#metabrainz

/

      • Kladky has quit
      • Maxr1998_ joined the channel
      • Maxr1998 has quit
      • Kladky joined the channel
      • MyNetAz has quit
      • sp1ff` has quit
      • petitminion has quit
      • MyNetAz joined the channel
      • minimal has quit
      • lucifer[m] has quit
      • aerozol[m]
        Lighting fast, thanks Jade and bitmap !!
      • pite has quit
      • suvid[m] joined the channel
      • suvid[m]
      • lucifer[m] joined the channel
      • lucifer[m]
        mayhem: hi! let me know when you are around for some discussion about importing listens in spark.
      • d4rk-ph0enix has quit
      • d4rk-ph0enix joined the channel
      • BrainzGit
        [musicbrainz-server] 14reosarevok opened pull request #3485 (03master…MBS-13950): MBS-13950: Support Utaite/Touhou/VocaDB tag pages for genres https://github.com/metabrainz/musicbrainz-serve...
      • monkey[m] has quit
      • davic has quit
      • davic joined the channel
      • davic has quit
      • davic joined the channel
      • davic has quit
      • davic joined the channel
      • dabeglavins3 has quit
      • mayhem[m] joined the channel
      • mayhem[m]
        lucifer: around-ish now, more so in 90 mins
      • petitminion joined the channel
      • monkey[m] joined the channel
      • monkey[m]
        suvid: Yes, you will need to remove any buttons to play music and other actions relating to BP, for which you can search for the `useBrainzPlayerDispatch` hook across the codebase.
      • You should also ideally make sure that the BrainzPlayer component is not loaded at all if disabled (i.e. not just hiding it, but not loading it at all)
      • Pokey has quit
      • Pokey joined the channel
      • lucifer[m]
        Sure let's do in 90 mine
      • petitminion has quit
      • petitminion joined the channel
      • BrainzGit
        [listenbrainz-server] 14anshg1214 merged pull request #3195 (03master…LB-1755): LB-1755: Fix Feed event deletion https://github.com/metabrainz/listenbrainz-serv...
      • mayhem[m]
        lucifer: around now.
      • lucifer[m]
        mayhem: hi!
      • further working on eliminating spark's reliance on full dumps, there are two approaches I have been looking into. First approach is keep using dumps but only incremental dumps, this is easier to implement. The only issue is currently the incremental dump after a full dump includes listens created only since that full dump. I could change it to include the listens since the last incremental dump (add a column in the data_dump table to
      • recognise which dump is incremental and which is full). And incremental dumps can be made to run concurrently with full dumps so that we don't lag when full dumps are being generated. (currently incremental dumps wait for full dumps to finish).
      • suvid[m]
      • suvid[m] uploaded an image: (294KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/yfZgZUVeycwCNkQQZrvZQZhw/image.png >
      • lucifer[m]
        However, this breaks the current sequencing of listens import. For example, at the moment we can import a full dump + all incremental dumps and have all the listens of LB. With the proposed change, the incremental dumps would duplicate some listens with the full dump and the onus would be on the user of the full dumps to deduplicate them or they could use just the full dump listens and wait for the next to avoid duplicates.
      • what do you think about this approach?
      • mayhem[m]
        could we generate full dumps from the incremental dumps??
      • that changes the whole equation, no?
      • lucifer[m]
        right so you mean end the full dump at the last incremental dump instead of the current time?
      • mayhem[m]
        sort-of.
      • first, lets elevate the idea of an incremental dump to a first class concept, rather than it being lower than the full dump.
      • lets forget about full dumps for now.
      • you fix the incremental dumps as you described above - from incremental to incremental.
      • and those are the only dumps the LB infra does directly.
      • to make a full dump, we take the last full dump, add all the incrementals and then zip up the full dump into the newest and latest version.
      • we could do this last step on some VM that we spin up for this task.
      • lucifer[m]
        how do you get the first full dump?
      • petitminion has quit
      • mayhem[m]
        clearly we need to create a "make a dump NOW" script.
      • but we don't plan to use it very often.
      • they key cool thing here is that the full dumps are no longer running on core infra
      • lucifer[m]
        yes makes sense. there are two issues however.
      • the first full dump and incremental dump might possible still have duplicates because full dumps filter on listened_at and incremental dumps on created timestamp. so if you import listens in past you could end up with duplicates in case of unfortunate timing. but i guess we could solve it by adding an additional created filter to the full dump generation (the first one). for subsequent ones just incremental dumps.
      • i just realised that this created filter would solve the issue i was stuck on in the current infra fwiw.
      • mayhem[m]
        can we redefine the full dumps so they are derived from the incremental semantics?
      • lucifer[m]
        right so on that note, the issue would be deleted listens.
      • mayhem[m]
        our nemesis.
      • lucifer[m]
        fwiw, i am doing something similar in spark anyway. a full dump exists, bunch of incremental dumps come in. combine both and load listens for stats generation, filter out deleted listens everytime we load this combination.
      • mayhem[m]
        I think this is where my thinking if coming from.
      • lucifer[m]
        after N days, take the combined listens remove the deleted listens from this and rewrite them to disk.
      • (this step is pending implementation)
      • mayhem[m]
        this could be done during the full dump generation,
      • lucifer[m]
        right. however, the issue with the rewrite is that for spark we store the listens somewhat optimally in hdfs, doing this with a bunch of tar files on vm is not going to be efficient.
      • but then what if we use spark for full dumps?
      • mayhem[m]
        oh weird.
      • can we imagine what the impact on the cluster would be?
      • lucifer[m]
        yeah i am just wondering how do you do the combination step to remove the deleted listens efficiently.
      • mayhem[m]
        would that be offline or can we pull just from hdfs?
      • lucifer[m]
        not sure what you mean by offline?
      • oh do you mean if the cluster will become unavailable during this step.
      • mayhem[m]
        would the cluster be unavailable for regular tasks?
      • lucifer[m]
        i think yeah but how long is the question. if its less than 8 hours i think we could schedule it well enough.
      • mayhem[m]
        overall, this doesn't feel great to me. the spark cluster is meant to be "disposable" and this feels like a departure from that
      • lucifer[m]
        yeah fair point.
      • we could a LB db replica and let dumps run off of it.
      • mayhem[m]
        lucifer[m]: poor use of resources.
      • is the deletion of listens from the incremental dumps the sticking factor in the idea of making full dumps from incrementals?
      • that doesn't seem that hard.
      • lucifer[m]
        i think a LB replica would be good in general but yes just for dumps ifs not optimal.
      • mayhem[m]
        you start with a list of listens to be deleted and a pile of incrementals.
      • lucifer[m]
        deletion of listens need to happen from the full dump as well.
      • mayhem[m]
        for each incremental, look at each listen. is deleted? yes, skip, no write to full dump.
      • lucifer[m]
        if listens were to be deleted only from incrementals that wouldn't be too big of a deal.
      • mayhem[m]
        this VM could continually update incrementals removing listens from it too.
      • lucifer[m]
        mayhem[m]: yes but how do you delete listens from the full dump.
      • monkey[m]
        <suvid[m]> "Hey..." <- suvid: I think that's a great step 1. The improvement I have in mind is to avoid loading any of the BrainzPlayer code in the first place if the user has deactivated it, saving on time and data usage.
      • The files to look at are frontend/js/src/index.tsx, where the call to `getRoutes` (frontend/js/src/routes/routes.tsx) should have a new argument that can be trickled down to the Layout component (in the same way that Layout has a `withProtectedRoutes` prop, it should have a new `withBrainzPlayer` that defaults to true) at frontend/js/src/layout/index.tsx
      • In frontend/js/src/index.tsx as well the BrainzPlayerContext should be rendered conditionally.
      • lucifer[m]
        say you have a full dump till 15th feb and you are adding incrementals from 16th to 24th to it. and there are listen deletions that happened today for listens from say 2023.
      • those listens are already in the full dump.
      • mayhem[m]
        lucifer[m]: since the full dump is made from incremental dumps that have listens deleted, it should be no problem, right?
      • ah -- I am going back to my idea of making full dumps from incrementals.
      • monkey[m]
        Basically, seeing where BrainzPlayer and its BrainzPlayerContext are rendered, and working your way down to avoid rendering the components
      • petitminion joined the channel
      • lucifer[m]
        yes i am on that idea too but doesn't your idea, start off with the full dump that was created off the last time and just add incrementals generated since then to it.
      • mayhem[m]
        we would have to start with a clean dump that has no deleted listens in it, for starters.
      • lucifer[m]
        Full Dump until 15 Feb, 2024 + Incremental Dumps from 16th to 24th = Full dump until 24th Feb
      • do i understand your idea right?
      • mayhem[m]
        yes
      • lucifer[m]
        so now if i deleted some listens from 2021 on Feb 23rd, those listens would be in the starter full dump?
      • mayhem[m]
        except it won't line up on date lines, but when the dump started/terminated. still not real change in logic.
      • lucifer[m]
        yup date lines are just example sake.
      • mayhem[m]
        yes, they would be in the full starter dump.
      • but the key is to not have them in the NEXT full dump, right?
      • lucifer[m]
        yes. but how do you do that?
      • do you process the full starter dump too?
      • i was under the impression your plan meant to copy it as is.
      • mayhem[m]
        that is the full dump making process: collect listens to be deleted and all the incremental dumps and then filter inc dumps into one giant clean full dump.
      • all inc dumps since last full dump, I should say
      • suvid[m]
        <monkey[m]> "suvid: I think that's a great..." <- oh ok
      • i'll look into it
      • in the meantime, i have also pushed a commit which changes listencard to remove play button and add to queue and play next options from the 3 dot menu
      • mayhem[m]
        * then filter out deleted listens from inc dumps, * inc dumps and combine into one
      • lucifer[m]
        right just to be clear you would read the existing full dump and process it too to delete the listens as needed?
      • mayhem[m]
        lucifer[m]: I wasn't suggesting that no. is that needed?
      • suvid[m] uploaded an image: (331KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/QNAoBxHEEsqGUOrrQSOcSPMX/image.png >
      • suvid[m]
        so it looks like this now :)
      • lucifer[m]
        mayhem: hmm but then you can't filter out all the deleted listens. you need to process your starter full dump to delete listens, if you just look at incremental dumps to delete listens it won't always work.
      • mayhem[m]
        yes, you're right. filter deleted listens from last full and all incs to make a new one.
      • that will still be faster than a full dump off our infra.
      • and if it runs on a on a one-off VM, we dont care how long it takes. 3 days? fine -- that is our schedule then
      • lucifer[m]
        sure i just wanted to be clear on the idea.
      • petitminion has quit
      • mayhem[m]
        but does the rest of the check-out?
      • lucifer[m]
        yes it does.
      • mayhem[m]
        overall I quite like this approach.
      • full dumps far away from prod systems == great.
      • lucifer[m]
        i don't like the idea of a one off VM though. because we don't usually do that sort of provisioning. and it would likely be some work to add it. not that its not doable.
      • mayhem[m]
        fair.
      • if we construct it carefully and have it consume only one thread, it still won't take forever. it will just be disk heavy.
      • we'd have to find a machine where we can do that.
      • lucifer[m]
        one more thing, that we can only move listens dump away this way. the rest of things still need to be dumped from prod.
      • mayhem[m]
        yes, but those dumps take how long? measured in minutes, not hours/days?
      • lucifer[m]
        i think stats dump takes 6 hours or more.
      • but i don't think that's an issue still.
      • mayhem[m]
        most of that isn't on PG, right?
      • lucifer[m]
        yes
      • mayhem[m]
        perfect.
      • I think this will be a great improvement for our infra, honestly.