#metabrainz

/

      • akshaaatt
        Hi Freso, I will be in a location today which doesn’t have any communication means, so will have to skip today’s meeting. For my update, I’ve been on a vacation and had a look at yellowhatpro’s work on the android app! Thank you.
      • v6lur joined the channel
      • v6lur has quit
      • Pratha-Fish
      • alastairp: looks like there has been a _slight miscalculation_
      • MLHD has 600k files, not 60k. So the estimated total time is 50hours, not 5 hours ⚰️
      • However the good news is, the processing is going fine so far with 0.3s avg testing time per file
      • No track-MBIDs detected in recording-MBID so far
      • BrainzGit
        [critiquebrainz] 14anshg1214 opened pull request #446 (03master…CB_440): CB-440: Recording entity [unknown artist] https://github.com/metabrainz/critiquebrainz/pu...
      • Freso
        akshaaatt: Noted. Thanks. :)
      • trolley has quit
      • trolley joined the channel
      • alastairp
        Pratha-Fish: and imagine if we were taking 3 seconds per file instead of 0.3!
      • Pratha-Fish
        alastairp: that would've been 500 hours 🥶
      • alastairp: Also, looks like the process is constantly using > 90% CPU on wolf. I hope it doesn't disturb other user's work
      • alastairp
        Pratha-Fish: note that this is 90% of 1 CPU core
      • we have 12 CPU cores
      • Pratha-Fish
        _wow_
      • alastairp
        btw, one thing we could have done is start up 6-8 parallel workers for the same process
      • get it done in 7 hours
      • no worries, let's just leave it to do its thing
      • Pratha-Fish
        Damn that would've been nice
      • alastairp
        Pratha-Fish: any indication so far about track ids?
      • Pratha-Fish: are you also saving the zst files? in what location?
      • Pratha-Fish
        alastairp: Nothing as of today morning
      • alastairp: I am saving it all in snaek/MLHD/rec_track_checker/MLHD
      • alastairp
        I see it, grat
      • great
      • Pratha-Fish
        Also, all logs are being written in 1 level above the dir where MLHD is being written. So far so good
      • alastairp: 163k files checked so far. Nothing found. Do you need any other numbers while we're at it?
      • alastairp
        nothing yet
      • hmm
      • Pratha-Fish: from what I can see of the code, you're just loading it in, checking the first column against your db tables, and then writing the dataframe out again?
      • Pratha-Fish
        alastairp: yes that's right
      • alastairp
        however, I just randomly sampled a few of your compressed zst files and compared them against the gzip version of the same file, and the resulting uncompressed data is different
      • Pratha-Fish
        How exactly?
      • alastairp
        good question
      • I ran this:
      • diff <(zstdcat /home/snaek/MLHD/rec_track_checker/MLHD/0a/0a118981-15b5-46df-8666-080ca5a1af62.csv.zst) <(zcat /data/mlhd/0a/0a118981-15b5-46df-8666-080ca5a1af62.txt.gz)
      • which should have no output (indicating that the files are the same)
      • oh wait -
      • sorry, of course, we're using csv for zstd and tsv for txt
      • Pratha-Fish
        ah right that could be the reason
      • alastairp
        or did you use tabs in the end?
      • Pratha-Fish
        I think I ended up using tabs
      • alastairp
      • yes, right. so I would expect these files to be identical
      • Pratha-Fish
        Hmm
      • Lemme load up a few files in python and cross check
      • I hope the difference is only limited to something trivial like row indices being written with the data
      • ansh
        moin!
      • alastairp
      • Pratha-Fish: at least the first 10 lines of the files are the same
      • oh, hmm. interesting
      • one sec
      • Pratha-Fish: right, so the original files have \r\n line terminators, and the ones that we generated have only \n
      • phew, that's less terrible than I thought
      • Pratha-Fish
        alastairp: Phew
      • pandas confirms that too
      • alastairp: does not having \r make a significant difference?
      • alastairp
        no, it just tends to appear more often on files created from windows
      • Pratha-Fish
        thankgod
      • alastairp
        in fact, I believe that the way that we are doing it is more correct
      • oh cool, there's a flag to `diff`:
      • diff --strip-trailing-cr <(zstdcat /home/snaek/MLHD/rec_track_checker/MLHD/0a/0a118981-15b5-46df-8666-080ca5a1af62.csv.zst) <(zcat /data/mlhd/0a/0a118981-15b5-46df-8666-080ca5a1af62.txt.gz)
      • that correctly outputs nothing
      • Pratha-Fish
        Nicee
      • Linux CLI is surprisingly powerful NGL. Makes me wanna switch back to arch
      • I just couldn't live with the constantly breaking system as a daily driver tbh
      • ansh
        alastairp: I tests are passing on CB#445. I tries running them locally.
      • BrainzBot
        Remove script to update Bookbrainz Database: https://github.com/metabrainz/critiquebrainz/pu...
      • ansh
        The*
      • Freso
        Pratha-Fish: That’s what eventually pushed me to drop Windows for good. 🙃 IME Windows is as likely to break as Linux, but with Linux I at least have an idea of what’s going on and a fighting chance to fix it.
      • ansh
        Is there any way to retest them on github before merging?
      • Pratha-Fish
        Freso: relatable haha but the opposite
      • The only reason why I am sticking with windows at this point is because of excellent software support, and force of habit
      • alastairp
        ansh: I can trigger it again
      • should be running again
      • ansh
        yes it started running
      • Pratha-Fish
        alastairp: What should I do while the computation is running?
      • We could jump back on the artist conflation issue, or even start converting all pandas.isin() code to set queries
      • alastairp
        Pratha-Fish: I think that the next interesting step is going to be a comparison of our two data lookup methods
      • remember back at the beginning of the year when we were explaining that we might need to rewrite some lookup methods in spark or some other faster system?
      • Pratha-Fish
        right
      • alastairp
        so, given a recording mbid in the data file, we currently have 2 ways of looking up a canonical id: mbid -> canonical mbid table; or mbid -> text metadata -> mapper
      • and the previous experiment you did a few weeks back shows that some items give different results
      • what we're interested in doing is seeing why these results are different, and what we can do to make them the same
      • because ideally we could continue to use the canonical mbid table, because it's super fast (otherwise we need to look up all 27 billion rows in the mapper, which is slow)
      • Pratha-Fish
        the mapper method won't complete the computing this year tbh
      • alastairp
        so we need to decide if the mapper really is "better" (we don't know what the definition of better is here, we need to investigate the data and make a decision)
      • and if it _is_ better, we need to move on to the next steps of seeing if we can re-implement in something faster (spark? something else) in order to do the processing in a reasonable time
      • Pratha-Fish
        very interesting :D
      • So I'll take a look at the data first ig. Let's see if there's any patterns
      • alastairp
        Pratha-Fish: this dataset endpoint should be useful: https://labs.api.listenbrainz.org/explain-mbid-...
      • you give it a single artist and recording, and it'll return debugging about how it finds the item
      • Pratha-Fish
        Sounds good. I can try mapping the API to some data
      • mayhem
        moooin!
      • for any one who knows about Don Norman's "Design of Everyday things", but hasn't been able to read/finish it, these notes look quite cool" https://elvischidera.com/2022-06-24-design-ever...
      • Pratha-Fish
        mayhem: What a coincidence, I started reading that one today :))
      • BrainzGit
        [critiquebrainz] 14alastair merged pull request #445 (03master…remove_temp_script): Remove script to update Bookbrainz Database https://github.com/metabrainz/critiquebrainz/pu...
      • CatQuest
        "Perceived affordances help people figure out what actions are possible without the need for labels or instructions."
      • this is the bullshit mentality that makes everything have icons and boxes now instead of CLEAR LABELS AND INSTRUCTIONS
      • I *LIKE* Labels and Instructions!!!!
      • aaaaaaaaaaaaaaaaa
      • .. but later on they say that a simple "you are offline" wouldsuffice as a notfication that connection is broken..
      • i'm confused
      • but in short: please label things with text, & write succinct instructions where needed." thanks
      • mayhem
      • is now ready with PR feedback and missing test file added.
      • ansh
        alastairp: The tests passed successfully after rebasing CB#441 last time. If there are any more changes required, pls let me know :)
      • BrainzBot
        CB-437: Add entity metadata to review get endpoints: https://github.com/metabrainz/critiquebrainz/pu...
      • lucifer
        mayhem: lgtm, thanks. it would be nice to add some tests with real mb data as well but currently we don't have MB db in LB tests so I'll open a ticket for it.
      • mayhem
        k
      • BrainzGit
        [listenbrainz-server] 14mayhem merged pull request #2065 (03master…add-upcoming-releases-backend): Add fresh releases backend https://github.com/metabrainz/listenbrainz-serv...
      • mayhem
        lucifer: is it time for us to chat about how to integrate the three separarate branches of fresh releases work we've got going on?
      • lucifer
        yes sure
      • mayhem
        the fetching of user specific data needs to be added to the endpoint I just added, that is one thing I see.
      • and now that I made space for the react work, chinmay can drop his work on top of the blank template that was just merged.
      • lucifer
        we have a couple of options there, either spark calls the api or lb fetches the data from db and sends it as a part of the rmq message
      • mayhem
        what else?
      • I was expecting for LB to fetch the data from couchdb/postgres.
      • and for the endpoint to return sidewide fresh releases unless a user name was given.
      • lucifer
        rest of the backend is almost done. most of the couchdb integration will be done when migrating stats to it. after that i'll finish the fresh releases pr.
      • mayhem
        ok, maybe we should just wait for that to be done before doing more stuff.
      • lucifer
        yes, that makes sense.
      • a few tests and dumps are pending on that front fwiw.
      • mayhem
        ok, ping me if you need anything. I'm going to see if I can classify tracks as high/low energy with the data we have at our disposal.... see if I can make another playlist for users.
      • lucifer
        will do. sounds great! :D
      • mayhem
        daily jams are making me pretty happy. looking quite nice now. I really need to make a point of listening to them each day to see how things shape up over time.
      • BP makes that pretty hard though. It plays a handful of tracks and then halts. :(
      • lucifer
        yeah spotify does not have any documentation on how to fix the issue and no one answered on forums either.
      • mayhem
        yeah, its fully meh.
      • I think I might try my hand at the spotify cache using couchdb as the document store.
      • Sophist_UK has quit
      • Sophist-UK joined the channel
      • riksucks
        hi lucifer, are you up?
      • lucifer
        riksucks: yes. sorry forgot to answer your question earlier. there are 2 things to consider here: 1) mulitple notifications on feed 2) allowing individual recommendee's to delete a personal notification they received without affecting others.
      • also maybe allow the recommender to unsend the recommendation to a particular person without unsending it to others?
      • for instance, Instagram allows you to send a post to multiple persons at a time but then you can unsend to a particular person later if you want.
      • riksucks
        true, I thought about the 2) one, and realised that in normal recommendation, only the recommender can delete it, and the recommendees can hide it from their timelines. So maybe we can implement a similar feature. Similarly for unsending for a particular person, we can try removing that specific ID from the array, and update it in the DB
      • lucifer
        yes that's possible. alternative option is to keep 1 row per user and instead group all the notifications by recording id.
      • mayhem, alastairp: thoughts on how to handle this: say a user sends a track recommendation to multiple people. should we create 1) 1 row per user or 2) 1 row with array containing all the users' ids.
      • mayhem
        2
      • lucifer
        in 1 its easier to handle deletes/hiding. but more work to manage notifications in feed. vice versa in 2.
      • Pratha-Fish
        alastairp: I've ran the explainer API on 333 rows of faulty data. Now how do we debug it?
      • riksucks
        also lucifer, I wanted to tell you another thing. I was reading up on how postgres handles JSONB, and what happens when we update certain keys or certain parts of that JSONB. Turns out, postgres always writes a new version of the whole row whenever we are updating. Do you think that would create overhead for lots of personal recommendation?