in #metabrainz

0:46 AM
akshaaatt

Hi Freso, I will be in a location today which doesn’t have any communication means, so will have to skip today’s meeting. For my update, I’ve been on a vacation and had a look at yellowhatpro’s work on the android app! Thank you.
4:46 AM
v6lur joined the channel
5:17 AM
v6lur has quit
5:19 AM
Pratha-Fish

Looks like it's still going https://usercontent.irccloud-cdn.com/file/rPHMD...
5:21 AM
alastairp: looks like there has been a _slight miscalculation_
5:22 AM
MLHD has 600k files, not 60k. So the estimated total time is 50hours, not 5 hours ⚰️
5:25 AM
However the good news is, the processing is going fine so far with 0.3s avg testing time per file
5:26 AM
No track-MBIDs detected in recording-MBID so far
6:38 AM
BrainzGit

[critiquebrainz] 14anshg1214 opened pull request #446 (03master…CB_440): CB-440: Recording entity [unknown artist] https://github.com/metabrainz/critiquebrainz/pu...
7:53 AM
Freso

akshaaatt: Noted. Thanks. :)
8:01 AM
trolley has quit
8:03 AM
trolley joined the channel
8:49 AM
alastairp

Pratha-Fish: and imagine if we were taking 3 seconds per file instead of 0.3!
8:50 AM
Pratha-Fish

alastairp: that would've been 500 hours 🥶
8:51 AM
alastairp: Also, looks like the process is constantly using > 90% CPU on wolf. I hope it doesn't disturb other user's work
8:52 AM
alastairp

Pratha-Fish: note that this is 90% of 1 CPU core
8:52 AM
we have 12 CPU cores
8:52 AM
Pratha-Fish

_wow_
8:52 AM
alastairp

btw, one thing we could have done is start up 6-8 parallel workers for the same process
8:53 AM
get it done in 7 hours
8:53 AM
no worries, let's just leave it to do its thing
8:53 AM
Pratha-Fish

Damn that would've been nice
8:53 AM
alastairp

Pratha-Fish: any indication so far about track ids?
8:53 AM
Pratha-Fish: are you also saving the zst files? in what location?
8:53 AM
Pratha-Fish

alastairp: Nothing as of today morning
8:54 AM
alastairp: I am saving it all in snaek/MLHD/rec_track_checker/MLHD
8:54 AM
alastairp

I see it, grat
8:54 AM
great
8:55 AM
Pratha-Fish

Also, all logs are being written in 1 level above the dir where MLHD is being written. So far so good
8:56 AM
https://usercontent.irccloud-cdn.com/file/U4wsP...
8:57 AM
alastairp: 163k files checked so far. Nothing found. Do you need any other numbers while we're at it?
8:58 AM
alastairp

nothing yet
9:01 AM
hmm
9:01 AM
Pratha-Fish: from what I can see of the code, you're just loading it in, checking the first column against your db tables, and then writing the dataframe out again?
9:02 AM
Pratha-Fish

alastairp: yes that's right
9:02 AM
alastairp

however, I just randomly sampled a few of your compressed zst files and compared them against the gzip version of the same file, and the resulting uncompressed data is different
9:03 AM
Pratha-Fish

How exactly?
9:03 AM
alastairp

good question
9:03 AM
I ran this:
9:03 AM
diff <(zstdcat /home/snaek/MLHD/rec_track_checker/MLHD/0a/0a118981-15b5-46df-8666-080ca5a1af62.csv.zst) <(zcat /data/mlhd/0a/0a118981-15b5-46df-8666-080ca5a1af62.txt.gz)
9:04 AM
which should have no output (indicating that the files are the same)
9:04 AM
oh wait -
9:04 AM
sorry, of course, we're using csv for zstd and tsv for txt
9:05 AM
Pratha-Fish

ah right that could be the reason
9:05 AM
alastairp

or did you use tabs in the end?
9:05 AM
Pratha-Fish

I think I ended up using tabs
9:05 AM
alastairp

https://github.com/Prathamesh-Ghatole/MLHD/blob...
9:06 AM
yes, right. so I would expect these files to be identical
9:06 AM
Pratha-Fish

Hmm
9:06 AM
Lemme load up a few files in python and cross check
9:07 AM
I hope the difference is only limited to something trivial like row indices being written with the data
9:08 AM
ansh

moin!
9:08 AM
alastairp

https://www.irccloud.com/pastebin/q9yUuQG7/
9:09 AM
Pratha-Fish: at least the first 10 lines of the files are the same
9:09 AM
oh, hmm. interesting
9:09 AM
one sec
9:11 AM
https://www.irccloud.com/pastebin/tIChzWqm/
9:12 AM
Pratha-Fish: right, so the original files have \r\n line terminators, and the ones that we generated have only \n
9:12 AM
phew, that's less terrible than I thought
9:13 AM
Pratha-Fish

alastairp: Phew
9:14 AM
https://usercontent.irccloud-cdn.com/file/mNa8d...
9:14 AM
pandas confirms that too
9:14 AM
alastairp: does not having \r make a significant difference?
9:15 AM
alastairp

no, it just tends to appear more often on files created from windows
9:15 AM
Pratha-Fish

thankgod
9:15 AM
alastairp

in fact, I believe that the way that we are doing it is more correct
9:15 AM
oh cool, there's a flag to `diff`:
9:15 AM
diff --strip-trailing-cr <(zstdcat /home/snaek/MLHD/rec_track_checker/MLHD/0a/0a118981-15b5-46df-8666-080ca5a1af62.csv.zst) <(zcat /data/mlhd/0a/0a118981-15b5-46df-8666-080ca5a1af62.txt.gz)
9:15 AM
that correctly outputs nothing
9:15 AM
Pratha-Fish

Nicee
9:16 AM
Linux CLI is surprisingly powerful NGL. Makes me wanna switch back to arch
9:16 AM
I just couldn't live with the constantly breaking system as a daily driver tbh
9:17 AM
ansh

alastairp: I tests are passing on CB#445. I tries running them locally.
9:17 AM
BrainzBot

Remove script to update Bookbrainz Database: https://github.com/metabrainz/critiquebrainz/pu...
9:17 AM
ansh

The*
9:18 AM
Freso

Pratha-Fish: That’s what eventually pushed me to drop Windows for good. 🙃 IME Windows is as likely to break as Linux, but with Linux I at least have an idea of what’s going on and a fighting chance to fix it.
9:19 AM
ansh

Is there any way to retest them on github before merging?
9:19 AM
Pratha-Fish

Freso: relatable haha but the opposite
9:19 AM
The only reason why I am sticking with windows at this point is because of excellent software support, and force of habit
9:20 AM
alastairp

ansh: I can trigger it again
9:20 AM
should be running again
9:20 AM
https://github.com/metabrainz/critiquebrainz/ac...
9:20 AM
ansh

yes it started running
9:36 AM
Pratha-Fish

alastairp: What should I do while the computation is running?
9:36 AM
We could jump back on the artist conflation issue, or even start converting all pandas.isin() code to set queries
9:39 AM
alastairp

Pratha-Fish: I think that the next interesting step is going to be a comparison of our two data lookup methods
9:39 AM
remember back at the beginning of the year when we were explaining that we might need to rewrite some lookup methods in spark or some other faster system?
9:39 AM
Pratha-Fish

right
9:40 AM
alastairp

so, given a recording mbid in the data file, we currently have 2 ways of looking up a canonical id: mbid -> canonical mbid table; or mbid -> text metadata -> mapper
9:41 AM
and the previous experiment you did a few weeks back shows that some items give different results
9:42 AM
what we're interested in doing is seeing why these results are different, and what we can do to make them the same
9:42 AM
because ideally we could continue to use the canonical mbid table, because it's super fast (otherwise we need to look up all 27 billion rows in the mapper, which is slow)
9:43 AM
Pratha-Fish

the mapper method won't complete the computing this year tbh
9:44 AM
alastairp

so we need to decide if the mapper really is "better" (we don't know what the definition of better is here, we need to investigate the data and make a decision)
9:45 AM
and if it _is_ better, we need to move on to the next steps of seeing if we can re-implement in something faster (spark? something else) in order to do the processing in a reasonable time
9:45 AM
Pratha-Fish

very interesting :D
9:46 AM
So I'll take a look at the data first ig. Let's see if there's any patterns
9:47 AM
alastairp

Pratha-Fish: this dataset endpoint should be useful: https://labs.api.listenbrainz.org/explain-mbid-...
9:48 AM
you give it a single artist and recording, and it'll return debugging about how it finds the item
9:49 AM
Pratha-Fish

Sounds good. I can try mapping the API to some data
9:51 AM
mayhem

moooin!
9:51 AM
for any one who knows about Don Norman's "Design of Everyday things", but hasn't been able to read/finish it, these notes look quite cool" https://elvischidera.com/2022-06-24-design-ever...
9:55 AM
Pratha-Fish

mayhem: What a coincidence, I started reading that one today :))
9:58 AM
BrainzGit

[critiquebrainz] 14alastair merged pull request #445 (03master…remove_temp_script): Remove script to update Bookbrainz Database https://github.com/metabrainz/critiquebrainz/pu...
10:05 AM
CatQuest

"Perceived affordances help people figure out what actions are possible without the need for labels or instructions."
10:05 AM
this is the bullshit mentality that makes everything have icons and boxes now instead of CLEAR LABELS AND INSTRUCTIONS
10:05 AM
I *LIKE* Labels and Instructions!!!!
10:05 AM
aaaaaaaaaaaaaaaaa
10:09 AM
.. but later on they say that a simple "you are offline" wouldsuffice as a notfication that connection is broken..
10:09 AM
i'm confused
10:10 AM
but in short: please label things with text, & write succinct instructions where needed." thanks
10:15 AM
mayhem

lucifer: https://github.com/metabrainz/listenbrainz-serv...
10:15 AM
is now ready with PR feedback and missing test file added.
10:28 AM
ansh

alastairp: The tests passed successfully after rebasing CB#441 last time. If there are any more changes required, pls let me know :)
10:28 AM
BrainzBot

CB-437: Add entity metadata to review get endpoints: https://github.com/metabrainz/critiquebrainz/pu...
10:51 AM
lucifer

mayhem: lgtm, thanks. it would be nice to add some tests with real mb data as well but currently we don't have MB db in LB tests so I'll open a ticket for it.
10:51 AM
mayhem

k
10:52 AM
BrainzGit

[listenbrainz-server] 14mayhem merged pull request #2065 (03master…add-upcoming-releases-backend): Add fresh releases backend https://github.com/metabrainz/listenbrainz-serv...
10:52 AM
mayhem

lucifer: is it time for us to chat about how to integrate the three separarate branches of fresh releases work we've got going on?
10:53 AM
lucifer

yes sure
10:53 AM
mayhem

the fetching of user specific data needs to be added to the endpoint I just added, that is one thing I see.
10:53 AM
and now that I made space for the react work, chinmay can drop his work on top of the blank template that was just merged.
10:53 AM
lucifer

we have a couple of options there, either spark calls the api or lb fetches the data from db and sends it as a part of the rmq message
10:53 AM
mayhem

what else?
10:54 AM
I was expecting for LB to fetch the data from couchdb/postgres.
10:54 AM
and for the endpoint to return sidewide fresh releases unless a user name was given.
10:55 AM
lucifer

rest of the backend is almost done. most of the couchdb integration will be done when migrating stats to it. after that i'll finish the fresh releases pr.
10:55 AM
mayhem

ok, maybe we should just wait for that to be done before doing more stuff.
10:55 AM
lucifer

yes, that makes sense.
10:56 AM
a few tests and dumps are pending on that front fwiw.
10:57 AM
mayhem

ok, ping me if you need anything. I'm going to see if I can classify tracks as high/low energy with the data we have at our disposal.... see if I can make another playlist for users.
10:57 AM
lucifer

will do. sounds great! :D
10:58 AM
mayhem

daily jams are making me pretty happy. looking quite nice now. I really need to make a point of listening to them each day to see how things shape up over time.
10:58 AM
BP makes that pretty hard though. It plays a handful of tracks and then halts. :(
10:59 AM
lucifer

yeah spotify does not have any documentation on how to fix the issue and no one answered on forums either.
11:00 AM
mayhem

yeah, its fully meh.
11:02 AM
I think I might try my hand at the spotify cache using couchdb as the document store.
11:12 AM
Sophist_UK has quit
11:12 AM
Sophist-UK joined the channel
11:16 AM
riksucks

hi lucifer, are you up?
11:19 AM
lucifer

riksucks: yes. sorry forgot to answer your question earlier. there are 2 things to consider here: 1) mulitple notifications on feed 2) allowing individual recommendee's to delete a personal notification they received without affecting others.
11:20 AM
also maybe allow the recommender to unsend the recommendation to a particular person without unsending it to others?
11:21 AM
for instance, Instagram allows you to send a post to multiple persons at a time but then you can unsend to a particular person later if you want.
11:21 AM
riksucks

true, I thought about the 2) one, and realised that in normal recommendation, only the recommender can delete it, and the recommendees can hide it from their timelines. So maybe we can implement a similar feature. Similarly for unsending for a particular person, we can try removing that specific ID from the array, and update it in the DB
11:23 AM
lucifer

yes that's possible. alternative option is to keep 1 row per user and instead group all the notifications by recording id.
11:25 AM
mayhem, alastairp: thoughts on how to handle this: say a user sends a track recommendation to multiple people. should we create 1) 1 row per user or 2) 1 row with array containing all the users' ids.
11:26 AM
mayhem

2
11:26 AM
lucifer

in 1 its easier to handle deletes/hiding. but more work to manage notifications in feed. vice versa in 2.
11:27 AM
Pratha-Fish

alastairp: I've ran the explainer API on 333 rows of faulty data. Now how do we debug it?
11:30 AM
riksucks

also lucifer, I wanted to tell you another thing. I was reading up on how postgres handles JSONB, and what happens when we update certain keys or certain parts of that JSONB. Turns out, postgres always writes a new version of the whole row whenever we are updating. Do you think that would create overhead for lots of personal recommendation?