It has the same number of differences as before though (650)
2022-09-01 24423, 2022
Pratha-Fish
I'll retry once without caching
2022-09-01 24410, 2022
lucifer
Pratha-Fish: yes, please. almost sure something wrong in the script generating this report because some of the mbids in there are not at all present in the new table.
last entry in that file. see the canonical mbid is not present as canonical mbid at all in the new table.
2022-09-01 24422, 2022
Pratha-Fish
This time there's 677 differences
2022-09-01 24438, 2022
lucifer
hmm so it increased.
2022-09-01 24453, 2022
Pratha-Fish
right
2022-09-01 24422, 2022
BrainzGit
[bookbrainz-site] 14MonkeyDo merged pull request #814 (03master…nameSectionImprovements): feat(BB-432): Show possible duplicates next to the name section https://github.com/metabrainz/bookbrainz-site/pul…
2022-09-01 24445, 2022
lucifer
thanks for the report, Pratha-Fish.
2022-09-01 24459, 2022
lucifer
i'll look into it later, cc: alastairp.
2022-09-01 24408, 2022
Pratha-Fish
lucifer: You're welcome, and super sorry for keeping you hanging
2022-09-01 24427, 2022
Pratha-Fish
alastairp: Hi, I wanted to discuss a few points about the MLHD cleanup process if you're free
aerozol: I've got feedback on play icon on the coverarty: I actually see why (and I don't dislike the concept of being able to play on the page (with userscripts or links or what))
2022-09-01 24439, 2022
CatQuest
but when I see coverart in mb, my first idea is that clicking it will direct me directly to coverart tab of release/chose coverart for RG
2022-09-01 24452, 2022
CatQuest
also, gimme some time and Iwill talk to you about expand/collapse releaee groups pls
2022-09-01 24446, 2022
CatQuest
i you can wait until after the 7th that would be best, right noe things are way to hectic and my brain isn't on the subject at all, after it i'll have LOTS of time having to take it easy and an sit and talj to you
2022-09-01 24423, 2022
CatQuest
same to Shubh, monkey, reosarevok, anyone else who wants feedback/help/instruments/work whatever
hi Pratha-Fish, lucifer. I was out doing errands. let me check history
2022-09-01 24419, 2022
piwu joined the channel
2022-09-01 24429, 2022
alastairp
lucifer: interesting that the differences increased! definitely something to look in to. let me see if I can find some time tomorrow for that
2022-09-01 24416, 2022
alastairp
Pratha-Fish: regarding rows without recording mbids, what's the number 78.13% is that the number without mbids, or with?
2022-09-01 24434, 2022
Pratha-Fish
alastairp: Hi, that's the % of rows that have a recording MBID present in them
2022-09-01 24409, 2022
Pratha-Fish
i.e. ~21.87 % rows in MLHD don't have any recording MBID
2022-09-01 24436, 2022
alastairp
again - try to be consistent in how you report these things. you say in the text "rows that don't have an mbid" and then you give a figure for rows that _do_ have mbids
2022-09-01 24458, 2022
Pratha-Fish
Oops
2022-09-01 24458, 2022
alastairp
it makes it easier for us to understand when reading the items
2022-09-01 24405, 2022
Pratha-Fish
Didin't notice that one. I'll fix it rn
2022-09-01 24403, 2022
alastairp
I think we should do 2 things with those - we were thinking of distributing two versions of the dataset, one "as close to the original as possible" and one "as much data as possible". The first, will keep the blank rows. it might be useful for people who want to know exactly when a user listened to music, even though they might not know what
2022-09-01 24415, 2022
alastairp
the 2nd, will have complete metadata (recording, artist(s), release) so that people who want to do something with the data can do so
2022-09-01 24445, 2022
alastairp
so it would be great if your code could have a flag which says "skip missing mbids" or "include rows for missing mbids"
2022-09-01 24413, 2022
Pratha-Fish
Yes definitely
2022-09-01 24431, 2022
Pratha-Fish
We could just completely leave such rows without rec-mbid alone
2022-09-01 24432, 2022
Pratha-Fish
While we're at it, should we move the artist-mbid column to the end of the dataset or just keep it as it is
2022-09-01 24414, 2022
alastairp
do you mean related to our discussion the other day?
2022-09-01 24419, 2022
alastairp
I think it's fine where it is
2022-09-01 24438, 2022
Pratha-Fish
yep. It won't affect the data set much, it might just make the text representation a little prettier.. which really doesn't matter anyway
2022-09-01 24426, 2022
Pratha-Fish
As for the next question, what about completely unknown rec-MBIDs.. i.e. the ones that weren't found in redirects, canonical, or the MB recording table
2022-09-01 24456, 2022
alastairp
turn them into blank rows
2022-09-01 24415, 2022
alastairp
that is, if we want to keep the row, output only the timestamp, if we want the "full data", omit the row entirely
2022-09-01 24441, 2022
Pratha-Fish
sounds good
2022-09-01 24427, 2022
Pratha-Fish
Also, would this be the final version of MLHD, or are we gonna re-iterate through the data once again
2022-09-01 24402, 2022
alastairp
we should release it when we're happy with the mapping, and that will be our final version
2022-09-01 24439, 2022
Pratha-Fish
Great :)
2022-09-01 24442, 2022
alastairp
re: parquet - yes, this is a good idea. we should have an option to write to either tsv/zstd, or to parquet, maybe if pandas gives us easily other formats we could do that too
2022-09-01 24401, 2022
Pratha-Fish
Just asking, because this one might take a while to get processed
2022-09-01 24420, 2022
alastairp
absolutely. I hope it'll be a similar time to the last thing we ran
2022-09-01 24429, 2022
alastairp
even if it's 2x as slow, that's only 10 days or so
2022-09-01 24443, 2022
Pratha-Fish
alastairp: Pandas, as well as PyArrow (the high performance that I'm currently using for reading and writing files)
2022-09-01 24452, 2022
alastairp
remember last time that I suggested that we can do parallel computation too - if we run it on 8 threads, it could be finished within a few days
2022-09-01 24426, 2022
Pratha-Fish
alastairp: yes, I'm hoping for a similar computation time as well, but the checking process seems to be quite slow right now, so it could take a while longer
2022-09-01 24436, 2022
alastairp
re: release mbids... that's another big question, and one that we haven't thought through completely. As a result of computing the canonical recording mbid, we end up also with the "canonical release mbid", so that's a good option
2022-09-01 24406, 2022
alastairp
another option - consider someone is listening to a compilation album. lots of songs by different artists
2022-09-01 24422, 2022
alastairp
we could report each song bas being a part of the original album that it was released on
2022-09-01 24404, 2022
alastairp
but we could also see if it would be possible to identify that these songs were actually listened to in order, and as part of an album
we should bring this code inline, in fact we should load it into memory as well
2022-09-01 24441, 2022
Pratha-Fish
So that has been sped up exponentially
2022-09-01 24458, 2022
alastairp
this data can't be more than a few 10s of gb, if that. no problem to store in memory and just do a lookup into a dict
2022-09-01 24403, 2022
alastairp
what else is slow?
2022-09-01 24418, 2022
Pratha-Fish
IK what exactly is bottle necking the process at this point
2022-09-01 24422, 2022
alastairp
so the mbc lookup in the current code already goes directly to the db?
2022-09-01 24434, 2022
alastairp
ok cool, let's benchmark it then.
2022-09-01 24446, 2022
alastairp
when will you be working next? we could sit down tomorrow afternoon and look at it?
2022-09-01 24449, 2022
Pratha-Fish
I'm using pandas.map() functions to apply functions to a whole series / array
2022-09-01 24412, 2022
Pratha-Fish
But the forementioned function is just a fancy implementation of a non vectorized for loop, which makes it painfully slow
2022-09-01 24421, 2022
Pratha-Fish
The solution would be to somehow vectorize it
2022-09-01 24446, 2022
Pratha-Fish
re: let's benchmark it, when would you be free
2022-09-01 24411, 2022
Pratha-Fish
I'd be free by 5pm IST tomorrow
2022-09-01 24457, 2022
Pratha-Fish
^1:30 PM Madrid time
2022-09-01 24443, 2022
alastairp
I won't be free for another ~2 hours after that
2022-09-01 24418, 2022
alastairp
but after that would be fine
2022-09-01 24431, 2022
Pratha-Fish
Sure, works for me too
2022-09-01 24433, 2022
Pratha-Fish
I'll should be free from the said time to 12:00AM IST (except for some time in b/w for dinner)
2022-09-01 24419, 2022
lucifer
Pratha-Fish: hi! i took a quick look again at the list. i think the report still has some issues. search for `Whirlwind in D Minor` in it and see that "canonical mbid lookup" links to a totally different recording. but if you open the lookup link, the lookup found by it is actually correct and matches the canonical mbid.
2022-09-01 24420, 2022
lucifer
oh yeah, Pratha-Fish looking at the report a lot of the canonical mbid same as the canonical mbid lookup of the previous row.
2022-09-01 24454, 2022
alastairp
🔍
2022-09-01 24431, 2022
style- joined the channel
2022-09-01 24420, 2022
tandy1000
can you use listen imports to submit now playing listens too?