alastairp: I just realized you've worked w/ freesound.org as well???
That's insane. I've used that site for so many of my personal music projects lool
Also, my most popular meme was also a fart sample that I took from freesound.org and visualized with a spectrum analyzer and posted on r/FL_studio because it looked beautiful XD
Good times
odnes joined the channel
odnes has quit
mayhem
moin moin!
Pratha-Fish
mayhem: hey tell me the story of how you got amazon to pay a 3 year due invoice by sending them a cake sometime lol
yvanzo: moin! Anything special for this docker release?
yvanzo
hi reosarevok: yes, I drafted release notes but have to make some improvements.
reosarevok
Ok :) I'll start the prod release (won't be around in the evening probably)
but we can look at that bit later
CatQuest
oh hey reo ˆ__ˆ
reosarevok
Hi!
CatQuest
:D
mayhem
lucifer: on TS the listened_at_track_name_user_id_ndx_listen index was created live and we didn't decide at the time if we wanted to keep it, yes?
because if that is so then PR 2042 makes sense. :)
lucifer
mayhem: yes it was created live. we needed it to keep the on conflict clauses working.
still need to figure out how many dupes are there in the db and how to delete those.
mayhem
there is dup detection and removal code in the MBID mapping stuff, you can take a look at it.
to use it for TS, I think we would have to do it on a set of chunks at the same time
well, one at a time, once the new index is in place.
lucifer
we cant create the index without deleting dupes.
mayhem
why not delete the dups?
lucifer
ah no, i mean we should delete the dupes. i misunderstood your message as to create index first and delete afterwards
mayhem
that would be ideal, but not possible.
we will have the problem that new dups can be created while we are deleting the old ones.
but I wonder if we can make the script that deletes dups work on ranges or the whole listen table.
then we do a month or so at a time and then once that is done, we try to create the index.
if that fails, we delete dups across the whole table.
but I doubt that would work, so we might end up chasing our tail on this one.
lucifer
i think dup deletion should be fast enough that we can stop ts writer while the script runs.
mayhem
I really doubt that.
lucifer
i see, lets try how fast it goes on one chunk and then decide what to do accordingly.
mayhem
well, if we do it in python then maybe. but pure SQL, I think that is going to OOM
lucifer
hmm, dont think it should oom but yeah really cant say without trying
mayhem
if we just fetch all the tracks ordered by listened_at and the other dedup fields and then just slowly delete all the dups, that could work. it might be fast enough for the second pass to run with TS writer stopped.
Just published the blog post, apparently my first attempt failed.
mayhem
lucifer: which is the last_played API endpoint? I can't find it in the docs...
lucifer
mayhem: you mean when the recommendation was last played? if so, there is not separate endpoint. the recs json includes the timestamp with the mbid.
mayhem
ahhh, ok, no wonder I couldn't find it.
lucifer
those times are available for all recordings but only stored in spark currently. before sending recs to LB, that data is merged with the recs to add a timestamp filed.
mayhem
easy then. :)
Pratha-Fish
hey alastairp sorry for the delay. Had couldn't do a lot today, but I am getting started with the updated to-do list right now.
The to-do list is hosted in the journal BTW.
Updating it with specifics of the artist conflation issue too
[acousticbrainz-server] 14alastair closed pull request #396 (03master…AB-407): AB-407: Redirect legacy API endpoints to new endpoint with http redirect https://github.com/metabrainz/acousticbrainz-se...
alastairp: also, you mentioned the part about making a csv with the following columns: mlhd_recording_mbid, mlhd_artist_mbid, mlhd_recording_name, mlhd_artist_name, mb_recording_artist_credit, mb_artist_mbids, mb_canonical_recording_mbid
TBH I am still a bit confused about this one. Maybe breaking it down into some macro steps could help :)
BrainzGit
[critiquebrainz] 14alastair opened pull request #438 (03master…sampledb-missing-entities): Always return dummy data in debug mode if it's not in MusicBrainz https://github.com/metabrainz/critiquebrainz/pu...
alastairp
Pratha-Fish: sure. maybe let's deal with the first 4 columns then
you already look up these fields from the mlhd dataset in the `recording` table and the `artist` table
this will just involve selecting the `name` field from these tables too, and writing them to a new csv file
to delete listens submitted with same listened_at, user_id but different case for track_name.
alastairp
lucifer: that code skeleton looks familiar ;)
lucifer
alastairp: hehe yes, i copied it from your listen_fill_userid script :D
alastairp
so to confirm, we already reject exact duplicates of (userid, submitted, track_name), but we found these cases where we had case-insensitive dups on track name?
lucifer
currently we don't reject those. the PR adds an index to fix that.
before we create the index, we need to cleanup the existing dupes.
the intent is we do 1 pass, then turn off ts writer, do another pass. try to create index. restart ts writer.