in #metabrainz

10:39 AM
mayhem

and australia sorts before EU, so that explains the first one.
10:39 AM
lucifer

i see, makes sense.
10:39 AM
mayhem

the second one is that the format sort prefers CD over digital media. I think we should move digital media to the top.
10:40 AM
thoughts?
10:40 AM
lucifer

given that most listens are spotify listens +1
10:40 AM
mayhem

yeah, I hate building this bias in, but the mapping is all about assumptions and expectations.
10:40 AM
move on to the next?
10:40 AM
lucifer

yeah indeed
10:40 AM
yes
10:41 AM
mayhem

https://tickets.metabrainz.org/browse/LB-1036?f...
10:41 AM
BrainzBot

LB-1036: MBID Mapping improvement rollup ticket
10:41 AM
mayhem

unicode issue.
10:41 AM
solved.
10:41 AM
https://tickets.metabrainz.org/browse/LB-1036?f...
10:41 AM
alastairp

so the idea is that it would match a recording from a release made in US before, say, Chile? (assuming the recording was in fact released in both countries)
10:41 AM
mayhem

I see the point of this one and it would be nice, but might be hard to do.
10:41 AM
alastairp

is this a year tiebreaker? or independent of the year?
10:41 AM
mayhem

year is one of the much earlier sort columsn.
10:42 AM
lucifer

year tiebreaker
10:42 AM
alastairp

right, so it'll happen in the case that the year is the same
10:42 AM
mayhem

yes.
10:42 AM
and there is a case in the mapping later that puts this to a finer point.
10:42 AM
https://tickets.metabrainz.org/browse/LB-1036?f...
10:42 AM
alastairp

it sounds good (and of course, will cause its own problems in very few cases, but we're expecting that)
10:42 AM
mayhem

track length and attribution are the only differences here.
10:43 AM
this latter case and the video case we may need to leave here for the time being.
10:45 AM
thoughts/comments/move on?
10:46 AM
lucifer

uh jira doesn't have a way to mark comments as resolved to know what's left :/
10:46 AM
mayhem

I can paste my notes that should have a comment for each of the ticket's comments.
10:46 AM
lucifer

yes to next
10:47 AM
+1
10:47 AM
mayhem

https://tickets.metabrainz.org/browse/LB-1036?f...
10:47 AM
BrainzBot

LB-1036: MBID Mapping improvement rollup ticket
10:47 AM
mayhem

currently album always sorts higher than single.
10:47 AM
but maybe album/dj-mix should not rank higher than single?
10:48 AM
lucifer

something on lines of original albums rank high then single then dj-mix makes sense to me.
10:48 AM
mayhem

ok.
10:49 AM
https://gist.github.com/mayhem/cb9a6476d26a1e15...
10:49 AM
my notes in full
10:49 AM
lucifer

we could also try including album name for mapping (since many listens have those) but that would complicate stuff.
10:50 AM
mayhem

things we've decided: use PG unaccent, improve format sort for dj-mix/single, add country sort.
10:50 AM
lucifer: it tends to make things worse, really.
10:51 AM
I would prefer to keep album out -- at least until a clear use case emerges.
10:51 AM
lucifer

oh! yeah makes sense to leave out for now then
10:52 AM
mayhem

ok, so this brings us to fixing existing issues - those will be tricky, but I can hammer those out in a few days time.
10:52 AM
lets discuss de-tuning.
10:53 AM
right now we have an iterative approach to this process. try exact, detune, try fuzzy, etc.
10:53 AM
and that is too expensive to do if we are trying to make a better API endpoint.
10:54 AM
and given the timings, the exact lookup is MUCH faster and the most common match type, we should do this:
10:54 AM
1. Exact match.
10:54 AM
2. Fuzzy match
10:54 AM
3. Detune,
10:54 AM
4 (option A): Exact match, fuzzy match.
10:54 AM
4 (option B), Fuzzy match.
10:55 AM
reosarevok

bitmap: is your tags code for the schema change also solving MBS-11755? If not, we can run a script whenever to delete the extra tags, see comment there for a (hopefully relevant, I wrote it ages ago) query
10:55 AM
BrainzBot

MBS-11755: Remove unused tags https://tickets.metabrainz.org/browse/MBS-11755
10:55 AM
mayhem

and detune should happen in one and only one step. and only on the incoming metadata side. not sure detuning the MB data makes sense.
10:57 AM
reosarevok

bitmap: since you're regenerating all refcounts, it might be easier to just look at whatever is 0 after that and remove it though
10:57 AM
lucifer

i have seen spotify do some weird stuff aroudn this, let me find an example
10:58 AM
https://musicbrainz.org/recording/f90720d0-3c4e...
10:59 AM
Selena Gomez with Rauw Alejandro or Selena Gomez w/ Rauw Alejandro is the artist name in MB. Baila conmigo is recording name.
10:59 AM
spotify does. Baila conmigo (with Rauw Alejandro) as recording name and Selena Gomez as artist name.
11:00 AM
so this case may need detuning MB data to match.
11:00 AM
mayhem

if we feel that MB needs detuning, then we should add detuned rows to the index.
11:01 AM
lucifer

or maybe this gets caught by fuzzy match.
11:01 AM
mayhem

the fuzzy match will match 2-3 characters at most.
11:01 AM
lucifer

i see, makes sense.
11:01 AM
mayhem

match a difference of 2-3 characters at most.
11:01 AM
lucifer

can we do fuzzy match on words?
11:01 AM
mayhem

or otherwise it slows down.
11:01 AM
yes, that is supported and another order of magnitude slower than just letters
11:02 AM
lucifer

artist_name + recording_name (incoming) fuzzy match on artist_name + recording_name Mb data
11:02 AM
mayhem

I put it in last night and took it back out immediately, since it was sooooo slow.
11:02 AM
lucifer

oh :/
11:05 AM
can we do a faster endpoint and a slower background one?
11:06 AM
mayhem

yes, I think that is a good approach.
11:06 AM
lucifer

a background process that reads unmatched stuff from mapping table and looks it up via the slower means.
11:06 AM
cool sounds good
11:06 AM
mayhem

I am already clear on the fact that I want to keep the pipeline for mapping around. it is working well.
11:07 AM
ok, given that we want to work on this stuff this week, where should we split the work?
11:07 AM
lucifer

makes sense.
11:07 AM
mayhem

I know how to work on the things already discussed.
11:07 AM
I wonder if you'd be open for working on a better detuning engine.
11:08 AM
you seem to have ideas on that front and its a pretty separate piece of code.
11:08 AM
lucifer

sure makes sense
11:09 AM
mayhem

maybe draw up your thinking on a gist/doc so we can discuss?
11:09 AM
lucifer

do we keep typesense around in the background pipeline?
11:10 AM
mayhem

possibly. I have a feeling it performs better for fuzzy matching up to 5 characters.
11:10 AM
let me add some timing to the typesense based search and then we'll have a better idea.
11:11 AM
reosarevok

yvanzo, bitmap: I added some descriptions for the different tickets to the schema change draft doc too, btw
11:11 AM
mayhem

that will be my first task, I think.
11:11 AM
lucifer

makes sense 👍
11:11 AM
reosarevok

yvanzo: I also added the description from last year's blog post to your AC ticket, but do check if it's still correct
11:13 AM
lucifer

mayhem, do we have a process that periodically rechecks unmatched listens automatically or is it manually invalidating some rows?
11:14 AM
mayhem

right now I invalidate rows by hand. I just invalidated all no_match matches for 2021.
11:14 AM
this needs to be automated. not sure how yet.
11:15 AM
lucifer

cron job to invalidate listens weekly, newer ones more frequently and older ones less so?
11:16 AM
mayhem

that
11:25 AM
Dijia

Hi, I had a problem when submitting my listen record. I have successfully built the development environment, everything works well except the "listens" page. Everytime I open this page, it shows "get /socket.io/?eio=4&transport=polling&t=nz-uqq1 http/1.1" 404 in git. When I tried to submit a record, the number of recent listens can increase but no listen records shown in the page. I have googled this bug but there seems to be no solutions. Is there
11:25 AM
anyone know what to do with that?
11:27 AM
lucifer

Dijia: uh yeah, that's a known issue. can you try running `./develop.sh manage update_user_listen_data` ? (that 404 is unrelated)
11:27 AM
mayhem

Dijia: hi. if you have errors like these, it is best to past the error you're getting -- that helps us understand better.
11:28 AM
unless you're lucifer , who is clearly a mind-reader. :)
11:28 AM
lucifer

being the devil comes with its perks :)
11:29 AM
Dijia

Ah thank you lucifer!! It works!! Amazing!
11:29 AM
mayhem

the dark side does seem to have better perks. sigh.
11:29 AM
lucifer

this is the same issue that we discussed last week, happens in prod due to which listens don't appear until cron job runs but well it never runs in dev so listens would never appear. i'll get the fix out soon.
11:33 AM
mayhem: oh i forgot to tell you. that query downgraded to full chunk scan again :-(. the test reproducer i had created to compare 11/13 works fine on 13 but the actual still query doesn't. i think i have found another work around (using a subquery isntead of CTE). opening a TS bug as we speak.
11:34 AM
https://explain.dalibo.com/plan/q7W#query and https://explain.dalibo.com/plan/Vo6#query
11:34 AM
akshaaatt

Hi yellowhatpro! The design looks good to me so far. We can have a figma design for it if you're comfortable. Otherwise also I think you can proceed :)
11:34 AM
lucifer

instead of JOINing to CTE and select from it as subquery and you get chunk exclusion.
11:35 AM
mayhem

oh joy. :(
11:44 AM
https://www.ctrl.blog/entry/account-deletions-2...
11:46 AM
lucifer

>The last successful request was processed 71 days after the first email. The GDPR doesn’t define “without undue delay”, but I’m fairly certain that it requires companies to not stall for over 10 weeks.
11:47 AM
spotify assumes it to mean 3 months apparently
11:47 AM
mayhem

once again, we're among the few who take this seriously.
11:47 AM
lucifer

indeed
11:59 AM
yellowhatpro

<akshaaatt> "Hi yellowhatpro! The design..." <- Thanks sempaiii.. I will be working on the Figma designs then. (/≧▽≦)/
12:47 PM
mayhem

lucifer: alastairp : I guess we're keeping typesense then
12:47 PM
https://bono.metabrainz.org/typesense-test?arti...
12:47 PM
edit distance of 5 and and 32 queries a sec.
12:48 PM
alastairp

compared to https://bono.metabrainz.org/acr-lookup-trgm?art... ?
12:48 PM
mayhem

pg_trgm can't touch that. just like mchammer can't touch THIS.
12:48 PM
alastairp

fuzzy 16/sec
12:48 PM
mayhem

16/sec at .6 which is about 2 edit distance on average.
12:48 PM
so, pg_trm is quite a bit slower, sadly.
12:49 PM
alastairp

is postgres memory settings on bono optimal?
13:06 PM
reosarevok

mayhem: wow D:
13:07 PM
ankes

Hi, I have seen that after Spotify's hiccup last week the listening logs collection in ListenBrainz stopped working for some accounts. I am monitoring a few users for an experiment that I am doing, and after asking them to disconnect / reconnect, still it's not working (I have written about this issue yesterday to the MetaBrainz contact email). Is
13:07 PM
there any way to check if their ListenBrainz accounts are still linked to Spotify (and that the collection is working properly)?
13:24 PM
lucifer

ankes: hi! yeah if you can share the username with us, we can check whether spotify is linked or not. other than that all data is public so you can see if listens are coming on website/api, then its working.
13:25 PM
mayhem: oh nice! what about edit distance 2, 3? if typesense is fast there too then might as well not do fuzzy match in pg. also which version of typesense is this, probably should upgrade to latest for more enhancements.
13:29 PM
ankes

lucifer thanks! for instance "draconisfirebolt" was working until last tue, then stopped, but then after disconnecting/reconnecting still is not working. The same goes for "ByeBye", "bigDart" and "Danysanak" (I doublecheck with the API)
13:31 PM
CatQuest

whatever is 0 after that and remove it though
13:31 PM
will this remove tags fro msearch and the like that have literally no hits?
13:32 PM
it annoys me to no end that misspelled tags I made one second exist forever because you can't permaremove tags
13:32 PM
zas

atj: about ansible role for haproxy, I think we'll need quite a lot specific settings, but we can use one as basis. I read some roles are not 100% compatible with most recent haproxy versions, we'll likely use one of the most recent version (2.5.x) because we need some very recent features
13:33 PM
lucifer

ankes: all of those disconnected on 8th (probably due to the spotify downtime), and haven't been reconnected since.
13:40 PM
ankes

lucifer this is strange because they told me they did it. I will ask them to double-check. thanks!
13:41 PM
lucifer

👍, i also disconnected/reconnected my account just to confirm that our part of workflow if working fine.
13:41 PM
*just now
13:50 PM
alastairp: had you tried importing from the pg_dump you made the other day for ts? i am trying to dump my local db (~400 listens) and import to create a small smaple for TS bug report and importing from it is failing.
13:50 PM
alastairp

lucifer: I didn't make a pg_dump, we just copied the entire data directory
13:50 PM
lucifer

ah ok 👍
13:50 PM
alastairp

what error are you seeing?
13:51 PM
lucifer: I just updated https://github.com/metabrainz/docker-python/pul... with latest python, and am just testing the 3.10 one now too
13:51 PM
do you want to give it a quick look so that we can merge?
13:52 PM
lucifer

i'll reproduce the error and share it.
13:52 PM
sure will look at it.
13:52 PM
alastairp: you mean 3.9.10 or 3.10?