so the idea is that it would match a recording from a release made in US before, say, Chile? (assuming the recording was in fact released in both countries)
mayhem
I see the point of this one and it would be nice, but might be hard to do.
alastairp
is this a year tiebreaker? or independent of the year?
mayhem
year is one of the much earlier sort columsn.
lucifer
year tiebreaker
alastairp
right, so it'll happen in the case that the year is the same
mayhem
yes.
and there is a case in the mapping later that puts this to a finer point.
we could also try including album name for mapping (since many listens have those) but that would complicate stuff.
mayhem
things we've decided: use PG unaccent, improve format sort for dj-mix/single, add country sort.
lucifer: it tends to make things worse, really.
I would prefer to keep album out -- at least until a clear use case emerges.
lucifer
oh! yeah makes sense to leave out for now then
mayhem
ok, so this brings us to fixing existing issues - those will be tricky, but I can hammer those out in a few days time.
lets discuss de-tuning.
right now we have an iterative approach to this process. try exact, detune, try fuzzy, etc.
and that is too expensive to do if we are trying to make a better API endpoint.
and given the timings, the exact lookup is MUCH faster and the most common match type, we should do this:
1. Exact match.
2. Fuzzy match
3. Detune,
4 (option A): Exact match, fuzzy match.
4 (option B), Fuzzy match.
reosarevok
bitmap: is your tags code for the schema change also solving MBS-11755? If not, we can run a script whenever to delete the extra tags, see comment there for a (hopefully relevant, I wrote it ages ago) query
Selena Gomez with Rauw Alejandro or Selena Gomez w/ Rauw Alejandro is the artist name in MB. Baila conmigo is recording name.
spotify does. Baila conmigo (with Rauw Alejandro) as recording name and Selena Gomez as artist name.
so this case may need detuning MB data to match.
mayhem
if we feel that MB needs detuning, then we should add detuned rows to the index.
lucifer
or maybe this gets caught by fuzzy match.
mayhem
the fuzzy match will match 2-3 characters at most.
lucifer
i see, makes sense.
mayhem
match a difference of 2-3 characters at most.
lucifer
can we do fuzzy match on words?
mayhem
or otherwise it slows down.
yes, that is supported and another order of magnitude slower than just letters
lucifer
artist_name + recording_name (incoming) fuzzy match on artist_name + recording_name Mb data
mayhem
I put it in last night and took it back out immediately, since it was sooooo slow.
lucifer
oh :/
can we do a faster endpoint and a slower background one?
mayhem
yes, I think that is a good approach.
lucifer
a background process that reads unmatched stuff from mapping table and looks it up via the slower means.
cool sounds good
mayhem
I am already clear on the fact that I want to keep the pipeline for mapping around. it is working well.
ok, given that we want to work on this stuff this week, where should we split the work?
lucifer
makes sense.
mayhem
I know how to work on the things already discussed.
I wonder if you'd be open for working on a better detuning engine.
you seem to have ideas on that front and its a pretty separate piece of code.
lucifer
sure makes sense
mayhem
maybe draw up your thinking on a gist/doc so we can discuss?
lucifer
do we keep typesense around in the background pipeline?
mayhem
possibly. I have a feeling it performs better for fuzzy matching up to 5 characters.
let me add some timing to the typesense based search and then we'll have a better idea.
reosarevok
yvanzo, bitmap: I added some descriptions for the different tickets to the schema change draft doc too, btw
mayhem
that will be my first task, I think.
lucifer
makes sense 👍
reosarevok
yvanzo: I also added the description from last year's blog post to your AC ticket, but do check if it's still correct
lucifer
mayhem, do we have a process that periodically rechecks unmatched listens automatically or is it manually invalidating some rows?
mayhem
right now I invalidate rows by hand. I just invalidated all no_match matches for 2021.
this needs to be automated. not sure how yet.
lucifer
cron job to invalidate listens weekly, newer ones more frequently and older ones less so?
mayhem
that
Dijia
Hi, I had a problem when submitting my listen record. I have successfully built the development environment, everything works well except the "listens" page. Everytime I open this page, it shows "get /socket.io/?eio=4&transport=polling&t=nz-uqq1 http/1.1" 404 in git. When I tried to submit a record, the number of recent listens can increase but no listen records shown in the page. I have googled this bug but there seems to be no solutions. Is there
anyone know what to do with that?
lucifer
Dijia: uh yeah, that's a known issue. can you try running `./develop.sh manage update_user_listen_data` ? (that 404 is unrelated)
mayhem
Dijia: hi. if you have errors like these, it is best to past the error you're getting -- that helps us understand better.
unless you're lucifer , who is clearly a mind-reader. :)
lucifer
being the devil comes with its perks :)
Dijia
Ah thank you lucifer!! It works!! Amazing!
mayhem
the dark side does seem to have better perks. sigh.
lucifer
this is the same issue that we discussed last week, happens in prod due to which listens don't appear until cron job runs but well it never runs in dev so listens would never appear. i'll get the fix out soon.
mayhem: oh i forgot to tell you. that query downgraded to full chunk scan again :-(. the test reproducer i had created to compare 11/13 works fine on 13 but the actual still query doesn't. i think i have found another work around (using a subquery isntead of CTE). opening a TS bug as we speak.
Hi yellowhatpro! The design looks good to me so far. We can have a figma design for it if you're comfortable. Otherwise also I think you can proceed :)
lucifer
instead of JOINing to CTE and select from it as subquery and you get chunk exclusion.
>The last successful request was processed 71 days after the first email. The GDPR doesn’t define “without undue delay”, but I’m fairly certain that it requires companies to not stall for over 10 weeks.
spotify assumes it to mean 3 months apparently
mayhem
once again, we're among the few who take this seriously.
lucifer
indeed
yellowhatpro
<akshaaatt> "Hi yellowhatpro! The design..." <- Thanks sempaiii.. I will be working on the Figma designs then. (/≧▽≦)/
mayhem
lucifer: alastairp : I guess we're keeping typesense then
pg_trgm can't touch that. just like mchammer can't touch THIS.
alastairp
fuzzy 16/sec
mayhem
16/sec at .6 which is about 2 edit distance on average.
so, pg_trm is quite a bit slower, sadly.
alastairp
is postgres memory settings on bono optimal?
reosarevok
mayhem: wow D:
ankes
Hi, I have seen that after Spotify's hiccup last week the listening logs collection in ListenBrainz stopped working for some accounts. I am monitoring a few users for an experiment that I am doing, and after asking them to disconnect / reconnect, still it's not working (I have written about this issue yesterday to the MetaBrainz contact email). Is
there any way to check if their ListenBrainz accounts are still linked to Spotify (and that the collection is working properly)?
lucifer
ankes: hi! yeah if you can share the username with us, we can check whether spotify is linked or not. other than that all data is public so you can see if listens are coming on website/api, then its working.
mayhem: oh nice! what about edit distance 2, 3? if typesense is fast there too then might as well not do fuzzy match in pg. also which version of typesense is this, probably should upgrade to latest for more enhancements.
ankes
lucifer thanks! for instance "draconisfirebolt" was working until last tue, then stopped, but then after disconnecting/reconnecting still is not working. The same goes for "ByeBye", "bigDart" and "Danysanak" (I doublecheck with the API)
CatQuest
whatever is 0 after that and remove it though
will this remove tags fro msearch and the like that have literally no hits?
it annoys me to no end that misspelled tags I made one second exist forever because you can't permaremove tags
zas
atj: about ansible role for haproxy, I think we'll need quite a lot specific settings, but we can use one as basis. I read some roles are not 100% compatible with most recent haproxy versions, we'll likely use one of the most recent version (2.5.x) because we need some very recent features
lucifer
ankes: all of those disconnected on 8th (probably due to the spotify downtime), and haven't been reconnected since.
ankes
lucifer this is strange because they told me they did it. I will ask them to double-check. thanks!
lucifer
👍, i also disconnected/reconnected my account just to confirm that our part of workflow if working fine.
*just now
alastairp: had you tried importing from the pg_dump you made the other day for ts? i am trying to dump my local db (~400 listens) and import to create a small smaple for TS bug report and importing from it is failing.
alastairp
lucifer: I didn't make a pg_dump, we just copied the entire data directory