#metabrainz

/

      • mayhem
        and australia sorts before EU, so that explains the first one.
      • lucifer
        i see, makes sense.
      • mayhem
        the second one is that the format sort prefers CD over digital media. I think we should move digital media to the top.
      • thoughts?
      • lucifer
        given that most listens are spotify listens +1
      • mayhem
        yeah, I hate building this bias in, but the mapping is all about assumptions and expectations.
      • move on to the next?
      • lucifer
        yeah indeed
      • yes
      • mayhem
      • BrainzBot
        LB-1036: MBID Mapping improvement rollup ticket
      • mayhem
        unicode issue.
      • solved.
      • alastairp
        so the idea is that it would match a recording from a release made in US before, say, Chile? (assuming the recording was in fact released in both countries)
      • mayhem
        I see the point of this one and it would be nice, but might be hard to do.
      • alastairp
        is this a year tiebreaker? or independent of the year?
      • mayhem
        year is one of the much earlier sort columsn.
      • lucifer
        year tiebreaker
      • alastairp
        right, so it'll happen in the case that the year is the same
      • mayhem
        yes.
      • and there is a case in the mapping later that puts this to a finer point.
      • alastairp
        it sounds good (and of course, will cause its own problems in very few cases, but we're expecting that)
      • mayhem
        track length and attribution are the only differences here.
      • this latter case and the video case we may need to leave here for the time being.
      • thoughts/comments/move on?
      • lucifer
        uh jira doesn't have a way to mark comments as resolved to know what's left :/
      • mayhem
        I can paste my notes that should have a comment for each of the ticket's comments.
      • lucifer
        yes to next
      • +1
      • mayhem
      • BrainzBot
        LB-1036: MBID Mapping improvement rollup ticket
      • mayhem
        currently album always sorts higher than single.
      • but maybe album/dj-mix should not rank higher than single?
      • lucifer
        something on lines of original albums rank high then single then dj-mix makes sense to me.
      • mayhem
        ok.
      • my notes in full
      • lucifer
        we could also try including album name for mapping (since many listens have those) but that would complicate stuff.
      • mayhem
        things we've decided: use PG unaccent, improve format sort for dj-mix/single, add country sort.
      • lucifer: it tends to make things worse, really.
      • I would prefer to keep album out -- at least until a clear use case emerges.
      • lucifer
        oh! yeah makes sense to leave out for now then
      • mayhem
        ok, so this brings us to fixing existing issues - those will be tricky, but I can hammer those out in a few days time.
      • lets discuss de-tuning.
      • right now we have an iterative approach to this process. try exact, detune, try fuzzy, etc.
      • and that is too expensive to do if we are trying to make a better API endpoint.
      • and given the timings, the exact lookup is MUCH faster and the most common match type, we should do this:
      • 1. Exact match.
      • 2. Fuzzy match
      • 3. Detune,
      • 4 (option A): Exact match, fuzzy match.
      • 4 (option B), Fuzzy match.
      • reosarevok
        bitmap: is your tags code for the schema change also solving MBS-11755? If not, we can run a script whenever to delete the extra tags, see comment there for a (hopefully relevant, I wrote it ages ago) query
      • BrainzBot
      • mayhem
        and detune should happen in one and only one step. and only on the incoming metadata side. not sure detuning the MB data makes sense.
      • reosarevok
        bitmap: since you're regenerating all refcounts, it might be easier to just look at whatever is 0 after that and remove it though
      • lucifer
        i have seen spotify do some weird stuff aroudn this, let me find an example
      • Selena Gomez with Rauw Alejandro or Selena Gomez w/ Rauw Alejandro is the artist name in MB. Baila conmigo is recording name.
      • spotify does. Baila conmigo (with Rauw Alejandro) as recording name and Selena Gomez as artist name.
      • so this case may need detuning MB data to match.
      • mayhem
        if we feel that MB needs detuning, then we should add detuned rows to the index.
      • lucifer
        or maybe this gets caught by fuzzy match.
      • mayhem
        the fuzzy match will match 2-3 characters at most.
      • lucifer
        i see, makes sense.
      • mayhem
        match a difference of 2-3 characters at most.
      • lucifer
        can we do fuzzy match on words?
      • mayhem
        or otherwise it slows down.
      • yes, that is supported and another order of magnitude slower than just letters
      • lucifer
        artist_name + recording_name (incoming) fuzzy match on artist_name + recording_name Mb data
      • mayhem
        I put it in last night and took it back out immediately, since it was sooooo slow.
      • lucifer
        oh :/
      • can we do a faster endpoint and a slower background one?
      • mayhem
        yes, I think that is a good approach.
      • lucifer
        a background process that reads unmatched stuff from mapping table and looks it up via the slower means.
      • cool sounds good
      • mayhem
        I am already clear on the fact that I want to keep the pipeline for mapping around. it is working well.
      • ok, given that we want to work on this stuff this week, where should we split the work?
      • lucifer
        makes sense.
      • mayhem
        I know how to work on the things already discussed.
      • I wonder if you'd be open for working on a better detuning engine.
      • you seem to have ideas on that front and its a pretty separate piece of code.
      • lucifer
        sure makes sense
      • mayhem
        maybe draw up your thinking on a gist/doc so we can discuss?
      • lucifer
        do we keep typesense around in the background pipeline?
      • mayhem
        possibly. I have a feeling it performs better for fuzzy matching up to 5 characters.
      • let me add some timing to the typesense based search and then we'll have a better idea.
      • reosarevok
        yvanzo, bitmap: I added some descriptions for the different tickets to the schema change draft doc too, btw
      • mayhem
        that will be my first task, I think.
      • lucifer
        makes sense 👍
      • reosarevok
        yvanzo: I also added the description from last year's blog post to your AC ticket, but do check if it's still correct
      • lucifer
        mayhem, do we have a process that periodically rechecks unmatched listens automatically or is it manually invalidating some rows?
      • mayhem
        right now I invalidate rows by hand. I just invalidated all no_match matches for 2021.
      • this needs to be automated. not sure how yet.
      • lucifer
        cron job to invalidate listens weekly, newer ones more frequently and older ones less so?
      • mayhem
        that
      • Dijia
        Hi, I had a problem when submitting my listen record. I have successfully built the development environment, everything works well except the "listens" page. Everytime I open this page, it shows "get /socket.io/?eio=4&transport=polling&t=nz-uqq1 http/1.1" 404 in git. When I tried to submit a record, the number of recent listens can increase but no listen records shown in the page. I have googled this bug but there seems to be no solutions. Is there
      • anyone know what to do with that?
      • lucifer
        Dijia: uh yeah, that's a known issue. can you try running `./develop.sh manage update_user_listen_data` ? (that 404 is unrelated)
      • mayhem
        Dijia: hi. if you have errors like these, it is best to past the error you're getting -- that helps us understand better.
      • unless you're lucifer , who is clearly a mind-reader. :)
      • lucifer
        being the devil comes with its perks :)
      • Dijia
        Ah thank you lucifer!! It works!! Amazing!
      • mayhem
        the dark side does seem to have better perks. sigh.
      • lucifer
        this is the same issue that we discussed last week, happens in prod due to which listens don't appear until cron job runs but well it never runs in dev so listens would never appear. i'll get the fix out soon.
      • mayhem: oh i forgot to tell you. that query downgraded to full chunk scan again :-(. the test reproducer i had created to compare 11/13 works fine on 13 but the actual still query doesn't. i think i have found another work around (using a subquery isntead of CTE). opening a TS bug as we speak.
      • akshaaatt
        Hi yellowhatpro! The design looks good to me so far. We can have a figma design for it if you're comfortable. Otherwise also I think you can proceed :)
      • lucifer
        instead of JOINing to CTE and select from it as subquery and you get chunk exclusion.
      • mayhem
        oh joy. :(
      • lucifer
        >The last successful request was processed 71 days after the first email. The GDPR doesn’t define “without undue delay”, but I’m fairly certain that it requires companies to not stall for over 10 weeks.
      • spotify assumes it to mean 3 months apparently
      • mayhem
        once again, we're among the few who take this seriously.
      • lucifer
        indeed
      • yellowhatpro
        <akshaaatt> "Hi yellowhatpro! The design..." <- Thanks sempaiii.. I will be working on the Figma designs then. (/≧▽≦)/
      • mayhem
        lucifer: alastairp : I guess we're keeping typesense then
      • edit distance of 5 and and 32 queries a sec.
      • alastairp
      • mayhem
        pg_trgm can't touch that. just like mchammer can't touch THIS.
      • alastairp
        fuzzy 16/sec
      • mayhem
        16/sec at .6 which is about 2 edit distance on average.
      • so, pg_trm is quite a bit slower, sadly.
      • alastairp
        is postgres memory settings on bono optimal?
      • reosarevok
        mayhem: wow D:
      • ankes
        Hi, I have seen that after Spotify's hiccup last week the listening logs collection in ListenBrainz stopped working for some accounts. I am monitoring a few users for an experiment that I am doing, and after asking them to disconnect / reconnect, still it's not working (I have written about this issue yesterday to the MetaBrainz contact email). Is
      • there any way to check if their ListenBrainz accounts are still linked to Spotify (and that the collection is working properly)?
      • lucifer
        ankes: hi! yeah if you can share the username with us, we can check whether spotify is linked or not. other than that all data is public so you can see if listens are coming on website/api, then its working.
      • mayhem: oh nice! what about edit distance 2, 3? if typesense is fast there too then might as well not do fuzzy match in pg. also which version of typesense is this, probably should upgrade to latest for more enhancements.
      • ankes
        lucifer thanks! for instance "draconisfirebolt" was working until last tue, then stopped, but then after disconnecting/reconnecting still is not working. The same goes for "ByeBye", "bigDart" and "Danysanak" (I doublecheck with the API)
      • CatQuest
        whatever is 0 after that and remove it though
      • will this remove tags fro msearch and the like that have literally no hits?
      • it annoys me to no end that misspelled tags I made one second exist forever because you can't permaremove tags
      • zas
        atj: about ansible role for haproxy, I think we'll need quite a lot specific settings, but we can use one as basis. I read some roles are not 100% compatible with most recent haproxy versions, we'll likely use one of the most recent version (2.5.x) because we need some very recent features
      • lucifer
        ankes: all of those disconnected on 8th (probably due to the spotify downtime), and haven't been reconnected since.
      • ankes
        lucifer this is strange because they told me they did it. I will ask them to double-check. thanks!
      • lucifer
        👍, i also disconnected/reconnected my account just to confirm that our part of workflow if working fine.
      • *just now
      • alastairp: had you tried importing from the pg_dump you made the other day for ts? i am trying to dump my local db (~400 listens) and import to create a small smaple for TS bug report and importing from it is failing.
      • alastairp
        lucifer: I didn't make a pg_dump, we just copied the entire data directory
      • lucifer
        ah ok 👍
      • alastairp
        what error are you seeing?
      • lucifer: I just updated https://github.com/metabrainz/docker-python/pul... with latest python, and am just testing the 3.10 one now too
      • do you want to give it a quick look so that we can merge?
      • lucifer
        i'll reproduce the error and share it.
      • sure will look at it.
      • alastairp: you mean 3.9.10 or 3.10?