#metabrainz

/

      • bitmap
        GibusWearingMann: congrats on your achievment
      • bitmap appears courtesy of spotty plane wifi
      • darkstardevx joined the channel
      • GibusWearingMann has quit
      • agatzk has quit
      • agatzk joined the channel
      • agatzk has quit
      • agatzk joined the channel
      • lucifer
        mayhem: a quick question for when you have time, for storing the spotify data in normalized form, do we want to store all data or a subset? https://developer.spotify.com/documentation/web...
      • for instance, there's external_urls, restrictions, available_markets. i don't think we'll need.
      • there's images also which don't need for the cache but i guess might help in speeding up cover art loading on frontend somehow.
      • reosarevok
        lucifer: what external_urls do they usually have, if any? (other than their own)
      • lucifer
        reosarevok: according to the docs and so far what i have seen, only the spotify url.
      • reosarevok
        Boooring
      • lucifer
        indeed.
      • aerozol
        bitmap: woo, hope the flight goes (or went?) smoothly
      • lucifer
        reosarevok: however, there is an undocumented external_ids key in the response which has UPC ids.
      • mayhem
        Moin
      • bitmap landed about half an hour ago. Is anyone at the office?
      • aerozol
        Buenos dias! Not us sorry
      • mayhem
      • aerozol: atj_mb akshaaatt : these are the human tower builders I was talking about
      • reosarevok
        mayhem: can you change the name for a donation? (see support) Alternatively, if I have access to do that, can you show me how so I don't have to pester you? :)
      • bitmap
        hola I’m aboard the aerobus, I’ll head to the office
      • mayhem
        moin bitmap
      • not sure if anyone is at the office yet. you could head to the other airbnb and see if aerozol akshaaatt and atj_mb will let you take a shower there until people head to the office.
      • bitmap
        Oh sounds good, then I’ll ping the three of them when I’m outside the Airbnb
      • mayhem
        good plan. I'll be heading to the office within the hour.
      • reosarevok
        mayhem: I understand the plan is to have zoom streaming for the 14:00 - 19:30 section each day?
      • (just figuring out when I should check in)
      • mayhem
        I've not really be part of the streaming/zoom discussion. all I know is that I am lending my webcam to the effort.
      • reosarevok
        Ok, sure, but I mean that that's the actual "meeting" part, and the rest is hacking and socialising, from the schedule, right? :)
      • Should be doable, if so :)
      • mayhem
        as for tomorrow, we're going to have a sooper special project for the day. one we all need to get through, whether we like it or not. OAuth.
      • reosarevok
        AOuch
      • mayhem
        if would be good if you could be around for that and help the MB team
      • reosarevok
        Let me know the estimated time and I'll do my best :)
      • mayhem
        BCN summit-like daytime. you know what that means. :)
      • reosarevok: that donation has been updated
      • reosarevok
        Thanks!
      • Mailed them back then
      • aerozol
        Hey bitmap, sounds good, see you soon!
      • mayhem: those human tower photos are insane!!
      • mayhem
        right?
      • and on university campuses you can see them practicing over lunch.
      • aerozol
        The photos of one collapsing are terrifying. Would definitely watch though!
      • mayhem
        which is far differente from the computer sci geeks watching the aggies rope wooden bulls in the parking lot with a real rope.
      • which is what kept happening at my uni, lol
      • aerozol
        Hah, a bit less exciting. Real rope though, ooh
      • bitmap
        aerozol: akshaaatt: atj_mb: I’m outside Carrer de Nàpols, 98, I think
      • place with a bunch of scaffolding?
      • mayhem
        yep
      • Press the Atico 3 button to wake them up. :)
      • bitmap
        Button pressed!
      • yvanzo
        Is there a universal power adapter (19V, 2.1A) I could borrow from BCNers? Or I will buy one for my laptop when I arrive in 6h from now (I know the best shops already).
      • mayhem
        yvanzo: I dont have one of those, everything I have is now USB powered. :(
      • q3lont joined the channel
      • agatzk has quit
      • agatzk joined the channel
      • alastairp
        yvanzo: what kind of plug does your laptop take?
      • I have a 20v 1.5A old thinkpad round connector
      • and a 20V 1.3A lenovo new-style thinkpad square connector
      • mayhem has arrived at the office
      • slow start here, but I'll turn up soonish
      • q3lont has quit
      • yvanzo
        alastairp: round dc jack, tip pin size: 5.5mm * 2.5mm
      • (40w)
      • alastairp
        sorry, looks like mine isn't that size. I'll bring it anyway in case you want to try
      • mayhem
        lucifer: regarding with what we want to store, I wonder if we should store the key things we really care about as columns, but still have a JSONB column for all the other fields.
      • and the question about markets is tricky: do we know that the cross linking they mentioned works as expected?
      • q3lont joined the channel
      • I think storing external links is also useful. stuff that we could mine...
      • lucifer
        mayhem: i see, how about keep everything as jsonb in the existing cache table but build normalized tables of only the stuff we need. so those normalized tables won't have any jsonb columns but the original table we have currently will.
      • uh let me write down a schema to clear it up.
      • alastairp
        I'm on my way. will stop and pick up some tea. does anyone drink it/have preferences? akshaaatt aerozol bitmap?
      • lucifer
        oh and about those external links, the album tracks endpoint only returns isrc/ean/upc for albums but not for individual tracks. the tracks lookup endpoint does.
      • so we'll probably have to do some more lookups to get all the external ids.
      • same for genres. those are not returned in all endpoints only some.
      • petitminion joined the channel
      • agatzk has quit
      • agatzk joined the channel
      • q3lont has quit
      • petitminion has quit
      • petitminion_ joined the channel
      • mayhem: actually thinking more, yes makes sense to have just one JSONB column for extra things there.
      • i think we can columns for the data we need in building the index, rest all goes to jsonb column for now. we can add more columns as need them in future. usecases like reading external ids only need a read of the jsonb column without joins so shuold be fast regardless.
      • and we won't need to do them on a very regular basis.
      • petitminion joined the channel
      • petitminion_ has quit
      • petitminion_ joined the channel
      • petitminion has quit
      • mayhem
        Lucifer exactly that.
      • lucifer
      • mayhem: ^ to normalize our existing cache data
      • mayhem
        woah.
      • thats pretty well insane, lucifer . :)
      • have you tried it yet?
      • there is the tmp_sp_metadata table on gaga that you could try it on.
      • lucifer
        i am not sure how spotify handles artist credits for now, i have put a unique index on spotify_id, name. we'll be able to detect issues but will have to rebuild in case we find some. i guess that's fine.
      • yes tried it there. takes <1min to execute on it.
      • so ~4 hours on entire cache i guess.
      • mayhem
        lets try it!
      • lucifer
        👍
      • petitminion_ has quit
      • this is the schema for now. does that look fine?
      • petitminion_ joined the channel
      • mayhem
        for a first try, looks pretty good.
      • the artist name might need some thoughts.
      • possibly an array of artist names?
      • lucifer
        do you mean storing an array of artist name in the artist table with each id ? or an array of artist name in track or album table?
      • yvanzo
        alastairp, mayhem: Thanks, issue resolved, I bought an adapter on the way.
      • reosarevok
        mayhem: since you're apparently not on #musicbrainz, someone posted "helloooo, I just tested the recommendation endpoint and wanted to let you know I think results are greeeeat o/ its amazing :D thx a lot :) " :)
      • Oh. That'd be petitminion_
      • mayhem
        Heh, lol.
      • lucifer
        alastairp: hi! any progress on CB PRs?
      • alastairp
        lucifer: hi! we've just finished lunch, lol
      • so that's a no
      • lucifer
        ah nice! :D
      • np
      • Pratha-Fish
        alastairp: h e n l o
      • I have a little update.
      • The script will be completed in no time, but it might be a lot slower than expected. To counter that, I am trying to rack my brain to optimize the cleanup functions. But you could recommend any options, it would be pretty helpful.
      • Here's what I've tried:
      • - numpy.vectorize (slower than pandas for some reason lol)
      • - numba.vectorize, numba.jit -> apparently, can't even process a simple dictionary checking function :|
      • - Currently trying out Dask, modin for speedups
      • CC: lucifer
      • lucifer
        what does the cleanup function do?
      • can share its code?
      • alastairp
        Pratha-Fish: yeah, let's take a look at the code first
      • mayhem
        lucifer: I guess an array of (artist name, artist id) would be best.
      • Pratha-Fish
        lucifer, alastairp: pushing the code to the repo..
      • alastairp
        Pratha-Fish: my initial feedback on this is that it's almost certainly not as slow as you think
      • and if it does have a problem, it's probably a really simple fix
      • Pratha-Fish
        alastairp: I really hope so
      • Currently, it's taking ~18 - 24s to process 105k rows (excluding r/w times)
      • alastairp
        I bet we can get that 10x faster
      • Pratha-Fish
        note that it's 105k non-unique rows. Maybe making it unique could help, but mapping the results back to the dataframe could be another bottleneck
      • alastairp: epic
      • lucifer
        mayhem: that can be built during the join query to fetch the data imo.
      • Pratha-Fish
        alastairp: luficer:
      • sorry can't guide you to the exact lines of code since it's a notebook. but go to cell "IN [8]" and see the "process_df()" function
      • mayhem
        lucifer: the only thing we should make sure is that artist names are not conflated together.
      • Pratha-Fish
        note: here the io.replace function simply does the following:
      • ```
      • try:
      • return some_table.at[mbid, 'new']
      • except KeyError: