#metabrainz

/

      • intrnl_[m] joined the channel
      • akshaaatt
        aerozol: sure, I’ll look into it for you.
      • yellowhatpro: definitely, you can also coordinate with aerozol in case you need some help finalising the post
      • d4rk joined the channel
      • d4rkie has quit
      • aerozol
        ansh: I accepted your changes to the PR, was there anything else I should do? Otherwise feel free to put it through!
      • ansh
        Thanks aerozol. Now the PR is ready for merge :)
      • skelly37 joined the channel
      • d4rk has quit
      • d4rkie joined the channel
      • ssam5 joined the channel
      • ssam has quit
      • monotux has quit
      • }8] has quit
      • genpaku has quit
      • Pokey has quit
      • ssam5 is now known as ssam
      • monotux joined the channel
      • }8] joined the channel
      • genpaku joined the channel
      • Pokey joined the channel
      • Pokey has quit
      • Pokey joined the channel
      • skelly37 has quit
      • Pratha-Fish
        Is it just me or does this certificate looks forged for some reason https://usercontent.irccloud-cdn.com/file/SYTKN...
      • The logos aren't even placed properly :(
      • alastairp
        hello
      • alastairp is back
      • skelly37 joined the channel
      • skelly37 has quit
      • skelly37 joined the channel
      • Etua joined the channel
      • lucifer
        chinmay: oh i see. will fix.
      • Etua has quit
      • Etua joined the channel
      • mayhem: alastairp: monkey: do we have a LB meeting later today?
      • mayhem
        std meeting time, yes?
      • alastairp
        sounds good, thanks
      • Pratha-Fish
        alastairp: Good morning!
      • alastairp
        hi Pratha-Fish, how are you?
      • Pratha-Fish
        I am doing well!
      • The track-MBID checking stuff is going fine too :)
      • However I've ran into a bottleneck
      • alastairp
        great, do you have code for that available now?
      • Pratha-Fish
        Yep
      • skelly37
        Hello everyone :)
      • Pratha-Fish
        However, it's taking ~18s to check 1 single file!
      • lucifer
        👍
      • Pratha-Fish
        that's just not feasable for the whole MLHD
      • skelly37
        outsidecontext, zas: Which time will we meet today?
      • alastairp
        no, that sounds quite slow. however we were testing thousands of files in only 30 seconds or so, right?
      • were you able to benchmark which part was slow? loading/checking/saving?
      • Pratha-Fish
        Yes, currently the checking part is the slowest
      • I even crosschecked the numbers with our previous tests and it looks just right
      • alastairp
        OK, let's take a look at the code then
      • Pratha-Fish
        right
      • Here's the code
      • almost all the elements are ready to be assembled. It's in a jupyter notebook right now for testing, etc
      • Ah I see
      • This code isn't working with unique MBIDs!
      • zas
        skelly37, outsidecontext : hey, what about usual meeting time?
      • outsidecontext
        ok, works for me
      • alastairp
        Pratha-Fish: hm, right. but you're right that this still seems to be much longer than I'd expect
      • Pratha-Fish
        That's right
      • Thankfully I've just realized a few optimizations here
      • i.e. Taking only unique rows
      • skelly37
        Fine for me
      • alastairp
        I'm loading and running the code now to see if I can recommend something
      • Pratha-Fish
        alastairp: The other faster optimization is just logging the "recording-MBID" that was marked as a track-MBID.
      • It's more complicated under the hood than it sounds actually
      • sure
      • zas
        atj: I'd like we progress on ansible crowdsec stuff, what about deploying it on rex & rudi to experiment and improve?
      • alastairp
        Pratha-Fish: just to check - `df_test_positive` is a sample dataframe that you created that includes some track ids and some track redirects, so that you can test the method?
      • Pratha-Fish
        alastairp: yes that's right
      • alastairp
        ok, great
      • Pratha-Fish: it looks to me like this specific pandas operation is really inefficient and we should target it
      • Pratha-Fish
        exactly
      • alastairp
        the first thing that I'd think of in this case is to just use a basic loop + counter. take a look at this:
      • run that and see how long it takes
      • Pratha-Fish
        on it
      • Nice that took 11s
      • alastairp
        did you run that in a single cell?
      • Pratha-Fish
        yes
      • alastairp
        put it in 2 - converting the track gids to a set only has to be done once
      • Pratha-Fish
        oh right
      • This time, let's also try doing it on both MB_track and MB_track_redir
      • All right, so plain python test is taking 11s for set conversion, and 300microseconds for the test!
      • alastairp
        mmmhm
      • Pratha-Fish
        * For both MB_track and MB_track_redir
      • alastairp
        so, in one way I'm surprised that the "typical" way that you might do it in pandas is so slow
      • Pratha-Fish
        It's probably due to the massive conversion overhead in pandas!
      • alastairp
        but it's also important to work out what our goal is - we could generate the intersection of the 2 dataframes and save it, but we could also do this in 2 phases - 1) quickly look at all items and then if we see something, 2) save the filename for further analysis
      • my guess is because both of these dataframes in pandas are lists
      • try `track_mbid_list = MB_track.gid.tolist()` and use `if recid in track_mbid_list:` instead
      • Pratha-Fish
        alastairp: Yes, that's exactly how I am logging the anomalies
      • Only the suspected MBID, and it's file path
      • alastairp
        and you'll see that this kind of check against a list is _way_ slower
      • Pratha-Fish
        Python sets for the win!
      • alastairp
        well, keep in mind that this is "datastructures for the win", using the correct tool for the job
      • this is O-notation, if you've covered it before in classes
      • Pratha-Fish
        Yes I am aware of it
      • Now I see why DSA is so important in interviews
      • alastairp
        I just tested `if recid in track_mbid_list:` on the 6 item dataset, and it took about 16 seconds
      • Pratha-Fish
        Looks like py set datastructure is implemented on hashmaps. Probably the biggest reason for these ridiculous lookup speeds
      • alastairp
        that's really close to the 20 that we saw with pandas
      • Pratha-Fish
        right
      • I also modified my pandas code a little, and it's taking ~14s now
      • alastairp
        yes, right. when you have a list [1,2,3,4,5,6] and you want to check `if 8 in mylist` then it has to look at each item in the list. if it's not in the list you have to go through the entire list every single time you check
      • whereas with a hashed datastructure like a set (python dictionary keys work the same way), its just a hash operation + 1 lookup
      • O(1) vs O(n)
      • Pratha-Fish
        !!!
      • I still have one doubt though
      • This old code with pandas took only 2.89s for 381k (unique) rows https://usercontent.irccloud-cdn.com/file/khxUq...
      • alastairp
        Pratha-Fish: looks like there may be some efficient pandas methods here too: https://stackoverflow.com/a/21175114
      • Pratha-Fish
        alastairp: Wow that one looks fast
      • Let's try out the intersection method. It's written in numpy, so it's gotta be pretty fast for such a basic operation
      • alastairp
        note that in that microbenchmark it's still faster to convert the series to a set, do the intersection, turn it back into a python list, and then turn that into a Series again (the first item, almost 10% faster than the last one)
      • .intersection is a python set method
      • .intersect1d is a numpy method
      • Pratha-Fish
        right, so the method with set is still gonna be faster I assume
      • Let's check out both of them just in case
      • alastairp
        sure, it'd be good to take a look at them
      • one other thing that I noticed - after you select the data from the database, they are a UUID object:
      • > array([UUID('9b02977e-a03b-4a6b-a9a9-06e722bdcd7a'),
      • and it looks like this object has an "equality" method to compare other UUID objects and strings of UUIDs, but it looks like it's faster if we just treat them as strings
      • lucifer
        mayhem: hi! by any chance, do we have extra MeB tshirts available at office? :)
      • Pratha-Fish
        alastairp: yes, I noticed that one too.
      • I also tried explicitly loading MLHD data with all UUID columns specified as string
      • But somehow it gets converted into UUID along the way
      • yvanzo
        O’Moin
      • Pratha-Fish
      • alastairp
        Pratha-Fish: psycopg2/postgres does this if you select an item which is a uuid column
      • mayhem
        lucifer: yes. Let me check sizes. But we'll be making more for the summit, right monkey?
      • alastairp
        you can force this to text if you do `SELECT gid::text FROM recording`
      • lucifer
        awesome! :DD
      • Pratha-Fish
        alastairp: oh yes, that's probably the reason here
      • alastairp
        Pratha-Fish: OK, I just benchmarked using a UUID object and a string and the difference isn't actually as much as I expected. probably not a big issue
      • object: 3.06 ms ± 12.1 µs per loop
      • string: 3.03 ms ± 7.64 µs per loop
      • Pratha-Fish
        Alright
      • ansh
        alastairp: Hi! I had reviewed CB#442 and now it's ready for merge.
      • BrainzBot
        Adding Guidelines and link to CoC: https://github.com/metabrainz/critiquebrainz/pu...
      • alastairp
        ansh: I saw that, thank you so much!
      • ansh: I'll also take a look at your search PR again, and the other API ones you added (you need the metadata and rating one for BB integration, right?)
      • ansh
        yes
      • It was really great reviewing a PR :) Thank you for this opportunity
      • alastairp
        great, I'm glad that you enjoyed it!
      • Etua has quit