in #metabrainz

2:41 AM
intrnl_[m] joined the channel
2:41 AM
akshaaatt

aerozol: sure, I’ll look into it for you.
2:43 AM
yellowhatpro: definitely, you can also coordinate with aerozol in case you need some help finalising the post
2:55 AM
d4rk joined the channel
2:58 AM
d4rkie has quit
4:36 AM
aerozol

ansh: I accepted your changes to the PR, was there anything else I should do? Otherwise feel free to put it through!
4:44 AM
ansh

Thanks aerozol. Now the PR is ready for merge :)
5:43 AM
skelly37 joined the channel
6:05 AM
d4rk has quit
6:05 AM
d4rkie joined the channel
6:12 AM
ssam5 joined the channel
6:14 AM
ssam has quit
6:14 AM
monotux has quit
6:14 AM
}8] has quit
6:14 AM
genpaku has quit
6:14 AM
Pokey has quit
6:14 AM
ssam5 is now known as ssam
6:17 AM
monotux joined the channel
6:17 AM
}8] joined the channel
6:17 AM
genpaku joined the channel
6:17 AM
Pokey joined the channel
6:18 AM
Pokey has quit
6:18 AM
Pokey joined the channel
6:58 AM
skelly37 has quit
6:58 AM
Pratha-Fish

Is it just me or does this certificate looks forged for some reason https://usercontent.irccloud-cdn.com/file/SYTKN...
6:59 AM
The logos aren't even placed properly :(
7:38 AM
alastairp

hello
7:39 AM
alastairp is back
7:45 AM
skelly37 joined the channel
7:51 AM
skelly37 has quit
7:51 AM
skelly37 joined the channel
7:54 AM
Etua joined the channel
8:06 AM
lucifer

chinmay: oh i see. will fix.
8:13 AM
Etua has quit
8:13 AM
Etua joined the channel
8:17 AM
mayhem: alastairp: monkey: do we have a LB meeting later today?
8:17 AM
mayhem

std meeting time, yes?
8:18 AM
alastairp

sounds good, thanks
8:18 AM
Pratha-Fish

alastairp: Good morning!
8:18 AM
alastairp

hi Pratha-Fish, how are you?
8:19 AM
Pratha-Fish

I am doing well!
8:19 AM
The track-MBID checking stuff is going fine too :)
8:19 AM
However I've ran into a bottleneck
8:19 AM
alastairp

great, do you have code for that available now?
8:19 AM
Pratha-Fish

Yep
8:19 AM
skelly37

Hello everyone :)
8:19 AM
Pratha-Fish

However, it's taking ~18s to check 1 single file!
8:19 AM
lucifer

👍
8:20 AM
Pratha-Fish

that's just not feasable for the whole MLHD
8:20 AM
skelly37

outsidecontext, zas: Which time will we meet today?
8:20 AM
alastairp

no, that sounds quite slow. however we were testing thousands of files in only 30 seconds or so, right?
8:20 AM
were you able to benchmark which part was slow? loading/checking/saving?
8:21 AM
Pratha-Fish

Yes, currently the checking part is the slowest
8:21 AM
I even crosschecked the numbers with our previous tests and it looks just right
8:22 AM
alastairp

OK, let's take a look at the code then
8:22 AM
Pratha-Fish

right
8:23 AM
https://github.com/Prathamesh-Ghatole/MLHD/blob...
8:23 AM
Here's the code
8:23 AM
almost all the elements are ready to be assembled. It's in a jupyter notebook right now for testing, etc
8:25 AM
Ah I see
8:25 AM
This code isn't working with unique MBIDs!
8:25 AM
zas

skelly37, outsidecontext : hey, what about usual meeting time?
8:25 AM
outsidecontext

ok, works for me
8:25 AM
alastairp

Pratha-Fish: hm, right. but you're right that this still seems to be much longer than I'd expect
8:25 AM
Pratha-Fish

That's right
8:26 AM
Thankfully I've just realized a few optimizations here
8:26 AM
i.e. Taking only unique rows
8:26 AM
skelly37

Fine for me
8:27 AM
alastairp

I'm loading and running the code now to see if I can recommend something
8:27 AM
Pratha-Fish

alastairp: The other faster optimization is just logging the "recording-MBID" that was marked as a track-MBID.
8:27 AM
It's more complicated under the hood than it sounds actually
8:27 AM
sure
8:31 AM
zas

atj: I'd like we progress on ansible crowdsec stuff, what about deploying it on rex & rudi to experiment and improve?
8:39 AM
alastairp

Pratha-Fish: just to check - `df_test_positive` is a sample dataframe that you created that includes some track ids and some track redirects, so that you can test the method?
8:39 AM
Pratha-Fish

alastairp: yes that's right
8:40 AM
alastairp

ok, great
8:40 AM
Pratha-Fish: it looks to me like this specific pandas operation is really inefficient and we should target it
8:41 AM
Pratha-Fish

exactly
8:41 AM
alastairp

the first thing that I'd think of in this case is to just use a basic loop + counter. take a look at this:
8:41 AM
https://www.irccloud.com/pastebin/ohAA0cY4/
8:41 AM
run that and see how long it takes
8:41 AM
Pratha-Fish

on it
8:43 AM
Nice that took 11s
8:43 AM
alastairp

did you run that in a single cell?
8:43 AM
Pratha-Fish

yes
8:43 AM
alastairp

put it in 2 - converting the track gids to a set only has to be done once
8:44 AM
Pratha-Fish

oh right
8:44 AM
This time, let's also try doing it on both MB_track and MB_track_redir
8:48 AM
All right, so plain python test is taking 11s for set conversion, and 300microseconds for the test!
8:48 AM
alastairp

mmmhm
8:48 AM
Pratha-Fish

* For both MB_track and MB_track_redir
8:48 AM
alastairp

so, in one way I'm surprised that the "typical" way that you might do it in pandas is so slow
8:49 AM
Pratha-Fish

It's probably due to the massive conversion overhead in pandas!
8:49 AM
alastairp

but it's also important to work out what our goal is - we could generate the intersection of the 2 dataframes and save it, but we could also do this in 2 phases - 1) quickly look at all items and then if we see something, 2) save the filename for further analysis
8:50 AM
my guess is because both of these dataframes in pandas are lists
8:50 AM
try `track_mbid_list = MB_track.gid.tolist()` and use `if recid in track_mbid_list:` instead
8:50 AM
Pratha-Fish

alastairp: Yes, that's exactly how I am logging the anomalies
8:50 AM
Only the suspected MBID, and it's file path
8:50 AM
alastairp

and you'll see that this kind of check against a list is _way_ slower
8:51 AM
Pratha-Fish

Python sets for the win!
8:51 AM
alastairp

well, keep in mind that this is "datastructures for the win", using the correct tool for the job
8:51 AM
this is O-notation, if you've covered it before in classes
8:52 AM
Pratha-Fish

Yes I am aware of it
8:52 AM
Now I see why DSA is so important in interviews
8:53 AM
alastairp

I just tested `if recid in track_mbid_list:` on the 6 item dataset, and it took about 16 seconds
8:53 AM
Pratha-Fish

Looks like py set datastructure is implemented on hashmaps. Probably the biggest reason for these ridiculous lookup speeds
8:54 AM
alastairp

that's really close to the 20 that we saw with pandas
8:54 AM
Pratha-Fish

right
8:54 AM
I also modified my pandas code a little, and it's taking ~14s now
8:54 AM
alastairp

yes, right. when you have a list [1,2,3,4,5,6] and you want to check `if 8 in mylist` then it has to look at each item in the list. if it's not in the list you have to go through the entire list every single time you check
8:55 AM
whereas with a hashed datastructure like a set (python dictionary keys work the same way), its just a hash operation + 1 lookup
8:55 AM
O(1) vs O(n)
8:55 AM
Pratha-Fish

!!!
8:56 AM
I still have one doubt though
8:57 AM
This old code with pandas took only 2.89s for 381k (unique) rows https://usercontent.irccloud-cdn.com/file/khxUq...
8:57 AM
alastairp

Pratha-Fish: looks like there may be some efficient pandas methods here too: https://stackoverflow.com/a/21175114
8:58 AM
Pratha-Fish

alastairp: Wow that one looks fast
8:58 AM
Let's try out the intersection method. It's written in numpy, so it's gotta be pretty fast for such a basic operation
8:58 AM
alastairp

note that in that microbenchmark it's still faster to convert the series to a set, do the intersection, turn it back into a python list, and then turn that into a Series again (the first item, almost 10% faster than the last one)
8:59 AM
.intersection is a python set method
8:59 AM
.intersect1d is a numpy method
9:00 AM
Pratha-Fish

right, so the method with set is still gonna be faster I assume
9:01 AM
Let's check out both of them just in case
9:01 AM
alastairp

sure, it'd be good to take a look at them
9:02 AM
one other thing that I noticed - after you select the data from the database, they are a UUID object:
9:02 AM
> array([UUID('9b02977e-a03b-4a6b-a9a9-06e722bdcd7a'),
9:02 AM
and it looks like this object has an "equality" method to compare other UUID objects and strings of UUIDs, but it looks like it's faster if we just treat them as strings
9:05 AM
lucifer

mayhem: hi! by any chance, do we have extra MeB tshirts available at office? :)
9:06 AM
Pratha-Fish

alastairp: yes, I noticed that one too.
9:06 AM
I also tried explicitly loading MLHD data with all UUID columns specified as string
9:06 AM
But somehow it gets converted into UUID along the way
9:06 AM
yvanzo

O’Moin
9:06 AM
Pratha-Fish

I am beyond surprised https://usercontent.irccloud-cdn.com/file/MLvSh...
9:06 AM
alastairp

Pratha-Fish: psycopg2/postgres does this if you select an item which is a uuid column
9:07 AM
mayhem

lucifer: yes. Let me check sizes. But we'll be making more for the summit, right monkey?
9:07 AM
alastairp

you can force this to text if you do `SELECT gid::text FROM recording`
9:07 AM
lucifer

awesome! :DD
9:07 AM
Pratha-Fish

alastairp: oh yes, that's probably the reason here
9:09 AM
alastairp

Pratha-Fish: OK, I just benchmarked using a UUID object and a string and the difference isn't actually as much as I expected. probably not a big issue
9:09 AM
object: 3.06 ms ± 12.1 µs per loop
9:09 AM
string: 3.03 ms ± 7.64 µs per loop
9:10 AM
Pratha-Fish

Alright
9:16 AM
ansh

alastairp: Hi! I had reviewed CB#442 and now it's ready for merge.
9:16 AM
BrainzBot

Adding Guidelines and link to CoC: https://github.com/metabrainz/critiquebrainz/pu...
9:17 AM
alastairp

ansh: I saw that, thank you so much!
9:17 AM
ansh: I'll also take a look at your search PR again, and the other API ones you added (you need the metadata and rating one for BB integration, right?)
9:17 AM
ansh

yes
9:18 AM
It was really great reviewing a PR :) Thank you for this opportunity
9:20 AM
alastairp

great, I'm glad that you enjoyed it!
9:21 AM
Etua has quit