#metabrainz

/

2:41 AM
intrnl_[m] joined the channel

2022-07-14 19545, 2022

2:41 AM
akshaaatt

aerozol: sure, I’ll look into it for you.

2022-07-14 19517, 2022

2:43 AM
akshaaatt

yellowhatpro: definitely, you can also coordinate with aerozol in case you need some help finalising the post

2022-07-14 19552, 2022

2:55 AM
d4rk joined the channel

2022-07-14 19519, 2022

2:58 AM
d4rkie has quit

2022-07-14 19521, 2022

4:36 AM
aerozol

ansh: I accepted your changes to the PR, was there anything else I should do? Otherwise feel free to put it through!

2022-07-14 19544, 2022

4:44 AM
ansh

Thanks aerozol. Now the PR is ready for merge :)

2022-07-14 19539, 2022

5:43 AM
skelly37 joined the channel

2022-07-14 19514, 2022

6:05 AM
d4rk has quit

2022-07-14 19554, 2022

6:05 AM
d4rkie joined the channel

2022-07-14 19512, 2022

6:12 AM
ssam5 joined the channel

2022-07-14 19525, 2022

6:14 AM
ssam has quit

2022-07-14 19525, 2022

6:14 AM
monotux has quit

2022-07-14 19525, 2022

6:14 AM
}8] has quit

2022-07-14 19526, 2022

6:14 AM
genpaku has quit

2022-07-14 19526, 2022

6:14 AM
Pokey has quit

2022-07-14 19526, 2022

6:14 AM
ssam5 is now known as ssam

2022-07-14 19541, 2022

6:17 AM
monotux joined the channel

2022-07-14 19541, 2022

6:17 AM
}8] joined the channel

2022-07-14 19541, 2022

6:17 AM
genpaku joined the channel

2022-07-14 19541, 2022

6:17 AM
Pokey joined the channel

2022-07-14 19522, 2022

6:18 AM
Pokey has quit

2022-07-14 19537, 2022

6:18 AM
Pokey joined the channel

2022-07-14 19510, 2022

6:58 AM
skelly37 has quit

2022-07-14 19535, 2022

6:58 AM
Pratha-Fish

Is it just me or does this certificate looks forged for some reason https://usercontent.irccloud-cdn.com/file/SYTKNbH…

2022-07-14 19511, 2022

6:59 AM
Pratha-Fish

The logos aren't even placed properly :(

2022-07-14 19559, 2022

7:38 AM
alastairp

hello

2022-07-14 19502, 2022

7:39 AM
alastairp is back

2022-07-14 19506, 2022

7:45 AM
skelly37 joined the channel

2022-07-14 19521, 2022

7:51 AM
skelly37 has quit

2022-07-14 19545, 2022

7:51 AM
skelly37 joined the channel

2022-07-14 19507, 2022

7:54 AM
Etua joined the channel

2022-07-14 19544, 2022

8:06 AM
lucifer

chinmay: oh i see. will fix.

2022-07-14 19523, 2022

8:13 AM
Etua has quit

2022-07-14 19534, 2022

8:13 AM
Etua joined the channel

2022-07-14 19540, 2022

8:17 AM
lucifer

mayhem: alastairp: monkey: do we have a LB meeting later today?

2022-07-14 19556, 2022

8:17 AM
mayhem

std meeting time, yes?

2022-07-14 19503, 2022

8:18 AM
alastairp

sounds good, thanks

2022-07-14 19531, 2022

8:18 AM
Pratha-Fish

alastairp: Good morning!

2022-07-14 19540, 2022

8:18 AM
alastairp

hi Pratha-Fish, how are you?

2022-07-14 19505, 2022

8:19 AM
Pratha-Fish

I am doing well!

2022-07-14 19505, 2022

8:19 AM
Pratha-Fish

The track-MBID checking stuff is going fine too :)

2022-07-14 19512, 2022

8:19 AM
Pratha-Fish

However I've ran into a bottleneck

2022-07-14 19520, 2022

8:19 AM
alastairp

great, do you have code for that available now?

2022-07-14 19526, 2022

8:19 AM
Pratha-Fish

Yep

2022-07-14 19540, 2022

8:19 AM
skelly37

Hello everyone :)

2022-07-14 19544, 2022

8:19 AM
Pratha-Fish

However, it's taking ~18s to check 1 single file!

2022-07-14 19545, 2022

8:19 AM
lucifer

👍

2022-07-14 19500, 2022

8:20 AM
Pratha-Fish

that's just not feasable for the whole MLHD

2022-07-14 19509, 2022

8:20 AM
skelly37

outsidecontext, zas: Which time will we meet today?

2022-07-14 19523, 2022

8:20 AM
alastairp

no, that sounds quite slow. however we were testing thousands of files in only 30 seconds or so, right?

2022-07-14 19533, 2022

8:20 AM
alastairp

were you able to benchmark which part was slow? loading/checking/saving?

2022-07-14 19533, 2022

8:21 AM
Pratha-Fish

Yes, currently the checking part is the slowest

2022-07-14 19557, 2022

8:21 AM
Pratha-Fish

I even crosschecked the numbers with our previous tests and it looks just right

2022-07-14 19534, 2022

8:22 AM
alastairp

OK, let's take a look at the code then

2022-07-14 19541, 2022

8:22 AM
Pratha-Fish

right

2022-07-14 19510, 2022

8:23 AM
Pratha-Fish

https://github.com/Prathamesh-Ghatole/MLHD/blob/m…

2022-07-14 19510, 2022

8:23 AM
Pratha-Fish

Here's the code

2022-07-14 19533, 2022

8:23 AM
Pratha-Fish

almost all the elements are ready to be assembled. It's in a jupyter notebook right now for testing, etc

2022-07-14 19502, 2022

8:25 AM
Pratha-Fish

Ah I see

2022-07-14 19503, 2022

8:25 AM
Pratha-Fish

This code isn't working with unique MBIDs!

2022-07-14 19507, 2022

8:25 AM
zas

skelly37, outsidecontext : hey, what about usual meeting time?

2022-07-14 19530, 2022

8:25 AM
outsidecontext

ok, works for me

2022-07-14 19541, 2022

8:25 AM
alastairp

Pratha-Fish: hm, right. but you're right that this still seems to be much longer than I'd expect

2022-07-14 19557, 2022

8:25 AM
Pratha-Fish

That's right

2022-07-14 19506, 2022

8:26 AM
Pratha-Fish

Thankfully I've just realized a few optimizations here

2022-07-14 19530, 2022

8:26 AM
Pratha-Fish

i.e. Taking only unique rows

2022-07-14 19531, 2022

8:26 AM
skelly37

Fine for me

2022-07-14 19519, 2022

8:27 AM
alastairp

I'm loading and running the code now to see if I can recommend something

2022-07-14 19528, 2022

8:27 AM
Pratha-Fish

alastairp: The other faster optimization is just logging the "recording-MBID" that was marked as a track-MBID.

2022-07-14 19528, 2022

8:27 AM
Pratha-Fish

It's more complicated under the hood than it sounds actually

2022-07-14 19532, 2022

8:27 AM
Pratha-Fish

sure

2022-07-14 19526, 2022

8:31 AM
zas

atj: I'd like we progress on ansible crowdsec stuff, what about deploying it on rex & rudi to experiment and improve?

2022-07-14 19522, 2022

8:39 AM
alastairp

Pratha-Fish: just to check - `df_test_positive` is a sample dataframe that you created that includes some track ids and some track redirects, so that you can test the method?

2022-07-14 19556, 2022

8:39 AM
Pratha-Fish

alastairp: yes that's right

2022-07-14 19500, 2022

8:40 AM
alastairp

ok, great

2022-07-14 19558, 2022

8:40 AM
alastairp

Pratha-Fish: it looks to me like this specific pandas operation is really inefficient and we should target it

2022-07-14 19510, 2022

8:41 AM
Pratha-Fish

exactly

2022-07-14 19521, 2022

8:41 AM
alastairp

the first thing that I'd think of in this case is to just use a basic loop + counter. take a look at this:

2022-07-14 19541, 2022

8:41 AM
alastairp

https://www.irccloud.com/pastebin/ohAA0cY4/

2022-07-14 19551, 2022

8:41 AM
alastairp

run that and see how long it takes

2022-07-14 19556, 2022

8:41 AM
Pratha-Fish

on it

2022-07-14 19511, 2022

8:43 AM
Pratha-Fish

Nice that took 11s

2022-07-14 19527, 2022

8:43 AM
alastairp

did you run that in a single cell?

2022-07-14 19537, 2022

8:43 AM
Pratha-Fish

yes

2022-07-14 19543, 2022

8:43 AM
alastairp

put it in 2 - converting the track gids to a set only has to be done once

2022-07-14 19511, 2022

8:44 AM
Pratha-Fish

oh right

2022-07-14 19527, 2022

8:44 AM
Pratha-Fish

This time, let's also try doing it on both MB_track and MB_track_redir

2022-07-14 19501, 2022

8:48 AM
Pratha-Fish

All right, so plain python test is taking 11s for set conversion, and 300microseconds for the test!

2022-07-14 19511, 2022

8:48 AM
alastairp

mmmhm

2022-07-14 19523, 2022

8:48 AM
Pratha-Fish

* For both MB_track and MB_track_redir

2022-07-14 19540, 2022

8:48 AM
alastairp

so, in one way I'm surprised that the "typical" way that you might do it in pandas is so slow

2022-07-14 19543, 2022

8:49 AM
Pratha-Fish

It's probably due to the massive conversion overhead in pandas!

2022-07-14 19545, 2022

8:49 AM
alastairp

but it's also important to work out what our goal is - we could generate the intersection of the 2 dataframes and save it, but we could also do this in 2 phases - 1) quickly look at all items and then if we see something, 2) save the filename for further analysis

2022-07-14 19502, 2022

8:50 AM
alastairp

my guess is because both of these dataframes in pandas are lists

2022-07-14 19525, 2022

8:50 AM
alastairp

try `track_mbid_list = MB_track.gid.tolist()` and use `if recid in track_mbid_list:` instead

2022-07-14 19525, 2022

8:50 AM
Pratha-Fish

alastairp: Yes, that's exactly how I am logging the anomalies

2022-07-14 19525, 2022

8:50 AM
Pratha-Fish

Only the suspected MBID, and it's file path

2022-07-14 19548, 2022

8:50 AM
alastairp

and you'll see that this kind of check against a list is _way_ slower

2022-07-14 19517, 2022

8:51 AM
Pratha-Fish

Python sets for the win!

2022-07-14 19539, 2022

8:51 AM
alastairp

well, keep in mind that this is "datastructures for the win", using the correct tool for the job

2022-07-14 19556, 2022

8:51 AM
alastairp

this is O-notation, if you've covered it before in classes

2022-07-14 19506, 2022

8:52 AM
Pratha-Fish

Yes I am aware of it

2022-07-14 19527, 2022

8:52 AM
Pratha-Fish

Now I see why DSA is so important in interviews

2022-07-14 19556, 2022

8:53 AM
alastairp

I just tested `if recid in track_mbid_list:` on the 6 item dataset, and it took about 16 seconds

2022-07-14 19556, 2022

8:53 AM
Pratha-Fish

Looks like py set datastructure is implemented on hashmaps. Probably the biggest reason for these ridiculous lookup speeds

2022-07-14 19510, 2022

8:54 AM
alastairp

that's really close to the 20 that we saw with pandas

2022-07-14 19517, 2022

8:54 AM
Pratha-Fish

right

2022-07-14 19527, 2022

8:54 AM
Pratha-Fish

I also modified my pandas code a little, and it's taking ~14s now

2022-07-14 19557, 2022

8:54 AM
alastairp

yes, right. when you have a list [1,2,3,4,5,6] and you want to check `if 8 in mylist` then it has to look at each item in the list. if it's not in the list you have to go through the entire list every single time you check

2022-07-14 19522, 2022

8:55 AM
alastairp

whereas with a hashed datastructure like a set (python dictionary keys work the same way), its just a hash operation + 1 lookup

2022-07-14 19529, 2022

8:55 AM
alastairp

O(1) vs O(n)

2022-07-14 19534, 2022

8:55 AM
Pratha-Fish

!!!

2022-07-14 19515, 2022

8:56 AM
Pratha-Fish

I still have one doubt though

2022-07-14 19510, 2022

8:57 AM
Pratha-Fish

This old code with pandas took only 2.89s for 381k (unique) rows https://usercontent.irccloud-cdn.com/file/khxUquf…

2022-07-14 19517, 2022

8:57 AM
alastairp

Pratha-Fish: looks like there may be some efficient pandas methods here too: https://stackoverflow.com/a/21175114

2022-07-14 19508, 2022

8:58 AM
Pratha-Fish

alastairp: Wow that one looks fast

2022-07-14 19555, 2022

8:58 AM
Pratha-Fish

Let's try out the intersection method. It's written in numpy, so it's gotta be pretty fast for such a basic operation

2022-07-14 19558, 2022

8:58 AM
alastairp

note that in that microbenchmark it's still faster to convert the series to a set, do the intersection, turn it back into a python list, and then turn that into a Series again (the first item, almost 10% faster than the last one)

2022-07-14 19516, 2022

8:59 AM
alastairp

.intersection is a python set method

2022-07-14 19525, 2022

8:59 AM
alastairp

.intersect1d is a numpy method

2022-07-14 19543, 2022

9:00 AM
Pratha-Fish

right, so the method with set is still gonna be faster I assume

2022-07-14 19533, 2022

9:01 AM
Pratha-Fish

Let's check out both of them just in case

2022-07-14 19544, 2022

9:01 AM
alastairp

sure, it'd be good to take a look at them

2022-07-14 19502, 2022

9:02 AM
alastairp

one other thing that I noticed - after you select the data from the database, they are a UUID object:

2022-07-14 19508, 2022

9:02 AM
alastairp

> array([UUID('9b02977e-a03b-4a6b-a9a9-06e722bdcd7a'),

2022-07-14 19543, 2022

9:02 AM
alastairp

and it looks like this object has an "equality" method to compare other UUID objects and strings of UUIDs, but it looks like it's faster if we just treat them as strings

2022-07-14 19519, 2022

9:05 AM
lucifer

mayhem: hi! by any chance, do we have extra MeB tshirts available at office? :)

2022-07-14 19503, 2022

9:06 AM
Pratha-Fish

alastairp: yes, I noticed that one too.

2022-07-14 19503, 2022

9:06 AM
Pratha-Fish

I also tried explicitly loading MLHD data with all UUID columns specified as string

2022-07-14 19503, 2022

9:06 AM
Pratha-Fish

But somehow it gets converted into UUID along the way

2022-07-14 19527, 2022

9:06 AM
yvanzo

O’Moin

2022-07-14 19543, 2022

9:06 AM
Pratha-Fish

I am beyond surprised https://usercontent.irccloud-cdn.com/file/MLvSha6…

2022-07-14 19555, 2022

9:06 AM
alastairp

Pratha-Fish: psycopg2/postgres does this if you select an item which is a uuid column

2022-07-14 19503, 2022

9:07 AM
mayhem

lucifer: yes. Let me check sizes. But we'll be making more for the summit, right monkey?

2022-07-14 19516, 2022

9:07 AM
alastairp

you can force this to text if you do `SELECT gid::text FROM recording`

2022-07-14 19521, 2022

9:07 AM
lucifer

awesome! :DD

2022-07-14 19541, 2022

9:07 AM
Pratha-Fish

alastairp: oh yes, that's probably the reason here

2022-07-14 19510, 2022

9:09 AM
alastairp

Pratha-Fish: OK, I just benchmarked using a UUID object and a string and the difference isn't actually as much as I expected. probably not a big issue

2022-07-14 19519, 2022

9:09 AM
alastairp

object: 3.06 ms ± 12.1 µs per loop

2022-07-14 19525, 2022

9:09 AM
alastairp

string: 3.03 ms ± 7.64 µs per loop

2022-07-14 19516, 2022

9:10 AM
Pratha-Fish

Alright

2022-07-14 19555, 2022

9:16 AM
ansh

alastairp: Hi! I had reviewed CB#442 and now it's ready for merge.

2022-07-14 19556, 2022

9:16 AM
BrainzBot

Adding Guidelines and link to CoC: https://github.com/metabrainz/critiquebrainz/pull…

2022-07-14 19505, 2022

9:17 AM
alastairp

ansh: I saw that, thank you so much!

2022-07-14 19535, 2022

9:17 AM
alastairp

ansh: I'll also take a look at your search PR again, and the other API ones you added (you need the metadata and rating one for BB integration, right?)

2022-07-14 19549, 2022

9:17 AM
ansh

yes

2022-07-14 19553, 2022

9:18 AM
ansh

It was really great reviewing a PR :) Thank you for this opportunity

2022-07-14 19518, 2022

9:20 AM
alastairp

great, I'm glad that you enjoyed it!

2022-07-14 19523, 2022

9:21 AM
Etua has quit