almost all the elements are ready to be assembled. It's in a jupyter notebook right now for testing, etc
Ah I see
This code isn't working with unique MBIDs!
zas
skelly37, outsidecontext : hey, what about usual meeting time?
outsidecontext
ok, works for me
alastairp
Pratha-Fish: hm, right. but you're right that this still seems to be much longer than I'd expect
Pratha-Fish
That's right
Thankfully I've just realized a few optimizations here
i.e. Taking only unique rows
skelly37
Fine for me
alastairp
I'm loading and running the code now to see if I can recommend something
Pratha-Fish
alastairp: The other faster optimization is just logging the "recording-MBID" that was marked as a track-MBID.
It's more complicated under the hood than it sounds actually
sure
zas
atj: I'd like we progress on ansible crowdsec stuff, what about deploying it on rex & rudi to experiment and improve?
alastairp
Pratha-Fish: just to check - `df_test_positive` is a sample dataframe that you created that includes some track ids and some track redirects, so that you can test the method?
Pratha-Fish
alastairp: yes that's right
alastairp
ok, great
Pratha-Fish: it looks to me like this specific pandas operation is really inefficient and we should target it
Pratha-Fish
exactly
alastairp
the first thing that I'd think of in this case is to just use a basic loop + counter. take a look at this:
put it in 2 - converting the track gids to a set only has to be done once
Pratha-Fish
oh right
This time, let's also try doing it on both MB_track and MB_track_redir
All right, so plain python test is taking 11s for set conversion, and 300microseconds for the test!
alastairp
mmmhm
Pratha-Fish
* For both MB_track and MB_track_redir
alastairp
so, in one way I'm surprised that the "typical" way that you might do it in pandas is so slow
Pratha-Fish
It's probably due to the massive conversion overhead in pandas!
alastairp
but it's also important to work out what our goal is - we could generate the intersection of the 2 dataframes and save it, but we could also do this in 2 phases - 1) quickly look at all items and then if we see something, 2) save the filename for further analysis
my guess is because both of these dataframes in pandas are lists
try `track_mbid_list = MB_track.gid.tolist()` and use `if recid in track_mbid_list:` instead
Pratha-Fish
alastairp: Yes, that's exactly how I am logging the anomalies
Only the suspected MBID, and it's file path
alastairp
and you'll see that this kind of check against a list is _way_ slower
Pratha-Fish
Python sets for the win!
alastairp
well, keep in mind that this is "datastructures for the win", using the correct tool for the job
this is O-notation, if you've covered it before in classes
Pratha-Fish
Yes I am aware of it
Now I see why DSA is so important in interviews
alastairp
I just tested `if recid in track_mbid_list:` on the 6 item dataset, and it took about 16 seconds
Pratha-Fish
Looks like py set datastructure is implemented on hashmaps. Probably the biggest reason for these ridiculous lookup speeds
alastairp
that's really close to the 20 that we saw with pandas
Pratha-Fish
right
I also modified my pandas code a little, and it's taking ~14s now
alastairp
yes, right. when you have a list [1,2,3,4,5,6] and you want to check `if 8 in mylist` then it has to look at each item in the list. if it's not in the list you have to go through the entire list every single time you check
whereas with a hashed datastructure like a set (python dictionary keys work the same way), its just a hash operation + 1 lookup
Let's try out the intersection method. It's written in numpy, so it's gotta be pretty fast for such a basic operation
alastairp
note that in that microbenchmark it's still faster to convert the series to a set, do the intersection, turn it back into a python list, and then turn that into a Series again (the first item, almost 10% faster than the last one)
.intersection is a python set method
.intersect1d is a numpy method
Pratha-Fish
right, so the method with set is still gonna be faster I assume
Let's check out both of them just in case
alastairp
sure, it'd be good to take a look at them
one other thing that I noticed - after you select the data from the database, they are a UUID object:
and it looks like this object has an "equality" method to compare other UUID objects and strings of UUIDs, but it looks like it's faster if we just treat them as strings
lucifer
mayhem: hi! by any chance, do we have extra MeB tshirts available at office? :)
Pratha-Fish
alastairp: yes, I noticed that one too.
I also tried explicitly loading MLHD data with all UUID columns specified as string
But somehow it gets converted into UUID along the way
ansh: I'll also take a look at your search PR again, and the other API ones you added (you need the metadata and rating one for BB integration, right?)
ansh
yes
It was really great reviewing a PR :) Thank you for this opportunity