yes, I guess nothing stopping us from making a mapping during analysis and applying it. I guess the idea is that someone has to change it, so is that us or the submitter
outsidecontext
I see
alastairp
I think that I didn't really foresee people trying to take the url of the stream that they're playing from and trying to derive this field from it
I was thinking of it more like a java package name. it "looks like" a domain name, but is really just supposed to be a stable identifier that isn't plain text
but then I can see that really it probably has to be hard-coded in all cases, because especially seeing the example from webscrobbler I suspect that they have too many variations to try and do it automatically based on the metadata that they currently have
monkey
get first string in the `matches` array in tha connectors.js file -> strip characters like `/` and `*` -> strip leading `www.` -> strip any leading or trailing `.` -> canonical domain?
alastairp
I think that would be possible, though it does mean that the order of their matches array suddenly becomes significant
leading . is problematic too as you don't know if the tld has 1 or 2 parts
monkey
I meant for cases like `*://*.freemusicarchive.org/*` where stripping * , : and / character leaves you with `.freemusicarchive.org`.
With a leading .
alastairp
ah right, a single leading ., got it
so maybe we could suggest to them to automatically derive it in the case of a single `matches`?
monkey
But to be honest this feels hacky indeed.
My preference would be hardcoding like they do the id and label. Let's wait and see what they think.
alastairp
one thought I had was to have a `primary_match: "freemusicarchive.org"` field which is just a single domain, and is used for the service domain (they could automatically add the *s to it), and if there are multiple other matches have a separate `additional_matches`
again, lots of additional work for them, and gives the primary_match field semantically 2 tasks. in which case maybe an explicit `music_service` field might be more explicit
monkey
A quick application of my suggested manipulation of the first `matches` string for all connectors:
maybe some people might have that redirect in place... but if it was only there for 5 minutes then I wouldn't worry. this redirect is only in place for old URLs anyway, my links that I'll create will be to data.metabrainz.org
yuzie joined the channel
yuzie has quit
Pratha-Fish
alastairp: hi, it's completely fine, let's do it whenever you're free
CatQuest: Makes sense 🧠
alastairp
Pratha-Fish: hi, if you're around let's do it
Pratha-Fish
sure
alastairp
cool, I'm just pulling up your doc + code
Pratha-Fish
okie
alastairp
cool, code looks really good so far just from a look at it
let me check it out on wolf and run it myself
so it looks like we have in MLHD.ipynb a basic analysis of the mbids in a subset of the data, in MLHD_conflation.ipynb we look up metadata and build a new csv file, and in MLHD_conflation_mapping.ipynb we check that csv file against the mapping API, generating a new file for comparison. is that right?
Pratha-Fish
yes that's right!
Pratha-Fish comes back from AFK
alastairp
after installing from requirements.txt, I can't run `jupyter notebook`, do you know why?
Pratha-Fish
maybe because a few requirements are missing.
I'll update it real quick. gimme a sec
From the looks of it requirements.txt is already uptodate!
alastairp
are you using notebook (the web interface), or are you using it from vscode?
I see that jupyter-core is installed, but not notebook
Pratha-Fish
I am using it on vscode actually maybe that's why
alastairp
sure, ok. maybe you can `pip install jupyter-notebook` and re-update requirements. not a big deal, I've done it myself on my version
Pratha-Fish
yes you're right jupyter notebook wasn't installed in this newer edition of the project
Upadting it RN
alastairp: done
alastairp
awesome
did you see this warning in MLHD.ipynb?
> UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
Pratha-Fish
Right
pandas officially supports only SQLalchemy connectable objects but psycopg2 obj worked too in the current versions that we're using and it looked lighter than SQLalchemy, so I just went ahead with psycopg2
It shouldn't be hard to replace it tho
alastairp
yes, no problem. we use sqlalchemy in all of our other projects too, it's a small thing but would be nice if we can fix the warnings
but yes, it seems that there is no difference for our cases
Pratha-Fish
Sure, I'll add that one to the to-do list
alastairp
from sqlalchemy import create_engine; engine = create_engine('postgresql://musicbrainz:musicbrainz@localhost/musicbrainz_db'); conn = engine.connect()
just changed it in my version. works fine
Pratha-Fish
got it
I'll change it ASAP
alastairp
these db load methods look really useful for future work. in fact, I see that we already have some similar version in MLHD.ipynb (`SELECT gid FROM recording`) and MLHD_conflation.ipynb (`select gid, name from recording`)
let's move them into your lib/ folder, so that we can do `from lib import mb; recordings = mb.load_recordings_df()` or something
Pratha-Fish
sounds great! Added to the to-do list
alastairp
perfect
are we still checking items against the track table?
Pratha-Fish
alastairp: we did one check (with 370k rows I think) and found no track-MBID shows up in recording-MBID column. So at least right now rec-MBIDs are not being checked in track table from MB
alastairp
OK. I wonder if we can run this over the entire dataset
I see your pending item in your doc for June 17:
> Analyze a larger sample space to verify if any track-MBIDs exist in the rec-MBID column.
Pratha-Fish
Ah right
alastairp
I think that we should prioritise this step this week, so that we can write it up in our document and remove this code from the notebook to simplify it
Pratha-Fish
definitely
alastairp
you were looking at the speed of loading data, right? let's take a look at that now
Pratha-Fish
So should I just loop over the complete dataset?
That could take ages, so is there any better way to get around it?
BrainzGit
[bookbrainz-site] 14MonkeyDo merged pull request #852 (03master…async-language-select): Feat(language-select): Asynchronously load options in language select https://github.com/metabrainz/bookbrainz-site/p...
alastairp
can you explain to me what you tried? I recall you had a few ideas with numba and other tools
well, let's try it! it could be interesting to get a general idea about how long it takes to read the entire dataset
Pratha-Fish
Right, from I tried using Numba a little bit, but looks like it doesn't have good support higher level functions from libraries like pandas, etc
So we could write numpy functions or plain python functions with lists and all use numba decorators with it
But I haven't found it very reliable since it barely worked for any of the functions that I've tried it with
alastairp
ok
and what you're trying to do here is make the `read_files` method quicker?
Pratha-Fish
So the current pandas read_csv method has a ton of bells and whistles that could be slowing it down. So I figured writing a tailored reading function for MLHD data could be quicker
So I wrote a simple function to get multiple mlhd txt files (in gzip format) > extract them > read the text and split it on the "\t" to get columns
Basically the whole table loaded as python nested dictionaries
alastairp
mmm, I tell you what - I'm just testing this myself
from what I can see, pandas is actually already 2x faster than just reading the file as a csv
I'm actually really surprised with this (in a good way!)
Pratha-Fish
yes that's right!
Pandas.read_csv is pretty optimized. So I don't really think there's a point in writing another function to tackle that. I did the same tests as you and found my custom csv loading func took ~20s while pandas took 12
alastairp
cool, so maybe we shouldn't worry too much about optimising this more
soundandvision joined the channel
Pratha-Fish
Yes, lets skip this one
soundandvision
has anyone else had some issues with connecting to IRC here, I use the kiwi web browser app and it often times out first login attempt, then goes through on the second?
alastairp
600000 files * 50ms = 500 minutes (~8h), so that's a lot of time spent just reading files, but it's the kind of time-period that you could just leave it running overnight and come back to the results the following day
it's not like it's going to take days or weeks just to read the files
soundandvision: timeout with your computer accessing kiwi, or kiwi accessing libera.chat?
Pratha-Fish
alastairp: But that's llike 600GB of data, will it even fit in the RAM?
alastest joined the channel
alastest has quit
alastairp
Pratha-Fish: there's no reason to load it all into memory at once! just load 1 file, process it - get some results, move on to the next file
Pratha-Fish
ah yes
alastairp
does a pandas dataframe have a fast way of checking if a value is present in a column?
I'm wondering what the most efficient way to do this might be
Pratha-Fish
Yes, I'm currently using the .isin() function for the same. I heard it's all written in C and is pretty fast
alastairp
cool, so let's do a test right now (if you are free)
soundandvision
(apologies im crap at IRC commands) alastairp kiwi accessing libera it would seem
oh hey that tag worked!
alastairp
soundandvision: sure, this can be a bit complex
yes, I got the notification!
I wouldn't be surprised if kiwiirc has a limit to the number of connections to libera it can make, it could be related to this
Pratha-Fish: I think that we should try and load track table and track_gid_redirect, then load a single file and do isin to check the recording column against these two tables
soundandvision
ok cool :)
alastairp
if we get that working for 1 file, we can put it in a loop and leave it running for the next 8-10h
Pratha-Fish
alastairp: exciting
I'll make sure to do some extensive testing this time around too
alastairp
Pratha-Fish: btw, I see in your method `get_null_stats` you say "Number of NOT-null rows in ...", I guess this should be Percentage?
Pratha-Fish
Is it in the MLHD.ipynb file?
alastairp
yes
Pratha-Fish
checkign
P.S. sorry if my mesasges are delivered late. I ran out of mobile data so the latency has shot up a lot
alastairp
no prob. it looks fine from here
Pratha-Fish
also, yes the get_null_stats function returns output in %