yes, I guess nothing stopping us from making a mapping during analysis and applying it. I guess the idea is that someone has to change it, so is that us or the submitter
2022-07-05 18636, 2022
outsidecontext
I see
2022-07-05 18603, 2022
alastairp
I think that I didn't really foresee people trying to take the url of the stream that they're playing from and trying to derive this field from it
2022-07-05 18628, 2022
alastairp
I was thinking of it more like a java package name. it "looks like" a domain name, but is really just supposed to be a stable identifier that isn't plain text
2022-07-05 18652, 2022
alastairp
but then I can see that really it probably has to be hard-coded in all cases, because especially seeing the example from webscrobbler I suspect that they have too many variations to try and do it automatically based on the metadata that they currently have
2022-07-05 18629, 2022
monkey
get first string in the `matches` array in tha connectors.js file -> strip characters like `/` and `*` -> strip leading `www.` -> strip any leading or trailing `.` -> canonical domain?
2022-07-05 18604, 2022
alastairp
I think that would be possible, though it does mean that the order of their matches array suddenly becomes significant
2022-07-05 18639, 2022
alastairp
leading . is problematic too as you don't know if the tld has 1 or 2 parts
2022-07-05 18619, 2022
monkey
I meant for cases like `*://*.freemusicarchive.org/*` where stripping * , : and / character leaves you with `.freemusicarchive.org`.
2022-07-05 18643, 2022
monkey
With a leading .
2022-07-05 18650, 2022
alastairp
ah right, a single leading ., got it
2022-07-05 18612, 2022
alastairp
so maybe we could suggest to them to automatically derive it in the case of a single `matches`?
2022-07-05 18615, 2022
monkey
But to be honest this feels hacky indeed.
2022-07-05 18641, 2022
monkey
My preference would be hardcoding like they do the id and label. Let's wait and see what they think.
2022-07-05 18621, 2022
alastairp
one thought I had was to have a `primary_match: "freemusicarchive.org"` field which is just a single domain, and is used for the service domain (they could automatically add the *s to it), and if there are multiple other matches have a separate `additional_matches`
2022-07-05 18656, 2022
alastairp
again, lots of additional work for them, and gives the primary_match field semantically 2 tasks. in which case maybe an explicit `music_service` field might be more explicit
2022-07-05 18639, 2022
monkey
A quick application of my suggested manipulation of the first `matches` string for all connectors:
maybe some people might have that redirect in place... but if it was only there for 5 minutes then I wouldn't worry. this redirect is only in place for old URLs anyway, my links that I'll create will be to data.metabrainz.org
2022-07-05 18614, 2022
yuzie joined the channel
2022-07-05 18623, 2022
yuzie has quit
2022-07-05 18628, 2022
Pratha-Fish
alastairp: hi, it's completely fine, let's do it whenever you're free
2022-07-05 18654, 2022
Pratha-Fish
CatQuest: Makes sense đź§
2022-07-05 18614, 2022
alastairp
Pratha-Fish: hi, if you're around let's do it
2022-07-05 18639, 2022
Pratha-Fish
sure
2022-07-05 18653, 2022
alastairp
cool, I'm just pulling up your doc + code
2022-07-05 18638, 2022
Pratha-Fish
okie
2022-07-05 18601, 2022
alastairp
cool, code looks really good so far just from a look at it
2022-07-05 18610, 2022
alastairp
let me check it out on wolf and run it myself
2022-07-05 18622, 2022
alastairp
so it looks like we have in MLHD.ipynb a basic analysis of the mbids in a subset of the data, in MLHD_conflation.ipynb we look up metadata and build a new csv file, and in MLHD_conflation_mapping.ipynb we check that csv file against the mapping API, generating a new file for comparison. is that right?
2022-07-05 18655, 2022
Pratha-Fish
yes that's right!
2022-07-05 18613, 2022
Pratha-Fish comes back from AFK
2022-07-05 18623, 2022
alastairp
after installing from requirements.txt, I can't run `jupyter notebook`, do you know why?
2022-07-05 18645, 2022
Pratha-Fish
maybe because a few requirements are missing.
2022-07-05 18645, 2022
Pratha-Fish
I'll update it real quick. gimme a sec
2022-07-05 18658, 2022
Pratha-Fish
From the looks of it requirements.txt is already uptodate!
2022-07-05 18616, 2022
alastairp
are you using notebook (the web interface), or are you using it from vscode?
2022-07-05 18624, 2022
alastairp
I see that jupyter-core is installed, but not notebook
2022-07-05 18646, 2022
Pratha-Fish
I am using it on vscode actually maybe that's why
2022-07-05 18626, 2022
alastairp
sure, ok. maybe you can `pip install jupyter-notebook` and re-update requirements. not a big deal, I've done it myself on my version
2022-07-05 18637, 2022
Pratha-Fish
yes you're right jupyter notebook wasn't installed in this newer edition of the project
2022-07-05 18637, 2022
Pratha-Fish
Upadting it RN
2022-07-05 18602, 2022
Pratha-Fish
alastairp: done
2022-07-05 18622, 2022
alastairp
awesome
2022-07-05 18641, 2022
alastairp
did you see this warning in MLHD.ipynb?
2022-07-05 18642, 2022
alastairp
> UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
2022-07-05 18601, 2022
Pratha-Fish
Right
2022-07-05 18601, 2022
Pratha-Fish
pandas officially supports only SQLalchemy connectable objects but psycopg2 obj worked too in the current versions that we're using and it looked lighter than SQLalchemy, so I just went ahead with psycopg2
2022-07-05 18613, 2022
Pratha-Fish
It shouldn't be hard to replace it tho
2022-07-05 18636, 2022
alastairp
yes, no problem. we use sqlalchemy in all of our other projects too, it's a small thing but would be nice if we can fix the warnings
2022-07-05 18651, 2022
alastairp
but yes, it seems that there is no difference for our cases
2022-07-05 18624, 2022
Pratha-Fish
Sure, I'll add that one to the to-do list
2022-07-05 18629, 2022
alastairp
from sqlalchemy import create_engine; engine = create_engine('postgresql://musicbrainz:musicbrainz@localhost/musicbrainz_db'); conn = engine.connect()
2022-07-05 18635, 2022
alastairp
just changed it in my version. works fine
2022-07-05 18607, 2022
Pratha-Fish
got it
2022-07-05 18607, 2022
Pratha-Fish
I'll change it ASAP
2022-07-05 18600, 2022
alastairp
these db load methods look really useful for future work. in fact, I see that we already have some similar version in MLHD.ipynb (`SELECT gid FROM recording`) and MLHD_conflation.ipynb (`select gid, name from recording`)
2022-07-05 18636, 2022
alastairp
let's move them into your lib/ folder, so that we can do `from lib import mb; recordings = mb.load_recordings_df()` or something
2022-07-05 18602, 2022
Pratha-Fish
sounds great! Added to the to-do list
2022-07-05 18610, 2022
alastairp
perfect
2022-07-05 18636, 2022
alastairp
are we still checking items against the track table?
2022-07-05 18602, 2022
Pratha-Fish
alastairp: we did one check (with 370k rows I think) and found no track-MBID shows up in recording-MBID column. So at least right now rec-MBIDs are not being checked in track table from MB
2022-07-05 18644, 2022
alastairp
OK. I wonder if we can run this over the entire dataset
2022-07-05 18657, 2022
alastairp
I see your pending item in your doc for June 17:
2022-07-05 18602, 2022
alastairp
> Analyze a larger sample space to verify if any track-MBIDs exist in the rec-MBID column.
2022-07-05 18625, 2022
Pratha-Fish
Ah right
2022-07-05 18634, 2022
alastairp
I think that we should prioritise this step this week, so that we can write it up in our document and remove this code from the notebook to simplify it
2022-07-05 18651, 2022
Pratha-Fish
definitely
2022-07-05 18603, 2022
alastairp
you were looking at the speed of loading data, right? let's take a look at that now
2022-07-05 18615, 2022
Pratha-Fish
So should I just loop over the complete dataset?
2022-07-05 18615, 2022
Pratha-Fish
That could take ages, so is there any better way to get around it?
2022-07-05 18620, 2022
BrainzGit
[bookbrainz-site] 14MonkeyDo merged pull request #852 (03master…async-language-select): Feat(language-select): Asynchronously load options in language select https://github.com/metabrainz/bookbrainz-site/pul…
2022-07-05 18628, 2022
alastairp
can you explain to me what you tried? I recall you had a few ideas with numba and other tools
2022-07-05 18649, 2022
alastairp
well, let's try it! it could be interesting to get a general idea about how long it takes to read the entire dataset
2022-07-05 18651, 2022
Pratha-Fish
Right, from I tried using Numba a little bit, but looks like it doesn't have good support higher level functions from libraries like pandas, etc
2022-07-05 18651, 2022
Pratha-Fish
So we could write numpy functions or plain python functions with lists and all use numba decorators with it
2022-07-05 18619, 2022
Pratha-Fish
But I haven't found it very reliable since it barely worked for any of the functions that I've tried it with
2022-07-05 18624, 2022
alastairp
ok
2022-07-05 18646, 2022
alastairp
and what you're trying to do here is make the `read_files` method quicker?
2022-07-05 18649, 2022
Pratha-Fish
So the current pandas read_csv method has a ton of bells and whistles that could be slowing it down. So I figured writing a tailored reading function for MLHD data could be quicker
2022-07-05 18646, 2022
Pratha-Fish
So I wrote a simple function to get multiple mlhd txt files (in gzip format) > extract them > read the text and split it on the "\t" to get columns
2022-07-05 18646, 2022
Pratha-Fish
Basically the whole table loaded as python nested dictionaries
2022-07-05 18602, 2022
alastairp
mmm, I tell you what - I'm just testing this myself
from what I can see, pandas is actually already 2x faster than just reading the file as a csv
2022-07-05 18634, 2022
alastairp
I'm actually really surprised with this (in a good way!)
2022-07-05 18617, 2022
Pratha-Fish
yes that's right!
2022-07-05 18617, 2022
Pratha-Fish
Pandas.read_csv is pretty optimized. So I don't really think there's a point in writing another function to tackle that. I did the same tests as you and found my custom csv loading func took ~20s while pandas took 12
2022-07-05 18640, 2022
alastairp
cool, so maybe we shouldn't worry too much about optimising this more
2022-07-05 18621, 2022
soundandvision joined the channel
2022-07-05 18635, 2022
Pratha-Fish
Yes, lets skip this one
2022-07-05 18603, 2022
soundandvision
has anyone else had some issues with connecting to IRC here, I use the kiwi web browser app and it often times out first login attempt, then goes through on the second?
2022-07-05 18654, 2022
alastairp
600000 files * 50ms = 500 minutes (~8h), so that's a lot of time spent just reading files, but it's the kind of time-period that you could just leave it running overnight and come back to the results the following day
2022-07-05 18607, 2022
alastairp
it's not like it's going to take days or weeks just to read the files
2022-07-05 18642, 2022
alastairp
soundandvision: timeout with your computer accessing kiwi, or kiwi accessing libera.chat?
2022-07-05 18605, 2022
Pratha-Fish
alastairp: But that's llike 600GB of data, will it even fit in the RAM?
2022-07-05 18606, 2022
alastest joined the channel
2022-07-05 18607, 2022
alastest has quit
2022-07-05 18639, 2022
alastairp
Pratha-Fish: there's no reason to load it all into memory at once! just load 1 file, process it - get some results, move on to the next file
2022-07-05 18619, 2022
Pratha-Fish
ah yes
2022-07-05 18608, 2022
alastairp
does a pandas dataframe have a fast way of checking if a value is present in a column?
2022-07-05 18616, 2022
alastairp
I'm wondering what the most efficient way to do this might be
2022-07-05 18652, 2022
Pratha-Fish
Yes, I'm currently using the .isin() function for the same. I heard it's all written in C and is pretty fast
2022-07-05 18629, 2022
alastairp
cool, so let's do a test right now (if you are free)
2022-07-05 18637, 2022
soundandvision
(apologies im crap at IRC commands) alastairp kiwi accessing libera it would seem
2022-07-05 18645, 2022
soundandvision
oh hey that tag worked!
2022-07-05 18647, 2022
alastairp
soundandvision: sure, this can be a bit complex
2022-07-05 18653, 2022
alastairp
yes, I got the notification!
2022-07-05 18615, 2022
alastairp
I wouldn't be surprised if kiwiirc has a limit to the number of connections to libera it can make, it could be related to this
2022-07-05 18645, 2022
alastairp
Pratha-Fish: I think that we should try and load track table and track_gid_redirect, then load a single file and do isin to check the recording column against these two tables
2022-07-05 18659, 2022
soundandvision
ok cool :)
2022-07-05 18604, 2022
alastairp
if we get that working for 1 file, we can put it in a loop and leave it running for the next 8-10h
2022-07-05 18611, 2022
Pratha-Fish
alastairp: exciting
2022-07-05 18629, 2022
Pratha-Fish
I'll make sure to do some extensive testing this time around too
2022-07-05 18623, 2022
alastairp
Pratha-Fish: btw, I see in your method `get_null_stats` you say "Number of NOT-null rows in ...", I guess this should be Percentage?
2022-07-05 18655, 2022
Pratha-Fish
Is it in the MLHD.ipynb file?
2022-07-05 18611, 2022
alastairp
yes
2022-07-05 18616, 2022
Pratha-Fish
checkign
2022-07-05 18647, 2022
Pratha-Fish
P.S. sorry if my mesasges are delivered late. I ran out of mobile data so the latency has shot up a lot
2022-07-05 18602, 2022
alastairp
no prob. it looks fine from here
2022-07-05 18608, 2022
Pratha-Fish
also, yes the get_null_stats function returns output in %