in #metabrainz

12:52 PM
alastairp

yes, I guess nothing stopping us from making a mapping during analysis and applying it. I guess the idea is that someone has to change it, so is that us or the submitter
12:52 PM
outsidecontext

I see
12:54 PM
alastairp

I think that I didn't really foresee people trying to take the url of the stream that they're playing from and trying to derive this field from it
12:54 PM
I was thinking of it more like a java package name. it "looks like" a domain name, but is really just supposed to be a stable identifier that isn't plain text
12:55 PM
but then I can see that really it probably has to be hard-coded in all cases, because especially seeing the example from webscrobbler I suspect that they have too many variations to try and do it automatically based on the metadata that they currently have
12:58 PM
monkey

get first string in the `matches` array in tha connectors.js file -> strip characters like `/` and `*` -> strip leading `www.` -> strip any leading or trailing `.` -> canonical domain?
12:59 PM
alastairp

I think that would be possible, though it does mean that the order of their matches array suddenly becomes significant
12:59 PM
leading . is problematic too as you don't know if the tld has 1 or 2 parts
13:04 PM
monkey

I meant for cases like `*://*.freemusicarchive.org/*` where stripping * , : and / character leaves you with `.freemusicarchive.org`.
13:04 PM
With a leading .
13:04 PM
alastairp

ah right, a single leading ., got it
13:05 PM
so maybe we could suggest to them to automatically derive it in the case of a single `matches`?
13:05 PM
monkey

But to be honest this feels hacky indeed.
13:05 PM
My preference would be hardcoding like they do the id and label. Let's wait and see what they think.
13:06 PM
alastairp

one thought I had was to have a `primary_match: "freemusicarchive.org"` field which is just a single domain, and is used for the service domain (they could automatically add the *s to it), and if there are multiple other matches have a separate `additional_matches`
13:06 PM
again, lots of additional work for them, and gives the primary_match field semantically 2 tasks. in which case maybe an explicit `music_service` field might be more explicit
13:13 PM
monkey

A quick application of my suggested manipulation of the first `matches` string for all connectors:
13:13 PM
https://www.irccloud.com/pastebin/y9udRx4V/
13:14 PM
Hmm, still a couple of mistakes here.
13:14 PM
alastairp

1001tracklists.comtracklist
13:14 PM
but that looks like just a mistake in pruning, you could be a bit more aggressive
13:30 PM
monkey

Even with proper pruning we're still left with a couple of subdomains:
13:31 PM
https://www.irccloud.com/pastebin/oXw32JvK/
13:31 PM
Such as 'daily.bandcamp.com', which has a different connector from 'bandcamp.com'
13:32 PM
(and also more mistakes from my part in there it looks like. Gonna stop here, I don't think this is reliable)
13:34 PM
Oh yeah, it's a no-go: https://github.com/web-scrobbler/web-scrobbler/...
13:34 PM
This one doesn't have a TLD at all.
13:48 PM
zas

alastairp: ftp.eu has no https?
13:49 PM
the redirect works, but it redirects https -> https
13:49 PM
should I change it to https -> http
13:49 PM
alastairp

zas: hmm, I wasn't aware of that
13:49 PM
zas

?
13:49 PM
alastairp

It sounds good to me to add a certificate and redirect for this
13:50 PM
zas

to me, it is better to have https on ftp.eu.
13:51 PM
alastairp

👍
13:51 PM
zas

why ftp.eu btw? you could use data.metabrainz.org
13:52 PM
https://data.metabrainz.org/pub/musicbrainz/aco...
13:53 PM
alastairp

oh, that was probably my mistake. I just went to the MB download page and picked a URL :(
13:53 PM
yes, data.metabrainz.org makes much more sense
13:54 PM
zas

ok, done
13:54 PM
but since it is a 301 ...
13:55 PM
alastairp

maybe some people might have that redirect in place... but if it was only there for 5 minutes then I wouldn't worry. this redirect is only in place for old URLs anyway, my links that I'll create will be to data.metabrainz.org
14:29 PM
yuzie joined the channel
14:29 PM
yuzie has quit
15:19 PM
Pratha-Fish

alastairp: hi, it's completely fine, let's do it whenever you're free
15:19 PM
CatQuest: Makes sense 🧠
15:55 PM
alastairp

Pratha-Fish: hi, if you're around let's do it
15:55 PM
Pratha-Fish

sure
15:56 PM
alastairp

cool, I'm just pulling up your doc + code
15:57 PM
Pratha-Fish

okie
16:02 PM
alastairp

cool, code looks really good so far just from a look at it
16:02 PM
let me check it out on wolf and run it myself
16:04 PM
so it looks like we have in MLHD.ipynb a basic analysis of the mbids in a subset of the data, in MLHD_conflation.ipynb we look up metadata and build a new csv file, and in MLHD_conflation_mapping.ipynb we check that csv file against the mapping API, generating a new file for comparison. is that right?
16:07 PM
Pratha-Fish

yes that's right!
16:08 PM
Pratha-Fish comes back from AFK
16:08 PM
alastairp

after installing from requirements.txt, I can't run `jupyter notebook`, do you know why?
16:08 PM
Pratha-Fish

maybe because a few requirements are missing.
16:08 PM
I'll update it real quick. gimme a sec
16:09 PM
From the looks of it requirements.txt is already uptodate!
16:10 PM
alastairp

are you using notebook (the web interface), or are you using it from vscode?
16:10 PM
I see that jupyter-core is installed, but not notebook
16:10 PM
Pratha-Fish

I am using it on vscode actually maybe that's why
16:11 PM
alastairp

sure, ok. maybe you can `pip install jupyter-notebook` and re-update requirements. not a big deal, I've done it myself on my version
16:11 PM
Pratha-Fish

yes you're right jupyter notebook wasn't installed in this newer edition of the project
16:11 PM
Upadting it RN
16:13 PM
alastairp: done
16:13 PM
alastairp

awesome
16:13 PM
did you see this warning in MLHD.ipynb?
16:13 PM
> UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
16:15 PM
Pratha-Fish

Right
16:15 PM
pandas officially supports only SQLalchemy connectable objects but psycopg2 obj worked too in the current versions that we're using and it looked lighter than SQLalchemy, so I just went ahead with psycopg2
16:15 PM
It shouldn't be hard to replace it tho
16:15 PM
alastairp

yes, no problem. we use sqlalchemy in all of our other projects too, it's a small thing but would be nice if we can fix the warnings
16:15 PM
but yes, it seems that there is no difference for our cases
16:16 PM
Pratha-Fish

Sure, I'll add that one to the to-do list
16:17 PM
alastairp

from sqlalchemy import create_engine; engine = create_engine('postgresql://musicbrainz:musicbrainz@localhost/musicbrainz_db'); conn = engine.connect()
16:17 PM
just changed it in my version. works fine
16:18 PM
Pratha-Fish

got it
16:18 PM
I'll change it ASAP
16:19 PM
alastairp

these db load methods look really useful for future work. in fact, I see that we already have some similar version in MLHD.ipynb (`SELECT gid FROM recording`) and MLHD_conflation.ipynb (`select gid, name from recording`)
16:19 PM
let's move them into your lib/ folder, so that we can do `from lib import mb; recordings = mb.load_recordings_df()` or something
16:20 PM
Pratha-Fish

sounds great! Added to the to-do list
16:20 PM
alastairp

perfect
16:20 PM
are we still checking items against the track table?
16:22 PM
Pratha-Fish

alastairp: we did one check (with 370k rows I think) and found no track-MBID shows up in recording-MBID column. So at least right now rec-MBIDs are not being checked in track table from MB
16:22 PM
alastairp

OK. I wonder if we can run this over the entire dataset
16:22 PM
I see your pending item in your doc for June 17:
16:23 PM
> Analyze a larger sample space to verify if any track-MBIDs exist in the rec-MBID column.
16:23 PM
Pratha-Fish

Ah right
16:23 PM
alastairp

I think that we should prioritise this step this week, so that we can write it up in our document and remove this code from the notebook to simplify it
16:23 PM
Pratha-Fish

definitely
16:24 PM
alastairp

you were looking at the speed of loading data, right? let's take a look at that now
16:24 PM
Pratha-Fish

So should I just loop over the complete dataset?
16:24 PM
That could take ages, so is there any better way to get around it?
16:24 PM
BrainzGit

[bookbrainz-site] 14MonkeyDo merged pull request #852 (03master…async-language-select): Feat(language-select): Asynchronously load options in language select https://github.com/metabrainz/bookbrainz-site/p...
16:24 PM
alastairp

can you explain to me what you tried? I recall you had a few ideas with numba and other tools
16:24 PM
well, let's try it! it could be interesting to get a general idea about how long it takes to read the entire dataset
16:25 PM
Pratha-Fish

Right, from I tried using Numba a little bit, but looks like it doesn't have good support higher level functions from libraries like pandas, etc
16:25 PM
So we could write numpy functions or plain python functions with lists and all use numba decorators with it
16:26 PM
But I haven't found it very reliable since it barely worked for any of the functions that I've tried it with
16:26 PM
alastairp

ok
16:26 PM
and what you're trying to do here is make the `read_files` method quicker?
16:27 PM
Pratha-Fish

So the current pandas read_csv method has a ton of bells and whistles that could be slowing it down. So I figured writing a tailored reading function for MLHD data could be quicker
16:29 PM
So I wrote a simple function to get multiple mlhd txt files (in gzip format) > extract them > read the text and split it on the "\t" to get columns
16:29 PM
Basically the whole table loaded as python nested dictionaries
16:30 PM
alastairp

mmm, I tell you what - I'm just testing this myself
16:30 PM
https://gist.github.com/alastair/0f54cb68c5c641...
16:31 PM
from what I can see, pandas is actually already 2x faster than just reading the file as a csv
16:31 PM
I'm actually really surprised with this (in a good way!)
16:32 PM
Pratha-Fish

yes that's right!
16:32 PM
Pandas.read_csv is pretty optimized. So I don't really think there's a point in writing another function to tackle that. I did the same tests as you and found my custom csv loading func took ~20s while pandas took 12
16:34 PM
alastairp

cool, so maybe we shouldn't worry too much about optimising this more
16:35 PM
soundandvision joined the channel
16:35 PM
Pratha-Fish

Yes, lets skip this one
16:36 PM
soundandvision

has anyone else had some issues with connecting to IRC here, I use the kiwi web browser app and it often times out first login attempt, then goes through on the second?
16:36 PM
alastairp

600000 files * 50ms = 500 minutes (~8h), so that's a lot of time spent just reading files, but it's the kind of time-period that you could just leave it running overnight and come back to the results the following day
16:37 PM
it's not like it's going to take days or weeks just to read the files
16:37 PM
soundandvision: timeout with your computer accessing kiwi, or kiwi accessing libera.chat?
16:38 PM
Pratha-Fish

alastairp: But that's llike 600GB of data, will it even fit in the RAM?
16:38 PM
alastest joined the channel
16:38 PM
alastest has quit
16:38 PM
alastairp

Pratha-Fish: there's no reason to load it all into memory at once! just load 1 file, process it - get some results, move on to the next file
16:39 PM
Pratha-Fish

ah yes
16:40 PM
alastairp

does a pandas dataframe have a fast way of checking if a value is present in a column?
16:40 PM
I'm wondering what the most efficient way to do this might be
16:40 PM
Pratha-Fish

Yes, I'm currently using the .isin() function for the same. I heard it's all written in C and is pretty fast
16:41 PM
alastairp

cool, so let's do a test right now (if you are free)
16:41 PM
soundandvision

(apologies im crap at IRC commands) alastairp kiwi accessing libera it would seem
16:41 PM
oh hey that tag worked!
16:41 PM
alastairp

soundandvision: sure, this can be a bit complex
16:41 PM
yes, I got the notification!
16:42 PM
I wouldn't be surprised if kiwiirc has a limit to the number of connections to libera it can make, it could be related to this
16:42 PM
Pratha-Fish: I think that we should try and load track table and track_gid_redirect, then load a single file and do isin to check the recording column against these two tables
16:42 PM
soundandvision

ok cool :)
16:43 PM
alastairp

if we get that working for 1 file, we can put it in a loop and leave it running for the next 8-10h
16:44 PM
Pratha-Fish

alastairp: exciting
16:44 PM
I'll make sure to do some extensive testing this time around too
16:45 PM
alastairp

Pratha-Fish: btw, I see in your method `get_null_stats` you say "Number of NOT-null rows in ...", I guess this should be Percentage?
16:45 PM
Pratha-Fish

Is it in the MLHD.ipynb file?
16:46 PM
alastairp

yes
16:46 PM
Pratha-Fish

checkign
16:46 PM
P.S. sorry if my mesasges are delivered late. I ran out of mobile data so the latency has shot up a lot
16:47 PM
alastairp

no prob. it looks fine from here
16:50 PM
Pratha-Fish

also, yes the get_null_stats function returns output in %
16:52 PM
alastairp

Pratha-Fish: check something like this: https://gist.github.com/alastair/0dcf5a3b670f83...
16:53 PM
the time_stats() function is what I linked to you the other day
16:54 PM
Pratha-Fish

Wow that os.walk function is something new
16:54 PM
alastairp

:)
16:55 PM
Pratha-Fish

I'll try out something with this code ig
16:55 PM
Also, what exactly does the time.monotonic function do?