#metabrainz

/

12:52 PM
alastairp

yes, I guess nothing stopping us from making a mapping during analysis and applying it. I guess the idea is that someone has to change it, so is that us or the submitter

2022-07-05 18636, 2022

12:52 PM
outsidecontext

I see

2022-07-05 18603, 2022

12:54 PM
alastairp

I think that I didn't really foresee people trying to take the url of the stream that they're playing from and trying to derive this field from it

2022-07-05 18628, 2022

12:54 PM
alastairp

I was thinking of it more like a java package name. it "looks like" a domain name, but is really just supposed to be a stable identifier that isn't plain text

2022-07-05 18652, 2022

12:55 PM
alastairp

but then I can see that really it probably has to be hard-coded in all cases, because especially seeing the example from webscrobbler I suspect that they have too many variations to try and do it automatically based on the metadata that they currently have

2022-07-05 18629, 2022

12:58 PM
monkey

get first string in the `matches` array in tha connectors.js file -> strip characters like `/` and `*` -> strip leading `www.` -> strip any leading or trailing `.` -> canonical domain?

2022-07-05 18604, 2022

12:59 PM
alastairp

I think that would be possible, though it does mean that the order of their matches array suddenly becomes significant

2022-07-05 18639, 2022

12:59 PM
alastairp

leading . is problematic too as you don't know if the tld has 1 or 2 parts

2022-07-05 18619, 2022

13:04 PM
monkey

I meant for cases like `*://*.freemusicarchive.org/*` where stripping * , : and / character leaves you with `.freemusicarchive.org`.

2022-07-05 18643, 2022

13:04 PM
monkey

With a leading .

2022-07-05 18650, 2022

13:04 PM
alastairp

ah right, a single leading ., got it

2022-07-05 18612, 2022

13:05 PM
alastairp

so maybe we could suggest to them to automatically derive it in the case of a single `matches`?

2022-07-05 18615, 2022

13:05 PM
monkey

But to be honest this feels hacky indeed.

2022-07-05 18641, 2022

13:05 PM
monkey

My preference would be hardcoding like they do the id and label. Let's wait and see what they think.

2022-07-05 18621, 2022

13:06 PM
alastairp

one thought I had was to have a `primary_match: "freemusicarchive.org"` field which is just a single domain, and is used for the service domain (they could automatically add the *s to it), and if there are multiple other matches have a separate `additional_matches`

2022-07-05 18656, 2022

13:06 PM
alastairp

again, lots of additional work for them, and gives the primary_match field semantically 2 tasks. in which case maybe an explicit `music_service` field might be more explicit

2022-07-05 18639, 2022

13:13 PM
monkey

A quick application of my suggested manipulation of the first `matches` string for all connectors:

2022-07-05 18644, 2022

13:13 PM
monkey

https://www.irccloud.com/pastebin/y9udRx4V/

2022-07-05 18620, 2022

13:14 PM
monkey

Hmm, still a couple of mistakes here.

2022-07-05 18633, 2022

13:14 PM
alastairp

1001tracklists.comtracklist

2022-07-05 18653, 2022

13:14 PM
alastairp

but that looks like just a mistake in pruning, you could be a bit more aggressive

2022-07-05 18659, 2022

13:30 PM
monkey

Even with proper pruning we're still left with a couple of subdomains:

2022-07-05 18603, 2022

13:31 PM
monkey

https://www.irccloud.com/pastebin/oXw32JvK/

2022-07-05 18623, 2022

13:31 PM
monkey

Such as 'daily.bandcamp.com', which has a different connector from 'bandcamp.com'

2022-07-05 18617, 2022

13:32 PM
monkey

(and also more mistakes from my part in there it looks like. Gonna stop here, I don't think this is reliable)

2022-07-05 18623, 2022

13:34 PM
monkey

Oh yeah, it's a no-go: https://github.com/web-scrobbler/web-scrobbler/bl…

2022-07-05 18632, 2022

13:34 PM
monkey

This one doesn't have a TLD at all.

2022-07-05 18642, 2022

13:48 PM
zas

alastairp: ftp.eu has no https?

2022-07-05 18600, 2022

13:49 PM
zas

the redirect works, but it redirects https -> https

2022-07-05 18609, 2022

13:49 PM
zas

should I change it to https -> http

2022-07-05 18610, 2022

13:49 PM
alastairp

zas: hmm, I wasn't aware of that

2022-07-05 18611, 2022

13:49 PM
zas

?

2022-07-05 18637, 2022

13:49 PM
alastairp

It sounds good to me to add a certificate and redirect for this

2022-07-05 18637, 2022

13:50 PM
zas

to me, it is better to have https on ftp.eu.

2022-07-05 18648, 2022

13:51 PM
alastairp

👍

2022-07-05 18649, 2022

13:51 PM
zas

why ftp.eu btw? you could use data.metabrainz.org

2022-07-05 18641, 2022

13:52 PM
zas

https://data.metabrainz.org/pub/musicbrainz/acous…

2022-07-05 18610, 2022

13:53 PM
alastairp

oh, that was probably my mistake. I just went to the MB download page and picked a URL :(

2022-07-05 18619, 2022

13:53 PM
alastairp

yes, data.metabrainz.org makes much more sense

2022-07-05 18600, 2022

13:54 PM
zas

ok, done

2022-07-05 18648, 2022

13:54 PM
zas

but since it is a 301 ...

2022-07-05 18638, 2022

13:55 PM
alastairp

maybe some people might have that redirect in place... but if it was only there for 5 minutes then I wouldn't worry. this redirect is only in place for old URLs anyway, my links that I'll create will be to data.metabrainz.org

2022-07-05 18614, 2022

14:29 PM
yuzie joined the channel

2022-07-05 18623, 2022

14:29 PM
yuzie has quit

2022-07-05 18628, 2022

15:19 PM
Pratha-Fish

alastairp: hi, it's completely fine, let's do it whenever you're free

2022-07-05 18654, 2022

15:19 PM
Pratha-Fish

CatQuest: Makes sense 🧠

2022-07-05 18614, 2022

15:55 PM
alastairp

Pratha-Fish: hi, if you're around let's do it

2022-07-05 18639, 2022

15:55 PM
Pratha-Fish

sure

2022-07-05 18653, 2022

15:56 PM
alastairp

cool, I'm just pulling up your doc + code

2022-07-05 18638, 2022

15:57 PM
Pratha-Fish

okie

2022-07-05 18601, 2022

16:02 PM
alastairp

cool, code looks really good so far just from a look at it

2022-07-05 18610, 2022

16:02 PM
alastairp

let me check it out on wolf and run it myself

2022-07-05 18622, 2022

16:04 PM
alastairp

so it looks like we have in MLHD.ipynb a basic analysis of the mbids in a subset of the data, in MLHD_conflation.ipynb we look up metadata and build a new csv file, and in MLHD_conflation_mapping.ipynb we check that csv file against the mapping API, generating a new file for comparison. is that right?

2022-07-05 18655, 2022

16:07 PM
Pratha-Fish

yes that's right!

2022-07-05 18613, 2022

16:08 PM
Pratha-Fish comes back from AFK

2022-07-05 18623, 2022

16:08 PM
alastairp

after installing from requirements.txt, I can't run `jupyter notebook`, do you know why?

2022-07-05 18645, 2022

16:08 PM
Pratha-Fish

maybe because a few requirements are missing.

2022-07-05 18645, 2022

16:08 PM
Pratha-Fish

I'll update it real quick. gimme a sec

2022-07-05 18658, 2022

16:09 PM
Pratha-Fish

From the looks of it requirements.txt is already uptodate!

2022-07-05 18616, 2022

16:10 PM
alastairp

are you using notebook (the web interface), or are you using it from vscode?

2022-07-05 18624, 2022

16:10 PM
alastairp

I see that jupyter-core is installed, but not notebook

2022-07-05 18646, 2022

16:10 PM
Pratha-Fish

I am using it on vscode actually maybe that's why

2022-07-05 18626, 2022

16:11 PM
alastairp

sure, ok. maybe you can `pip install jupyter-notebook` and re-update requirements. not a big deal, I've done it myself on my version

2022-07-05 18637, 2022

16:11 PM
Pratha-Fish

yes you're right jupyter notebook wasn't installed in this newer edition of the project

2022-07-05 18637, 2022

16:11 PM
Pratha-Fish

Upadting it RN

2022-07-05 18602, 2022

16:13 PM
Pratha-Fish

alastairp: done

2022-07-05 18622, 2022

16:13 PM
alastairp

awesome

2022-07-05 18641, 2022

16:13 PM
alastairp

did you see this warning in MLHD.ipynb?

2022-07-05 18642, 2022

16:13 PM
alastairp

> UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy

2022-07-05 18601, 2022

16:15 PM
Pratha-Fish

Right

2022-07-05 18601, 2022

16:15 PM
Pratha-Fish

pandas officially supports only SQLalchemy connectable objects but psycopg2 obj worked too in the current versions that we're using and it looked lighter than SQLalchemy, so I just went ahead with psycopg2

2022-07-05 18613, 2022

16:15 PM
Pratha-Fish

It shouldn't be hard to replace it tho

2022-07-05 18636, 2022

16:15 PM
alastairp

yes, no problem. we use sqlalchemy in all of our other projects too, it's a small thing but would be nice if we can fix the warnings

2022-07-05 18651, 2022

16:15 PM
alastairp

but yes, it seems that there is no difference for our cases

2022-07-05 18624, 2022

16:16 PM
Pratha-Fish

Sure, I'll add that one to the to-do list

2022-07-05 18629, 2022

16:17 PM
alastairp

from sqlalchemy import create_engine; engine = create_engine('postgresql://musicbrainz:musicbrainz@localhost/musicbrainz_db'); conn = engine.connect()

2022-07-05 18635, 2022

16:17 PM
alastairp

just changed it in my version. works fine

2022-07-05 18607, 2022

16:18 PM
Pratha-Fish

got it

2022-07-05 18607, 2022

16:18 PM
Pratha-Fish

I'll change it ASAP

2022-07-05 18600, 2022

16:19 PM
alastairp

these db load methods look really useful for future work. in fact, I see that we already have some similar version in MLHD.ipynb (`SELECT gid FROM recording`) and MLHD_conflation.ipynb (`select gid, name from recording`)

2022-07-05 18636, 2022

16:19 PM
alastairp

let's move them into your lib/ folder, so that we can do `from lib import mb; recordings = mb.load_recordings_df()` or something

2022-07-05 18602, 2022

16:20 PM
Pratha-Fish

sounds great! Added to the to-do list

2022-07-05 18610, 2022

16:20 PM
alastairp

perfect

2022-07-05 18636, 2022

16:20 PM
alastairp

are we still checking items against the track table?

2022-07-05 18602, 2022

16:22 PM
Pratha-Fish

alastairp: we did one check (with 370k rows I think) and found no track-MBID shows up in recording-MBID column. So at least right now rec-MBIDs are not being checked in track table from MB

2022-07-05 18644, 2022

16:22 PM
alastairp

OK. I wonder if we can run this over the entire dataset

2022-07-05 18657, 2022

16:22 PM
alastairp

I see your pending item in your doc for June 17:

2022-07-05 18602, 2022

16:23 PM
alastairp

> Analyze a larger sample space to verify if any track-MBIDs exist in the rec-MBID column.

2022-07-05 18625, 2022

16:23 PM
Pratha-Fish

Ah right

2022-07-05 18634, 2022

16:23 PM
alastairp

I think that we should prioritise this step this week, so that we can write it up in our document and remove this code from the notebook to simplify it

2022-07-05 18651, 2022

16:23 PM
Pratha-Fish

definitely

2022-07-05 18603, 2022

16:24 PM
alastairp

you were looking at the speed of loading data, right? let's take a look at that now

2022-07-05 18615, 2022

16:24 PM
Pratha-Fish

So should I just loop over the complete dataset?

2022-07-05 18615, 2022

16:24 PM
Pratha-Fish

That could take ages, so is there any better way to get around it?

2022-07-05 18620, 2022

16:24 PM
BrainzGit

[bookbrainz-site] 14MonkeyDo merged pull request #852 (03master…async-language-select): Feat(language-select): Asynchronously load options in language select https://github.com/metabrainz/bookbrainz-site/pul…

2022-07-05 18628, 2022

16:24 PM
alastairp

can you explain to me what you tried? I recall you had a few ideas with numba and other tools

2022-07-05 18649, 2022

16:24 PM
alastairp

well, let's try it! it could be interesting to get a general idea about how long it takes to read the entire dataset

2022-07-05 18651, 2022

16:25 PM
Pratha-Fish

Right, from I tried using Numba a little bit, but looks like it doesn't have good support higher level functions from libraries like pandas, etc

2022-07-05 18651, 2022

16:25 PM
Pratha-Fish

So we could write numpy functions or plain python functions with lists and all use numba decorators with it

2022-07-05 18619, 2022

16:26 PM
Pratha-Fish

But I haven't found it very reliable since it barely worked for any of the functions that I've tried it with

2022-07-05 18624, 2022

16:26 PM
alastairp

ok

2022-07-05 18646, 2022

16:26 PM
alastairp

and what you're trying to do here is make the `read_files` method quicker?

2022-07-05 18649, 2022

16:27 PM
Pratha-Fish

So the current pandas read_csv method has a ton of bells and whistles that could be slowing it down. So I figured writing a tailored reading function for MLHD data could be quicker

2022-07-05 18646, 2022

16:29 PM
Pratha-Fish

So I wrote a simple function to get multiple mlhd txt files (in gzip format) > extract them > read the text and split it on the "\t" to get columns

2022-07-05 18646, 2022

16:29 PM
Pratha-Fish

Basically the whole table loaded as python nested dictionaries

2022-07-05 18602, 2022

16:30 PM
alastairp

mmm, I tell you what - I'm just testing this myself

2022-07-05 18645, 2022

16:30 PM
alastairp

https://gist.github.com/alastair/0f54cb68c5c64126…

2022-07-05 18601, 2022

16:31 PM
alastairp

from what I can see, pandas is actually already 2x faster than just reading the file as a csv

2022-07-05 18634, 2022

16:31 PM
alastairp

I'm actually really surprised with this (in a good way!)

2022-07-05 18617, 2022

16:32 PM
Pratha-Fish

yes that's right!

2022-07-05 18617, 2022

16:32 PM
Pratha-Fish

Pandas.read_csv is pretty optimized. So I don't really think there's a point in writing another function to tackle that. I did the same tests as you and found my custom csv loading func took ~20s while pandas took 12

2022-07-05 18640, 2022

16:34 PM
alastairp

cool, so maybe we shouldn't worry too much about optimising this more

2022-07-05 18621, 2022

16:35 PM
soundandvision joined the channel

2022-07-05 18635, 2022

16:35 PM
Pratha-Fish

Yes, lets skip this one

2022-07-05 18603, 2022

16:36 PM
soundandvision

has anyone else had some issues with connecting to IRC here, I use the kiwi web browser app and it often times out first login attempt, then goes through on the second?

2022-07-05 18654, 2022

16:36 PM
alastairp

600000 files * 50ms = 500 minutes (~8h), so that's a lot of time spent just reading files, but it's the kind of time-period that you could just leave it running overnight and come back to the results the following day

2022-07-05 18607, 2022

16:37 PM
alastairp

it's not like it's going to take days or weeks just to read the files

2022-07-05 18642, 2022

16:37 PM
alastairp

soundandvision: timeout with your computer accessing kiwi, or kiwi accessing libera.chat?

2022-07-05 18605, 2022

16:38 PM
Pratha-Fish

alastairp: But that's llike 600GB of data, will it even fit in the RAM?

2022-07-05 18606, 2022

16:38 PM
alastest joined the channel

2022-07-05 18607, 2022

16:38 PM
alastest has quit

2022-07-05 18639, 2022

16:38 PM
alastairp

Pratha-Fish: there's no reason to load it all into memory at once! just load 1 file, process it - get some results, move on to the next file

2022-07-05 18619, 2022

16:39 PM
Pratha-Fish

ah yes

2022-07-05 18608, 2022

16:40 PM
alastairp

does a pandas dataframe have a fast way of checking if a value is present in a column?

2022-07-05 18616, 2022

16:40 PM
alastairp

I'm wondering what the most efficient way to do this might be

2022-07-05 18652, 2022

16:40 PM
Pratha-Fish

Yes, I'm currently using the .isin() function for the same. I heard it's all written in C and is pretty fast

2022-07-05 18629, 2022

16:41 PM
alastairp

cool, so let's do a test right now (if you are free)

2022-07-05 18637, 2022

16:41 PM
soundandvision

(apologies im crap at IRC commands) alastairp kiwi accessing libera it would seem

2022-07-05 18645, 2022

16:41 PM
soundandvision

oh hey that tag worked!

2022-07-05 18647, 2022

16:41 PM
alastairp

soundandvision: sure, this can be a bit complex

2022-07-05 18653, 2022

16:41 PM
alastairp

yes, I got the notification!

2022-07-05 18615, 2022

16:42 PM
alastairp

I wouldn't be surprised if kiwiirc has a limit to the number of connections to libera it can make, it could be related to this

2022-07-05 18645, 2022

16:42 PM
alastairp

Pratha-Fish: I think that we should try and load track table and track_gid_redirect, then load a single file and do isin to check the recording column against these two tables

2022-07-05 18659, 2022

16:42 PM
soundandvision

ok cool :)

2022-07-05 18604, 2022

16:43 PM
alastairp

if we get that working for 1 file, we can put it in a loop and leave it running for the next 8-10h

2022-07-05 18611, 2022

16:44 PM
Pratha-Fish

alastairp: exciting

2022-07-05 18629, 2022

16:44 PM
Pratha-Fish

I'll make sure to do some extensive testing this time around too

2022-07-05 18623, 2022

16:45 PM
alastairp

Pratha-Fish: btw, I see in your method `get_null_stats` you say "Number of NOT-null rows in ...", I guess this should be Percentage?

2022-07-05 18655, 2022

16:45 PM
Pratha-Fish

Is it in the MLHD.ipynb file?

2022-07-05 18611, 2022

16:46 PM
alastairp

yes

2022-07-05 18616, 2022

16:46 PM
Pratha-Fish

checkign

2022-07-05 18647, 2022

16:46 PM
Pratha-Fish

P.S. sorry if my mesasges are delivered late. I ran out of mobile data so the latency has shot up a lot

2022-07-05 18602, 2022

16:47 PM
alastairp

no prob. it looks fine from here

2022-07-05 18608, 2022

16:50 PM
Pratha-Fish

also, yes the get_null_stats function returns output in %

2022-07-05 18649, 2022

16:52 PM
alastairp

Pratha-Fish: check something like this: https://gist.github.com/alastair/0dcf5a3b670f83f6…

2022-07-05 18602, 2022

16:53 PM
alastairp

the time_stats() function is what I linked to you the other day

2022-07-05 18641, 2022

16:54 PM
Pratha-Fish

Wow that os.walk function is something new

2022-07-05 18648, 2022

16:54 PM
alastairp

:)

2022-07-05 18640, 2022

16:55 PM
Pratha-Fish

I'll try out something with this code ig

2022-07-05 18654, 2022

16:55 PM
Pratha-Fish

Also, what exactly does the time.monotonic function do?