#metabrainz

/

      • alastairp
        yes, I guess nothing stopping us from making a mapping during analysis and applying it. I guess the idea is that someone has to change it, so is that us or the submitter
      • outsidecontext
        I see
      • alastairp
        I think that I didn't really foresee people trying to take the url of the stream that they're playing from and trying to derive this field from it
      • I was thinking of it more like a java package name. it "looks like" a domain name, but is really just supposed to be a stable identifier that isn't plain text
      • but then I can see that really it probably has to be hard-coded in all cases, because especially seeing the example from webscrobbler I suspect that they have too many variations to try and do it automatically based on the metadata that they currently have
      • monkey
        get first string in the `matches` array in tha connectors.js file -> strip characters like `/` and `*` -> strip leading `www.` -> strip any leading or trailing `.` -> canonical domain?
      • alastairp
        I think that would be possible, though it does mean that the order of their matches array suddenly becomes significant
      • leading . is problematic too as you don't know if the tld has 1 or 2 parts
      • monkey
        I meant for cases like `*://*.freemusicarchive.org/*` where stripping * , : and / character leaves you with `.freemusicarchive.org`.
      • With a leading .
      • alastairp
        ah right, a single leading ., got it
      • so maybe we could suggest to them to automatically derive it in the case of a single `matches`?
      • monkey
        But to be honest this feels hacky indeed.
      • My preference would be hardcoding like they do the id and label. Let's wait and see what they think.
      • alastairp
        one thought I had was to have a `primary_match: "freemusicarchive.org"` field which is just a single domain, and is used for the service domain (they could automatically add the *s to it), and if there are multiple other matches have a separate `additional_matches`
      • again, lots of additional work for them, and gives the primary_match field semantically 2 tasks. in which case maybe an explicit `music_service` field might be more explicit
      • monkey
        A quick application of my suggested manipulation of the first `matches` string for all connectors:
      • Hmm, still a couple of mistakes here.
      • alastairp
        1001tracklists.comtracklist
      • but that looks like just a mistake in pruning, you could be a bit more aggressive
      • monkey
        Even with proper pruning we're still left with a couple of subdomains:
      • Such as 'daily.bandcamp.com', which has a different connector from 'bandcamp.com'
      • (and also more mistakes from my part in there it looks like. Gonna stop here, I don't think this is reliable)
      • This one doesn't have a TLD at all.
      • zas
        alastairp: ftp.eu has no https?
      • the redirect works, but it redirects https -> https
      • should I change it to https -> http
      • alastairp
        zas: hmm, I wasn't aware of that
      • zas
        ?
      • alastairp
        It sounds good to me to add a certificate and redirect for this
      • zas
        to me, it is better to have https on ftp.eu.
      • alastairp
        👍
      • zas
        why ftp.eu btw? you could use data.metabrainz.org
      • alastairp
        oh, that was probably my mistake. I just went to the MB download page and picked a URL :(
      • yes, data.metabrainz.org makes much more sense
      • zas
        ok, done
      • but since it is a 301 ...
      • alastairp
        maybe some people might have that redirect in place... but if it was only there for 5 minutes then I wouldn't worry. this redirect is only in place for old URLs anyway, my links that I'll create will be to data.metabrainz.org
      • yuzie joined the channel
      • yuzie has quit
      • Pratha-Fish
        alastairp: hi, it's completely fine, let's do it whenever you're free
      • CatQuest: Makes sense 🧠
      • alastairp
        Pratha-Fish: hi, if you're around let's do it
      • Pratha-Fish
        sure
      • alastairp
        cool, I'm just pulling up your doc + code
      • Pratha-Fish
        okie
      • alastairp
        cool, code looks really good so far just from a look at it
      • let me check it out on wolf and run it myself
      • so it looks like we have in MLHD.ipynb a basic analysis of the mbids in a subset of the data, in MLHD_conflation.ipynb we look up metadata and build a new csv file, and in MLHD_conflation_mapping.ipynb we check that csv file against the mapping API, generating a new file for comparison. is that right?
      • Pratha-Fish
        yes that's right!
      • Pratha-Fish comes back from AFK
      • alastairp
        after installing from requirements.txt, I can't run `jupyter notebook`, do you know why?
      • Pratha-Fish
        maybe because a few requirements are missing.
      • I'll update it real quick. gimme a sec
      • From the looks of it requirements.txt is already uptodate!
      • alastairp
        are you using notebook (the web interface), or are you using it from vscode?
      • I see that jupyter-core is installed, but not notebook
      • Pratha-Fish
        I am using it on vscode actually maybe that's why
      • alastairp
        sure, ok. maybe you can `pip install jupyter-notebook` and re-update requirements. not a big deal, I've done it myself on my version
      • Pratha-Fish
        yes you're right jupyter notebook wasn't installed in this newer edition of the project
      • Upadting it RN
      • alastairp: done
      • alastairp
        awesome
      • did you see this warning in MLHD.ipynb?
      • > UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
      • Pratha-Fish
        Right
      • pandas officially supports only SQLalchemy connectable objects but psycopg2 obj worked too in the current versions that we're using and it looked lighter than SQLalchemy, so I just went ahead with psycopg2
      • It shouldn't be hard to replace it tho
      • alastairp
        yes, no problem. we use sqlalchemy in all of our other projects too, it's a small thing but would be nice if we can fix the warnings
      • but yes, it seems that there is no difference for our cases
      • Pratha-Fish
        Sure, I'll add that one to the to-do list
      • alastairp
        from sqlalchemy import create_engine; engine = create_engine('postgresql://musicbrainz:musicbrainz@localhost/musicbrainz_db'); conn = engine.connect()
      • just changed it in my version. works fine
      • Pratha-Fish
        got it
      • I'll change it ASAP
      • alastairp
        these db load methods look really useful for future work. in fact, I see that we already have some similar version in MLHD.ipynb (`SELECT gid FROM recording`) and MLHD_conflation.ipynb (`select gid, name from recording`)
      • let's move them into your lib/ folder, so that we can do `from lib import mb; recordings = mb.load_recordings_df()` or something
      • Pratha-Fish
        sounds great! Added to the to-do list
      • alastairp
        perfect
      • are we still checking items against the track table?
      • Pratha-Fish
        alastairp: we did one check (with 370k rows I think) and found no track-MBID shows up in recording-MBID column. So at least right now rec-MBIDs are not being checked in track table from MB
      • alastairp
        OK. I wonder if we can run this over the entire dataset
      • I see your pending item in your doc for June 17:
      • > Analyze a larger sample space to verify if any track-MBIDs exist in the rec-MBID column.
      • Pratha-Fish
        Ah right
      • alastairp
        I think that we should prioritise this step this week, so that we can write it up in our document and remove this code from the notebook to simplify it
      • Pratha-Fish
        definitely
      • alastairp
        you were looking at the speed of loading data, right? let's take a look at that now
      • Pratha-Fish
        So should I just loop over the complete dataset?
      • That could take ages, so is there any better way to get around it?
      • BrainzGit
        [bookbrainz-site] 14MonkeyDo merged pull request #852 (03master…async-language-select): Feat(language-select): Asynchronously load options in language select https://github.com/metabrainz/bookbrainz-site/p...
      • alastairp
        can you explain to me what you tried? I recall you had a few ideas with numba and other tools
      • well, let's try it! it could be interesting to get a general idea about how long it takes to read the entire dataset
      • Pratha-Fish
        Right, from I tried using Numba a little bit, but looks like it doesn't have good support higher level functions from libraries like pandas, etc
      • So we could write numpy functions or plain python functions with lists and all use numba decorators with it
      • But I haven't found it very reliable since it barely worked for any of the functions that I've tried it with
      • alastairp
        ok
      • and what you're trying to do here is make the `read_files` method quicker?
      • Pratha-Fish
        So the current pandas read_csv method has a ton of bells and whistles that could be slowing it down. So I figured writing a tailored reading function for MLHD data could be quicker
      • So I wrote a simple function to get multiple mlhd txt files (in gzip format) > extract them > read the text and split it on the "\t" to get columns
      • Basically the whole table loaded as python nested dictionaries
      • alastairp
        mmm, I tell you what - I'm just testing this myself
      • from what I can see, pandas is actually already 2x faster than just reading the file as a csv
      • I'm actually really surprised with this (in a good way!)
      • Pratha-Fish
        yes that's right!
      • Pandas.read_csv is pretty optimized. So I don't really think there's a point in writing another function to tackle that. I did the same tests as you and found my custom csv loading func took ~20s while pandas took 12
      • alastairp
        cool, so maybe we shouldn't worry too much about optimising this more
      • soundandvision joined the channel
      • Pratha-Fish
        Yes, lets skip this one
      • soundandvision
        has anyone else had some issues with connecting to IRC here, I use the kiwi web browser app and it often times out first login attempt, then goes through on the second?
      • alastairp
        600000 files * 50ms = 500 minutes (~8h), so that's a lot of time spent just reading files, but it's the kind of time-period that you could just leave it running overnight and come back to the results the following day
      • it's not like it's going to take days or weeks just to read the files
      • soundandvision: timeout with your computer accessing kiwi, or kiwi accessing libera.chat?
      • Pratha-Fish
        alastairp: But that's llike 600GB of data, will it even fit in the RAM?
      • alastest joined the channel
      • alastest has quit
      • alastairp
        Pratha-Fish: there's no reason to load it all into memory at once! just load 1 file, process it - get some results, move on to the next file
      • Pratha-Fish
        ah yes
      • alastairp
        does a pandas dataframe have a fast way of checking if a value is present in a column?
      • I'm wondering what the most efficient way to do this might be
      • Pratha-Fish
        Yes, I'm currently using the .isin() function for the same. I heard it's all written in C and is pretty fast
      • alastairp
        cool, so let's do a test right now (if you are free)
      • soundandvision
        (apologies im crap at IRC commands) alastairp kiwi accessing libera it would seem
      • oh hey that tag worked!
      • alastairp
        soundandvision: sure, this can be a bit complex
      • yes, I got the notification!
      • I wouldn't be surprised if kiwiirc has a limit to the number of connections to libera it can make, it could be related to this
      • Pratha-Fish: I think that we should try and load track table and track_gid_redirect, then load a single file and do isin to check the recording column against these two tables
      • soundandvision
        ok cool :)
      • alastairp
        if we get that working for 1 file, we can put it in a loop and leave it running for the next 8-10h
      • Pratha-Fish
        alastairp: exciting
      • I'll make sure to do some extensive testing this time around too
      • alastairp
        Pratha-Fish: btw, I see in your method `get_null_stats` you say "Number of NOT-null rows in ...", I guess this should be Percentage?
      • Pratha-Fish
        Is it in the MLHD.ipynb file?
      • alastairp
        yes
      • Pratha-Fish
        checkign
      • P.S. sorry if my mesasges are delivered late. I ran out of mobile data so the latency has shot up a lot
      • alastairp
        no prob. it looks fine from here
      • Pratha-Fish
        also, yes the get_null_stats function returns output in %
      • alastairp
        Pratha-Fish: check something like this: https://gist.github.com/alastair/0dcf5a3b670f83...
      • the time_stats() function is what I linked to you the other day
      • Pratha-Fish
        Wow that os.walk function is something new
      • alastairp
        :)
      • Pratha-Fish
        I'll try out something with this code ig
      • Also, what exactly does the time.monotonic function do?