#metabrainz

/

      • jmp_music_
        @alastairp after eating your lunch can we do a small meeting?
      • 2020-07-30 21247, 2020

      • iliekcomputers
        ishaanshah: np. i personally user the metabrainz repo with a `param/` prefix for my repos
      • 2020-07-30 21252, 2020

      • iliekcomputers
        my branches
      • 2020-07-30 21204, 2020

      • iliekcomputers
        i gave you the extra access so you could push to my artist map branch
      • 2020-07-30 21245, 2020

      • ishaanshah
        Oh, cool thanks :)
      • 2020-07-30 21215, 2020

      • Mr_Monkey
        Is the API endpoint likely to return a count lower than the requested count (other than cases in which the user has less than $count listens)?
      • 2020-07-30 21238, 2020

      • Mr_Monkey
        There's some timesclae knowledge I'm missing to understand this
      • 2020-07-30 21246, 2020

      • shivam-kapila
        A sec
      • 2020-07-30 21253, 2020

      • shivam-kapila
        I will send you some links
      • 2020-07-30 21254, 2020

      • shivam-kapila
      • 2020-07-30 21216, 2020

      • Mr_Monkey
        I'm remembering some of it. By default we search listens for 3 ranges (15 days), with no assurance that there aren't any older ones. So basically if I request 25 listens but there is a month gap between two listens, I will not get the whole lot returned.
      • 2020-07-30 21234, 2020

      • shivam-kapila
        yep
      • 2020-07-30 21214, 2020

      • Mr_Monkey
        Yeah, I followed the mechanism up to timescale, which is the part I don't understand much. But I guess it makes sense. So as discussed, there's probably an argument missing for the API endpoint.
      • 2020-07-30 21222, 2020

      • Mr_Monkey
        On the front-end side, I'll need to do the listens count comparison (provided I'm not on the last page [that might have less than $count listens]) and call the API again (automatically, ideally) with the extra arg.
      • 2020-07-30 21238, 2020

      • shivam-kapila
        yes
      • 2020-07-30 21241, 2020

      • shivam-kapila
        tha would do
      • 2020-07-30 21246, 2020

      • shivam-kapila
        that*
      • 2020-07-30 21258, 2020

      • Mr_Monkey
        OK, makes sense. Thanks for your help refreshing my memory :)
      • 2020-07-30 21202, 2020

      • shivam-kapila
        we may also add the count check in api
      • 2020-07-30 21208, 2020

      • shivam-kapila
        one more thing
      • 2020-07-30 21220, 2020

      • Mr_Monkey
        Hm. That would probably be better on the API direectly
      • 2020-07-30 21202, 2020

      • shivam-kapila
        Can we make the user/<user-name> not fetch the listens even the first time and make a call from frontend itself
      • 2020-07-30 21207, 2020

      • Mr_Monkey
        Why so? Best practice would be to serve the results with the page and save an extra call from the frontend.
      • 2020-07-30 21207, 2020

      • shivam-kapila
        ALso we do limit the results
      • 2020-07-30 21242, 2020

      • shivam-kapila
        I just see that most of the newer services serve a template and then make the calls from frontend
      • 2020-07-30 21231, 2020

      • Mr_Monkey
        Not sure what newer services you're referring to
      • 2020-07-30 21245, 2020

      • shivam-kapila
        No strong opinions on this suggestion though
      • 2020-07-30 21205, 2020

      • ruaok
      • 2020-07-30 21220, 2020

      • ruaok
        Not the ideal time, Mr_Monkey
      • 2020-07-30 21243, 2020

      • shivam-kapila
        cheers ruaok
      • 2020-07-30 21244, 2020

      • ruaok
        But I see that shivam-kapila is already helping.
      • 2020-07-30 21255, 2020

      • ruaok
        Thanks!
      • 2020-07-30 21258, 2020

      • shivam-kapila
        heh. Not so much
      • 2020-07-30 21218, 2020

      • Mr_Monkey
        Oh, totally. You answered my questions
      • 2020-07-30 21225, 2020

      • shivam-kapila
        I am almost confusing XD
      • 2020-07-30 21239, 2020

      • Mr_Monkey
        prost ruaok !
      • 2020-07-30 21241, 2020

      • shivam-kapila
        > ALso we do limit the results
      • 2020-07-30 21241, 2020

      • shivam-kapila
        So if you want to that all pages have almost 25 listens in each case for consistency, we may do so
      • 2020-07-30 21200, 2020

      • ruaok
        Danke!
      • 2020-07-30 21217, 2020

      • shivam-kapila
        lol I am searching the meanings
      • 2020-07-30 21225, 2020

      • alastairp
        ishaanshah: do you have some time to talk about a few things in the hdfs uploader?
      • 2020-07-30 21231, 2020

      • alastairp
        jmp_music_: hi, I'm here. how are you?
      • 2020-07-30 21200, 2020

      • alastairp
        ruaok: you do know that Mr_Monkey and I make a beer with a ship on the label too? You could have just got some from here
      • 2020-07-30 21214, 2020

      • ishaanshah
        alastairp: sure, give me 5 mins
      • 2020-07-30 21246, 2020

      • ruaok
        You mean my whole escape was for naught??
      • 2020-07-30 21204, 2020

      • alastairp
        if your whole escape was to find beer with a ship on the label, then yes
      • 2020-07-30 21226, 2020

      • ruaok
        Crap!
      • 2020-07-30 21252, 2020

      • jmp_music_
        @alastairp: Hey! I'm fine! I made some changes over the last days
      • 2020-07-30 21256, 2020

      • ishaanshah
        alastairp: Hey I am up
      • 2020-07-30 21210, 2020

      • jmp_music_
        Finally everything works fine
      • 2020-07-30 21213, 2020

      • ishaanshah
        maybe after your meeting with jmp_music_ ?
      • 2020-07-30 21222, 2020

      • shivam-kapila
        busy alastairp
      • 2020-07-30 21251, 2020

      • jmp_music_
        @alastairp: Do you want to make a short meeting later today to inform you about the updates?
      • 2020-07-30 21244, 2020

      • alastairp
        jmp_music_: let's do it now
      • 2020-07-30 21200, 2020

      • jmp_music_
        great
      • 2020-07-30 21232, 2020

      • jmp_music_
        well, I finally made every transformation with Pipelines and the prediction issues are solved
      • 2020-07-30 21251, 2020

      • jmp_music_
        thus the code is shorten up a lot
      • 2020-07-30 21225, 2020

      • alastairp
        that's great. so we're probably in a position where the new models are basically a drop-in replacement for the existing ones?
      • 2020-07-30 21233, 2020

      • jmp_music_
        exactly
      • 2020-07-30 21234, 2020

      • alastairp
        do you know what the issue was with the prediction?
      • 2020-07-30 21258, 2020

      • jmp_music_
        the `random` library again. There were two shuffled processes in the past. One for the tracks (which were in a list), and one for the labels, which were included in a pandas series
      • 2020-07-30 21224, 2020

      • jmp_music_
        now everything works properly because I do the whole shuffling in the start
      • 2020-07-30 21231, 2020

      • jmp_music_
        and then I split the labels from the tracks
      • 2020-07-30 21232, 2020

      • alastairp
        cool
      • 2020-07-30 21249, 2020

      • alastairp
        so it was actually returning results for a different item?
      • 2020-07-30 21200, 2020

      • jmp_music_
        yeap
      • 2020-07-30 21220, 2020

      • alastairp
        whoops. good thing that we caught that
      • 2020-07-30 21230, 2020

      • jmp_music_
        I think so :)
      • 2020-07-30 21221, 2020

      • jmp_music_
        Furthermore, now there is project template, and for each classification problem a different classification config yaml is created
      • 2020-07-30 21247, 2020

      • alastairp
        what do you think is the next step in the project, then?
      • 2020-07-30 21225, 2020

      • jmp_music_
        I want just to finish some logging now, and then proceed to the integration with the AB
      • 2020-07-30 21225, 2020

      • alastairp
        looking at your proposal, we had the integration of the new models into the rest of acousticbrainz?
      • 2020-07-30 21240, 2020

      • alastairp
        great. once you've finished with the logging can you make a pull request on the acousticbrainz-server repository to add the code?
      • 2020-07-30 21201, 2020

      • jmp_music_
        yes of course
      • 2020-07-30 21206, 2020

      • alastairp
        let's make a new package for it. Perhaps `acousticbrainz.models`
      • 2020-07-30 21214, 2020

      • jmp_music_
        ok!
      • 2020-07-30 21234, 2020

      • alastairp
        we don't have an `acousticbrainz` package at the moment, but we want to move stuff into it eventually, so we could make this as the first thing that uses it
      • 2020-07-30 21207, 2020

      • alastairp
        thinking into the future, let's add the sklearn stuff into a `sklearn` submodule, so that if we have other libraries (tensorflow, etc), we can put them in there as well
      • 2020-07-30 21215, 2020

      • jmp_music_
        Do you think that we could make it as an `acousticbrainz` library?
      • 2020-07-30 21228, 2020

      • alastairp
        as something that is installable with pip?
      • 2020-07-30 21235, 2020

      • jmp_music_
        yes
      • 2020-07-30 21242, 2020

      • alastairp
        I don't think that's important at the moment
      • 2020-07-30 21224, 2020

      • jmp_music_
        thus I have to transfer the whole code in the AB repository?
      • 2020-07-30 21229, 2020

      • jmp_music_
        am i right?
      • 2020-07-30 21231, 2020

      • alastairp
        yes.
      • 2020-07-30 21241, 2020

      • alastairp
        we normally keep all code for each project in the same repository
      • 2020-07-30 21246, 2020

      • alastairp
        listenbrainz does the same
      • 2020-07-30 21256, 2020

      • jmp_music_
        sounds good
      • 2020-07-30 21207, 2020

      • alastairp
      • 2020-07-30 21218, 2020

      • alastairp
        we'll have a new 'acousticbrainz/models/sklearn' folder
      • 2020-07-30 21231, 2020

      • alastairp
        do you want to talk briefly about the integration?
      • 2020-07-30 21208, 2020

      • alastairp
        I can give you an overview about how the system currently works, and then we could plan the first part of this
      • 2020-07-30 21239, 2020

      • jmp_music_
        yes of course. Because I want to start thinking how it predicts the classes each instance is classified to
      • 2020-07-30 21219, 2020

      • alastairp
        OK
      • 2020-07-30 21228, 2020

      • jmp_music_
        for example, in the sklearn prediction tool, I get a low-level instance from the API
      • 2020-07-30 21242, 2020

      • jmp_music_
        and I predict the class it belongs to
      • 2020-07-30 21250, 2020

      • jmp_music_
        here is an example
      • 2020-07-30 21254, 2020

      • alastairp
        at the moment we have two related, but slightly distinct parts to do with machine learning in acousticbrainz
      • 2020-07-30 21255, 2020

      • jmp_music_
      • 2020-07-30 21209, 2020

      • alastairp
        the first is the high-level extractor: https://github.com/metabrainz/acousticbrainz-serv…
      • 2020-07-30 21218, 2020

      • alastairp
        this uses essentia and gaia to do prediction
      • 2020-07-30 21238, 2020

      • alastairp
        we have a script that runs that looks at the `lowlevel` database table and the `highlevel` database table
      • 2020-07-30 21258, 2020

      • jmp_music_
        ok! So I have to replace gaia over there
      • 2020-07-30 21222, 2020

      • alastairp
        if there is no item in the highlevel table for a specific row in the lowlevel table, we get the lowlevel data from the database, perform the prediction, and then write the highlevel data
      • 2020-07-30 21247, 2020

      • alastairp
      • 2020-07-30 21201, 2020

      • alastairp
      • 2020-07-30 21238, 2020

      • alastairp
        however, I believe that we can take advantage of this project to improve this workflow
      • 2020-07-30 21209, 2020

      • alastairp
        recall that we have the functionality to build datasets in acousticbrainz: https://acousticbrainz.org/datasets/create
      • 2020-07-30 21223, 2020

      • jmp_music_
        hmm ok
      • 2020-07-30 21247, 2020

      • alastairp
        when you have built a dataset, we have a button called "Evaluate", which submits it to have a model trained with gaia
      • 2020-07-30 21204, 2020

      • alastairp
        I would like to set up a complete end-to-end pipeline that allows us to build a dataset, construct a model with sklearn, perform an evaluation with a separate subset of the acousticbrainz database, and then finally promote a model as live if we decide that it works well, so that it shows on the website and is available in the API
      • 2020-07-30 21240, 2020

      • jmp_music_
        however build dataset evaluations are not the models that are used for the predictions of the high-level, right?
      • 2020-07-30 21257, 2020

      • alastairp
        it would be great to be able to do this completely through the website. almost all of these components exist as individual parts, I think that now would be a great time to integrate them together
      • 2020-07-30 21214, 2020

      • jmp_music_
        I undestand
      • 2020-07-30 21228, 2020

      • alastairp
        I sent you our paper about cross-collection evaluation, right?
      • 2020-07-30 21204, 2020

      • jmp_music_
        yes yes
      • 2020-07-30 21233, 2020

      • alastairp
        at the moment we have an accuracy of the model made with sklearn, using cross-evaluation train/test splits
      • 2020-07-30 21253, 2020

      • jmp_music_
        right
      • 2020-07-30 21205, 2020

      • alastairp
        however we would like to also calculate a second accuracy, using a second dataset
      • 2020-07-30 21218, 2020

      • alastairp
        for example, you and I both make a dataset for electronic/not electronic
      • 2020-07-30 21238, 2020

      • alastairp
        you make a model with your dataset and you get 89% accuracy
      • 2020-07-30 21229, 2020

      • alastairp
        then you use your model to compute predictions on the items in my dataset, and see how many of the predictions match my ground-truth
      • 2020-07-30 21248, 2020

      • jmp_music_
        aha, i understand
      • 2020-07-30 21211, 2020

      • alastairp
        we actually have this functionality. it's called dataset contests, however it's not fully merged
      • 2020-07-30 21219, 2020

      • alastairp
        I will work on merging it in the next few weeks
      • 2020-07-30 21235, 2020

      • alastairp
        but the idea would be to modify this existing code so that it works with either gaia or sklearn
      • 2020-07-30 21257, 2020

      • jmp_music_
        ok!
      • 2020-07-30 21215, 2020

      • alastairp
        OK, that's the first part
      • 2020-07-30 21218, 2020

      • alastairp
        the second part:
      • 2020-07-30 21248, 2020

      • alastairp
      • 2020-07-30 21254, 2020

      • alastairp
        we have this table in the database called 'model'
      • 2020-07-30 21211, 2020

      • alastairp
        It currently has all of the gaia models (genre, mood, instrumental, electronic, etc)
      • 2020-07-30 21203, 2020

      • alastairp
        we should update this table to include some additional information - for example, the tool that was used to create the model
      • 2020-07-30 21220, 2020

      • jmp_music_
        that will be gaia or sklearn
      • 2020-07-30 21224, 2020

      • alastairp
        yes
      • 2020-07-30 21253, 2020

      • alastairp
        see that when we store highlevel data, we do it in a number of different tables: https://github.com/metabrainz/acousticbrainz-serv…
      • 2020-07-30 21209, 2020

      • alastairp
        see the highlevel_model table. This is the prediction for a single model. So for 1 lowlevel item, we will have 1 highlevel item, and 18 highlevel_model items (one for each model)
      • 2020-07-30 21232, 2020

      • jmp_music_
        aha ok!
      • 2020-07-30 21203, 2020

      • jmp_music_
        I think I understand
      • 2020-07-30 21216, 2020

      • alastairp
        so when we add a new row to the model table, we should have a script which can find all of the lowlevel items that don't have a prediction for that model, and then compute the prediction, and add a row to the highlevel_table table
      • 2020-07-30 21226, 2020

      • alastairp
        then this data will appear on the API
      • 2020-07-30 21228, 2020

      • jmp_music_
        ok
      • 2020-07-30 21242, 2020

      • jmp_music_
        Can I ask something
      • 2020-07-30 21242, 2020

      • jmp_music_
        ?
      • 2020-07-30 21248, 2020

      • alastairp
        OK, that's the whole overview. I'm not sure if we will have enough time to finish it this summer, but I wanted you to know the full cycle
      • 2020-07-30 21251, 2020

      • alastairp
        absolutely
      • 2020-07-30 21221, 2020

      • jmp_music_
        Where should I save the .pkl models of the transformation pipelines?
      • 2020-07-30 21231, 2020

      • jmp_music_
        (gaussianize, normalize, etc.)
      • 2020-07-30 21242, 2020

      • alastairp
      • 2020-07-30 21253, 2020

      • alastairp
        this is the script that currently runs the gaia model training