#metabrainz

/

13:23 PM
jmp_music_

@alastairp after eating your lunch can we do a small meeting?

2020-07-30 21247, 2020

13:23 PM
iliekcomputers

ishaanshah: np. i personally user the metabrainz repo with a `param/` prefix for my repos

2020-07-30 21252, 2020

13:23 PM
iliekcomputers

my branches

2020-07-30 21204, 2020

13:24 PM
iliekcomputers

i gave you the extra access so you could push to my artist map branch

2020-07-30 21245, 2020

13:24 PM
ishaanshah

Oh, cool thanks :)

2020-07-30 21215, 2020

13:25 PM
Mr_Monkey

Is the API endpoint likely to return a count lower than the requested count (other than cases in which the user has less than $count listens)?

2020-07-30 21238, 2020

13:25 PM
Mr_Monkey

There's some timesclae knowledge I'm missing to understand this

2020-07-30 21246, 2020

13:25 PM
shivam-kapila

A sec

2020-07-30 21253, 2020

13:25 PM
shivam-kapila

I will send you some links

2020-07-30 21254, 2020

13:29 PM
shivam-kapila

Mr_Monkey: https://github.com/metabrainz/listenbrainz-server…

2020-07-30 21216, 2020

13:30 PM
Mr_Monkey

I'm remembering some of it. By default we search listens for 3 ranges (15 days), with no assurance that there aren't any older ones. So basically if I request 25 listens but there is a month gap between two listens, I will not get the whole lot returned.

2020-07-30 21234, 2020

13:30 PM
shivam-kapila

yep

2020-07-30 21214, 2020

13:31 PM
Mr_Monkey

Yeah, I followed the mechanism up to timescale, which is the part I don't understand much. But I guess it makes sense. So as discussed, there's probably an argument missing for the API endpoint.

2020-07-30 21222, 2020

13:32 PM
Mr_Monkey

On the front-end side, I'll need to do the listens count comparison (provided I'm not on the last page [that might have less than $count listens]) and call the API again (automatically, ideally) with the extra arg.

2020-07-30 21238, 2020

13:32 PM
shivam-kapila

yes

2020-07-30 21241, 2020

13:32 PM
shivam-kapila

tha would do

2020-07-30 21246, 2020

13:32 PM
shivam-kapila

that*

2020-07-30 21258, 2020

13:32 PM
Mr_Monkey

OK, makes sense. Thanks for your help refreshing my memory :)

2020-07-30 21202, 2020

13:33 PM
shivam-kapila

we may also add the count check in api

2020-07-30 21208, 2020

13:33 PM
shivam-kapila

one more thing

2020-07-30 21220, 2020

13:33 PM
Mr_Monkey

Hm. That would probably be better on the API direectly

2020-07-30 21202, 2020

13:35 PM
shivam-kapila

Can we make the user/<user-name> not fetch the listens even the first time and make a call from frontend itself

2020-07-30 21207, 2020

13:36 PM
Mr_Monkey

Why so? Best practice would be to serve the results with the page and save an extra call from the frontend.

2020-07-30 21207, 2020

13:36 PM
shivam-kapila

ALso we do limit the results

2020-07-30 21242, 2020

13:36 PM
shivam-kapila

I just see that most of the newer services serve a template and then make the calls from frontend

2020-07-30 21231, 2020

13:37 PM
Mr_Monkey

Not sure what newer services you're referring to

2020-07-30 21245, 2020

13:39 PM
shivam-kapila

No strong opinions on this suggestion though

2020-07-30 21205, 2020

13:41 PM
ruaok

https://usercontent.irccloud-cdn.com/file/WdSXLpW…

2020-07-30 21220, 2020

13:41 PM
ruaok

Not the ideal time, Mr_Monkey

2020-07-30 21243, 2020

13:41 PM
shivam-kapila

cheers ruaok

2020-07-30 21244, 2020

13:41 PM
ruaok

But I see that shivam-kapila is already helping.

2020-07-30 21255, 2020

13:41 PM
ruaok

Thanks!

2020-07-30 21258, 2020

13:41 PM
shivam-kapila

heh. Not so much

2020-07-30 21218, 2020

13:42 PM
Mr_Monkey

Oh, totally. You answered my questions

2020-07-30 21225, 2020

13:43 PM
shivam-kapila

I am almost confusing XD

2020-07-30 21239, 2020

13:43 PM
Mr_Monkey

prost ruaok !

2020-07-30 21241, 2020

13:43 PM
shivam-kapila

> ALso we do limit the results

2020-07-30 21241, 2020

13:43 PM
shivam-kapila

So if you want to that all pages have almost 25 listens in each case for consistency, we may do so

2020-07-30 21200, 2020

13:44 PM
ruaok

Danke!

2020-07-30 21217, 2020

13:44 PM
shivam-kapila

lol I am searching the meanings

2020-07-30 21225, 2020

13:48 PM
alastairp

ishaanshah: do you have some time to talk about a few things in the hdfs uploader?

2020-07-30 21231, 2020

13:48 PM
alastairp

jmp_music_: hi, I'm here. how are you?

2020-07-30 21200, 2020

13:49 PM
alastairp

ruaok: you do know that Mr_Monkey and I make a beer with a ship on the label too? You could have just got some from here

2020-07-30 21214, 2020

13:49 PM
ishaanshah

alastairp: sure, give me 5 mins

2020-07-30 21246, 2020

13:49 PM
ruaok

You mean my whole escape was for naught??

2020-07-30 21204, 2020

13:50 PM
alastairp

if your whole escape was to find beer with a ship on the label, then yes

2020-07-30 21226, 2020

13:50 PM
ruaok

Crap!

2020-07-30 21252, 2020

13:50 PM
jmp_music_

@alastairp: Hey! I'm fine! I made some changes over the last days

2020-07-30 21256, 2020

13:50 PM
ishaanshah

alastairp: Hey I am up

2020-07-30 21210, 2020

13:51 PM
jmp_music_

Finally everything works fine

2020-07-30 21213, 2020

13:51 PM
ishaanshah

maybe after your meeting with jmp_music_ ?

2020-07-30 21222, 2020

13:51 PM
shivam-kapila

busy alastairp

2020-07-30 21251, 2020

13:51 PM
jmp_music_

@alastairp: Do you want to make a short meeting later today to inform you about the updates?

2020-07-30 21244, 2020

13:52 PM
alastairp

jmp_music_: let's do it now

2020-07-30 21200, 2020

13:54 PM
jmp_music_

great

2020-07-30 21232, 2020

13:54 PM
jmp_music_

well, I finally made every transformation with Pipelines and the prediction issues are solved

2020-07-30 21251, 2020

13:54 PM
jmp_music_

thus the code is shorten up a lot

2020-07-30 21225, 2020

13:55 PM
alastairp

that's great. so we're probably in a position where the new models are basically a drop-in replacement for the existing ones?

2020-07-30 21233, 2020

13:55 PM
jmp_music_

exactly

2020-07-30 21234, 2020

13:55 PM
alastairp

do you know what the issue was with the prediction?

2020-07-30 21258, 2020

13:56 PM
jmp_music_

the `random` library again. There were two shuffled processes in the past. One for the tracks (which were in a list), and one for the labels, which were included in a pandas series

2020-07-30 21224, 2020

13:57 PM
jmp_music_

now everything works properly because I do the whole shuffling in the start

2020-07-30 21231, 2020

13:57 PM
jmp_music_

and then I split the labels from the tracks

2020-07-30 21232, 2020

13:57 PM
alastairp

cool

2020-07-30 21249, 2020

13:57 PM
alastairp

so it was actually returning results for a different item?

2020-07-30 21200, 2020

13:58 PM
jmp_music_

yeap

2020-07-30 21220, 2020

13:58 PM
alastairp

whoops. good thing that we caught that

2020-07-30 21230, 2020

13:58 PM
jmp_music_

I think so :)

2020-07-30 21221, 2020

13:59 PM
jmp_music_

Furthermore, now there is project template, and for each classification problem a different classification config yaml is created

2020-07-30 21247, 2020

13:59 PM
alastairp

what do you think is the next step in the project, then?

2020-07-30 21225, 2020

14:00 PM
jmp_music_

I want just to finish some logging now, and then proceed to the integration with the AB

2020-07-30 21225, 2020

14:00 PM
alastairp

looking at your proposal, we had the integration of the new models into the rest of acousticbrainz?

2020-07-30 21240, 2020

14:01 PM
alastairp

great. once you've finished with the logging can you make a pull request on the acousticbrainz-server repository to add the code?

2020-07-30 21201, 2020

14:03 PM
jmp_music_

yes of course

2020-07-30 21206, 2020

14:03 PM
alastairp

let's make a new package for it. Perhaps `acousticbrainz.models`

2020-07-30 21214, 2020

14:03 PM
jmp_music_

ok!

2020-07-30 21234, 2020

14:03 PM
alastairp

we don't have an `acousticbrainz` package at the moment, but we want to move stuff into it eventually, so we could make this as the first thing that uses it

2020-07-30 21207, 2020

14:04 PM
alastairp

thinking into the future, let's add the sklearn stuff into a `sklearn` submodule, so that if we have other libraries (tensorflow, etc), we can put them in there as well

2020-07-30 21215, 2020

14:04 PM
jmp_music_

Do you think that we could make it as an `acousticbrainz` library?

2020-07-30 21228, 2020

14:04 PM
alastairp

as something that is installable with pip?

2020-07-30 21235, 2020

14:04 PM
jmp_music_

yes

2020-07-30 21242, 2020

14:04 PM
alastairp

I don't think that's important at the moment

2020-07-30 21224, 2020

14:05 PM
jmp_music_

thus I have to transfer the whole code in the AB repository?

2020-07-30 21229, 2020

14:05 PM
jmp_music_

am i right?

2020-07-30 21231, 2020

14:05 PM
alastairp

yes.

2020-07-30 21241, 2020

14:05 PM
alastairp

we normally keep all code for each project in the same repository

2020-07-30 21246, 2020

14:05 PM
alastairp

listenbrainz does the same

2020-07-30 21256, 2020

14:05 PM
jmp_music_

sounds good

2020-07-30 21207, 2020

14:06 PM
alastairp

so in here: https://github.com/metabrainz/acousticbrainz-serv…

2020-07-30 21218, 2020

14:06 PM
alastairp

we'll have a new 'acousticbrainz/models/sklearn' folder

2020-07-30 21231, 2020

14:06 PM
alastairp

do you want to talk briefly about the integration?

2020-07-30 21208, 2020

14:07 PM
alastairp

I can give you an overview about how the system currently works, and then we could plan the first part of this

2020-07-30 21239, 2020

14:07 PM
jmp_music_

yes of course. Because I want to start thinking how it predicts the classes each instance is classified to

2020-07-30 21219, 2020

14:08 PM
alastairp

OK

2020-07-30 21228, 2020

14:08 PM
jmp_music_

for example, in the sklearn prediction tool, I get a low-level instance from the API

2020-07-30 21242, 2020

14:08 PM
jmp_music_

and I predict the class it belongs to

2020-07-30 21250, 2020

14:08 PM
jmp_music_

here is an example

2020-07-30 21254, 2020

14:08 PM
alastairp

at the moment we have two related, but slightly distinct parts to do with machine learning in acousticbrainz

2020-07-30 21255, 2020

14:08 PM
jmp_music_

https://www.irccloud.com/pastebin/CHRp9osW/

2020-07-30 21209, 2020

14:09 PM
alastairp

the first is the high-level extractor: https://github.com/metabrainz/acousticbrainz-serv…

2020-07-30 21218, 2020

14:09 PM
alastairp

this uses essentia and gaia to do prediction

2020-07-30 21238, 2020

14:09 PM
alastairp

we have a script that runs that looks at the `lowlevel` database table and the `highlevel` database table

2020-07-30 21258, 2020

14:09 PM
jmp_music_

ok! So I have to replace gaia over there

2020-07-30 21222, 2020

14:10 PM
alastairp

if there is no item in the highlevel table for a specific row in the lowlevel table, we get the lowlevel data from the database, perform the prediction, and then write the highlevel data

2020-07-30 21247, 2020

14:10 PM
alastairp

for example: https://github.com/metabrainz/acousticbrainz-serv…

2020-07-30 21201, 2020

14:11 PM
alastairp

https://github.com/metabrainz/acousticbrainz-serv…

2020-07-30 21238, 2020

14:11 PM
alastairp

however, I believe that we can take advantage of this project to improve this workflow

2020-07-30 21209, 2020

14:12 PM
alastairp

recall that we have the functionality to build datasets in acousticbrainz: https://acousticbrainz.org/datasets/create

2020-07-30 21223, 2020

14:12 PM
jmp_music_

hmm ok

2020-07-30 21247, 2020

14:12 PM
alastairp

when you have built a dataset, we have a button called "Evaluate", which submits it to have a model trained with gaia

2020-07-30 21204, 2020

14:14 PM
alastairp

I would like to set up a complete end-to-end pipeline that allows us to build a dataset, construct a model with sklearn, perform an evaluation with a separate subset of the acousticbrainz database, and then finally promote a model as live if we decide that it works well, so that it shows on the website and is available in the API

2020-07-30 21240, 2020

14:14 PM
jmp_music_

however build dataset evaluations are not the models that are used for the predictions of the high-level, right?

2020-07-30 21257, 2020

14:14 PM
alastairp

it would be great to be able to do this completely through the website. almost all of these components exist as individual parts, I think that now would be a great time to integrate them together

2020-07-30 21214, 2020

14:15 PM
jmp_music_

I undestand

2020-07-30 21228, 2020

14:15 PM
alastairp

I sent you our paper about cross-collection evaluation, right?

2020-07-30 21204, 2020

14:16 PM
jmp_music_

yes yes

2020-07-30 21233, 2020

14:16 PM
alastairp

at the moment we have an accuracy of the model made with sklearn, using cross-evaluation train/test splits

2020-07-30 21253, 2020

14:16 PM
jmp_music_

right

2020-07-30 21205, 2020

14:17 PM
alastairp

however we would like to also calculate a second accuracy, using a second dataset

2020-07-30 21218, 2020

14:17 PM
alastairp

for example, you and I both make a dataset for electronic/not electronic

2020-07-30 21238, 2020

14:17 PM
alastairp

you make a model with your dataset and you get 89% accuracy

2020-07-30 21229, 2020

14:18 PM
alastairp

then you use your model to compute predictions on the items in my dataset, and see how many of the predictions match my ground-truth

2020-07-30 21248, 2020

14:18 PM
jmp_music_

aha, i understand

2020-07-30 21211, 2020

14:19 PM
alastairp

we actually have this functionality. it's called dataset contests, however it's not fully merged

2020-07-30 21219, 2020

14:19 PM
alastairp

I will work on merging it in the next few weeks

2020-07-30 21235, 2020

14:19 PM
alastairp

but the idea would be to modify this existing code so that it works with either gaia or sklearn

2020-07-30 21257, 2020

14:19 PM
jmp_music_

ok!

2020-07-30 21215, 2020

14:20 PM
alastairp

OK, that's the first part

2020-07-30 21218, 2020

14:20 PM
alastairp

the second part:

2020-07-30 21248, 2020

14:20 PM
alastairp

https://github.com/metabrainz/acousticbrainz-serv…

2020-07-30 21254, 2020

14:20 PM
alastairp

we have this table in the database called 'model'

2020-07-30 21211, 2020

14:21 PM
alastairp

It currently has all of the gaia models (genre, mood, instrumental, electronic, etc)

2020-07-30 21203, 2020

14:22 PM
alastairp

we should update this table to include some additional information - for example, the tool that was used to create the model

2020-07-30 21220, 2020

14:22 PM
jmp_music_

that will be gaia or sklearn

2020-07-30 21224, 2020

14:22 PM
alastairp

yes

2020-07-30 21253, 2020

14:22 PM
alastairp

see that when we store highlevel data, we do it in a number of different tables: https://github.com/metabrainz/acousticbrainz-serv…

2020-07-30 21209, 2020

14:24 PM
alastairp

see the highlevel_model table. This is the prediction for a single model. So for 1 lowlevel item, we will have 1 highlevel item, and 18 highlevel_model items (one for each model)

2020-07-30 21232, 2020

14:24 PM
jmp_music_

aha ok!

2020-07-30 21203, 2020

14:25 PM
jmp_music_

I think I understand

2020-07-30 21216, 2020

14:25 PM
alastairp

so when we add a new row to the model table, we should have a script which can find all of the lowlevel items that don't have a prediction for that model, and then compute the prediction, and add a row to the highlevel_table table

2020-07-30 21226, 2020

14:25 PM
alastairp

then this data will appear on the API

2020-07-30 21228, 2020

14:26 PM
jmp_music_

ok

2020-07-30 21242, 2020

14:26 PM
jmp_music_

Can I ask something

2020-07-30 21242, 2020

14:26 PM
jmp_music_

?

2020-07-30 21248, 2020

14:26 PM
alastairp

OK, that's the whole overview. I'm not sure if we will have enough time to finish it this summer, but I wanted you to know the full cycle

2020-07-30 21251, 2020

14:26 PM
alastairp

absolutely

2020-07-30 21221, 2020

14:27 PM
jmp_music_

Where should I save the .pkl models of the transformation pipelines?

2020-07-30 21231, 2020

14:27 PM
jmp_music_

(gaussianize, normalize, etc.)

2020-07-30 21242, 2020

14:28 PM
alastairp

https://github.com/metabrainz/acousticbrainz-serv…

2020-07-30 21253, 2020

14:28 PM
alastairp

this is the script that currently runs the gaia model training