#metabrainz

/

11:31 AM
alastairp

is that right?

2020-11-10 31551, 2020

11:31 AM
pristine___

`listens` -> `listens_df` and users_df and artists_df -> `playcounts_df`,

2020-11-10 31552, 2020

11:31 AM
pristine___

Also

2020-11-10 31515, 2020

11:32 AM
pristine___

We are storing users_df, artist_df and artistcount to HDFS

2020-11-10 31506, 2020

11:33 AM
alastairp

my feeling is that listens_df isn't strictly needed, because you could take data from listens and put it directly into the playcounts df, right? Also, is this similar to artist stats in any way?

2020-11-10 31537, 2020

11:33 AM
pristine___

Yeah, right on the first point!

2020-11-10 31540, 2020

11:33 AM
sumedh joined the channel

2020-11-10 31526, 2020

11:34 AM
pristine___

Artist stats, like?

2020-11-10 31545, 2020

11:34 AM
alastairp

we already have a count of how many times users have listened to each artist

2020-11-10 31500, 2020

11:35 AM
alastairp

(I used this data dump when I did artist similarity using some python tools)

2020-11-10 31509, 2020

11:36 AM
pristine___

Yeah, but I think the range will differ. I might want to have artist stat for like last 15 days or even if it's last 7 days, the 7days of artist stat might be diff from 7days I want to fetch data for. So I thought, it will good to keep it separate

2020-11-10 31540, 2020

11:37 AM
alastairp

ok. I would keep this open as an option - it seems a bit silly to compute exactly the same data slightly differently in two different places. Perhaps we can modify the stats computation to also generate the data that we need here

2020-11-10 31555, 2020

11:37 AM
pristine___

Yes.

2020-11-10 31556, 2020

11:37 AM
alastairp

however, that's fine. For now we can go about it the way that you have suggested

2020-11-10 31538, 2020

11:38 AM
alastairp

those dataframes are fine then, just make sure that their names are a bit more descriptive, and make sure that you add some more details to this document about how data from the listens df gets loaded into these ones

2020-11-10 31551, 2020

11:38 AM
alastairp

I was looking at your comments about evaluation

2020-11-10 31551, 2020

11:38 AM
pristine___

Sure

2020-11-10 31501, 2020

11:39 AM
alastairp

did you look into the other projects that I sent you?

2020-11-10 31549, 2020

11:39 AM
pristine___

I looked at the user similarity PR, and the project. Noticed some evaluation techniques but didn't understand them fully.

2020-11-10 31542, 2020

11:40 AM
alastairp

especially take a look at the "Evaluation" section here: https://github.com/andrebola/mtlab-recsys/blob/ma…

2020-11-10 31513, 2020

11:41 AM
alastairp

these are some good metrics to look into. You can search their names to learn more about them. There are implementations in this repository

2020-11-10 31538, 2020

11:41 AM
pristine___

Yeah, thanks for the link

2020-11-10 31504, 2020

11:42 AM
alastairp

you're right about the train/test split

2020-11-10 31534, 2020

11:42 AM
alastairp

I will look for some more information about how to do splits in these types of models and see if I can find anything useful

2020-11-10 31556, 2020

11:42 AM
pristine___

Yes, I will want to use that in recording recs too. Thanks

2020-11-10 31507, 2020

11:43 AM
alastairp

next, you're looking at what input to give to the model

2020-11-10 31525, 2020

11:43 AM
alastairp

either all artists by a user, or just their top artists

2020-11-10 31532, 2020

11:44 AM
pristine___

Giving all artists by a user or just their top artists is not a good idea

2020-11-10 31544, 2020

11:44 AM
alastairp

I'm not sure that the existing artist relations information is useful here - is that what you're suggesting? "Get artists similar to these artists using the artist relations dump"

2020-11-10 31500, 2020

11:45 AM
pristine___

Okay, so

2020-11-10 31510, 2020

11:45 AM
pristine___

We create the dfs, train the model

2020-11-10 31521, 2020

11:45 AM
pristine___

The third step is to get the recommendations

2020-11-10 31530, 2020

11:45 AM
pristine___

Generating recommendations is nothing but

2020-11-10 31553, 2020

11:45 AM
ruaok barges in

2020-11-10 31512, 2020

11:46 AM
ruaok

I have one pie in the sky use case that I would like us to consider... for our own survival.

2020-11-10 31513, 2020

11:46 AM
pristine___

Give some artist ID's to the recommender for a user and it will give scores to the artists

2020-11-10 31531, 2020

11:46 AM
pristine___

Now what artist ids we should give as input

2020-11-10 31534, 2020

11:46 AM
alastairp

ruaok: go

2020-11-10 31502, 2020

11:47 AM
ruaok

I'd love it if we could build a system so that a small music label could come in, feed all of their metadata into MB and then train a model to recommend tracks in their collection.

2020-11-10 31543, 2020

11:47 AM
ruaok

take the model, host it on their own server and say: "what is your Spotify ID? Can we pull your last 50 listens and suggest which tracks you should listen to at our label?"

2020-11-10 31509, 2020

11:48 AM
ruaok

this would be a massive win for us -- give us data, take recs and give us a little money for the effort.

2020-11-10 31547, 2020

11:48 AM
ruaok

and pie in sky... I'd like to make this possible in 2021.

2020-11-10 31531, 2020

11:51 AM
_lucifer

that should be possible right now i guess?

2020-11-10 31539, 2020

11:51 AM
alastairp

yeah, that should be possible. we'd need to work out the missing step between 'add metadata' and 'generate recommendations', if we're using collab filtering, we'll need to wait until we have enough listens

2020-11-10 31549, 2020

11:51 AM
alastairp

if their data is already in MB and people are listening to it, perfect

2020-11-10 31534, 2020

11:52 AM
alastairp

pristine___: mm, that's definitely one way of doing recommendations, but you don't need to select the potential artists first

2020-11-10 31542, 2020

11:52 AM
ruaok

_lucifer: not quite, no.

2020-11-10 31502, 2020

11:53 AM
alastairp

see the repo that I linked previously, the "Analysis of the recommendations" section

2020-11-10 31506, 2020

11:53 AM
ruaok

alastairp: agreed. this isn't a week long project.

2020-11-10 31502, 2020

11:54 AM
alastairp

the returned list of tuples are "artists that are similar to all of the artists that this user has listened to". The True/False indicates if we know that the user has listened to that artist before

2020-11-10 31537, 2020

11:54 AM
pristine___

Returned from where?

2020-11-10 31554, 2020

11:55 AM
alastairp

that's data from the model

2020-11-10 31504, 2020

11:56 AM
pristine___

If I follow, we have trained the model, we are on prepare candidate sets step now?

2020-11-10 31526, 2020

11:56 AM
sumedh has quit

2020-11-10 31505, 2020

11:57 AM
alastairp

your candidate set is the list of artists for a user that you pass into the model?

2020-11-10 31521, 2020

11:57 AM
pristine___

We give some input to the recommender , and the recommender based on trained model assign score to the input

2020-11-10 31523, 2020

11:57 AM
pristine___

Yes

2020-11-10 31551, 2020

11:57 AM
alastairp

but you don't need to give an explicit input to the model

2020-11-10 31518, 2020

11:58 AM
alastairp

that is, the model itself will give you some similar artists, ordered by score

2020-11-10 31528, 2020

11:58 AM
sumedh joined the channel

2020-11-10 31542, 2020

11:58 AM
alastairp

you don't need to go and find some artists yourself

2020-11-10 31505, 2020

11:59 AM
pristine___

Okay

2020-11-10 31553, 2020

11:59 AM
alastairp

you can just say "given the artists that pristine___ listens to, what are some other similar artists that other people listen to?"

2020-11-10 31553, 2020

12:00 PM
pristine___

Right, I understand. Candidate sets is more of a layer to refine the results. It's like these are the artists the user might like, now given them a ranking. But yeah, what you say makes sense to mem

2020-11-10 31530, 2020

12:02 PM
alastairp

there is definitely value in using this kind of score lookup to be able to say "we think that you might like this artist 67%" when you are viewing an artist's page

2020-11-10 31523, 2020

12:03 PM
alastairp

but I think that initially we should focus on directly using the model to generate the recommendations, and only look at expanding this set if we find that it's not big enough

2020-11-10 31508, 2020

12:04 PM
pristine___

Makes sense.

2020-11-10 31510, 2020

12:04 PM
alastairp

how long do you think that it takes to do a circle of make a change, test something, get results?

2020-11-10 31534, 2020

12:04 PM
alastairp

I'm just interested in knowing what your workflow is. Every time that you test something, do you deploy and run a job? Do you have a local setup with a smaller dataset?

2020-11-10 31555, 2020

12:07 PM
_lucifer

what's the difference between the 3 different dumps here http://ftp.musicbrainz.org/pub/musicbrainz/listen…

2020-11-10 31542, 2020

12:08 PM
ruaok

_lucifer: listens data, postgres data (user info, etc) and spark import friendly data.

2020-11-10 31555, 2020

12:10 PM
pristine___

alastairp: it was easier when I could test stuff on the cluster. Rn, it depends on if someone is available to merge, deploy. I generally add logging statements if I am stuck somewhere or if I feel that the result is not what I expect. Testing on big data of course helps. So rn, I write code, test on my machine on whatever lil data I have, open a PR, merge, deploy and then understand if it is working as

2020-11-10 31555, 2020

12:10 PM
pristine___

expected. 3-4 days maybe

2020-11-10 31541, 2020

12:12 PM
_lucifer

ruaok: ah! ok. whats the diff between the listens data and spark import friendly data ? some schema difference or some data difference

2020-11-10 31558, 2020

12:12 PM
ruaok

one is postgres, the other spark

2020-11-10 31505, 2020

12:13 PM
ruaok

suited for import to each one...

2020-11-10 31521, 2020

12:14 PM
alastairp

pristine___: right. I think that we should try and improve this workflow if we can, then. It would be great if you were able to try different parameters to models, or try different inputs to see the results easily, without having to wait for a review/deploy/test cycle

2020-11-10 31531, 2020

12:14 PM
alastairp

even if that was on a smaller subset of the dataset

2020-11-10 31510, 2020

12:15 PM
_lucifer

ruaok: understood. thanks!

2020-11-10 31522, 2020

12:15 PM
pristine___

Yup

2020-11-10 31532, 2020

12:15 PM
alastairp

let's keep this open as an idea, then.

2020-11-10 31505, 2020

12:16 PM
alastairp

I spent some time a few months ago improving the local spark development workflow, but didn't get any further. I think that there are still some things that we could change here

2020-11-10 31521, 2020

12:16 PM
_lucifer

for recs, in production spark only uses data from last 6 months right ?

2020-11-10 31503, 2020

12:17 PM
pristine___

_lucifer: yes, we can change that window as we like

2020-11-10 31504, 2020

12:17 PM
alastairp

pristine___: I strongly recommend that you run the mtlab-recsys notebook yourself. the test dataset is quite small, and you can reduce the number of model iterations too

2020-11-10 31515, 2020

12:17 PM
pristine___

Will do.

2020-11-10 31522, 2020

12:17 PM
alastairp

the repository is designed as an entry to this kind of task

2020-11-10 31547, 2020

12:17 PM
alastairp

so have a play around and see if you can work out what different inputs give, and how everything fits together

2020-11-10 31556, 2020

12:17 PM
pristine___

Yup

2020-11-10 31502, 2020

12:18 PM
alastairp

I think that it will make things a lot clearer when you have to implement it in spark

2020-11-10 31542, 2020

12:18 PM
alastairp

I have a feeling that we should be able to basically make a direct copy from the recsys repo -> spark, just changing the tools

2020-11-10 31506, 2020

12:19 PM
alastairp

it does all of the same things, like making a mapping of usernames -> ids, and making the artist counts

2020-11-10 31522, 2020

12:19 PM
pristine___

Oh.

2020-11-10 31535, 2020

12:19 PM
_lucifer

pristine___: right so for local testing, the sample should be at most last 6 months data. that would be probably a 1.5g to 2g dump i think.

2020-11-10 31544, 2020

12:19 PM
pristine___

I will play with it. Will help me understand stuff bettwr

2020-11-10 31550, 2020

12:19 PM
alastairp

take a look at the last item in the notebook too, which gives you artist -> other similar artist lookups, that should be easy for us to implement with the model, too

2020-11-10 31524, 2020

12:20 PM
alastairp

please update the document, especially the parts about how the dataframes are generated from listen data and improve the names, and the section about the candidate sets

2020-11-10 31538, 2020

12:20 PM
pristine___

_lucifer: you can change that from 6months to maye two if you don't have much data on local machine but yeah your estimate looks good to me

2020-11-10 31557, 2020

12:20 PM
pristine___

alastairp: yes, I will update

2020-11-10 31558, 2020

12:20 PM
_lucifer

pristine___: if we create dumps of those size, we can easily hoist our own notebooks with kaggle/colab etc.

2020-11-10 31505, 2020

12:21 PM
pristine___

Right

2020-11-10 31515, 2020

12:21 PM
_lucifer

that would make prototyping really quick

2020-11-10 31517, 2020

12:21 PM
pristine___

_lucifer: so I thought about it earlier

2020-11-10 31522, 2020

12:21 PM
alastairp

perhaps split the candidate sets into two - one that uses existing user listens to generate similar artists, and an extension to look at your idea by using other data sources to find other potential artists

2020-11-10 31532, 2020

12:21 PM
pristine___

I will come back to you after the discussion with alastairp

2020-11-10 31538, 2020

12:21 PM
_lucifer

+1

2020-11-10 31540, 2020

12:21 PM
alastairp

pristine___: did you also want to talk about your feedback work?

2020-11-10 31504, 2020

12:22 PM
pristine___

Yes after the artist recs stuff

2020-11-10 31522, 2020

12:22 PM
alastairp

after you have finished it, or after we have discussed it?

2020-11-10 31525, 2020

12:23 PM
pristine___

Oh lol. After we have discussed the artist recs, I was thinking that I will start work on it right after this discussion. So we can talk about the road map and timeline as well

2020-11-10 31505, 2020

12:24 PM
alastairp

oh, right.

2020-11-10 31527, 2020

12:24 PM
alastairp

sure, let's continue with the discussion. I didn't see the lower down stuff

2020-11-10 31537, 2020

12:24 PM
alastairp

schema - this is in the listenbrainz postgres database?

2020-11-10 31554, 2020

12:24 PM
pristine___

It's not there now, but this is what I propose.

2020-11-10 31559, 2020

12:24 PM
pristine___

Yes

2020-11-10 31502, 2020

12:25 PM
alastairp

yes, right. that's what I mean

2020-11-10 31522, 2020

12:25 PM
alastairp

you should describe what this `artist` column will be in more detail

2020-11-10 31533, 2020

12:25 PM
pristine___

Right

2020-11-10 31550, 2020

12:25 PM
pristine___

Will edit in the docs. It contains the artist recommendations

2020-11-10 31501, 2020

12:26 PM
alastairp

will you only store one row for each user? or will you store results from multiple evaluations of the model?

2020-11-10 31505, 2020

12:26 PM
pristine___

The recommended artist mbid/credit id and score

2020-11-10 31509, 2020

12:26 PM
alastairp

in what format? what data?

2020-11-10 31512, 2020

12:26 PM
sumedh has quit

2020-11-10 31536, 2020

12:26 PM
pristine___

It will be a list of dicts

2020-11-10 31538, 2020

12:26 PM
BrainzGit

[listenbrainz-server] MonkeyDo opened pull request #1173 (master…LB-717): LB-717: Return array instead of object when no feedback https://github.com/metabrainz/listenbrainz-server…

2020-11-10 31539, 2020

12:26 PM
alastairp

I see that you initially used json fields in some other tables because you weren't sure about the structure of the returned data. is that why you're doing it here?

2020-11-10 31540, 2020

12:26 PM
BrainzBot

LB-717: Error thrown in loadFeedback function (React) https://tickets.metabrainz.org/browse/LB-717

2020-11-10 31506, 2020

12:27 PM
pristine___

No, I think I am sure about it now

2020-11-10 31516, 2020

12:27 PM
alastairp

if you know that the data will be (artist name, mbid, artistcredit_id, score), then maybe we could just add the columns directly

2020-11-10 31534, 2020

12:27 PM
alastairp

will you store multiple sets of data, or just refresh it periodically?

2020-11-10 31541, 2020

12:27 PM
pristine___

Oh that, that I am not sure now. For example

2020-11-10 31523, 2020

12:28 PM
pristine___

In the recording recs we were initially just storing the track mbid, later we wanted the score too, having a json field makes it easy to update and store the data.

2020-11-10 31508, 2020

12:29 PM
pristine___

I am thinking to update the data periodically, so like I have a list of dicts this week, the next week a fresh list of dicts will replace the older one

2020-11-10 31511, 2020

12:29 PM
pristine___

Something like that

2020-11-10 31555, 2020

12:29 PM
alastairp

there are some questions about exactly how you might want to do that - e.g. you could have multiple rows for a users

2020-11-10 31559, 2020

12:29 PM
alastairp

for a user.

2020-11-10 31513, 2020

12:30 PM
alastairp

and then when showing the data, you select the one that was added most recently

2020-11-10 31528, 2020

12:30 PM
alastairp

you could even delete old ones after you add a new one, that means that the site will never break

2020-11-10 31520, 2020

12:31 PM
pristine___

But why do we want to reatim old data when we have new recs generated

2020-11-10 31521, 2020

12:31 PM
sumedh joined the channel

2020-11-10 31533, 2020

12:31 PM
alastairp

great question. that's what I'm asking

2020-11-10 31540, 2020

12:31 PM
pristine___

Also, the site won't break I think. If we don't have recs we can show a message or something

2020-11-10 31542, 2020

12:31 PM
alastairp

do we want it? is there anything interesting that we can do with it?

2020-11-10 31556, 2020

12:31 PM
pristine___

Atm

2020-11-10 31524, 2020

12:32 PM
alastairp

right, it depends what you mean by "break" - I would say that it's not nice if you're looking at recs, refresh and they disappear, refresh and there are new ones

2020-11-10 31537, 2020

12:32 PM
alastairp

it would be better to refresh and see the new ones immediately

2020-11-10 31542, 2020

12:32 PM
pristine___

I think we need it so that we don't repeat recommendations. For eg, if this week's recs are kinda similar to last week, I can look at the table and do something about it maybe

2020-11-10 31506, 2020

12:33 PM
pristine___

> right, it depends what you mean by "break" - I would say that it's not nice if you're looking at recs, refresh and they disappear, refresh and there are new ones

2020-11-10 31507, 2020

12:33 PM
pristine___

A sec

2020-11-10 31508, 2020

12:33 PM
BrainzGit

[musicbrainz-server] reosarevok opened pull request #1781 (master…MBS-11212): MBS-11212: Serialize correct quality for JSON release https://github.com/metabrainz/musicbrainz-server/…