`listens` -> `listens_df` and users_df and artists_df -> `playcounts_df`,
2020-11-10 31552, 2020
pristine___
Also
2020-11-10 31515, 2020
pristine___
We are storing users_df, artist_df and artistcount to HDFS
2020-11-10 31506, 2020
alastairp
my feeling is that listens_df isn't strictly needed, because you could take data from listens and put it directly into the playcounts df, right? Also, is this similar to artist stats in any way?
2020-11-10 31537, 2020
pristine___
Yeah, right on the first point!
2020-11-10 31540, 2020
sumedh joined the channel
2020-11-10 31526, 2020
pristine___
Artist stats, like?
2020-11-10 31545, 2020
alastairp
we already have a count of how many times users have listened to each artist
2020-11-10 31500, 2020
alastairp
(I used this data dump when I did artist similarity using some python tools)
2020-11-10 31509, 2020
pristine___
Yeah, but I think the range will differ. I might want to have artist stat for like last 15 days or even if it's last 7 days, the 7days of artist stat might be diff from 7days I want to fetch data for. So I thought, it will good to keep it separate
2020-11-10 31540, 2020
alastairp
ok. I would keep this open as an option - it seems a bit silly to compute exactly the same data slightly differently in two different places. Perhaps we can modify the stats computation to also generate the data that we need here
2020-11-10 31555, 2020
pristine___
Yes.
2020-11-10 31556, 2020
alastairp
however, that's fine. For now we can go about it the way that you have suggested
2020-11-10 31538, 2020
alastairp
those dataframes are fine then, just make sure that their names are a bit more descriptive, and make sure that you add some more details to this document about how data from the listens df gets loaded into these ones
2020-11-10 31551, 2020
alastairp
I was looking at your comments about evaluation
2020-11-10 31551, 2020
pristine___
Sure
2020-11-10 31501, 2020
alastairp
did you look into the other projects that I sent you?
2020-11-10 31549, 2020
pristine___
I looked at the user similarity PR, and the project. Noticed some evaluation techniques but didn't understand them fully.
these are some good metrics to look into. You can search their names to learn more about them. There are implementations in this repository
2020-11-10 31538, 2020
pristine___
Yeah, thanks for the link
2020-11-10 31504, 2020
alastairp
you're right about the train/test split
2020-11-10 31534, 2020
alastairp
I will look for some more information about how to do splits in these types of models and see if I can find anything useful
2020-11-10 31556, 2020
pristine___
Yes, I will want to use that in recording recs too. Thanks
2020-11-10 31507, 2020
alastairp
next, you're looking at what input to give to the model
2020-11-10 31525, 2020
alastairp
either all artists by a user, or just their top artists
2020-11-10 31532, 2020
pristine___
Giving all artists by a user or just their top artists is not a good idea
2020-11-10 31544, 2020
alastairp
I'm not sure that the existing artist relations information is useful here - is that what you're suggesting? "Get artists similar to these artists using the artist relations dump"
2020-11-10 31500, 2020
pristine___
Okay, so
2020-11-10 31510, 2020
pristine___
We create the dfs, train the model
2020-11-10 31521, 2020
pristine___
The third step is to get the recommendations
2020-11-10 31530, 2020
pristine___
Generating recommendations is nothing but
2020-11-10 31553, 2020
ruaok barges in
2020-11-10 31512, 2020
ruaok
I have one pie in the sky use case that I would like us to consider... for our own survival.
2020-11-10 31513, 2020
pristine___
Give some artist ID's to the recommender for a user and it will give scores to the artists
2020-11-10 31531, 2020
pristine___
Now what artist ids we should give as input
2020-11-10 31534, 2020
alastairp
ruaok: go
2020-11-10 31502, 2020
ruaok
I'd love it if we could build a system so that a small music label could come in, feed all of their metadata into MB and then train a model to recommend tracks in their collection.
2020-11-10 31543, 2020
ruaok
take the model, host it on their own server and say: "what is your Spotify ID? Can we pull your last 50 listens and suggest which tracks you should listen to at our label?"
2020-11-10 31509, 2020
ruaok
this would be a massive win for us -- give us data, take recs and give us a little money for the effort.
2020-11-10 31547, 2020
ruaok
and pie in sky... I'd like to make this possible in 2021.
2020-11-10 31531, 2020
_lucifer
that should be possible right now i guess?
2020-11-10 31539, 2020
alastairp
yeah, that should be possible. we'd need to work out the missing step between 'add metadata' and 'generate recommendations', if we're using collab filtering, we'll need to wait until we have enough listens
2020-11-10 31549, 2020
alastairp
if their data is already in MB and people are listening to it, perfect
2020-11-10 31534, 2020
alastairp
pristine___: mm, that's definitely one way of doing recommendations, but you don't need to select the potential artists first
2020-11-10 31542, 2020
ruaok
_lucifer: not quite, no.
2020-11-10 31502, 2020
alastairp
see the repo that I linked previously, the "Analysis of the recommendations" section
2020-11-10 31506, 2020
ruaok
alastairp: agreed. this isn't a week long project.
2020-11-10 31502, 2020
alastairp
the returned list of tuples are "artists that are similar to all of the artists that this user has listened to". The True/False indicates if we know that the user has listened to that artist before
2020-11-10 31537, 2020
pristine___
Returned from where?
2020-11-10 31554, 2020
alastairp
that's data from the model
2020-11-10 31504, 2020
pristine___
If I follow, we have trained the model, we are on prepare candidate sets step now?
2020-11-10 31526, 2020
sumedh has quit
2020-11-10 31505, 2020
alastairp
your candidate set is the list of artists for a user that you pass into the model?
2020-11-10 31521, 2020
pristine___
We give some input to the recommender , and the recommender based on trained model assign score to the input
2020-11-10 31523, 2020
pristine___
Yes
2020-11-10 31551, 2020
alastairp
but you don't need to give an explicit input to the model
2020-11-10 31518, 2020
alastairp
that is, the model itself will give you some similar artists, ordered by score
2020-11-10 31528, 2020
sumedh joined the channel
2020-11-10 31542, 2020
alastairp
you don't need to go and find some artists yourself
2020-11-10 31505, 2020
pristine___
Okay
2020-11-10 31553, 2020
alastairp
you can just say "given the artists that pristine___ listens to, what are some other similar artists that other people listen to?"
2020-11-10 31553, 2020
pristine___
Right, I understand. Candidate sets is more of a layer to refine the results. It's like these are the artists the user might like, now given them a ranking. But yeah, what you say makes sense to mem
2020-11-10 31530, 2020
alastairp
there is definitely value in using this kind of score lookup to be able to say "we think that you might like this artist 67%" when you are viewing an artist's page
2020-11-10 31523, 2020
alastairp
but I think that initially we should focus on directly using the model to generate the recommendations, and only look at expanding this set if we find that it's not big enough
2020-11-10 31508, 2020
pristine___
Makes sense.
2020-11-10 31510, 2020
alastairp
how long do you think that it takes to do a circle of make a change, test something, get results?
2020-11-10 31534, 2020
alastairp
I'm just interested in knowing what your workflow is. Every time that you test something, do you deploy and run a job? Do you have a local setup with a smaller dataset?
_lucifer: listens data, postgres data (user info, etc) and spark import friendly data.
2020-11-10 31555, 2020
pristine___
alastairp: it was easier when I could test stuff on the cluster. Rn, it depends on if someone is available to merge, deploy. I generally add logging statements if I am stuck somewhere or if I feel that the result is not what I expect. Testing on big data of course helps. So rn, I write code, test on my machine on whatever lil data I have, open a PR, merge, deploy and then understand if it is working as
2020-11-10 31555, 2020
pristine___
expected. 3-4 days maybe
2020-11-10 31541, 2020
_lucifer
ruaok: ah! ok. whats the diff between the listens data and spark import friendly data ? some schema difference or some data difference
2020-11-10 31558, 2020
ruaok
one is postgres, the other spark
2020-11-10 31505, 2020
ruaok
suited for import to each one...
2020-11-10 31521, 2020
alastairp
pristine___: right. I think that we should try and improve this workflow if we can, then. It would be great if you were able to try different parameters to models, or try different inputs to see the results easily, without having to wait for a review/deploy/test cycle
2020-11-10 31531, 2020
alastairp
even if that was on a smaller subset of the dataset
2020-11-10 31510, 2020
_lucifer
ruaok: understood. thanks!
2020-11-10 31522, 2020
pristine___
Yup
2020-11-10 31532, 2020
alastairp
let's keep this open as an idea, then.
2020-11-10 31505, 2020
alastairp
I spent some time a few months ago improving the local spark development workflow, but didn't get any further. I think that there are still some things that we could change here
2020-11-10 31521, 2020
_lucifer
for recs, in production spark only uses data from last 6 months right ?
2020-11-10 31503, 2020
pristine___
_lucifer: yes, we can change that window as we like
2020-11-10 31504, 2020
alastairp
pristine___: I strongly recommend that you run the mtlab-recsys notebook yourself. the test dataset is quite small, and you can reduce the number of model iterations too
2020-11-10 31515, 2020
pristine___
Will do.
2020-11-10 31522, 2020
alastairp
the repository is designed as an entry to this kind of task
2020-11-10 31547, 2020
alastairp
so have a play around and see if you can work out what different inputs give, and how everything fits together
2020-11-10 31556, 2020
pristine___
Yup
2020-11-10 31502, 2020
alastairp
I think that it will make things a lot clearer when you have to implement it in spark
2020-11-10 31542, 2020
alastairp
I have a feeling that we should be able to basically make a direct copy from the recsys repo -> spark, just changing the tools
2020-11-10 31506, 2020
alastairp
it does all of the same things, like making a mapping of usernames -> ids, and making the artist counts
2020-11-10 31522, 2020
pristine___
Oh.
2020-11-10 31535, 2020
_lucifer
pristine___: right so for local testing, the sample should be at most last 6 months data. that would be probably a 1.5g to 2g dump i think.
2020-11-10 31544, 2020
pristine___
I will play with it. Will help me understand stuff bettwr
2020-11-10 31550, 2020
alastairp
take a look at the last item in the notebook too, which gives you artist -> other similar artist lookups, that should be easy for us to implement with the model, too
2020-11-10 31524, 2020
alastairp
please update the document, especially the parts about how the dataframes are generated from listen data and improve the names, and the section about the candidate sets
2020-11-10 31538, 2020
pristine___
_lucifer: you can change that from 6months to maye two if you don't have much data on local machine but yeah your estimate looks good to me
2020-11-10 31557, 2020
pristine___
alastairp: yes, I will update
2020-11-10 31558, 2020
_lucifer
pristine___: if we create dumps of those size, we can easily hoist our own notebooks with kaggle/colab etc.
2020-11-10 31505, 2020
pristine___
Right
2020-11-10 31515, 2020
_lucifer
that would make prototyping really quick
2020-11-10 31517, 2020
pristine___
_lucifer: so I thought about it earlier
2020-11-10 31522, 2020
alastairp
perhaps split the candidate sets into two - one that uses existing user listens to generate similar artists, and an extension to look at your idea by using other data sources to find other potential artists
2020-11-10 31532, 2020
pristine___
I will come back to you after the discussion with alastairp
2020-11-10 31538, 2020
_lucifer
+1
2020-11-10 31540, 2020
alastairp
pristine___: did you also want to talk about your feedback work?
2020-11-10 31504, 2020
pristine___
Yes after the artist recs stuff
2020-11-10 31522, 2020
alastairp
after you have finished it, or after we have discussed it?
2020-11-10 31525, 2020
pristine___
Oh lol. After we have discussed the artist recs, I was thinking that I will start work on it right after this discussion. So we can talk about the road map and timeline as well
2020-11-10 31505, 2020
alastairp
oh, right.
2020-11-10 31527, 2020
alastairp
sure, let's continue with the discussion. I didn't see the lower down stuff
2020-11-10 31537, 2020
alastairp
schema - this is in the listenbrainz postgres database?
2020-11-10 31554, 2020
pristine___
It's not there now, but this is what I propose.
2020-11-10 31559, 2020
pristine___
Yes
2020-11-10 31502, 2020
alastairp
yes, right. that's what I mean
2020-11-10 31522, 2020
alastairp
you should describe what this `artist` column will be in more detail
2020-11-10 31533, 2020
pristine___
Right
2020-11-10 31550, 2020
pristine___
Will edit in the docs. It contains the artist recommendations
2020-11-10 31501, 2020
alastairp
will you only store one row for each user? or will you store results from multiple evaluations of the model?
I see that you initially used json fields in some other tables because you weren't sure about the structure of the returned data. is that why you're doing it here?
if you know that the data will be (artist name, mbid, artistcredit_id, score), then maybe we could just add the columns directly
2020-11-10 31534, 2020
alastairp
will you store multiple sets of data, or just refresh it periodically?
2020-11-10 31541, 2020
pristine___
Oh that, that I am not sure now. For example
2020-11-10 31523, 2020
pristine___
In the recording recs we were initially just storing the track mbid, later we wanted the score too, having a json field makes it easy to update and store the data.
2020-11-10 31508, 2020
pristine___
I am thinking to update the data periodically, so like I have a list of dicts this week, the next week a fresh list of dicts will replace the older one
2020-11-10 31511, 2020
pristine___
Something like that
2020-11-10 31555, 2020
alastairp
there are some questions about exactly how you might want to do that - e.g. you could have multiple rows for a users
2020-11-10 31559, 2020
alastairp
for a user.
2020-11-10 31513, 2020
alastairp
and then when showing the data, you select the one that was added most recently
2020-11-10 31528, 2020
alastairp
you could even delete old ones after you add a new one, that means that the site will never break
2020-11-10 31520, 2020
pristine___
But why do we want to reatim old data when we have new recs generated
2020-11-10 31521, 2020
sumedh joined the channel
2020-11-10 31533, 2020
alastairp
great question. that's what I'm asking
2020-11-10 31540, 2020
pristine___
Also, the site won't break I think. If we don't have recs we can show a message or something
2020-11-10 31542, 2020
alastairp
do we want it? is there anything interesting that we can do with it?
2020-11-10 31556, 2020
pristine___
Atm
2020-11-10 31524, 2020
alastairp
right, it depends what you mean by "break" - I would say that it's not nice if you're looking at recs, refresh and they disappear, refresh and there are new ones
2020-11-10 31537, 2020
alastairp
it would be better to refresh and see the new ones immediately
2020-11-10 31542, 2020
pristine___
I think we need it so that we don't repeat recommendations. For eg, if this week's recs are kinda similar to last week, I can look at the table and do something about it maybe
2020-11-10 31506, 2020
pristine___
> right, it depends what you mean by "break" - I would say that it's not nice if you're looking at recs, refresh and they disappear, refresh and there are new ones