cool. this seems super exiting, let me know what you come up with
2020-11-10 31513, 2020
ruaok
it would be cool to turn around and how us having built something with it.
2020-11-10 31505, 2020
ruaok
will do. I think this slots in perfectly after the mapping is calculated. the mapping itself isn't that useful in context of ACRP, but it shows us "these are the tracks that can't be matched with exact matching". its a perfect test dataset.
2020-11-10 31530, 2020
ruaok
and I can feel the "add MBIDs to listens in timescale" getting closer to reality.
2020-11-10 31551, 2020
ruaok
which would make so much more code on the spark side simpler for not having to match MBIDs as part of the process.
alastairp: not sure why I cannot request your review for this PR through GitHub, so I’m doing it here ^
2020-11-10 31529, 2020
alastairp
maybe I'm not in an admin team on the repo. will look
2020-11-10 31538, 2020
alastairp
pristine___: OK, I'm here. how are you?
2020-11-10 31558, 2020
Nyanko-sensei joined the channel
2020-11-10 31515, 2020
alastairp
Let's start with the artist recommendations doc
2020-11-10 31521, 2020
pristine___
Great!
2020-11-10 31550, 2020
alastairp
I haven't been following your discussions with ruaok. Can you give me a very quick 2-3 line overview of what you're planning on doing?
2020-11-10 31507, 2020
pristine___
Sure
2020-11-10 31516, 2020
discopatrick has quit
2020-11-10 31513, 2020
pristine___
So rn, we are generating recording recommendations for users (the playlists), we thought of generating artist recommendations too, as in artists users might like. Roughly, it can help us in refining the daily jams/playlists.
2020-11-10 31536, 2020
alastairp
and what are your thoughts on the artist recommendations? How are you planning on collecting the training data, and what will be the input and output to the model?
2020-11-10 31550, 2020
pristine___
Umm... I have written that in the doc. The training data will be, as I plan, the playcounts/artistcounts, as in how many times a user has listened to a particular artists. It's implicit.
2020-11-10 31543, 2020
pristine___
So we can use listens of past month/year to fetch these artist counts.
2020-11-10 31550, 2020
pristine___
And train model on these
2020-11-10 31502, 2020
alastairp
ok, cool
2020-11-10 31521, 2020
alastairp
I'm looking at this 'create dataframes' section in the document
it's not clear to me if these are all new dataframes
2020-11-10 31548, 2020
pristine___
New?
2020-11-10 31557, 2020
alastairp
do they currently exist?
2020-11-10 31546, 2020
pristine___
No.
2020-11-10 31556, 2020
pristine___
The users df
2020-11-10 31558, 2020
pristine___
Exists now, as in it is used for recording recs, but I will want to have a separate users df, stored in a separate dir for artist recs, something like `/recs/artist/df/users.parquet`
2020-11-10 31558, 2020
alastairp
the names of these dataframes seem really generic - especially the one that you've named playcounts_df, this doesn't have anything in the name that says that they are _artist_ playcounts
2020-11-10 31542, 2020
pristine___
Yeah, right I missed that. It can be users_df, listens_df, artists_df, and artistcount_df
2020-11-10 31556, 2020
alastairp
given a listen, what are the steps for putting it into these tables?
2020-11-10 31527, 2020
pristine___
Yeah, so the listens are first mapped with the mapping.
2020-11-10 31534, 2020
alastairp
the input to the model will be a matrix of User / Artist, right? With counts in the cells
2020-11-10 31543, 2020
pristine___
Yes
2020-11-10 31512, 2020
pristine___
Then we fetch distinct users and assign them a user ID and prepare users_df
2020-11-10 31529, 2020
alastairp
right. so the description of these dataframes don't explain to me why each of them are needed in this setup. it's difficult to follow how the data flows into these dfs
2020-11-10 31532, 2020
pristine___
Fetch distinct artist, assign artist ID and prepare artist df
2020-11-10 31553, 2020
alastairp
what's an artist ID, and why do you need it?
2020-11-10 31505, 2020
pristine___
Users_df, artist_df and listend_df are needed to prepare artistcount_df
2020-11-10 31512, 2020
pristine___
Yeah, so the IDs
2020-11-10 31558, 2020
pristine___
The user ID and artist ID, it is assigned like this
The model takes input of the form (int, int bigint)
2020-11-10 31545, 2020
pristine___
Int, int, bigint
2020-11-10 31552, 2020
pristine___
So we cannot just pass an mbid
2020-11-10 31554, 2020
pristine___
Or string
2020-11-10 31500, 2020
pristine___
To identify user, artist
2020-11-10 31506, 2020
pristine___
We assign them IDs
2020-11-10 31531, 2020
alastairp
great. so it's a mapping from our input to the indexes in the matrix
2020-11-10 31526, 2020
pristine___
I won't say indexes, but yeah we can use ids to later reference mbids/names etc
2020-11-10 31500, 2020
alastairp
can you please edit the document to make this clearer? Explicitly describe the matrix format, and explain that each of these other tables is for the mapping
2020-11-10 31523, 2020
alastairp
index - I mean how to reference a particular row or column.
2020-11-10 31529, 2020
pristine___
> Explicitly describe the matrix format, and explain that each of these other tables is for the mapping
2020-11-10 31550, 2020
pristine___
So we generally use the term mapping for msid->mbid mapping
2020-11-10 31556, 2020
pristine___
Sorry, I want clear on that
2020-11-10 31559, 2020
pristine___
Wasn't
2020-11-10 31503, 2020
pristine___
Earlier
2020-11-10 31506, 2020
pristine___
Will edit
2020-11-10 31519, 2020
pristine___
Do you want me to edit rn, or after the discussion?
2020-11-10 31529, 2020
alastairp
anything that goes from one identifier to another identifier is a mapping
2020-11-10 31536, 2020
pristine___
Yeah
2020-11-10 31539, 2020
alastairp
users_df is a mapping from a username to an integer
Listens_df is nothing but the listens as such, I have just filtered the columns/fields I need
2020-11-10 31547, 2020
alastairp
can you explain it to me?
2020-11-10 31552, 2020
pristine___
Yeah
2020-11-10 31558, 2020
alastairp
oh, it's an existing table?
2020-11-10 31508, 2020
pristine___
We just do, listens_df = listens.select(`arist_credit_id`, `user_name`).
2020-11-10 31514, 2020
alastairp
you just said before that these are all new dataframes
2020-11-10 31550, 2020
pristine___
Yes. New in the sense, we are creating them from the listens table (Submitted to LB)
2020-11-10 31509, 2020
alastairp
is there a specific technical reason why this is needed? So it seems like you're going from `listens` -> `listens_df` -> users_df and artists_df -> `playcounts_df`