#metabrainz

/

      • alastairp
        is that right?
      • 2020-11-10 31551, 2020

      • pristine___
        `listens` -> `listens_df` and users_df and artists_df -> `playcounts_df`,
      • 2020-11-10 31552, 2020

      • pristine___
        Also
      • 2020-11-10 31515, 2020

      • pristine___
        We are storing users_df, artist_df and artistcount to HDFS
      • 2020-11-10 31506, 2020

      • alastairp
        my feeling is that listens_df isn't strictly needed, because you could take data from listens and put it directly into the playcounts df, right? Also, is this similar to artist stats in any way?
      • 2020-11-10 31537, 2020

      • pristine___
        Yeah, right on the first point!
      • 2020-11-10 31540, 2020

      • sumedh joined the channel
      • 2020-11-10 31526, 2020

      • pristine___
        Artist stats, like?
      • 2020-11-10 31545, 2020

      • alastairp
        we already have a count of how many times users have listened to each artist
      • 2020-11-10 31500, 2020

      • alastairp
        (I used this data dump when I did artist similarity using some python tools)
      • 2020-11-10 31509, 2020

      • pristine___
        Yeah, but I think the range will differ. I might want to have artist stat for like last 15 days or even if it's last 7 days, the 7days of artist stat might be diff from 7days I want to fetch data for. So I thought, it will good to keep it separate
      • 2020-11-10 31540, 2020

      • alastairp
        ok. I would keep this open as an option - it seems a bit silly to compute exactly the same data slightly differently in two different places. Perhaps we can modify the stats computation to also generate the data that we need here
      • 2020-11-10 31555, 2020

      • pristine___
        Yes.
      • 2020-11-10 31556, 2020

      • alastairp
        however, that's fine. For now we can go about it the way that you have suggested
      • 2020-11-10 31538, 2020

      • alastairp
        those dataframes are fine then, just make sure that their names are a bit more descriptive, and make sure that you add some more details to this document about how data from the listens df gets loaded into these ones
      • 2020-11-10 31551, 2020

      • alastairp
        I was looking at your comments about evaluation
      • 2020-11-10 31551, 2020

      • pristine___
        Sure
      • 2020-11-10 31501, 2020

      • alastairp
        did you look into the other projects that I sent you?
      • 2020-11-10 31549, 2020

      • pristine___
        I looked at the user similarity PR, and the project. Noticed some evaluation techniques but didn't understand them fully.
      • 2020-11-10 31542, 2020

      • alastairp
        especially take a look at the "Evaluation" section here: https://github.com/andrebola/mtlab-recsys/blob/ma…
      • 2020-11-10 31513, 2020

      • alastairp
        these are some good metrics to look into. You can search their names to learn more about them. There are implementations in this repository
      • 2020-11-10 31538, 2020

      • pristine___
        Yeah, thanks for the link
      • 2020-11-10 31504, 2020

      • alastairp
        you're right about the train/test split
      • 2020-11-10 31534, 2020

      • alastairp
        I will look for some more information about how to do splits in these types of models and see if I can find anything useful
      • 2020-11-10 31556, 2020

      • pristine___
        Yes, I will want to use that in recording recs too. Thanks
      • 2020-11-10 31507, 2020

      • alastairp
        next, you're looking at what input to give to the model
      • 2020-11-10 31525, 2020

      • alastairp
        either all artists by a user, or just their top artists
      • 2020-11-10 31532, 2020

      • pristine___
        Giving all artists by a user or just their top artists is not a good idea
      • 2020-11-10 31544, 2020

      • alastairp
        I'm not sure that the existing artist relations information is useful here - is that what you're suggesting? "Get artists similar to these artists using the artist relations dump"
      • 2020-11-10 31500, 2020

      • pristine___
        Okay, so
      • 2020-11-10 31510, 2020

      • pristine___
        We create the dfs, train the model
      • 2020-11-10 31521, 2020

      • pristine___
        The third step is to get the recommendations
      • 2020-11-10 31530, 2020

      • pristine___
        Generating recommendations is nothing but
      • 2020-11-10 31553, 2020

      • ruaok barges in
      • 2020-11-10 31512, 2020

      • ruaok
        I have one pie in the sky use case that I would like us to consider... for our own survival.
      • 2020-11-10 31513, 2020

      • pristine___
        Give some artist ID's to the recommender for a user and it will give scores to the artists
      • 2020-11-10 31531, 2020

      • pristine___
        Now what artist ids we should give as input
      • 2020-11-10 31534, 2020

      • alastairp
        ruaok: go
      • 2020-11-10 31502, 2020

      • ruaok
        I'd love it if we could build a system so that a small music label could come in, feed all of their metadata into MB and then train a model to recommend tracks in their collection.
      • 2020-11-10 31543, 2020

      • ruaok
        take the model, host it on their own server and say: "what is your Spotify ID? Can we pull your last 50 listens and suggest which tracks you should listen to at our label?"
      • 2020-11-10 31509, 2020

      • ruaok
        this would be a massive win for us -- give us data, take recs and give us a little money for the effort.
      • 2020-11-10 31547, 2020

      • ruaok
        and pie in sky... I'd like to make this possible in 2021.
      • 2020-11-10 31531, 2020

      • _lucifer
        that should be possible right now i guess?
      • 2020-11-10 31539, 2020

      • alastairp
        yeah, that should be possible. we'd need to work out the missing step between 'add metadata' and 'generate recommendations', if we're using collab filtering, we'll need to wait until we have enough listens
      • 2020-11-10 31549, 2020

      • alastairp
        if their data is already in MB and people are listening to it, perfect
      • 2020-11-10 31534, 2020

      • alastairp
        pristine___: mm, that's definitely one way of doing recommendations, but you don't need to select the potential artists first
      • 2020-11-10 31542, 2020

      • ruaok
        _lucifer: not quite, no.
      • 2020-11-10 31502, 2020

      • alastairp
        see the repo that I linked previously, the "Analysis of the recommendations" section
      • 2020-11-10 31506, 2020

      • ruaok
        alastairp: agreed. this isn't a week long project.
      • 2020-11-10 31502, 2020

      • alastairp
        the returned list of tuples are "artists that are similar to all of the artists that this user has listened to". The True/False indicates if we know that the user has listened to that artist before
      • 2020-11-10 31537, 2020

      • pristine___
        Returned from where?
      • 2020-11-10 31554, 2020

      • alastairp
        that's data from the model
      • 2020-11-10 31504, 2020

      • pristine___
        If I follow, we have trained the model, we are on prepare candidate sets step now?
      • 2020-11-10 31526, 2020

      • sumedh has quit
      • 2020-11-10 31505, 2020

      • alastairp
        your candidate set is the list of artists for a user that you pass into the model?
      • 2020-11-10 31521, 2020

      • pristine___
        We give some input to the recommender , and the recommender based on trained model assign score to the input
      • 2020-11-10 31523, 2020

      • pristine___
        Yes
      • 2020-11-10 31551, 2020

      • alastairp
        but you don't need to give an explicit input to the model
      • 2020-11-10 31518, 2020

      • alastairp
        that is, the model itself will give you some similar artists, ordered by score
      • 2020-11-10 31528, 2020

      • sumedh joined the channel
      • 2020-11-10 31542, 2020

      • alastairp
        you don't need to go and find some artists yourself
      • 2020-11-10 31505, 2020

      • pristine___
        Okay
      • 2020-11-10 31553, 2020

      • alastairp
        you can just say "given the artists that pristine___ listens to, what are some other similar artists that other people listen to?"
      • 2020-11-10 31553, 2020

      • pristine___
        Right, I understand. Candidate sets is more of a layer to refine the results. It's like these are the artists the user might like, now given them a ranking. But yeah, what you say makes sense to mem
      • 2020-11-10 31530, 2020

      • alastairp
        there is definitely value in using this kind of score lookup to be able to say "we think that you might like this artist 67%" when you are viewing an artist's page
      • 2020-11-10 31523, 2020

      • alastairp
        but I think that initially we should focus on directly using the model to generate the recommendations, and only look at expanding this set if we find that it's not big enough
      • 2020-11-10 31508, 2020

      • pristine___
        Makes sense.
      • 2020-11-10 31510, 2020

      • alastairp
        how long do you think that it takes to do a circle of make a change, test something, get results?
      • 2020-11-10 31534, 2020

      • alastairp
        I'm just interested in knowing what your workflow is. Every time that you test something, do you deploy and run a job? Do you have a local setup with a smaller dataset?
      • 2020-11-10 31555, 2020

      • _lucifer
        what's the difference between the 3 different dumps here http://ftp.musicbrainz.org/pub/musicbrainz/listen…
      • 2020-11-10 31542, 2020

      • ruaok
        _lucifer: listens data, postgres data (user info, etc) and spark import friendly data.
      • 2020-11-10 31555, 2020

      • pristine___
        alastairp: it was easier when I could test stuff on the cluster. Rn, it depends on if someone is available to merge, deploy. I generally add logging statements if I am stuck somewhere or if I feel that the result is not what I expect. Testing on big data of course helps. So rn, I write code, test on my machine on whatever lil data I have, open a PR, merge, deploy and then understand if it is working as
      • 2020-11-10 31555, 2020

      • pristine___
        expected. 3-4 days maybe
      • 2020-11-10 31541, 2020

      • _lucifer
        ruaok: ah! ok. whats the diff between the listens data and spark import friendly data ? some schema difference or some data difference
      • 2020-11-10 31558, 2020

      • ruaok
        one is postgres, the other spark
      • 2020-11-10 31505, 2020

      • ruaok
        suited for import to each one...
      • 2020-11-10 31521, 2020

      • alastairp
        pristine___: right. I think that we should try and improve this workflow if we can, then. It would be great if you were able to try different parameters to models, or try different inputs to see the results easily, without having to wait for a review/deploy/test cycle
      • 2020-11-10 31531, 2020

      • alastairp
        even if that was on a smaller subset of the dataset
      • 2020-11-10 31510, 2020

      • _lucifer
        ruaok: understood. thanks!
      • 2020-11-10 31522, 2020

      • pristine___
        Yup
      • 2020-11-10 31532, 2020

      • alastairp
        let's keep this open as an idea, then.
      • 2020-11-10 31505, 2020

      • alastairp
        I spent some time a few months ago improving the local spark development workflow, but didn't get any further. I think that there are still some things that we could change here
      • 2020-11-10 31521, 2020

      • _lucifer
        for recs, in production spark only uses data from last 6 months right ?
      • 2020-11-10 31503, 2020

      • pristine___
        _lucifer: yes, we can change that window as we like
      • 2020-11-10 31504, 2020

      • alastairp
        pristine___: I strongly recommend that you run the mtlab-recsys notebook yourself. the test dataset is quite small, and you can reduce the number of model iterations too
      • 2020-11-10 31515, 2020

      • pristine___
        Will do.
      • 2020-11-10 31522, 2020

      • alastairp
        the repository is designed as an entry to this kind of task
      • 2020-11-10 31547, 2020

      • alastairp
        so have a play around and see if you can work out what different inputs give, and how everything fits together
      • 2020-11-10 31556, 2020

      • pristine___
        Yup
      • 2020-11-10 31502, 2020

      • alastairp
        I think that it will make things a lot clearer when you have to implement it in spark
      • 2020-11-10 31542, 2020

      • alastairp
        I have a feeling that we should be able to basically make a direct copy from the recsys repo -> spark, just changing the tools
      • 2020-11-10 31506, 2020

      • alastairp
        it does all of the same things, like making a mapping of usernames -> ids, and making the artist counts
      • 2020-11-10 31522, 2020

      • pristine___
        Oh.
      • 2020-11-10 31535, 2020

      • _lucifer
        pristine___: right so for local testing, the sample should be at most last 6 months data. that would be probably a 1.5g to 2g dump i think.
      • 2020-11-10 31544, 2020

      • pristine___
        I will play with it. Will help me understand stuff bettwr
      • 2020-11-10 31550, 2020

      • alastairp
        take a look at the last item in the notebook too, which gives you artist -> other similar artist lookups, that should be easy for us to implement with the model, too
      • 2020-11-10 31524, 2020

      • alastairp
        please update the document, especially the parts about how the dataframes are generated from listen data and improve the names, and the section about the candidate sets
      • 2020-11-10 31538, 2020

      • pristine___
        _lucifer: you can change that from 6months to maye two if you don't have much data on local machine but yeah your estimate looks good to me
      • 2020-11-10 31557, 2020

      • pristine___
        alastairp: yes, I will update
      • 2020-11-10 31558, 2020

      • _lucifer
        pristine___: if we create dumps of those size, we can easily hoist our own notebooks with kaggle/colab etc.
      • 2020-11-10 31505, 2020

      • pristine___
        Right
      • 2020-11-10 31515, 2020

      • _lucifer
        that would make prototyping really quick
      • 2020-11-10 31517, 2020

      • pristine___
        _lucifer: so I thought about it earlier
      • 2020-11-10 31522, 2020

      • alastairp
        perhaps split the candidate sets into two - one that uses existing user listens to generate similar artists, and an extension to look at your idea by using other data sources to find other potential artists
      • 2020-11-10 31532, 2020

      • pristine___
        I will come back to you after the discussion with alastairp
      • 2020-11-10 31538, 2020

      • _lucifer
        +1
      • 2020-11-10 31540, 2020

      • alastairp
        pristine___: did you also want to talk about your feedback work?
      • 2020-11-10 31504, 2020

      • pristine___
        Yes after the artist recs stuff
      • 2020-11-10 31522, 2020

      • alastairp
        after you have finished it, or after we have discussed it?
      • 2020-11-10 31525, 2020

      • pristine___
        Oh lol. After we have discussed the artist recs, I was thinking that I will start work on it right after this discussion. So we can talk about the road map and timeline as well
      • 2020-11-10 31505, 2020

      • alastairp
        oh, right.
      • 2020-11-10 31527, 2020

      • alastairp
        sure, let's continue with the discussion. I didn't see the lower down stuff
      • 2020-11-10 31537, 2020

      • alastairp
        schema - this is in the listenbrainz postgres database?
      • 2020-11-10 31554, 2020

      • pristine___
        It's not there now, but this is what I propose.
      • 2020-11-10 31559, 2020

      • pristine___
        Yes
      • 2020-11-10 31502, 2020

      • alastairp
        yes, right. that's what I mean
      • 2020-11-10 31522, 2020

      • alastairp
        you should describe what this `artist` column will be in more detail
      • 2020-11-10 31533, 2020

      • pristine___
        Right
      • 2020-11-10 31550, 2020

      • pristine___
        Will edit in the docs. It contains the artist recommendations
      • 2020-11-10 31501, 2020

      • alastairp
        will you only store one row for each user? or will you store results from multiple evaluations of the model?
      • 2020-11-10 31505, 2020

      • pristine___
        The recommended artist mbid/credit id and score
      • 2020-11-10 31509, 2020

      • alastairp
        in what format? what data?
      • 2020-11-10 31512, 2020

      • sumedh has quit
      • 2020-11-10 31536, 2020

      • pristine___
        It will be a list of dicts
      • 2020-11-10 31538, 2020

      • BrainzGit
        [listenbrainz-server] MonkeyDo opened pull request #1173 (master…LB-717): LB-717: Return array instead of object when no feedback https://github.com/metabrainz/listenbrainz-server…
      • 2020-11-10 31539, 2020

      • alastairp
        I see that you initially used json fields in some other tables because you weren't sure about the structure of the returned data. is that why you're doing it here?
      • 2020-11-10 31540, 2020

      • BrainzBot
        LB-717: Error thrown in loadFeedback function (React) https://tickets.metabrainz.org/browse/LB-717
      • 2020-11-10 31506, 2020

      • pristine___
        No, I think I am sure about it now
      • 2020-11-10 31516, 2020

      • alastairp
        if you know that the data will be (artist name, mbid, artistcredit_id, score), then maybe we could just add the columns directly
      • 2020-11-10 31534, 2020

      • alastairp
        will you store multiple sets of data, or just refresh it periodically?
      • 2020-11-10 31541, 2020

      • pristine___
        Oh that, that I am not sure now. For example
      • 2020-11-10 31523, 2020

      • pristine___
        In the recording recs we were initially just storing the track mbid, later we wanted the score too, having a json field makes it easy to update and store the data.
      • 2020-11-10 31508, 2020

      • pristine___
        I am thinking to update the data periodically, so like I have a list of dicts this week, the next week a fresh list of dicts will replace the older one
      • 2020-11-10 31511, 2020

      • pristine___
        Something like that
      • 2020-11-10 31555, 2020

      • alastairp
        there are some questions about exactly how you might want to do that - e.g. you could have multiple rows for a users
      • 2020-11-10 31559, 2020

      • alastairp
        for a user.
      • 2020-11-10 31513, 2020

      • alastairp
        and then when showing the data, you select the one that was added most recently
      • 2020-11-10 31528, 2020

      • alastairp
        you could even delete old ones after you add a new one, that means that the site will never break
      • 2020-11-10 31520, 2020

      • pristine___
        But why do we want to reatim old data when we have new recs generated
      • 2020-11-10 31521, 2020

      • sumedh joined the channel
      • 2020-11-10 31533, 2020

      • alastairp
        great question. that's what I'm asking
      • 2020-11-10 31540, 2020

      • pristine___
        Also, the site won't break I think. If we don't have recs we can show a message or something
      • 2020-11-10 31542, 2020

      • alastairp
        do we want it? is there anything interesting that we can do with it?
      • 2020-11-10 31556, 2020

      • pristine___
        Atm
      • 2020-11-10 31524, 2020

      • alastairp
        right, it depends what you mean by "break" - I would say that it's not nice if you're looking at recs, refresh and they disappear, refresh and there are new ones
      • 2020-11-10 31537, 2020

      • alastairp
        it would be better to refresh and see the new ones immediately
      • 2020-11-10 31542, 2020

      • pristine___
        I think we need it so that we don't repeat recommendations. For eg, if this week's recs are kinda similar to last week, I can look at the table and do something about it maybe
      • 2020-11-10 31506, 2020

      • pristine___
        > right, it depends what you mean by "break" - I would say that it's not nice if you're looking at recs, refresh and they disappear, refresh and there are new ones
      • 2020-11-10 31507, 2020

      • pristine___
        A sec
      • 2020-11-10 31508, 2020

      • BrainzGit
        [musicbrainz-server] reosarevok opened pull request #1781 (master…MBS-11212): MBS-11212: Serialize correct quality for JSON release https://github.com/metabrainz/musicbrainz-server/…