#metabrainz

/

      • ruaok
        I can do that, but how do I partition that data?
      • 2020-01-18 01832, 2020

      • pristine__
        It's not about testing
      • 2020-01-18 01853, 2020

      • pristine__
        It is for anyone to try and run recommendation engine on their local machine
      • 2020-01-18 01858, 2020

      • pristine__
        Basically use labs
      • 2020-01-18 01842, 2020

      • pristine__
        So the all-changes-mapping PR downloads huge data as of now. We discussed this a few weeks ago
      • 2020-01-18 01858, 2020

      • ruaok
        yes.
      • 2020-01-18 01804, 2020

      • pristine__
        That first do it for the big data and as a next step do it for small data.
      • 2020-01-18 01812, 2020

      • pristine__
        So what I was thinking is
      • 2020-01-18 01845, 2020

      • pristine__
        We can put an option in the config that for small data chunks use this config value
      • 2020-01-18 01855, 2020

      • pristine__
        Now the question is where to get the small chunk for
      • 2020-01-18 01858, 2020

      • pristine__
        From*
      • 2020-01-18 01805, 2020

      • pristine__
        For that I was thinking
      • 2020-01-18 01821, 2020

      • pristine__
        To use latest incremental dump for time being
      • 2020-01-18 01829, 2020

      • pristine__
        And for mapping and relation
      • 2020-01-18 01830, 2020

      • ruaok
        > Now the question is where to get the small chunk for
      • 2020-01-18 01837, 2020

      • nav2002_ joined the channel
      • 2020-01-18 01839, 2020

      • ruaok
        this is what I was saying with the -test versions of relations, etc.
      • 2020-01-18 01856, 2020

      • pristine__
        Umm... I think we should not call it test
      • 2020-01-18 01806, 2020

      • pristine__
        But yeah, we both are on the same track :)
      • 2020-01-18 01811, 2020

      • pristine__
        So we can do a join?
      • 2020-01-18 01827, 2020

      • ruaok
        we should use the same terminology MB uses. han gon.
      • 2020-01-18 01844, 2020

      • pristine__
        Oooo. Didn't know that. Cool
      • 2020-01-18 01857, 2020

      • ruaok
        sample.
      • 2020-01-18 01858, 2020

      • ruaok
      • 2020-01-18 01804, 2020

      • ruaok
        let's use the word sample.
      • 2020-01-18 01812, 2020

      • ruaok
        it solves exactly this test case.
      • 2020-01-18 01842, 2020

      • ruaok
        and then it makes for a nice starting point for us. I can adjust artist relations to only use artists that are in that dump.
      • 2020-01-18 01848, 2020

      • ruaok
        run full, only output sample.
      • 2020-01-18 01805, 2020

      • ruaok
        not sure that will work very well, TBH.
      • 2020-01-18 01824, 2020

      • ruaok
        because recommendations are going to recommend outside of the set almost immediately.
      • 2020-01-18 01835, 2020

      • pristine__
        I know they won't. But rn it is difficult for contributors to know what is happening with data
      • 2020-01-18 01841, 2020

      • pristine__
        Do you have any other idea?
      • 2020-01-18 01850, 2020

      • ruaok
        not really no.
      • 2020-01-18 01852, 2020

      • pristine__
        May be i am not able to think of one
      • 2020-01-18 01858, 2020

      • sumedh
        pristine_ thanks :)
      • 2020-01-18 01814, 2020

      • ruaok
        at best we can use the sample data sets to get the algs to run locally. but they will not produce very good results.
      • 2020-01-18 01821, 2020

      • pristine__
        Because if someone wants to make a patch, they cannot really see there changes because no data in hdfs
      • 2020-01-18 01836, 2020

      • pristine__
        > at best we can use the sample data sets to get the algs to run locally. but they will not produce very good results.
      • 2020-01-18 01839, 2020

      • pristine__
        Exactly
      • 2020-01-18 01843, 2020

      • ruaok
        also, lets back this discussion up onto two separate points.
      • 2020-01-18 01853, 2020

      • ruaok
        work that involves spark and work that does not.
      • 2020-01-18 01812, 2020

      • pristine__
        Also, we can keep that dump constant so that we don't have to update mappings and realtions accordingly
      • 2020-01-18 01823, 2020

      • pristine__
        > work that involves spark and work that does not.
      • 2020-01-18 01834, 2020

      • pristine__
        Cool, will be carefull :)
      • 2020-01-18 01836, 2020

      • ruaok
        so, for someone wishing to build a rec engine using the new data sets on their own laptop, that should be fine.
      • 2020-01-18 01852, 2020

      • pristine__
        Yes
      • 2020-01-18 01806, 2020

      • ruaok
        that entails downloading some 20GB of data and importing it. that's just the cost of playing in this sandbox.
      • 2020-01-18 01818, 2020

      • pristine__
        True
      • 2020-01-18 01840, 2020

      • ruaok
        now, if someone wants to tinker on the data set that comes out of spark, we have an entirely different problem.
      • 2020-01-18 01844, 2020

      • ruaok
        with me so far?
      • 2020-01-18 01850, 2020

      • pristine__
        Yes
      • 2020-01-18 01814, 2020

      • ruaok
        now it is difficult for people to setup their own spark clusters to do meaningful work.
      • 2020-01-18 01841, 2020

      • ruaok
        and this is where the guessing starts. we don't know:
      • 2020-01-18 01848, 2020

      • ruaok
        1) how many people will want to work on this.
      • 2020-01-18 01802, 2020

      • ruaok
        2) how big of a cluster we need.
      • 2020-01-18 01813, 2020

      • ruaok
        3) how busy that cluster will be.
      • 2020-01-18 01823, 2020

      • ruaok
        4) how much we can share it.
      • 2020-01-18 01805, 2020

      • pristine__
        With you
      • 2020-01-18 01810, 2020

      • ruaok
        given this vast quantity of unknowns, it is hard to direct future work, so I am inclined to wait and see how this demand for the cluster plays out.
      • 2020-01-18 01839, 2020

      • ruaok
        because if we do huge amounts of work, and no one wants to play, then we've wasted a huge amount of work.
      • 2020-01-18 01851, 2020

      • pristine__
        I totally agree.
      • 2020-01-18 01817, 2020

      • ruaok
        ok, now the really vague thinking starts. bear with me.
      • 2020-01-18 01834, 2020

      • ruaok
        1. We can ignore people willing to work on spark stuff. (not good)
      • 2020-01-18 01857, 2020

      • pristine__
        Phew.
      • 2020-01-18 01810, 2020

      • sarthak_jain
        =(
      • 2020-01-18 01810, 2020

      • pristine__
        Like new contributors?
      • 2020-01-18 01834, 2020

      • ruaok
        2. We can have you become the de-facto mentor for people wishing to work on spark stuff. They should get unit tests to pass on their code, but running the code should be done by you. initially.
      • 2020-01-18 01849, 2020

      • ruaok
        3. Once we trust contributors, we can give them accounts on leader.
      • 2020-01-18 01808, 2020

      • ruaok
        4. Create the ability to create sample data sets for everything.
      • 2020-01-18 01824, 2020

      • ruaok
        5. Create a small cluster just for testing with much looser access rules.
      • 2020-01-18 01840, 2020

      • pristine__
        The third point is a lil critical, but yes we might need people for that.
      • 2020-01-18 01852, 2020

      • pristine__
        Do we plan on GOOD project for labs?
      • 2020-01-18 01855, 2020

      • ruaok
        critical of?
      • 2020-01-18 01805, 2020

      • ruaok
        GOOD project?
      • 2020-01-18 01816, 2020

      • pristine__
        > critical of?
      • 2020-01-18 01853, 2020

      • pristine__
        someone who really wants to get into spark core and not just write pyspark code. Writing code is different, running on cluster is different
      • 2020-01-18 01805, 2020

      • pristine__
        > GOOD project?
      • 2020-01-18 01809, 2020

      • pristine__
        GOOD*
      • 2020-01-18 01814, 2020

      • pristine__
        GSOC*
      • 2020-01-18 01821, 2020

      • pristine__
        This auto correct
      • 2020-01-18 01824, 2020

      • pristine__
        Phew
      • 2020-01-18 01837, 2020

      • pristine__
        Many people have come in to try for GSOC in labs
      • 2020-01-18 01846, 2020

      • ruaok
        Consider the list above as a range of the spectrum of possibilities.
      • 2020-01-18 01851, 2020

      • pristine__
        And I don't have any concrete task for them
      • 2020-01-18 01856, 2020

      • pristine__
        I was thinning
      • 2020-01-18 01801, 2020

      • pristine__
        Thinking*
      • 2020-01-18 01819, 2020

      • ruaok
        some people will make it clear quickly that they deserve more access, so we give it to them.
      • 2020-01-18 01824, 2020

      • pristine__
        That from after this mapping PR, we will want to write code to fill schema in lemme
      • 2020-01-18 01843, 2020

      • pristine__
        Which doesn't need much access imo
      • 2020-01-18 01843, 2020

      • ruaok
        I guess over time, as the unknowns become knowns, we move from one end of the spectrum to the other.
      • 2020-01-18 01821, 2020

      • ruaok
        > Which doesn't need much access imo
      • 2020-01-18 01822, 2020

      • pristine__
        Okay. The second point in your list
      • 2020-01-18 01827, 2020

      • ruaok
        then this is a perfect task for someone to work on.
      • 2020-01-18 01844, 2020

      • ruaok
        much like you, new comers will need to work on a pile of non-spark things, in order to the spark things to work.
      • 2020-01-18 01852, 2020

      • ruaok
        such is the life of software engineering.
      • 2020-01-18 01810, 2020

      • pristine__
        I guess they will understand it from this conversation
      • 2020-01-18 01826, 2020

      • pristine__
        My near future task is to somehow show recommendation to LB website
      • 2020-01-18 01830, 2020

      • pristine__
        Even if they are bad.
      • 2020-01-18 01839, 2020

      • pristine__
        On*
      • 2020-01-18 01852, 2020

      • sarthak_jain
        So, pristine__ do we have some task that I can begin with ?
      • 2020-01-18 01800, 2020

      • pristine__
        sarthak_jain: wait
      • 2020-01-18 01803, 2020

      • ruaok
        we should update the list to be done soon.
      • 2020-01-18 01810, 2020

      • ruaok
        sarthak_jain: follow along.
      • 2020-01-18 01821, 2020

      • ruaok
        a lot of this may not make sense yet, but it should soon.
      • 2020-01-18 01839, 2020

      • sarthak_jain
        With you :)
      • 2020-01-18 01837, 2020

      • pristine__
        Yes. For that we need to fill the lemme schema which will be a very complex algo
      • 2020-01-18 01849, 2020

      • ruaok
        what is complex about it?
      • 2020-01-18 01857, 2020

      • pristine__
        I am not sure about it anyhow. The schema itself is complex (atleast for me :p). Complex on what songs to show, how to change them weekly etc etc
      • 2020-01-18 01827, 2020

      • nav2002_ has quit
      • 2020-01-18 01839, 2020

      • ruaok
        this is exactly why I want to step back a bit and consider next steps.
      • 2020-01-18 01851, 2020

      • ruaok
        your roadmaps is still sort of a leftover from GSoC, methinks.
      • 2020-01-18 01802, 2020

      • ruaok
        and a lot has changed since then.
      • 2020-01-18 01831, 2020

      • pristine__
        Maybe :( but I am not able to think where to go from here.
      • 2020-01-18 01835, 2020

      • ruaok
        remember how I decided that we should create a suite of new data sets before we create fully running recommendation engines?
      • 2020-01-18 01846, 2020

      • pristine__
        Yeah. I can recall
      • 2020-01-18 01800, 2020

      • ruaok
        so, short term goals should be:
      • 2020-01-18 01806, 2020

      • ruaok
        0. Get tests done
      • 2020-01-18 01816, 2020

      • ruaok
        1. Write dumps of the collab filtering stuff.
      • 2020-01-18 01858, 2020

      • pristine__
        > 1. Write dumps of the collab filtering stuff.
      • 2020-01-18 01800, 2020

      • ruaok
        2. Write some example algorithms that show off our new data sets.
      • 2020-01-18 01818, 2020

      • ruaok
        3. Work to release some of these examples onto lemmy.
      • 2020-01-18 01819, 2020

      • pristine__
        The tests dumps you gonna filter? Just to confirm
      • 2020-01-18 01833, 2020

      • mukesh joined the channel
      • 2020-01-18 01842, 2020

      • ruaok
        so, the push to lemme, short of what iliekcomputers is doing for stats, isn't super important just yet.
      • 2020-01-18 01850, 2020

      • pristine__
        So what these data sets should look like? Do you have an image in mind?
      • 2020-01-18 01853, 2020

      • ruaok
        which is good for having simpler taskts to hand.
      • 2020-01-18 01802, 2020

      • pristine__
        > so, the push to lemme, short of what iliekcomputers is doing for stats, isn't super important just yet.
      • 2020-01-18 01804, 2020

      • pristine__
        Right
      • 2020-01-18 01818, 2020

      • ruaok
        > The tests dumps you gonna filter? Just to confirm
      • 2020-01-18 01836, 2020

      • ruaok
        I was thinking of the collaborative filtering output. don't push to lemmy, but write to disk.
      • 2020-01-18 01849, 2020

      • pristine__
        Oooo. Gotcha
      • 2020-01-18 01852, 2020

      • ruaok
        this task might be perfect for sarthak_jain.
      • 2020-01-18 01804, 2020

      • pristine__
        Yeah right.
      • 2020-01-18 01829, 2020

      • pristine__
        We need a schema for that. On what to store in hdfs
      • 2020-01-18 01830, 2020

      • antara joined the channel
      • 2020-01-18 01840, 2020

      • pristine__
        That will be fun to brainstorm any time
      • 2020-01-18 01850, 2020

      • ruaok
        the idea then is having: AAR, Annoy, Collab output, and MSB->MSID mappings as publicly available data sets.
      • 2020-01-18 01816, 2020

      • pristine__
        >I was thinking of the collaborative filtering output. don't push to lemmy, but write to disk.
      • 2020-01-18 01836, 2020

      • pristine__
        We will then soon get rid of the HTML files.
      • 2020-01-18 01837, 2020

      • ruaok
        I'm not sure that the output of the collaborative filtering needs to go into/stay in spark. we need it on disk for people to play.
      • 2020-01-18 01840, 2020

      • pristine__
        :p
      • 2020-01-18 01803, 2020

      • pristine__
        Ummm.... disk?
      • 2020-01-18 01806, 2020

      • ruaok
        the HTML files haven't begun to be important yet!
      • 2020-01-18 01816, 2020

      • mukesh has quit
      • 2020-01-18 01817, 2020

      • ruaok
        disk -> data dumps.
      • 2020-01-18 01826, 2020

      • ruaok
        [one sec, brb]
      • 2020-01-18 01835, 2020

      • sarthak_jain
        pristine__ might have to help in making me understand the task exactly.
      • 2020-01-18 01844, 2020

      • sarthak_jain
        *helping
      • 2020-01-18 01848, 2020

      • sarthak_jain
        '=D
      • 2020-01-18 01818, 2020

      • pristine__
        But with every patch I have to modify them a lil, for now I have turned off the flag which genrates them in the script but I really don't think they are useful. We can just keep one HTML file for people to see the recommendations and remove the ones that have query run time since it was made for community to understand spark. I think they have fulfilled their purpose