and then it makes for a nice starting point for us. I can adjust artist relations to only use artists that are in that dump.
2020-01-18 01848, 2020
ruaok
run full, only output sample.
2020-01-18 01805, 2020
ruaok
not sure that will work very well, TBH.
2020-01-18 01824, 2020
ruaok
because recommendations are going to recommend outside of the set almost immediately.
2020-01-18 01835, 2020
pristine__
I know they won't. But rn it is difficult for contributors to know what is happening with data
2020-01-18 01841, 2020
pristine__
Do you have any other idea?
2020-01-18 01850, 2020
ruaok
not really no.
2020-01-18 01852, 2020
pristine__
May be i am not able to think of one
2020-01-18 01858, 2020
sumedh
pristine_ thanks :)
2020-01-18 01814, 2020
ruaok
at best we can use the sample data sets to get the algs to run locally. but they will not produce very good results.
2020-01-18 01821, 2020
pristine__
Because if someone wants to make a patch, they cannot really see there changes because no data in hdfs
2020-01-18 01836, 2020
pristine__
> at best we can use the sample data sets to get the algs to run locally. but they will not produce very good results.
2020-01-18 01839, 2020
pristine__
Exactly
2020-01-18 01843, 2020
ruaok
also, lets back this discussion up onto two separate points.
2020-01-18 01853, 2020
ruaok
work that involves spark and work that does not.
2020-01-18 01812, 2020
pristine__
Also, we can keep that dump constant so that we don't have to update mappings and realtions accordingly
2020-01-18 01823, 2020
pristine__
> work that involves spark and work that does not.
2020-01-18 01834, 2020
pristine__
Cool, will be carefull :)
2020-01-18 01836, 2020
ruaok
so, for someone wishing to build a rec engine using the new data sets on their own laptop, that should be fine.
2020-01-18 01852, 2020
pristine__
Yes
2020-01-18 01806, 2020
ruaok
that entails downloading some 20GB of data and importing it. that's just the cost of playing in this sandbox.
2020-01-18 01818, 2020
pristine__
True
2020-01-18 01840, 2020
ruaok
now, if someone wants to tinker on the data set that comes out of spark, we have an entirely different problem.
2020-01-18 01844, 2020
ruaok
with me so far?
2020-01-18 01850, 2020
pristine__
Yes
2020-01-18 01814, 2020
ruaok
now it is difficult for people to setup their own spark clusters to do meaningful work.
2020-01-18 01841, 2020
ruaok
and this is where the guessing starts. we don't know:
2020-01-18 01848, 2020
ruaok
1) how many people will want to work on this.
2020-01-18 01802, 2020
ruaok
2) how big of a cluster we need.
2020-01-18 01813, 2020
ruaok
3) how busy that cluster will be.
2020-01-18 01823, 2020
ruaok
4) how much we can share it.
2020-01-18 01805, 2020
pristine__
With you
2020-01-18 01810, 2020
ruaok
given this vast quantity of unknowns, it is hard to direct future work, so I am inclined to wait and see how this demand for the cluster plays out.
2020-01-18 01839, 2020
ruaok
because if we do huge amounts of work, and no one wants to play, then we've wasted a huge amount of work.
2020-01-18 01851, 2020
pristine__
I totally agree.
2020-01-18 01817, 2020
ruaok
ok, now the really vague thinking starts. bear with me.
2020-01-18 01834, 2020
ruaok
1. We can ignore people willing to work on spark stuff. (not good)
2020-01-18 01857, 2020
pristine__
Phew.
2020-01-18 01810, 2020
sarthak_jain
=(
2020-01-18 01810, 2020
pristine__
Like new contributors?
2020-01-18 01834, 2020
ruaok
2. We can have you become the de-facto mentor for people wishing to work on spark stuff. They should get unit tests to pass on their code, but running the code should be done by you. initially.
2020-01-18 01849, 2020
ruaok
3. Once we trust contributors, we can give them accounts on leader.
2020-01-18 01808, 2020
ruaok
4. Create the ability to create sample data sets for everything.
2020-01-18 01824, 2020
ruaok
5. Create a small cluster just for testing with much looser access rules.
2020-01-18 01840, 2020
pristine__
The third point is a lil critical, but yes we might need people for that.
2020-01-18 01852, 2020
pristine__
Do we plan on GOOD project for labs?
2020-01-18 01855, 2020
ruaok
critical of?
2020-01-18 01805, 2020
ruaok
GOOD project?
2020-01-18 01816, 2020
pristine__
> critical of?
2020-01-18 01853, 2020
pristine__
someone who really wants to get into spark core and not just write pyspark code. Writing code is different, running on cluster is different
2020-01-18 01805, 2020
pristine__
> GOOD project?
2020-01-18 01809, 2020
pristine__
GOOD*
2020-01-18 01814, 2020
pristine__
GSOC*
2020-01-18 01821, 2020
pristine__
This auto correct
2020-01-18 01824, 2020
pristine__
Phew
2020-01-18 01837, 2020
pristine__
Many people have come in to try for GSOC in labs
2020-01-18 01846, 2020
ruaok
Consider the list above as a range of the spectrum of possibilities.
2020-01-18 01851, 2020
pristine__
And I don't have any concrete task for them
2020-01-18 01856, 2020
pristine__
I was thinning
2020-01-18 01801, 2020
pristine__
Thinking*
2020-01-18 01819, 2020
ruaok
some people will make it clear quickly that they deserve more access, so we give it to them.
2020-01-18 01824, 2020
pristine__
That from after this mapping PR, we will want to write code to fill schema in lemme
2020-01-18 01843, 2020
pristine__
Which doesn't need much access imo
2020-01-18 01843, 2020
ruaok
I guess over time, as the unknowns become knowns, we move from one end of the spectrum to the other.
2020-01-18 01821, 2020
ruaok
> Which doesn't need much access imo
2020-01-18 01822, 2020
pristine__
Okay. The second point in your list
2020-01-18 01827, 2020
ruaok
then this is a perfect task for someone to work on.
2020-01-18 01844, 2020
ruaok
much like you, new comers will need to work on a pile of non-spark things, in order to the spark things to work.
2020-01-18 01852, 2020
ruaok
such is the life of software engineering.
2020-01-18 01810, 2020
pristine__
I guess they will understand it from this conversation
2020-01-18 01826, 2020
pristine__
My near future task is to somehow show recommendation to LB website
2020-01-18 01830, 2020
pristine__
Even if they are bad.
2020-01-18 01839, 2020
pristine__
On*
2020-01-18 01852, 2020
sarthak_jain
So, pristine__ do we have some task that I can begin with ?
2020-01-18 01800, 2020
pristine__
sarthak_jain: wait
2020-01-18 01803, 2020
ruaok
we should update the list to be done soon.
2020-01-18 01810, 2020
ruaok
sarthak_jain: follow along.
2020-01-18 01821, 2020
ruaok
a lot of this may not make sense yet, but it should soon.
2020-01-18 01839, 2020
sarthak_jain
With you :)
2020-01-18 01837, 2020
pristine__
Yes. For that we need to fill the lemme schema which will be a very complex algo
2020-01-18 01849, 2020
ruaok
what is complex about it?
2020-01-18 01857, 2020
pristine__
I am not sure about it anyhow. The schema itself is complex (atleast for me :p). Complex on what songs to show, how to change them weekly etc etc
2020-01-18 01827, 2020
nav2002_ has quit
2020-01-18 01839, 2020
ruaok
this is exactly why I want to step back a bit and consider next steps.
2020-01-18 01851, 2020
ruaok
your roadmaps is still sort of a leftover from GSoC, methinks.
2020-01-18 01802, 2020
ruaok
and a lot has changed since then.
2020-01-18 01831, 2020
pristine__
Maybe :( but I am not able to think where to go from here.
2020-01-18 01835, 2020
ruaok
remember how I decided that we should create a suite of new data sets before we create fully running recommendation engines?
2020-01-18 01846, 2020
pristine__
Yeah. I can recall
2020-01-18 01800, 2020
ruaok
so, short term goals should be:
2020-01-18 01806, 2020
ruaok
0. Get tests done
2020-01-18 01816, 2020
ruaok
1. Write dumps of the collab filtering stuff.
2020-01-18 01858, 2020
pristine__
> 1. Write dumps of the collab filtering stuff.
2020-01-18 01800, 2020
ruaok
2. Write some example algorithms that show off our new data sets.
2020-01-18 01818, 2020
ruaok
3. Work to release some of these examples onto lemmy.
2020-01-18 01819, 2020
pristine__
The tests dumps you gonna filter? Just to confirm
2020-01-18 01833, 2020
mukesh joined the channel
2020-01-18 01842, 2020
ruaok
so, the push to lemme, short of what iliekcomputers is doing for stats, isn't super important just yet.
2020-01-18 01850, 2020
pristine__
So what these data sets should look like? Do you have an image in mind?
2020-01-18 01853, 2020
ruaok
which is good for having simpler taskts to hand.
2020-01-18 01802, 2020
pristine__
> so, the push to lemme, short of what iliekcomputers is doing for stats, isn't super important just yet.
2020-01-18 01804, 2020
pristine__
Right
2020-01-18 01818, 2020
ruaok
> The tests dumps you gonna filter? Just to confirm
2020-01-18 01836, 2020
ruaok
I was thinking of the collaborative filtering output. don't push to lemmy, but write to disk.
2020-01-18 01849, 2020
pristine__
Oooo. Gotcha
2020-01-18 01852, 2020
ruaok
this task might be perfect for sarthak_jain.
2020-01-18 01804, 2020
pristine__
Yeah right.
2020-01-18 01829, 2020
pristine__
We need a schema for that. On what to store in hdfs
2020-01-18 01830, 2020
antara joined the channel
2020-01-18 01840, 2020
pristine__
That will be fun to brainstorm any time
2020-01-18 01850, 2020
ruaok
the idea then is having: AAR, Annoy, Collab output, and MSB->MSID mappings as publicly available data sets.
2020-01-18 01816, 2020
pristine__
>I was thinking of the collaborative filtering output. don't push to lemmy, but write to disk.
2020-01-18 01836, 2020
pristine__
We will then soon get rid of the HTML files.
2020-01-18 01837, 2020
ruaok
I'm not sure that the output of the collaborative filtering needs to go into/stay in spark. we need it on disk for people to play.
2020-01-18 01840, 2020
pristine__
:p
2020-01-18 01803, 2020
pristine__
Ummm.... disk?
2020-01-18 01806, 2020
ruaok
the HTML files haven't begun to be important yet!
2020-01-18 01816, 2020
mukesh has quit
2020-01-18 01817, 2020
ruaok
disk -> data dumps.
2020-01-18 01826, 2020
ruaok
[one sec, brb]
2020-01-18 01835, 2020
sarthak_jain
pristine__ might have to help in making me understand the task exactly.
2020-01-18 01844, 2020
sarthak_jain
*helping
2020-01-18 01848, 2020
sarthak_jain
'=D
2020-01-18 01818, 2020
pristine__
But with every patch I have to modify them a lil, for now I have turned off the flag which genrates them in the script but I really don't think they are useful. We can just keep one HTML file for people to see the recommendations and remove the ones that have query run time since it was made for community to understand spark. I think they have fulfilled their purpose