#metabrainz

/

13:47 PM
ruaok

I can do that, but how do I partition that data?

2020-01-18 01832, 2020

13:47 PM
pristine__

It's not about testing

2020-01-18 01853, 2020

13:47 PM
pristine__

It is for anyone to try and run recommendation engine on their local machine

2020-01-18 01858, 2020

13:47 PM
pristine__

Basically use labs

2020-01-18 01842, 2020

13:48 PM
pristine__

So the all-changes-mapping PR downloads huge data as of now. We discussed this a few weeks ago

2020-01-18 01858, 2020

13:48 PM
ruaok

yes.

2020-01-18 01804, 2020

13:49 PM
pristine__

That first do it for the big data and as a next step do it for small data.

2020-01-18 01812, 2020

13:49 PM
pristine__

So what I was thinking is

2020-01-18 01845, 2020

13:49 PM
pristine__

We can put an option in the config that for small data chunks use this config value

2020-01-18 01855, 2020

13:49 PM
pristine__

Now the question is where to get the small chunk for

2020-01-18 01858, 2020

13:49 PM
pristine__

From*

2020-01-18 01805, 2020

13:50 PM
pristine__

For that I was thinking

2020-01-18 01821, 2020

13:50 PM
pristine__

To use latest incremental dump for time being

2020-01-18 01829, 2020

13:50 PM
pristine__

And for mapping and relation

2020-01-18 01830, 2020

13:50 PM
ruaok

> Now the question is where to get the small chunk for

2020-01-18 01837, 2020

13:50 PM
nav2002_ joined the channel

2020-01-18 01839, 2020

13:50 PM
ruaok

this is what I was saying with the -test versions of relations, etc.

2020-01-18 01856, 2020

13:50 PM
pristine__

Umm... I think we should not call it test

2020-01-18 01806, 2020

13:51 PM
pristine__

But yeah, we both are on the same track :)

2020-01-18 01811, 2020

13:51 PM
pristine__

So we can do a join?

2020-01-18 01827, 2020

13:51 PM
ruaok

we should use the same terminology MB uses. han gon.

2020-01-18 01844, 2020

13:51 PM
pristine__

Oooo. Didn't know that. Cool

2020-01-18 01857, 2020

13:51 PM
ruaok

sample.

2020-01-18 01858, 2020

13:51 PM
ruaok

ftp://ftp.eu.metabrainz.org/pub/musicbrainz/data/…

2020-01-18 01804, 2020

13:52 PM
ruaok

let's use the word sample.

2020-01-18 01812, 2020

13:52 PM
ruaok

it solves exactly this test case.

2020-01-18 01842, 2020

13:52 PM
ruaok

and then it makes for a nice starting point for us. I can adjust artist relations to only use artists that are in that dump.

2020-01-18 01848, 2020

13:52 PM
ruaok

run full, only output sample.

2020-01-18 01805, 2020

13:53 PM
ruaok

not sure that will work very well, TBH.

2020-01-18 01824, 2020

13:53 PM
ruaok

because recommendations are going to recommend outside of the set almost immediately.

2020-01-18 01835, 2020

13:53 PM
pristine__

I know they won't. But rn it is difficult for contributors to know what is happening with data

2020-01-18 01841, 2020

13:53 PM
pristine__

Do you have any other idea?

2020-01-18 01850, 2020

13:53 PM
ruaok

not really no.

2020-01-18 01852, 2020

13:53 PM
pristine__

May be i am not able to think of one

2020-01-18 01858, 2020

13:53 PM
sumedh

pristine_ thanks :)

2020-01-18 01814, 2020

13:54 PM
ruaok

at best we can use the sample data sets to get the algs to run locally. but they will not produce very good results.

2020-01-18 01821, 2020

13:54 PM
pristine__

Because if someone wants to make a patch, they cannot really see there changes because no data in hdfs

2020-01-18 01836, 2020

13:54 PM
pristine__

> at best we can use the sample data sets to get the algs to run locally. but they will not produce very good results.

2020-01-18 01839, 2020

13:54 PM
pristine__

Exactly

2020-01-18 01843, 2020

13:54 PM
ruaok

also, lets back this discussion up onto two separate points.

2020-01-18 01853, 2020

13:54 PM
ruaok

work that involves spark and work that does not.

2020-01-18 01812, 2020

13:55 PM
pristine__

Also, we can keep that dump constant so that we don't have to update mappings and realtions accordingly

2020-01-18 01823, 2020

13:55 PM
pristine__

> work that involves spark and work that does not.

2020-01-18 01834, 2020

13:55 PM
pristine__

Cool, will be carefull :)

2020-01-18 01836, 2020

13:55 PM
ruaok

so, for someone wishing to build a rec engine using the new data sets on their own laptop, that should be fine.

2020-01-18 01852, 2020

13:55 PM
pristine__

Yes

2020-01-18 01806, 2020

13:56 PM
ruaok

that entails downloading some 20GB of data and importing it. that's just the cost of playing in this sandbox.

2020-01-18 01818, 2020

13:56 PM
pristine__

True

2020-01-18 01840, 2020

13:56 PM
ruaok

now, if someone wants to tinker on the data set that comes out of spark, we have an entirely different problem.

2020-01-18 01844, 2020

13:56 PM
ruaok

with me so far?

2020-01-18 01850, 2020

13:56 PM
pristine__

Yes

2020-01-18 01814, 2020

13:57 PM
ruaok

now it is difficult for people to setup their own spark clusters to do meaningful work.

2020-01-18 01841, 2020

13:57 PM
ruaok

and this is where the guessing starts. we don't know:

2020-01-18 01848, 2020

13:57 PM
ruaok

1) how many people will want to work on this.

2020-01-18 01802, 2020

13:58 PM
ruaok

2) how big of a cluster we need.

2020-01-18 01813, 2020

13:58 PM
ruaok

3) how busy that cluster will be.

2020-01-18 01823, 2020

13:58 PM
ruaok

4) how much we can share it.

2020-01-18 01805, 2020

13:59 PM
pristine__

With you

2020-01-18 01810, 2020

13:59 PM
ruaok

given this vast quantity of unknowns, it is hard to direct future work, so I am inclined to wait and see how this demand for the cluster plays out.

2020-01-18 01839, 2020

13:59 PM
ruaok

because if we do huge amounts of work, and no one wants to play, then we've wasted a huge amount of work.

2020-01-18 01851, 2020

13:59 PM
pristine__

I totally agree.

2020-01-18 01817, 2020

14:00 PM
ruaok

ok, now the really vague thinking starts. bear with me.

2020-01-18 01834, 2020

14:00 PM
ruaok

1. We can ignore people willing to work on spark stuff. (not good)

2020-01-18 01857, 2020

14:00 PM
pristine__

Phew.

2020-01-18 01810, 2020

14:01 PM
sarthak_jain

=(

2020-01-18 01810, 2020

14:01 PM
pristine__

Like new contributors?

2020-01-18 01834, 2020

14:01 PM
ruaok

2. We can have you become the de-facto mentor for people wishing to work on spark stuff. They should get unit tests to pass on their code, but running the code should be done by you. initially.

2020-01-18 01849, 2020

14:01 PM
ruaok

3. Once we trust contributors, we can give them accounts on leader.

2020-01-18 01808, 2020

14:02 PM
ruaok

4. Create the ability to create sample data sets for everything.

2020-01-18 01824, 2020

14:02 PM
ruaok

5. Create a small cluster just for testing with much looser access rules.

2020-01-18 01840, 2020

14:02 PM
pristine__

The third point is a lil critical, but yes we might need people for that.

2020-01-18 01852, 2020

14:02 PM
pristine__

Do we plan on GOOD project for labs?

2020-01-18 01855, 2020

14:02 PM
ruaok

critical of?

2020-01-18 01805, 2020

14:03 PM
ruaok

GOOD project?

2020-01-18 01816, 2020

14:03 PM
pristine__

> critical of?

2020-01-18 01853, 2020

14:03 PM
pristine__

someone who really wants to get into spark core and not just write pyspark code. Writing code is different, running on cluster is different

2020-01-18 01805, 2020

14:04 PM
pristine__

> GOOD project?

2020-01-18 01809, 2020

14:04 PM
pristine__

GOOD*

2020-01-18 01814, 2020

14:04 PM
pristine__

GSOC*

2020-01-18 01821, 2020

14:04 PM
pristine__

This auto correct

2020-01-18 01824, 2020

14:04 PM
pristine__

Phew

2020-01-18 01837, 2020

14:04 PM
pristine__

Many people have come in to try for GSOC in labs

2020-01-18 01846, 2020

14:04 PM
ruaok

Consider the list above as a range of the spectrum of possibilities.

2020-01-18 01851, 2020

14:04 PM
pristine__

And I don't have any concrete task for them

2020-01-18 01856, 2020

14:04 PM
pristine__

I was thinning

2020-01-18 01801, 2020

14:05 PM
pristine__

Thinking*

2020-01-18 01819, 2020

14:05 PM
ruaok

some people will make it clear quickly that they deserve more access, so we give it to them.

2020-01-18 01824, 2020

14:05 PM
pristine__

That from after this mapping PR, we will want to write code to fill schema in lemme

2020-01-18 01843, 2020

14:05 PM
pristine__

Which doesn't need much access imo

2020-01-18 01843, 2020

14:05 PM
ruaok

I guess over time, as the unknowns become knowns, we move from one end of the spectrum to the other.

2020-01-18 01821, 2020

14:06 PM
ruaok

> Which doesn't need much access imo

2020-01-18 01822, 2020

14:06 PM
pristine__

Okay. The second point in your list

2020-01-18 01827, 2020

14:06 PM
ruaok

then this is a perfect task for someone to work on.

2020-01-18 01844, 2020

14:06 PM
ruaok

much like you, new comers will need to work on a pile of non-spark things, in order to the spark things to work.

2020-01-18 01852, 2020

14:06 PM
ruaok

such is the life of software engineering.

2020-01-18 01810, 2020

14:07 PM
pristine__

I guess they will understand it from this conversation

2020-01-18 01826, 2020

14:07 PM
pristine__

My near future task is to somehow show recommendation to LB website

2020-01-18 01830, 2020

14:07 PM
pristine__

Even if they are bad.

2020-01-18 01839, 2020

14:07 PM
pristine__

On*

2020-01-18 01852, 2020

14:07 PM
sarthak_jain

So, pristine__ do we have some task that I can begin with ?

2020-01-18 01800, 2020

14:08 PM
pristine__

sarthak_jain: wait

2020-01-18 01803, 2020

14:08 PM
ruaok

we should update the list to be done soon.

2020-01-18 01810, 2020

14:08 PM
ruaok

sarthak_jain: follow along.

2020-01-18 01821, 2020

14:08 PM
ruaok

a lot of this may not make sense yet, but it should soon.

2020-01-18 01839, 2020

14:08 PM
sarthak_jain

With you :)

2020-01-18 01837, 2020

14:09 PM
pristine__

Yes. For that we need to fill the lemme schema which will be a very complex algo

2020-01-18 01849, 2020

14:09 PM
ruaok

what is complex about it?

2020-01-18 01857, 2020

14:10 PM
pristine__

I am not sure about it anyhow. The schema itself is complex (atleast for me :p). Complex on what songs to show, how to change them weekly etc etc

2020-01-18 01827, 2020

14:11 PM
nav2002_ has quit

2020-01-18 01839, 2020

14:11 PM
ruaok

this is exactly why I want to step back a bit and consider next steps.

2020-01-18 01851, 2020

14:11 PM
ruaok

your roadmaps is still sort of a leftover from GSoC, methinks.

2020-01-18 01802, 2020

14:12 PM
ruaok

and a lot has changed since then.

2020-01-18 01831, 2020

14:12 PM
pristine__

Maybe :( but I am not able to think where to go from here.

2020-01-18 01835, 2020

14:12 PM
ruaok

remember how I decided that we should create a suite of new data sets before we create fully running recommendation engines?

2020-01-18 01846, 2020

14:12 PM
pristine__

Yeah. I can recall

2020-01-18 01800, 2020

14:13 PM
ruaok

so, short term goals should be:

2020-01-18 01806, 2020

14:13 PM
ruaok

0. Get tests done

2020-01-18 01816, 2020

14:13 PM
ruaok

1. Write dumps of the collab filtering stuff.

2020-01-18 01858, 2020

14:13 PM
pristine__

> 1. Write dumps of the collab filtering stuff.

2020-01-18 01800, 2020

14:14 PM
ruaok

2. Write some example algorithms that show off our new data sets.

2020-01-18 01818, 2020

14:14 PM
ruaok

3. Work to release some of these examples onto lemmy.

2020-01-18 01819, 2020

14:14 PM
pristine__

The tests dumps you gonna filter? Just to confirm

2020-01-18 01833, 2020

14:14 PM
mukesh joined the channel

2020-01-18 01842, 2020

14:14 PM
ruaok

so, the push to lemme, short of what iliekcomputers is doing for stats, isn't super important just yet.

2020-01-18 01850, 2020

14:14 PM
pristine__

So what these data sets should look like? Do you have an image in mind?

2020-01-18 01853, 2020

14:14 PM
ruaok

which is good for having simpler taskts to hand.

2020-01-18 01802, 2020

14:15 PM
pristine__

> so, the push to lemme, short of what iliekcomputers is doing for stats, isn't super important just yet.

2020-01-18 01804, 2020

14:15 PM
pristine__

Right

2020-01-18 01818, 2020

14:15 PM
ruaok

> The tests dumps you gonna filter? Just to confirm

2020-01-18 01836, 2020

14:15 PM
ruaok

I was thinking of the collaborative filtering output. don't push to lemmy, but write to disk.

2020-01-18 01849, 2020

14:15 PM
pristine__

Oooo. Gotcha

2020-01-18 01852, 2020

14:15 PM
ruaok

this task might be perfect for sarthak_jain.

2020-01-18 01804, 2020

14:16 PM
pristine__

Yeah right.

2020-01-18 01829, 2020

14:16 PM
pristine__

We need a schema for that. On what to store in hdfs

2020-01-18 01830, 2020

14:16 PM
antara joined the channel

2020-01-18 01840, 2020

14:16 PM
pristine__

That will be fun to brainstorm any time

2020-01-18 01850, 2020

14:16 PM
ruaok

the idea then is having: AAR, Annoy, Collab output, and MSB->MSID mappings as publicly available data sets.

2020-01-18 01816, 2020

14:17 PM
pristine__

>I was thinking of the collaborative filtering output. don't push to lemmy, but write to disk.

2020-01-18 01836, 2020

14:17 PM
pristine__

We will then soon get rid of the HTML files.

2020-01-18 01837, 2020

14:17 PM
ruaok

I'm not sure that the output of the collaborative filtering needs to go into/stay in spark. we need it on disk for people to play.

2020-01-18 01840, 2020

14:17 PM
pristine__

:p

2020-01-18 01803, 2020

14:18 PM
pristine__

Ummm.... disk?

2020-01-18 01806, 2020

14:18 PM
ruaok

the HTML files haven't begun to be important yet!

2020-01-18 01816, 2020

14:18 PM
mukesh has quit

2020-01-18 01817, 2020

14:18 PM
ruaok

disk -> data dumps.

2020-01-18 01826, 2020

14:18 PM
ruaok

[one sec, brb]

2020-01-18 01835, 2020

14:18 PM
sarthak_jain

pristine__ might have to help in making me understand the task exactly.

2020-01-18 01844, 2020

14:18 PM
sarthak_jain

*helping

2020-01-18 01848, 2020

14:18 PM
sarthak_jain

'=D

2020-01-18 01818, 2020

14:20 PM
pristine__

But with every patch I have to modify them a lil, for now I have turned off the flag which genrates them in the script but I really don't think they are useful. We can just keep one HTML file for people to see the recommendations and remove the ones that have query run time since it was made for community to understand spark. I think they have fulfilled their purpose