#metabrainz

/

9:47 AM
ruaok

I'd like to get this giant PR merged sooner than later to avoid messes down the road.

2020-07-30 21244, 2020

9:47 AM
ruaok

(the PR is at least easy to read. :) )

2020-07-30 21245, 2020

9:51 AM
yvanzo

alastairp: yes

2020-07-30 21208, 2020

9:52 AM
alastairp

yvanzo: perfect. my previous tickets are already there, so nothing more to do for me. thanks

2020-07-30 21214, 2020

9:52 AM
alastairp

setup worked well last night

2020-07-30 21241, 2020

9:55 AM
ruaok

iliekcomputers: did you just do something? builds passed now. heh.

2020-07-30 21243, 2020

9:55 AM
ruaok

https://github.com/metabrainz/listenbrainz-server… is ready then.

2020-07-30 21200, 2020

9:56 AM
iliekcomputers

i didn't

2020-07-30 21207, 2020

9:56 AM
ruaok

how the??

2020-07-30 21225, 2020

9:56 AM
ruaok

ok, well, the PR is happy.

2020-07-30 21238, 2020

9:56 AM
iliekcomputers

>+3 −1,187

2020-07-30 21240, 2020

9:56 AM
iliekcomputers

nice

2020-07-30 21220, 2020

10:01 AM
alastairp

who should I talk to about spark setup/scripts? iliekcomputers? ruaok? pristine___?

2020-07-30 21234, 2020

10:01 AM
ruaok

not i

2020-07-30 21248, 2020

10:01 AM
alastairp

the setup instructions for local dev as they stand don't work, so we'll need to work on that

2020-07-30 21217, 2020

10:02 AM
iliekcomputers

they don't?

2020-07-30 21219, 2020

10:02 AM
iliekcomputers

https://listenbrainz.readthedocs.io/en/production…

2020-07-30 21224, 2020

10:02 AM
iliekcomputers

https://listenbrainz.readthedocs.io/en/production…

2020-07-30 21239, 2020

10:03 AM
alastairp

and the various scripts lying around the repo? run.sh, config.sh.sample, docker/*.sh ?

2020-07-30 21247, 2020

10:03 AM
alastairp

are these all for production? or unused?

2020-07-30 21224, 2020

10:04 AM
iliekcomputers

most are for production.

2020-07-30 21252, 2020

10:04 AM
iliekcomputers

docker/*.sh could definitely be consolidated, but they don't matter for local dev

2020-07-30 21202, 2020

10:05 AM
alastairp

would be nice to use a similar trick that yvanzo has in musicbrainz-docker to add overlay docker-compose files to make the spark reader container

2020-07-30 21206, 2020

10:05 AM
alastairp

mmm, ok. I'll look again

2020-07-30 21222, 2020

10:05 AM
alastairp

iliekcomputers: did you see my comment about dumps? I recall that you mentioned something a few days ago

2020-07-30 21249, 2020

10:05 AM
iliekcomputers

right, i tried importing the public dump using the manage command 2-3 years ago

2020-07-30 21204, 2020

10:06 AM
iliekcomputers

and surprise surprise, it didn't work :P

2020-07-30 21223, 2020

10:06 AM
iliekcomputers

i think manual COPY FROMs would still work, but I have a ticket open for fixing this.

2020-07-30 21247, 2020

10:06 AM
alastairp

so, that is to say it's not possible to load a public dump at the moment?

2020-07-30 21257, 2020

10:06 AM
alastairp

2-3 _years_ ?

2020-07-30 21206, 2020

10:07 AM
iliekcomputers

a public postgres dump

2020-07-30 21247, 2020

10:07 AM
iliekcomputers

yeah, i wrote the initial data dump code at least 2 years ago

2020-07-30 21206, 2020

10:08 AM
iliekcomputers

i tried to import a few weeks ago

2020-07-30 21213, 2020

10:08 AM
iliekcomputers

sorry, me english bad

2020-07-30 21220, 2020

10:08 AM
alastairp

got it

2020-07-30 21201, 2020

10:09 AM
alastairp

so, if I'm going to be working through this in the same environment, it'd be great to try and fill in these gaps

2020-07-30 21229, 2020

10:09 AM
alastairp

make sure that an external person can do the setup, import some data, push it into local spark, build some models, etc

2020-07-30 21217, 2020

10:10 AM
ruaok

ishaanshah: iliekcomputers : https://labs.api.listenbrainz.org

2020-07-30 21230, 2020

10:10 AM
iliekcomputers

alastairp: i'm pretty sure that should be possible right now.

2020-07-30 21234, 2020

10:10 AM
alastairp

so, I have an lb env set up now. next step is data. can you give a brief description of what data is available, and the process for loading it?

2020-07-30 21235, 2020

10:10 AM
ruaok

please update your code to use this URL from now on.

2020-07-30 21241, 2020

10:10 AM
alastairp

I'll fill in some missing docs if necessary

2020-07-30 21244, 2020

10:10 AM
iliekcomputers

the listens data can get imported into spark.

2020-07-30 21249, 2020

10:10 AM
alastairp

(take your time, this evening after work would be fine)

2020-07-30 21212, 2020

10:11 AM
ishaanshah

ruaok: noice!

2020-07-30 21218, 2020

10:11 AM
ishaanshah

I will update the code

2020-07-30 21257, 2020

10:11 AM
ishaanshah

ruaok: we need msid->mbid for artists too

2020-07-30 21259, 2020

10:11 AM
iliekcomputers

alastairp: i'd suggest going through the steps here once: https://listenbrainz.readthedocs.io/en/production…, if something doesn't work, we can fix the docs. but that should set you up with a valid data dump with listens in spark.

2020-07-30 21214, 2020

10:12 AM
ruaok

ishaanshah: ah, ok.

2020-07-30 21219, 2020

10:12 AM
ishaanshah

https://labs.api.listenbrainz.org/artist-credit-f…

2020-07-30 21228, 2020

10:12 AM
ishaanshah

^ this is not needed

2020-07-30 21235, 2020

10:12 AM
alastairp

right, so the process is to load a public spark dump into spark?

2020-07-30 21236, 2020

10:12 AM
iliekcomputers

at that point, this explains how we send requests to spark: https://listenbrainz.readthedocs.io/en/production…

2020-07-30 21241, 2020

10:12 AM
iliekcomputers

alastairp: yes.

2020-07-30 21251, 2020

10:12 AM
alastairp

rather than load a public data dump into timescale and then ship to spark?

2020-07-30 21256, 2020

10:12 AM
iliekcomputers

yes.

2020-07-30 21216, 2020

10:13 AM
alastairp

ok. I'll try with this process first then

2020-07-30 21226, 2020

10:13 AM
iliekcomputers

ishaanshah will also be able to help if you have questions.

2020-07-30 21201, 2020

10:14 AM
iliekcomputers

pointing out things missing in the docs etc would be very much appreciated :)

2020-07-30 21232, 2020

10:15 AM
MajorLurker joined the channel

2020-07-30 21227, 2020

10:19 AM
ruaok

zas: another one, matching the previous one: https://github.com/metabrainz/docker-server-confi…

2020-07-30 21245, 2020

10:19 AM
MajorLurker has quit

2020-07-30 21237, 2020

10:21 AM
zas

ruaok: reviewed, lgtm

2020-07-30 21245, 2020

10:21 AM
ruaok

thanks!

2020-07-30 21216, 2020

10:25 AM
alastairp

iliekcomputers: what creates the metabrainz/hadoop-yarn, metabrainz/spark-master, and metabrainz/spark-worker images?

2020-07-30 21225, 2020

10:26 AM
alastairp

ah, hadoop-cluster-docker

2020-07-30 21218, 2020

10:38 AM
ruaok

shit. :(

2020-07-30 21256, 2020

10:38 AM
ruaok

ishaanshah: I didn't know you need the artist_msid lookup to be in production. that's the messybrainz mapping which is a lot harder to put into production.

2020-07-30 21226, 2020

10:41 AM
ruaok

for now, keep using the one on bono until I figure out what to do.

2020-07-30 21215, 2020

10:45 AM
ishaanshah

ruaok: sure will do

2020-07-30 21246, 2020

10:57 AM
BrainzGit

[listenbrainz-server] mayhem opened pull request #996 (master…update-dump-docs): Update dump docs to reflect new timescale based dumps https://github.com/metabrainz/listenbrainz-server…

2020-07-30 21228, 2020

11:11 AM
diru1100

Morning!!!

2020-07-30 21220, 2020

11:14 AM
diru1100

yvanzo: i have changed the draft name t o v-0.1. The files needed are stored as assets

2020-07-30 21254, 2020

11:22 AM
alastairp

anyone seen the follow server fail to start?

2020-07-30 21254, 2020

11:22 AM
alastairp

follow_server_1 | 2020-07-30 11:22:03,996 CRITICAL Could not get addresses to use: [Errno -3] Lookup timed out (rabbitmq)

2020-07-30 21212, 2020

11:23 AM
alastairp

looks like it can't lookup `rabbitmq` hostname, but this service is running

2020-07-30 21203, 2020

11:25 AM
alastairp

https://www.irccloud.com/pastebin/kWfHBChy/

2020-07-30 21225, 2020

11:25 AM
alastairp

loading a new shell and pinging rabbitmq works as expected... :/

2020-07-30 21213, 2020

11:28 AM
sumedh joined the channel

2020-07-30 21250, 2020

11:32 AM
BrainzGit

[listenbrainz-server] alastair merged pull request #994 (master…develop-improvements): develop.sh improvements https://github.com/metabrainz/listenbrainz-server…

2020-07-30 21202, 2020

11:33 AM
alastairp

thanks for the review ruaok

2020-07-30 21236, 2020

11:33 AM
ruaok

Thanks for the PR!

2020-07-30 21212, 2020

11:38 AM
v6lur has quit

2020-07-30 21201, 2020

11:40 AM
yvanzo

diru1100: nice

2020-07-30 21235, 2020

11:44 AM
Major_Lurker has quit

2020-07-30 21254, 2020

11:47 AM
BrainzGit

[listenbrainz-server] alastair opened pull request #997 (master…spark-hdfs): Improve HDFS setup and startup scripts https://github.com/metabrainz/listenbrainz-server…

2020-07-30 21201, 2020

11:48 AM
yvanzo

diru1100: is bio_tokenizer.pickle still needed? I don't see any reference to it in pr #2.

2020-07-30 21204, 2020

11:54 AM
supersandro2000 has quit

2020-07-30 21208, 2020

11:54 AM
alastairp

iliekcomputers:

2020-07-30 21209, 2020

11:54 AM
alastairp

hadoop-master_1 | 2020-07-30 11:53:40,816 WARN hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /temp to /data/listenbrainz because destination's parent does not exist

2020-07-30 21214, 2020

11:54 AM
alastairp

does this look familiar?

2020-07-30 21219, 2020

11:54 AM
supersandro2000 joined the channel

2020-07-30 21243, 2020

11:54 AM
iliekcomputers

ishaanshah: ^

2020-07-30 21257, 2020

11:54 AM
alastairp

when running the spark data importer. I don't see any reference to /data in the docker-compose.spark file

2020-07-30 21205, 2020

11:55 AM
alastairp

thanks :)

2020-07-30 21201, 2020

11:57 AM
diru1100

yvanzo: it's not needed. I have removed all pickle files in pr #2

2020-07-30 21200, 2020

11:58 AM
travis-ci joined the channel

2020-07-30 21200, 2020

11:58 AM
travis-ci

Project bookbrainz-site build #3281: passed in 4 min 26 sec: https://travis-ci.org/bookbrainz/bookbrainz-site/…

2020-07-30 21200, 2020

11:58 AM
travis-ci has left the channel

2020-07-30 21218, 2020

11:59 AM
yvanzo

diru1100: so does it still need to be in v-0.1 assets?

2020-07-30 21241, 2020

12:00 PM
diru1100

yvanzo: it's needed if they want to generate data.

2020-07-30 21254, 2020

12:00 PM
diru1100

Not needed to run the model

2020-07-30 21202, 2020

12:02 PM
diru1100

We can remove all *_tokenizers actually

2020-07-30 21206, 2020

12:13 PM
yvanzo

diru1100: It would probably be more useful to just explain how to use these tokenizer files as the goal is to allow reproducing tests and derivative works.

2020-07-30 21251, 2020

12:13 PM
yvanzo

diru1100: For example, how can one use bio_tokenizer.pickle (step by step)?

2020-07-30 21252, 2020

12:17 PM
diru1100

yvanzo: Yes, should will help. But in dataset_generation notebook we aren't using the pickle file at all. We are directly using Keras Tokenizer class to do the job.

2020-07-30 21228, 2020

12:18 PM
diru1100

I think it is kept maybe to store the tokenizers once we use it in production, but due to online it might change.

2020-07-30 21247, 2020

12:18 PM
diru1100

*online learning

2020-07-30 21207, 2020

12:20 PM
diru1100

I think the starting paragraph explains the model well. https://github.com/diru1100/spambrainz_ml/blob/gs…

2020-07-30 21258, 2020

12:52 PM
yvanzo

my bad, I probably just mistyped the filename

2020-07-30 21259, 2020

12:58 PM
sumedh has quit

2020-07-30 21203, 2020

12:59 PM
ishaanshah

alastairp: Is this the first time you're running the import?

2020-07-30 21247, 2020

12:59 PM
alastairp

ishaanshah: yes

2020-07-30 21202, 2020

13:00 PM
ishaanshah

Hmm, I think I know why its happening

2020-07-30 21213, 2020

13:00 PM
ishaanshah

the /data folder hasn't been created

2020-07-30 21233, 2020

13:00 PM
ishaanshah

It worked for me because it had already been created befor

2020-07-30 21241, 2020

13:00 PM
alastairp

yes, makes sense

2020-07-30 21243, 2020

13:00 PM
alastairp

how did you create it?

2020-07-30 21255, 2020

13:00 PM
alastairp

should this happen automatically as part of the import script?

2020-07-30 21201, 2020

13:01 PM
ishaanshah

The moving part is a recent addition

2020-07-30 21211, 2020

13:01 PM
ishaanshah

> should this happen automatically as part of the import script?

2020-07-30 21211, 2020

13:01 PM
ishaanshah

yep

2020-07-30 21219, 2020

13:01 PM
ishaanshah

I'll open a PR

2020-07-30 21216, 2020

13:02 PM
alastairp

thanks. is there something I can do now to be able to continue without waiting for the PR?

2020-07-30 21221, 2020

13:02 PM
alastairp

or will it only take a few minutes?

2020-07-30 21230, 2020

13:02 PM
ishaanshah

it should take a few minutes

2020-07-30 21242, 2020

13:02 PM
ishaanshah

Otherwise you could use the hdfs cli to create the directory

2020-07-30 21225, 2020

13:03 PM
ishaanshah

iliekcomputers maybe able to help you with that

2020-07-30 21236, 2020

13:03 PM
ishaanshah

I am not familiar with the hdfs cli

2020-07-30 21258, 2020

13:03 PM
kieto joined the channel

2020-07-30 21233, 2020

13:04 PM
alastairp

hdfs cli sounds like something that we should have at least some basic documentation for if it doesn't exist

2020-07-30 21207, 2020

13:11 PM
iliekcomputers

alastairp: hmm, i think something like `hdfs dfs -mkdir /data` from inside the hadoop container should fix the issue

2020-07-30 21217, 2020

13:11 PM
iliekcomputers

https://hadoop.apache.org/docs/current/hadoop-pro…

2020-07-30 21224, 2020

13:11 PM
ishaanshah

alastairp: can you wait for 2 mins

2020-07-30 21231, 2020

13:11 PM
ishaanshah

I am just about to open a PR

2020-07-30 21245, 2020

13:11 PM
alastairp

I can wait, I'm just eating lunch :)

2020-07-30 21228, 2020

13:13 PM
Mr_Monkey

Hi ruaok ! Do you have a few minutes to talk about LB's search_larger_time_range mechanism?

2020-07-30 21258, 2020

13:17 PM
shivam-kapila

Mr_Monkey: hi. I have some idea about it. I may be able to help in case you are in hurry

2020-07-30 21247, 2020

13:18 PM
Mr_Monkey

Not in a hurry per se, ni, but you can probably help me understand a bit better. In short, I'm trying to figure out what should be changed now that pagination is done in react

2020-07-30 21212, 2020

13:20 PM
BrainzGit

[listenbrainz-server] ishaanshah opened pull request #998 (master…import_fix): Fix bug in spark import code https://github.com/metabrainz/listenbrainz-server…

2020-07-30 21220, 2020

13:20 PM
Mr_Monkey

As far as I can tell there's no search_larger_time_range param for the API /user/XXXXX/listens endpoint, so i wonder if that's now needed

2020-07-30 21221, 2020

13:20 PM
ishaanshah

iliekcomputers: I created a branch in the metabrainz repo by mistake, I'll delete it after the PR gets merged

2020-07-30 21252, 2020

13:20 PM
shivam-kapila

Mr_Monkey: oh yes you are right

2020-07-30 21257, 2020

13:20 PM
shivam-kapila

We need that

2020-07-30 21208, 2020

13:21 PM
shivam-kapila

I may do it for you

2020-07-30 21220, 2020

13:21 PM
shivam-kapila

And we may remove it from other page

2020-07-30 21227, 2020

13:21 PM
shivam-kapila

route*

2020-07-30 21235, 2020

13:22 PM
Mr_Monkey

What's the idea behind that mechanism again?

2020-07-30 21203, 2020

13:23 PM
jmp_music_

@alastair after eating your lunch can we do a small meeting?

2020-07-30 21219, 2020

13:23 PM
shivam-kapila

Actually it compares if the length oc listens fetched is less than the minimum no. of listens we set as a threshold