#metabrainz

/

9:12 AM
pristine___

+1

2020-09-02 24608, 2020

9:13 AM
pristine___

as in get recs for a few (mayeb 100) users at a time from spark to Lemmy.

2020-09-02 24609, 2020

9:13 AM
alastairp

ruaok: right, thinking in terms of the current issue with rabbitmq?

2020-09-02 24614, 2020

9:16 AM
ruaok

yes

2020-09-02 24601, 2020

9:17 AM
_lucifer

spark actually records such information automatically

2020-09-02 24604, 2020

9:17 AM
_lucifer

https://spark.apache.org/docs/latest/monitoring.h…

2020-09-02 24643, 2020

9:17 AM
_lucifer

it might be a matter of just enabling some options for the spark context

2020-09-02 24644, 2020

9:18 AM
ruaok

_lucifer: could be useful yes. but I think it will be more useful having this data in our log files, where they are in context of what is happening....

2020-09-02 24614, 2020

9:19 AM
pristine___

_lucifer: I have used that in the past, logging is better imo.

2020-09-02 24653, 2020

9:21 AM
_lucifer

yeah make sense

2020-09-02 24659, 2020

9:21 AM
_lucifer

👍

2020-09-02 24612, 2020

9:24 AM
yvanzo

reosarevok: seems reasonable, replied in comment

2020-09-02 24618, 2020

9:24 AM
shivam-kapila

pristine___: might he worth trying to online notebook to compare

2020-09-02 24624, 2020

9:25 AM
pianoguy has quit

2020-09-02 24601, 2020

9:26 AM
_lucifer

pristine___: one question, is the dataset processed lazily?

2020-09-02 24637, 2020

9:26 AM
pristine___

Which dataset?

2020-09-02 24609, 2020

9:27 AM
_lucifer

the input dataset for generating recommendations

2020-09-02 24646, 2020

9:27 AM
pristine___

Everything is processed lazily in spark

2020-09-02 24600, 2020

9:28 AM
pristine___

I am thinking to persist the data

2020-09-02 24604, 2020

9:28 AM
pristine___

Datasets*

2020-09-02 24653, 2020

9:28 AM
pristine___

https://github.com/metabrainz/listenbrainz-server…

2020-09-02 24657, 2020

9:28 AM
pristine___

something like this

2020-09-02 24625, 2020

9:40 AM
d4rkie joined the channel

2020-09-02 24628, 2020

9:41 AM
v6lur_ has quit

2020-09-02 24616, 2020

9:43 AM
Nyanko-sensei has quit

2020-09-02 24632, 2020

10:08 AM
BrainzGit

[listenbrainz-server] vansika opened pull request #1070 (master…log-time-for-each-user): Log recommendation generation time for each user https://github.com/metabrainz/listenbrainz-server…

2020-09-02 24643, 2020

10:12 AM
iliekcomputers

pristine___: that pr has two timestamp variables, one named ts and one named ti

2020-09-02 24648, 2020

10:12 AM
iliekcomputers

:P

2020-09-02 24650, 2020

10:13 AM
pristine___

lol

2020-09-02 24654, 2020

10:13 AM
pristine___

i will do

2020-09-02 24605, 2020

10:15 AM
pristine___

iliekcomputers: ts_inital and ts?

2020-09-02 24602, 2020

10:16 AM
iliekcomputers

i'm not completely sure what the code does, so not sure if these are good. but i try to avoid two character names in general

2020-09-02 24643, 2020

10:19 AM
BrainzGit

[musicbrainz-server] reosarevok merged pull request #1602 (master…MBS-10972): MBS-10972: Convert Add Instrument edit to React https://github.com/metabrainz/musicbrainz-server/…

2020-09-02 24645, 2020

10:19 AM
BrainzBot

MBS-10972: Convert Add Instrument edit to React https://tickets.metabrainz.org/browse/MBS-10972

2020-09-02 24641, 2020

10:29 AM
pristine___

ruaok: do we want to try generating in recs in batches or switch to ML? which one first?

2020-09-02 24644, 2020

10:29 AM
alastairp

pristine___: why don't you also log the runtime of the items in get_recommendations_for_user? because I understand that some of these items are the lookups in the model, but you're also doing stuff like creating new dataframes (are these serialised to disk?) and running joins over them

2020-09-02 24654, 2020

10:29 AM
musiclover67 joined the channel

2020-09-02 24620, 2020

10:30 AM
musiclover67 has quit

2020-09-02 24630, 2020

10:30 AM
alastairp

I would recommend getting timing information first, but after that I suspect that switching to batches isn't much extra work, so might be better to start with that

2020-09-02 24606, 2020

10:40 AM
ruaok

pristine___: as alastair says. let's collect timing data, then make a decision. we're flying bling right now...

2020-09-02 24639, 2020

10:41 AM
pristine___

Spark is lazy so in my experience getting run time of diff items needs some action which will in turn increase the run time. I will have a look but.

2020-09-02 24650, 2020

10:41 AM
pristine___

ruaok: cool

2020-09-02 24613, 2020

10:43 AM
alastairp

I think it's OK to force it to evaluate a dataset (or whatever) for testing the time of each step

2020-09-02 24645, 2020

10:47 AM
v6lur__ joined the channel

2020-09-02 24600, 2020

10:50 AM
ruaok

pristine___: the user_df.persist() -- is there so that the user_df doesn't go out of scope before the count() call later in the function?

2020-09-02 24629, 2020

10:50 AM
ruaok

not sure if the persist uses more resources, but we could save the number of uses instead of persisting the df....

2020-09-02 24633, 2020

10:50 AM
ruaok

just wondering, really.

2020-09-02 24649, 2020

10:50 AM
alastairp

pristine___: I'm just thinking out loud here. I understand that you do a lookup in the matrix, and get a bunch of IDs or indexes or something, right? Is this why you have to serialise it to a dataframe and then join it against another table to get artist names or ids? This is one reason why the timing of each step is important

2020-09-02 24659, 2020

10:50 AM
alastairp

As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster

2020-09-02 24606, 2020

10:52 AM
pristine___

ruaok: so that users_df is not calculated with each run of the loop.

2020-09-02 24624, 2020

10:52 AM
ruaok

ah, ok. that's kinda important. :)

2020-09-02 24617, 2020

10:53 AM
pristine___

> As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster

2020-09-02 24633, 2020

10:54 AM
ruaok

pristine___: that build error on the PR... should I restart the build?

2020-09-02 24641, 2020

10:55 AM
pristine___

I completely agree here. I was thinking to do this but then I came across the ML lib which deals in dataframes. I think we can combine these two alastairp

2020-09-02 24648, 2020

10:55 AM
pristine___

ruaok: a sec

2020-09-02 24625, 2020

10:56 AM
alastairp

when I say dataframe, I mean whatever storage system is currently in use - sorry, I guess I mean to say RDD. As I said previously, I don't know this system very well

2020-09-02 24656, 2020

10:56 AM
pristine___

Yeah ruaok restart the build

2020-09-02 24602, 2020

10:57 AM
alastairp

I'm not implicitly suggesting that you switch to pyspark.ml here at the same time

2020-09-02 24645, 2020

10:57 AM
ruaok agrees with alastairp

2020-09-02 24656, 2020

10:57 AM
ruaok

we should take one step at a time, measure, think, act.

2020-09-02 24600, 2020

10:58 AM
pristine___

Right. Cool. I understand what you say. I will try to that and see if it improves the time.

2020-09-02 24606, 2020

10:58 AM
pristine___

Do*

2020-09-02 24632, 2020

10:58 AM
pristine___

> As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster

2020-09-02 24636, 2020

10:58 AM
pristine___

This.

2020-09-02 24638, 2020

10:58 AM
ruaok

pristine___: short of lunch time, I can work with you all day to ensure that code gets run right away.

2020-09-02 24653, 2020

10:58 AM
pristine___

Cool.

2020-09-02 24603, 2020

10:59 AM
ruaok

> As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster

2020-09-02 24618, 2020

10:59 AM
ruaok

my guess is that this will be a better speed improvement than going to the other lib

2020-09-02 24644, 2020

10:59 AM
pristine___

Maybe. But I will try this first, for sure.

2020-09-02 24635, 2020

11:07 AM
ruaok

pristine___: build passed.

2020-09-02 24654, 2020

11:07 AM
ruaok

let me merge and restart request consumer

2020-09-02 24602, 2020

11:08 AM
ruaok

what commands do you want to run?

2020-09-02 24609, 2020

11:08 AM
BrainzGit

[listenbrainz-server] mayhem merged pull request #1070 (master…log-time-for-each-user): Log recommendation generation time for each user https://github.com/metabrainz/listenbrainz-server…

2020-09-02 24605, 2020

11:09 AM
ruaok

restarted, ready to rock.

2020-09-02 24632, 2020

11:14 AM
pristine___

ruaok: what commands did you run?

2020-09-02 24640, 2020

11:14 AM
pristine___

for all users?

2020-09-02 24656, 2020

11:14 AM
ruaok

nothing, I'm awaiting commands from you.

2020-09-02 24631, 2020

11:15 AM
pristine___

So if we just want to get of runtime, we should run for a list of users no?

2020-09-02 24629, 2020

11:16 AM
ruaok

that, yes.

2020-09-02 24630, 2020

11:16 AM
pristine___

an idea*

2020-09-02 24639, 2020

11:16 AM
pristine___

right, a sec

2020-09-02 24652, 2020

11:18 AM
pristine___

`./develop.sh manage spark request_recommendations --user-name=rob --user-name=iliekcomputers --user-name=shivam-kapila

2020-09-02 24653, 2020

11:18 AM
pristine___

--user-name=avma --user-name=sprung --user-name=ukko12`

2020-09-02 24602, 2020

11:19 AM
pristine___

ruaok:

2020-09-02 24628, 2020

11:21 AM
alastairp

ruaok: hi, I shared you a CB document last week. no real rush on it, just checking that it's on your radar to respond to and you didn't inbox-bankruptcy it

2020-09-02 24636, 2020

11:21 AM
ruaok

pristine___: issued

2020-09-02 24659, 2020

11:21 AM
ruaok

I didn't declare inbox bankruptcy. you're on my radar.

2020-09-02 24625, 2020

11:25 AM
ruaok

pristine___: iliekcomputers alastairp : https://gist.github.com/mayhem/fa279ee177f147d596…

2020-09-02 24654, 2020

11:25 AM
ruaok

timing seems directly related to the number of listens.

2020-09-02 24611, 2020

11:26 AM
ruaok

INFO in recommend: Average time: 29.72sec

2020-09-02 24625, 2020

11:26 AM
ruaok

so that figure that pristine___ quoted was in fact correct.

2020-09-02 24641, 2020

11:27 AM
ruaok

ok, so there are 3 possible things for us to do in the short term that I see: 1) drive requests in chunks of 100 2) batch chunks into single dfs and 3) move to new lib

2020-09-02 24614, 2020

11:28 AM
ruaok

my impression is that #2 is the low hanging fruit here. not super much work but could give drastic improvements.

2020-09-02 24618, 2020

11:28 AM
ruaok

thoughts?

2020-09-02 24648, 2020

11:28 AM
shivam-kapila

> timing seems directly related to the number of listens.

2020-09-02 24648, 2020

11:28 AM
shivam-kapila

There are exceptions

2020-09-02 24607, 2020

11:29 AM
shivam-kapila

like for avma12 its 4.55 sec

2020-09-02 24626, 2020

11:29 AM
shivam-kapila

but they have 30 times more listens than me

2020-09-02 24633, 2020

11:29 AM
shivam-kapila

sorry ukko12

2020-09-02 24618, 2020

11:31 AM
pristine___

ruaok: I am writing a pr to log the count as well so we are sure if it is because of listen count like shivam-kapila said

2020-09-02 24620, 2020

11:31 AM
pristine___

a min

2020-09-02 24637, 2020

11:31 AM
ruaok

good idea

2020-09-02 24620, 2020

11:32 AM
shivam-kapila

I have a suggestion to increase the list of users

2020-09-02 24641, 2020

11:32 AM
shivam-kapila

so we can clearly analyse the trend

2020-09-02 24653, 2020

11:32 AM
pristine___

shivam-kapila: can you give a bigger list ? :p

2020-09-02 24615, 2020

11:33 AM
shivam-kapila

I can

2020-09-02 24623, 2020

11:33 AM
shivam-kapila

number?

2020-09-02 24643, 2020

11:33 AM
shivam-kapila

I will just pull out usernames from recent page

2020-09-02 24656, 2020

11:33 AM
pristine___

nice idea

2020-09-02 24613, 2020

11:37 AM
ruaok

user "nasasie"

2020-09-02 24647, 2020

11:37 AM
ruaok

hmm, wrong spelling

2020-09-02 24651, 2020

11:38 AM
ruaok

CatQuest: what was your LB name again?

2020-09-02 24659, 2020

11:38 AM
shivam-kapila

catcat

2020-09-02 24622, 2020

11:39 AM
shivam-kapila

https://listenbrainz.org/user/catcat

2020-09-02 24629, 2020

11:39 AM
ruaok

oh right, that other was the last.fm name

2020-09-02 24633, 2020

11:39 AM
pristine___

shivam-kapila: just don't put rob's name first in the list

2020-09-02 24658, 2020

11:39 AM
ruaok

yeah, go for catcat. 1.49M listens.

2020-09-02 24605, 2020

11:40 AM
pristine___

put it somehwere in the middle

2020-09-02 24627, 2020

11:40 AM
shivam-kapila

howmany users should I go for

2020-09-02 24602, 2020

11:43 AM
pristine___

10 ?

2020-09-02 24627, 2020

11:45 AM
shivam-kapila

https://gist.github.com/shivam-kapila/79dd0cb0adb…

2020-09-02 24632, 2020

11:45 AM
shivam-kapila

15 done

2020-09-02 24609, 2020

11:47 AM
pristine___

shivam-kapila: can you write them like a in the command I shared above, so that ruaok

2020-09-02 24618, 2020

11:47 AM
pristine___

can copy paste

2020-09-02 24624, 2020

11:51 AM
shivam-kapila

yo

2020-09-02 24626, 2020

11:51 AM
BrainzGit

[musicbrainz-server] reosarevok opened pull request #1671 (master…eslint-quotes): Eslint auto-fixes for quote-props https://github.com/metabrainz/musicbrainz-server/…

2020-09-02 24633, 2020

11:51 AM
shivam-kapila

ruaok: https://gist.github.com/shivam-kapila/79dd0cb0adb…

2020-09-02 24625, 2020

12:17 PM
BrainzGit

[listenbrainz-server] vansika opened pull request #1071 (master…log-count-rec): log recording count of top artist and similar artist candidate sets https://github.com/metabrainz/listenbrainz-server…

2020-09-02 24619, 2020

12:22 PM
pristine___

shivam-kapila: thanks

2020-09-02 24644, 2020

12:35 PM
pristine___

ruaok: I have added *counts* for logging purpose. Count is an action in spark terminology. So two things will happen now, computation of count and computation of recs. It will increase the runtime. Therefore, the logs for runtime should not be taken as absolute rather should be used for comparison.

2020-09-02 24627, 2020

12:38 PM
v6lur__ has quit

2020-09-02 24632, 2020

12:53 PM
iliekcomputers

ishaanshah: you'd like this, https://stripe.com/blog/globe

2020-09-02 24611, 2020

13:04 PM
ruaok

ruaok: makes sense.

2020-09-02 24615, 2020

13:04 PM
ruaok

let me run that command.

2020-09-02 24656, 2020

13:04 PM
ruaok

done. stats in the queue atm.

2020-09-02 24619, 2020

13:06 PM
ruaok

pristine___: does 1071 need rebasing?

2020-09-02 24609, 2020

13:08 PM
pristine___

No

2020-09-02 24622, 2020

13:09 PM
BrainzGit

[listenbrainz-server] mayhem merged pull request #1071 (master…log-count-rec): log recording count of top artist and similar artist candidate sets https://github.com/metabrainz/listenbrainz-server…

2020-09-02 24605, 2020

13:10 PM
ruaok

I purged the requests for now. once the other job finishes, I will restart the consumer and reissue the commands.

2020-09-02 24636, 2020

13:13 PM
pristine___

Cool

2020-09-02 24622, 2020

13:18 PM
ruaok

ok, updated, restarted. running now.

2020-09-02 24656, 2020

13:18 PM
alastairp

pristine___: btw, I guess log messages have a timestamp on them too, so the time.monotonic check isn't strictly necessary, unless you want summary information :)

2020-09-02 24638, 2020

13:19 PM
alastairp

will the count() calls force an evaluation of the data at each log point?

2020-09-02 24603, 2020

13:20 PM
ruaok

alastairp: did you see her comment above?

2020-09-02 24615, 2020

13:20 PM
ruaok

> ruaok: I have added *counts* for logging purpose. Count is an action in spark terminology. So two things will happen now, computation of count and computation of recs. It will increase the runtime. Therefore, the logs for runtime should not be taken as absolute rather should be used for comparison.

2020-09-02 24636, 2020

13:20 PM
alastairp

yes, but that wasn't clear to me that this is what the count() was doing

2020-09-02 24648, 2020

13:20 PM
ruaok

just checking.

2020-09-02 24602, 2020

13:21 PM
alastairp

we really need to use slack so that we can use threads when talking to each other :D

2020-09-02 24628, 2020

13:21 PM
shivam-kapila

IRC prem allows that

2020-09-02 24651, 2020

13:21 PM
pristine___

> will the count() calls force an evaluation of the data at each log point?