it might be a matter of just enabling some options for the spark context
2020-09-02 24644, 2020
ruaok
_lucifer: could be useful yes. but I think it will be more useful having this data in our log files, where they are in context of what is happening....
2020-09-02 24614, 2020
pristine___
_lucifer: I have used that in the past, logging is better imo.
2020-09-02 24653, 2020
_lucifer
yeah make sense
2020-09-02 24659, 2020
_lucifer
👍
2020-09-02 24612, 2020
yvanzo
reosarevok: seems reasonable, replied in comment
2020-09-02 24618, 2020
shivam-kapila
pristine___: might he worth trying to online notebook to compare
2020-09-02 24624, 2020
pianoguy has quit
2020-09-02 24601, 2020
_lucifer
pristine___: one question, is the dataset processed lazily?
ruaok: do we want to try generating in recs in batches or switch to ML? which one first?
2020-09-02 24644, 2020
alastairp
pristine___: why don't you also log the runtime of the items in get_recommendations_for_user? because I understand that some of these items are the lookups in the model, but you're also doing stuff like creating new dataframes (are these serialised to disk?) and running joins over them
2020-09-02 24654, 2020
musiclover67 joined the channel
2020-09-02 24620, 2020
musiclover67 has quit
2020-09-02 24630, 2020
alastairp
I would recommend getting timing information first, but after that I suspect that switching to batches isn't much extra work, so might be better to start with that
2020-09-02 24606, 2020
ruaok
pristine___: as alastair says. let's collect timing data, then make a decision. we're flying bling right now...
2020-09-02 24639, 2020
pristine___
Spark is lazy so in my experience getting run time of diff items needs some action which will in turn increase the run time. I will have a look but.
2020-09-02 24650, 2020
pristine___
ruaok: cool
2020-09-02 24613, 2020
alastairp
I think it's OK to force it to evaluate a dataset (or whatever) for testing the time of each step
2020-09-02 24645, 2020
v6lur__ joined the channel
2020-09-02 24600, 2020
ruaok
pristine___: the user_df.persist() -- is there so that the user_df doesn't go out of scope before the count() call later in the function?
2020-09-02 24629, 2020
ruaok
not sure if the persist uses more resources, but we could save the number of uses instead of persisting the df....
2020-09-02 24633, 2020
ruaok
just wondering, really.
2020-09-02 24649, 2020
alastairp
pristine___: I'm just thinking out loud here. I understand that you do a lookup in the matrix, and get a bunch of IDs or indexes or something, right? Is this why you have to serialise it to a dataframe and then join it against another table to get artist names or ids? This is one reason why the timing of each step is important
2020-09-02 24659, 2020
alastairp
As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
2020-09-02 24606, 2020
pristine___
ruaok: so that users_df is not calculated with each run of the loop.
2020-09-02 24624, 2020
ruaok
ah, ok. that's kinda important. :)
2020-09-02 24617, 2020
pristine___
> As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
2020-09-02 24633, 2020
ruaok
pristine___: that build error on the PR... should I restart the build?
2020-09-02 24641, 2020
pristine___
I completely agree here. I was thinking to do this but then I came across the ML lib which deals in dataframes. I think we can combine these two alastairp
2020-09-02 24648, 2020
pristine___
ruaok: a sec
2020-09-02 24625, 2020
alastairp
when I say dataframe, I mean whatever storage system is currently in use - sorry, I guess I mean to say RDD. As I said previously, I don't know this system very well
2020-09-02 24656, 2020
pristine___
Yeah ruaok restart the build
2020-09-02 24602, 2020
alastairp
I'm not implicitly suggesting that you switch to pyspark.ml here at the same time
2020-09-02 24645, 2020
ruaok agrees with alastairp
2020-09-02 24656, 2020
ruaok
we should take one step at a time, measure, think, act.
2020-09-02 24600, 2020
pristine___
Right. Cool. I understand what you say. I will try to that and see if it improves the time.
2020-09-02 24606, 2020
pristine___
Do*
2020-09-02 24632, 2020
pristine___
> As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
2020-09-02 24636, 2020
pristine___
This.
2020-09-02 24638, 2020
ruaok
pristine___: short of lunch time, I can work with you all day to ensure that code gets run right away.
2020-09-02 24653, 2020
pristine___
Cool.
2020-09-02 24603, 2020
ruaok
> As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
2020-09-02 24618, 2020
ruaok
my guess is that this will be a better speed improvement than going to the other lib
ruaok: hi, I shared you a CB document last week. no real rush on it, just checking that it's on your radar to respond to and you didn't inbox-bankruptcy it
2020-09-02 24636, 2020
ruaok
pristine___: issued
2020-09-02 24659, 2020
ruaok
I didn't declare inbox bankruptcy. you're on my radar.
timing seems directly related to the number of listens.
2020-09-02 24611, 2020
ruaok
INFO in recommend: Average time: 29.72sec
2020-09-02 24625, 2020
ruaok
so that figure that pristine___ quoted was in fact correct.
2020-09-02 24641, 2020
ruaok
ok, so there are 3 possible things for us to do in the short term that I see: 1) drive requests in chunks of 100 2) batch chunks into single dfs and 3) move to new lib
2020-09-02 24614, 2020
ruaok
my impression is that #2 is the low hanging fruit here. not super much work but could give drastic improvements.
2020-09-02 24618, 2020
ruaok
thoughts?
2020-09-02 24648, 2020
shivam-kapila
> timing seems directly related to the number of listens.
2020-09-02 24648, 2020
shivam-kapila
There are exceptions
2020-09-02 24607, 2020
shivam-kapila
like for avma12 its 4.55 sec
2020-09-02 24626, 2020
shivam-kapila
but they have 30 times more listens than me
2020-09-02 24633, 2020
shivam-kapila
sorry ukko12
2020-09-02 24618, 2020
pristine___
ruaok: I am writing a pr to log the count as well so we are sure if it is because of listen count like shivam-kapila said
ruaok: I have added *counts* for logging purpose. Count is an action in spark terminology. So two things will happen now, computation of count and computation of recs. It will increase the runtime. Therefore, the logs for runtime should not be taken as absolute rather should be used for comparison.
I purged the requests for now. once the other job finishes, I will restart the consumer and reissue the commands.
2020-09-02 24636, 2020
pristine___
Cool
2020-09-02 24622, 2020
ruaok
ok, updated, restarted. running now.
2020-09-02 24656, 2020
alastairp
pristine___: btw, I guess log messages have a timestamp on them too, so the time.monotonic check isn't strictly necessary, unless you want summary information :)
2020-09-02 24638, 2020
alastairp
will the count() calls force an evaluation of the data at each log point?
2020-09-02 24603, 2020
ruaok
alastairp: did you see her comment above?
2020-09-02 24615, 2020
ruaok
> ruaok: I have added *counts* for logging purpose. Count is an action in spark terminology. So two things will happen now, computation of count and computation of recs. It will increase the runtime. Therefore, the logs for runtime should not be taken as absolute rather should be used for comparison.
2020-09-02 24636, 2020
alastairp
yes, but that wasn't clear to me that this is what the count() was doing
2020-09-02 24648, 2020
ruaok
just checking.
2020-09-02 24602, 2020
alastairp
we really need to use slack so that we can use threads when talking to each other :D
2020-09-02 24628, 2020
shivam-kapila
IRC prem allows that
2020-09-02 24651, 2020
pristine___
> will the count() calls force an evaluation of the data at each log point?