#metabrainz

/

      • pristine___
        +1
      • 2020-09-02 24608, 2020

      • pristine___
        as in get recs for a few (mayeb 100) users at a time from spark to Lemmy.
      • 2020-09-02 24609, 2020

      • alastairp
        ruaok: right, thinking in terms of the current issue with rabbitmq?
      • 2020-09-02 24614, 2020

      • ruaok
        yes
      • 2020-09-02 24601, 2020

      • _lucifer
        spark actually records such information automatically
      • 2020-09-02 24604, 2020

      • _lucifer
      • 2020-09-02 24643, 2020

      • _lucifer
        it might be a matter of just enabling some options for the spark context
      • 2020-09-02 24644, 2020

      • ruaok
        _lucifer: could be useful yes. but I think it will be more useful having this data in our log files, where they are in context of what is happening....
      • 2020-09-02 24614, 2020

      • pristine___
        _lucifer: I have used that in the past, logging is better imo.
      • 2020-09-02 24653, 2020

      • _lucifer
        yeah make sense
      • 2020-09-02 24659, 2020

      • _lucifer
        👍
      • 2020-09-02 24612, 2020

      • yvanzo
        reosarevok: seems reasonable, replied in comment
      • 2020-09-02 24618, 2020

      • shivam-kapila
        pristine___: might he worth trying to online notebook to compare
      • 2020-09-02 24624, 2020

      • pianoguy has quit
      • 2020-09-02 24601, 2020

      • _lucifer
        pristine___: one question, is the dataset processed lazily?
      • 2020-09-02 24637, 2020

      • pristine___
        Which dataset?
      • 2020-09-02 24609, 2020

      • _lucifer
        the input dataset for generating recommendations
      • 2020-09-02 24646, 2020

      • pristine___
        Everything is processed lazily in spark
      • 2020-09-02 24600, 2020

      • pristine___
        I am thinking to persist the data
      • 2020-09-02 24604, 2020

      • pristine___
        Datasets*
      • 2020-09-02 24653, 2020

      • pristine___
      • 2020-09-02 24657, 2020

      • pristine___
        something like this
      • 2020-09-02 24625, 2020

      • d4rkie joined the channel
      • 2020-09-02 24628, 2020

      • v6lur_ has quit
      • 2020-09-02 24616, 2020

      • Nyanko-sensei has quit
      • 2020-09-02 24632, 2020

      • BrainzGit
        [listenbrainz-server] vansika opened pull request #1070 (master…log-time-for-each-user): Log recommendation generation time for each user https://github.com/metabrainz/listenbrainz-server…
      • 2020-09-02 24643, 2020

      • iliekcomputers
        pristine___: that pr has two timestamp variables, one named ts and one named ti
      • 2020-09-02 24648, 2020

      • iliekcomputers
        :P
      • 2020-09-02 24650, 2020

      • pristine___
        lol
      • 2020-09-02 24654, 2020

      • pristine___
        i will do
      • 2020-09-02 24605, 2020

      • pristine___
        iliekcomputers: ts_inital and ts?
      • 2020-09-02 24602, 2020

      • iliekcomputers
        i'm not completely sure what the code does, so not sure if these are good. but i try to avoid two character names in general
      • 2020-09-02 24643, 2020

      • BrainzGit
        [musicbrainz-server] reosarevok merged pull request #1602 (master…MBS-10972): MBS-10972: Convert Add Instrument edit to React https://github.com/metabrainz/musicbrainz-server/…
      • 2020-09-02 24645, 2020

      • BrainzBot
        MBS-10972: Convert Add Instrument edit to React https://tickets.metabrainz.org/browse/MBS-10972
      • 2020-09-02 24641, 2020

      • pristine___
        ruaok: do we want to try generating in recs in batches or switch to ML? which one first?
      • 2020-09-02 24644, 2020

      • alastairp
        pristine___: why don't you also log the runtime of the items in get_recommendations_for_user? because I understand that some of these items are the lookups in the model, but you're also doing stuff like creating new dataframes (are these serialised to disk?) and running joins over them
      • 2020-09-02 24654, 2020

      • musiclover67 joined the channel
      • 2020-09-02 24620, 2020

      • musiclover67 has quit
      • 2020-09-02 24630, 2020

      • alastairp
        I would recommend getting timing information first, but after that I suspect that switching to batches isn't much extra work, so might be better to start with that
      • 2020-09-02 24606, 2020

      • ruaok
        pristine___: as alastair says. let's collect timing data, then make a decision. we're flying bling right now...
      • 2020-09-02 24639, 2020

      • pristine___
        Spark is lazy so in my experience getting run time of diff items needs some action which will in turn increase the run time. I will have a look but.
      • 2020-09-02 24650, 2020

      • pristine___
        ruaok: cool
      • 2020-09-02 24613, 2020

      • alastairp
        I think it's OK to force it to evaluate a dataset (or whatever) for testing the time of each step
      • 2020-09-02 24645, 2020

      • v6lur__ joined the channel
      • 2020-09-02 24600, 2020

      • ruaok
        pristine___: the user_df.persist() -- is there so that the user_df doesn't go out of scope before the count() call later in the function?
      • 2020-09-02 24629, 2020

      • ruaok
        not sure if the persist uses more resources, but we could save the number of uses instead of persisting the df....
      • 2020-09-02 24633, 2020

      • ruaok
        just wondering, really.
      • 2020-09-02 24649, 2020

      • alastairp
        pristine___: I'm just thinking out loud here. I understand that you do a lookup in the matrix, and get a bunch of IDs or indexes or something, right? Is this why you have to serialise it to a dataframe and then join it against another table to get artist names or ids? This is one reason why the timing of each step is important
      • 2020-09-02 24659, 2020

      • alastairp
        As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
      • 2020-09-02 24606, 2020

      • pristine___
        ruaok: so that users_df is not calculated with each run of the loop.
      • 2020-09-02 24624, 2020

      • ruaok
        ah, ok. that's kinda important. :)
      • 2020-09-02 24617, 2020

      • pristine___
        > As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
      • 2020-09-02 24633, 2020

      • ruaok
        pristine___: that build error on the PR... should I restart the build?
      • 2020-09-02 24641, 2020

      • pristine___
        I completely agree here. I was thinking to do this but then I came across the ML lib which deals in dataframes. I think we can combine these two alastairp
      • 2020-09-02 24648, 2020

      • pristine___
        ruaok: a sec
      • 2020-09-02 24625, 2020

      • alastairp
        when I say dataframe, I mean whatever storage system is currently in use - sorry, I guess I mean to say RDD. As I said previously, I don't know this system very well
      • 2020-09-02 24656, 2020

      • pristine___
        Yeah ruaok restart the build
      • 2020-09-02 24602, 2020

      • alastairp
        I'm not implicitly suggesting that you switch to pyspark.ml here at the same time
      • 2020-09-02 24645, 2020

      • ruaok agrees with alastairp
      • 2020-09-02 24656, 2020

      • ruaok
        we should take one step at a time, measure, think, act.
      • 2020-09-02 24600, 2020

      • pristine___
        Right. Cool. I understand what you say. I will try to that and see if it improves the time.
      • 2020-09-02 24606, 2020

      • pristine___
        Do*
      • 2020-09-02 24632, 2020

      • pristine___
        > As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
      • 2020-09-02 24636, 2020

      • pristine___
        This.
      • 2020-09-02 24638, 2020

      • ruaok
        pristine___: short of lunch time, I can work with you all day to ensure that code gets run right away.
      • 2020-09-02 24653, 2020

      • pristine___
        Cool.
      • 2020-09-02 24603, 2020

      • ruaok
        > As a hypothesis, generating recommendations for 100 users and putting it into a single dataframe, and then joining over that might be faster
      • 2020-09-02 24618, 2020

      • ruaok
        my guess is that this will be a better speed improvement than going to the other lib
      • 2020-09-02 24644, 2020

      • pristine___
        Maybe. But I will try this first, for sure.
      • 2020-09-02 24635, 2020

      • ruaok
        pristine___: build passed.
      • 2020-09-02 24654, 2020

      • ruaok
        let me merge and restart request consumer
      • 2020-09-02 24602, 2020

      • ruaok
        what commands do you want to run?
      • 2020-09-02 24609, 2020

      • BrainzGit
        [listenbrainz-server] mayhem merged pull request #1070 (master…log-time-for-each-user): Log recommendation generation time for each user https://github.com/metabrainz/listenbrainz-server…
      • 2020-09-02 24605, 2020

      • ruaok
        restarted, ready to rock.
      • 2020-09-02 24632, 2020

      • pristine___
        ruaok: what commands did you run?
      • 2020-09-02 24640, 2020

      • pristine___
        for all users?
      • 2020-09-02 24656, 2020

      • ruaok
        nothing, I'm awaiting commands from you.
      • 2020-09-02 24631, 2020

      • pristine___
        So if we just want to get of runtime, we should run for a list of users no?
      • 2020-09-02 24629, 2020

      • ruaok
        that, yes.
      • 2020-09-02 24630, 2020

      • pristine___
        an idea*
      • 2020-09-02 24639, 2020

      • pristine___
        right, a sec
      • 2020-09-02 24652, 2020

      • pristine___
        `./develop.sh manage spark request_recommendations --user-name=rob --user-name=iliekcomputers --user-name=shivam-kapila
      • 2020-09-02 24653, 2020

      • pristine___
        --user-name=avma --user-name=sprung --user-name=ukko12`
      • 2020-09-02 24602, 2020

      • pristine___
        ruaok:
      • 2020-09-02 24628, 2020

      • alastairp
        ruaok: hi, I shared you a CB document last week. no real rush on it, just checking that it's on your radar to respond to and you didn't inbox-bankruptcy it
      • 2020-09-02 24636, 2020

      • ruaok
        pristine___: issued
      • 2020-09-02 24659, 2020

      • ruaok
        I didn't declare inbox bankruptcy. you're on my radar.
      • 2020-09-02 24625, 2020

      • ruaok
        pristine___: iliekcomputers alastairp : https://gist.github.com/mayhem/fa279ee177f147d596…
      • 2020-09-02 24654, 2020

      • ruaok
        timing seems directly related to the number of listens.
      • 2020-09-02 24611, 2020

      • ruaok
        INFO in recommend: Average time: 29.72sec
      • 2020-09-02 24625, 2020

      • ruaok
        so that figure that pristine___ quoted was in fact correct.
      • 2020-09-02 24641, 2020

      • ruaok
        ok, so there are 3 possible things for us to do in the short term that I see: 1) drive requests in chunks of 100 2) batch chunks into single dfs and 3) move to new lib
      • 2020-09-02 24614, 2020

      • ruaok
        my impression is that #2 is the low hanging fruit here. not super much work but could give drastic improvements.
      • 2020-09-02 24618, 2020

      • ruaok
        thoughts?
      • 2020-09-02 24648, 2020

      • shivam-kapila
        > timing seems directly related to the number of listens.
      • 2020-09-02 24648, 2020

      • shivam-kapila
        There are exceptions
      • 2020-09-02 24607, 2020

      • shivam-kapila
        like for avma12 its 4.55 sec
      • 2020-09-02 24626, 2020

      • shivam-kapila
        but they have 30 times more listens than me
      • 2020-09-02 24633, 2020

      • shivam-kapila
        sorry ukko12
      • 2020-09-02 24618, 2020

      • pristine___
        ruaok: I am writing a pr to log the count as well so we are sure if it is because of listen count like shivam-kapila said
      • 2020-09-02 24620, 2020

      • pristine___
        a min
      • 2020-09-02 24637, 2020

      • ruaok
        good idea
      • 2020-09-02 24620, 2020

      • shivam-kapila
        I have a suggestion to increase the list of users
      • 2020-09-02 24641, 2020

      • shivam-kapila
        so we can clearly analyse the trend
      • 2020-09-02 24653, 2020

      • pristine___
        shivam-kapila: can you give a bigger list ? :p
      • 2020-09-02 24615, 2020

      • shivam-kapila
        I can
      • 2020-09-02 24623, 2020

      • shivam-kapila
        number?
      • 2020-09-02 24643, 2020

      • shivam-kapila
        I will just pull out usernames from recent page
      • 2020-09-02 24656, 2020

      • pristine___
        nice idea
      • 2020-09-02 24613, 2020

      • ruaok
        user "nasasie"
      • 2020-09-02 24647, 2020

      • ruaok
        hmm, wrong spelling
      • 2020-09-02 24651, 2020

      • ruaok
        CatQuest: what was your LB name again?
      • 2020-09-02 24659, 2020

      • shivam-kapila
        catcat
      • 2020-09-02 24622, 2020

      • shivam-kapila
      • 2020-09-02 24629, 2020

      • ruaok
        oh right, that other was the last.fm name
      • 2020-09-02 24633, 2020

      • pristine___
        shivam-kapila: just don't put rob's name first in the list
      • 2020-09-02 24658, 2020

      • ruaok
        yeah, go for catcat. 1.49M listens.
      • 2020-09-02 24605, 2020

      • pristine___
        put it somehwere in the middle
      • 2020-09-02 24627, 2020

      • shivam-kapila
        howmany users should I go for
      • 2020-09-02 24602, 2020

      • pristine___
        10 ?
      • 2020-09-02 24627, 2020

      • shivam-kapila
      • 2020-09-02 24632, 2020

      • shivam-kapila
        15 done
      • 2020-09-02 24609, 2020

      • pristine___
        shivam-kapila: can you write them like a in the command I shared above, so that ruaok
      • 2020-09-02 24618, 2020

      • pristine___
        can copy paste
      • 2020-09-02 24624, 2020

      • shivam-kapila
        yo
      • 2020-09-02 24626, 2020

      • BrainzGit
        [musicbrainz-server] reosarevok opened pull request #1671 (master…eslint-quotes): Eslint auto-fixes for quote-props https://github.com/metabrainz/musicbrainz-server/…
      • 2020-09-02 24633, 2020

      • shivam-kapila
      • 2020-09-02 24625, 2020

      • BrainzGit
        [listenbrainz-server] vansika opened pull request #1071 (master…log-count-rec): log recording count of top artist and similar artist candidate sets https://github.com/metabrainz/listenbrainz-server…
      • 2020-09-02 24619, 2020

      • pristine___
        shivam-kapila: thanks
      • 2020-09-02 24644, 2020

      • pristine___
        ruaok: I have added *counts* for logging purpose. Count is an action in spark terminology. So two things will happen now, computation of count and computation of recs. It will increase the runtime. Therefore, the logs for runtime should not be taken as absolute rather should be used for comparison.
      • 2020-09-02 24627, 2020

      • v6lur__ has quit
      • 2020-09-02 24632, 2020

      • iliekcomputers
        ishaanshah: you'd like this, https://stripe.com/blog/globe
      • 2020-09-02 24611, 2020

      • ruaok
        ruaok: makes sense.
      • 2020-09-02 24615, 2020

      • ruaok
        let me run that command.
      • 2020-09-02 24656, 2020

      • ruaok
        done. stats in the queue atm.
      • 2020-09-02 24619, 2020

      • ruaok
        pristine___: does 1071 need rebasing?
      • 2020-09-02 24609, 2020

      • pristine___
        No
      • 2020-09-02 24622, 2020

      • BrainzGit
        [listenbrainz-server] mayhem merged pull request #1071 (master…log-count-rec): log recording count of top artist and similar artist candidate sets https://github.com/metabrainz/listenbrainz-server…
      • 2020-09-02 24605, 2020

      • ruaok
        I purged the requests for now. once the other job finishes, I will restart the consumer and reissue the commands.
      • 2020-09-02 24636, 2020

      • pristine___
        Cool
      • 2020-09-02 24622, 2020

      • ruaok
        ok, updated, restarted. running now.
      • 2020-09-02 24656, 2020

      • alastairp
        pristine___: btw, I guess log messages have a timestamp on them too, so the time.monotonic check isn't strictly necessary, unless you want summary information :)
      • 2020-09-02 24638, 2020

      • alastairp
        will the count() calls force an evaluation of the data at each log point?
      • 2020-09-02 24603, 2020

      • ruaok
        alastairp: did you see her comment above?
      • 2020-09-02 24615, 2020

      • ruaok
        > ruaok: I have added *counts* for logging purpose. Count is an action in spark terminology. So two things will happen now, computation of count and computation of recs. It will increase the runtime. Therefore, the logs for runtime should not be taken as absolute rather should be used for comparison.
      • 2020-09-02 24636, 2020

      • alastairp
        yes, but that wasn't clear to me that this is what the count() was doing
      • 2020-09-02 24648, 2020

      • ruaok
        just checking.
      • 2020-09-02 24602, 2020

      • alastairp
        we really need to use slack so that we can use threads when talking to each other :D
      • 2020-09-02 24628, 2020

      • shivam-kapila
        IRC prem allows that
      • 2020-09-02 24651, 2020

      • pristine___
        > will the count() calls force an evaluation of the data at each log point?