#metabrainz

/

      • alastairp
        Gentlecat: well, that’s kind of fun
      • 2015-09-29 27224, 2015

      • ruaok
        alastairp: that is my job in dec.
      • 2015-09-29 27229, 2015

      • alastairp
        who’s the teacher in the second class?
      • 2015-09-29 27230, 2015

      • ruaok
        they are just turning it off.
      • 2015-09-29 27200, 2015

      • Gentlecat
        Perfecto Herrera
      • 2015-09-29 27225, 2015

      • alastairp
        when does it finish?
      • 2015-09-29 27233, 2015

      • Gentlecat
        oh wait, second one
      • 2015-09-29 27244, 2015

      • Gentlecat
        Rafael Ramírez
      • 2015-09-29 27250, 2015

      • alastairp
        if you asked him, he’d probably bump it 15 or 30 minutes
      • 2015-09-29 27258, 2015

      • alastairp
        depending on how many people go to both
      • 2015-09-29 27203, 2015

      • Gentlecat
        it was actually shorter today
      • 2015-09-29 27232, 2015

      • Gentlecat
        most of these classes are only for us so
      • 2015-09-29 27247, 2015

      • alastairp
        “us”?
      • 2015-09-29 27257, 2015

      • Gentlecat
        when drinking beer today we actually decided to ask about that next week
      • 2015-09-29 27203, 2015

      • Gentlecat
        SMC
      • 2015-09-29 27212, 2015

      • alastairp
        oh yeah. it wouldn’t be a problem at all
      • 2015-09-29 27217, 2015

      • Gentlecat
        I don't think there are any other students there
      • 2015-09-29 27241, 2015

      • zas joined the channel
      • 2015-09-29 27256, 2015

      • yeeeargh has quit
      • 2015-09-29 27211, 2015

      • ruaok
        regommend looks like an easy way to make a recommender for music, now that we have this data. making an artist similarity one would be pretty cake.
      • 2015-09-29 27231, 2015

      • darwin
        so excited
      • 2015-09-29 27225, 2015

      • ruaok
        it is almost guaranteed to be not very good. but that is an excellent way to get people fired up.
      • 2015-09-29 27237, 2015

      • alastairp
        I’m glad it looks good
      • 2015-09-29 27246, 2015

      • ruaok
        ----> here is the code, fix it
      • 2015-09-29 27200, 2015

      • alastairp
        see you tomorrow
      • 2015-09-29 27205, 2015

      • ruaok wonders about cassandra counters
      • 2015-09-29 27216, 2015

      • ruaok
        sound like they are stable now. would be very useful, methinks.
      • 2015-09-29 27218, 2015

      • darwin
        what about them?
      • 2015-09-29 27233, 2015

      • ruaok
        we've been told to avoid them due to stability issues.
      • 2015-09-29 27241, 2015

      • ruaok
        but we hear mixed things.
      • 2015-09-29 27241, 2015

      • darwin
        briefly, you probably don't want to use them to count without an aggregator in front of them
      • 2015-09-29 27247, 2015

      • alastairp
        By darwin no less
      • 2015-09-29 27250, 2015

      • darwin
        the person who told you that is giving you good advice, actually
      • 2015-09-29 27202, 2015

      • darwin
        however aleksey did meaningfully improve them in recent versions
      • 2015-09-29 27207, 2015

      • darwin
        what use case are you considering using them for?
      • 2015-09-29 27227, 2015

      • ruaok
        count the total number of listens a user has. for starters.
      • 2015-09-29 27246, 2015

      • ruaok
        then count per artist would be the next logical one. esp for regommend.
      • 2015-09-29 27252, 2015

      • darwin
        how accurate do you want that number to be? do you plan to ever bulk recalculate?
      • 2015-09-29 27210, 2015

      • darwin
        (by, for example, hadooping their entire history?)
      • 2015-09-29 27212, 2015

      • ruaok
        decent, yes.
      • 2015-09-29 27229, 2015

      • darwin
        if you have authoritative source and plan to recalculate, accuracy is likely to be fine
      • 2015-09-29 27233, 2015

      • darwin
        most error is in direction of overcount
      • 2015-09-29 27238, 2015

      • ruaok
        our data is going to continually evolve an I expect many passes over the data.
      • 2015-09-29 27211, 2015

      • ruaok
        what sort of error percentages are we talking about?
      • 2015-09-29 27223, 2015

      • ruaok
        off by a couple? couple thousand?
      • 2015-09-29 27227, 2015

      • darwin
        it depends on your client error rate, which depends on how well you operate
      • 2015-09-29 27239, 2015

      • darwin
        modern counters are much less likely to catastrophically overcount
      • 2015-09-29 27241, 2015

      • darwin
        than old ones
      • 2015-09-29 27258, 2015

      • darwin
        the worst case is when you try to increment a counter
      • 2015-09-29 27200, 2015

      • darwin
        and you get an error
      • 2015-09-29 27201, 2015

      • ruaok
        the alternative we're considering is keeping counts in redis.
      • 2015-09-29 27210, 2015

      • darwin
        15:01 < darwin> briefly, you probably don't want to use them to count without an aggregator in front of them
      • 2015-09-29 27211, 2015

      • ruaok
        it has atomic count operations.
      • 2015-09-29 27218, 2015

      • darwin
        you could use redis as the aggregator I describe
      • 2015-09-29 27230, 2015

      • darwin
        simplereach gave a good presentation on their approach to 300,000 counter updates a minute
      • 2015-09-29 27231, 2015

      • ruaok
        ah, thanks for clarifying that.
      • 2015-09-29 27234, 2015

      • darwin
        at the recent cassandra summit
      • 2015-09-29 27208, 2015

      • darwin
        I would be uncomfortable with the idea of cassandra counting in the same cluster as the other data without an in-memory aggregator
      • 2015-09-29 27214, 2015

      • darwin
        for the last.fm case
      • 2015-09-29 27230, 2015

      • darwin
        in theory it might be tractable, but each one of those things
      • 2015-09-29 27233, 2015

      • darwin
        doing it in the same cluster
      • 2015-09-29 27239, 2015

      • ruaok
        yeah, that would be a lot of writes.
      • 2015-09-29 27244, 2015

      • darwin
        and counters read-before-write whereas all other cassandra doesn't, etc.
      • 2015-09-29 27203, 2015

      • darwin
        basically counters are pretty okish these days, but they're still distributed counters which can only ever be so accurate
      • 2015-09-29 27207, 2015

      • darwin
        and so performant
      • 2015-09-29 27222, 2015

      • darwin
        meanwhile those redis counters are basically not-distributed, single node, in memory counters
      • 2015-09-29 27234, 2015

      • ruaok
        so redis batches for a period, then updates counters in cassandra, is that the model?
      • 2015-09-29 27235, 2015

      • darwin
        which would, if operated in LRU, probably be enough for mb.fm for quite some time
      • 2015-09-29 27249, 2015

      • darwin
        ruaok: yeah, they batch 1 minute of increments at simplereach in a custom aggregator
      • 2015-09-29 27252, 2015

      • darwin
        ruaok: then flush once a minute
      • 2015-09-29 27211, 2015

      • ruaok
        ok, got it.
      • 2015-09-29 27245, 2015

      • ruaok
        that is a bit different than what I had in mind, but having a practical example to model after is probably better.
      • 2015-09-29 27203, 2015

      • darwin
        they are the most public high-volume cassandra counters shop
      • 2015-09-29 27212, 2015

      • darwin
        the other one is disqus, which... also has an aggregator in front of them
      • 2015-09-29 27223, 2015

      • ruaok
      • 2015-09-29 27234, 2015

      • ruaok
        we put a lot of emphasis onto the listened at timestamp.
      • 2015-09-29 27245, 2015

      • ruaok
        but we have not timestamp for when we wrote something to disk.
      • 2015-09-29 27204, 2015

      • ruaok
        when you put counters in place, how to backfill the data?
      • 2015-09-29 27213, 2015

      • ruaok
        is there a best practice there?
      • 2015-09-29 27234, 2015

      • alastairp
        if we write something, how do we know if it actually got serialised (in our case where we dedup)
      • 2015-09-29 27252, 2015

      • alastairp
        and therefore, when should we increment?
      • 2015-09-29 27223, 2015

      • ruaok
        ouch, thats another tricky case.
      • 2015-09-29 27238, 2015

      • darwin
        the best practice would be a single increment
      • 2015-09-29 27245, 2015

      • darwin
        from a materialized-by-otbher-means count
      • 2015-09-29 27201, 2015

      • ruaok
        without having some sort of inserted_at timestamp, we couldn't reliably back fill.
      • 2015-09-29 27230, 2015

      • ruaok
        right. you replay the data to the aggregator.
      • 2015-09-29 27249, 2015

      • ruaok
        but how do you know what data to query for to feed to the backfiller?
      • 2015-09-29 27219, 2015

      • darwin
        all data before timestamp [x] with no count?
      • 2015-09-29 27228, 2015

      • darwin
        and I'm not actualyl saying to replay against hte aggregator
      • 2015-09-29 27235, 2015

      • darwin
        I'm saying if you have the entire dataset in something like hadoop
      • 2015-09-29 27245, 2015

      • darwin
        you can make the initial increment be an increment of all history up to time [x]
      • 2015-09-29 27205, 2015

      • ruaok
        ok, and if you dont have hadoop yet?
      • 2015-09-29 27251, 2015

      • darwin
        in that case replaying to the aggregator is a good stress test!
      • 2015-09-29 27221, 2015

      • darwin
      • 2015-09-29 27224, 2015

      • darwin
        is the presentation btw
      • 2015-09-29 27228, 2015

      • darwin
        trying to get the slide deck
      • 2015-09-29 27247, 2015

      • kahu has quit
      • 2015-09-29 27239, 2015

      • ruaok
        thanks for the thoughts on all this. I'm looking forward to playing with the data. :)
      • 2015-09-29 27255, 2015

      • ruaok todders off
      • 2015-09-29 27207, 2015

      • zas has quit
      • 2015-09-29 27221, 2015

      • LordSputnik joined the channel
      • 2015-09-29 27222, 2015

      • Gentlecat
        ruaok: did you figure out how to set up documentation?
      • 2015-09-29 27252, 2015

      • Gentlecat
        might want to update sphinxcontrib-httpdomain to 1.4.0 in requirements
      • 2015-09-29 27205, 2015

      • Lotheric joined the channel
      • 2015-09-29 27245, 2015

      • ruaok
        that and the https side tracked me. I'll try that tomorrow.
      • 2015-09-29 27202, 2015

      • Gentlecat
        I can come and help again
      • 2015-09-29 27224, 2015

      • Gentlecat
        tomorrow I'm a free man \o/
      • 2015-09-29 27224, 2015

      • ruaok
        if you want. I'll be there usual slacker o clock
      • 2015-09-29 27230, 2015

      • ruaok
        nice
      • 2015-09-29 27221, 2015

      • ruaok has quit
      • 2015-09-29 27209, 2015

      • mat__ joined the channel
      • 2015-09-29 27204, 2015

      • rvedotrc has quit
      • 2015-09-29 27211, 2015

      • rvedotrc joined the channel
      • 2015-09-29 27209, 2015

      • mat__ is now known as mat_