#metabrainz

/

      • alastairp
        Gentlecat: well, that’s kind of fun
      • ruaok
        alastairp: that is my job in dec.
      • alastairp
        who’s the teacher in the second class?
      • ruaok
        they are just turning it off.
      • Gentlecat
        Perfecto Herrera
      • alastairp
        when does it finish?
      • Gentlecat
        oh wait, second one
      • Rafael Ramírez
      • alastairp
        if you asked him, he’d probably bump it 15 or 30 minutes
      • depending on how many people go to both
      • Gentlecat
        it was actually shorter today
      • most of these classes are only for us so
      • alastairp
        “us”?
      • Gentlecat
        when drinking beer today we actually decided to ask about that next week
      • SMC
      • alastairp
        oh yeah. it wouldn’t be a problem at all
      • Gentlecat
        I don't think there are any other students there
      • zas joined the channel
      • yeeeargh has quit
      • ruaok
        regommend looks like an easy way to make a recommender for music, now that we have this data. making an artist similarity one would be pretty cake.
      • darwin
        so excited
      • ruaok
        it is almost guaranteed to be not very good. but that is an excellent way to get people fired up.
      • alastairp
        I’m glad it looks good
      • ruaok
        ----> here is the code, fix it
      • alastairp
        see you tomorrow
      • ruaok wonders about cassandra counters
      • ruaok
        sound like they are stable now. would be very useful, methinks.
      • darwin
        what about them?
      • ruaok
        we've been told to avoid them due to stability issues.
      • but we hear mixed things.
      • darwin
        briefly, you probably don't want to use them to count without an aggregator in front of them
      • alastairp
        By darwin no less
      • darwin
        the person who told you that is giving you good advice, actually
      • however aleksey did meaningfully improve them in recent versions
      • what use case are you considering using them for?
      • ruaok
        count the total number of listens a user has. for starters.
      • then count per artist would be the next logical one. esp for regommend.
      • darwin
        how accurate do you want that number to be? do you plan to ever bulk recalculate?
      • (by, for example, hadooping their entire history?)
      • ruaok
        decent, yes.
      • darwin
        if you have authoritative source and plan to recalculate, accuracy is likely to be fine
      • most error is in direction of overcount
      • ruaok
        our data is going to continually evolve an I expect many passes over the data.
      • what sort of error percentages are we talking about?
      • off by a couple? couple thousand?
      • darwin
        it depends on your client error rate, which depends on how well you operate
      • modern counters are much less likely to catastrophically overcount
      • than old ones
      • the worst case is when you try to increment a counter
      • and you get an error
      • ruaok
        the alternative we're considering is keeping counts in redis.
      • darwin
        15:01 < darwin> briefly, you probably don't want to use them to count without an aggregator in front of them
      • ruaok
        it has atomic count operations.
      • darwin
        you could use redis as the aggregator I describe
      • simplereach gave a good presentation on their approach to 300,000 counter updates a minute
      • ruaok
        ah, thanks for clarifying that.
      • darwin
        at the recent cassandra summit
      • I would be uncomfortable with the idea of cassandra counting in the same cluster as the other data without an in-memory aggregator
      • for the last.fm case
      • in theory it might be tractable, but each one of those things
      • doing it in the same cluster
      • ruaok
        yeah, that would be a lot of writes.
      • darwin
        and counters read-before-write whereas all other cassandra doesn't, etc.
      • basically counters are pretty okish these days, but they're still distributed counters which can only ever be so accurate
      • and so performant
      • meanwhile those redis counters are basically not-distributed, single node, in memory counters
      • ruaok
        so redis batches for a period, then updates counters in cassandra, is that the model?
      • darwin
        which would, if operated in LRU, probably be enough for mb.fm for quite some time
      • ruaok: yeah, they batch 1 minute of increments at simplereach in a custom aggregator
      • ruaok: then flush once a minute
      • ruaok
        ok, got it.
      • that is a bit different than what I had in mind, but having a practical example to model after is probably better.
      • darwin
        they are the most public high-volume cassandra counters shop
      • the other one is disqus, which... also has an aggregator in front of them
      • ruaok
      • we put a lot of emphasis onto the listened at timestamp.
      • but we have not timestamp for when we wrote something to disk.
      • when you put counters in place, how to backfill the data?
      • is there a best practice there?
      • alastairp
        if we write something, how do we know if it actually got serialised (in our case where we dedup)
      • and therefore, when should we increment?
      • ruaok
        ouch, thats another tricky case.
      • darwin
        the best practice would be a single increment
      • from a materialized-by-otbher-means count
      • ruaok
        without having some sort of inserted_at timestamp, we couldn't reliably back fill.
      • right. you replay the data to the aggregator.
      • but how do you know what data to query for to feed to the backfiller?
      • darwin
        all data before timestamp [x] with no count?
      • and I'm not actualyl saying to replay against hte aggregator
      • I'm saying if you have the entire dataset in something like hadoop
      • you can make the initial increment be an increment of all history up to time [x]
      • ruaok
        ok, and if you dont have hadoop yet?
      • darwin
        in that case replaying to the aggregator is a good stress test!
      • is the presentation btw
      • trying to get the slide deck
      • kahu has quit
      • ruaok
        thanks for the thoughts on all this. I'm looking forward to playing with the data. :)
      • ruaok todders off
      • zas has quit
      • LordSputnik joined the channel
      • Gentlecat
        ruaok: did you figure out how to set up documentation?
      • might want to update sphinxcontrib-httpdomain to 1.4.0 in requirements
      • Lotheric joined the channel
      • ruaok
        that and the https side tracked me. I'll try that tomorrow.
      • Gentlecat
        I can come and help again
      • tomorrow I'm a free man \o/
      • ruaok
        if you want. I'll be there usual slacker o clock
      • nice
      • ruaok has quit
      • mat__ joined the channel
      • rvedotrc has quit
      • rvedotrc joined the channel
      • mat__ is now known as mat_