in #metabrainz

21:14 PM
alastairp

Gentlecat: well, that’s kind of fun
21:14 PM
ruaok

alastairp: that is my job in dec.
21:14 PM
alastairp

who’s the teacher in the second class?
21:14 PM
ruaok

they are just turning it off.
21:15 PM
Gentlecat

Perfecto Herrera
21:15 PM
alastairp

when does it finish?
21:15 PM
Gentlecat

oh wait, second one
21:15 PM
Rafael Ramírez
21:15 PM
alastairp

if you asked him, he’d probably bump it 15 or 30 minutes
21:15 PM
depending on how many people go to both
21:16 PM
Gentlecat

it was actually shorter today
21:16 PM
most of these classes are only for us so
21:16 PM
alastairp

“us”?
21:16 PM
Gentlecat

when drinking beer today we actually decided to ask about that next week
21:17 PM
SMC
21:17 PM
alastairp

oh yeah. it wouldn’t be a problem at all
21:17 PM
Gentlecat

I don't think there are any other students there
21:21 PM
zas joined the channel
21:37 PM
yeeeargh has quit
21:54 PM
ruaok

regommend looks like an easy way to make a recommender for music, now that we have this data. making an artist similarity one would be pretty cake.
21:55 PM
darwin

so excited
21:56 PM
ruaok

it is almost guaranteed to be not very good. but that is an excellent way to get people fired up.
21:56 PM
alastairp

I’m glad it looks good
21:56 PM
ruaok

----> here is the code, fix it
22:00 PM
alastairp

see you tomorrow
22:01 PM
ruaok wonders about cassandra counters
22:01 PM
ruaok

sound like they are stable now. would be very useful, methinks.
22:01 PM
darwin

what about them?
22:01 PM
ruaok

we've been told to avoid them due to stability issues.
22:01 PM
but we hear mixed things.
22:01 PM
darwin

briefly, you probably don't want to use them to count without an aggregator in front of them
22:01 PM
alastairp

By darwin no less
22:01 PM
darwin

the person who told you that is giving you good advice, actually
22:02 PM
however aleksey did meaningfully improve them in recent versions
22:02 PM
what use case are you considering using them for?
22:02 PM
ruaok

count the total number of listens a user has. for starters.
22:02 PM
then count per artist would be the next logical one. esp for regommend.
22:02 PM
darwin

how accurate do you want that number to be? do you plan to ever bulk recalculate?
22:03 PM
(by, for example, hadooping their entire history?)
22:03 PM
ruaok

decent, yes.
22:03 PM
darwin

if you have authoritative source and plan to recalculate, accuracy is likely to be fine
22:03 PM
most error is in direction of overcount
22:03 PM
ruaok

our data is going to continually evolve an I expect many passes over the data.
22:04 PM
what sort of error percentages are we talking about?
22:04 PM
off by a couple? couple thousand?
22:04 PM
darwin

it depends on your client error rate, which depends on how well you operate
22:04 PM
modern counters are much less likely to catastrophically overcount
22:04 PM
than old ones
22:04 PM
the worst case is when you try to increment a counter
22:05 PM
and you get an error
22:05 PM
ruaok

the alternative we're considering is keeping counts in redis.
22:05 PM
darwin

15:01 < darwin> briefly, you probably don't want to use them to count without an aggregator in front of them
22:05 PM
ruaok

it has atomic count operations.
22:05 PM
darwin

you could use redis as the aggregator I describe
22:05 PM
simplereach gave a good presentation on their approach to 300,000 counter updates a minute
22:05 PM
ruaok

ah, thanks for clarifying that.
22:05 PM
darwin

at the recent cassandra summit
22:06 PM
I would be uncomfortable with the idea of cassandra counting in the same cluster as the other data without an in-memory aggregator
22:06 PM
for the last.fm case
22:06 PM
in theory it might be tractable, but each one of those things
22:06 PM
doing it in the same cluster
22:06 PM
ruaok

yeah, that would be a lot of writes.
22:07 PM
darwin

and counters read-before-write whereas all other cassandra doesn't, etc.
22:08 PM
basically counters are pretty okish these days, but they're still distributed counters which can only ever be so accurate
22:08 PM
and so performant
22:08 PM
meanwhile those redis counters are basically not-distributed, single node, in memory counters
22:08 PM
ruaok

so redis batches for a period, then updates counters in cassandra, is that the model?
22:08 PM
darwin

which would, if operated in LRU, probably be enough for mb.fm for quite some time
22:08 PM
ruaok: yeah, they batch 1 minute of increments at simplereach in a custom aggregator
22:08 PM
ruaok: then flush once a minute
22:09 PM
ruaok

ok, got it.
22:09 PM
that is a bit different than what I had in mind, but having a practical example to model after is probably better.
22:10 PM
darwin

they are the most public high-volume cassandra counters shop
22:10 PM
the other one is disqus, which... also has an aggregator in front of them
22:12 PM
ruaok

https://github.com/metabrainz/listenbrainz-serv...
22:12 PM
we put a lot of emphasis onto the listened at timestamp.
22:12 PM
but we have not timestamp for when we wrote something to disk.
22:13 PM
when you put counters in place, how to backfill the data?
22:13 PM
is there a best practice there?
22:13 PM
alastairp

if we write something, how do we know if it actually got serialised (in our case where we dedup)
22:13 PM
and therefore, when should we increment?
22:14 PM
ruaok

ouch, thats another tricky case.
22:14 PM
darwin

the best practice would be a single increment
22:14 PM
from a materialized-by-otbher-means count
22:15 PM
ruaok

without having some sort of inserted_at timestamp, we couldn't reliably back fill.
22:15 PM
right. you replay the data to the aggregator.
22:15 PM
but how do you know what data to query for to feed to the backfiller?
22:16 PM
darwin

all data before timestamp [x] with no count?
22:16 PM
and I'm not actualyl saying to replay against hte aggregator
22:16 PM
I'm saying if you have the entire dataset in something like hadoop
22:16 PM
you can make the initial increment be an increment of all history up to time [x]
22:17 PM
ruaok

ok, and if you dont have hadoop yet?
22:17 PM
darwin

in that case replaying to the aggregator is a good stress test!
22:18 PM
http://cassandrasummit-datastax.com/agenda/coun...
22:18 PM
is the presentation btw
22:18 PM
trying to get the slide deck
22:18 PM
kahu has quit
22:19 PM
ruaok

thanks for the thoughts on all this. I'm looking forward to playing with the data. :)
22:19 PM
ruaok todders off
22:20 PM
zas has quit
22:36 PM
LordSputnik joined the channel
22:42 PM
Gentlecat

ruaok: did you figure out how to set up documentation?
22:45 PM
might want to update sphinxcontrib-httpdomain to 1.4.0 in requirements
22:47 PM
Lotheric joined the channel
23:05 PM
ruaok

that and the https side tracked me. I'll try that tomorrow.
23:07 PM
Gentlecat

I can come and help again
23:07 PM
tomorrow I'm a free man \o/
23:07 PM
ruaok

if you want. I'll be there usual slacker o clock
23:07 PM
nice
23:15 PM
ruaok has quit
23:32 PM
mat__ joined the channel
23:34 PM
rvedotrc has quit
23:34 PM
rvedotrc joined the channel
23:35 PM
mat__ is now known as mat_