-
alastairp
Gentlecat: well, that’s kind of fun
-
ruaok
alastairp: that is my job in dec.
-
alastairp
who’s the teacher in the second class?
-
ruaok
they are just turning it off.
-
Gentlecat
Perfecto Herrera
-
alastairp
when does it finish?
-
Gentlecat
oh wait, second one
-
Rafael Ramírez
-
alastairp
if you asked him, he’d probably bump it 15 or 30 minutes
-
depending on how many people go to both
-
Gentlecat
it was actually shorter today
-
most of these classes are only for us so
-
alastairp
“us”?
-
Gentlecat
when drinking beer today we actually decided to ask about that next week
-
SMC
-
alastairp
oh yeah. it wouldn’t be a problem at all
-
Gentlecat
I don't think there are any other students there
-
zas joined the channel
-
yeeeargh has quit
-
ruaok
regommend looks like an easy way to make a recommender for music, now that we have this data. making an artist similarity one would be pretty cake.
-
darwin
so excited
-
ruaok
it is almost guaranteed to be not very good. but that is an excellent way to get people fired up.
-
alastairp
I’m glad it looks good
-
ruaok
----> here is the code, fix it
-
alastairp
see you tomorrow
-
ruaok wonders about cassandra counters
-
ruaok
sound like they are stable now. would be very useful, methinks.
-
darwin
what about them?
-
ruaok
we've been told to avoid them due to stability issues.
-
but we hear mixed things.
-
darwin
briefly, you probably don't want to use them to count without an aggregator in front of them
-
alastairp
By darwin no less
-
darwin
the person who told you that is giving you good advice, actually
-
however aleksey did meaningfully improve them in recent versions
-
what use case are you considering using them for?
-
ruaok
count the total number of listens a user has. for starters.
-
then count per artist would be the next logical one. esp for regommend.
-
darwin
how accurate do you want that number to be? do you plan to ever bulk recalculate?
-
(by, for example, hadooping their entire history?)
-
ruaok
decent, yes.
-
darwin
if you have authoritative source and plan to recalculate, accuracy is likely to be fine
-
most error is in direction of overcount
-
ruaok
our data is going to continually evolve an I expect many passes over the data.
-
what sort of error percentages are we talking about?
-
off by a couple? couple thousand?
-
darwin
it depends on your client error rate, which depends on how well you operate
-
modern counters are much less likely to catastrophically overcount
-
than old ones
-
the worst case is when you try to increment a counter
-
and you get an error
-
ruaok
the alternative we're considering is keeping counts in redis.
-
darwin
15:01 < darwin> briefly, you probably don't want to use them to count without an aggregator in front of them
-
ruaok
it has atomic count operations.
-
darwin
you could use redis as the aggregator I describe
-
simplereach gave a good presentation on their approach to 300,000 counter updates a minute
-
ruaok
ah, thanks for clarifying that.
-
darwin
at the recent cassandra summit
-
I would be uncomfortable with the idea of cassandra counting in the same cluster as the other data without an in-memory aggregator
-
for the last.fm case
-
in theory it might be tractable, but each one of those things
-
doing it in the same cluster
-
ruaok
yeah, that would be a lot of writes.
-
darwin
and counters read-before-write whereas all other cassandra doesn't, etc.
-
basically counters are pretty okish these days, but they're still distributed counters which can only ever be so accurate
-
and so performant
-
meanwhile those redis counters are basically not-distributed, single node, in memory counters
-
ruaok
so redis batches for a period, then updates counters in cassandra, is that the model?
-
darwin
which would, if operated in LRU, probably be enough for mb.fm for quite some time
-
ruaok: yeah, they batch 1 minute of increments at simplereach in a custom aggregator
-
ruaok: then flush once a minute
-
ruaok
ok, got it.
-
that is a bit different than what I had in mind, but having a practical example to model after is probably better.
-
darwin
they are the most public high-volume cassandra counters shop
-
the other one is disqus, which... also has an aggregator in front of them
-
ruaok
-
we put a lot of emphasis onto the listened at timestamp.
-
but we have not timestamp for when we wrote something to disk.
-
when you put counters in place, how to backfill the data?
-
is there a best practice there?
-
alastairp
if we write something, how do we know if it actually got serialised (in our case where we dedup)
-
and therefore, when should we increment?
-
ruaok
ouch, thats another tricky case.
-
darwin
the best practice would be a single increment
-
from a materialized-by-otbher-means count
-
ruaok
without having some sort of inserted_at timestamp, we couldn't reliably back fill.
-
right. you replay the data to the aggregator.
-
but how do you know what data to query for to feed to the backfiller?
-
darwin
all data before timestamp [x] with no count?
-
and I'm not actualyl saying to replay against hte aggregator
-
I'm saying if you have the entire dataset in something like hadoop
-
you can make the initial increment be an increment of all history up to time [x]
-
ruaok
ok, and if you dont have hadoop yet?
-
darwin
in that case replaying to the aggregator is a good stress test!
-
-
is the presentation btw
-
trying to get the slide deck
-
kahu has quit
-
ruaok
thanks for the thoughts on all this. I'm looking forward to playing with the data. :)
-
ruaok todders off
-
zas has quit
-
LordSputnik joined the channel
-
Gentlecat
ruaok: did you figure out how to set up documentation?
-
might want to update sphinxcontrib-httpdomain to 1.4.0 in requirements
-
Lotheric joined the channel
-
ruaok
that and the https side tracked me. I'll try that tomorrow.
-
Gentlecat
I can come and help again
-
tomorrow I'm a free man \o/
-
ruaok
if you want. I'll be there usual slacker o clock
-
nice
-
ruaok has quit
-
mat__ joined the channel
-
rvedotrc has quit
-
rvedotrc joined the channel
-
mat__ is now known as mat_