and I think we're using PG incorrectly here -- I think for the most part we never query the data in any way except to serve the data.
I wonder if we should write one very large file with all of the results in JSON or compressed JSON, grouped by user and then have a very small python module resolve a user name to a chunk on disk. then have nginx serve the data from a memory mapped file.
Etua has quit
alastairp
morning
mayhem
moin
alastairp
that sounds similar to the issues we ended up having with AB
mayhem
yes, same pattern.
alastairp
if it's just a block of data that you store somewhere and serve directly back to people, static json sounds good
mayhem
I am not 100% sure if we never query the data for anything, lucifer would know better.
alastairp
right
mayhem
but if we don't we should move to that system.
because it seems that every week we want to store more data.
and if we have a sudden explosion of users, bam, we're stuck
lucifer
i agree with the PG issue. for fixing it though, i'd prefer looking into a ready made JSON data store.
we only query for calculating the artist map stats.
mayhem
I think investigating JSON data store is a pretty good idea.
but I have this feeling that if we're going to replace 100% of each group of data each day, a data store is not a good way to go. it will always have a cost greater than simply writing a file to disk.
if we were replacing only a subset of the data each day, then absolutely yes a json data store.
lucifer
i think that is what is stuff mongo do anyway. store it as json files and internally keep an index of which file is which record etc.
mayhem shudders when hearing mongo.
not proposing we use mongo :)
mayhem
does mongo still lose data?
ok, good. :) phew.
but if the datastore is set up for that use case, then great.
lucifer
my understanding is that document dbs/json data stores work alike the small python module you described. we can try some and if works then fine else we can try to write our own.
mayhem
ok, sounds good.
lucifer
(another option can be to create stats incrementally which might be possible but needs further exploration spark side)
mayhem
well, that brings about incremental updates of large amounts of data. Not a good pattern for big data.
I think we will get more done in a total recalculation and replacement daily.
lucifer
i see, makes sense.
alastairp
and just throwing some ideas around - are these stats already stored in spark/hdfs somehow?
Etua joined the channel
lucifer
they aren't currently but could be and an api could be exposed on top of spark. fwiw, there exists tools like Apache Pinot, for example, which are capable of reading off spark and exposing realtime user visible analytics.
i haven't explored those yet but the premise is really interesting and might be useful to look into such tools in future.
so that you could expose an internal api from spark to serve stats and have LB backend call it and forward the response.
Etua has quit
mayhem
I was wondering about that, but I kinda dismissed it since the column major representation would have to get converted to row major and that sounds like a royal pain.
lucifer
as long as we don't have to do it manually, it'd be fine i think
but spark doesn't mandate column major or row major, you could just store it as json files directly.
mayhem
oh, very good. then that might be a very good solution.
it is easier to replace everything once a day than to work out the differences between days.
not in this case, no.
atj
is the data indexed by a key, like a username or id?
mayhem
there is no downside to having one user be served day old stats and another user be served fresh stats for a short window of time.
username.
lucifer
the table is like (username, stat_type, stat_range, last_update, stat_json)
mayhem
but there are/will be other tables as well, organized differently, but always with a primary key, likely to be user.
atj
so I'm lacking a lot of context here so this might be totally wrong/unsuitable/stupid, but what immediately came to mind was a prefix based directory hierarchy, based on the hash of an ID
mayhem
atj: that is exactly the direction we were going for, or at least me. so you're pretty much spot on.
except that filesystems don't like large numbers of files.
atj
well, they don't like large numbers of files in a single directory
mayhem
so my idea was to write to a single file and have user index: user, start offset, chunk size.
atj
which is I think partly why this pattern came about
mayhem
true, but even large numbers of files distributed into a well organized filesystem tree is always going to incur more overhead than writing one large massive file.
there is massive overhead in creating dirs, open and closing files.
all that can be skipped if we just open a file one, write it, then close it.
but we may not need to do that -- we may be able to write the results in HDFS on our cluster and serve them from there.
lucifer
i don't know about the overheads but this one file could be rather huge, think 30-40G.
atj
I agree with the overheads of managing lots of files, but managing huge files has it's own issues
*its
mayhem
lucifer: oh, that would actually make the cluster part of our infrastructure that must be up at all times. perhaps not such a good idea afterall.
and I am not suggesting that we write ALL stats into one file.
but each class into one file, so that we have 10-15 large files.
last month stats one file, last year stats another, all time yet another
atj
also, a directory structure makes sharding pretty trivial, if it comes to that
mayhem
good point.
atj
are you ever likely to want to open that 30G JSON file in vim to check something? :)
mayhem
no, but I am also not suggesting 30G JSON files. perhaps a GB or so max.
atj
ah, ok
anyway, just some thoughts that came to mind after reading your discussion
mayhem
thanks for your input!
Etua joined the channel
atj
mmap makes me nervous, but I think that's just because I've never properly understood how it works in practice :)
mayhem thanks his cranky com sci prof for making him implement a virtual memory scheme in a project for computer architectures 2
mayhem
*comp
monkey
mayhem:
Woops. Good and bad about the tags
mayhem
software architorture, more like.
monkey
We did talk about that issue: assembling tags from releases and rgs
atj
heh
mayhem
monkey: I'll fix up another PR this afternoon, I hope.
assembling them is pretty easy. we just need to figure out how to make the UI clear that the user is tagging a release group and not the release.
I'll need to send along the release group id as well.
lucifer
mayhem: yeah indeed, i am not suggesting to do that right now. HDFS is slower than the filesystem probably and yes uptime is another issue. in the future, maybe when stuff is more stale and we have a tool at hand to make this stuff go faster.
Etua has quit
mayhem
lucifer: agreed. lets start with looking at JSON doc stores.
hehe, I am suggesting this, knowing for the first time that, "aw fuck it, just use PG as always" is not an acceptable outcome
texke has quit
lucifer
hehe lol :D
i am trying to find json data stores and half of my results are storing json it in PG lol
mayhem
pretty soon the entire world will be stored in PG.
lucifer
🤞
mayhem
I'm just so glad that people are finally shutting up about mysql.
and sqlite has many more installs that mysql. heh.