and I think we're using PG incorrectly here -- I think for the most part we never query the data in any way except to serve the data.
2022-05-23 14344, 2022
mayhem
I wonder if we should write one very large file with all of the results in JSON or compressed JSON, grouped by user and then have a very small python module resolve a user name to a chunk on disk. then have nginx serve the data from a memory mapped file.
2022-05-23 14352, 2022
Etua has quit
2022-05-23 14313, 2022
alastairp
morning
2022-05-23 14319, 2022
mayhem
moin
2022-05-23 14323, 2022
alastairp
that sounds similar to the issues we ended up having with AB
2022-05-23 14334, 2022
mayhem
yes, same pattern.
2022-05-23 14342, 2022
alastairp
if it's just a block of data that you store somewhere and serve directly back to people, static json sounds good
2022-05-23 14347, 2022
mayhem
I am not 100% sure if we never query the data for anything, lucifer would know better.
2022-05-23 14354, 2022
alastairp
right
2022-05-23 14305, 2022
mayhem
but if we don't we should move to that system.
2022-05-23 14314, 2022
mayhem
because it seems that every week we want to store more data.
2022-05-23 14323, 2022
mayhem
and if we have a sudden explosion of users, bam, we're stuck
2022-05-23 14324, 2022
lucifer
i agree with the PG issue. for fixing it though, i'd prefer looking into a ready made JSON data store.
2022-05-23 14342, 2022
lucifer
we only query for calculating the artist map stats.
2022-05-23 14305, 2022
mayhem
I think investigating JSON data store is a pretty good idea.
2022-05-23 14344, 2022
mayhem
but I have this feeling that if we're going to replace 100% of each group of data each day, a data store is not a good way to go. it will always have a cost greater than simply writing a file to disk.
2022-05-23 14322, 2022
mayhem
if we were replacing only a subset of the data each day, then absolutely yes a json data store.
2022-05-23 14345, 2022
lucifer
i think that is what is stuff mongo do anyway. store it as json files and internally keep an index of which file is which record etc.
2022-05-23 14309, 2022
mayhem shudders when hearing mongo.
2022-05-23 14318, 2022
lucifer
not proposing we use mongo :)
2022-05-23 14324, 2022
mayhem
does mongo still lose data?
2022-05-23 14329, 2022
mayhem
ok, good. :) phew.
2022-05-23 14346, 2022
mayhem
but if the datastore is set up for that use case, then great.
2022-05-23 14358, 2022
lucifer
my understanding is that document dbs/json data stores work alike the small python module you described. we can try some and if works then fine else we can try to write our own.
2022-05-23 14325, 2022
mayhem
ok, sounds good.
2022-05-23 14339, 2022
lucifer
(another option can be to create stats incrementally which might be possible but needs further exploration spark side)
2022-05-23 14308, 2022
mayhem
well, that brings about incremental updates of large amounts of data. Not a good pattern for big data.
2022-05-23 14324, 2022
mayhem
I think we will get more done in a total recalculation and replacement daily.
2022-05-23 14331, 2022
lucifer
i see, makes sense.
2022-05-23 14311, 2022
alastairp
and just throwing some ideas around - are these stats already stored in spark/hdfs somehow?
2022-05-23 14333, 2022
Etua joined the channel
2022-05-23 14343, 2022
lucifer
they aren't currently but could be and an api could be exposed on top of spark. fwiw, there exists tools like Apache Pinot, for example, which are capable of reading off spark and exposing realtime user visible analytics.
2022-05-23 14345, 2022
lucifer
i haven't explored those yet but the premise is really interesting and might be useful to look into such tools in future.
2022-05-23 14330, 2022
lucifer
so that you could expose an internal api from spark to serve stats and have LB backend call it and forward the response.
2022-05-23 14332, 2022
Etua has quit
2022-05-23 14310, 2022
mayhem
I was wondering about that, but I kinda dismissed it since the column major representation would have to get converted to row major and that sounds like a royal pain.
2022-05-23 14305, 2022
lucifer
as long as we don't have to do it manually, it'd be fine i think
2022-05-23 14340, 2022
lucifer
but spark doesn't mandate column major or row major, you could just store it as json files directly.
2022-05-23 14334, 2022
mayhem
oh, very good. then that might be a very good solution.
it is easier to replace everything once a day than to work out the differences between days.
2022-05-23 14337, 2022
mayhem
not in this case, no.
2022-05-23 14304, 2022
atj
is the data indexed by a key, like a username or id?
2022-05-23 14304, 2022
mayhem
there is no downside to having one user be served day old stats and another user be served fresh stats for a short window of time.
2022-05-23 14312, 2022
mayhem
username.
2022-05-23 14359, 2022
lucifer
the table is like (username, stat_type, stat_range, last_update, stat_json)
2022-05-23 14349, 2022
mayhem
but there are/will be other tables as well, organized differently, but always with a primary key, likely to be user.
2022-05-23 14337, 2022
atj
so I'm lacking a lot of context here so this might be totally wrong/unsuitable/stupid, but what immediately came to mind was a prefix based directory hierarchy, based on the hash of an ID
2022-05-23 14330, 2022
mayhem
atj: that is exactly the direction we were going for, or at least me. so you're pretty much spot on.
2022-05-23 14341, 2022
mayhem
except that filesystems don't like large numbers of files.
2022-05-23 14356, 2022
atj
well, they don't like large numbers of files in a single directory
2022-05-23 14307, 2022
mayhem
so my idea was to write to a single file and have user index: user, start offset, chunk size.
2022-05-23 14324, 2022
atj
which is I think partly why this pattern came about
2022-05-23 14347, 2022
mayhem
true, but even large numbers of files distributed into a well organized filesystem tree is always going to incur more overhead than writing one large massive file.
2022-05-23 14308, 2022
mayhem
there is massive overhead in creating dirs, open and closing files.
2022-05-23 14320, 2022
mayhem
all that can be skipped if we just open a file one, write it, then close it.
2022-05-23 14353, 2022
mayhem
but we may not need to do that -- we may be able to write the results in HDFS on our cluster and serve them from there.
2022-05-23 14358, 2022
lucifer
i don't know about the overheads but this one file could be rather huge, think 30-40G.
2022-05-23 14321, 2022
atj
I agree with the overheads of managing lots of files, but managing huge files has it's own issues
2022-05-23 14329, 2022
atj
*its
2022-05-23 14333, 2022
mayhem
lucifer: oh, that would actually make the cluster part of our infrastructure that must be up at all times. perhaps not such a good idea afterall.
2022-05-23 14342, 2022
mayhem
and I am not suggesting that we write ALL stats into one file.
2022-05-23 14351, 2022
mayhem
but each class into one file, so that we have 10-15 large files.
2022-05-23 14307, 2022
mayhem
last month stats one file, last year stats another, all time yet another
2022-05-23 14315, 2022
atj
also, a directory structure makes sharding pretty trivial, if it comes to that
2022-05-23 14328, 2022
mayhem
good point.
2022-05-23 14317, 2022
atj
are you ever likely to want to open that 30G JSON file in vim to check something? :)
2022-05-23 14348, 2022
mayhem
no, but I am also not suggesting 30G JSON files. perhaps a GB or so max.
2022-05-23 14300, 2022
atj
ah, ok
2022-05-23 14325, 2022
atj
anyway, just some thoughts that came to mind after reading your discussion
2022-05-23 14313, 2022
mayhem
thanks for your input!
2022-05-23 14340, 2022
Etua joined the channel
2022-05-23 14311, 2022
atj
mmap makes me nervous, but I think that's just because I've never properly understood how it works in practice :)
2022-05-23 14317, 2022
mayhem thanks his cranky com sci prof for making him implement a virtual memory scheme in a project for computer architectures 2
2022-05-23 14322, 2022
mayhem
*comp
2022-05-23 14343, 2022
monkey
mayhem:
2022-05-23 14357, 2022
monkey
Woops. Good and bad about the tags
2022-05-23 14307, 2022
mayhem
software architorture, more like.
2022-05-23 14318, 2022
monkey
We did talk about that issue: assembling tags from releases and rgs
2022-05-23 14319, 2022
atj
heh
2022-05-23 14321, 2022
mayhem
monkey: I'll fix up another PR this afternoon, I hope.
2022-05-23 14354, 2022
mayhem
assembling them is pretty easy. we just need to figure out how to make the UI clear that the user is tagging a release group and not the release.
2022-05-23 14302, 2022
mayhem
I'll need to send along the release group id as well.
2022-05-23 14349, 2022
lucifer
mayhem: yeah indeed, i am not suggesting to do that right now. HDFS is slower than the filesystem probably and yes uptime is another issue. in the future, maybe when stuff is more stale and we have a tool at hand to make this stuff go faster.
2022-05-23 14302, 2022
Etua has quit
2022-05-23 14312, 2022
mayhem
lucifer: agreed. lets start with looking at JSON doc stores.
2022-05-23 14342, 2022
mayhem
hehe, I am suggesting this, knowing for the first time that, "aw fuck it, just use PG as always" is not an acceptable outcome
2022-05-23 14305, 2022
texke has quit
2022-05-23 14347, 2022
lucifer
hehe lol :D
2022-05-23 14316, 2022
lucifer
i am trying to find json data stores and half of my results are storing json it in PG lol
2022-05-23 14347, 2022
mayhem
pretty soon the entire world will be stored in PG.
2022-05-23 14301, 2022
lucifer
🤞
2022-05-23 14301, 2022
mayhem
I'm just so glad that people are finally shutting up about mysql.
2022-05-23 14322, 2022
mayhem
and sqlite has many more installs that mysql. heh.