in #metabrainz

0:50 AM
agatzk has quit
0:50 AM
agatzk joined the channel
2:59 AM
chinmay joined the channel
3:35 AM
hackerman2 joined the channel
3:37 AM
hackerman has quit
3:37 AM
hackerman2 is now known as hackerman
4:41 AM
yellowhatpro

yussir (ง •_•)ง
5:09 AM
chinmay has quit
5:42 AM
chinmay joined the channel
6:07 AM
ROpdebee has quit
6:07 AM
revi has quit
6:07 AM
ijc has quit
6:07 AM
KassOtsimine has quit
6:07 AM
BenOckmore has quit
6:07 AM
lucifer has quit
6:07 AM
pprkut has quit
6:07 AM
mruszczyk has quit
6:07 AM
mayhem has quit
6:07 AM
ijc joined the channel
6:07 AM
pprkut joined the channel
6:07 AM
BenOckmore joined the channel
6:07 AM
mayhem joined the channel
6:07 AM
lucifer joined the channel
6:07 AM
KassOtsimine joined the channel
6:08 AM
mruszczyk joined the channel
6:08 AM
ROpdebee joined the channel
6:08 AM
KassOtsimine has quit
6:08 AM
KassOtsimine joined the channel
6:08 AM
revi joined the channel
6:08 AM
revi has quit
6:08 AM
revi joined the channel
6:18 AM
param has quit
6:18 AM
FichteFoll has quit
6:18 AM
param joined the channel
6:20 AM
FichteFoll joined the channel
7:04 AM
texke has quit
7:07 AM
texke joined the channel
7:59 AM
Shubh has quit
8:04 AM
Shubh joined the channel
9:10 AM
nbin has quit
9:15 AM
MRiddickW has quit
9:18 AM
MRiddickW joined the channel
9:19 AM
nbin joined the channel
9:31 AM
Etua joined the channel
9:35 AM
mayhem

monkey: so, the good news is that the mb metadata cache now fetches release tags.
9:36 AM
the bad news is that no one cares about release tags. we should care about release group tags. 🤦‍♂️
9:40 AM
lucifer: last night when I was going to bed I thought about a problem we have in LB.
9:40 AM
namely writing loads of stats and other data into PG each day.
9:40 AM
we're pushing PG quite hard to insert all this data and we only have a few users now.
9:40 AM
this will become a bottleneck for us soon.
9:41 AM
BrainzGit

[sir] 14ta264 opened pull request #132 (03master…v3): V3 https://github.com/metabrainz/sir/pull/132
9:41 AM
[sir] 14ta264 closed pull request #132 (03master…v3): V3 https://github.com/metabrainz/sir/pull/132
9:41 AM
mayhem

and I think we're using PG incorrectly here -- I think for the most part we never query the data in any way except to serve the data.
9:42 AM
I wonder if we should write one very large file with all of the results in JSON or compressed JSON, grouped by user and then have a very small python module resolve a user name to a chunk on disk. then have nginx serve the data from a memory mapped file.
9:42 AM
Etua has quit
9:43 AM
alastairp

morning
9:43 AM
mayhem

moin
9:43 AM
alastairp

that sounds similar to the issues we ended up having with AB
9:43 AM
mayhem

yes, same pattern.
9:43 AM
alastairp

if it's just a block of data that you store somewhere and serve directly back to people, static json sounds good
9:43 AM
mayhem

I am not 100% sure if we never query the data for anything, lucifer would know better.
9:43 AM
alastairp

right
9:44 AM
mayhem

but if we don't we should move to that system.
9:44 AM
because it seems that every week we want to store more data.
9:44 AM
and if we have a sudden explosion of users, bam, we're stuck
9:44 AM
lucifer

i agree with the PG issue. for fixing it though, i'd prefer looking into a ready made JSON data store.
9:44 AM
we only query for calculating the artist map stats.
9:45 AM
mayhem

I think investigating JSON data store is a pretty good idea.
9:45 AM
but I have this feeling that if we're going to replace 100% of each group of data each day, a data store is not a good way to go. it will always have a cost greater than simply writing a file to disk.
9:46 AM
if we were replacing only a subset of the data each day, then absolutely yes a json data store.
9:46 AM
lucifer

i think that is what is stuff mongo do anyway. store it as json files and internally keep an index of which file is which record etc.
9:47 AM
mayhem shudders when hearing mongo.
9:47 AM
not proposing we use mongo :)
9:47 AM
mayhem

does mongo still lose data?
9:47 AM
ok, good. :) phew.
9:47 AM
but if the datastore is set up for that use case, then great.
9:48 AM
lucifer

my understanding is that document dbs/json data stores work alike the small python module you described. we can try some and if works then fine else we can try to write our own.
9:49 AM
mayhem

ok, sounds good.
9:49 AM
lucifer

(another option can be to create stats incrementally which might be possible but needs further exploration spark side)
9:50 AM
mayhem

well, that brings about incremental updates of large amounts of data. Not a good pattern for big data.
9:50 AM
I think we will get more done in a total recalculation and replacement daily.
9:50 AM
lucifer

i see, makes sense.
9:52 AM
alastairp

and just throwing some ideas around - are these stats already stored in spark/hdfs somehow?
9:54 AM
Etua joined the channel
9:54 AM
lucifer

they aren't currently but could be and an api could be exposed on top of spark. fwiw, there exists tools like Apache Pinot, for example, which are capable of reading off spark and exposing realtime user visible analytics.
9:55 AM
i haven't explored those yet but the premise is really interesting and might be useful to look into such tools in future.
9:57 AM
so that you could expose an internal api from spark to serve stats and have LB backend call it and forward the response.
9:57 AM
Etua has quit
10:11 AM
mayhem

I was wondering about that, but I kinda dismissed it since the column major representation would have to get converted to row major and that sounds like a royal pain.
10:12 AM
lucifer

as long as we don't have to do it manually, it'd be fine i think
10:12 AM
but spark doesn't mandate column major or row major, you could just store it as json files directly.
10:13 AM
mayhem

oh, very good. then that might be a very good solution.
10:14 AM
atj

you replace the entire dataset every day?
10:15 AM
BrainzGit

[bookbrainz-site] 14tr1ten opened pull request #844 (03master…achievement-progress): Feat(Achievement): Show achievement progress https://github.com/metabrainz/bookbrainz-site/p...
10:15 AM
mayhem

atj: yes
10:16 AM
atj

does the update have to be atomic?
10:16 AM
mayhem

it is easier to replace everything once a day than to work out the differences between days.
10:16 AM
not in this case, no.
10:17 AM
atj

is the data indexed by a key, like a username or id?
10:17 AM
mayhem

there is no downside to having one user be served day old stats and another user be served fresh stats for a short window of time.
10:17 AM
username.
10:17 AM
lucifer

the table is like (username, stat_type, stat_range, last_update, stat_json)
10:18 AM
mayhem

but there are/will be other tables as well, organized differently, but always with a primary key, likely to be user.
10:19 AM
atj

so I'm lacking a lot of context here so this might be totally wrong/unsuitable/stupid, but what immediately came to mind was a prefix based directory hierarchy, based on the hash of an ID
10:20 AM
mayhem

atj: that is exactly the direction we were going for, or at least me. so you're pretty much spot on.
10:20 AM
except that filesystems don't like large numbers of files.
10:20 AM
atj

well, they don't like large numbers of files in a single directory
10:21 AM
mayhem

so my idea was to write to a single file and have user index: user, start offset, chunk size.
10:21 AM
atj

which is I think partly why this pattern came about
10:21 AM
mayhem

true, but even large numbers of files distributed into a well organized filesystem tree is always going to incur more overhead than writing one large massive file.
10:22 AM
there is massive overhead in creating dirs, open and closing files.
10:22 AM
all that can be skipped if we just open a file one, write it, then close it.
10:22 AM
but we may not need to do that -- we may be able to write the results in HDFS on our cluster and serve them from there.
10:22 AM
lucifer

i don't know about the overheads but this one file could be rather huge, think 30-40G.
10:23 AM
atj

I agree with the overheads of managing lots of files, but managing huge files has it's own issues
10:23 AM
*its
10:23 AM
mayhem

lucifer: oh, that would actually make the cluster part of our infrastructure that must be up at all times. perhaps not such a good idea afterall.
10:23 AM
and I am not suggesting that we write ALL stats into one file.
10:23 AM
but each class into one file, so that we have 10-15 large files.
10:24 AM
last month stats one file, last year stats another, all time yet another
10:24 AM
atj

also, a directory structure makes sharding pretty trivial, if it comes to that
10:24 AM
mayhem

good point.
10:25 AM
atj

are you ever likely to want to open that 30G JSON file in vim to check something? :)
10:25 AM
mayhem

no, but I am also not suggesting 30G JSON files. perhaps a GB or so max.
10:26 AM
atj

ah, ok
10:26 AM
anyway, just some thoughts that came to mind after reading your discussion
10:27 AM
mayhem

thanks for your input!
10:27 AM
Etua joined the channel
10:28 AM
atj

mmap makes me nervous, but I think that's just because I've never properly understood how it works in practice :)
10:29 AM
mayhem thanks his cranky com sci prof for making him implement a virtual memory scheme in a project for computer architectures 2
10:29 AM
mayhem

*comp
10:29 AM
monkey

mayhem:
10:29 AM
Woops. Good and bad about the tags
10:30 AM
mayhem

software architorture, more like.
10:30 AM
monkey

We did talk about that issue: assembling tags from releases and rgs
10:30 AM
atj

heh
10:30 AM
mayhem

monkey: I'll fix up another PR this afternoon, I hope.
10:30 AM
assembling them is pretty easy. we just need to figure out how to make the UI clear that the user is tagging a release group and not the release.
10:31 AM
I'll need to send along the release group id as well.
10:31 AM
lucifer

mayhem: yeah indeed, i am not suggesting to do that right now. HDFS is slower than the filesystem probably and yes uptime is another issue. in the future, maybe when stuff is more stale and we have a tool at hand to make this stuff go faster.
10:32 AM
Etua has quit
10:32 AM
mayhem

lucifer: agreed. lets start with looking at JSON doc stores.
10:32 AM
hehe, I am suggesting this, knowing for the first time that, "aw fuck it, just use PG as always" is not an acceptable outcome
10:34 AM
texke has quit
10:34 AM
lucifer

hehe lol :D
10:36 AM
i am trying to find json data stores and half of my results are storing json it in PG lol
10:36 AM
mayhem

pretty soon the entire world will be stored in PG.
10:37 AM
lucifer

🤞
10:37 AM
mayhem

I'm just so glad that people are finally shutting up about mysql.
10:37 AM
and sqlite has many more installs that mysql. heh.