alastairp: question: is there a reason we use bz2 (and not lzma and pxz) for compressing the json dumps in AB?
drsaunde has quit
antlarr has quit
antlarr joined the channel
kyan joined the channel
github joined the channel
github
[listenbrainz-server] paramsingh opened pull request #280: LB-233: Refactor testcase classes to use single path_to_data_file (master...tests-refactor) https://git.io/vdjtR
github has left the channel
github joined the channel
[listenbrainz-server] paramsingh opened pull request #281: LB-206: Add message that ts must be omitted from playing_now submissions (master...docs-fix) https://git.io/vdjmW
github has left the channel
alastairp
iliekcomputers: speed/size tradeoff
I don't think that we specifically measured it, but lzma compression is much slower
bz2 is "good enough", while not taking 12h to dump and compress the database
iliekcomputers
alastairp: oh, I've used lzma for the listens in LB, do you think bz2 would be more appropriate there too? Listens are also mostly JSON documents.
alastairp: I also wanted your thoughts on how exactly AB-117 should be implemented (if you have any)
I wonder if we should do it again and document that somewhere
I wonder if they have any difference when you compress lots of small files or one big file
oh, I see - that example is compressing linux tarred, that's probably a good enough demo
I can't remember how big AB Is, currently about 250GB I think
zas
well, results may differ on json/text
iliekcomputers
alastairp: :O
zas
doing some measurements should be fairly easy, but be sure to measure memory used during compression (it can be an issue, especially if the process swaps to disk....)
iliekcomputers
alastairp: what should I do about tables like statistics etc that get dumped too? I assume stats would need to be calculated again after importing anyways?
alastairp
are we talking about AB or LB?
iliekcomputers
AB
Sorry :)
alastairp
theoretically they should time-sync
have you seen how AB dumps are based on a timestamp
if you import x rows up until a point in time, then the rows for stats will be the same
so we should dump them
note that the reason that dumps are broken is because we used to only have a few tables for highlevel, and only 1 (or 2?) for lowlevel
iliekcomputers
Oh, I didn't come across that, cool.
alastairp
now lowlevel is 2 tables, and hl is about 6
I'd like to also make some changes to dump files so that we only have a certain maximum number of submissions per dump archive
so if for example we made a dump between t1 and t2
and there were 167000 submissions in that time
we could make 2 files, one with 100k, and one with 67k
again, for speed/ease of handling files
iliekcomputers
So that dumps have a maximum size, makes sense.
alastairp
I would rather people download 40 1gb files than 1 40gb file
if we do that, we could get the small dump for "free"
it would also go towards making sure we don't use too much memory, like zas mentioned
iliekcomputers
alastairp: :O
That is a really cool idea
Leo_Verto
depending on how often you do dumps you could also distribute them as torrents
alastairp
hmm, I had some other code that i had been working towards
yeah, we started doing that initially
no one used them
and we have oodles of bandwith
for now we should get dumps out there, and then see how much people use them
note that torrents doesn't solve the "this data is too big for me to install and play around with locally in a trial manner"
iliekcomputers: Have you worked out how ordinal numbering of submissions works?
e.g. mbid-1, mbid-2, mbid-3 ?
ferbncode_ joined the channel
iliekcomputers
Using the version table? How it keeps different submissions of the same mbid? Yes, I think๐
alastairp
no, not using the version table
that's to do with the version of essentia used to compute data
iliekcomputers
Oh
alastairp
perhaps we should rename that table
for multiple submissions we allow multiple versions of the same mbid to be added to the lowlevel table
so we don't unique on lowlevel.gid
assume that we have two submissions for an mbid; we need a way to access each of them, we do that by adding a query parameter on GET, or by adding -1, -2 to the json files when we dump them
but what if we dump -1, -2, then another one gets submitted, then we make another dump
in this dump, the new row needs to be written with -3
and is joined in `highlevel_model` to a highlevel and model row
and then the "metadata" block comes from the `highlevel_meta` table
we make the dumps in a way that you can uncompress them on top of each other
so if highlevel-dump-1.tar.bz2 has mbid-1, mbic-1, mbid-2 and hl-dump-2.tar.bz2 has mbid-3, mbic-2, then they all go in the right place
now, we can add a new model to AB; in this case we create new highlevel_model rows for all of the existing items in the database for this new model
what would the -n be?
if we name the file mbid-1.json for the first submission of mbid-1, and only include the data for this new model, we would override the highlevel data for all other models for this mbid-1
do you follow? if not we should stop here and I'll try and explain it in another way
iliekcomputers
Ok, I think I do.
And making a new mbid-1 would be redundant
alastairp
right, we had a few solutions to this
1) make the file mbid-1-modelid.json
iliekcomputers
Would be weird if there's too many models? Maybe add something to the Json so that old models don't get overridden. Not sure.
alastairp
actually, I can't think of many other options
I don't think we're going to have too many models
we also wondered if this would make too many files
lots of small files
also, since we have the metadata block, were would we put it? in every file? (duplicated n times where n = number of models)
we considered having one file per model
with a dictionary inside where each key was mbid-n
this would be neat for updates, but the first file would be huge
iliekcomputers
Hmmm
alastairp
we could split it into many files, but then the separation would be arbitrary - if you wanted to search for a specific mbid you would have to go opening each file and looking for it - now it's easy because you construct the filename