alastairp: question: is there a reason we use bz2 (and not lzma and pxz) for compressing the json dumps in AB?
2017-10-24 29721, 2017
drsaunde has quit
2017-10-24 29727, 2017
antlarr has quit
2017-10-24 29738, 2017
antlarr joined the channel
2017-10-24 29716, 2017
kyan joined the channel
2017-10-24 29717, 2017
github joined the channel
2017-10-24 29717, 2017
github
[listenbrainz-server] paramsingh opened pull request #280: LB-233: Refactor testcase classes to use single path_to_data_file (master...tests-refactor) https://git.io/vdjtR
2017-10-24 29717, 2017
github has left the channel
2017-10-24 29723, 2017
github joined the channel
2017-10-24 29723, 2017
github
[listenbrainz-server] paramsingh opened pull request #281: LB-206: Add message that ts must be omitted from playing_now submissions (master...docs-fix) https://git.io/vdjmW
2017-10-24 29723, 2017
github has left the channel
2017-10-24 29736, 2017
alastairp
iliekcomputers: speed/size tradeoff
2017-10-24 29750, 2017
alastairp
I don't think that we specifically measured it, but lzma compression is much slower
2017-10-24 29712, 2017
alastairp
bz2 is "good enough", while not taking 12h to dump and compress the database
2017-10-24 29739, 2017
iliekcomputers
alastairp: oh, I've used lzma for the listens in LB, do you think bz2 would be more appropriate there too? Listens are also mostly JSON documents.
2017-10-24 29741, 2017
iliekcomputers
alastairp: I also wanted your thoughts on how exactly AB-117 should be implemented (if you have any)
I wonder if we should do it again and document that somewhere
2017-10-24 29743, 2017
alastairp
I wonder if they have any difference when you compress lots of small files or one big file
2017-10-24 29758, 2017
alastairp
oh, I see - that example is compressing linux tarred, that's probably a good enough demo
2017-10-24 29710, 2017
alastairp
I can't remember how big AB Is, currently about 250GB I think
2017-10-24 29725, 2017
zas
well, results may differ on json/text
2017-10-24 29745, 2017
iliekcomputers
alastairp: :O
2017-10-24 29715, 2017
zas
doing some measurements should be fairly easy, but be sure to measure memory used during compression (it can be an issue, especially if the process swaps to disk....)
2017-10-24 29702, 2017
iliekcomputers
alastairp: what should I do about tables like statistics etc that get dumped too? I assume stats would need to be calculated again after importing anyways?
2017-10-24 29721, 2017
alastairp
are we talking about AB or LB?
2017-10-24 29747, 2017
iliekcomputers
AB
2017-10-24 29756, 2017
iliekcomputers
Sorry :)
2017-10-24 29703, 2017
alastairp
theoretically they should time-sync
2017-10-24 29713, 2017
alastairp
have you seen how AB dumps are based on a timestamp
2017-10-24 29727, 2017
alastairp
if you import x rows up until a point in time, then the rows for stats will be the same
2017-10-24 29735, 2017
alastairp
so we should dump them
2017-10-24 29703, 2017
alastairp
note that the reason that dumps are broken is because we used to only have a few tables for highlevel, and only 1 (or 2?) for lowlevel
2017-10-24 29704, 2017
iliekcomputers
Oh, I didn't come across that, cool.
2017-10-24 29716, 2017
alastairp
now lowlevel is 2 tables, and hl is about 6
2017-10-24 29709, 2017
alastairp
I'd like to also make some changes to dump files so that we only have a certain maximum number of submissions per dump archive
2017-10-24 29723, 2017
alastairp
so if for example we made a dump between t1 and t2
2017-10-24 29735, 2017
alastairp
and there were 167000 submissions in that time
2017-10-24 29745, 2017
alastairp
we could make 2 files, one with 100k, and one with 67k
2017-10-24 29703, 2017
alastairp
again, for speed/ease of handling files
2017-10-24 29713, 2017
iliekcomputers
So that dumps have a maximum size, makes sense.
2017-10-24 29713, 2017
alastairp
I would rather people download 40 1gb files than 1 40gb file
2017-10-24 29734, 2017
alastairp
if we do that, we could get the small dump for "free"
2017-10-24 29752, 2017
alastairp
it would also go towards making sure we don't use too much memory, like zas mentioned
2017-10-24 29755, 2017
iliekcomputers
alastairp: :O
2017-10-24 29710, 2017
iliekcomputers
That is a really cool idea
2017-10-24 29757, 2017
Leo_Verto
depending on how often you do dumps you could also distribute them as torrents
2017-10-24 29757, 2017
alastairp
hmm, I had some other code that i had been working towards
2017-10-24 29706, 2017
alastairp
yeah, we started doing that initially
2017-10-24 29709, 2017
alastairp
no one used them
2017-10-24 29717, 2017
alastairp
and we have oodles of bandwith
2017-10-24 29729, 2017
alastairp
for now we should get dumps out there, and then see how much people use them
2017-10-24 29702, 2017
alastairp
note that torrents doesn't solve the "this data is too big for me to install and play around with locally in a trial manner"
iliekcomputers: Have you worked out how ordinal numbering of submissions works?
2017-10-24 29740, 2017
alastairp
e.g. mbid-1, mbid-2, mbid-3 ?
2017-10-24 29716, 2017
ferbncode_ joined the channel
2017-10-24 29720, 2017
iliekcomputers
Using the version table? How it keeps different submissions of the same mbid? Yes, I think😅
2017-10-24 29701, 2017
alastairp
no, not using the version table
2017-10-24 29745, 2017
alastairp
that's to do with the version of essentia used to compute data
2017-10-24 29702, 2017
iliekcomputers
Oh
2017-10-24 29718, 2017
alastairp
perhaps we should rename that table
2017-10-24 29724, 2017
alastairp
for multiple submissions we allow multiple versions of the same mbid to be added to the lowlevel table
2017-10-24 29732, 2017
alastairp
so we don't unique on lowlevel.gid
2017-10-24 29726, 2017
alastairp
assume that we have two submissions for an mbid; we need a way to access each of them, we do that by adding a query parameter on GET, or by adding -1, -2 to the json files when we dump them
2017-10-24 29748, 2017
alastairp
but what if we dump -1, -2, then another one gets submitted, then we make another dump
2017-10-24 29701, 2017
alastairp
in this dump, the new row needs to be written with -3
and is joined in `highlevel_model` to a highlevel and model row
2017-10-24 29729, 2017
alastairp
and then the "metadata" block comes from the `highlevel_meta` table
2017-10-24 29743, 2017
alastairp
we make the dumps in a way that you can uncompress them on top of each other
2017-10-24 29724, 2017
alastairp
so if highlevel-dump-1.tar.bz2 has mbid-1, mbic-1, mbid-2 and hl-dump-2.tar.bz2 has mbid-3, mbic-2, then they all go in the right place
2017-10-24 29701, 2017
alastairp
now, we can add a new model to AB; in this case we create new highlevel_model rows for all of the existing items in the database for this new model
2017-10-24 29705, 2017
alastairp
what would the -n be?
2017-10-24 29711, 2017
alastairp
if we name the file mbid-1.json for the first submission of mbid-1, and only include the data for this new model, we would override the highlevel data for all other models for this mbid-1
2017-10-24 29725, 2017
alastairp
do you follow? if not we should stop here and I'll try and explain it in another way
2017-10-24 29712, 2017
iliekcomputers
Ok, I think I do.
2017-10-24 29751, 2017
iliekcomputers
And making a new mbid-1 would be redundant
2017-10-24 29700, 2017
alastairp
right, we had a few solutions to this
2017-10-24 29713, 2017
alastairp
1) make the file mbid-1-modelid.json
2017-10-24 29718, 2017
iliekcomputers
Would be weird if there's too many models? Maybe add something to the Json so that old models don't get overridden. Not sure.
2017-10-24 29730, 2017
alastairp
actually, I can't think of many other options
2017-10-24 29747, 2017
alastairp
I don't think we're going to have too many models
2017-10-24 29755, 2017
alastairp
we also wondered if this would make too many files
2017-10-24 29705, 2017
alastairp
lots of small files
2017-10-24 29749, 2017
alastairp
also, since we have the metadata block, were would we put it? in every file? (duplicated n times where n = number of models)
2017-10-24 29756, 2017
alastairp
we considered having one file per model
2017-10-24 29705, 2017
alastairp
with a dictionary inside where each key was mbid-n
2017-10-24 29714, 2017
alastairp
this would be neat for updates, but the first file would be huge
2017-10-24 29738, 2017
iliekcomputers
Hmmm
2017-10-24 29754, 2017
alastairp
we could split it into many files, but then the separation would be arbitrary - if you wanted to search for a specific mbid you would have to go opening each file and looking for it - now it's easy because you construct the filename