in #metabrainz

14:14 PM
heyoni joined the channel
14:36 PM
drsaunde joined the channel
14:56 PM
UmkaDK_ has quit
14:58 PM
UmkaDK joined the channel
15:12 PM
UmkaDK has quit
15:16 PM
UmkaDK joined the channel
15:20 PM
agentsim_ joined the channel
15:24 PM
agentsim has quit
15:36 PM
iliekcomputers

alastairp: question: is there a reason we use bz2 (and not lzma and pxz) for compressing the json dumps in AB?
15:43 PM
drsaunde has quit
16:14 PM
antlarr has quit
16:16 PM
antlarr joined the channel
16:18 PM
kyan joined the channel
16:28 PM
github joined the channel
16:28 PM
github

[listenbrainz-server] paramsingh opened pull request #280: LB-233: Refactor testcase classes to use single path_to_data_file (master...tests-refactor) https://git.io/vdjtR
16:28 PM
github has left the channel
16:43 PM
github joined the channel
16:43 PM
[listenbrainz-server] paramsingh opened pull request #281: LB-206: Add message that ts must be omitted from playing_now submissions (master...docs-fix) https://git.io/vdjmW
16:43 PM
github has left the channel
16:50 PM
alastairp

iliekcomputers: speed/size tradeoff
16:50 PM
I don't think that we specifically measured it, but lzma compression is much slower
16:51 PM
bz2 is "good enough", while not taking 12h to dump and compress the database
16:52 PM
iliekcomputers

alastairp: oh, I've used lzma for the listens in LB, do you think bz2 would be more appropriate there too? Listens are also mostly JSON documents.
16:53 PM
alastairp: I also wanted your thoughts on how exactly AB-117 should be implemented (if you have any)
16:53 PM
BrainzBot

AB-117: Dump dataset tables https://tickets.metabrainz.org/browse/AB-117
16:53 PM
iliekcomputers

oh wait, wrong issue
16:53 PM
AB-97
16:53 PM
BrainzBot

AB-97: Provide small JSON exports, for testing https://tickets.metabrainz.org/browse/AB-97
16:55 PM
iliekcomputers

There doesn't seem to be a way to import the JSON exports yet?
16:56 PM
Should we do a smaller dump in the format of the full db dump instead?
16:59 PM
alastairp

right, small dumps should probably be in both formats
16:59 PM
import would be db dumps, not JSON
16:59 PM
but small json dumps would also be good for people to be able to test
16:59 PM
something ~100MB, perhaps
17:00 PM
not sure how many files that would be? 100k?
17:01 PM
not sure for LB, perhaps bz2 might be better than lzma too
17:01 PM
perhaps we had run a test for speed/size for compression
17:01 PM
zas

https://catchchallenger.first-world.info/wiki/Q...
17:01 PM
alastairp

I wonder if we should do it again and document that somewhere
17:02 PM
I wonder if they have any difference when you compress lots of small files or one big file
17:03 PM
oh, I see - that example is compressing linux tarred, that's probably a good enough demo
17:04 PM
I can't remember how big AB Is, currently about 250GB I think
17:04 PM
zas

well, results may differ on json/text
17:04 PM
iliekcomputers

alastairp: :O
17:05 PM
zas

doing some measurements should be fairly easy, but be sure to measure memory used during compression (it can be an issue, especially if the process swaps to disk....)
17:06 PM
iliekcomputers

alastairp: what should I do about tables like statistics etc that get dumped too? I assume stats would need to be calculated again after importing anyways?
17:06 PM
alastairp

are we talking about AB or LB?
17:06 PM
iliekcomputers

AB
17:06 PM
Sorry :)
17:07 PM
alastairp

theoretically they should time-sync
17:07 PM
have you seen how AB dumps are based on a timestamp
17:07 PM
if you import x rows up until a point in time, then the rows for stats will be the same
17:07 PM
so we should dump them
17:08 PM
note that the reason that dumps are broken is because we used to only have a few tables for highlevel, and only 1 (or 2?) for lowlevel
17:08 PM
iliekcomputers

Oh, I didn't come across that, cool.
17:08 PM
alastairp

now lowlevel is 2 tables, and hl is about 6
17:09 PM
I'd like to also make some changes to dump files so that we only have a certain maximum number of submissions per dump archive
17:09 PM
so if for example we made a dump between t1 and t2
17:09 PM
and there were 167000 submissions in that time
17:09 PM
we could make 2 files, one with 100k, and one with 67k
17:10 PM
again, for speed/ease of handling files
17:10 PM
iliekcomputers

So that dumps have a maximum size, makes sense.
17:10 PM
alastairp

I would rather people download 40 1gb files than 1 40gb file
17:10 PM
if we do that, we could get the small dump for "free"
17:10 PM
it would also go towards making sure we don't use too much memory, like zas mentioned
17:10 PM
iliekcomputers

alastairp: :O
17:11 PM
That is a really cool idea
17:11 PM
Leo_Verto

depending on how often you do dumps you could also distribute them as torrents
17:11 PM
alastairp

hmm, I had some other code that i had been working towards
17:12 PM
yeah, we started doing that initially
17:12 PM
no one used them
17:12 PM
and we have oodles of bandwith
17:12 PM
for now we should get dumps out there, and then see how much people use them
17:13 PM
note that torrents doesn't solve the "this data is too big for me to install and play around with locally in a trial manner"
17:13 PM
which is AB-97
17:13 PM
BrainzBot

AB-97: Provide small JSON exports, for testing https://tickets.metabrainz.org/browse/AB-97
17:13 PM
alastairp

iliekcomputers: Have you worked out how ordinal numbering of submissions works?
17:13 PM
e.g. mbid-1, mbid-2, mbid-3 ?
17:14 PM
ferbncode_ joined the channel
17:16 PM
iliekcomputers

Using the version table? How it keeps different submissions of the same mbid? Yes, I think😅
17:17 PM
alastairp

no, not using the version table
17:17 PM
that's to do with the version of essentia used to compute data
17:18 PM
iliekcomputers

Oh
17:18 PM
alastairp

perhaps we should rename that table
17:18 PM
for multiple submissions we allow multiple versions of the same mbid to be added to the lowlevel table
17:18 PM
so we don't unique on lowlevel.gid
17:19 PM
assume that we have two submissions for an mbid; we need a way to access each of them, we do that by adding a query parameter on GET, or by adding -1, -2 to the json files when we dump them
17:19 PM
but what if we dump -1, -2, then another one gets submitted, then we make another dump
17:20 PM
in this dump, the new row needs to be written with -3
17:20 PM
iliekcomputers

Right
17:21 PM
alastairp

we do that here https://github.com/metabrainz/acousticbrainz-se...
17:22 PM
iliekcom- joined the channel
17:22 PM
and we have something similar here https://github.com/metabrainz/acousticbrainz-se...
17:23 PM
I had a thought about the dumps which I wanted to try and make faster, but looking at it again I think it's pretty good
17:23 PM
but definitely for the bulk lookup in the API
17:24 PM
SQL has these things called window functions: https://www.postgresql.org/docs/current/static/...
17:24 PM
see the stuff about the rank() function
17:24 PM
that can do the counting for us, I wonder if it's faster than the one-at-a-time lookup that we currently do
17:25 PM
I started rewriting this method to use them to see if we could make it faster, but I didn't get around to testing it
17:25 PM
I wonder if we should take a look at that too
17:26 PM
iliekcomputers

How much time does it take to create dumps now? 😅
17:27 PM
ferbncode joined the channel
17:27 PM
alastairp

to our (my) shame, no one knows
17:27 PM
after we updated the schema (almost 2 years ago) we didn't fix dumps
17:28 PM
and haven't made a public dump since
17:29 PM
iliekcomputers

Datasets were the new tables added then? Or is there more stuff needed to be done?
17:29 PM
Also, what would be the best way to test speed in stuff like this, run both versions of the code and time them?
17:30 PM
alastairp

yes, but probably on the full database
17:30 PM
yes, datasets have been added since
17:31 PM
but also the schema change added version, highlevel_meta, highlevel_model, model, and deleted highlevel_json
17:33 PM
iliekcomputers

But the new hl tables get dumped in the current code
17:34 PM
alastairp

mmm
17:34 PM
do they?
17:34 PM
maybe I made the changes but never tested it
17:35 PM
certainly I never fixed highlevel json dumps
17:35 PM
because they're more complex
17:35 PM
iliekcomputers

https://github.com/metabrainz/acousticbrainz-se...
17:35 PM
alastairp

yeah, so I see...
17:36 PM
iliekcomputers

alastairp: so for new dumps, we'd need to see if the new hl tables get dumped properly and start dumping datasets
17:36 PM
alastairp

yes
17:36 PM
and we need to fix hl json
17:36 PM
let me explain
17:37 PM
http://acousticbrainz.org/d1662e74-b24b-42d4-82...
17:37 PM
see each of the keys in the "highlevel" block?
17:37 PM
each key is a row in `model`
17:37 PM
iliekcomputers

Okay
17:38 PM
alastairp

and is joined in `highlevel_model` to a highlevel and model row
17:38 PM
and then the "metadata" block comes from the `highlevel_meta` table
17:38 PM
we make the dumps in a way that you can uncompress them on top of each other
17:39 PM
so if highlevel-dump-1.tar.bz2 has mbid-1, mbic-1, mbid-2 and hl-dump-2.tar.bz2 has mbid-3, mbic-2, then they all go in the right place
17:40 PM
now, we can add a new model to AB; in this case we create new highlevel_model rows for all of the existing items in the database for this new model
17:40 PM
what would the -n be?
17:41 PM
if we name the file mbid-1.json for the first submission of mbid-1, and only include the data for this new model, we would override the highlevel data for all other models for this mbid-1
17:41 PM
do you follow? if not we should stop here and I'll try and explain it in another way
17:43 PM
iliekcomputers

Ok, I think I do.
17:43 PM
And making a new mbid-1 would be redundant
17:44 PM
alastairp

right, we had a few solutions to this
17:44 PM
1) make the file mbid-1-modelid.json
17:47 PM
iliekcomputers

Would be weird if there's too many models? Maybe add something to the Json so that old models don't get overridden. Not sure.
17:47 PM
alastairp

actually, I can't think of many other options
17:47 PM
I don't think we're going to have too many models
17:47 PM
we also wondered if this would make too many files
17:48 PM
lots of small files
17:48 PM
also, since we have the metadata block, were would we put it? in every file? (duplicated n times where n = number of models)
17:48 PM
we considered having one file per model
17:49 PM
with a dictionary inside where each key was mbid-n
17:49 PM
this would be neat for updates, but the first file would be huge
17:49 PM
iliekcomputers

Hmmm
17:49 PM
alastairp

we could split it into many files, but then the separation would be arbitrary - if you wanted to search for a specific mbid you would have to go opening each file and looking for it - now it's easy because you construct the filename
17:50 PM
in fact, I just thought of another solution