#metabrainz

/

14:14 PM
heyoni joined the channel

2017-10-24 29725, 2017

14:36 PM
drsaunde joined the channel

2017-10-24 29722, 2017

14:56 PM
UmkaDK_ has quit

2017-10-24 29734, 2017

14:58 PM
UmkaDK joined the channel

2017-10-24 29757, 2017

15:12 PM
UmkaDK has quit

2017-10-24 29700, 2017

15:16 PM
UmkaDK joined the channel

2017-10-24 29755, 2017

15:20 PM
agentsim_ joined the channel

2017-10-24 29715, 2017

15:24 PM
agentsim has quit

2017-10-24 29710, 2017

15:36 PM
iliekcomputers

alastairp: question: is there a reason we use bz2 (and not lzma and pxz) for compressing the json dumps in AB?

2017-10-24 29721, 2017

15:43 PM
drsaunde has quit

2017-10-24 29727, 2017

16:14 PM
antlarr has quit

2017-10-24 29738, 2017

16:16 PM
antlarr joined the channel

2017-10-24 29716, 2017

16:18 PM
kyan joined the channel

2017-10-24 29717, 2017

16:28 PM
github joined the channel

2017-10-24 29717, 2017

16:28 PM
github

[listenbrainz-server] paramsingh opened pull request #280: LB-233: Refactor testcase classes to use single path_to_data_file (master...tests-refactor) https://git.io/vdjtR

2017-10-24 29717, 2017

16:28 PM
github has left the channel

2017-10-24 29723, 2017

16:43 PM
github joined the channel

2017-10-24 29723, 2017

16:43 PM
github

[listenbrainz-server] paramsingh opened pull request #281: LB-206: Add message that ts must be omitted from playing_now submissions (master...docs-fix) https://git.io/vdjmW

2017-10-24 29723, 2017

16:43 PM
github has left the channel

2017-10-24 29736, 2017

16:50 PM
alastairp

iliekcomputers: speed/size tradeoff

2017-10-24 29750, 2017

16:50 PM
alastairp

I don't think that we specifically measured it, but lzma compression is much slower

2017-10-24 29712, 2017

16:51 PM
alastairp

bz2 is "good enough", while not taking 12h to dump and compress the database

2017-10-24 29739, 2017

16:52 PM
iliekcomputers

alastairp: oh, I've used lzma for the listens in LB, do you think bz2 would be more appropriate there too? Listens are also mostly JSON documents.

2017-10-24 29741, 2017

16:53 PM
iliekcomputers

alastairp: I also wanted your thoughts on how exactly AB-117 should be implemented (if you have any)

2017-10-24 29741, 2017

16:53 PM
BrainzBot

AB-117: Dump dataset tables https://tickets.metabrainz.org/browse/AB-117

2017-10-24 29749, 2017

16:53 PM
iliekcomputers

oh wait, wrong issue

2017-10-24 29756, 2017

16:53 PM
iliekcomputers

AB-97

2017-10-24 29756, 2017

16:53 PM
BrainzBot

AB-97: Provide small JSON exports, for testing https://tickets.metabrainz.org/browse/AB-97

2017-10-24 29709, 2017

16:55 PM
iliekcomputers

There doesn't seem to be a way to import the JSON exports yet?

2017-10-24 29719, 2017

16:56 PM
iliekcomputers

Should we do a smaller dump in the format of the full db dump instead?

2017-10-24 29726, 2017

16:59 PM
alastairp

right, small dumps should probably be in both formats

2017-10-24 29737, 2017

16:59 PM
alastairp

import would be db dumps, not JSON

2017-10-24 29748, 2017

16:59 PM
alastairp

but small json dumps would also be good for people to be able to test

2017-10-24 29758, 2017

16:59 PM
alastairp

something ~100MB, perhaps

2017-10-24 29712, 2017

17:00 PM
alastairp

not sure how many files that would be? 100k?

2017-10-24 29709, 2017

17:01 PM
alastairp

not sure for LB, perhaps bz2 might be better than lzma too

2017-10-24 29736, 2017

17:01 PM
alastairp

perhaps we had run a test for speed/size for compression

2017-10-24 29740, 2017

17:01 PM
zas

https://catchchallenger.first-world.info/wiki/Qui…

2017-10-24 29751, 2017

17:01 PM
alastairp

I wonder if we should do it again and document that somewhere

2017-10-24 29743, 2017

17:02 PM
alastairp

I wonder if they have any difference when you compress lots of small files or one big file

2017-10-24 29758, 2017

17:03 PM
alastairp

oh, I see - that example is compressing linux tarred, that's probably a good enough demo

2017-10-24 29710, 2017

17:04 PM
alastairp

I can't remember how big AB Is, currently about 250GB I think

2017-10-24 29725, 2017

17:04 PM
zas

well, results may differ on json/text

2017-10-24 29745, 2017

17:04 PM
iliekcomputers

alastairp: :O

2017-10-24 29715, 2017

17:05 PM
zas

doing some measurements should be fairly easy, but be sure to measure memory used during compression (it can be an issue, especially if the process swaps to disk....)

2017-10-24 29702, 2017

17:06 PM
iliekcomputers

alastairp: what should I do about tables like statistics etc that get dumped too? I assume stats would need to be calculated again after importing anyways?

2017-10-24 29721, 2017

17:06 PM
alastairp

are we talking about AB or LB?

2017-10-24 29747, 2017

17:06 PM
iliekcomputers

AB

2017-10-24 29756, 2017

17:06 PM
iliekcomputers

Sorry :)

2017-10-24 29703, 2017

17:07 PM
alastairp

theoretically they should time-sync

2017-10-24 29713, 2017

17:07 PM
alastairp

have you seen how AB dumps are based on a timestamp

2017-10-24 29727, 2017

17:07 PM
alastairp

if you import x rows up until a point in time, then the rows for stats will be the same

2017-10-24 29735, 2017

17:07 PM
alastairp

so we should dump them

2017-10-24 29703, 2017

17:08 PM
alastairp

note that the reason that dumps are broken is because we used to only have a few tables for highlevel, and only 1 (or 2?) for lowlevel

2017-10-24 29704, 2017

17:08 PM
iliekcomputers

Oh, I didn't come across that, cool.

2017-10-24 29716, 2017

17:08 PM
alastairp

now lowlevel is 2 tables, and hl is about 6

2017-10-24 29709, 2017

17:09 PM
alastairp

I'd like to also make some changes to dump files so that we only have a certain maximum number of submissions per dump archive

2017-10-24 29723, 2017

17:09 PM
alastairp

so if for example we made a dump between t1 and t2

2017-10-24 29735, 2017

17:09 PM
alastairp

and there were 167000 submissions in that time

2017-10-24 29745, 2017

17:09 PM
alastairp

we could make 2 files, one with 100k, and one with 67k

2017-10-24 29703, 2017

17:10 PM
alastairp

again, for speed/ease of handling files

2017-10-24 29713, 2017

17:10 PM
iliekcomputers

So that dumps have a maximum size, makes sense.

2017-10-24 29713, 2017

17:10 PM
alastairp

I would rather people download 40 1gb files than 1 40gb file

2017-10-24 29734, 2017

17:10 PM
alastairp

if we do that, we could get the small dump for "free"

2017-10-24 29752, 2017

17:10 PM
alastairp

it would also go towards making sure we don't use too much memory, like zas mentioned

2017-10-24 29755, 2017

17:10 PM
iliekcomputers

alastairp: :O

2017-10-24 29710, 2017

17:11 PM
iliekcomputers

That is a really cool idea

2017-10-24 29757, 2017

17:11 PM
Leo_Verto

depending on how often you do dumps you could also distribute them as torrents

2017-10-24 29757, 2017

17:11 PM
alastairp

hmm, I had some other code that i had been working towards

2017-10-24 29706, 2017

17:12 PM
alastairp

yeah, we started doing that initially

2017-10-24 29709, 2017

17:12 PM
alastairp

no one used them

2017-10-24 29717, 2017

17:12 PM
alastairp

and we have oodles of bandwith

2017-10-24 29729, 2017

17:12 PM
alastairp

for now we should get dumps out there, and then see how much people use them

2017-10-24 29702, 2017

17:13 PM
alastairp

note that torrents doesn't solve the "this data is too big for me to install and play around with locally in a trial manner"

2017-10-24 29714, 2017

17:13 PM
alastairp

which is AB-97

2017-10-24 29714, 2017

17:13 PM
BrainzBot

AB-97: Provide small JSON exports, for testing https://tickets.metabrainz.org/browse/AB-97

2017-10-24 29734, 2017

17:13 PM
alastairp

iliekcomputers: Have you worked out how ordinal numbering of submissions works?

2017-10-24 29740, 2017

17:13 PM
alastairp

e.g. mbid-1, mbid-2, mbid-3 ?

2017-10-24 29716, 2017

17:14 PM
ferbncode_ joined the channel

2017-10-24 29720, 2017

17:16 PM
iliekcomputers

Using the version table? How it keeps different submissions of the same mbid? Yes, I think😅

2017-10-24 29701, 2017

17:17 PM
alastairp

no, not using the version table

2017-10-24 29745, 2017

17:17 PM
alastairp

that's to do with the version of essentia used to compute data

2017-10-24 29702, 2017

17:18 PM
iliekcomputers

Oh

2017-10-24 29718, 2017

17:18 PM
alastairp

perhaps we should rename that table

2017-10-24 29724, 2017

17:18 PM
alastairp

for multiple submissions we allow multiple versions of the same mbid to be added to the lowlevel table

2017-10-24 29732, 2017

17:18 PM
alastairp

so we don't unique on lowlevel.gid

2017-10-24 29726, 2017

17:19 PM
alastairp

assume that we have two submissions for an mbid; we need a way to access each of them, we do that by adding a query parameter on GET, or by adding -1, -2 to the json files when we dump them

2017-10-24 29748, 2017

17:19 PM
alastairp

but what if we dump -1, -2, then another one gets submitted, then we make another dump

2017-10-24 29701, 2017

17:20 PM
alastairp

in this dump, the new row needs to be written with -3

2017-10-24 29733, 2017

17:20 PM
iliekcomputers

Right

2017-10-24 29727, 2017

17:21 PM
alastairp

we do that here https://github.com/metabrainz/acousticbrainz-serv…

2017-10-24 29709, 2017

17:22 PM
iliekcom- joined the channel

2017-10-24 29745, 2017

17:22 PM
alastairp

and we have something similar here https://github.com/metabrainz/acousticbrainz-serv…

2017-10-24 29740, 2017

17:23 PM
alastairp

I had a thought about the dumps which I wanted to try and make faster, but looking at it again I think it's pretty good

2017-10-24 29750, 2017

17:23 PM
alastairp

but definitely for the bulk lookup in the API

2017-10-24 29722, 2017

17:24 PM
alastairp

SQL has these things called window functions: https://www.postgresql.org/docs/current/static/tu…

2017-10-24 29729, 2017

17:24 PM
alastairp

see the stuff about the rank() function

2017-10-24 29748, 2017

17:24 PM
alastairp

that can do the counting for us, I wonder if it's faster than the one-at-a-time lookup that we currently do

2017-10-24 29712, 2017

17:25 PM
alastairp

I started rewriting this method to use them to see if we could make it faster, but I didn't get around to testing it

2017-10-24 29718, 2017

17:25 PM
alastairp

I wonder if we should take a look at that too

2017-10-24 29736, 2017

17:26 PM
iliekcomputers

How much time does it take to create dumps now? 😅

2017-10-24 29714, 2017

17:27 PM
ferbncode joined the channel

2017-10-24 29738, 2017

17:27 PM
alastairp

to our (my) shame, no one knows

2017-10-24 29753, 2017

17:27 PM
alastairp

after we updated the schema (almost 2 years ago) we didn't fix dumps

2017-10-24 29702, 2017

17:28 PM
alastairp

and haven't made a public dump since

2017-10-24 29700, 2017

17:29 PM
iliekcomputers

Datasets were the new tables added then? Or is there more stuff needed to be done?

2017-10-24 29736, 2017

17:29 PM
iliekcomputers

Also, what would be the best way to test speed in stuff like this, run both versions of the code and time them?

2017-10-24 29700, 2017

17:30 PM
alastairp

yes, but probably on the full database

2017-10-24 29718, 2017

17:30 PM
alastairp

yes, datasets have been added since

2017-10-24 29714, 2017

17:31 PM
alastairp

but also the schema change added version, highlevel_meta, highlevel_model, model, and deleted highlevel_json

2017-10-24 29749, 2017

17:33 PM
iliekcomputers

But the new hl tables get dumped in the current code

2017-10-24 29703, 2017

17:34 PM
alastairp

mmm

2017-10-24 29709, 2017

17:34 PM
alastairp

do they?

2017-10-24 29723, 2017

17:34 PM
alastairp

maybe I made the changes but never tested it

2017-10-24 29706, 2017

17:35 PM
alastairp

certainly I never fixed highlevel json dumps

2017-10-24 29711, 2017

17:35 PM
alastairp

because they're more complex

2017-10-24 29712, 2017

17:35 PM
iliekcomputers

https://github.com/metabrainz/acousticbrainz-serv…

2017-10-24 29719, 2017

17:35 PM
alastairp

yeah, so I see...

2017-10-24 29734, 2017

17:36 PM
iliekcomputers

alastairp: so for new dumps, we'd need to see if the new hl tables get dumped properly and start dumping datasets

2017-10-24 29743, 2017

17:36 PM
alastairp

yes

2017-10-24 29749, 2017

17:36 PM
alastairp

and we need to fix hl json

2017-10-24 29753, 2017

17:36 PM
alastairp

let me explain

2017-10-24 29710, 2017

17:37 PM
alastairp

http://acousticbrainz.org/d1662e74-b24b-42d4-82dd…

2017-10-24 29719, 2017

17:37 PM
alastairp

see each of the keys in the "highlevel" block?

2017-10-24 29725, 2017

17:37 PM
alastairp

each key is a row in `model`

2017-10-24 29757, 2017

17:37 PM
iliekcomputers

Okay

2017-10-24 29704, 2017

17:38 PM
alastairp

and is joined in `highlevel_model` to a highlevel and model row

2017-10-24 29729, 2017

17:38 PM
alastairp

and then the "metadata" block comes from the `highlevel_meta` table

2017-10-24 29743, 2017

17:38 PM
alastairp

we make the dumps in a way that you can uncompress them on top of each other

2017-10-24 29724, 2017

17:39 PM
alastairp

so if highlevel-dump-1.tar.bz2 has mbid-1, mbic-1, mbid-2 and hl-dump-2.tar.bz2 has mbid-3, mbic-2, then they all go in the right place

2017-10-24 29701, 2017

17:40 PM
alastairp

now, we can add a new model to AB; in this case we create new highlevel_model rows for all of the existing items in the database for this new model

2017-10-24 29705, 2017

17:40 PM
alastairp

what would the -n be?

2017-10-24 29711, 2017

17:41 PM
alastairp

if we name the file mbid-1.json for the first submission of mbid-1, and only include the data for this new model, we would override the highlevel data for all other models for this mbid-1

2017-10-24 29725, 2017

17:41 PM
alastairp

do you follow? if not we should stop here and I'll try and explain it in another way

2017-10-24 29712, 2017

17:43 PM
iliekcomputers

Ok, I think I do.

2017-10-24 29751, 2017

17:43 PM
iliekcomputers

And making a new mbid-1 would be redundant

2017-10-24 29700, 2017

17:44 PM
alastairp

right, we had a few solutions to this

2017-10-24 29713, 2017

17:44 PM
alastairp

1) make the file mbid-1-modelid.json

2017-10-24 29718, 2017

17:47 PM
iliekcomputers

Would be weird if there's too many models? Maybe add something to the Json so that old models don't get overridden. Not sure.

2017-10-24 29730, 2017

17:47 PM
alastairp

actually, I can't think of many other options

2017-10-24 29747, 2017

17:47 PM
alastairp

I don't think we're going to have too many models

2017-10-24 29755, 2017

17:47 PM
alastairp

we also wondered if this would make too many files

2017-10-24 29705, 2017

17:48 PM
alastairp

lots of small files

2017-10-24 29749, 2017

17:48 PM
alastairp

also, since we have the metadata block, were would we put it? in every file? (duplicated n times where n = number of models)

2017-10-24 29756, 2017

17:48 PM
alastairp

we considered having one file per model

2017-10-24 29705, 2017

17:49 PM
alastairp

with a dictionary inside where each key was mbid-n

2017-10-24 29714, 2017

17:49 PM
alastairp

this would be neat for updates, but the first file would be huge

2017-10-24 29738, 2017

17:49 PM
iliekcomputers

Hmmm

2017-10-24 29754, 2017

17:49 PM
alastairp

we could split it into many files, but then the separation would be arbitrary - if you wanted to search for a specific mbid you would have to go opening each file and looking for it - now it's easy because you construct the filename

2017-10-24 29701, 2017

17:50 PM
alastairp

in fact, I just thought of another solution