#metabrainz

/

      • heyoni joined the channel
      • drsaunde joined the channel
      • UmkaDK_ has quit
      • UmkaDK joined the channel
      • UmkaDK has quit
      • UmkaDK joined the channel
      • agentsim_ joined the channel
      • agentsim has quit
      • iliekcomputers
        alastairp: question: is there a reason we use bz2 (and not lzma and pxz) for compressing the json dumps in AB?
      • drsaunde has quit
      • antlarr has quit
      • antlarr joined the channel
      • kyan joined the channel
      • github joined the channel
      • github
        [listenbrainz-server] paramsingh opened pull request #280: LB-233: Refactor testcase classes to use single path_to_data_file (master...tests-refactor) https://git.io/vdjtR
      • github has left the channel
      • github joined the channel
      • [listenbrainz-server] paramsingh opened pull request #281: LB-206: Add message that ts must be omitted from playing_now submissions (master...docs-fix) https://git.io/vdjmW
      • github has left the channel
      • alastairp
        iliekcomputers: speed/size tradeoff
      • I don't think that we specifically measured it, but lzma compression is much slower
      • bz2 is "good enough", while not taking 12h to dump and compress the database
      • iliekcomputers
        alastairp: oh, I've used lzma for the listens in LB, do you think bz2 would be more appropriate there too? Listens are also mostly JSON documents.
      • alastairp: I also wanted your thoughts on how exactly AB-117 should be implemented (if you have any)
      • BrainzBot
      • iliekcomputers
        oh wait, wrong issue
      • AB-97
      • BrainzBot
        AB-97: Provide small JSON exports, for testing https://tickets.metabrainz.org/browse/AB-97
      • iliekcomputers
        There doesn't seem to be a way to import the JSON exports yet?
      • Should we do a smaller dump in the format of the full db dump instead?
      • alastairp
        right, small dumps should probably be in both formats
      • import would be db dumps, not JSON
      • but small json dumps would also be good for people to be able to test
      • something ~100MB, perhaps
      • not sure how many files that would be? 100k?
      • not sure for LB, perhaps bz2 might be better than lzma too
      • perhaps we had run a test for speed/size for compression
      • zas
      • alastairp
        I wonder if we should do it again and document that somewhere
      • I wonder if they have any difference when you compress lots of small files or one big file
      • oh, I see - that example is compressing linux tarred, that's probably a good enough demo
      • I can't remember how big AB Is, currently about 250GB I think
      • zas
        well, results may differ on json/text
      • iliekcomputers
        alastairp: :O
      • zas
        doing some measurements should be fairly easy, but be sure to measure memory used during compression (it can be an issue, especially if the process swaps to disk....)
      • iliekcomputers
        alastairp: what should I do about tables like statistics etc that get dumped too? I assume stats would need to be calculated again after importing anyways?
      • alastairp
        are we talking about AB or LB?
      • iliekcomputers
        AB
      • Sorry :)
      • alastairp
        theoretically they should time-sync
      • have you seen how AB dumps are based on a timestamp
      • if you import x rows up until a point in time, then the rows for stats will be the same
      • so we should dump them
      • note that the reason that dumps are broken is because we used to only have a few tables for highlevel, and only 1 (or 2?) for lowlevel
      • iliekcomputers
        Oh, I didn't come across that, cool.
      • alastairp
        now lowlevel is 2 tables, and hl is about 6
      • I'd like to also make some changes to dump files so that we only have a certain maximum number of submissions per dump archive
      • so if for example we made a dump between t1 and t2
      • and there were 167000 submissions in that time
      • we could make 2 files, one with 100k, and one with 67k
      • again, for speed/ease of handling files
      • iliekcomputers
        So that dumps have a maximum size, makes sense.
      • alastairp
        I would rather people download 40 1gb files than 1 40gb file
      • if we do that, we could get the small dump for "free"
      • it would also go towards making sure we don't use too much memory, like zas mentioned
      • iliekcomputers
        alastairp: :O
      • That is a really cool idea
      • Leo_Verto
        depending on how often you do dumps you could also distribute them as torrents
      • alastairp
        hmm, I had some other code that i had been working towards
      • yeah, we started doing that initially
      • no one used them
      • and we have oodles of bandwith
      • for now we should get dumps out there, and then see how much people use them
      • note that torrents doesn't solve the "this data is too big for me to install and play around with locally in a trial manner"
      • which is AB-97
      • BrainzBot
        AB-97: Provide small JSON exports, for testing https://tickets.metabrainz.org/browse/AB-97
      • alastairp
        iliekcomputers: Have you worked out how ordinal numbering of submissions works?
      • e.g. mbid-1, mbid-2, mbid-3 ?
      • ferbncode_ joined the channel
      • iliekcomputers
        Using the version table? How it keeps different submissions of the same mbid? Yes, I think๐Ÿ˜…
      • alastairp
        no, not using the version table
      • that's to do with the version of essentia used to compute data
      • iliekcomputers
        Oh
      • alastairp
        perhaps we should rename that table
      • for multiple submissions we allow multiple versions of the same mbid to be added to the lowlevel table
      • so we don't unique on lowlevel.gid
      • assume that we have two submissions for an mbid; we need a way to access each of them, we do that by adding a query parameter on GET, or by adding -1, -2 to the json files when we dump them
      • but what if we dump -1, -2, then another one gets submitted, then we make another dump
      • in this dump, the new row needs to be written with -3
      • iliekcomputers
        Right
      • alastairp
      • iliekcom- joined the channel
      • and we have something similar here https://github.com/metabrainz/acousticbrainz-se...
      • I had a thought about the dumps which I wanted to try and make faster, but looking at it again I think it's pretty good
      • but definitely for the bulk lookup in the API
      • SQL has these things called window functions: https://www.postgresql.org/docs/current/static/...
      • see the stuff about the rank() function
      • that can do the counting for us, I wonder if it's faster than the one-at-a-time lookup that we currently do
      • I started rewriting this method to use them to see if we could make it faster, but I didn't get around to testing it
      • I wonder if we should take a look at that too
      • iliekcomputers
        How much time does it take to create dumps now? ๐Ÿ˜…
      • ferbncode joined the channel
      • alastairp
        to our (my) shame, no one knows
      • after we updated the schema (almost 2 years ago) we didn't fix dumps
      • and haven't made a public dump since
      • iliekcomputers
        Datasets were the new tables added then? Or is there more stuff needed to be done?
      • Also, what would be the best way to test speed in stuff like this, run both versions of the code and time them?
      • alastairp
        yes, but probably on the full database
      • yes, datasets have been added since
      • but also the schema change added version, highlevel_meta, highlevel_model, model, and deleted highlevel_json
      • iliekcomputers
        But the new hl tables get dumped in the current code
      • alastairp
        mmm
      • do they?
      • maybe I made the changes but never tested it
      • certainly I never fixed highlevel json dumps
      • because they're more complex
      • iliekcomputers
      • alastairp
        yeah, so I see...
      • iliekcomputers
        alastairp: so for new dumps, we'd need to see if the new hl tables get dumped properly and start dumping datasets
      • alastairp
        yes
      • and we need to fix hl json
      • let me explain
      • see each of the keys in the "highlevel" block?
      • each key is a row in `model`
      • iliekcomputers
        Okay
      • alastairp
        and is joined in `highlevel_model` to a highlevel and model row
      • and then the "metadata" block comes from the `highlevel_meta` table
      • we make the dumps in a way that you can uncompress them on top of each other
      • so if highlevel-dump-1.tar.bz2 has mbid-1, mbic-1, mbid-2 and hl-dump-2.tar.bz2 has mbid-3, mbic-2, then they all go in the right place
      • now, we can add a new model to AB; in this case we create new highlevel_model rows for all of the existing items in the database for this new model
      • what would the -n be?
      • if we name the file mbid-1.json for the first submission of mbid-1, and only include the data for this new model, we would override the highlevel data for all other models for this mbid-1
      • do you follow? if not we should stop here and I'll try and explain it in another way
      • iliekcomputers
        Ok, I think I do.
      • And making a new mbid-1 would be redundant
      • alastairp
        right, we had a few solutions to this
      • 1) make the file mbid-1-modelid.json
      • iliekcomputers
        Would be weird if there's too many models? Maybe add something to the Json so that old models don't get overridden. Not sure.
      • alastairp
        actually, I can't think of many other options
      • I don't think we're going to have too many models
      • we also wondered if this would make too many files
      • lots of small files
      • also, since we have the metadata block, were would we put it? in every file? (duplicated n times where n = number of models)
      • we considered having one file per model
      • with a dictionary inside where each key was mbid-n
      • this would be neat for updates, but the first file would be huge
      • iliekcomputers
        Hmmm
      • alastairp
        we could split it into many files, but then the separation would be arbitrary - if you wanted to search for a specific mbid you would have to go opening each file and looking for it - now it's easy because you construct the filename
      • in fact, I just thought of another solution