#metabrainz

/

      • d4rkie joined the channel
      • d4rk-ph0enix has quit
      • santiagofn joined the channel
      • MRiddickW joined the channel
      • Pratha-Fish
        alastairp: Hi, I also checked out a few things this morning, and realized we could also use zstd with parquet for even better performance! (especially in the long run)
      • Pros:
      • - Parquet is future-proof with great support with pandas, spark, etc.
      • - It'll make the data ready to go for future use for ML, etc applications.
      • - It supports zstd as a compression too!
      • - Loading times are exceptionally faster, and sizes are ridiculously low, especially for text data.
      • - Even export times could be lowered since pandas has great support for parquet too
      • Here's some stats for comparision https://usercontent.irccloud-cdn.com/file/HgLYe...
      • It performs great enough with gzip and snappy. It'll surely perform even better with zstd
      • darkstardevx joined the channel
      • monotux has quit
      • monotux joined the channel
      • BrainzGit
        [bookbrainz-site] 14tr1ten opened pull request #858 (03new-creation-form…uf-edit-entity): Feat(route): Edit exisiting entity through unified form POST route https://github.com/metabrainz/bookbrainz-site/p...
      • aerozol
        alastairp: did you still have some tickets coming my way?
      • BrainzGit
        [musicbrainz-server] 14reosarevok merged pull request #2565 (03master…MBS-12457): MBS-12457: Wrap overly long words in annotations https://github.com/metabrainz/musicbrainz-serve...
      • d4rkie has quit
      • d4rkie joined the channel
      • d4rkie has quit
      • MRiddickW has quit
      • d4rkie joined the channel
      • reosarevok
        bitmap: are you still up and dealing with test.mb?
      • (was going to test but it's 504ing)
      • d4rkie has quit
      • d4rkie joined the channel
      • bitmap
        reosarevok: I'm up but not sure what to do about test.mb, json dumps are still running
      • reosarevok
        Oh, so it's not that you're releasing an update, it's just that dumps take so much from the server test just fails to load? :D
      • If so, guess I just need to check later
      • I was thinking of putting beta out so I can start adding more genre rels
      • Anything against that? :)
      • bitmap
        sounds good to me
      • and yeah, it's been unusable all day :(
      • reosarevok
        I guess we maybe should move test away from that server then, but that means we need a dedicated server for json dumps, basically? Sounds bad
      • You said you had an idea to make them less bad :D
      • Maybe that should be the next project...
      • bitmap
        I was working on it a bit today since I got bored of writing rel editor tests
      • reosarevok
        How are you doing that btw, mostly all selenium?
      • Pratha-Fish
        alastairp: Here's some comparision charts!
      • bitmap
        reosarevok: mostly planning to write some normal JS ones where I feed some state into the reducer functions and check the output. but I'll try to add some basic Selenium ones too
      • Pratha-Fish
        alastairp: Looks like parquet+zstd is acing every stat.
      • Excellent r/w time with the smallest storage size!
      • Also note that reading gzip > writing to txt.zst is ridiculously slow.
      • Maybe we should really shift to zst+parquet for all the data
      • reosarevok
        bitmap: sounds good, at least to begin with :)
      • I'll do the beta thing and add more genre data, and then I'll look into the search changes yvanzo requested for the genre ws things :) Maybe we can release that when he's back
      • BrainzGit
        [musicbrainz-server] 14reosarevok merged pull request #2555 (03master…MBS-12418): MBS-12418: Also format artists in setlists if there's no MBID link https://github.com/metabrainz/musicbrainz-serve...
      • antlarr has quit
      • antlarr joined the channel
      • d4rk-ph0enix joined the channel
      • d4rkie has quit
      • alastairp
        morning
      • thanks bitmap for the comments on SQL, that's more or less what I expected
      • Pratha-Fish: thanks for the experiment, incredible to see how small and fast parquet is. you're right, this might be a good idea.
      • keep in mind that we also need a format that can be used by people no matter what programming language they use - let's have a discussion with lucifer about this, because we're saving space maybe we can distribute it in both formats
      • how are you writing csv/parquet? just with the DataFrame methods? for comparison, can you also do the same with csv+gz?
      • how many items did you write? because 22 seconds does seem like a lot, but I guess you're writing many files in a loop?
      • antlarr has quit
      • KassOtsimine has quit
      • reosarevok has quit
      • zas has quit
      • mruszczyk has quit
      • reosarevok joined the channel
      • KassOtsimine joined the channel
      • zas joined the channel
      • mruszczyk joined the channel
      • milkii has quit
      • aerozol: CB-441 and CB-442
      • BrainzBot
        CB-441: Need a way to show that an entity/review is for MB or BB https://tickets.metabrainz.org/browse/CB-441
      • CB-442: Improve layout of CB entity page https://tickets.metabrainz.org/browse/CB-442
      • milkii joined the channel
      • lucifer
        alastairp: hi! whats the data to be distributed here?
      • alastairp
        hi lucifer
      • antlarr joined the channel
      • this is the music listen histories dataset. 27 billion rows, currently distributed as csv, 1 file per user. each row is timestamp, recording mbid, artist mbid, release mbid
      • we've already obtained a 50% filesize reduction by switching from gz to zstd, but the question is shoud we go further and use parquet as well, which gives additional size reductions as well as a significant speed increase when loading
      • lucifer
        parquet is probably better in almost all ways with the exception of being not human readable and you can't edit it in excel.
      • but 27B rows is too much data for direct human consumption anyway so i think these issues don't matter much.
      • alastairp
        yeah, right
      • so we read/write it in spark and python?
      • perhaps it'd be nice to be able to `cat file.parquet | to-csv` somehow, to be able to view the data as csv if necessary? I'm reading the internet and it seems to just suggest using pandas for this, which is ok but not great
      • lucifer
        yes. both parquet and csv have built in support in spark. in python, you need to add pyarrow manually to use parquet (csv support is builtin as you know).
      • there's duckdb and some cli parquet tools but not too great.
      • alastairp
        right
      • lucifer
      • alastairp
        oh nice, that's exactly what I waswanting
      • lucifer
        there's also a convert csv command in this fwiw.
      • but i remember facing issues with installing this tool last i tried. let me see if it works now.
      • alastairp
        lucifer: and remind me - how do we make spark dumps? I seem to recall that we dump json and then convert?
      • lucifer
        ah no, we dump parquet directly using pandas now.
      • we make 2 dumps. 1 json and 1 parquet one for spark.
      • alastairp
        oh great
      • found it
      • do we do any compression on the parquet file?
      • lucifer
        parquet uses snappy compression by default so yes
      • alastairp
      • MRiddickW joined the channel
      • thanks lucifer! we'll put parquet on the list of possibilities for the final version of the dataset, then
      • I think it makes sense. I might do a quick look and see if I can find what programming languages people used for projects that use this dataset
      • Lotheric_ joined the channel
      • Lotheric has quit
      • CatQuest
        Pratha-Fish: you forgot the cons, tho ;)
      • (not that i know them :D)
      • alastairp: your mockup on https://tickets.metabrainz.org/browse/CB-442 get's a 👍 from me, it's a clear improvement
      • BrainzBot
        CB-442: Improve layout of CB entity page
      • alastairp
        CatQuest: that's monkey's mockup, all 👍 should go to him!
      • CatQuest
        (I agree cb layout ha always been a tad odd)
      • ... you whre th reporter so i was confused :D
      • alastairp
        CatQuest: the only large con that we know of is what lucifer and I were just discussing - the data format is no longer text, so you need a 3rd party software library to read it
      • CatQuest
        !m mofor mockups then!
      • BrainzBot
        You're doing good work, mofor mockups then!!
      • CatQuest
        ...
      • !m monkey for mockups then!
      • BrainzBot
        You're doing good work, monkey for mockups then!!
      • alastairp
        I suspect that CB was much like every other project - a programmer makes a start and needs to lay out the data so comes up with something, but we never get around to looking at it from a design perspective
      • CatQuest
        alastairp: yea.i was kinda joknig a little with Pratha-Fish, becasue they said "pros:" and didn't list cons :)
      • alastairp
        there really are very few cons compared to the pros, though
      • now that we have the right people on board for design help, definitely agree that we should work to improve it
      • CatQuest
        I dunno about design. but usability fro man user perspective
      • alastairp
        sure, to me that's part of design too
      • CatQuest
        i trust monkey explisitly, as he seems to also take that int oaccount, isntead of trying to be "fancy" for the sake of "design" or the sake of "just being fancy"
      • as i've seen other such people do :D
      • btw, have you seen the new entity editor that Shubh is xooking up for bb?
      • it's also a "design" improvement
      • so far I've given feedback about performances (ostensibly on old browsers) but other's input re usability/design/ui/whatever would als obe usefull to them i think
      • skelly37 joined the channel
      • ROpdebee has quit
      • ROpdebee joined the channel
      • lucifer
        monkey: alastairp: mayhem: we didn't have a LB meeting this month (last too). thoughts on doing one soon?
      • alastairp
        yeah, let's do it. next week I'm away Mon-Weds
      • lucifer
        i see, maybe later today or tomorrow if it works for all?
      • alastairp
        tomorrow midday/early afternoon would be OK for me
      • d4rk-ph0enix has quit
      • d4rkie joined the channel
      • d4rkie has quit
      • mayhem
        oh yes, sorry about that one.
      • tomorrow afternoon is pretty bad for me.
      • lucifer
        next thursday/friday or maybe the monday of july 18 1 hr before regular meeting?
      • zas
      • mayhem
        next thursday I could do. I'm out Friday/Monday that weekend.
      • d4rkie joined the channel
      • d4rkie has quit
      • d4rkie joined the channel
      • ansh
        alastairp: Thanks for the detailed review :)