in #metabrainz

1:15 AM
d4rkie joined the channel
1:17 AM
d4rk-ph0enix has quit
2:58 AM
santiagofn joined the channel
3:01 AM
MRiddickW joined the channel
3:56 AM
Pratha-Fish

alastairp: Hi, I also checked out a few things this morning, and realized we could also use zstd with parquet for even better performance! (especially in the long run)
4:26 AM
Pros:
4:26 AM
- Parquet is future-proof with great support with pandas, spark, etc.
4:26 AM
- It'll make the data ready to go for future use for ML, etc applications.
4:26 AM
- It supports zstd as a compression too!
4:26 AM
- Loading times are exceptionally faster, and sizes are ridiculously low, especially for text data.
4:26 AM
- Even export times could be lowered since pandas has great support for parquet too
4:27 AM
Here's some stats for comparision https://usercontent.irccloud-cdn.com/file/HgLYe...
4:28 AM
It performs great enough with gzip and snappy. It'll surely perform even better with zstd
4:30 AM
darkstardevx joined the channel
4:46 AM
monotux has quit
4:47 AM
monotux joined the channel
4:49 AM
BrainzGit

[bookbrainz-site] 14tr1ten opened pull request #858 (03new-creation-form…uf-edit-entity): Feat(route): Edit exisiting entity through unified form POST route https://github.com/metabrainz/bookbrainz-site/p...
4:51 AM
aerozol

alastairp: did you still have some tickets coming my way?
5:42 AM
BrainzGit

[musicbrainz-server] 14reosarevok merged pull request #2565 (03master…MBS-12457): MBS-12457: Wrap overly long words in annotations https://github.com/metabrainz/musicbrainz-serve...
5:49 AM
d4rkie has quit
5:50 AM
d4rkie joined the channel
5:54 AM
d4rkie has quit
5:55 AM
MRiddickW has quit
5:59 AM
d4rkie joined the channel
6:00 AM
reosarevok

bitmap: are you still up and dealing with test.mb?
6:00 AM
(was going to test but it's 504ing)
6:00 AM
d4rkie has quit
6:00 AM
d4rkie joined the channel
6:01 AM
bitmap

reosarevok: I'm up but not sure what to do about test.mb, json dumps are still running
6:02 AM
reosarevok

Oh, so it's not that you're releasing an update, it's just that dumps take so much from the server test just fails to load? :D
6:02 AM
If so, guess I just need to check later
6:03 AM
I was thinking of putting beta out so I can start adding more genre rels
6:03 AM
Anything against that? :)
6:04 AM
bitmap

sounds good to me
6:04 AM
and yeah, it's been unusable all day :(
6:06 AM
reosarevok

I guess we maybe should move test away from that server then, but that means we need a dedicated server for json dumps, basically? Sounds bad
6:06 AM
You said you had an idea to make them less bad :D
6:07 AM
Maybe that should be the next project...
6:07 AM
bitmap

I was working on it a bit today since I got bored of writing rel editor tests
6:08 AM
reosarevok

How are you doing that btw, mostly all selenium?
6:09 AM
Pratha-Fish

alastairp: Here's some comparision charts!
6:09 AM
https://usercontent.irccloud-cdn.com/file/wZTxZ...
6:09 AM
https://usercontent.irccloud-cdn.com/file/zlWLW...
6:10 AM
https://usercontent.irccloud-cdn.com/file/4Dhn3...
6:11 AM
bitmap

reosarevok: mostly planning to write some normal JS ones where I feed some state into the reducer functions and check the output. but I'll try to add some basic Selenium ones too
6:11 AM
Pratha-Fish

alastairp: Looks like parquet+zstd is acing every stat.
6:11 AM
Excellent r/w time with the smallest storage size!
6:12 AM
Also note that reading gzip > writing to txt.zst is ridiculously slow.
6:12 AM
Maybe we should really shift to zst+parquet for all the data
6:17 AM
reosarevok

bitmap: sounds good, at least to begin with :)
6:19 AM
I'll do the beta thing and add more genre data, and then I'll look into the search changes yvanzo requested for the genre ws things :) Maybe we can release that when he's back
6:31 AM
BrainzGit

[musicbrainz-server] 14reosarevok merged pull request #2555 (03master…MBS-12418): MBS-12418: Also format artists in setlists if there's no MBID link https://github.com/metabrainz/musicbrainz-serve...
6:40 AM
antlarr has quit
6:49 AM
antlarr joined the channel
7:12 AM
d4rk-ph0enix joined the channel
7:12 AM
d4rkie has quit
7:29 AM
alastairp

morning
7:29 AM
thanks bitmap for the comments on SQL, that's more or less what I expected
7:30 AM
Pratha-Fish: thanks for the experiment, incredible to see how small and fast parquet is. you're right, this might be a good idea.
7:31 AM
keep in mind that we also need a format that can be used by people no matter what programming language they use - let's have a discussion with lucifer about this, because we're saving space maybe we can distribute it in both formats
7:31 AM
how are you writing csv/parquet? just with the DataFrame methods? for comparison, can you also do the same with csv+gz?
7:32 AM
how many items did you write? because 22 seconds does seem like a lot, but I guess you're writing many files in a loop?
7:41 AM
antlarr has quit
7:41 AM
KassOtsimine has quit
7:42 AM
reosarevok has quit
7:42 AM
zas has quit
7:42 AM
mruszczyk has quit
7:42 AM
reosarevok joined the channel
7:42 AM
KassOtsimine joined the channel
7:42 AM
zas joined the channel
7:42 AM
mruszczyk joined the channel
7:43 AM
milkii has quit
7:43 AM
aerozol: CB-441 and CB-442
7:43 AM
BrainzBot

CB-441: Need a way to show that an entity/review is for MB or BB https://tickets.metabrainz.org/browse/CB-441
7:43 AM
CB-442: Improve layout of CB entity page https://tickets.metabrainz.org/browse/CB-442
7:44 AM
milkii joined the channel
7:45 AM
lucifer

alastairp: hi! whats the data to be distributed here?
7:45 AM
alastairp

hi lucifer
7:45 AM
antlarr joined the channel
7:46 AM
this is the music listen histories dataset. 27 billion rows, currently distributed as csv, 1 file per user. each row is timestamp, recording mbid, artist mbid, release mbid
7:48 AM
we've already obtained a 50% filesize reduction by switching from gz to zstd, but the question is shoud we go further and use parquet as well, which gives additional size reductions as well as a significant speed increase when loading
7:49 AM
lucifer

parquet is probably better in almost all ways with the exception of being not human readable and you can't edit it in excel.
7:50 AM
but 27B rows is too much data for direct human consumption anyway so i think these issues don't matter much.
7:50 AM
alastairp

yeah, right
7:50 AM
so we read/write it in spark and python?
7:51 AM
perhaps it'd be nice to be able to `cat file.parquet | to-csv` somehow, to be able to view the data as csv if necessary? I'm reading the internet and it seems to just suggest using pandas for this, which is ok but not great
7:52 AM
lucifer

yes. both parquet and csv have built in support in spark. in python, you need to add pyarrow manually to use parquet (csv support is builtin as you know).
7:52 AM
there's duckdb and some cli parquet tools but not too great.
7:52 AM
alastairp

right
7:53 AM
lucifer

for instance, https://stackoverflow.com/questions/36140264/in...
7:53 AM
alastairp

oh nice, that's exactly what I waswanting
7:53 AM
lucifer

there's also a convert csv command in this fwiw.
7:54 AM
but i remember facing issues with installing this tool last i tried. let me see if it works now.
7:56 AM
alastairp

lucifer: and remind me - how do we make spark dumps? I seem to recall that we dump json and then convert?
7:57 AM
lucifer

ah no, we dump parquet directly using pandas now.
7:57 AM
we make 2 dumps. 1 json and 1 parquet one for spark.
7:57 AM
alastairp

oh great
7:58 AM
found it
7:59 AM
do we do any compression on the parquet file?
8:00 AM
lucifer

parquet uses snappy compression by default so yes
8:00 AM
alastairp

right (https://usercontent.irccloud-cdn.com/file/wZTxZ...)
8:01 AM
MRiddickW joined the channel
8:01 AM
thanks lucifer! we'll put parquet on the list of possibilities for the final version of the dataset, then
8:03 AM
I think it makes sense. I might do a quick look and see if I can find what programming languages people used for projects that use this dataset
8:04 AM
Lotheric_ joined the channel
8:06 AM
Lotheric has quit
8:23 AM
CatQuest

Pratha-Fish: you forgot the cons, tho ;)
8:27 AM
(not that i know them :D)
8:31 AM
alastairp: your mockup on https://tickets.metabrainz.org/browse/CB-442 get's a 👍 from me, it's a clear improvement
8:31 AM
BrainzBot

CB-442: Improve layout of CB entity page
8:31 AM
alastairp

CatQuest: that's monkey's mockup, all 👍 should go to him!
8:31 AM
CatQuest

(I agree cb layout ha always been a tad odd)
8:32 AM
... you whre th reporter so i was confused :D
8:32 AM
alastairp

CatQuest: the only large con that we know of is what lucifer and I were just discussing - the data format is no longer text, so you need a 3rd party software library to read it
8:32 AM
CatQuest

!m mofor mockups then!
8:32 AM
BrainzBot

You're doing good work, mofor mockups then!!
8:32 AM
CatQuest

...
8:32 AM
!m monkey for mockups then!
8:32 AM
BrainzBot

You're doing good work, monkey for mockups then!!
8:32 AM
alastairp

I suspect that CB was much like every other project - a programmer makes a start and needs to lay out the data so comes up with something, but we never get around to looking at it from a design perspective
8:32 AM
CatQuest

alastairp: yea.i was kinda joknig a little with Pratha-Fish, becasue they said "pros:" and didn't list cons :)
8:33 AM
alastairp

there really are very few cons compared to the pros, though
8:33 AM
now that we have the right people on board for design help, definitely agree that we should work to improve it
8:33 AM
CatQuest

I dunno about design. but usability fro man user perspective
8:33 AM
alastairp

sure, to me that's part of design too
8:33 AM
CatQuest

i trust monkey explisitly, as he seems to also take that int oaccount, isntead of trying to be "fancy" for the sake of "design" or the sake of "just being fancy"
8:34 AM
as i've seen other such people do :D
8:35 AM
btw, have you seen the new entity editor that Shubh is xooking up for bb?
8:35 AM
it's also a "design" improvement
8:35 AM
https://test.bookbrainz.org/create
8:36 AM
so far I've given feedback about performances (ostensibly on old browsers) but other's input re usability/design/ui/whatever would als obe usefull to them i think
9:18 AM
skelly37 joined the channel
9:21 AM
ROpdebee has quit
9:21 AM
ROpdebee joined the channel
9:39 AM
lucifer

monkey: alastairp: mayhem: we didn't have a LB meeting this month (last too). thoughts on doing one soon?
9:40 AM
alastairp

yeah, let's do it. next week I'm away Mon-Weds
9:40 AM
lucifer

i see, maybe later today or tomorrow if it works for all?
9:42 AM
alastairp

tomorrow midday/early afternoon would be OK for me
9:45 AM
d4rk-ph0enix has quit
9:45 AM
d4rkie joined the channel
9:49 AM
d4rkie has quit
9:55 AM
mayhem

oh yes, sorry about that one.
9:55 AM
tomorrow afternoon is pretty bad for me.
9:57 AM
lucifer

next thursday/friday or maybe the monday of july 18 1 hr before regular meeting?
9:57 AM
zas

https://blog.metabrainz.org/2022/07/07/picard-2...
9:59 AM
mayhem

next thursday I could do. I'm out Friday/Monday that weekend.
9:59 AM
d4rkie joined the channel
10:01 AM
d4rkie has quit
10:01 AM
d4rkie joined the channel
10:20 AM
ansh

alastairp: Thanks for the detailed review :)