alastairp: Hi, I also checked out a few things this morning, and realized we could also use zstd with parquet for even better performance! (especially in the long run)
2022-07-07 18801, 2022
Pratha-Fish
Pros:
2022-07-07 18801, 2022
Pratha-Fish
- Parquet is future-proof with great support with pandas, spark, etc.
2022-07-07 18801, 2022
Pratha-Fish
- It'll make the data ready to go for future use for ML, etc applications.
2022-07-07 18801, 2022
Pratha-Fish
- It supports zstd as a compression too!
2022-07-07 18801, 2022
Pratha-Fish
- Loading times are exceptionally faster, and sizes are ridiculously low, especially for text data.
2022-07-07 18801, 2022
Pratha-Fish
- Even export times could be lowered since pandas has great support for parquet too
It performs great enough with gzip and snappy. It'll surely perform even better with zstd
2022-07-07 18833, 2022
darkstardevx joined the channel
2022-07-07 18832, 2022
monotux has quit
2022-07-07 18841, 2022
monotux joined the channel
2022-07-07 18851, 2022
BrainzGit
[bookbrainz-site] 14tr1ten opened pull request #858 (03new-creation-form…uf-edit-entity): Feat(route): Edit exisiting entity through unified form POST route https://github.com/metabrainz/bookbrainz-site/pul…
2022-07-07 18821, 2022
aerozol
alastairp: did you still have some tickets coming my way?
reosarevok: mostly planning to write some normal JS ones where I feed some state into the reducer functions and check the output. but I'll try to add some basic Selenium ones too
2022-07-07 18808, 2022
Pratha-Fish
alastairp: Looks like parquet+zstd is acing every stat.
2022-07-07 18809, 2022
Pratha-Fish
Excellent r/w time with the smallest storage size!
2022-07-07 18845, 2022
Pratha-Fish
Also note that reading gzip > writing to txt.zst is ridiculously slow.
2022-07-07 18845, 2022
Pratha-Fish
Maybe we should really shift to zst+parquet for all the data
2022-07-07 18805, 2022
reosarevok
bitmap: sounds good, at least to begin with :)
2022-07-07 18820, 2022
reosarevok
I'll do the beta thing and add more genre data, and then I'll look into the search changes yvanzo requested for the genre ws things :) Maybe we can release that when he's back
thanks bitmap for the comments on SQL, that's more or less what I expected
2022-07-07 18820, 2022
alastairp
Pratha-Fish: thanks for the experiment, incredible to see how small and fast parquet is. you're right, this might be a good idea.
2022-07-07 18809, 2022
alastairp
keep in mind that we also need a format that can be used by people no matter what programming language they use - let's have a discussion with lucifer about this, because we're saving space maybe we can distribute it in both formats
2022-07-07 18847, 2022
alastairp
how are you writing csv/parquet? just with the DataFrame methods? for comparison, can you also do the same with csv+gz?
2022-07-07 18810, 2022
alastairp
how many items did you write? because 22 seconds does seem like a lot, but I guess you're writing many files in a loop?
alastairp: hi! whats the data to be distributed here?
2022-07-07 18834, 2022
alastairp
hi lucifer
2022-07-07 18843, 2022
antlarr joined the channel
2022-07-07 18813, 2022
alastairp
this is the music listen histories dataset. 27 billion rows, currently distributed as csv, 1 file per user. each row is timestamp, recording mbid, artist mbid, release mbid
2022-07-07 18826, 2022
alastairp
we've already obtained a 50% filesize reduction by switching from gz to zstd, but the question is shoud we go further and use parquet as well, which gives additional size reductions as well as a significant speed increase when loading
2022-07-07 18825, 2022
lucifer
parquet is probably better in almost all ways with the exception of being not human readable and you can't edit it in excel.
2022-07-07 18822, 2022
lucifer
but 27B rows is too much data for direct human consumption anyway so i think these issues don't matter much.
2022-07-07 18835, 2022
alastairp
yeah, right
2022-07-07 18844, 2022
alastairp
so we read/write it in spark and python?
2022-07-07 18859, 2022
alastairp
perhaps it'd be nice to be able to `cat file.parquet | to-csv` somehow, to be able to view the data as csv if necessary? I'm reading the internet and it seems to just suggest using pandas for this, which is ok but not great
2022-07-07 18803, 2022
lucifer
yes. both parquet and csv have built in support in spark. in python, you need to add pyarrow manually to use parquet (csv support is builtin as you know).
2022-07-07 18822, 2022
lucifer
there's duckdb and some cli parquet tools but not too great.
CatQuest: that's monkey's mockup, all 👍 should go to him!
2022-07-07 18848, 2022
CatQuest
(I agree cb layout ha always been a tad odd)
2022-07-07 18806, 2022
CatQuest
... you whre th reporter so i was confused :D
2022-07-07 18808, 2022
alastairp
CatQuest: the only large con that we know of is what lucifer and I were just discussing - the data format is no longer text, so you need a 3rd party software library to read it
2022-07-07 18813, 2022
CatQuest
!m mofor mockups then!
2022-07-07 18814, 2022
BrainzBot
You're doing good work, mofor mockups then!!
2022-07-07 18824, 2022
CatQuest
...
2022-07-07 18834, 2022
CatQuest
!m monkey for mockups then!
2022-07-07 18834, 2022
BrainzBot
You're doing good work, monkey for mockups then!!
2022-07-07 18854, 2022
alastairp
I suspect that CB was much like every other project - a programmer makes a start and needs to lay out the data so comes up with something, but we never get around to looking at it from a design perspective
2022-07-07 18856, 2022
CatQuest
alastairp: yea.i was kinda joknig a little with Pratha-Fish, becasue they said "pros:" and didn't list cons :)
2022-07-07 18811, 2022
alastairp
there really are very few cons compared to the pros, though
2022-07-07 18821, 2022
alastairp
now that we have the right people on board for design help, definitely agree that we should work to improve it
2022-07-07 18822, 2022
CatQuest
I dunno about design. but usability fro man user perspective
2022-07-07 18842, 2022
alastairp
sure, to me that's part of design too
2022-07-07 18859, 2022
CatQuest
i trust monkey explisitly, as he seems to also take that int oaccount, isntead of trying to be "fancy" for the sake of "design" or the sake of "just being fancy"
2022-07-07 18816, 2022
CatQuest
as i've seen other such people do :D
2022-07-07 18805, 2022
CatQuest
btw, have you seen the new entity editor that Shubh is xooking up for bb?
so far I've given feedback about performances (ostensibly on old browsers) but other's input re usability/design/ui/whatever would als obe usefull to them i think
2022-07-07 18857, 2022
skelly37 joined the channel
2022-07-07 18824, 2022
ROpdebee has quit
2022-07-07 18852, 2022
ROpdebee joined the channel
2022-07-07 18834, 2022
lucifer
monkey: alastairp: mayhem: we didn't have a LB meeting this month (last too). thoughts on doing one soon?
2022-07-07 18800, 2022
alastairp
yeah, let's do it. next week I'm away Mon-Weds
2022-07-07 18821, 2022
lucifer
i see, maybe later today or tomorrow if it works for all?
2022-07-07 18822, 2022
alastairp
tomorrow midday/early afternoon would be OK for me
2022-07-07 18802, 2022
d4rk-ph0enix has quit
2022-07-07 18834, 2022
d4rkie joined the channel
2022-07-07 18850, 2022
d4rkie has quit
2022-07-07 18805, 2022
mayhem
oh yes, sorry about that one.
2022-07-07 18826, 2022
mayhem
tomorrow afternoon is pretty bad for me.
2022-07-07 18811, 2022
lucifer
next thursday/friday or maybe the monday of july 18 1 hr before regular meeting?