thanks bitmap for the comments on SQL, that's more or less what I expected
Pratha-Fish: thanks for the experiment, incredible to see how small and fast parquet is. you're right, this might be a good idea.
keep in mind that we also need a format that can be used by people no matter what programming language they use - let's have a discussion with lucifer about this, because we're saving space maybe we can distribute it in both formats
how are you writing csv/parquet? just with the DataFrame methods? for comparison, can you also do the same with csv+gz?
how many items did you write? because 22 seconds does seem like a lot, but I guess you're writing many files in a loop?
alastairp: hi! whats the data to be distributed here?
antlarr joined the channel
this is the music listen histories dataset. 27 billion rows, currently distributed as csv, 1 file per user. each row is timestamp, recording mbid, artist mbid, release mbid
we've already obtained a 50% filesize reduction by switching from gz to zstd, but the question is shoud we go further and use parquet as well, which gives additional size reductions as well as a significant speed increase when loading
parquet is probably better in almost all ways with the exception of being not human readable and you can't edit it in excel.
but 27B rows is too much data for direct human consumption anyway so i think these issues don't matter much.
so we read/write it in spark and python?
perhaps it'd be nice to be able to `cat file.parquet | to-csv` somehow, to be able to view the data as csv if necessary? I'm reading the internet and it seems to just suggest using pandas for this, which is ok but not great
yes. both parquet and csv have built in support in spark. in python, you need to add pyarrow manually to use parquet (csv support is builtin as you know).
there's duckdb and some cli parquet tools but not too great.
CatQuest: that's monkey's mockup, all 👍 should go to him!
(I agree cb layout ha always been a tad odd)
... you whre th reporter so i was confused :D
CatQuest: the only large con that we know of is what lucifer and I were just discussing - the data format is no longer text, so you need a 3rd party software library to read it
!m mofor mockups then!
You're doing good work, mofor mockups then!!
!m monkey for mockups then!
You're doing good work, monkey for mockups then!!
I suspect that CB was much like every other project - a programmer makes a start and needs to lay out the data so comes up with something, but we never get around to looking at it from a design perspective
alastairp: yea.i was kinda joknig a little with Pratha-Fish, becasue they said "pros:" and didn't list cons :)
there really are very few cons compared to the pros, though
now that we have the right people on board for design help, definitely agree that we should work to improve it
I dunno about design. but usability fro man user perspective
sure, to me that's part of design too
i trust monkey explisitly, as he seems to also take that int oaccount, isntead of trying to be "fancy" for the sake of "design" or the sake of "just being fancy"
as i've seen other such people do :D
btw, have you seen the new entity editor that Shubh is xooking up for bb?