#metabrainz

/

1:15 AM
d4rkie joined the channel

2022-07-07 18835, 2022

1:17 AM
d4rk-ph0enix has quit

2022-07-07 18858, 2022

2:58 AM
santiagofn joined the channel

2022-07-07 18855, 2022

3:01 AM
MRiddickW joined the channel

2022-07-07 18803, 2022

3:56 AM
Pratha-Fish

alastairp: Hi, I also checked out a few things this morning, and realized we could also use zstd with parquet for even better performance! (especially in the long run)

2022-07-07 18801, 2022

4:26 AM
Pratha-Fish

Pros:

2022-07-07 18801, 2022

4:26 AM
Pratha-Fish

- Parquet is future-proof with great support with pandas, spark, etc.

2022-07-07 18801, 2022

4:26 AM
Pratha-Fish

- It'll make the data ready to go for future use for ML, etc applications.

2022-07-07 18801, 2022

4:26 AM
Pratha-Fish

- It supports zstd as a compression too!

2022-07-07 18801, 2022

4:26 AM
Pratha-Fish

- Loading times are exceptionally faster, and sizes are ridiculously low, especially for text data.

2022-07-07 18801, 2022

4:26 AM
Pratha-Fish

- Even export times could be lowered since pandas has great support for parquet too

2022-07-07 18824, 2022

4:27 AM
Pratha-Fish

Here's some stats for comparision https://usercontent.irccloud-cdn.com/file/HgLYejT…

2022-07-07 18837, 2022

4:28 AM
Pratha-Fish

It performs great enough with gzip and snappy. It'll surely perform even better with zstd

2022-07-07 18833, 2022

4:30 AM
darkstardevx joined the channel

2022-07-07 18832, 2022

4:46 AM
monotux has quit

2022-07-07 18841, 2022

4:47 AM
monotux joined the channel

2022-07-07 18851, 2022

4:49 AM
BrainzGit

[bookbrainz-site] 14tr1ten opened pull request #858 (03new-creation-form…uf-edit-entity): Feat(route): Edit exisiting entity through unified form POST route https://github.com/metabrainz/bookbrainz-site/pul…

2022-07-07 18821, 2022

4:51 AM
aerozol

alastairp: did you still have some tickets coming my way?

2022-07-07 18814, 2022

5:42 AM
BrainzGit

[musicbrainz-server] 14reosarevok merged pull request #2565 (03master…MBS-12457): MBS-12457: Wrap overly long words in annotations https://github.com/metabrainz/musicbrainz-server/…

2022-07-07 18841, 2022

5:49 AM
d4rkie has quit

2022-07-07 18815, 2022

5:50 AM
d4rkie joined the channel

2022-07-07 18851, 2022

5:54 AM
d4rkie has quit

2022-07-07 18822, 2022

5:55 AM
MRiddickW has quit

2022-07-07 18822, 2022

5:59 AM
d4rkie joined the channel

2022-07-07 18814, 2022

6:00 AM
reosarevok

bitmap: are you still up and dealing with test.mb?

2022-07-07 18823, 2022

6:00 AM
reosarevok

(was going to test but it's 504ing)

2022-07-07 18847, 2022

6:00 AM
d4rkie has quit

2022-07-07 18859, 2022

6:00 AM
d4rkie joined the channel

2022-07-07 18836, 2022

6:01 AM
bitmap

reosarevok: I'm up but not sure what to do about test.mb, json dumps are still running

2022-07-07 18817, 2022

6:02 AM
reosarevok

Oh, so it's not that you're releasing an update, it's just that dumps take so much from the server test just fails to load? :D

2022-07-07 18847, 2022

6:02 AM
reosarevok

If so, guess I just need to check later

2022-07-07 18803, 2022

6:03 AM
reosarevok

I was thinking of putting beta out so I can start adding more genre rels

2022-07-07 18807, 2022

6:03 AM
reosarevok

Anything against that? :)

2022-07-07 18818, 2022

6:04 AM
bitmap

sounds good to me

2022-07-07 18849, 2022

6:04 AM
bitmap

and yeah, it's been unusable all day :(

2022-07-07 18850, 2022

6:06 AM
reosarevok

I guess we maybe should move test away from that server then, but that means we need a dedicated server for json dumps, basically? Sounds bad

2022-07-07 18858, 2022

6:06 AM
reosarevok

You said you had an idea to make them less bad :D

2022-07-07 18805, 2022

6:07 AM
reosarevok

Maybe that should be the next project...

2022-07-07 18840, 2022

6:07 AM
bitmap

I was working on it a bit today since I got bored of writing rel editor tests

2022-07-07 18807, 2022

6:08 AM
reosarevok

How are you doing that btw, mostly all selenium?

2022-07-07 18827, 2022

6:09 AM
Pratha-Fish

alastairp: Here's some comparision charts!

2022-07-07 18831, 2022

6:09 AM
Pratha-Fish

https://usercontent.irccloud-cdn.com/file/wZTxZ7n…

2022-07-07 18850, 2022

6:09 AM
Pratha-Fish

https://usercontent.irccloud-cdn.com/file/zlWLWgn…

2022-07-07 18813, 2022

6:10 AM
Pratha-Fish

https://usercontent.irccloud-cdn.com/file/4Dhn34g…

2022-07-07 18804, 2022

6:11 AM
bitmap

reosarevok: mostly planning to write some normal JS ones where I feed some state into the reducer functions and check the output. but I'll try to add some basic Selenium ones too

2022-07-07 18808, 2022

6:11 AM
Pratha-Fish

alastairp: Looks like parquet+zstd is acing every stat.

2022-07-07 18809, 2022

6:11 AM
Pratha-Fish

Excellent r/w time with the smallest storage size!

2022-07-07 18845, 2022

6:12 AM
Pratha-Fish

Also note that reading gzip > writing to txt.zst is ridiculously slow.

2022-07-07 18845, 2022

6:12 AM
Pratha-Fish

Maybe we should really shift to zst+parquet for all the data

2022-07-07 18805, 2022

6:17 AM
reosarevok

bitmap: sounds good, at least to begin with :)

2022-07-07 18820, 2022

6:19 AM
reosarevok

I'll do the beta thing and add more genre data, and then I'll look into the search changes yvanzo requested for the genre ws things :) Maybe we can release that when he's back

2022-07-07 18843, 2022

6:31 AM
BrainzGit

[musicbrainz-server] 14reosarevok merged pull request #2555 (03master…MBS-12418): MBS-12418: Also format artists in setlists if there's no MBID link https://github.com/metabrainz/musicbrainz-server/…

2022-07-07 18819, 2022

6:40 AM
antlarr has quit

2022-07-07 18846, 2022

6:49 AM
antlarr joined the channel

2022-07-07 18805, 2022

7:12 AM
d4rk-ph0enix joined the channel

2022-07-07 18805, 2022

7:12 AM
d4rkie has quit

2022-07-07 18815, 2022

7:29 AM
alastairp

morning

2022-07-07 18833, 2022

7:29 AM
alastairp

thanks bitmap for the comments on SQL, that's more or less what I expected

2022-07-07 18820, 2022

7:30 AM
alastairp

Pratha-Fish: thanks for the experiment, incredible to see how small and fast parquet is. you're right, this might be a good idea.

2022-07-07 18809, 2022

7:31 AM
alastairp

keep in mind that we also need a format that can be used by people no matter what programming language they use - let's have a discussion with lucifer about this, because we're saving space maybe we can distribute it in both formats

2022-07-07 18847, 2022

7:31 AM
alastairp

how are you writing csv/parquet? just with the DataFrame methods? for comparison, can you also do the same with csv+gz?

2022-07-07 18810, 2022

7:32 AM
alastairp

how many items did you write? because 22 seconds does seem like a lot, but I guess you're writing many files in a loop?

2022-07-07 18820, 2022

7:41 AM
antlarr has quit

2022-07-07 18855, 2022

7:41 AM
KassOtsimine has quit

2022-07-07 18826, 2022

7:42 AM
reosarevok has quit

2022-07-07 18830, 2022

7:42 AM
zas has quit

2022-07-07 18837, 2022

7:42 AM
mruszczyk has quit

2022-07-07 18842, 2022

7:42 AM
reosarevok joined the channel

2022-07-07 18842, 2022

7:42 AM
KassOtsimine joined the channel

2022-07-07 18842, 2022

7:42 AM
zas joined the channel

2022-07-07 18848, 2022

7:42 AM
mruszczyk joined the channel

2022-07-07 18800, 2022

7:43 AM
milkii has quit

2022-07-07 18839, 2022

7:43 AM
alastairp

aerozol: CB-441 and CB-442

2022-07-07 18840, 2022

7:43 AM
BrainzBot

CB-441: Need a way to show that an entity/review is for MB or BB https://tickets.metabrainz.org/browse/CB-441

2022-07-07 18840, 2022

7:43 AM
BrainzBot

CB-442: Improve layout of CB entity page https://tickets.metabrainz.org/browse/CB-442

2022-07-07 18815, 2022

7:44 AM
milkii joined the channel

2022-07-07 18825, 2022

7:45 AM
lucifer

alastairp: hi! whats the data to be distributed here?

2022-07-07 18834, 2022

7:45 AM
alastairp

hi lucifer

2022-07-07 18843, 2022

7:45 AM
antlarr joined the channel

2022-07-07 18813, 2022

7:46 AM
alastairp

this is the music listen histories dataset. 27 billion rows, currently distributed as csv, 1 file per user. each row is timestamp, recording mbid, artist mbid, release mbid

2022-07-07 18826, 2022

7:48 AM
alastairp

we've already obtained a 50% filesize reduction by switching from gz to zstd, but the question is shoud we go further and use parquet as well, which gives additional size reductions as well as a significant speed increase when loading

2022-07-07 18825, 2022

7:49 AM
lucifer

parquet is probably better in almost all ways with the exception of being not human readable and you can't edit it in excel.

2022-07-07 18822, 2022

7:50 AM
lucifer

but 27B rows is too much data for direct human consumption anyway so i think these issues don't matter much.

2022-07-07 18835, 2022

7:50 AM
alastairp

yeah, right

2022-07-07 18844, 2022

7:50 AM
alastairp

so we read/write it in spark and python?

2022-07-07 18859, 2022

7:51 AM
alastairp

perhaps it'd be nice to be able to `cat file.parquet | to-csv` somehow, to be able to view the data as csv if necessary? I'm reading the internet and it seems to just suggest using pandas for this, which is ok but not great

2022-07-07 18803, 2022

7:52 AM
lucifer

yes. both parquet and csv have built in support in spark. in python, you need to add pyarrow manually to use parquet (csv support is builtin as you know).

2022-07-07 18822, 2022

7:52 AM
lucifer

there's duckdb and some cli parquet tools but not too great.

2022-07-07 18829, 2022

7:52 AM
alastairp

right

2022-07-07 18801, 2022

7:53 AM
lucifer

for instance, https://stackoverflow.com/questions/36140264/insp…

2022-07-07 18816, 2022

7:53 AM
alastairp

oh nice, that's exactly what I waswanting

2022-07-07 18857, 2022

7:53 AM
lucifer

there's also a convert csv command in this fwiw.

2022-07-07 18845, 2022

7:54 AM
lucifer

but i remember facing issues with installing this tool last i tried. let me see if it works now.

2022-07-07 18859, 2022

7:56 AM
alastairp

lucifer: and remind me - how do we make spark dumps? I seem to recall that we dump json and then convert?

2022-07-07 18814, 2022

7:57 AM
lucifer

ah no, we dump parquet directly using pandas now.

2022-07-07 18832, 2022

7:57 AM
lucifer

we make 2 dumps. 1 json and 1 parquet one for spark.

2022-07-07 18850, 2022

7:57 AM
alastairp

oh great

2022-07-07 18810, 2022

7:58 AM
alastairp

found it

2022-07-07 18805, 2022

7:59 AM
alastairp

do we do any compression on the parquet file?

2022-07-07 18808, 2022

8:00 AM
lucifer

parquet uses snappy compression by default so yes

2022-07-07 18842, 2022

8:00 AM
alastairp

right (https://usercontent.irccloud-cdn.com/file/wZTxZ7n…)

2022-07-07 18800, 2022

8:01 AM
MRiddickW joined the channel

2022-07-07 18830, 2022

8:01 AM
alastairp

thanks lucifer! we'll put parquet on the list of possibilities for the final version of the dataset, then

2022-07-07 18838, 2022

8:03 AM
alastairp

I think it makes sense. I might do a quick look and see if I can find what programming languages people used for projects that use this dataset

2022-07-07 18816, 2022

8:04 AM
Lotheric_ joined the channel

2022-07-07 18818, 2022

8:06 AM
Lotheric has quit

2022-07-07 18851, 2022

8:23 AM
CatQuest

Pratha-Fish: you forgot the cons, tho ;)

2022-07-07 18809, 2022

8:27 AM
CatQuest

(not that i know them :D)

2022-07-07 18809, 2022

8:31 AM
CatQuest

alastairp: your mockup on https://tickets.metabrainz.org/browse/CB-442 get's a 👍 from me, it's a clear improvement

2022-07-07 18810, 2022

8:31 AM
BrainzBot

CB-442: Improve layout of CB entity page

2022-07-07 18830, 2022

8:31 AM
alastairp

CatQuest: that's monkey's mockup, all 👍 should go to him!

2022-07-07 18848, 2022

8:31 AM
CatQuest

(I agree cb layout ha always been a tad odd)

2022-07-07 18806, 2022

8:32 AM
CatQuest

... you whre th reporter so i was confused :D

2022-07-07 18808, 2022

8:32 AM
alastairp

CatQuest: the only large con that we know of is what lucifer and I were just discussing - the data format is no longer text, so you need a 3rd party software library to read it

2022-07-07 18813, 2022

8:32 AM
CatQuest

!m mofor mockups then!

2022-07-07 18814, 2022

8:32 AM
BrainzBot

You're doing good work, mofor mockups then!!

2022-07-07 18824, 2022

8:32 AM
CatQuest

...

2022-07-07 18834, 2022

8:32 AM
CatQuest

!m monkey for mockups then!

2022-07-07 18834, 2022

8:32 AM
BrainzBot

You're doing good work, monkey for mockups then!!

2022-07-07 18854, 2022

8:32 AM
alastairp

I suspect that CB was much like every other project - a programmer makes a start and needs to lay out the data so comes up with something, but we never get around to looking at it from a design perspective

2022-07-07 18856, 2022

8:32 AM
CatQuest

alastairp: yea.i was kinda joknig a little with Pratha-Fish, becasue they said "pros:" and didn't list cons :)

2022-07-07 18811, 2022

8:33 AM
alastairp

there really are very few cons compared to the pros, though

2022-07-07 18821, 2022

8:33 AM
alastairp

now that we have the right people on board for design help, definitely agree that we should work to improve it

2022-07-07 18822, 2022

8:33 AM
CatQuest

I dunno about design. but usability fro man user perspective

2022-07-07 18842, 2022

8:33 AM
alastairp

sure, to me that's part of design too

2022-07-07 18859, 2022

8:33 AM
CatQuest

i trust monkey explisitly, as he seems to also take that int oaccount, isntead of trying to be "fancy" for the sake of "design" or the sake of "just being fancy"

2022-07-07 18816, 2022

8:34 AM
CatQuest

as i've seen other such people do :D

2022-07-07 18805, 2022

8:35 AM
CatQuest

btw, have you seen the new entity editor that Shubh is xooking up for bb?

2022-07-07 18811, 2022

8:35 AM
CatQuest

it's also a "design" improvement

2022-07-07 18859, 2022

8:35 AM
CatQuest

https://test.bookbrainz.org/create

2022-07-07 18844, 2022

8:36 AM
CatQuest

so far I've given feedback about performances (ostensibly on old browsers) but other's input re usability/design/ui/whatever would als obe usefull to them i think

2022-07-07 18857, 2022

9:18 AM
skelly37 joined the channel

2022-07-07 18824, 2022

9:21 AM
ROpdebee has quit

2022-07-07 18852, 2022

9:21 AM
ROpdebee joined the channel

2022-07-07 18834, 2022

9:39 AM
lucifer

monkey: alastairp: mayhem: we didn't have a LB meeting this month (last too). thoughts on doing one soon?

2022-07-07 18800, 2022

9:40 AM
alastairp

yeah, let's do it. next week I'm away Mon-Weds

2022-07-07 18821, 2022

9:40 AM
lucifer

i see, maybe later today or tomorrow if it works for all?

2022-07-07 18822, 2022

9:42 AM
alastairp

tomorrow midday/early afternoon would be OK for me

2022-07-07 18802, 2022

9:45 AM
d4rk-ph0enix has quit

2022-07-07 18834, 2022

9:45 AM
d4rkie joined the channel

2022-07-07 18850, 2022

9:49 AM
d4rkie has quit

2022-07-07 18805, 2022

9:55 AM
mayhem

oh yes, sorry about that one.

2022-07-07 18826, 2022

9:55 AM
mayhem

tomorrow afternoon is pretty bad for me.

2022-07-07 18811, 2022

9:57 AM
lucifer

next thursday/friday or maybe the monday of july 18 1 hr before regular meeting?

2022-07-07 18851, 2022

9:57 AM
zas

https://blog.metabrainz.org/2022/07/07/picard-2-8…

2022-07-07 18836, 2022

9:59 AM
mayhem

next thursday I could do. I'm out Friday/Monday that weekend.

2022-07-07 18842, 2022

9:59 AM
d4rkie joined the channel

2022-07-07 18805, 2022

10:01 AM
d4rkie has quit

2022-07-07 18841, 2022

10:01 AM
d4rkie joined the channel

2022-07-07 18842, 2022

10:20 AM
ansh

alastairp: Thanks for the detailed review :)