#metabrainz

/

0:06 AM
thevar1able has quit

2022-09-27 27039, 2022

0:29 AM
thevar1able joined the channel

2022-09-27 27015, 2022

1:42 AM
Lotheric_ joined the channel

2022-09-27 27021, 2022

1:43 AM
SothoTalKer_ joined the channel

2022-09-27 27044, 2022

1:44 AM
Lotheric has quit

2022-09-27 27011, 2022

1:46 AM
SothoTalKer has quit

2022-09-27 27017, 2022

1:46 AM
SothoTalKer_ is now known as SothoTalKer

2022-09-27 27031, 2022

1:59 AM
aerozol

rdswift: The link to the ticket tracker filter here doesn’t work for me (the requested filter doesn’t exist or is private): https://picard-docs.musicbrainz.org/en/about_pica…

2022-09-27 27006, 2022

2:03 AM
aerozol

I can’t find any kind of ‘Picard Docs’ section or component so I’ll just hold off adding a ticket until I know where to put it

2022-09-27 27011, 2022

2:04 AM
aerozol

(to remind myself, it was about adding DBpoweramp to the options here: https://picard-docs.musicbrainz.org/en/workflows/… - after checking if it’s actually already in prod)

2022-09-27 27045, 2022

3:19 AM
lucifer

aerozol: spotify probably has abundance of both ;). but yes i think huesound design can improve much. that said i dislike Spotify's current design for the page too.

2022-09-27 27017, 2022

3:20 AM
aerozol

Oh yes I hate it :D

2022-09-27 27053, 2022

3:20 AM
aerozol

Halfway between huesound and their mess would probably be perfect!

2022-09-27 27058, 2022

3:20 AM
aerozol

Maybe a bit closer to huesound...

2022-09-27 27008, 2022

3:22 AM
aerozol

Good morning by the way!

2022-09-27 27049, 2022

3:34 AM
aerozol

Wooo I have a local LB server running! No CSS by the looks of it, but a start

2022-09-27 27021, 2022

3:40 AM
BrainzGit

[musicbrainz-server] 14mwiencek merged pull request #2662 (03beta…mbs-12626): MBS-12626: Fix error submitting multiple instrument relationships https://github.com/metabrainz/musicbrainz-server/…

2022-09-27 27019, 2022

3:44 AM
aerozol

monkey: this is the output after running .test.sh fe -u - should I push the changes or is this unexpected?

2022-09-27 27025, 2022

3:44 AM
aerozol

https://www.irccloud.com/pastebin/jmTBBRHR/

2022-09-27 27050, 2022

4:14 AM
BrainzGit

[musicbrainz-server] 14mwiencek merged pull request #2663 (03beta…mbs-12625): MBS-12625: Autocomplete2: Show recent instruments https://github.com/metabrainz/musicbrainz-server/…

2022-09-27 27027, 2022

8:17 AM
alastairp

good morning everyone!

2022-09-27 27052, 2022

8:17 AM
alastairp

lucifer: thanks for fixing CB, I wonder if it continued to work on CI because we had cached layers and we never rebuilt from scratch?

2022-09-27 27020, 2022

8:18 AM
alastairp

ansh: give me a moment and I'll look through your licensing questions, did you clarify an answer with lucifer?

2022-09-27 27052, 2022

8:22 AM
alastairp

hi Pratha-Fish, we can look into this in a bit more detail, but some things that I can see are 1) even if writing is slower for zstd that's not a huge problem, this is a one-off operation and if we can make smaller files then that's an OK tradeoff to have even if it takes a bit longer, 2) we already have proof that zstd is smaller than gzip, given that we've rewritten the files, so we should look into this and see what's going on

2022-09-27 27028, 2022

8:24 AM
alastairp

3) the "0" time for writing csv+gzip looks really suspicious, and makes me think that something is wrong. what are you measuring here - the read/write time for 1 file, or a set of files?

2022-09-27 27056, 2022

8:24 AM
alastairp

Pratha-Fish: where is the code to generate this graph? I can have a look at it

2022-09-27 27026, 2022

9:02 AM
mayhem

mooooin!

2022-09-27 27012, 2022

9:03 AM
BrainzGit

[musicbrainz-server] 14reosarevok opened pull request #2666 (03beta…MBS-12639): MBS-12639: Also update countries.pot from the DB https://github.com/metabrainz/musicbrainz-server/…

2022-09-27 27031, 2022

9:03 AM
reosarevok

yvanzo: when around, do make sure that seems fine to you since you've looked at pot stuff more than I have :)

2022-09-27 27032, 2022

9:03 AM
lucifer

aerozol: halfway sounds good indeed. re the tests, you'll also need to make the other change monkey mentioned in his comment but other than that the output looks good to push.

2022-09-27 27028, 2022

9:04 AM
reosarevok

outsidecontext: good catch! :D Not sure how we didn't notice for 8 years, but

2022-09-27 27034, 2022

9:05 AM
lucifer

alastairp: yes, it was cache indeed. many of the builds on existing PRs failed when i tried to push a development image from them after updating the master which invalidated the older cache layers.

2022-09-27 27006, 2022

9:06 AM
alastairp

lucifer: right, got it. we can just rebase those

2022-09-27 27014, 2022

9:06 AM
lucifer

yup

2022-09-27 27036, 2022

9:06 AM
alastairp

I think I'll review CB things today - try and get all of that finished up

2022-09-27 27048, 2022

9:06 AM
lucifer

sounds good

2022-09-27 27050, 2022

9:13 AM
Sophist-UK joined the channel

2022-09-27 27056, 2022

9:21 AM
Pratha-Fish

alastairp: Hi!

2022-09-27 27056, 2022

9:21 AM
Pratha-Fish

Here's the code to generate the graph: https://github.com/Prathamesh-Ghatole/MLHD/blob/m…

2022-09-27 27052, 2022

9:24 AM
Pratha-Fish

alastairp: re:

2022-09-27 27052, 2022

9:24 AM
Pratha-Fish

1) Sounds good :)

2022-09-27 27052, 2022

9:24 AM
Pratha-Fish

2) Maybe it's because I didn't set the compression level in this particular test

2022-09-27 27052, 2022

9:24 AM
Pratha-Fish

3) The "0" time is because the files are already in CSV+GZIP format, so I didn't benchmark the write times for the same.

2022-09-27 27053, 2022

9:24 AM
Pratha-Fish

Lastly, the write time here is the total time taken to write 100 files if I remember correctly

2022-09-27 27005, 2022

9:26 AM
CatQuest

bitmap: *please* 🙇 consider making th new "Credited as" button a toggle instead of a button one always has to press to use relationship credits 🙇 thank you

2022-09-27 27038, 2022

9:27 AM
alastairp

Pratha-Fish: I'm just reading this notebook - for the cells where you test write times, it looks like you include the time needed to read the csv.gz too?

2022-09-27 27009, 2022

9:28 AM
alastairp

this would be more accurate if we read all of the data files into memory first, and then start the timer and do the writes

2022-09-27 27052, 2022

9:28 AM
Pratha-Fish

Ooh that really makes a lot of sense!

2022-09-27 27056, 2022

9:28 AM
alastairp

especially because we should keep in mind that linux does lots of smart things if you read the same file from disk many times one after another

2022-09-27 27021, 2022

9:29 AM
alastairp

it caches files in memory a lot of the time - this means that the first time you run it, it will be slower, and the 2nd time it will be much much faster

2022-09-27 27025, 2022

9:29 AM
Pratha-Fish

wow I didn't take that into account

2022-09-27 27041, 2022

9:29 AM
alastairp

so we're actually including a bias against the first test (zst parquet in your example)

2022-09-27 27015, 2022

9:30 AM
Pratha-Fish

Alright, let me run it once again, but this time I'll load the tables into memory

2022-09-27 27041, 2022

9:30 AM
Pratha-Fish

That's a nice new addition to the "lessons learnt" section too :)

2022-09-27 27049, 2022

9:31 AM
alastairp

I see that in our test_rec_track_checker we used level 10 zst compression, whereas in test_file_type_io_testing you're just using the default (3)

2022-09-27 27059, 2022

9:31 AM
alastairp

so that might also be a factor in your file size differences

2022-09-27 27034, 2022

9:35 AM
Pratha-Fish

alastairp: indeed, that seems to be the case

2022-09-27 27009, 2022

9:36 AM
Pratha-Fish

But the issue is, even with these same settings the tests seemed to prefer zstd earlier

2022-09-27 27011, 2022

9:36 AM
alastairp

btw, great to see title + axis labels + legend in your graphs! very easy to understand!

2022-09-27 27046, 2022

9:36 AM
alastairp

yes, you're right - that's a bit confusing. but the disk cache issue might really be part of the problem

2022-09-27 27014, 2022

9:37 AM
alastairp

we could try and test independently from disk access - you could read/write into a `bytesio` object

2022-09-27 27031, 2022

9:37 AM
alastairp

https://docs.python.org/3/library/io.html#binary-…

2022-09-27 27042, 2022

9:37 AM
alastairp

this is a thing that "looks like" a file handle, but is actually all in memory

2022-09-27 27016, 2022

9:38 AM
alastairp

it means that we would really only be testing the CPU part of compressing the data, rather than any overhead that reading from/writing to the disk might bring

2022-09-27 27049, 2022

9:38 AM
atj

simplest to just copy the files to /dev/shm

2022-09-27 27014, 2022

9:39 AM
alastairp

ah, cool. thanks atj!

2022-09-27 27058, 2022

9:39 AM
alastairp

Pratha-Fish: so, as atj just pointed out, there is a folder on wolf `/dev/shm`, this is a "memory filesystem"

2022-09-27 27050, 2022

9:40 AM
alastairp

it looks just like a filesystem, but is only stored in memory, not on disk. We could copy our base files there, and then do the normal tests - reading/writing etc, but use this as the target location. this will avoid all issues that reading from disk may bring

2022-09-27 27004, 2022

9:41 AM
alastairp

Pratha-Fish: one more thing I thought of - you said "3) The "0" time is because the files are already in CSV+GZIP format, so I didn't benchmark the write times for the same."

2022-09-27 27036, 2022

9:41 AM
alastairp

but is that correct now? because we deleted all of the gz files. is your sample using gz files (which you may have copied to your home directory?) or is it reading the zst files?

2022-09-27 27027, 2022

9:42 AM
alastairp

atj: you know, I've seen /dev/shm in the output of mount(1) so often, and I wondered what it was, but never questioned it and looked into exactly what it was

2022-09-27 27022, 2022

9:43 AM
atj

alastairp: not sure on the origins of it myself, but I assume shm stands for shared memory

2022-09-27 27034, 2022

9:43 AM
elomatreb[m]

You probably just want a regular tmpfs, rather than /dev/shm

2022-09-27 27050, 2022

9:43 AM
atj

elomatreb[m]: it is regular tmpfs?

2022-09-27 27006, 2022

9:44 AM
alastairp

elomatreb[m]: are you aware of specific issues in using /dev/shm randomly as a scratch space?

2022-09-27 27017, 2022

9:44 AM
elomatreb[m]

No, but it would be weird

2022-09-27 27026, 2022

9:44 AM
elomatreb[m]

Same reason you don't put your temporary files into /boot

2022-09-27 27036, 2022

9:44 AM
atj

well boot isn't world writable

2022-09-27 27004, 2022

9:45 AM
alastairp

also, my /boot is a separate smaller partition :) (but that's labouring the point)

2022-09-27 27034, 2022

9:45 AM
atj

for the purposes of some small benchmarks, I think we're OK

2022-09-27 27049, 2022

9:45 AM
alastairp

while I agree that setting up a specific tmpfs for this task would be more correct than using /dev/shm, in the grand scheme of things it'd be even easier to just read the files into a bytesio and use that for the tests

2022-09-27 27053, 2022

9:45 AM
alastairp

so we're in sort of a middleground here

2022-09-27 27029, 2022

9:50 AM
Pratha-Fish

alastairp: atj thanks for the tips :)

2022-09-27 27024, 2022

9:51 AM
atj

would be interesting to include different ZSTD levels in the benchmark to see what the CPU/size tradeoff is

2022-09-27 27040, 2022

9:51 AM
Pratha-Fish

alastairp: Nice catch. We don't use the gzip files anymore. Which also means, even the GZIP tests aren't GZIP tests anymore!

2022-09-27 27040, 2022

9:51 AM
Pratha-Fish

I think the best option here might be to load all the files at once in a list, and then testing the read/write times independently

2022-09-27 27009, 2022

9:52 AM
alastairp

atj: yeah, I did a bit of that when I was looking at some dumps code that I worked on a few months back

2022-09-27 27047, 2022

9:52 AM
alastairp

anything over about 12 started getting really slow for not much more benefit

2022-09-27 27008, 2022

9:53 AM
atj

diminishing returns

2022-09-27 27017, 2022

9:53 AM
alastairp

it's definitely much faster if you build a compression-specific dictionary, but that's a lot of drama to need to carry around the dictionary for anyone who wants to uncompress the archives

2022-09-27 27031, 2022

9:53 AM
atj

3 vs. 10 or something

2022-09-27 27003, 2022

9:54 AM
alastairp

sure - Pratha-Fish, you already have the default code for compression level 3, and you have the code in the other notebook for compression level 10. You could add that as 2 different columns in the graphs

2022-09-27 27010, 2022

9:54 AM
atj

gzip has levels too, I think the compression does improve quite a lot at higher levels

2022-09-27 27041, 2022

9:54 AM
alastairp

at a speed tradeoff?

2022-09-27 27014, 2022

9:55 AM
alastairp

it's true - we were always testing gz default level against things like bzip2 and xz default levels, but then testing 10 different zstd levels and saying that it was clearly better

2022-09-27 27042, 2022

9:55 AM
atj

IIRC gzip uses a lot more CPU at higher levels

2022-09-27 27051, 2022

9:55 AM
alastairp

but even in my tests - I was seeing for multi-gb files that even zst -10 was faster _and_ gave smaller results than gzip's default compression

2022-09-27 27007, 2022

9:56 AM
atj

to be fair, how old is gzip? :)

2022-09-27 27008, 2022

9:56 AM
alastairp

and much much faster decompressing too

2022-09-27 27017, 2022

9:56 AM
alastairp

absolutely

2022-09-27 27017, 2022

9:56 AM
Pratha-Fish

alastairp: on it

2022-09-27 27040, 2022

9:56 AM
alastairp

so I think we're making the right decision with zst, there's just a question of what we want our speed/size tradeoff to be

2022-09-27 27015, 2022

9:57 AM
alastairp

especially given that the compression is a once-off operation, we can definitely afford to take longer at the compression stage if it gives significantly better size results

2022-09-27 27037, 2022

9:57 AM
Pratha-Fish

not to mention, pyarrow has significantly increased the write speed as well

2022-09-27 27019, 2022

9:58 AM
Pratha-Fish

If we also end up making the cleanup process multithreaded, we could also leverage arrow's batch writing functions to further improve the speeds

2022-09-27 27014, 2022

10:00 AM
alastairp

Pratha-Fish: that being said, I was just looking at my calendar and it reminded me that the submission week starts in only 3 weeks time - so given our previous experience of the re-writing taking ~1 week, I think that we should be starting to test making our new version quite soon

2022-09-27 27027, 2022

10:00 AM
alastairp

because if we find a mistake after running it for 4 days, we will have to re-run it

2022-09-27 27044, 2022

10:00 AM
alastairp

so let's work on these graphs once more, but then definitely focus on getting the new dataset created

2022-09-27 27026, 2022

10:01 AM
Pratha-Fish

alastairp: absolutely

2022-09-27 27027, 2022

10:01 AM
alastairp

if you want to write up a bunch of posts about things that you learned, then this also has to be finished before your submission deadline, so I'd think about finishing these within the next 2 weeks so that we have time to polish them if needed

2022-09-27 27021, 2022

10:02 AM
Pratha-Fish

This is also one of the reasons why I was hesitant about restructuring the notebooks and older scripts. It could get in the way of the primary objective, given the time constraints

2022-09-27 27059, 2022

10:02 AM
Pratha-Fish

alastairp: Yes. Let's hop on to the cleanup script once this benchmark is done :)

2022-09-27 27012, 2022

10:03 AM
alastairp

ok, agreed. so if you need to re-do an older script because you need it as the base of your post then let's do that

2022-09-27 27020, 2022

10:03 AM
alastairp

otherwise, let's leave it

2022-09-27 27040, 2022

10:03 AM
Pratha-Fish

sounds good 👍

2022-09-27 27008, 2022

10:05 AM
Pratha-Fish

P.S. Let's also factor in the fact that I have exams on the following schedule:

2022-09-27 27008, 2022

10:05 AM
Pratha-Fish

11-13 Oct (tentative)

2022-09-27 27008, 2022

10:05 AM
Pratha-Fish

16th Oct (fixed)

2022-09-27 27008, 2022

10:05 AM
Pratha-Fish

So there goes another 5 days :skull_and_crossbones:

2022-09-27 27004, 2022

10:07 AM
alastairp

definitely. so let's focus on creating the new dataset as a top priority, and how about you tell me in the next few days what topics you would like to write about so that we can see how many of them we think that we can do. Given your constraints I think that we should think about a maximum of two (in addition to a general blog post about what you did for the project)

2022-09-27 27004, 2022

10:09 AM
Pratha-Fish

alastairp: Great. I'll take a look at the journal, and pick 2 topics that I could expand the most upon.

2022-09-27 27020, 2022

10:09 AM
Pratha-Fish

The speed benchmarks sound good for starters

2022-09-27 27023, 2022

10:09 AM
lucifer

lol trying to copy 40 GB from SSD to HDD using 4 threads brought laptop to knees. git taking 10 mins to squash 4 commits.

2022-09-27 27025, 2022

10:09 AM
alastairp

agreed

2022-09-27 27015, 2022

10:10 AM
alastairp

lucifer: we ran out of disk space in a machine at the uni, so I moved the docker root from the / ssd to our large spinning disk

2022-09-27 27022, 2022

10:10 AM
alastairp

and things got _so slow_

2022-09-27 27036, 2022

10:10 AM
alastairp

I can't believe we used to consider these things normal

2022-09-27 27055, 2022

10:10 AM
alastairp

maybe it's just because we saw that we had so many resources that we just started writing really inefficient code

2022-09-27 27011, 2022

10:11 AM
lucifer

oh yeah, docker is way too slow for me to run locally so i mostly use wolf. i am planning to buy a new SSD this week to speed up local development.

2022-09-27 27001, 2022

10:12 AM
lucifer

(the current SSD is small and runs out of space often so i mostly have stuff on the HDD so getting a bigger SSD)

2022-09-27 27046, 2022

10:14 AM
atj

is it NVME?

2022-09-27 27002, 2022

10:16 AM
lucifer

atj: yes

2022-09-27 27013, 2022

10:16 AM
mayhem goes back to digesting the large spark query in LB#2037

2022-09-27 27049, 2022

10:16 AM
mayhem

oh, alastairp have you been getting Awair scores of 100 in the past days?

2022-09-27 27045, 2022

10:18 AM
alastairp

mayhem: yeah, 96-100 yesterday morning

2022-09-27 27053, 2022

10:19 AM
lucifer

mayhem: if it helps, that query 1) explodes the array of artist mbids 2) counts the number of times an artist has been listened to by a user. 3) filters the fresh releases based on this artist data 4) finally sorts it.

2022-09-27 27018, 2022

10:20 AM
alastairp

not sure what weights were causing that, I didn't look into detail at co2 or pm25 levels. low temperature and humidity certainly would have helped

2022-09-27 27059, 2022

10:33 AM
BrainzGit

[critiquebrainz] 14amCap1712 opened pull request #469 (03master…sqlalchemy-warnings): Upgrade SQLAlchemy and enable 2.0 warnings https://github.com/metabrainz/critiquebrainz/pull…

2022-09-27 27032, 2022

10:36 AM
lucifer

alastairp: mayhem: are you both available today for some discussion about incremental updates to mb_metadata_cache?

2022-09-27 27049, 2022

10:40 AM
mayhem

i can be

2022-09-27 27025, 2022

10:53 AM
mayhem

atj: zas: I've got a raspberry pi here in the office running the door opener. The load on the RPi is near zero, but network connectivity is rather spotty/shit.

2022-09-27 27058, 2022

10:53 AM
mayhem

for instance, it takes 1-2 minutes to log in, but once logged in, everything is fine.

2022-09-27 27018, 2022

10:54 AM
mayhem

but then sometimes operations are really slow.

2022-09-27 27035, 2022

10:54 AM
mayhem

any ideas what might causes this -- it looks very much like a network setup issue to me.

2022-09-27 27016, 2022

10:55 AM
mayhem

but, the telegram bot that runs there is always responsive and quick. but generally using the RPi is... meh.

2022-09-27 27040, 2022

10:58 AM
zas

weird, which RPi version it is?

2022-09-27 27011, 2022

10:59 AM
mayhem

3 or 4, not sure. I didn't install it -- the one I installed blew over the summer.

2022-09-27 27054, 2022

10:59 AM
mayhem

its at 192.168.1.3 if you care to take a look.

2022-09-27 27044, 2022

11:01 AM
atj

DNS?

2022-09-27 27058, 2022

11:01 AM
atj

probably not, usually times out after 30s