#metabrainz

/

0:43 AM
Maxr1998 joined the channel

2024-01-04 00456, 2024

0:44 AM
Maxr1998_ has quit

2024-01-04 00437, 2024

1:04 AM
aerozol

lucifer: driving all day today and all the hills and shit are making it hard to load figma... There's a mock-up in there with some of the stats I thought we could share, if you don't mind having a look?

2024-01-04 00413, 2024

1:05 AM
aerozol

There's a section that has social media post mockups

2024-01-04 00417, 2024

1:05 AM
aerozol

I'll post a screenshot if it ever loads 🤔

2024-01-04 00431, 2024

1:14 AM
aerozol

lucifer: got it! https://usercontent.irccloud-cdn.com/file/6yDZgD8…

2024-01-04 00444, 2024

1:14 AM
aerozol

Honestly, any annual stats would be interesting!

2024-01-04 00413, 2024

1:15 AM
aerozol

You grab em and I'll make them look interesting 😁

2024-01-04 00423, 2024

1:15 AM
aerozol

(Lucy is driving btw everyone, otherwise this would be very problematic haha)

2024-01-04 00403, 2024

1:36 AM
minimal has quit

2024-01-04 00444, 2024

1:36 AM
bitmap

uhh jimmy out of disk space?

2024-01-04 00435, 2024

1:37 AM
bitmap

all of MB is down rn because jimmy can't be accessed

2024-01-04 00451, 2024

1:42 AM
bitmap

lucifer: is anything running that would be eating up space? ^

2024-01-04 00454, 2024

1:49 AM
bitmap

zas: around?

2024-01-04 00459, 2024

1:58 AM
lediur joined the channel

2024-01-04 00453, 2024

2:00 AM
lediur has quit

2024-01-04 00441, 2024

2:01 AM
bitmap

zas: I had to delete /home/zas/temp.file to even allow postgres to start, it was 2.1GB and I really wasn't sure what else I could remove

2024-01-04 00426, 2024

2:02 AM
bitmap

postgres is back but we're in very dangerous territory rn...

2024-01-04 00400, 2024

2:05 AM
bitmap

I tried pruning unused docker images but there was nothing

2024-01-04 00419, 2024

2:16 AM
aerozol

I've posted on the socials that we're working on the issue, ping me if there's updates to share bitmap

2024-01-04 00432, 2024

2:19 AM
bitmap

aerozol: I was able to restart postgres and musicbrainz seems to be back, at least, but things might be unstable if whatever caused the meltdown starts running again

2024-01-04 00408, 2024

2:21 AM
aerozol

bitmap: thanks, will update now

2024-01-04 00451, 2024

2:22 AM
bitmap

ty

2024-01-04 00442, 2024

2:23 AM
bitmap

the listenbrainz DB rose from 32GB to 119GB which I believe is when it ran out of space

2024-01-04 00417, 2024

2:25 AM
bitmap

https://stats.metabrainz.org/d/000000067/postgres…

2024-01-04 00430, 2024

2:26 AM
aerozol

Year in music related?

2024-01-04 00435, 2024

2:26 AM
bitmap

also, musicbrainz_db dropped from 314GB to 259GB afterward, but I believe that's because the mapping.* schema is now empty

2024-01-04 00452, 2024

2:26 AM
bitmap

yeah, possibly, I'm not sure what was running earlier

2024-01-04 00417, 2024

2:28 AM
bitmap

the standby on hendrix seems to still be in recovery, that might take a while

2024-01-04 00400, 2024

2:29 AM
aerozol

https://usercontent.irccloud-cdn.com/file/aqkzE2G…

2024-01-04 00425, 2024

2:29 AM
bitmap

XD

2024-01-04 00432, 2024

2:29 AM
aerozol

Graphana link doesn't work on my phone, but was Discord onto this hours ago 🥴

2024-01-04 00431, 2024

2:30 AM
aerozol

Oh wait, not hours sorry, I'm still on holiday time sorry. This is about the same time as you posted here

2024-01-04 00456, 2024

2:30 AM
bitmap

ah, good, I was worried I missed some alerts

2024-01-04 00410, 2024

2:31 AM
bitmap

I was afk but I did see the alerts when I checked my phone

2024-01-04 00426, 2024

2:32 AM
bitmap

atj: zas: do we have to increase the size of /srv/postgresql? (or is that not how ZFS works)

2024-01-04 00436, 2024

2:56 AM
relaxoMob has quit

2024-01-04 00414, 2024

2:58 AM
relaxoMob joined the channel

2024-01-04 00459, 2024

3:02 AM
Maxr1998

I only checked Grafana once it was already down and when you guys were already aware of it ^^

2024-01-04 00459, 2024

3:02 AM
Maxr1998

Thanks for resolving it so quickly btw!

2024-01-04 00418, 2024

3:17 AM
bitmap

I was really worried we'd be down for a long time if I couldn't find anything on jimmy to delete and clear up some space (since PG couldn't even start, so I couldn't clear any tables or anything)

2024-01-04 00431, 2024

3:18 AM
bitmap

luckily zas had some random 2GB temp file lying around (hope it wasn't important). maybe we ought to keep more of those in case of emergency lol

2024-01-04 00445, 2024

3:44 AM
derwin joined the channel

2024-01-04 00412, 2024

3:45 AM
derwin

is the site still broken, or is the endless spinning when I try to add an artist from the release editor on this gigantic release some other issue?

2024-01-04 00449, 2024

3:53 AM
bitmap

derwin: there have been quite a few artists/releases added since, so perhaps some other issue (but I'm not sure about adding artists from the release editor specifically)

2024-01-04 00436, 2024

3:54 AM
bitmap

if you see anything relevant in the browser console or network panel I can take a look

2024-01-04 00448, 2024

4:03 AM
lucifer

bitmap: hi!

2024-01-04 00451, 2024

4:03 AM
lucifer

just woke up

2024-01-04 00405, 2024

4:04 AM
relaxoMob has quit

2024-01-04 00411, 2024

4:04 AM
bitmap

lucifer: hey

2024-01-04 00412, 2024

4:04 AM
lucifer

where do you see the LB db size? i just checked and the largest table is 5G

2024-01-04 00426, 2024

4:04 AM
bitmap

from the graphana link above

2024-01-04 00416, 2024

4:05 AM
lucifer

oh my bad, i had the query wrong. yes i see a 100G table.

2024-01-04 00454, 2024

4:05 AM
bitmap

any idea what's wrong, or is it expected to grow that much?

2024-01-04 00433, 2024

4:06 AM
lucifer

nope, i can try to remove some of the autogenerated data.

2024-01-04 00451, 2024

4:06 AM
bitmap

/srv/postgresql is apparently only 258G but I assume that can be increased if needed

2024-01-04 00410, 2024

4:07 AM
lucifer

yeah that seems very weird

2024-01-04 00411, 2024

4:08 AM
bitmap

I'm not sure how that is calculated if musicbrainz_db alone is 259GB (is that taking compression into account?)

2024-01-04 00437, 2024

4:09 AM
lucifer

or maybe zfs commands need to be used to obtain the free disk space for it.

2024-01-04 00414, 2024

4:11 AM
bitmap

yea zpool list shows 68GB free

2024-01-04 00400, 2024

4:12 AM
bitmap

much of the currently used space is likely WAL files

2024-01-04 00424, 2024

4:12 AM
lucifer

ah okay makes sense

2024-01-04 00456, 2024

4:12 AM
bitmap

there is 1.5 TB of WAL files, which is a crazy amount of writes

2024-01-04 00443, 2024

4:13 AM
bitmap

trying to get those to drop atm

2024-01-04 00458, 2024

4:14 AM
lucifer

are writes still happening?

2024-01-04 00420, 2024

4:18 AM
bitmap

I don't think so, since the WAL graph isn't a 45 degree line anymore (lol)

2024-01-04 00434, 2024

4:19 AM
lucifer

weird so many writes for 5 hours

2024-01-04 00416, 2024

4:23 AM
relaxoMob joined the channel

2024-01-04 00448, 2024

4:25 AM
bitmap

I got WAL archiving working again so it should start dropping soon

2024-01-04 00432, 2024

4:28 AM
bitmap

but there are close to 100,000, which exceeds anything I've seen before by like 10x

2024-01-04 00445, 2024

4:29 AM
lucifer

can we know what the writes were that created those wal files?

2024-01-04 00449, 2024

4:29 AM
lucifer

table name maybe?

2024-01-04 00455, 2024

4:30 AM
nullhawk joined the channel

2024-01-04 00429, 2024

4:31 AM
nullhawk has quit

2024-01-04 00452, 2024

4:33 AM
bitmap

pg_stat_all_tables will probably help

2024-01-04 00452, 2024

4:33 AM
lucifer

bitmap: i can run vacuum on the big table and try seeing if that reduces the space. probably should, i checked the rows and there is no row big than 1 MB. 24K rows in table. but that would generate more wal i guess so should i wait or do it?

2024-01-04 00410, 2024

4:34 AM
bitmap

I'd wait a bit until WAL drops

2024-01-04 00425, 2024

4:34 AM
lucifer

makes sense

2024-01-04 00426, 2024

4:35 AM
bitmap

in musicbrainz_db, mapping.canonical_release_tmp has the most n_tup_ins by far

2024-01-04 00437, 2024

4:35 AM
bitmap

in listenbrainz, it's pg_toast_160991024 (so dunno, I guess oversized columns?)

2024-01-04 00431, 2024

4:37 AM
lucifer

we store json in one of the columns so probably that

2024-01-04 00402, 2024

4:38 AM
lucifer

the mapping schema i had changed all to unlogged

2024-01-04 00425, 2024

4:38 AM
lucifer

so it shouldn't have created any wal

2024-01-04 00429, 2024

4:39 AM
bitmap

there is a INSERT INTO statistics.year_in_music statement in the pg logs which has a crazy json document (was holding pg up for like a minute)

2024-01-04 00409, 2024

4:40 AM
lucifer

hmm i see.

2024-01-04 00414, 2024

4:41 AM
bitmap

(meant I was holding the "page up" key on my keyboard for a minute, not that the query was holding postgres up. :) realized that was phrasing was confusing)

2024-01-04 00412, 2024

4:42 AM
lucifer

ah okay

2024-01-04 00409, 2024

4:43 AM
derwin

bleh, I guess I will just re-start adding this 50 track release :/

2024-01-04 00448, 2024

4:44 AM
bitmap

postgres said the insert took 1428.462 ms so not sure if it was an issue, really

2024-01-04 00419, 2024

4:46 AM
bitmap

though I guess there are many of these

2024-01-04 00441, 2024

4:47 AM
bitmap

how big is the average "data" column on year_in_music and how many rows are expected?

2024-01-04 00454, 2024

4:48 AM
lucifer

i had checked with pg_column_size and the likes, rows are less than 1MB in size which checks out. ~25K rows.

2024-01-04 00434, 2024

4:49 AM
lucifer

most people would have less data than 1MB so 20G is my estimate of how large the table should be.

2024-01-04 00424, 2024

4:50 AM
lucifer

95 G uhhh, i have a hunch on how that could have happened. the table is jsonb so every update creates a new toast on every data point update.

2024-01-04 00413, 2024

4:51 AM
lucifer

there are about 10+ of those, for 8k rows. i guess in the worst case that could somehow balloon into that.

2024-01-04 00445, 2024

4:51 AM
lucifer

the WAL space i am not sure, what's the relation between table/row size and the WAL it generates.

2024-01-04 00414, 2024

4:52 AM
lucifer

1.5TB seems very excessive, and we have run these queries multiple times last year without any issues.

2024-01-04 00426, 2024

4:52 AM
lucifer

so i am still unsure on if this is the actual cause.

2024-01-04 00415, 2024

5:02 AM
bitmap

WAL actually started rising at around 18:15 yesterday and it looks like mbid mapper stuff was running at this time, so perhaps a combination of that + YIM? you said the former was moved to unlogged tables, but something is amiss

2024-01-04 00431, 2024

5:02 AM
bitmap

since PG is still complaining of too-frequent checkpoints during that time

2024-01-04 00455, 2024

5:02 AM
lucifer

hmm i can check the mapping schema to find any logged tables

2024-01-04 00405, 2024

5:03 AM
relaxoMob has quit

2024-01-04 00453, 2024

5:06 AM
lucifer

just checked all the expected ones are indeed unlogged.

2024-01-04 00438, 2024

5:22 AM
bitmap

hrm. well I don't really see anything else in the logs and the tup stats only point to the toast table otherwise. YIM was also running at 18:00 yesterday it seems

2024-01-04 00450, 2024

5:22 AM
lucifer

18:00 utc?

2024-01-04 00423, 2024

5:23 AM
bitmap

I see an insert at 18:25 UTC

2024-01-04 00437, 2024

5:23 AM
lucifer

makes sense

2024-01-04 00438, 2024

5:23 AM
lucifer

https://stats.metabrainz.org/d/000000067/postgres…

2024-01-04 00453, 2024

5:23 AM
lucifer

the database size shot up at 00:19 UTC

2024-01-04 00419, 2024

5:24 AM
derwin has left the channel

2024-01-04 00422, 2024

5:24 AM
bitmap

you had already tested the YIM stuff on jimmy though?

2024-01-04 00446, 2024

5:24 AM
lucifer

yes it was run a couple of weeks ago on jimmy

2024-01-04 00401, 2024

5:25 AM
relaxoMob joined the channel

2024-01-04 00401, 2024

5:25 AM
lucifer

this was the final production run.

2024-01-04 00412, 2024

5:25 AM
bitmap

weird :\

2024-01-04 00429, 2024

5:25 AM
lucifer

https://stats.metabrainz.org/d/000000067/postgres…

2024-01-04 00439, 2024

5:25 AM
lucifer

is the database size calculation working correctly?

2024-01-04 00424, 2024

5:26 AM
lucifer

hmm i guess in normal days the changes are too less to be noticeable in this graph

2024-01-04 00415, 2024

5:29 AM
lucifer

i checked the logs on the possible previous YIM runs, and wal didn't rise like this on those days

2024-01-04 00437, 2024

5:29 AM
lucifer

https://stats.metabrainz.org/d/000000067/postgres…

2024-01-04 00447, 2024

5:29 AM
lucifer

3 peaks all of mapping stuff.

2024-01-04 00449, 2024

5:39 AM
bitmap

which day was the previous YIM run?

2024-01-04 00440, 2024

5:41 AM
bitmap

maybe something on aretha was slowing down the WAL archiver

2024-01-04 00427, 2024

5:43 AM
lucifer

18th December

2024-01-04 00441, 2024

5:47 AM
bitmap

load average on aretha was significantly higher during this time than on december 18th (maybe json dumps running or something?)

2024-01-04 00403, 2024

5:48 AM
lucifer

i see

2024-01-04 00417, 2024

5:48 AM
bitmap

I believe this could possibly slow down WAL archiving enough to cause a build up

2024-01-04 00453, 2024

5:48 AM
lucifer

oh makes sense

2024-01-04 00429, 2024

5:50 AM
bitmap

https://stats.metabrainz.org/d/000000025/all-load…

2024-01-04 00447, 2024

5:50 AM
bitmap

(graph uses logarithmic scale)

2024-01-04 00442, 2024

5:52 AM
lucifer

umm i see 1200 unprocessed messages in RMQ, possibly a 100 or so are for YIM which failed to insert because of some unrelated error. i have stopped the container so that it doesn't try to insert.

2024-01-04 00441, 2024

5:53 AM
bitmap

👍

2024-01-04 00406, 2024

5:54 AM
lucifer

wal archiving seems to have stopped

2024-01-04 00414, 2024

5:54 AM
bitmap

daily json dumps started on 1/3 at 00:00 utc and they are still running

2024-01-04 00414, 2024

5:54 AM
lucifer

https://stats.metabrainz.org/d/000000067/postgres…

2024-01-04 00422, 2024

5:54 AM
bitmap

we really need to move this to another server

2024-01-04 00439, 2024

5:54 AM
lucifer

do you mean 1/4 at 00:00 ?

2024-01-04 00456, 2024

5:54 AM
bitmap

no :\

2024-01-04 00406, 2024

5:55 AM
lucifer

oh wow

2024-01-04 00407, 2024

5:57 AM
bitmap

Archiving segment 2539 of 10347

2024-01-04 00429, 2024

5:57 AM
bitmap

it's processing one every few seconds at least

2024-01-04 00447, 2024

5:57 AM
lucifer

i see, the wal size isn't decreasing at the same speed.

2024-01-04 00456, 2024

5:57 AM
lucifer

maybe postgres will catch up in a while

2024-01-04 00447, 2024

5:58 AM
bitmap

hmm, maybe they're being kept around until the next barman backup

2024-01-04 00458, 2024

5:58 AM
bitmap

I'll delete the latest backup and start a new one

2024-01-04 00438, 2024

5:59 AM
lucifer

pg_stat_replication shows two receivers, hendrix and streaming barman

2024-01-04 00432, 2024

6:07 AM
bitmap

yea, hendrix is up to date but streaming barman is still signficantly behind & in catchup mode

2024-01-04 00417, 2024

6:55 AM
relaxoMob has quit

2024-01-04 00435, 2024

6:55 AM
relaxoMob joined the channel

2024-01-04 00453, 2024

7:16 AM
BrainzGit

[musicbrainz-server] 14reosarevok opened pull request #3138 (03master…MBS-13420-test-fix): Fix Selenium test broken by MBS-13420 https://github.com/metabrainz/musicbrainz-server/…