in #metabrainz

0:43 AM
Maxr1998 joined the channel
0:44 AM
Maxr1998_ has quit
1:04 AM
aerozol

lucifer: driving all day today and all the hills and shit are making it hard to load figma... There's a mock-up in there with some of the stats I thought we could share, if you don't mind having a look?
1:05 AM
There's a section that has social media post mockups
1:05 AM
I'll post a screenshot if it ever loads 🤔
1:14 AM
lucifer: got it! https://usercontent.irccloud-cdn.com/file/6yDZg...
1:14 AM
Honestly, any annual stats would be interesting!
1:15 AM
You grab em and I'll make them look interesting 😁
1:15 AM
(Lucy is driving btw everyone, otherwise this would be very problematic haha)
1:36 AM
minimal has quit
1:36 AM
bitmap

uhh jimmy out of disk space?
1:37 AM
all of MB is down rn because jimmy can't be accessed
1:42 AM
lucifer: is anything running that would be eating up space? ^
1:49 AM
zas: around?
1:58 AM
lediur joined the channel
2:00 AM
lediur has quit
2:01 AM
zas: I had to delete /home/zas/temp.file to even allow postgres to start, it was 2.1GB and I really wasn't sure what else I could remove
2:02 AM
postgres is back but we're in very dangerous territory rn...
2:05 AM
I tried pruning unused docker images but there was nothing
2:16 AM
aerozol

I've posted on the socials that we're working on the issue, ping me if there's updates to share bitmap
2:19 AM
bitmap

aerozol: I was able to restart postgres and musicbrainz seems to be back, at least, but things might be unstable if whatever caused the meltdown starts running again
2:21 AM
aerozol

bitmap: thanks, will update now
2:22 AM
bitmap

ty
2:23 AM
the listenbrainz DB rose from 32GB to 119GB which I believe is when it ran out of space
2:25 AM
https://stats.metabrainz.org/d/000000067/postgr...
2:26 AM
aerozol

Year in music related?
2:26 AM
bitmap

also, musicbrainz_db dropped from 314GB to 259GB afterward, but I believe that's because the mapping.* schema is now empty
2:26 AM
yeah, possibly, I'm not sure what was running earlier
2:28 AM
the standby on hendrix seems to still be in recovery, that might take a while
2:29 AM
aerozol

https://usercontent.irccloud-cdn.com/file/aqkzE...
2:29 AM
bitmap

XD
2:29 AM
aerozol

Graphana link doesn't work on my phone, but was Discord onto this hours ago 🥴
2:30 AM
Oh wait, not hours sorry, I'm still on holiday time sorry. This is about the same time as you posted here
2:30 AM
bitmap

ah, good, I was worried I missed some alerts
2:31 AM
I was afk but I did see the alerts when I checked my phone
2:32 AM
atj: zas: do we have to increase the size of /srv/postgresql? (or is that not how ZFS works)
2:56 AM
relaxoMob has quit
2:58 AM
relaxoMob joined the channel
3:02 AM
Maxr1998

I only checked Grafana once it was already down and when you guys were already aware of it ^^
3:02 AM
Thanks for resolving it so quickly btw!
3:17 AM
bitmap

I was really worried we'd be down for a long time if I couldn't find anything on jimmy to delete and clear up some space (since PG couldn't even start, so I couldn't clear any tables or anything)
3:18 AM
luckily zas had some random 2GB temp file lying around (hope it wasn't important). maybe we ought to keep more of those in case of emergency lol
3:44 AM
derwin joined the channel
3:45 AM
derwin

is the site still broken, or is the endless spinning when I try to add an artist from the release editor on this gigantic release some other issue?
3:53 AM
bitmap

derwin: there have been quite a few artists/releases added since, so perhaps some other issue (but I'm not sure about adding artists from the release editor specifically)
3:54 AM
if you see anything relevant in the browser console or network panel I can take a look
4:03 AM
lucifer

bitmap: hi!
4:03 AM
just woke up
4:04 AM
relaxoMob has quit
4:04 AM
bitmap

lucifer: hey
4:04 AM
lucifer

where do you see the LB db size? i just checked and the largest table is 5G
4:04 AM
bitmap

from the graphana link above
4:05 AM
lucifer

oh my bad, i had the query wrong. yes i see a 100G table.
4:05 AM
bitmap

any idea what's wrong, or is it expected to grow that much?
4:06 AM
lucifer

nope, i can try to remove some of the autogenerated data.
4:06 AM
bitmap

/srv/postgresql is apparently only 258G but I assume that can be increased if needed
4:07 AM
lucifer

yeah that seems very weird
4:08 AM
bitmap

I'm not sure how that is calculated if musicbrainz_db alone is 259GB (is that taking compression into account?)
4:09 AM
lucifer

or maybe zfs commands need to be used to obtain the free disk space for it.
4:11 AM
bitmap

yea zpool list shows 68GB free
4:12 AM
much of the currently used space is likely WAL files
4:12 AM
lucifer

ah okay makes sense
4:12 AM
bitmap

there is 1.5 TB of WAL files, which is a crazy amount of writes
4:13 AM
trying to get those to drop atm
4:14 AM
lucifer

are writes still happening?
4:18 AM
bitmap

I don't think so, since the WAL graph isn't a 45 degree line anymore (lol)
4:19 AM
lucifer

weird so many writes for 5 hours
4:23 AM
relaxoMob joined the channel
4:25 AM
bitmap

I got WAL archiving working again so it should start dropping soon
4:28 AM
but there are close to 100,000, which exceeds anything I've seen before by like 10x
4:29 AM
lucifer

can we know what the writes were that created those wal files?
4:29 AM
table name maybe?
4:30 AM
nullhawk joined the channel
4:31 AM
nullhawk has quit
4:33 AM
bitmap

pg_stat_all_tables will probably help
4:33 AM
lucifer

bitmap: i can run vacuum on the big table and try seeing if that reduces the space. probably should, i checked the rows and there is no row big than 1 MB. 24K rows in table. but that would generate more wal i guess so should i wait or do it?
4:34 AM
bitmap

I'd wait a bit until WAL drops
4:34 AM
lucifer

makes sense
4:35 AM
bitmap

in musicbrainz_db, mapping.canonical_release_tmp has the most n_tup_ins by far
4:35 AM
in listenbrainz, it's pg_toast_160991024 (so dunno, I guess oversized columns?)
4:37 AM
lucifer

we store json in one of the columns so probably that
4:38 AM
the mapping schema i had changed all to unlogged
4:38 AM
so it shouldn't have created any wal
4:39 AM
bitmap

there is a INSERT INTO statistics.year_in_music statement in the pg logs which has a crazy json document (was holding pg up for like a minute)
4:40 AM
lucifer

hmm i see.
4:41 AM
bitmap

(meant I was holding the "page up" key on my keyboard for a minute, not that the query was holding postgres up. :) realized that was phrasing was confusing)
4:42 AM
lucifer

ah okay
4:43 AM
derwin

bleh, I guess I will just re-start adding this 50 track release :/
4:44 AM
bitmap

postgres said the insert took 1428.462 ms so not sure if it was an issue, really
4:46 AM
though I guess there are many of these
4:47 AM
how big is the average "data" column on year_in_music and how many rows are expected?
4:48 AM
lucifer

i had checked with pg_column_size and the likes, rows are less than 1MB in size which checks out. ~25K rows.
4:49 AM
most people would have less data than 1MB so 20G is my estimate of how large the table should be.
4:50 AM
95 G uhhh, i have a hunch on how that could have happened. the table is jsonb so every update creates a new toast on every data point update.
4:51 AM
there are about 10+ of those, for 8k rows. i guess in the worst case that could somehow balloon into that.
4:51 AM
the WAL space i am not sure, what's the relation between table/row size and the WAL it generates.
4:52 AM
1.5TB seems very excessive, and we have run these queries multiple times last year without any issues.
4:52 AM
so i am still unsure on if this is the actual cause.
5:02 AM
bitmap

WAL actually started rising at around 18:15 yesterday and it looks like mbid mapper stuff was running at this time, so perhaps a combination of that + YIM? you said the former was moved to unlogged tables, but something is amiss
5:02 AM
since PG is still complaining of too-frequent checkpoints during that time
5:02 AM
lucifer

hmm i can check the mapping schema to find any logged tables
5:03 AM
relaxoMob has quit
5:06 AM
just checked all the expected ones are indeed unlogged.
5:22 AM
bitmap

hrm. well I don't really see anything else in the logs and the tup stats only point to the toast table otherwise. YIM was also running at 18:00 yesterday it seems
5:22 AM
lucifer

18:00 utc?
5:23 AM
bitmap

I see an insert at 18:25 UTC
5:23 AM
lucifer

makes sense
5:23 AM
https://stats.metabrainz.org/d/000000067/postgr...
5:23 AM
the database size shot up at 00:19 UTC
5:24 AM
derwin has left the channel
5:24 AM
bitmap

you had already tested the YIM stuff on jimmy though?
5:24 AM
lucifer

yes it was run a couple of weeks ago on jimmy
5:25 AM
relaxoMob joined the channel
5:25 AM
this was the final production run.
5:25 AM
bitmap

weird :\
5:25 AM
lucifer

https://stats.metabrainz.org/d/000000067/postgr...
5:25 AM
is the database size calculation working correctly?
5:26 AM
hmm i guess in normal days the changes are too less to be noticeable in this graph
5:29 AM
i checked the logs on the possible previous YIM runs, and wal didn't rise like this on those days
5:29 AM
https://stats.metabrainz.org/d/000000067/postgr...
5:29 AM
3 peaks all of mapping stuff.
5:39 AM
bitmap

which day was the previous YIM run?
5:41 AM
maybe something on aretha was slowing down the WAL archiver
5:43 AM
lucifer

18th December
5:47 AM
bitmap

load average on aretha was significantly higher during this time than on december 18th (maybe json dumps running or something?)
5:48 AM
lucifer

i see
5:48 AM
bitmap

I believe this could possibly slow down WAL archiving enough to cause a build up
5:48 AM
lucifer

oh makes sense
5:50 AM
bitmap

https://stats.metabrainz.org/d/000000025/all-lo...
5:50 AM
(graph uses logarithmic scale)
5:52 AM
lucifer

umm i see 1200 unprocessed messages in RMQ, possibly a 100 or so are for YIM which failed to insert because of some unrelated error. i have stopped the container so that it doesn't try to insert.
5:53 AM
bitmap

👍
5:54 AM
lucifer

wal archiving seems to have stopped
5:54 AM
bitmap

daily json dumps started on 1/3 at 00:00 utc and they are still running
5:54 AM
lucifer

https://stats.metabrainz.org/d/000000067/postgr...
5:54 AM
bitmap

we really need to move this to another server
5:54 AM
lucifer

do you mean 1/4 at 00:00 ?
5:54 AM
bitmap

no :\
5:55 AM
lucifer

oh wow
5:57 AM
bitmap

Archiving segment 2539 of 10347
5:57 AM
it's processing one every few seconds at least
5:57 AM
lucifer

i see, the wal size isn't decreasing at the same speed.
5:57 AM
maybe postgres will catch up in a while
5:58 AM
bitmap

hmm, maybe they're being kept around until the next barman backup
5:58 AM
I'll delete the latest backup and start a new one
5:59 AM
lucifer

pg_stat_replication shows two receivers, hendrix and streaming barman
6:07 AM
bitmap

yea, hendrix is up to date but streaming barman is still signficantly behind & in catchup mode
6:55 AM
relaxoMob has quit
6:55 AM
relaxoMob joined the channel
7:16 AM
BrainzGit

[musicbrainz-server] 14reosarevok opened pull request #3138 (03master…MBS-13420-test-fix): Fix Selenium test broken by MBS-13420 https://github.com/metabrainz/musicbrainz-serve...