lucifer: driving all day today and all the hills and shit are making it hard to load figma... There's a mock-up in there with some of the stats I thought we could share, if you don't mind having a look?
There's a section that has social media post mockups
(Lucy is driving btw everyone, otherwise this would be very problematic haha)
minimal has quit
bitmap
uhh jimmy out of disk space?
all of MB is down rn because jimmy can't be accessed
lucifer: is anything running that would be eating up space? ^
zas: around?
lediur joined the channel
lediur has quit
zas: I had to delete /home/zas/temp.file to even allow postgres to start, it was 2.1GB and I really wasn't sure what else I could remove
postgres is back but we're in very dangerous territory rn...
I tried pruning unused docker images but there was nothing
aerozol
I've posted on the socials that we're working on the issue, ping me if there's updates to share bitmap
bitmap
aerozol: I was able to restart postgres and musicbrainz seems to be back, at least, but things might be unstable if whatever caused the meltdown starts running again
aerozol
bitmap: thanks, will update now
bitmap
ty
the listenbrainz DB rose from 32GB to 119GB which I believe is when it ran out of space
Graphana link doesn't work on my phone, but was Discord onto this hours ago 🥴
Oh wait, not hours sorry, I'm still on holiday time sorry. This is about the same time as you posted here
bitmap
ah, good, I was worried I missed some alerts
I was afk but I did see the alerts when I checked my phone
atj: zas: do we have to increase the size of /srv/postgresql? (or is that not how ZFS works)
relaxoMob has quit
relaxoMob joined the channel
Maxr1998
I only checked Grafana once it was already down and when you guys were already aware of it ^^
Thanks for resolving it so quickly btw!
bitmap
I was really worried we'd be down for a long time if I couldn't find anything on jimmy to delete and clear up some space (since PG couldn't even start, so I couldn't clear any tables or anything)
luckily zas had some random 2GB temp file lying around (hope it wasn't important). maybe we ought to keep more of those in case of emergency lol
derwin joined the channel
derwin
is the site still broken, or is the endless spinning when I try to add an artist from the release editor on this gigantic release some other issue?
bitmap
derwin: there have been quite a few artists/releases added since, so perhaps some other issue (but I'm not sure about adding artists from the release editor specifically)
if you see anything relevant in the browser console or network panel I can take a look
lucifer
bitmap: hi!
just woke up
relaxoMob has quit
bitmap
lucifer: hey
lucifer
where do you see the LB db size? i just checked and the largest table is 5G
bitmap
from the graphana link above
lucifer
oh my bad, i had the query wrong. yes i see a 100G table.
bitmap
any idea what's wrong, or is it expected to grow that much?
lucifer
nope, i can try to remove some of the autogenerated data.
bitmap
/srv/postgresql is apparently only 258G but I assume that can be increased if needed
lucifer
yeah that seems very weird
bitmap
I'm not sure how that is calculated if musicbrainz_db alone is 259GB (is that taking compression into account?)
lucifer
or maybe zfs commands need to be used to obtain the free disk space for it.
bitmap
yea zpool list shows 68GB free
much of the currently used space is likely WAL files
lucifer
ah okay makes sense
bitmap
there is 1.5 TB of WAL files, which is a crazy amount of writes
trying to get those to drop atm
lucifer
are writes still happening?
bitmap
I don't think so, since the WAL graph isn't a 45 degree line anymore (lol)
lucifer
weird so many writes for 5 hours
relaxoMob joined the channel
bitmap
I got WAL archiving working again so it should start dropping soon
but there are close to 100,000, which exceeds anything I've seen before by like 10x
lucifer
can we know what the writes were that created those wal files?
table name maybe?
nullhawk joined the channel
nullhawk has quit
bitmap
pg_stat_all_tables will probably help
lucifer
bitmap: i can run vacuum on the big table and try seeing if that reduces the space. probably should, i checked the rows and there is no row big than 1 MB. 24K rows in table. but that would generate more wal i guess so should i wait or do it?
bitmap
I'd wait a bit until WAL drops
lucifer
makes sense
bitmap
in musicbrainz_db, mapping.canonical_release_tmp has the most n_tup_ins by far
in listenbrainz, it's pg_toast_160991024 (so dunno, I guess oversized columns?)
lucifer
we store json in one of the columns so probably that
the mapping schema i had changed all to unlogged
so it shouldn't have created any wal
bitmap
there is a INSERT INTO statistics.year_in_music statement in the pg logs which has a crazy json document (was holding pg up for like a minute)
lucifer
hmm i see.
bitmap
(meant I was holding the "page up" key on my keyboard for a minute, not that the query was holding postgres up. :) realized that was phrasing was confusing)
lucifer
ah okay
derwin
bleh, I guess I will just re-start adding this 50 track release :/
bitmap
postgres said the insert took 1428.462 ms so not sure if it was an issue, really
though I guess there are many of these
how big is the average "data" column on year_in_music and how many rows are expected?
lucifer
i had checked with pg_column_size and the likes, rows are less than 1MB in size which checks out. ~25K rows.
most people would have less data than 1MB so 20G is my estimate of how large the table should be.
95 G uhhh, i have a hunch on how that could have happened. the table is jsonb so every update creates a new toast on every data point update.
there are about 10+ of those, for 8k rows. i guess in the worst case that could somehow balloon into that.
the WAL space i am not sure, what's the relation between table/row size and the WAL it generates.
1.5TB seems very excessive, and we have run these queries multiple times last year without any issues.
so i am still unsure on if this is the actual cause.
bitmap
WAL actually started rising at around 18:15 yesterday and it looks like mbid mapper stuff was running at this time, so perhaps a combination of that + YIM? you said the former was moved to unlogged tables, but something is amiss
since PG is still complaining of too-frequent checkpoints during that time
lucifer
hmm i can check the mapping schema to find any logged tables
relaxoMob has quit
just checked all the expected ones are indeed unlogged.
bitmap
hrm. well I don't really see anything else in the logs and the tup stats only point to the toast table otherwise. YIM was also running at 18:00 yesterday it seems
umm i see 1200 unprocessed messages in RMQ, possibly a 100 or so are for YIM which failed to insert because of some unrelated error. i have stopped the container so that it doesn't try to insert.
bitmap
👍
lucifer
wal archiving seems to have stopped
bitmap
daily json dumps started on 1/3 at 00:00 utc and they are still running