#metabrainz

/

      • Maxr1998 joined the channel
      • Maxr1998_ has quit
      • aerozol
        lucifer: driving all day today and all the hills and shit are making it hard to load figma... There's a mock-up in there with some of the stats I thought we could share, if you don't mind having a look?
      • There's a section that has social media post mockups
      • I'll post a screenshot if it ever loads 🤔
      • Honestly, any annual stats would be interesting!
      • You grab em and I'll make them look interesting 😁
      • (Lucy is driving btw everyone, otherwise this would be very problematic haha)
      • minimal has quit
      • bitmap
        uhh jimmy out of disk space?
      • all of MB is down rn because jimmy can't be accessed
      • lucifer: is anything running that would be eating up space? ^
      • zas: around?
      • lediur joined the channel
      • lediur has quit
      • zas: I had to delete /home/zas/temp.file to even allow postgres to start, it was 2.1GB and I really wasn't sure what else I could remove
      • postgres is back but we're in very dangerous territory rn...
      • I tried pruning unused docker images but there was nothing
      • aerozol
        I've posted on the socials that we're working on the issue, ping me if there's updates to share bitmap
      • bitmap
        aerozol: I was able to restart postgres and musicbrainz seems to be back, at least, but things might be unstable if whatever caused the meltdown starts running again
      • aerozol
        bitmap: thanks, will update now
      • bitmap
        ty
      • the listenbrainz DB rose from 32GB to 119GB which I believe is when it ran out of space
      • aerozol
        Year in music related?
      • bitmap
        also, musicbrainz_db dropped from 314GB to 259GB afterward, but I believe that's because the mapping.* schema is now empty
      • yeah, possibly, I'm not sure what was running earlier
      • the standby on hendrix seems to still be in recovery, that might take a while
      • aerozol
      • bitmap
        XD
      • aerozol
        Graphana link doesn't work on my phone, but was Discord onto this hours ago 🥴
      • Oh wait, not hours sorry, I'm still on holiday time sorry. This is about the same time as you posted here
      • bitmap
        ah, good, I was worried I missed some alerts
      • I was afk but I did see the alerts when I checked my phone
      • atj: zas: do we have to increase the size of /srv/postgresql? (or is that not how ZFS works)
      • relaxoMob has quit
      • relaxoMob joined the channel
      • Maxr1998
        I only checked Grafana once it was already down and when you guys were already aware of it ^^
      • Thanks for resolving it so quickly btw!
      • bitmap
        I was really worried we'd be down for a long time if I couldn't find anything on jimmy to delete and clear up some space (since PG couldn't even start, so I couldn't clear any tables or anything)
      • luckily zas had some random 2GB temp file lying around (hope it wasn't important). maybe we ought to keep more of those in case of emergency lol
      • derwin joined the channel
      • derwin
        is the site still broken, or is the endless spinning when I try to add an artist from the release editor on this gigantic release some other issue?
      • bitmap
        derwin: there have been quite a few artists/releases added since, so perhaps some other issue (but I'm not sure about adding artists from the release editor specifically)
      • if you see anything relevant in the browser console or network panel I can take a look
      • lucifer
        bitmap: hi!
      • just woke up
      • relaxoMob has quit
      • bitmap
        lucifer: hey
      • lucifer
        where do you see the LB db size? i just checked and the largest table is 5G
      • bitmap
        from the graphana link above
      • lucifer
        oh my bad, i had the query wrong. yes i see a 100G table.
      • bitmap
        any idea what's wrong, or is it expected to grow that much?
      • lucifer
        nope, i can try to remove some of the autogenerated data.
      • bitmap
        /srv/postgresql is apparently only 258G but I assume that can be increased if needed
      • lucifer
        yeah that seems very weird
      • bitmap
        I'm not sure how that is calculated if musicbrainz_db alone is 259GB (is that taking compression into account?)
      • lucifer
        or maybe zfs commands need to be used to obtain the free disk space for it.
      • bitmap
        yea zpool list shows 68GB free
      • much of the currently used space is likely WAL files
      • lucifer
        ah okay makes sense
      • bitmap
        there is 1.5 TB of WAL files, which is a crazy amount of writes
      • trying to get those to drop atm
      • lucifer
        are writes still happening?
      • bitmap
        I don't think so, since the WAL graph isn't a 45 degree line anymore (lol)
      • lucifer
        weird so many writes for 5 hours
      • relaxoMob joined the channel
      • bitmap
        I got WAL archiving working again so it should start dropping soon
      • but there are close to 100,000, which exceeds anything I've seen before by like 10x
      • lucifer
        can we know what the writes were that created those wal files?
      • table name maybe?
      • nullhawk joined the channel
      • nullhawk has quit
      • bitmap
        pg_stat_all_tables will probably help
      • lucifer
        bitmap: i can run vacuum on the big table and try seeing if that reduces the space. probably should, i checked the rows and there is no row big than 1 MB. 24K rows in table. but that would generate more wal i guess so should i wait or do it?
      • bitmap
        I'd wait a bit until WAL drops
      • lucifer
        makes sense
      • bitmap
        in musicbrainz_db, mapping.canonical_release_tmp has the most n_tup_ins by far
      • in listenbrainz, it's pg_toast_160991024 (so dunno, I guess oversized columns?)
      • lucifer
        we store json in one of the columns so probably that
      • the mapping schema i had changed all to unlogged
      • so it shouldn't have created any wal
      • bitmap
        there is a INSERT INTO statistics.year_in_music statement in the pg logs which has a crazy json document (was holding pg up for like a minute)
      • lucifer
        hmm i see.
      • bitmap
        (meant I was holding the "page up" key on my keyboard for a minute, not that the query was holding postgres up. :) realized that was phrasing was confusing)
      • lucifer
        ah okay
      • derwin
        bleh, I guess I will just re-start adding this 50 track release :/
      • bitmap
        postgres said the insert took 1428.462 ms so not sure if it was an issue, really
      • though I guess there are many of these
      • how big is the average "data" column on year_in_music and how many rows are expected?
      • lucifer
        i had checked with pg_column_size and the likes, rows are less than 1MB in size which checks out. ~25K rows.
      • most people would have less data than 1MB so 20G is my estimate of how large the table should be.
      • 95 G uhhh, i have a hunch on how that could have happened. the table is jsonb so every update creates a new toast on every data point update.
      • there are about 10+ of those, for 8k rows. i guess in the worst case that could somehow balloon into that.
      • the WAL space i am not sure, what's the relation between table/row size and the WAL it generates.
      • 1.5TB seems very excessive, and we have run these queries multiple times last year without any issues.
      • so i am still unsure on if this is the actual cause.
      • bitmap
        WAL actually started rising at around 18:15 yesterday and it looks like mbid mapper stuff was running at this time, so perhaps a combination of that + YIM? you said the former was moved to unlogged tables, but something is amiss
      • since PG is still complaining of too-frequent checkpoints during that time
      • lucifer
        hmm i can check the mapping schema to find any logged tables
      • relaxoMob has quit
      • just checked all the expected ones are indeed unlogged.
      • bitmap
        hrm. well I don't really see anything else in the logs and the tup stats only point to the toast table otherwise. YIM was also running at 18:00 yesterday it seems
      • lucifer
        18:00 utc?
      • bitmap
        I see an insert at 18:25 UTC
      • lucifer
        makes sense
      • the database size shot up at 00:19 UTC
      • derwin has left the channel
      • bitmap
        you had already tested the YIM stuff on jimmy though?
      • lucifer
        yes it was run a couple of weeks ago on jimmy
      • relaxoMob joined the channel
      • this was the final production run.
      • bitmap
        weird :\
      • lucifer
      • is the database size calculation working correctly?
      • hmm i guess in normal days the changes are too less to be noticeable in this graph
      • i checked the logs on the possible previous YIM runs, and wal didn't rise like this on those days
      • 3 peaks all of mapping stuff.
      • bitmap
        which day was the previous YIM run?
      • maybe something on aretha was slowing down the WAL archiver
      • lucifer
        18th December
      • bitmap
        load average on aretha was significantly higher during this time than on december 18th (maybe json dumps running or something?)
      • lucifer
        i see
      • bitmap
        I believe this could possibly slow down WAL archiving enough to cause a build up
      • lucifer
        oh makes sense
      • bitmap
      • (graph uses logarithmic scale)
      • lucifer
        umm i see 1200 unprocessed messages in RMQ, possibly a 100 or so are for YIM which failed to insert because of some unrelated error. i have stopped the container so that it doesn't try to insert.
      • bitmap
        👍
      • lucifer
        wal archiving seems to have stopped
      • bitmap
        daily json dumps started on 1/3 at 00:00 utc and they are still running
      • lucifer
      • bitmap
        we really need to move this to another server
      • lucifer
        do you mean 1/4 at 00:00 ?
      • bitmap
        no :\
      • lucifer
        oh wow
      • bitmap
        Archiving segment 2539 of 10347
      • it's processing one every few seconds at least
      • lucifer
        i see, the wal size isn't decreasing at the same speed.
      • maybe postgres will catch up in a while
      • bitmap
        hmm, maybe they're being kept around until the next barman backup
      • I'll delete the latest backup and start a new one
      • lucifer
        pg_stat_replication shows two receivers, hendrix and streaming barman
      • bitmap
        yea, hendrix is up to date but streaming barman is still signficantly behind & in catchup mode
      • relaxoMob has quit
      • relaxoMob joined the channel
      • BrainzGit
        [musicbrainz-server] 14reosarevok opened pull request #3138 (03master…MBS-13420-test-fix): Fix Selenium test broken by MBS-13420 https://github.com/metabrainz/musicbrainz-serve...