#metabrainz

/

      • Maxr1998 joined the channel
      • 2024-01-04 00456, 2024

      • Maxr1998_ has quit
      • 2024-01-04 00437, 2024

      • aerozol
        lucifer: driving all day today and all the hills and shit are making it hard to load figma... There's a mock-up in there with some of the stats I thought we could share, if you don't mind having a look?
      • 2024-01-04 00413, 2024

      • aerozol
        There's a section that has social media post mockups
      • 2024-01-04 00417, 2024

      • aerozol
        I'll post a screenshot if it ever loads šŸ¤”
      • 2024-01-04 00431, 2024

      • aerozol
      • 2024-01-04 00444, 2024

      • aerozol
        Honestly, any annual stats would be interesting!
      • 2024-01-04 00413, 2024

      • aerozol
        You grab em and I'll make them look interesting 😁
      • 2024-01-04 00423, 2024

      • aerozol
        (Lucy is driving btw everyone, otherwise this would be very problematic haha)
      • 2024-01-04 00403, 2024

      • minimal has quit
      • 2024-01-04 00444, 2024

      • bitmap
        uhh jimmy out of disk space?
      • 2024-01-04 00435, 2024

      • bitmap
        all of MB is down rn because jimmy can't be accessed
      • 2024-01-04 00451, 2024

      • bitmap
        lucifer: is anything running that would be eating up space? ^
      • 2024-01-04 00454, 2024

      • bitmap
        zas: around?
      • 2024-01-04 00459, 2024

      • lediur joined the channel
      • 2024-01-04 00453, 2024

      • lediur has quit
      • 2024-01-04 00441, 2024

      • bitmap
        zas: I had to delete /home/zas/temp.file to even allow postgres to start, it was 2.1GB and I really wasn't sure what else I could remove
      • 2024-01-04 00426, 2024

      • bitmap
        postgres is back but we're in very dangerous territory rn...
      • 2024-01-04 00400, 2024

      • bitmap
        I tried pruning unused docker images but there was nothing
      • 2024-01-04 00419, 2024

      • aerozol
        I've posted on the socials that we're working on the issue, ping me if there's updates to share bitmap
      • 2024-01-04 00432, 2024

      • bitmap
        aerozol: I was able to restart postgres and musicbrainz seems to be back, at least, but things might be unstable if whatever caused the meltdown starts running again
      • 2024-01-04 00408, 2024

      • aerozol
        bitmap: thanks, will update now
      • 2024-01-04 00451, 2024

      • bitmap
        ty
      • 2024-01-04 00442, 2024

      • bitmap
        the listenbrainz DB rose from 32GB to 119GB which I believe is when it ran out of space
      • 2024-01-04 00417, 2024

      • bitmap
      • 2024-01-04 00430, 2024

      • aerozol
        Year in music related?
      • 2024-01-04 00435, 2024

      • bitmap
        also, musicbrainz_db dropped from 314GB to 259GB afterward, but I believe that's because the mapping.* schema is now empty
      • 2024-01-04 00452, 2024

      • bitmap
        yeah, possibly, I'm not sure what was running earlier
      • 2024-01-04 00417, 2024

      • bitmap
        the standby on hendrix seems to still be in recovery, that might take a while
      • 2024-01-04 00400, 2024

      • aerozol
      • 2024-01-04 00425, 2024

      • bitmap
        XD
      • 2024-01-04 00432, 2024

      • aerozol
        Graphana link doesn't work on my phone, but was Discord onto this hours ago 🄓
      • 2024-01-04 00431, 2024

      • aerozol
        Oh wait, not hours sorry, I'm still on holiday time sorry. This is about the same time as you posted here
      • 2024-01-04 00456, 2024

      • bitmap
        ah, good, I was worried I missed some alerts
      • 2024-01-04 00410, 2024

      • bitmap
        I was afk but I did see the alerts when I checked my phone
      • 2024-01-04 00426, 2024

      • bitmap
        atj: zas: do we have to increase the size of /srv/postgresql? (or is that not how ZFS works)
      • 2024-01-04 00436, 2024

      • relaxoMob has quit
      • 2024-01-04 00414, 2024

      • relaxoMob joined the channel
      • 2024-01-04 00459, 2024

      • Maxr1998
        I only checked Grafana once it was already down and when you guys were already aware of it ^^
      • 2024-01-04 00459, 2024

      • Maxr1998
        Thanks for resolving it so quickly btw!
      • 2024-01-04 00418, 2024

      • bitmap
        I was really worried we'd be down for a long time if I couldn't find anything on jimmy to delete and clear up some space (since PG couldn't even start, so I couldn't clear any tables or anything)
      • 2024-01-04 00431, 2024

      • bitmap
        luckily zas had some random 2GB temp file lying around (hope it wasn't important). maybe we ought to keep more of those in case of emergency lol
      • 2024-01-04 00445, 2024

      • derwin joined the channel
      • 2024-01-04 00412, 2024

      • derwin
        is the site still broken, or is the endless spinning when I try to add an artist from the release editor on this gigantic release some other issue?
      • 2024-01-04 00449, 2024

      • bitmap
        derwin: there have been quite a few artists/releases added since, so perhaps some other issue (but I'm not sure about adding artists from the release editor specifically)
      • 2024-01-04 00436, 2024

      • bitmap
        if you see anything relevant in the browser console or network panel I can take a look
      • 2024-01-04 00448, 2024

      • lucifer
        bitmap: hi!
      • 2024-01-04 00451, 2024

      • lucifer
        just woke up
      • 2024-01-04 00405, 2024

      • relaxoMob has quit
      • 2024-01-04 00411, 2024

      • bitmap
        lucifer: hey
      • 2024-01-04 00412, 2024

      • lucifer
        where do you see the LB db size? i just checked and the largest table is 5G
      • 2024-01-04 00426, 2024

      • bitmap
        from the graphana link above
      • 2024-01-04 00416, 2024

      • lucifer
        oh my bad, i had the query wrong. yes i see a 100G table.
      • 2024-01-04 00454, 2024

      • bitmap
        any idea what's wrong, or is it expected to grow that much?
      • 2024-01-04 00433, 2024

      • lucifer
        nope, i can try to remove some of the autogenerated data.
      • 2024-01-04 00451, 2024

      • bitmap
        /srv/postgresql is apparently only 258G but I assume that can be increased if needed
      • 2024-01-04 00410, 2024

      • lucifer
        yeah that seems very weird
      • 2024-01-04 00411, 2024

      • bitmap
        I'm not sure how that is calculated if musicbrainz_db alone is 259GB (is that taking compression into account?)
      • 2024-01-04 00437, 2024

      • lucifer
        or maybe zfs commands need to be used to obtain the free disk space for it.
      • 2024-01-04 00414, 2024

      • bitmap
        yea zpool list shows 68GB free
      • 2024-01-04 00400, 2024

      • bitmap
        much of the currently used space is likely WAL files
      • 2024-01-04 00424, 2024

      • lucifer
        ah okay makes sense
      • 2024-01-04 00456, 2024

      • bitmap
        there is 1.5 TB of WAL files, which is a crazy amount of writes
      • 2024-01-04 00443, 2024

      • bitmap
        trying to get those to drop atm
      • 2024-01-04 00458, 2024

      • lucifer
        are writes still happening?
      • 2024-01-04 00420, 2024

      • bitmap
        I don't think so, since the WAL graph isn't a 45 degree line anymore (lol)
      • 2024-01-04 00434, 2024

      • lucifer
        weird so many writes for 5 hours
      • 2024-01-04 00416, 2024

      • relaxoMob joined the channel
      • 2024-01-04 00448, 2024

      • bitmap
        I got WAL archiving working again so it should start dropping soon
      • 2024-01-04 00432, 2024

      • bitmap
        but there are close to 100,000, which exceeds anything I've seen before by like 10x
      • 2024-01-04 00445, 2024

      • lucifer
        can we know what the writes were that created those wal files?
      • 2024-01-04 00449, 2024

      • lucifer
        table name maybe?
      • 2024-01-04 00455, 2024

      • nullhawk joined the channel
      • 2024-01-04 00429, 2024

      • nullhawk has quit
      • 2024-01-04 00452, 2024

      • bitmap
        pg_stat_all_tables will probably help
      • 2024-01-04 00452, 2024

      • lucifer
        bitmap: i can run vacuum on the big table and try seeing if that reduces the space. probably should, i checked the rows and there is no row big than 1 MB. 24K rows in table. but that would generate more wal i guess so should i wait or do it?
      • 2024-01-04 00410, 2024

      • bitmap
        I'd wait a bit until WAL drops
      • 2024-01-04 00425, 2024

      • lucifer
        makes sense
      • 2024-01-04 00426, 2024

      • bitmap
        in musicbrainz_db, mapping.canonical_release_tmp has the most n_tup_ins by far
      • 2024-01-04 00437, 2024

      • bitmap
        in listenbrainz, it's pg_toast_160991024 (so dunno, I guess oversized columns?)
      • 2024-01-04 00431, 2024

      • lucifer
        we store json in one of the columns so probably that
      • 2024-01-04 00402, 2024

      • lucifer
        the mapping schema i had changed all to unlogged
      • 2024-01-04 00425, 2024

      • lucifer
        so it shouldn't have created any wal
      • 2024-01-04 00429, 2024

      • bitmap
        there is a INSERT INTO statistics.year_in_music statement in the pg logs which has a crazy json document (was holding pg up for like a minute)
      • 2024-01-04 00409, 2024

      • lucifer
        hmm i see.
      • 2024-01-04 00414, 2024

      • bitmap
        (meant I was holding the "page up" key on my keyboard for a minute, not that the query was holding postgres up. :) realized that was phrasing was confusing)
      • 2024-01-04 00412, 2024

      • lucifer
        ah okay
      • 2024-01-04 00409, 2024

      • derwin
        bleh, I guess I will just re-start adding this 50 track release :/
      • 2024-01-04 00448, 2024

      • bitmap
        postgres said the insert took 1428.462 ms so not sure if it was an issue, really
      • 2024-01-04 00419, 2024

      • bitmap
        though I guess there are many of these
      • 2024-01-04 00441, 2024

      • bitmap
        how big is the average "data" column on year_in_music and how many rows are expected?
      • 2024-01-04 00454, 2024

      • lucifer
        i had checked with pg_column_size and the likes, rows are less than 1MB in size which checks out. ~25K rows.
      • 2024-01-04 00434, 2024

      • lucifer
        most people would have less data than 1MB so 20G is my estimate of how large the table should be.
      • 2024-01-04 00424, 2024

      • lucifer
        95 G uhhh, i have a hunch on how that could have happened. the table is jsonb so every update creates a new toast on every data point update.
      • 2024-01-04 00413, 2024

      • lucifer
        there are about 10+ of those, for 8k rows. i guess in the worst case that could somehow balloon into that.
      • 2024-01-04 00445, 2024

      • lucifer
        the WAL space i am not sure, what's the relation between table/row size and the WAL it generates.
      • 2024-01-04 00414, 2024

      • lucifer
        1.5TB seems very excessive, and we have run these queries multiple times last year without any issues.
      • 2024-01-04 00426, 2024

      • lucifer
        so i am still unsure on if this is the actual cause.
      • 2024-01-04 00415, 2024

      • bitmap
        WAL actually started rising at around 18:15 yesterday and it looks like mbid mapper stuff was running at this time, so perhaps a combination of that + YIM? you said the former was moved to unlogged tables, but something is amiss
      • 2024-01-04 00431, 2024

      • bitmap
        since PG is still complaining of too-frequent checkpoints during that time
      • 2024-01-04 00455, 2024

      • lucifer
        hmm i can check the mapping schema to find any logged tables
      • 2024-01-04 00405, 2024

      • relaxoMob has quit
      • 2024-01-04 00453, 2024

      • lucifer
        just checked all the expected ones are indeed unlogged.
      • 2024-01-04 00438, 2024

      • bitmap
        hrm. well I don't really see anything else in the logs and the tup stats only point to the toast table otherwise. YIM was also running at 18:00 yesterday it seems
      • 2024-01-04 00450, 2024

      • lucifer
        18:00 utc?
      • 2024-01-04 00423, 2024

      • bitmap
        I see an insert at 18:25 UTC
      • 2024-01-04 00437, 2024

      • lucifer
        makes sense
      • 2024-01-04 00438, 2024

      • lucifer
      • 2024-01-04 00453, 2024

      • lucifer
        the database size shot up at 00:19 UTC
      • 2024-01-04 00419, 2024

      • derwin has left the channel
      • 2024-01-04 00422, 2024

      • bitmap
        you had already tested the YIM stuff on jimmy though?
      • 2024-01-04 00446, 2024

      • lucifer
        yes it was run a couple of weeks ago on jimmy
      • 2024-01-04 00401, 2024

      • relaxoMob joined the channel
      • 2024-01-04 00401, 2024

      • lucifer
        this was the final production run.
      • 2024-01-04 00412, 2024

      • bitmap
        weird :\
      • 2024-01-04 00429, 2024

      • lucifer
      • 2024-01-04 00439, 2024

      • lucifer
        is the database size calculation working correctly?
      • 2024-01-04 00424, 2024

      • lucifer
        hmm i guess in normal days the changes are too less to be noticeable in this graph
      • 2024-01-04 00415, 2024

      • lucifer
        i checked the logs on the possible previous YIM runs, and wal didn't rise like this on those days
      • 2024-01-04 00437, 2024

      • lucifer
      • 2024-01-04 00447, 2024

      • lucifer
        3 peaks all of mapping stuff.
      • 2024-01-04 00449, 2024

      • bitmap
        which day was the previous YIM run?
      • 2024-01-04 00440, 2024

      • bitmap
        maybe something on aretha was slowing down the WAL archiver
      • 2024-01-04 00427, 2024

      • lucifer
        18th December
      • 2024-01-04 00441, 2024

      • bitmap
        load average on aretha was significantly higher during this time than on december 18th (maybe json dumps running or something?)
      • 2024-01-04 00403, 2024

      • lucifer
        i see
      • 2024-01-04 00417, 2024

      • bitmap
        I believe this could possibly slow down WAL archiving enough to cause a build up
      • 2024-01-04 00453, 2024

      • lucifer
        oh makes sense
      • 2024-01-04 00429, 2024

      • bitmap
      • 2024-01-04 00447, 2024

      • bitmap
        (graph uses logarithmic scale)
      • 2024-01-04 00442, 2024

      • lucifer
        umm i see 1200 unprocessed messages in RMQ, possibly a 100 or so are for YIM which failed to insert because of some unrelated error. i have stopped the container so that it doesn't try to insert.
      • 2024-01-04 00441, 2024

      • bitmap
        šŸ‘
      • 2024-01-04 00406, 2024

      • lucifer
        wal archiving seems to have stopped
      • 2024-01-04 00414, 2024

      • bitmap
        daily json dumps started on 1/3 at 00:00 utc and they are still running
      • 2024-01-04 00414, 2024

      • lucifer
      • 2024-01-04 00422, 2024

      • bitmap
        we really need to move this to another server
      • 2024-01-04 00439, 2024

      • lucifer
        do you mean 1/4 at 00:00 ?
      • 2024-01-04 00456, 2024

      • bitmap
        no :\
      • 2024-01-04 00406, 2024

      • lucifer
        oh wow
      • 2024-01-04 00407, 2024

      • bitmap
        Archiving segment 2539 of 10347
      • 2024-01-04 00429, 2024

      • bitmap
        it's processing one every few seconds at least
      • 2024-01-04 00447, 2024

      • lucifer
        i see, the wal size isn't decreasing at the same speed.
      • 2024-01-04 00456, 2024

      • lucifer
        maybe postgres will catch up in a while
      • 2024-01-04 00447, 2024

      • bitmap
        hmm, maybe they're being kept around until the next barman backup
      • 2024-01-04 00458, 2024

      • bitmap
        I'll delete the latest backup and start a new one
      • 2024-01-04 00438, 2024

      • lucifer
        pg_stat_replication shows two receivers, hendrix and streaming barman
      • 2024-01-04 00432, 2024

      • bitmap
        yea, hendrix is up to date but streaming barman is still signficantly behind & in catchup mode
      • 2024-01-04 00417, 2024

      • relaxoMob has quit
      • 2024-01-04 00435, 2024

      • relaxoMob joined the channel
      • 2024-01-04 00453, 2024

      • BrainzGit
        [musicbrainz-server] 14reosarevok opened pull request #3138 (03master…MBS-13420-test-fix): Fix Selenium test broken by MBS-13420 https://github.com/metabrainz/musicbrainz-server/…