#metabrainz

/

      • ruaok
        but, this only buys a small amount of time. we will hit this point again and each time this point comes closer.
      • 2018-05-01 12158, 2018

      • ruaok
        looking for a possibly cause in all of this can be a huge undertaking that has one recommended outcome: tune your server or add more capacity.
      • 2018-05-01 12121, 2018

      • ruaok
        so, best to take two approaches to this, as we have been: work to add more capacity in the short term.
      • 2018-05-01 12141, 2018

      • ruaok
        post react move, we focus on caching and optimizing the worst queries in our system.
      • 2018-05-01 12113, 2018

      • zas
        so, you think ANALYZE was run by autovacuum and it was fixed by itself during first 2 events ? but i don't see any autovacuum stuff in logs
      • 2018-05-01 12158, 2018

      • ruaok
        for some reason autovacuum isn't working. that we need to find out why -- but twice now an ANALYZE has solved out problems.
      • 2018-05-01 12109, 2018

      • ruaok
        so, we need to do that on a regular basis until we know why we need to do it.
      • 2018-05-01 12137, 2018

      • zas
        yes, i just checked logs again
      • 2018-05-01 12153, 2018

      • ruaok
        not sure by what you mean by first 2 events.. I think it may have done its job until the load become too high and then it stops working in order to not being the server to its knees.
      • 2018-05-01 12158, 2018

      • zas
      • 2018-05-01 12114, 2018

      • zas
        basically nothing between january and april...
      • 2018-05-01 12137, 2018

      • zas
        my guess: temp tables created by search indexer causes this
      • 2018-05-01 12139, 2018

      • ruaok
        and I bet that if we check our chat logs we will find that we had a freakout in jan. :)
      • 2018-05-01 12151, 2018

      • ruaok
        sure, that is quite possible.
      • 2018-05-01 12102, 2018

      • zas
        they aren't cleaned by autovacuum, and other tables stay under thresholds
      • 2018-05-01 12109, 2018

      • Slurpee has quit
      • 2018-05-01 12126, 2018

      • zas
        but autovacuum process actually runs, it just does nothing
      • 2018-05-01 12138, 2018

      • ruaok
        the an immediate fix is to add a cron job to run ANALYZE once a week during slow times.
      • 2018-05-01 12159, 2018

      • ruaok
        not sure about nothing... it probably updates stats.
      • 2018-05-01 12113, 2018

      • zas
        yes, but according to https://dba.stackexchange.com/questions/18664/are… it shouldn't be needed
      • 2018-05-01 12125, 2018

      • zas
        we can tweak autovacuum options for that
      • 2018-05-01 12143, 2018

      • zas
        it doesn't fix the temp tables thing though
      • 2018-05-01 12143, 2018

      • zas
        question: why are temp tables created ? pg_temp_44.tmp_artistcredit (and others orphans) were created by ?
      • 2018-05-01 12131, 2018

      • zas
        it seems to me autovacuum took care of those after we restarted pg
      • 2018-05-01 12108, 2018

      • ruaok
        the search indexer creates them to make index building more efficient.
      • 2018-05-01 12145, 2018

      • ruaok
        rather than repeatedly running the same query to get a chunk of rows, it creates a temp table and then queries that table while building an index.
      • 2018-05-01 12122, 2018

      • ruaok
        it has the effect of running a fat query a few thousand times, vs a fat query once and then just fetching rows without a join.
      • 2018-05-01 12138, 2018

      • zas
      • 2018-05-01 12125, 2018

      • zas
        i propose to increase autovacuum verbosity to start with
      • 2018-05-01 12134, 2018

      • ruaok
        sure.
      • 2018-05-01 12150, 2018

      • ruaok
        what is our current work mem setting?
      • 2018-05-01 12156, 2018

      • ruaok
        that one had good results too.
      • 2018-05-01 12148, 2018

      • ruaok
        yvanzo: ok, in that case, which container has the master set of replication packets?
      • 2018-05-01 12108, 2018

      • ruaok
        we have a gap in sequence in 2016, so we should discard packets prior to that.
      • 2018-05-01 12116, 2018

      • ruaok
        UmkaDK: check your replicated tables, should look better now.
      • 2018-05-01 12156, 2018

      • UmkaDK
        Yep, our replication restarted and is on 1113,034
      • 2018-05-01 12121, 2018

      • ruaok
        yay.
      • 2018-05-01 12124, 2018

      • UmkaDK
      • 2018-05-01 12135, 2018

      • ruaok
        phew.
      • 2018-05-01 12109, 2018

      • bukwurm joined the channel
      • 2018-05-01 12141, 2018

      • UmkaDK
        So ruaok, does it mean that the replication is now stable?
      • 2018-05-01 12117, 2018

      • UmkaDK
        As in, I can re-enable our alarms.
      • 2018-05-01 12122, 2018

      • yvanzo
        ruaok: No idea, the gap is from 2016-04-28 to 2016-05-23, that was prior to the move to Hetzner.
      • 2018-05-01 12122, 2018

      • ruaok
        yes, do that.
      • 2018-05-01 12102, 2018

      • UmkaDK
        Will do! Thanks guys!!
      • 2018-05-01 12104, 2018

      • ruaok
        yvanzo: yeah, I'm moving old packets out of the way on the metabrainz production container.
      • 2018-05-01 12113, 2018

      • ruaok
        sorry for the hassle UmkaDK !
      • 2018-05-01 12141, 2018

      • UmkaDK
        ruaok, not a problem at all!! These things just happen... :)
      • 2018-05-01 12108, 2018

      • UmkaDK
        (I wouldn't have a job if they didn't) :)
      • 2018-05-01 12148, 2018

      • ruaok
        lol, good perspective.
      • 2018-05-01 12115, 2018

      • ruaok
        yvanzo: a sanity check from you please.
      • 2018-05-01 12158, 2018

      • ruaok
        on hip, in metabrainz-prod in /data/replication_packets ... the packets now start with replication-95701.tar.bz2
      • 2018-05-01 12123, 2018

      • ruaok
      • 2018-05-01 12132, 2018

      • ruaok
        reports that CRITICAL Replication packet 95700 is missing
      • 2018-05-01 12137, 2018

      • ruaok
      • 2018-05-01 12138, 2018

      • yvanzo
        ruaok: found missing replication packets on bootsy
      • 2018-05-01 12144, 2018

      • ruaok
        it doesn't seem to be cached.
      • 2018-05-01 12100, 2018

      • ruaok
        I don't really think we need to be concerned with them.
      • 2018-05-01 12112, 2018

      • ruaok
        they are super old.
      • 2018-05-01 12126, 2018

      • ruaok
        I just don't understand why that check is failing still.
      • 2018-05-01 12145, 2018

      • yvanzo
        OK, does it checks backup dir or ftp dir?
      • 2018-05-01 12124, 2018

      • ruaok
        it checks REPLICATION_PACKETS_DIR...
      • 2018-05-01 12144, 2018

      • ruaok
        'REPLICATION_PACKETS_DIR': '/data/replication_packets'
      • 2018-05-01 12154, 2018

      • ruaok
        so, I'm stumped as to why that check still fails.
      • 2018-05-01 12130, 2018

      • yvanzo
        looks like it is checking the FTP dir
      • 2018-05-01 12156, 2018

      • yvanzo
        Shall I move old packets from bootsy to there?
      • 2018-05-01 12136, 2018

      • yvanzo
        Actually that was probably somehow broken at that time, that would explain why they were no more available from FTP.
      • 2018-05-01 12124, 2018

      • ruaok
        how is it checking the FTP dir? I just double checked the settings.
      • 2018-05-01 12133, 2018

      • ruaok
        > 'REPLICATION_PACKETS_DIR': '/data/replication_packets'
      • 2018-05-01 12142, 2018

      • ruaok
        that is the current config inside the container.
      • 2018-05-01 12154, 2018

      • ruaok
        and the packets are not supposed to be on FTP anylonger.
      • 2018-05-01 12104, 2018

      • ruaok
        too many companies were freeloading
      • 2018-05-01 12134, 2018

      • yvanzo
        ruaok: Volume musicbrainz-replication-packets is mounted on /data in MeB container and under .../ftp/... in MB Cron container.
      • 2018-05-01 12159, 2018

      • yvanzo
        the backup dir you cleaned up is from volume musicbrainz-backups
      • 2018-05-01 12119, 2018

      • ruaok
        i dont quite follow....
      • 2018-05-01 12136, 2018

      • ruaok
        I cleaned up var/lib/docker/volumes/musicbrainz-replication-packets/_data which is mounted into:
      • 2018-05-01 12140, 2018

      • ruaok
      • 2018-05-01 12142, 2018

      • yvanzo
        ruaok: May I move old packets (prior to the gap) from replication-packets (ftp) (where it is checked) to backups (or any other place for archives)?
      • 2018-05-01 12159, 2018

      • ruaok
        sure
      • 2018-05-01 12106, 2018

      • ruaok
        just dont put more recent ones onto FTP
      • 2018-05-01 12118, 2018

      • yvanzo
        OK
      • 2018-05-01 12158, 2018

      • Lotheric has quit
      • 2018-05-01 12153, 2018

      • yvanzo
        ruaok: replication check is happy now
      • 2018-05-01 12118, 2018

      • yvanzo
        We still have packets previous to the gap in backups.
      • 2018-05-01 12128, 2018

      • ruaok
        thanks for fixing that.
      • 2018-05-01 12137, 2018

      • ruaok
        zas: replication check is ready for you to try again.
      • 2018-05-01 12135, 2018

      • yvanzo
        I also moved distcontinued packets (from during the gap) from old/ subdir in replication-packets (ftp) to subdir replication-95000-95625/ in backups.
      • 2018-05-01 12109, 2018

      • Lotheric joined the channel
      • 2018-05-01 12155, 2018

      • Lotheric has quit
      • 2018-05-01 12123, 2018

      • SothoTalKer has quit
      • 2018-05-01 12123, 2018

      • Nyanko-sensei has quit
      • 2018-05-01 12103, 2018

      • Nyanko-sensei joined the channel
      • 2018-05-01 12139, 2018

      • Sophist-UK has quit
      • 2018-05-01 12158, 2018

      • Nyanko-sensei has quit
      • 2018-05-01 12145, 2018

      • Sophist-UK joined the channel
      • 2018-05-01 12134, 2018

      • kartikeyaSh_ircc joined the channel
      • 2018-05-01 12154, 2018

      • SothoTalKer joined the channel
      • 2018-05-01 12102, 2018

      • Lotheric joined the channel
      • 2018-05-01 12128, 2018

      • Sophist-UK has quit
      • 2018-05-01 12154, 2018

      • Nyanko-sensei joined the channel
      • 2018-05-01 12157, 2018

      • bukwurm has quit
      • 2018-05-01 12132, 2018

      • Nyanko-sensei has quit
      • 2018-05-01 12143, 2018

      • Nyanko-sensei joined the channel
      • 2018-05-01 12122, 2018

      • ruaok
        yvanzo: zas: samj1912: invoices please!
      • 2018-05-01 12143, 2018

      • Slurpee joined the channel
      • 2018-05-01 12143, 2018

      • Slurpee has quit
      • 2018-05-01 12143, 2018

      • Slurpee joined the channel
      • 2018-05-01 12114, 2018

      • bukwurm joined the channel
      • 2018-05-01 12148, 2018

      • zas
        bitmap: ping me when around
      • 2018-05-01 12132, 2018

      • Slurpee has quit
      • 2018-05-01 12117, 2018

      • drsaund has quit
      • 2018-05-01 12129, 2018

      • rembo10 has quit
      • 2018-05-01 12129, 2018

      • angreifer has quit
      • 2018-05-01 12147, 2018

      • rembo10 joined the channel
      • 2018-05-01 12149, 2018

      • angreifer joined the channel
      • 2018-05-01 12123, 2018

      • kartikeyaSh_ircc has quit
      • 2018-05-01 12121, 2018

      • ephemer0l has quit
      • 2018-05-01 12133, 2018

      • drsaund joined the channel
      • 2018-05-01 12153, 2018

      • drsaund has quit
      • 2018-05-01 12126, 2018

      • outsidecontext joined the channel
      • 2018-05-01 12139, 2018

      • ephemer0l_ joined the channel
      • 2018-05-01 12157, 2018

      • bukwurm has quit
      • 2018-05-01 12149, 2018

      • outsidecontext has quit
      • 2018-05-01 12130, 2018

      • CatQuest
        meagblup! O_O
      • 2018-05-01 12141, 2018

      • CatQuest
        megablup! O_O
      • 2018-05-01 12150, 2018

      • CatQuest
        o_O
      • 2018-05-01 12150, 2018

      • CatQuest
        O_o
      • 2018-05-01 12159, 2018

      • CatQuest
        blupblupblup
      • 2018-05-01 12112, 2018

      • CatQuest
        (when will sigyn ban me :P)
      • 2018-05-01 12146, 2018

      • bitmap
        zas: ping
      • 2018-05-01 12158, 2018

      • CatQuest
        sorry, no me tomar medicina
      • 2018-05-01 12154, 2018

      • zas
        Hey bitmap, I'm available in 30 minutes to discuss about pg stuff, meanwhile you can read backlog if you didn't already.
      • 2018-05-01 12146, 2018

      • bitmap
        ok cool, I saw some talk about autovacuum misbehaving
      • 2018-05-01 12109, 2018

      • bitmap
        s/misbehaving/not being tuned properly?/
      • 2018-05-01 12140, 2018

      • zas
        I'd say it needs tuning, but I'm not sure yet about the exact cause of the problem. We saw a manual analyze fixed it. But I'm not sure why yet.
      • 2018-05-01 12119, 2018

      • Slurpee joined the channel
      • 2018-05-01 12129, 2018

      • bitmap
      • 2018-05-01 12156, 2018

      • zas
        autoanalyze is what we want
      • 2018-05-01 12157, 2018

      • bitmap
        yeah
      • 2018-05-01 12121, 2018

      • zas
        to start with i think we need to increase logging information concerning autovacuum
      • 2018-05-01 12145, 2018

      • Slurpee has quit
      • 2018-05-01 12121, 2018

      • zas
        according to my findings, the problem started 2 weeks ago, but i can't explain why it fixed by itself somehow, we had 3 "events", look at https://stats.metabrainz.org/d/eStswhGmk/postgres…
      • 2018-05-01 12144, 2018

      • zas
      • 2018-05-01 12157, 2018

      • zas
        it is perhaps because autoanalyze actually partially occured, but overall it continued to degrade among time
      • 2018-05-01 12137, 2018

      • zas
        perhaps conditions triggering analyze aren't met enough often
      • 2018-05-01 12140, 2018

      • bitmap
        is there not logs indicating if autoanalyze actually occurred?
      • 2018-05-01 12118, 2018

      • zas
      • 2018-05-01 12134, 2018

      • zas
        nothing between january and april
      • 2018-05-01 12149, 2018

      • zas
        though the daemon was running afaik
      • 2018-05-01 12122, 2018

      • zas
        so, may be, conditions to start the autovacuum aren't met