#metabrainz

/

      • UmkaDK joined the channel
      • 2018-05-01 12127, 2018

      • UmkaDK has quit
      • 2018-05-01 12151, 2018

      • UmkaDK joined the channel
      • 2018-05-01 12110, 2018

      • reosarevok
        zas: what's off with the https://tickets.metabrainz.org/browse/STYLE-398 certificates?
      • 2018-05-01 12158, 2018

      • drsaund is back from tax hell
      • 2018-05-01 12154, 2018

      • Major_Lurker
        yay
      • 2018-05-01 12149, 2018

      • drsaund
        worked 16hrs today, been home for 2and a half and i'm wired
      • 2018-05-01 12101, 2018

      • Major_Lurker
        tax should be banned
      • 2018-05-01 12111, 2018

      • drsaund
        well then i wouldn't have a job
      • 2018-05-01 12125, 2018

      • drsaund
        but aholes that wait until the last day should be flogged at least
      • 2018-05-01 12128, 2018

      • Major_Lurker
        mmm oh well.... hermit is not bad
      • 2018-05-01 12137, 2018

      • drsaund
        now i've got lots of time to pester reosarevok into making CAA-84 happen
      • 2018-05-01 12124, 2018

      • Major_Lurker has quit
      • 2018-05-01 12150, 2018

      • Major_Lurker joined the channel
      • 2018-05-01 12119, 2018

      • Freso
        reosarevok (zas): Nothing's off with it, it's just expired. It expired on May 1st 1:59 CEST (ie., almost 8 hours ago).
      • 2018-05-01 12102, 2018

      • Freso
        There might be something off with the script that should automatically make new certificates before the current one expired though. :)
      • 2018-05-01 12147, 2018

      • d4rkie joined the channel
      • 2018-05-01 12145, 2018

      • Nyanko-sensei has quit
      • 2018-05-01 12139, 2018

      • drsaund
        sweet..and i just found a J in the bowels of my chair
      • 2018-05-01 12121, 2018

      • Freso
        Congrats!
      • 2018-05-01 12100, 2018

      • drsaund
      • 2018-05-01 12107, 2018

      • UmkaDK_ joined the channel
      • 2018-05-01 12114, 2018

      • UmkaDK has quit
      • 2018-05-01 12133, 2018

      • Slurpee joined the channel
      • 2018-05-01 12153, 2018

      • UmkaDK_ has quit
      • 2018-05-01 12159, 2018

      • UmkaDK joined the channel
      • 2018-05-01 12142, 2018

      • zas
        moin
      • 2018-05-01 12112, 2018

      • zas
        reosarevok: cert expired on tickets but also on stats.metabrainz.org, i don't know what is going on, but that's very weird
      • 2018-05-01 12109, 2018

      • Nyanko-sensei joined the channel
      • 2018-05-01 12121, 2018

      • d4rkie has quit
      • 2018-05-01 12139, 2018

      • zas
        ok, my fault, forgot to deploy new certs to those machines, fix in progress
      • 2018-05-01 12120, 2018

      • zas
        fixed, i also added checks for those issues
      • 2018-05-01 12131, 2018

      • zas
        ruaok: autovacuum is actually running, according to logs, but it may not trigger an ANALYZE, because the config has no specific options (it uses default thresholds). The command that was run yesterday is VACUUM ANALYZE.
      • 2018-05-01 12112, 2018

      • zas
        I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;
      • 2018-05-01 12126, 2018

      • zas
        time matches load decreases
      • 2018-05-01 12159, 2018

      • zas
        so, to my understanding, autovacuum works, but needs to be tuned
      • 2018-05-01 12131, 2018

      • zas
        running vacuum manually shouldn't be needed with current pg versions
      • 2018-05-01 12144, 2018

      • zas
      • 2018-05-01 12115, 2018

      • zas
        bitmap: i'd start by logging autovacuum actions (log_autovacuum_min_duration), it should be also noted we can set those options per table
      • 2018-05-01 12135, 2018

      • UmkaDK
        Hi guys, I just wanted to the check on your replication server. How is it doing? Is it feeling better?
      • 2018-05-01 12111, 2018

      • UmkaDK
        We've disabled some of the alarms while replication was going up and down, so I just need to know if it's safe to re-enable them.
      • 2018-05-01 12123, 2018

      • zas
        UmkaDK: nope, this issue isn't fixed yet. https://metabrainz.org/api/musicbrainz/replicatio…
      • 2018-05-01 12127, 2018

      • UmkaDK
        Thanks zas, I'll keep the alarms off for now. Good luck with the fix!!
      • 2018-05-01 12147, 2018

      • zas
        UmkaDK: we had serious db issues those last days, the replication issue is prolly related to this, i think it'll be fixed soon
      • 2018-05-01 12110, 2018

      • UmkaDK
        Yee, I've seen the chatter. Thanks for the update zas!
      • 2018-05-01 12104, 2018

      • ruaok
        moin!
      • 2018-05-01 12128, 2018

      • ruaok
        > I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;
      • 2018-05-01 12137, 2018

      • ruaok
        that is the manual one yvanzo started.
      • 2018-05-01 12121, 2018

      • ruaok
      • 2018-05-01 12146, 2018

      • ruaok
        that is not actually the reason why replication is behind -- that check is pointing to another ancient problem.
      • 2018-05-01 12107, 2018

      • zas
        ah
      • 2018-05-01 12118, 2018

      • ruaok
        yvanzo: when you get up, things improved after the vacuum analyze, now we should be able to generate packets again.
      • 2018-05-01 12125, 2018

      • ruaok
        can you kick that process, please.
      • 2018-05-01 12107, 2018

      • ruaok
        do you know which machines stores replication packets, zas? not inside a container.. just on FS.
      • 2018-05-01 12142, 2018

      • yvanzo
        ruaok: yup, seen that :)
      • 2018-05-01 12108, 2018

      • yvanzo
        kicked
      • 2018-05-01 12127, 2018

      • ruaok
        great, thanks.
      • 2018-05-01 12137, 2018

      • ruaok
        do you know where the replication packets are stored?
      • 2018-05-01 12158, 2018

      • ruaok
        one of our machines generates replication packets and stores them on the FS. where is that?
      • 2018-05-01 12154, 2018

      • yvanzo
        still the same place, in docker, on hip
      • 2018-05-01 12141, 2018

      • ruaok
        not ever stored on the FS?
      • 2018-05-01 12141, 2018

      • yvanzo
        oh right, there is a backup dir as well, let me search
      • 2018-05-01 12127, 2018

      • yvanzo
        ruaok: still not on the FS, the backup dir is in the container musicbrainz-production-cron at ~musicbrainz/backup
      • 2018-05-01 12106, 2018

      • ruaok
        yvanzo: ok, thanks.
      • 2018-05-01 12123, 2018

      • ruaok
        zas: sudden degradation in performance happen because of the optimizer.
      • 2018-05-01 12130, 2018

      • ruaok
        ... errr planner.
      • 2018-05-01 12147, 2018

      • zas
        morning yvanzo
      • 2018-05-01 12152, 2018

      • ruaok
        the planner needs to use statistics from the DB to plan a query for optimium efficiency.
      • 2018-05-01 12103, 2018

      • yvanzo
        hi zas!
      • 2018-05-01 12110, 2018

      • ruaok
        at some point in time, black magic, really, something changes.
      • 2018-05-01 12130, 2018

      • zas
        let's start with facts: autovacuum runs, but doesn't help with this problem
      • 2018-05-01 12132, 2018

      • ruaok
        the planner needs to do something different because the previous strategy didn't work anymore.
      • 2018-05-01 12142, 2018

      • zas
        VACUUM ANALYZE did help a lot
      • 2018-05-01 12151, 2018

      • ruaok
        hang on. I'm trying to answer your previous questions.
      • 2018-05-01 12108, 2018

      • zas
        k
      • 2018-05-01 12151, 2018

      • ruaok
        so, what used to be an efficient in-memory query now becomes an more expensive on disk query... for instance.
      • 2018-05-01 12101, 2018

      • ruaok
        and that has a greater impact on system performance.
      • 2018-05-01 12110, 2018

      • zas
        hmmm, wait, i need to verify this
      • 2018-05-01 12116, 2018

      • ruaok
        and if that query was used a lot, then bam your server is overloaded.
      • 2018-05-01 12135, 2018

      • ruaok
        but, this may be because it is using old statistics.
      • 2018-05-01 12129, 2018

      • zas
        you assume it causes disk activity, and therefore a slowdown, but nope: https://stats.metabrainz.org/d/000000048/hetzner-…
      • 2018-05-01 12104, 2018

      • zas
        there was no increase in disk IO during 3 events, i can only see CPU usage increase in fact
      • 2018-05-01 12115, 2018

      • ruaok
        if the statistic don't get updated, or old shit thrown from the DB, then it has to assume that things have changed. it can't made good judgements anymore.
      • 2018-05-01 12110, 2018

      • zas
        yup, but then why it isn't more progressive ? i mean i'd expect slow performance degradation
      • 2018-05-01 12116, 2018

      • ruaok
        so, running vacuum analyze throws out old shit and updates stats.
      • 2018-05-01 12145, 2018

      • ruaok
        because the query planner now needs to make a different decision. it reaches a threshold.
      • 2018-05-01 12155, 2018

      • ruaok
        once it passes the threshold, bam everything backs up.
      • 2018-05-01 12112, 2018

      • ruaok
        now, you can spend a pile of time looking for exactly what happend.
      • 2018-05-01 12127, 2018

      • ruaok
        but in the end the outcome is the same: optimize your DB or add more capacity.
      • 2018-05-01 12157, 2018

      • zas
        i tend to disagree here, as it doesn't explain all
      • 2018-05-01 12110, 2018

      • zas
      • 2018-05-01 12137, 2018

      • zas
        first one and second one went through without any action (i think), and lasted for 24-36hours
      • 2018-05-01 12102, 2018

      • zas
        i see no trace of VACUUM in logs for those events
      • 2018-05-01 12151, 2018

      • ruaok
        ok, I'm speaking from experience of 15 years of running postgres.
      • 2018-05-01 12107, 2018

      • ruaok
        I'm trying to save you more frustration.
      • 2018-05-01 12112, 2018

      • zas
        yes, i understand, but please stick to facts
      • 2018-05-01 12126, 2018

      • ruaok
        I see you approaching this with tools that simply are not effective in combatting this.
      • 2018-05-01 12147, 2018

      • ruaok
        I don't have facts. that's the whole problem about this. facts are hard to come by.
      • 2018-05-01 12112, 2018

      • ruaok
        what I do have is observed patterns and experience with this problem.
      • 2018-05-01 12135, 2018

      • ruaok
        this is classic "nothing changed, but PG is freaking out, why?" I've been here several times before.
      • 2018-05-01 12135, 2018

      • zas
        i can't find solutions until the problem is well defined... and for now, your attempt to define the problem doesn't match any measurement
      • 2018-05-01 12107, 2018

      • zas
        you said "what used to be an efficient in-memory query now becomes an more expensive on disk query..." -> where's the disk activity ?
      • 2018-05-01 12139, 2018

      • ruaok
        I am trying to describe one of many different scenarios. I really don't know what the query planner is or is not doing in this case.
      • 2018-05-01 12151, 2018

      • ruaok
        it simply may not be disk related.
      • 2018-05-01 12136, 2018

      • zas
        ok, but keep your scenarii realistic: here we had only increase in cpu usage; network, memory and disk activity remained constant
      • 2018-05-01 12122, 2018

      • zas
        but i agree with you it is a decrease of efficiency, likely coming from bad predictions (and lack of ANALYZE)
      • 2018-05-01 12101, 2018

      • zas
        so why does this happen suddenly (starting on 17th) after months of stable activity ?
      • 2018-05-01 12143, 2018

      • zas
        size of table triggered something ? number of insert/delete/update ?
      • 2018-05-01 12107, 2018

      • ruaok
        I don't know, but I can speculate....
      • 2018-05-01 12116, 2018

      • ruaok
        the query planner decides to use X tables for a query...
      • 2018-05-01 12136, 2018

      • ruaok
        then stats go out of date and things get more fuzzy.
      • 2018-05-01 12111, 2018

      • ruaok
        now it can't keep the whole table in ram anymore, or it thinks it can't.
      • 2018-05-01 12130, 2018

      • ruaok
        the it may need to go get bits from disk and do more loading of data.
      • 2018-05-01 12147, 2018

      • ruaok
        but that data is fresh so it actually resides in cache, (RAM, not L2)
      • 2018-05-01 12127, 2018

      • ruaok
        so, now more fetching across RAM.
      • 2018-05-01 12151, 2018

      • ruaok
        and it might really only be a slight change, but that slight change is what trip the tipping point.
      • 2018-05-01 12121, 2018

      • ruaok
        and now everything backs up and can't ever recover and suddenly the server is totally overloaded.
      • 2018-05-01 12147, 2018

      • ruaok
        by running the stats and throwing out old cruft, we turn back to where the planner can do things better.
      • 2018-05-01 12106, 2018

      • ruaok
        but, this only buys a small amount of time. we will hit this point again and each time this point comes closer.
      • 2018-05-01 12158, 2018

      • ruaok
        looking for a possibly cause in all of this can be a huge undertaking that has one recommended outcome: tune your server or add more capacity.
      • 2018-05-01 12121, 2018

      • ruaok
        so, best to take two approaches to this, as we have been: work to add more capacity in the short term.
      • 2018-05-01 12141, 2018

      • ruaok
        post react move, we focus on caching and optimizing the worst queries in our system.
      • 2018-05-01 12113, 2018

      • zas
        so, you think ANALYZE was run by autovacuum and it was fixed by itself during first 2 events ? but i don't see any autovacuum stuff in logs
      • 2018-05-01 12158, 2018

      • ruaok
        for some reason autovacuum isn't working. that we need to find out why -- but twice now an ANALYZE has solved out problems.
      • 2018-05-01 12109, 2018

      • ruaok
        so, we need to do that on a regular basis until we know why we need to do it.
      • 2018-05-01 12137, 2018

      • zas
        yes, i just checked logs again
      • 2018-05-01 12153, 2018

      • ruaok
        not sure by what you mean by first 2 events.. I think it may have done its job until the load become too high and then it stops working in order to not being the server to its knees.
      • 2018-05-01 12158, 2018

      • zas
      • 2018-05-01 12114, 2018

      • zas
        basically nothing between january and april...
      • 2018-05-01 12137, 2018

      • zas
        my guess: temp tables created by search indexer causes this
      • 2018-05-01 12139, 2018

      • ruaok
        and I bet that if we check our chat logs we will find that we had a freakout in jan. :)
      • 2018-05-01 12151, 2018

      • ruaok
        sure, that is quite possible.
      • 2018-05-01 12102, 2018

      • zas
        they aren't cleaned by autovacuum, and other tables stay under thresholds
      • 2018-05-01 12109, 2018

      • Slurpee has quit
      • 2018-05-01 12126, 2018

      • zas
        but autovacuum process actually runs, it just does nothing
      • 2018-05-01 12138, 2018

      • ruaok
        the an immediate fix is to add a cron job to run ANALYZE once a week during slow times.
      • 2018-05-01 12159, 2018

      • ruaok
        not sure about nothing... it probably updates stats.
      • 2018-05-01 12113, 2018

      • zas
        yes, but according to https://dba.stackexchange.com/questions/18664/are… it shouldn't be needed
      • 2018-05-01 12125, 2018

      • zas
        we can tweak autovacuum options for that
      • 2018-05-01 12143, 2018

      • zas
        it doesn't fix the temp tables thing though
      • 2018-05-01 12143, 2018

      • zas
        question: why are temp tables created ? pg_temp_44.tmp_artistcredit (and others orphans) were created by ?
      • 2018-05-01 12131, 2018

      • zas
        it seems to me autovacuum took care of those after we restarted pg
      • 2018-05-01 12108, 2018

      • ruaok
        the search indexer creates them to make index building more efficient.
      • 2018-05-01 12145, 2018

      • ruaok
        rather than repeatedly running the same query to get a chunk of rows, it creates a temp table and then queries that table while building an index.
      • 2018-05-01 12122, 2018

      • ruaok
        it has the effect of running a fat query a few thousand times, vs a fat query once and then just fetching rows without a join.
      • 2018-05-01 12138, 2018

      • zas
      • 2018-05-01 12125, 2018

      • zas
        i propose to increase autovacuum verbosity to start with
      • 2018-05-01 12134, 2018

      • ruaok
        sure.
      • 2018-05-01 12150, 2018

      • ruaok
        what is our current work mem setting?
      • 2018-05-01 12156, 2018

      • ruaok
        that one had good results too.