0:08 AM 
     
        
        ephemer0l_ has quit
     
      2018-05-01 12120, 2018  
    
    
        0:11 AM 
     
        
        Nyanko-sensei has quit
     
      2018-05-01 12133, 2018  
    
    
        0:12 AM 
     
        
        ephemer0l joined the channel
     
      2018-05-01 12145, 2018  
    
    
        0:15 AM 
     
        
        UmkaDK_ has quit
     
      2018-05-01 12137, 2018  
    
    
        0:18 AM 
     
                    ruaok
                nice! load on bowie is now 8.5!
     
      2018-05-01 12104, 2018  
    
    
        0:19 AM 
     
                    ruaok
                ok, lesson learned: autovacuum does not work. we all need to remember that for next time this exact same problems happens.
     
      2018-05-01 12119, 2018  
    
    
        0:19 AM 
     
                    ruaok
                I wonder why that is -- because the server is too overloaded for the daemon to run?
     
      2018-05-01 12121, 2018  
    
    
        0:22 AM 
     
                    SothoTalKer
                is it fast again?
     
      2018-05-01 12148, 2018  
    
    
        0:29 AM 
     
        
        Nyanko-sensei joined the channel
     
      2018-05-01 12143, 2018  
    
    
        0:30 AM 
     
                    ruaok
                looks better
     
      2018-05-01 12144, 2018  
    
    
        0:35 AM 
     
                    SothoTalKer
                yay
     
      2018-05-01 12149, 2018  
    
    
        0:50 AM 
     
        
        dragonzeron has quit
     
      2018-05-01 12102, 2018  
    
    
        1:14 AM 
     
        
        Slurpee joined the channel
     
      2018-05-01 12102, 2018  
    
    
        1:14 AM 
     
        
        Slurpee has quit
     
      2018-05-01 12102, 2018  
    
    
        1:14 AM 
     
        
        Slurpee joined the channel
     
      2018-05-01 12153, 2018  
    
    
        1:17 AM 
     
                    CatQuest
                hilariously, it seems everythig *but* mb is slow now (wikipedia, discogs)
     
      2018-05-01 12130, 2018  
    
    
        1:19 AM 
     
        
        Gore|woerk joined the channel
     
      2018-05-01 12109, 2018  
    
    
        1:20 AM 
     
        
        G0re has quit
     
      2018-05-01 12121, 2018  
    
    
        2:30 AM 
     
                    Major_Lurker
                woah, metabrainz TShirt arrived....
     
      2018-05-01 12129, 2018  
    
    
        2:34 AM 
     
                    SothoTalKer
                does it fit?
     
      2018-05-01 12157, 2018  
    
    
        2:34 AM 
     
                    Major_Lurker
                looks like it will
     
      2018-05-01 12112, 2018  
    
    
        2:35 AM 
     
                    Major_Lurker
                i got the fat size
     
      2018-05-01 12109, 2018  
    
    
        2:47 AM 
     
                    SothoTalKer
                3XL
     
      2018-05-01 12122, 2018  
    
    
        2:50 AM 
     
        
        Major_Lurker has quit
     
      2018-05-01 12149, 2018  
    
    
        2:50 AM 
     
        
        Major_Lurker joined the channel
     
      2018-05-01 12138, 2018  
    
    
        3:12 AM 
     
                    Major_Lurker
                ha no just xl
     
      2018-05-01 12156, 2018  
    
    
        3:32 AM 
     
                    SothoTalKer
                aww
     
      2018-05-01 12133, 2018  
    
    
        3:40 AM 
     
        
        sentriz has quit
     
      2018-05-01 12122, 2018  
    
    
        3:41 AM 
     
        
        sentriz joined the channel
     
      2018-05-01 12151, 2018  
    
    
        4:10 AM 
     
        
        Slurpee has quit
     
      2018-05-01 12136, 2018  
    
    
        5:08 AM 
     
                    CatQuest
                he's australian SothoTalKer, not american :3
     
      2018-05-01 12144, 2018  
    
    
        5:12 AM 
     
        
        drsaund joined the channel
     
      2018-05-01 12118, 2018  
    
    
        6:21 AM 
     
        
        UmkaDK joined the channel
     
      2018-05-01 12127, 2018  
    
    
        6:25 AM 
     
        
        UmkaDK has quit
     
      2018-05-01 12151, 2018  
    
    
        7:00 AM 
     
        
        UmkaDK joined the channel
     
      2018-05-01 12110, 2018  
    
    
        7:23 AM 
     
                    reosarevok
                
     
      2018-05-01 12158, 2018  
    
    
        7:31 AM 
     
        
        drsaund is back from tax hell
     
      2018-05-01 12154, 2018  
    
    
        7:32 AM 
     
                    Major_Lurker
                yay
     
      2018-05-01 12149, 2018  
    
    
        7:33 AM 
     
                    drsaund
                worked 16hrs today, been home for 2and a half and i'm wired
     
      2018-05-01 12101, 2018  
    
    
        7:34 AM 
     
                    Major_Lurker
                tax should be banned
     
      2018-05-01 12111, 2018  
    
    
        7:34 AM 
     
                    drsaund
                well then i wouldn't have a job
     
      2018-05-01 12125, 2018  
    
    
        7:34 AM 
     
                    drsaund
                but aholes that wait until the last day should be flogged at least
     
      2018-05-01 12128, 2018  
    
    
        7:34 AM 
     
                    Major_Lurker
                mmm oh well.... hermit is not bad
     
      2018-05-01 12137, 2018  
    
    
        7:36 AM 
     
                    drsaund
                now i've got lots of time to pester reosarevok into making CAA-84 happen
     
      2018-05-01 12124, 2018  
    
    
        7:40 AM 
     
        
        Major_Lurker has quit
     
      2018-05-01 12150, 2018  
    
    
        7:40 AM 
     
        
        Major_Lurker joined the channel
     
      2018-05-01 12119, 2018  
    
    
        7:44 AM 
     
                    Freso
                reosarevok (zas): Nothing's off with it, it's just expired. It expired on May 1st 1:59 CEST (ie., almost 8 hours ago).
     
      2018-05-01 12102, 2018  
    
    
        7:45 AM 
     
                    Freso
                There might be something off with the script that should automatically make new certificates before the current one expired though. :)
     
      2018-05-01 12147, 2018  
    
    
        7:46 AM 
     
        
        d4rkie joined the channel
     
      2018-05-01 12145, 2018  
    
    
        7:47 AM 
     
        
        Nyanko-sensei has quit
     
      2018-05-01 12139, 2018  
    
    
        7:55 AM 
     
                    drsaund
                sweet..and i just found a J in the bowels of my chair
     
      2018-05-01 12121, 2018  
    
    
        7:56 AM 
     
                    Freso
                Congrats!
     
      2018-05-01 12100, 2018  
    
    
        7:57 AM 
     
                    drsaund
                
     
      2018-05-01 12107, 2018  
    
    
        7:58 AM 
     
        
        UmkaDK_ joined the channel
     
      2018-05-01 12114, 2018  
    
    
        8:00 AM 
     
        
        UmkaDK has quit
     
      2018-05-01 12133, 2018  
    
    
        8:15 AM 
     
        
        Slurpee joined the channel
     
      2018-05-01 12153, 2018  
    
    
        8:27 AM 
     
        
        UmkaDK_ has quit
     
      2018-05-01 12159, 2018  
    
    
        8:27 AM 
     
        
        UmkaDK joined the channel
     
      2018-05-01 12142, 2018  
    
    
        8:33 AM 
     
                    zas
                moin
     
      2018-05-01 12112, 2018  
    
    
        8:34 AM 
     
                    zas
                reosarevok: cert expired on tickets but also on 
stats.metabrainz.org , i don't know what is going on, but that's very weird
 
     
      2018-05-01 12109, 2018  
    
    
        8:35 AM 
     
        
        Nyanko-sensei joined the channel
     
      2018-05-01 12121, 2018  
    
    
        8:36 AM 
     
        
        d4rkie has quit
     
      2018-05-01 12139, 2018  
    
    
        8:41 AM 
     
                    zas
                ok, my fault, forgot to deploy new certs to those machines, fix in progress
     
      2018-05-01 12120, 2018  
    
    
        8:48 AM 
     
                    zas
                fixed, i also added checks for those issues
     
      2018-05-01 12131, 2018  
    
    
        9:08 AM 
     
                    zas
                ruaok: autovacuum is actually running, according to logs, but it may not trigger an ANALYZE, because the config has no specific options (it uses default thresholds). The command that was run yesterday is VACUUM ANALYZE.
     
      2018-05-01 12112, 2018  
    
    
        9:09 AM 
     
                    zas
                I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG:  duration: 2486396.275 ms  statement: VACUUM ANALYZE;
     
      2018-05-01 12126, 2018  
    
    
        9:09 AM 
     
                    zas
                time matches load decreases
     
      2018-05-01 12159, 2018  
    
    
        9:09 AM 
     
                    zas
                so, to my understanding, autovacuum works, but needs to be tuned
     
      2018-05-01 12131, 2018  
    
    
        9:10 AM 
     
                    zas
                running vacuum manually shouldn't be needed with current pg versions
     
      2018-05-01 12144, 2018  
    
    
        9:10 AM 
     
                    zas
                
     
      2018-05-01 12115, 2018  
    
    
        9:13 AM 
     
                    zas
                bitmap: i'd start by logging autovacuum actions (log_autovacuum_min_duration), it should be also noted we can set those options per table
     
      2018-05-01 12135, 2018  
    
    
        9:13 AM 
     
                    UmkaDK
                Hi guys, I just wanted to the check on your replication server. How is it doing? Is it feeling better?
     
      2018-05-01 12111, 2018  
    
    
        9:15 AM 
     
                    UmkaDK
                We've disabled some of the alarms while replication was going up and down, so I just need to know if it's safe to re-enable them.
     
      2018-05-01 12123, 2018  
    
    
        9:15 AM 
     
                    zas
                
     
      2018-05-01 12127, 2018  
    
    
        9:16 AM 
     
                    UmkaDK
                Thanks zas, I'll keep the alarms off for now. Good luck with the fix!!
     
      2018-05-01 12147, 2018  
    
    
        9:18 AM 
     
                    zas
                UmkaDK: we had serious db issues those last days, the replication issue is prolly related to this, i think it'll be fixed soon
     
      2018-05-01 12110, 2018  
    
    
        9:20 AM 
     
                    UmkaDK
                Yee, I've seen the chatter. Thanks for the update zas!
     
      2018-05-01 12104, 2018  
    
    
        9:33 AM 
     
                    ruaok
                moin!
     
      2018-05-01 12128, 2018  
    
    
        9:34 AM 
     
                    ruaok
                > I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG:  duration: 2486396.275 ms  statement: VACUUM ANALYZE;
     
      2018-05-01 12137, 2018  
    
    
        9:34 AM 
     
                    ruaok
                that is the manual one yvanzo started.
     
      2018-05-01 12121, 2018  
    
    
        9:35 AM 
     
                    ruaok
                
     
      2018-05-01 12146, 2018  
    
    
        9:35 AM 
     
                    ruaok
                that is not actually the reason why replication is behind -- that check is pointing to another ancient problem.
     
      2018-05-01 12107, 2018  
    
    
        9:36 AM 
     
                    zas
                ah
     
      2018-05-01 12118, 2018  
    
    
        9:36 AM 
     
                    ruaok
                yvanzo: when you get up, things improved after the vacuum analyze, now we should be able to generate packets again.
     
      2018-05-01 12125, 2018  
    
    
        9:36 AM 
     
                    ruaok
                can you kick that process, please.
     
      2018-05-01 12107, 2018  
    
    
        9:38 AM 
     
                    ruaok
                do you know which machines stores replication packets, zas? not inside a container.. just on FS.
     
      2018-05-01 12142, 2018  
    
    
        9:40 AM 
     
                    yvanzo
                ruaok: yup, seen that :)
     
      2018-05-01 12108, 2018  
    
    
        9:41 AM 
     
                    yvanzo
                kicked
     
      2018-05-01 12127, 2018  
    
    
        9:41 AM 
     
                    ruaok
                great, thanks.
     
      2018-05-01 12137, 2018  
    
    
        9:41 AM 
     
                    ruaok
                do you know where the replication packets are stored?
     
      2018-05-01 12158, 2018  
    
    
        9:41 AM 
     
                    ruaok
                one of our machines generates replication packets and stores them on the FS. where is that?
     
      2018-05-01 12154, 2018  
    
    
        9:42 AM 
     
                    yvanzo
                still the same place, in docker, on hip
     
      2018-05-01 12141, 2018  
    
    
        9:43 AM 
     
                    ruaok
                not ever stored on the FS?
     
      2018-05-01 12141, 2018  
    
    
        9:44 AM 
     
                    yvanzo
                oh right, there is a backup dir as well, let me search
     
      2018-05-01 12127, 2018  
    
    
        9:48 AM 
     
                    yvanzo
                ruaok: still not on the FS, the backup dir is in the container musicbrainz-production-cron at ~musicbrainz/backup
     
      2018-05-01 12106, 2018  
    
    
        9:49 AM 
     
                    ruaok
                yvanzo: ok, thanks.
     
      2018-05-01 12123, 2018  
    
    
        9:49 AM 
     
                    ruaok
                zas: sudden degradation in performance happen because of the optimizer.
     
      2018-05-01 12130, 2018  
    
    
        9:49 AM 
     
                    ruaok
                ... errr planner.
     
      2018-05-01 12147, 2018  
    
    
        9:49 AM 
     
                    zas
                morning yvanzo
     
      2018-05-01 12152, 2018  
    
    
        9:49 AM 
     
                    ruaok
                the planner needs to use statistics from the DB to plan a query for optimium efficiency.
     
      2018-05-01 12103, 2018  
    
    
        9:50 AM 
     
                    yvanzo
                hi zas!
     
      2018-05-01 12110, 2018  
    
    
        9:50 AM 
     
                    ruaok
                at some point in time, black magic, really, something changes.
     
      2018-05-01 12130, 2018  
    
    
        9:50 AM 
     
                    zas
                let's start with facts: autovacuum runs, but doesn't help with this problem
     
      2018-05-01 12132, 2018  
    
    
        9:50 AM 
     
                    ruaok
                the planner needs to do something different because the previous strategy didn't work anymore.
     
      2018-05-01 12142, 2018  
    
    
        9:50 AM 
     
                    zas
                VACUUM ANALYZE did help a lot
     
      2018-05-01 12151, 2018  
    
    
        9:50 AM 
     
                    ruaok
                hang on. I'm trying to answer your previous questions.
     
      2018-05-01 12108, 2018  
    
    
        9:51 AM 
     
                    zas
                k
     
      2018-05-01 12151, 2018  
    
    
        9:51 AM 
     
                    ruaok
                so, what used to be an efficient in-memory query now becomes an more expensive on disk query... for instance.
     
      2018-05-01 12101, 2018  
    
    
        9:52 AM 
     
                    ruaok
                and that has a greater impact on system performance.
     
      2018-05-01 12110, 2018  
    
    
        9:52 AM 
     
                    zas
                hmmm, wait, i need to verify this
     
      2018-05-01 12116, 2018  
    
    
        9:52 AM 
     
                    ruaok
                and if that query was used a lot, then bam your server is overloaded.
     
      2018-05-01 12135, 2018  
    
    
        9:52 AM 
     
                    ruaok
                but, this may be because it is using old statistics.
     
      2018-05-01 12129, 2018  
    
    
        9:53 AM 
     
                    zas
                
     
      2018-05-01 12104, 2018  
    
    
        9:54 AM 
     
                    zas
                there was no increase in disk IO during 3 events, i can only see CPU usage increase in fact
     
      2018-05-01 12115, 2018  
    
    
        9:54 AM 
     
                    ruaok
                if the statistic don't get updated, or old shit thrown from the DB, then it has to assume that things have changed. it can't made good judgements anymore.
     
      2018-05-01 12110, 2018  
    
    
        9:55 AM 
     
                    zas
                yup, but then why it isn't more progressive ? i mean i'd expect slow performance degradation
     
      2018-05-01 12116, 2018  
    
    
        9:55 AM 
     
                    ruaok
                so, running vacuum analyze throws out old shit and updates stats.
     
      2018-05-01 12145, 2018  
    
    
        9:55 AM 
     
                    ruaok
                because the query planner now needs to make a different decision. it reaches a threshold.
     
      2018-05-01 12155, 2018  
    
    
        9:55 AM 
     
                    ruaok
                once it passes the threshold, bam everything backs up.
     
      2018-05-01 12112, 2018  
    
    
        9:56 AM 
     
                    ruaok
                now, you can spend a pile of time looking for exactly what happend.
     
      2018-05-01 12127, 2018  
    
    
        9:56 AM 
     
                    ruaok
                but in the end the outcome is the same: optimize your DB or add more capacity.
     
      2018-05-01 12157, 2018  
    
    
        9:56 AM 
     
                    zas
                i tend to disagree here, as it doesn't explain all
     
      2018-05-01 12110, 2018  
    
    
        9:57 AM 
     
                    zas
                
     
      2018-05-01 12137, 2018  
    
    
        9:57 AM 
     
                    zas
                first one and second one went through without any action (i think), and lasted for 24-36hours
     
      2018-05-01 12102, 2018  
    
    
        9:58 AM 
     
                    zas
                i see no trace of VACUUM in logs for those events
     
      2018-05-01 12151, 2018  
    
    
        9:58 AM 
     
                    ruaok
                ok, I'm speaking from experience of 15 years of running postgres.
     
      2018-05-01 12107, 2018  
    
    
        9:59 AM 
     
                    ruaok
                I'm trying to save you more frustration.
     
      2018-05-01 12112, 2018  
    
    
        9:59 AM 
     
                    zas
                yes, i understand, but please stick to facts
     
      2018-05-01 12126, 2018  
    
    
        9:59 AM 
     
                    ruaok
                I see you approaching this with tools that simply are not effective in combatting this.
     
      2018-05-01 12147, 2018  
    
    
        9:59 AM 
     
                    ruaok
                I don't have facts. that's the whole problem about this. facts are hard to come by.
     
      2018-05-01 12112, 2018  
    
    
        10:01 AM 
     
                    ruaok
                what I do have is observed patterns and experience with this problem.
     
      2018-05-01 12135, 2018  
    
    
        10:01 AM 
     
                    ruaok
                this is classic "nothing changed, but PG is freaking out, why?" I've been here several times before.
     
      2018-05-01 12135, 2018  
    
    
        10:01 AM 
     
                    zas
                i can't find solutions until the problem is well defined... and for now,  your attempt to define the problem doesn't match any measurement
     
      2018-05-01 12107, 2018  
    
    
        10:02 AM 
     
                    zas
                you said "what used to be an efficient in-memory query now becomes an more expensive on disk query..." -> where's the disk activity ?
     
      2018-05-01 12139, 2018  
    
    
        10:02 AM 
     
                    ruaok
                I am trying to describe one of many different scenarios. I really don't know what the query planner is or is not doing in this case.
     
      2018-05-01 12151, 2018  
    
    
        10:02 AM 
     
                    ruaok
                it simply may not be disk related.
     
      2018-05-01 12136, 2018  
    
    
        10:03 AM 
     
                    zas
                ok, but keep your scenarii realistic: here we had only increase in cpu usage; network, memory and disk activity remained constant
     
      2018-05-01 12122, 2018  
    
    
        10:04 AM 
     
                    zas
                but i agree with you it is a decrease of efficiency, likely coming from bad predictions (and lack of ANALYZE)
     
      2018-05-01 12101, 2018  
    
    
        10:05 AM 
     
                    zas
                so why does this happen suddenly (starting on 17th) after months of stable activity ?
     
      2018-05-01 12143, 2018  
    
    
        10:05 AM 
     
                    zas
                size of table triggered something ? number of insert/delete/update ?
     
      2018-05-01 12107, 2018  
    
    
        10:09 AM 
     
                    ruaok
                I don't know, but I can speculate....
     
      2018-05-01 12116, 2018  
    
    
        10:10 AM 
     
                    ruaok
                the query planner decides to use X tables for a query...
     
      2018-05-01 12136, 2018  
    
    
        10:10 AM 
     
                    ruaok
                then stats go out of date and things get more fuzzy.
     
      2018-05-01 12111, 2018  
    
    
        10:11 AM 
     
                    ruaok
                now it can't keep the whole table in ram anymore, or it thinks it can't.
     
      2018-05-01 12130, 2018  
    
    
        10:11 AM 
     
                    ruaok
                the it may need to go get bits from disk and do more loading of data.
     
      2018-05-01 12147, 2018  
    
    
        10:11 AM 
     
                    ruaok
                but that data is fresh so it actually resides in cache, (RAM, not L2)
     
      2018-05-01 12127, 2018  
    
    
        10:12 AM 
     
                    ruaok
                so, now more fetching across RAM.
     
      2018-05-01 12151, 2018  
    
    
        10:12 AM 
     
                    ruaok
                and it might really only be a slight change, but that slight change is what trip the tipping point.
     
      2018-05-01 12121, 2018  
    
    
        10:13 AM 
     
                    ruaok
                and now everything backs up and can't ever recover and suddenly the server is totally overloaded.
     
      2018-05-01 12147, 2018  
    
    
        10:13 AM 
     
                    ruaok
                by running the stats and throwing out old cruft, we turn back to where the planner can do things better.