reosarevok: cert expired on tickets but also on stats.metabrainz.org, i don't know what is going on, but that's very weird
2018-05-01 12109, 2018
Nyanko-sensei joined the channel
2018-05-01 12121, 2018
d4rkie has quit
2018-05-01 12139, 2018
zas
ok, my fault, forgot to deploy new certs to those machines, fix in progress
2018-05-01 12120, 2018
zas
fixed, i also added checks for those issues
2018-05-01 12131, 2018
zas
ruaok: autovacuum is actually running, according to logs, but it may not trigger an ANALYZE, because the config has no specific options (it uses default thresholds). The command that was run yesterday is VACUUM ANALYZE.
2018-05-01 12112, 2018
zas
I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;
2018-05-01 12126, 2018
zas
time matches load decreases
2018-05-01 12159, 2018
zas
so, to my understanding, autovacuum works, but needs to be tuned
2018-05-01 12131, 2018
zas
running vacuum manually shouldn't be needed with current pg versions
Thanks zas, I'll keep the alarms off for now. Good luck with the fix!!
2018-05-01 12147, 2018
zas
UmkaDK: we had serious db issues those last days, the replication issue is prolly related to this, i think it'll be fixed soon
2018-05-01 12110, 2018
UmkaDK
Yee, I've seen the chatter. Thanks for the update zas!
2018-05-01 12104, 2018
ruaok
moin!
2018-05-01 12128, 2018
ruaok
> I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;
there was no increase in disk IO during 3 events, i can only see CPU usage increase in fact
2018-05-01 12115, 2018
ruaok
if the statistic don't get updated, or old shit thrown from the DB, then it has to assume that things have changed. it can't made good judgements anymore.
2018-05-01 12110, 2018
zas
yup, but then why it isn't more progressive ? i mean i'd expect slow performance degradation
2018-05-01 12116, 2018
ruaok
so, running vacuum analyze throws out old shit and updates stats.
2018-05-01 12145, 2018
ruaok
because the query planner now needs to make a different decision. it reaches a threshold.
2018-05-01 12155, 2018
ruaok
once it passes the threshold, bam everything backs up.
2018-05-01 12112, 2018
ruaok
now, you can spend a pile of time looking for exactly what happend.
2018-05-01 12127, 2018
ruaok
but in the end the outcome is the same: optimize your DB or add more capacity.
2018-05-01 12157, 2018
zas
i tend to disagree here, as it doesn't explain all
first one and second one went through without any action (i think), and lasted for 24-36hours
2018-05-01 12102, 2018
zas
i see no trace of VACUUM in logs for those events
2018-05-01 12151, 2018
ruaok
ok, I'm speaking from experience of 15 years of running postgres.
2018-05-01 12107, 2018
ruaok
I'm trying to save you more frustration.
2018-05-01 12112, 2018
zas
yes, i understand, but please stick to facts
2018-05-01 12126, 2018
ruaok
I see you approaching this with tools that simply are not effective in combatting this.
2018-05-01 12147, 2018
ruaok
I don't have facts. that's the whole problem about this. facts are hard to come by.
2018-05-01 12112, 2018
ruaok
what I do have is observed patterns and experience with this problem.
2018-05-01 12135, 2018
ruaok
this is classic "nothing changed, but PG is freaking out, why?" I've been here several times before.
2018-05-01 12135, 2018
zas
i can't find solutions until the problem is well defined... and for now, your attempt to define the problem doesn't match any measurement
2018-05-01 12107, 2018
zas
you said "what used to be an efficient in-memory query now becomes an more expensive on disk query..." -> where's the disk activity ?
2018-05-01 12139, 2018
ruaok
I am trying to describe one of many different scenarios. I really don't know what the query planner is or is not doing in this case.
2018-05-01 12151, 2018
ruaok
it simply may not be disk related.
2018-05-01 12136, 2018
zas
ok, but keep your scenarii realistic: here we had only increase in cpu usage; network, memory and disk activity remained constant
2018-05-01 12122, 2018
zas
but i agree with you it is a decrease of efficiency, likely coming from bad predictions (and lack of ANALYZE)
2018-05-01 12101, 2018
zas
so why does this happen suddenly (starting on 17th) after months of stable activity ?
2018-05-01 12143, 2018
zas
size of table triggered something ? number of insert/delete/update ?
2018-05-01 12107, 2018
ruaok
I don't know, but I can speculate....
2018-05-01 12116, 2018
ruaok
the query planner decides to use X tables for a query...
2018-05-01 12136, 2018
ruaok
then stats go out of date and things get more fuzzy.
2018-05-01 12111, 2018
ruaok
now it can't keep the whole table in ram anymore, or it thinks it can't.
2018-05-01 12130, 2018
ruaok
the it may need to go get bits from disk and do more loading of data.
2018-05-01 12147, 2018
ruaok
but that data is fresh so it actually resides in cache, (RAM, not L2)
2018-05-01 12127, 2018
ruaok
so, now more fetching across RAM.
2018-05-01 12151, 2018
ruaok
and it might really only be a slight change, but that slight change is what trip the tipping point.
2018-05-01 12121, 2018
ruaok
and now everything backs up and can't ever recover and suddenly the server is totally overloaded.
2018-05-01 12147, 2018
ruaok
by running the stats and throwing out old cruft, we turn back to where the planner can do things better.
2018-05-01 12106, 2018
ruaok
but, this only buys a small amount of time. we will hit this point again and each time this point comes closer.
2018-05-01 12158, 2018
ruaok
looking for a possibly cause in all of this can be a huge undertaking that has one recommended outcome: tune your server or add more capacity.
2018-05-01 12121, 2018
ruaok
so, best to take two approaches to this, as we have been: work to add more capacity in the short term.
2018-05-01 12141, 2018
ruaok
post react move, we focus on caching and optimizing the worst queries in our system.
2018-05-01 12113, 2018
zas
so, you think ANALYZE was run by autovacuum and it was fixed by itself during first 2 events ? but i don't see any autovacuum stuff in logs
2018-05-01 12158, 2018
ruaok
for some reason autovacuum isn't working. that we need to find out why -- but twice now an ANALYZE has solved out problems.
2018-05-01 12109, 2018
ruaok
so, we need to do that on a regular basis until we know why we need to do it.
2018-05-01 12137, 2018
zas
yes, i just checked logs again
2018-05-01 12153, 2018
ruaok
not sure by what you mean by first 2 events.. I think it may have done its job until the load become too high and then it stops working in order to not being the server to its knees.