#metabrainz

/

6:21 AM
UmkaDK joined the channel

2018-05-01 12127, 2018

6:25 AM
UmkaDK has quit

2018-05-01 12151, 2018

7:00 AM
UmkaDK joined the channel

2018-05-01 12110, 2018

7:23 AM
reosarevok

zas: what's off with the https://tickets.metabrainz.org/browse/STYLE-398 certificates?

2018-05-01 12158, 2018

7:31 AM
drsaund is back from tax hell

2018-05-01 12154, 2018

7:32 AM
Major_Lurker

yay

2018-05-01 12149, 2018

7:33 AM
drsaund

worked 16hrs today, been home for 2and a half and i'm wired

2018-05-01 12101, 2018

7:34 AM
Major_Lurker

tax should be banned

2018-05-01 12111, 2018

7:34 AM
drsaund

well then i wouldn't have a job

2018-05-01 12125, 2018

7:34 AM
drsaund

but aholes that wait until the last day should be flogged at least

2018-05-01 12128, 2018

7:34 AM
Major_Lurker

mmm oh well.... hermit is not bad

2018-05-01 12137, 2018

7:36 AM
drsaund

now i've got lots of time to pester reosarevok into making CAA-84 happen

2018-05-01 12124, 2018

7:40 AM
Major_Lurker has quit

2018-05-01 12150, 2018

7:40 AM
Major_Lurker joined the channel

2018-05-01 12119, 2018

7:44 AM
Freso

reosarevok (zas): Nothing's off with it, it's just expired. It expired on May 1st 1:59 CEST (ie., almost 8 hours ago).

2018-05-01 12102, 2018

7:45 AM
Freso

There might be something off with the script that should automatically make new certificates before the current one expired though. :)

2018-05-01 12147, 2018

7:46 AM
d4rkie joined the channel

2018-05-01 12145, 2018

7:47 AM
Nyanko-sensei has quit

2018-05-01 12139, 2018

7:55 AM
drsaund

sweet..and i just found a J in the bowels of my chair

2018-05-01 12121, 2018

7:56 AM
Freso

Congrats!

2018-05-01 12100, 2018

7:57 AM
drsaund

https://www.youtube.com/watch?v=M67E9mpwBpM

2018-05-01 12107, 2018

7:58 AM
UmkaDK_ joined the channel

2018-05-01 12114, 2018

8:00 AM
UmkaDK has quit

2018-05-01 12133, 2018

8:15 AM
Slurpee joined the channel

2018-05-01 12153, 2018

8:27 AM
UmkaDK_ has quit

2018-05-01 12159, 2018

8:27 AM
UmkaDK joined the channel

2018-05-01 12142, 2018

8:33 AM
zas

moin

2018-05-01 12112, 2018

8:34 AM
zas

reosarevok: cert expired on tickets but also on stats.metabrainz.org, i don't know what is going on, but that's very weird

2018-05-01 12109, 2018

8:35 AM
Nyanko-sensei joined the channel

2018-05-01 12121, 2018

8:36 AM
d4rkie has quit

2018-05-01 12139, 2018

8:41 AM
zas

ok, my fault, forgot to deploy new certs to those machines, fix in progress

2018-05-01 12120, 2018

8:48 AM
zas

fixed, i also added checks for those issues

2018-05-01 12131, 2018

9:08 AM
zas

ruaok: autovacuum is actually running, according to logs, but it may not trigger an ANALYZE, because the config has no specific options (it uses default thresholds). The command that was run yesterday is VACUUM ANALYZE.

2018-05-01 12112, 2018

9:09 AM
zas

I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;

2018-05-01 12126, 2018

9:09 AM
zas

time matches load decreases

2018-05-01 12159, 2018

9:09 AM
zas

so, to my understanding, autovacuum works, but needs to be tuned

2018-05-01 12131, 2018

9:10 AM
zas

running vacuum manually shouldn't be needed with current pg versions

2018-05-01 12144, 2018

9:10 AM
zas

https://www.postgresql.org/docs/current/static/ru…

2018-05-01 12115, 2018

9:13 AM
zas

bitmap: i'd start by logging autovacuum actions (log_autovacuum_min_duration), it should be also noted we can set those options per table

2018-05-01 12135, 2018

9:13 AM
UmkaDK

Hi guys, I just wanted to the check on your replication server. How is it doing? Is it feeling better?

2018-05-01 12111, 2018

9:15 AM
UmkaDK

We've disabled some of the alarms while replication was going up and down, so I just need to know if it's safe to re-enable them.

2018-05-01 12123, 2018

9:15 AM
zas

UmkaDK: nope, this issue isn't fixed yet. https://metabrainz.org/api/musicbrainz/replicatio…

2018-05-01 12127, 2018

9:16 AM
UmkaDK

Thanks zas, I'll keep the alarms off for now. Good luck with the fix!!

2018-05-01 12147, 2018

9:18 AM
zas

UmkaDK: we had serious db issues those last days, the replication issue is prolly related to this, i think it'll be fixed soon

2018-05-01 12110, 2018

9:20 AM
UmkaDK

Yee, I've seen the chatter. Thanks for the update zas!

2018-05-01 12104, 2018

9:33 AM
ruaok

moin!

2018-05-01 12128, 2018

9:34 AM
ruaok

> I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;

2018-05-01 12137, 2018

9:34 AM
ruaok

that is the manual one yvanzo started.

2018-05-01 12121, 2018

9:35 AM
ruaok

zas: https://metabrainz.org/api/musicbrainz/replicatio…

2018-05-01 12146, 2018

9:35 AM
ruaok

that is not actually the reason why replication is behind -- that check is pointing to another ancient problem.

2018-05-01 12107, 2018

9:36 AM
zas

ah

2018-05-01 12118, 2018

9:36 AM
ruaok

yvanzo: when you get up, things improved after the vacuum analyze, now we should be able to generate packets again.

2018-05-01 12125, 2018

9:36 AM
ruaok

can you kick that process, please.

2018-05-01 12107, 2018

9:38 AM
ruaok

do you know which machines stores replication packets, zas? not inside a container.. just on FS.

2018-05-01 12142, 2018

9:40 AM
yvanzo

ruaok: yup, seen that :)

2018-05-01 12108, 2018

9:41 AM
yvanzo

kicked

2018-05-01 12127, 2018

9:41 AM
ruaok

great, thanks.

2018-05-01 12137, 2018

9:41 AM
ruaok

do you know where the replication packets are stored?

2018-05-01 12158, 2018

9:41 AM
ruaok

one of our machines generates replication packets and stores them on the FS. where is that?

2018-05-01 12154, 2018

9:42 AM
yvanzo

still the same place, in docker, on hip

2018-05-01 12141, 2018

9:43 AM
ruaok

not ever stored on the FS?

2018-05-01 12141, 2018

9:44 AM
yvanzo

oh right, there is a backup dir as well, let me search

2018-05-01 12127, 2018

9:48 AM
yvanzo

ruaok: still not on the FS, the backup dir is in the container musicbrainz-production-cron at ~musicbrainz/backup

2018-05-01 12106, 2018

9:49 AM
ruaok

yvanzo: ok, thanks.

2018-05-01 12123, 2018

9:49 AM
ruaok

zas: sudden degradation in performance happen because of the optimizer.

2018-05-01 12130, 2018

9:49 AM
ruaok

... errr planner.

2018-05-01 12147, 2018

9:49 AM
zas

morning yvanzo

2018-05-01 12152, 2018

9:49 AM
ruaok

the planner needs to use statistics from the DB to plan a query for optimium efficiency.

2018-05-01 12103, 2018

9:50 AM
yvanzo

hi zas!

2018-05-01 12110, 2018

9:50 AM
ruaok

at some point in time, black magic, really, something changes.

2018-05-01 12130, 2018

9:50 AM
zas

let's start with facts: autovacuum runs, but doesn't help with this problem

2018-05-01 12132, 2018

9:50 AM
ruaok

the planner needs to do something different because the previous strategy didn't work anymore.

2018-05-01 12142, 2018

9:50 AM
zas

VACUUM ANALYZE did help a lot

2018-05-01 12151, 2018

9:50 AM
ruaok

hang on. I'm trying to answer your previous questions.

2018-05-01 12108, 2018

9:51 AM
zas

k

2018-05-01 12151, 2018

9:51 AM
ruaok

so, what used to be an efficient in-memory query now becomes an more expensive on disk query... for instance.

2018-05-01 12101, 2018

9:52 AM
ruaok

and that has a greater impact on system performance.

2018-05-01 12110, 2018

9:52 AM
zas

hmmm, wait, i need to verify this

2018-05-01 12116, 2018

9:52 AM
ruaok

and if that query was used a lot, then bam your server is overloaded.

2018-05-01 12135, 2018

9:52 AM
ruaok

but, this may be because it is using old statistics.

2018-05-01 12129, 2018

9:53 AM
zas

you assume it causes disk activity, and therefore a slowdown, but nope: https://stats.metabrainz.org/d/000000048/hetzner-…

2018-05-01 12104, 2018

9:54 AM
zas

there was no increase in disk IO during 3 events, i can only see CPU usage increase in fact

2018-05-01 12115, 2018

9:54 AM
ruaok

if the statistic don't get updated, or old shit thrown from the DB, then it has to assume that things have changed. it can't made good judgements anymore.

2018-05-01 12110, 2018

9:55 AM
zas

yup, but then why it isn't more progressive ? i mean i'd expect slow performance degradation

2018-05-01 12116, 2018

9:55 AM
ruaok

so, running vacuum analyze throws out old shit and updates stats.

2018-05-01 12145, 2018

9:55 AM
ruaok

because the query planner now needs to make a different decision. it reaches a threshold.

2018-05-01 12155, 2018

9:55 AM
ruaok

once it passes the threshold, bam everything backs up.

2018-05-01 12112, 2018

9:56 AM
ruaok

now, you can spend a pile of time looking for exactly what happend.

2018-05-01 12127, 2018

9:56 AM
ruaok

but in the end the outcome is the same: optimize your DB or add more capacity.

2018-05-01 12157, 2018

9:56 AM
zas

i tend to disagree here, as it doesn't explain all

2018-05-01 12110, 2018

9:57 AM
zas

https://stats.metabrainz.org/d/eStswhGmk/postgres… shows there was 3 "major events"

2018-05-01 12137, 2018

9:57 AM
zas

first one and second one went through without any action (i think), and lasted for 24-36hours

2018-05-01 12102, 2018

9:58 AM
zas

i see no trace of VACUUM in logs for those events

2018-05-01 12151, 2018

9:58 AM
ruaok

ok, I'm speaking from experience of 15 years of running postgres.

2018-05-01 12107, 2018

9:59 AM
ruaok

I'm trying to save you more frustration.

2018-05-01 12112, 2018

9:59 AM
zas

yes, i understand, but please stick to facts

2018-05-01 12126, 2018

9:59 AM
ruaok

I see you approaching this with tools that simply are not effective in combatting this.

2018-05-01 12147, 2018

9:59 AM
ruaok

I don't have facts. that's the whole problem about this. facts are hard to come by.

2018-05-01 12112, 2018

10:01 AM
ruaok

what I do have is observed patterns and experience with this problem.

2018-05-01 12135, 2018

10:01 AM
ruaok

this is classic "nothing changed, but PG is freaking out, why?" I've been here several times before.

2018-05-01 12135, 2018

10:01 AM
zas

i can't find solutions until the problem is well defined... and for now, your attempt to define the problem doesn't match any measurement

2018-05-01 12107, 2018

10:02 AM
zas

you said "what used to be an efficient in-memory query now becomes an more expensive on disk query..." -> where's the disk activity ?

2018-05-01 12139, 2018

10:02 AM
ruaok

I am trying to describe one of many different scenarios. I really don't know what the query planner is or is not doing in this case.

2018-05-01 12151, 2018

10:02 AM
ruaok

it simply may not be disk related.

2018-05-01 12136, 2018

10:03 AM
zas

ok, but keep your scenarii realistic: here we had only increase in cpu usage; network, memory and disk activity remained constant

2018-05-01 12122, 2018

10:04 AM
zas

but i agree with you it is a decrease of efficiency, likely coming from bad predictions (and lack of ANALYZE)

2018-05-01 12101, 2018

10:05 AM
zas

so why does this happen suddenly (starting on 17th) after months of stable activity ?

2018-05-01 12143, 2018

10:05 AM
zas

size of table triggered something ? number of insert/delete/update ?

2018-05-01 12107, 2018

10:09 AM
ruaok

I don't know, but I can speculate....

2018-05-01 12116, 2018

10:10 AM
ruaok

the query planner decides to use X tables for a query...

2018-05-01 12136, 2018

10:10 AM
ruaok

then stats go out of date and things get more fuzzy.

2018-05-01 12111, 2018

10:11 AM
ruaok

now it can't keep the whole table in ram anymore, or it thinks it can't.

2018-05-01 12130, 2018

10:11 AM
ruaok

the it may need to go get bits from disk and do more loading of data.

2018-05-01 12147, 2018

10:11 AM
ruaok

but that data is fresh so it actually resides in cache, (RAM, not L2)

2018-05-01 12127, 2018

10:12 AM
ruaok

so, now more fetching across RAM.

2018-05-01 12151, 2018

10:12 AM
ruaok

and it might really only be a slight change, but that slight change is what trip the tipping point.

2018-05-01 12121, 2018

10:13 AM
ruaok

and now everything backs up and can't ever recover and suddenly the server is totally overloaded.

2018-05-01 12147, 2018

10:13 AM
ruaok

by running the stats and throwing out old cruft, we turn back to where the planner can do things better.

2018-05-01 12106, 2018

10:14 AM
ruaok

but, this only buys a small amount of time. we will hit this point again and each time this point comes closer.

2018-05-01 12158, 2018

10:14 AM
ruaok

looking for a possibly cause in all of this can be a huge undertaking that has one recommended outcome: tune your server or add more capacity.

2018-05-01 12121, 2018

10:15 AM
ruaok

so, best to take two approaches to this, as we have been: work to add more capacity in the short term.

2018-05-01 12141, 2018

10:15 AM
ruaok

post react move, we focus on caching and optimizing the worst queries in our system.

2018-05-01 12113, 2018

10:16 AM
zas

so, you think ANALYZE was run by autovacuum and it was fixed by itself during first 2 events ? but i don't see any autovacuum stuff in logs

2018-05-01 12158, 2018

10:16 AM
ruaok

for some reason autovacuum isn't working. that we need to find out why -- but twice now an ANALYZE has solved out problems.

2018-05-01 12109, 2018

10:17 AM
ruaok

so, we need to do that on a regular basis until we know why we need to do it.

2018-05-01 12137, 2018

10:17 AM
zas

yes, i just checked logs again

2018-05-01 12153, 2018

10:17 AM
ruaok

not sure by what you mean by first 2 events.. I think it may have done its job until the load become too high and then it stops working in order to not being the server to its knees.

2018-05-01 12158, 2018

10:17 AM
zas

https://www.irccloud.com/pastebin/Y8MnDDGV/

2018-05-01 12114, 2018

10:18 AM
zas

basically nothing between january and april...

2018-05-01 12137, 2018

10:18 AM
zas

my guess: temp tables created by search indexer causes this

2018-05-01 12139, 2018

10:18 AM
ruaok

and I bet that if we check our chat logs we will find that we had a freakout in jan. :)

2018-05-01 12151, 2018

10:18 AM
ruaok

sure, that is quite possible.

2018-05-01 12102, 2018

10:19 AM
zas

they aren't cleaned by autovacuum, and other tables stay under thresholds

2018-05-01 12109, 2018

10:19 AM
Slurpee has quit

2018-05-01 12126, 2018

10:19 AM
zas

but autovacuum process actually runs, it just does nothing

2018-05-01 12138, 2018

10:19 AM
ruaok

the an immediate fix is to add a cron job to run ANALYZE once a week during slow times.

2018-05-01 12159, 2018

10:19 AM
ruaok

not sure about nothing... it probably updates stats.

2018-05-01 12113, 2018

10:20 AM
zas

yes, but according to https://dba.stackexchange.com/questions/18664/are… it shouldn't be needed

2018-05-01 12125, 2018

10:20 AM
zas

we can tweak autovacuum options for that

2018-05-01 12143, 2018

10:20 AM
zas

it doesn't fix the temp tables thing though

2018-05-01 12143, 2018

10:21 AM
zas

question: why are temp tables created ? pg_temp_44.tmp_artistcredit (and others orphans) were created by ?

2018-05-01 12131, 2018

10:22 AM
zas

it seems to me autovacuum took care of those after we restarted pg

2018-05-01 12108, 2018

10:23 AM
ruaok

the search indexer creates them to make index building more efficient.

2018-05-01 12145, 2018

10:23 AM
ruaok

rather than repeatedly running the same query to get a chunk of rows, it creates a temp table and then queries that table while building an index.

2018-05-01 12122, 2018

10:24 AM
ruaok

it has the effect of running a fat query a few thousand times, vs a fat query once and then just fetching rows without a join.

2018-05-01 12138, 2018

10:25 AM
zas

https://www.postgresql.org/message-id/7b7f3e98-97…

2018-05-01 12125, 2018

10:28 AM
zas

i propose to increase autovacuum verbosity to start with

2018-05-01 12134, 2018

10:28 AM
ruaok

sure.

2018-05-01 12150, 2018

10:28 AM
ruaok

what is our current work mem setting?

2018-05-01 12156, 2018

10:28 AM
ruaok

that one had good results too.