#metabrainz

/

0:08 AM
ephemer0l_ has quit

2018-05-01 12120, 2018

0:11 AM
Nyanko-sensei has quit

2018-05-01 12133, 2018

0:12 AM
ephemer0l joined the channel

2018-05-01 12145, 2018

0:15 AM
UmkaDK_ has quit

2018-05-01 12137, 2018

0:18 AM
ruaok

nice! load on bowie is now 8.5!

2018-05-01 12104, 2018

0:19 AM
ruaok

ok, lesson learned: autovacuum does not work. we all need to remember that for next time this exact same problems happens.

2018-05-01 12119, 2018

0:19 AM
ruaok

I wonder why that is -- because the server is too overloaded for the daemon to run?

2018-05-01 12121, 2018

0:22 AM
SothoTalKer

is it fast again?

2018-05-01 12148, 2018

0:29 AM
Nyanko-sensei joined the channel

2018-05-01 12143, 2018

0:30 AM
ruaok

looks better

2018-05-01 12144, 2018

0:35 AM
SothoTalKer

yay

2018-05-01 12149, 2018

0:50 AM
dragonzeron has quit

2018-05-01 12102, 2018

1:14 AM
Slurpee joined the channel

2018-05-01 12102, 2018

1:14 AM
Slurpee has quit

2018-05-01 12102, 2018

1:14 AM
Slurpee joined the channel

2018-05-01 12153, 2018

1:17 AM
CatQuest

hilariously, it seems everythig *but* mb is slow now (wikipedia, discogs)

2018-05-01 12130, 2018

1:19 AM
Gore|woerk joined the channel

2018-05-01 12109, 2018

1:20 AM
G0re has quit

2018-05-01 12121, 2018

2:30 AM
Major_Lurker

woah, metabrainz TShirt arrived....

2018-05-01 12129, 2018

2:34 AM
SothoTalKer

does it fit?

2018-05-01 12157, 2018

2:34 AM
Major_Lurker

looks like it will

2018-05-01 12112, 2018

2:35 AM
Major_Lurker

i got the fat size

2018-05-01 12109, 2018

2:47 AM
SothoTalKer

3XL

2018-05-01 12122, 2018

2:50 AM
Major_Lurker has quit

2018-05-01 12149, 2018

2:50 AM
Major_Lurker joined the channel

2018-05-01 12138, 2018

3:12 AM
Major_Lurker

ha no just xl

2018-05-01 12156, 2018

3:32 AM
SothoTalKer

aww

2018-05-01 12133, 2018

3:40 AM
sentriz has quit

2018-05-01 12122, 2018

3:41 AM
sentriz joined the channel

2018-05-01 12151, 2018

4:10 AM
Slurpee has quit

2018-05-01 12136, 2018

5:08 AM
CatQuest

he's australian SothoTalKer, not american :3

2018-05-01 12144, 2018

5:12 AM
drsaund joined the channel

2018-05-01 12118, 2018

6:21 AM
UmkaDK joined the channel

2018-05-01 12127, 2018

6:25 AM
UmkaDK has quit

2018-05-01 12151, 2018

7:00 AM
UmkaDK joined the channel

2018-05-01 12110, 2018

7:23 AM
reosarevok

zas: what's off with the https://tickets.metabrainz.org/browse/STYLE-398 certificates?

2018-05-01 12158, 2018

7:31 AM
drsaund is back from tax hell

2018-05-01 12154, 2018

7:32 AM
Major_Lurker

yay

2018-05-01 12149, 2018

7:33 AM
drsaund

worked 16hrs today, been home for 2and a half and i'm wired

2018-05-01 12101, 2018

7:34 AM
Major_Lurker

tax should be banned

2018-05-01 12111, 2018

7:34 AM
drsaund

well then i wouldn't have a job

2018-05-01 12125, 2018

7:34 AM
drsaund

but aholes that wait until the last day should be flogged at least

2018-05-01 12128, 2018

7:34 AM
Major_Lurker

mmm oh well.... hermit is not bad

2018-05-01 12137, 2018

7:36 AM
drsaund

now i've got lots of time to pester reosarevok into making CAA-84 happen

2018-05-01 12124, 2018

7:40 AM
Major_Lurker has quit

2018-05-01 12150, 2018

7:40 AM
Major_Lurker joined the channel

2018-05-01 12119, 2018

7:44 AM
Freso

reosarevok (zas): Nothing's off with it, it's just expired. It expired on May 1st 1:59 CEST (ie., almost 8 hours ago).

2018-05-01 12102, 2018

7:45 AM
Freso

There might be something off with the script that should automatically make new certificates before the current one expired though. :)

2018-05-01 12147, 2018

7:46 AM
d4rkie joined the channel

2018-05-01 12145, 2018

7:47 AM
Nyanko-sensei has quit

2018-05-01 12139, 2018

7:55 AM
drsaund

sweet..and i just found a J in the bowels of my chair

2018-05-01 12121, 2018

7:56 AM
Freso

Congrats!

2018-05-01 12100, 2018

7:57 AM
drsaund

https://www.youtube.com/watch?v=M67E9mpwBpM

2018-05-01 12107, 2018

7:58 AM
UmkaDK_ joined the channel

2018-05-01 12114, 2018

8:00 AM
UmkaDK has quit

2018-05-01 12133, 2018

8:15 AM
Slurpee joined the channel

2018-05-01 12153, 2018

8:27 AM
UmkaDK_ has quit

2018-05-01 12159, 2018

8:27 AM
UmkaDK joined the channel

2018-05-01 12142, 2018

8:33 AM
zas

moin

2018-05-01 12112, 2018

8:34 AM
zas

reosarevok: cert expired on tickets but also on stats.metabrainz.org, i don't know what is going on, but that's very weird

2018-05-01 12109, 2018

8:35 AM
Nyanko-sensei joined the channel

2018-05-01 12121, 2018

8:36 AM
d4rkie has quit

2018-05-01 12139, 2018

8:41 AM
zas

ok, my fault, forgot to deploy new certs to those machines, fix in progress

2018-05-01 12120, 2018

8:48 AM
zas

fixed, i also added checks for those issues

2018-05-01 12131, 2018

9:08 AM
zas

ruaok: autovacuum is actually running, according to logs, but it may not trigger an ANALYZE, because the config has no specific options (it uses default thresholds). The command that was run yesterday is VACUUM ANALYZE.

2018-05-01 12112, 2018

9:09 AM
zas

I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;

2018-05-01 12126, 2018

9:09 AM
zas

time matches load decreases

2018-05-01 12159, 2018

9:09 AM
zas

so, to my understanding, autovacuum works, but needs to be tuned

2018-05-01 12131, 2018

9:10 AM
zas

running vacuum manually shouldn't be needed with current pg versions

2018-05-01 12144, 2018

9:10 AM
zas

https://www.postgresql.org/docs/current/static/ru…

2018-05-01 12115, 2018

9:13 AM
zas

bitmap: i'd start by logging autovacuum actions (log_autovacuum_min_duration), it should be also noted we can set those options per table

2018-05-01 12135, 2018

9:13 AM
UmkaDK

Hi guys, I just wanted to the check on your replication server. How is it doing? Is it feeling better?

2018-05-01 12111, 2018

9:15 AM
UmkaDK

We've disabled some of the alarms while replication was going up and down, so I just need to know if it's safe to re-enable them.

2018-05-01 12123, 2018

9:15 AM
zas

UmkaDK: nope, this issue isn't fixed yet. https://metabrainz.org/api/musicbrainz/replicatio…

2018-05-01 12127, 2018

9:16 AM
UmkaDK

Thanks zas, I'll keep the alarms off for now. Good luck with the fix!!

2018-05-01 12147, 2018

9:18 AM
zas

UmkaDK: we had serious db issues those last days, the replication issue is prolly related to this, i think it'll be fixed soon

2018-05-01 12110, 2018

9:20 AM
UmkaDK

Yee, I've seen the chatter. Thanks for the update zas!

2018-05-01 12104, 2018

9:33 AM
ruaok

moin!

2018-05-01 12128, 2018

9:34 AM
ruaok

> I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;

2018-05-01 12137, 2018

9:34 AM
ruaok

that is the manual one yvanzo started.

2018-05-01 12121, 2018

9:35 AM
ruaok

zas: https://metabrainz.org/api/musicbrainz/replicatio…

2018-05-01 12146, 2018

9:35 AM
ruaok

that is not actually the reason why replication is behind -- that check is pointing to another ancient problem.

2018-05-01 12107, 2018

9:36 AM
zas

ah

2018-05-01 12118, 2018

9:36 AM
ruaok

yvanzo: when you get up, things improved after the vacuum analyze, now we should be able to generate packets again.

2018-05-01 12125, 2018

9:36 AM
ruaok

can you kick that process, please.

2018-05-01 12107, 2018

9:38 AM
ruaok

do you know which machines stores replication packets, zas? not inside a container.. just on FS.

2018-05-01 12142, 2018

9:40 AM
yvanzo

ruaok: yup, seen that :)

2018-05-01 12108, 2018

9:41 AM
yvanzo

kicked

2018-05-01 12127, 2018

9:41 AM
ruaok

great, thanks.

2018-05-01 12137, 2018

9:41 AM
ruaok

do you know where the replication packets are stored?

2018-05-01 12158, 2018

9:41 AM
ruaok

one of our machines generates replication packets and stores them on the FS. where is that?

2018-05-01 12154, 2018

9:42 AM
yvanzo

still the same place, in docker, on hip

2018-05-01 12141, 2018

9:43 AM
ruaok

not ever stored on the FS?

2018-05-01 12141, 2018

9:44 AM
yvanzo

oh right, there is a backup dir as well, let me search

2018-05-01 12127, 2018

9:48 AM
yvanzo

ruaok: still not on the FS, the backup dir is in the container musicbrainz-production-cron at ~musicbrainz/backup

2018-05-01 12106, 2018

9:49 AM
ruaok

yvanzo: ok, thanks.

2018-05-01 12123, 2018

9:49 AM
ruaok

zas: sudden degradation in performance happen because of the optimizer.

2018-05-01 12130, 2018

9:49 AM
ruaok

... errr planner.

2018-05-01 12147, 2018

9:49 AM
zas

morning yvanzo

2018-05-01 12152, 2018

9:49 AM
ruaok

the planner needs to use statistics from the DB to plan a query for optimium efficiency.

2018-05-01 12103, 2018

9:50 AM
yvanzo

hi zas!

2018-05-01 12110, 2018

9:50 AM
ruaok

at some point in time, black magic, really, something changes.

2018-05-01 12130, 2018

9:50 AM
zas

let's start with facts: autovacuum runs, but doesn't help with this problem

2018-05-01 12132, 2018

9:50 AM
ruaok

the planner needs to do something different because the previous strategy didn't work anymore.

2018-05-01 12142, 2018

9:50 AM
zas

VACUUM ANALYZE did help a lot

2018-05-01 12151, 2018

9:50 AM
ruaok

hang on. I'm trying to answer your previous questions.

2018-05-01 12108, 2018

9:51 AM
zas

k

2018-05-01 12151, 2018

9:51 AM
ruaok

so, what used to be an efficient in-memory query now becomes an more expensive on disk query... for instance.

2018-05-01 12101, 2018

9:52 AM
ruaok

and that has a greater impact on system performance.

2018-05-01 12110, 2018

9:52 AM
zas

hmmm, wait, i need to verify this

2018-05-01 12116, 2018

9:52 AM
ruaok

and if that query was used a lot, then bam your server is overloaded.

2018-05-01 12135, 2018

9:52 AM
ruaok

but, this may be because it is using old statistics.

2018-05-01 12129, 2018

9:53 AM
zas

you assume it causes disk activity, and therefore a slowdown, but nope: https://stats.metabrainz.org/d/000000048/hetzner-…

2018-05-01 12104, 2018

9:54 AM
zas

there was no increase in disk IO during 3 events, i can only see CPU usage increase in fact

2018-05-01 12115, 2018

9:54 AM
ruaok

if the statistic don't get updated, or old shit thrown from the DB, then it has to assume that things have changed. it can't made good judgements anymore.

2018-05-01 12110, 2018

9:55 AM
zas

yup, but then why it isn't more progressive ? i mean i'd expect slow performance degradation

2018-05-01 12116, 2018

9:55 AM
ruaok

so, running vacuum analyze throws out old shit and updates stats.

2018-05-01 12145, 2018

9:55 AM
ruaok

because the query planner now needs to make a different decision. it reaches a threshold.

2018-05-01 12155, 2018

9:55 AM
ruaok

once it passes the threshold, bam everything backs up.

2018-05-01 12112, 2018

9:56 AM
ruaok

now, you can spend a pile of time looking for exactly what happend.

2018-05-01 12127, 2018

9:56 AM
ruaok

but in the end the outcome is the same: optimize your DB or add more capacity.

2018-05-01 12157, 2018

9:56 AM
zas

i tend to disagree here, as it doesn't explain all

2018-05-01 12110, 2018

9:57 AM
zas

https://stats.metabrainz.org/d/eStswhGmk/postgres… shows there was 3 "major events"

2018-05-01 12137, 2018

9:57 AM
zas

first one and second one went through without any action (i think), and lasted for 24-36hours

2018-05-01 12102, 2018

9:58 AM
zas

i see no trace of VACUUM in logs for those events

2018-05-01 12151, 2018

9:58 AM
ruaok

ok, I'm speaking from experience of 15 years of running postgres.

2018-05-01 12107, 2018

9:59 AM
ruaok

I'm trying to save you more frustration.

2018-05-01 12112, 2018

9:59 AM
zas

yes, i understand, but please stick to facts

2018-05-01 12126, 2018

9:59 AM
ruaok

I see you approaching this with tools that simply are not effective in combatting this.

2018-05-01 12147, 2018

9:59 AM
ruaok

I don't have facts. that's the whole problem about this. facts are hard to come by.

2018-05-01 12112, 2018

10:01 AM
ruaok

what I do have is observed patterns and experience with this problem.

2018-05-01 12135, 2018

10:01 AM
ruaok

this is classic "nothing changed, but PG is freaking out, why?" I've been here several times before.

2018-05-01 12135, 2018

10:01 AM
zas

i can't find solutions until the problem is well defined... and for now, your attempt to define the problem doesn't match any measurement

2018-05-01 12107, 2018

10:02 AM
zas

you said "what used to be an efficient in-memory query now becomes an more expensive on disk query..." -> where's the disk activity ?

2018-05-01 12139, 2018

10:02 AM
ruaok

I am trying to describe one of many different scenarios. I really don't know what the query planner is or is not doing in this case.

2018-05-01 12151, 2018

10:02 AM
ruaok

it simply may not be disk related.

2018-05-01 12136, 2018

10:03 AM
zas

ok, but keep your scenarii realistic: here we had only increase in cpu usage; network, memory and disk activity remained constant

2018-05-01 12122, 2018

10:04 AM
zas

but i agree with you it is a decrease of efficiency, likely coming from bad predictions (and lack of ANALYZE)

2018-05-01 12101, 2018

10:05 AM
zas

so why does this happen suddenly (starting on 17th) after months of stable activity ?

2018-05-01 12143, 2018

10:05 AM
zas

size of table triggered something ? number of insert/delete/update ?

2018-05-01 12107, 2018

10:09 AM
ruaok

I don't know, but I can speculate....

2018-05-01 12116, 2018

10:10 AM
ruaok

the query planner decides to use X tables for a query...

2018-05-01 12136, 2018

10:10 AM
ruaok

then stats go out of date and things get more fuzzy.

2018-05-01 12111, 2018

10:11 AM
ruaok

now it can't keep the whole table in ram anymore, or it thinks it can't.

2018-05-01 12130, 2018

10:11 AM
ruaok

the it may need to go get bits from disk and do more loading of data.

2018-05-01 12147, 2018

10:11 AM
ruaok

but that data is fresh so it actually resides in cache, (RAM, not L2)

2018-05-01 12127, 2018

10:12 AM
ruaok

so, now more fetching across RAM.

2018-05-01 12151, 2018

10:12 AM
ruaok

and it might really only be a slight change, but that slight change is what trip the tipping point.

2018-05-01 12121, 2018

10:13 AM
ruaok

and now everything backs up and can't ever recover and suddenly the server is totally overloaded.

2018-05-01 12147, 2018

10:13 AM
ruaok

by running the stats and throwing out old cruft, we turn back to where the planner can do things better.