IRC Logs for #metabrainz | MetaBrainz Chatlogs

0:08 AM
ephemer0l_ has quit
0:11 AM
Nyanko-sensei has quit
0:12 AM
ephemer0l joined the channel
0:15 AM
UmkaDK_ has quit
0:18 AM
ruaok

nice! load on bowie is now 8.5!
0:19 AM
ok, lesson learned: autovacuum does not work. we all need to remember that for next time this exact same problems happens.
0:19 AM
I wonder why that is -- because the server is too overloaded for the daemon to run?
0:22 AM
SothoTalKer

is it fast again?
0:29 AM
Nyanko-sensei joined the channel
0:30 AM
ruaok

looks better
0:35 AM
SothoTalKer

yay
0:50 AM
dragonzeron has quit
1:14 AM
Slurpee joined the channel
1:14 AM
Slurpee has quit
1:14 AM
Slurpee joined the channel
1:17 AM
CatQuest

hilariously, it seems everythig *but* mb is slow now (wikipedia, discogs)
1:19 AM
Gore|woerk joined the channel
1:20 AM
G0re has quit
2:30 AM
Major_Lurker

woah, metabrainz TShirt arrived....
2:34 AM
SothoTalKer

does it fit?
2:34 AM
Major_Lurker

looks like it will
2:35 AM
i got the fat size
2:47 AM
SothoTalKer

3XL
2:50 AM
Major_Lurker has quit
2:50 AM
Major_Lurker joined the channel
3:12 AM
Major_Lurker

ha no just xl
3:32 AM
SothoTalKer

aww
3:40 AM
sentriz has quit
3:41 AM
sentriz joined the channel
4:10 AM
Slurpee has quit
5:08 AM
CatQuest

he's australian SothoTalKer, not american :3
5:12 AM
drsaund joined the channel
6:21 AM
UmkaDK joined the channel
6:25 AM
UmkaDK has quit
7:00 AM
UmkaDK joined the channel
7:23 AM
reosarevok

zas: what's off with the https://tickets.metabrainz.org/browse/STYLE-398 certificates?
7:31 AM
drsaund is back from tax hell
7:32 AM
Major_Lurker

yay
7:33 AM
drsaund

worked 16hrs today, been home for 2and a half and i'm wired
7:34 AM
Major_Lurker

tax should be banned
7:34 AM
drsaund

well then i wouldn't have a job
7:34 AM
but aholes that wait until the last day should be flogged at least
7:34 AM
Major_Lurker

mmm oh well.... hermit is not bad
7:36 AM
drsaund

now i've got lots of time to pester reosarevok into making CAA-84 happen
7:40 AM
Major_Lurker has quit
7:40 AM
Major_Lurker joined the channel
7:44 AM
Freso

reosarevok (zas): Nothing's off with it, it's just expired. It expired on May 1st 1:59 CEST (ie., almost 8 hours ago).
7:45 AM
There might be something off with the script that should automatically make new certificates before the current one expired though. :)
7:46 AM
d4rkie joined the channel
7:47 AM
Nyanko-sensei has quit
7:55 AM
drsaund

sweet..and i just found a J in the bowels of my chair
7:56 AM
Freso

Congrats!
7:57 AM
drsaund

https://www.youtube.com/watch?v=M67E9mpwBpM
7:58 AM
UmkaDK_ joined the channel
8:00 AM
UmkaDK has quit
8:15 AM
Slurpee joined the channel
8:27 AM
UmkaDK_ has quit
8:27 AM
UmkaDK joined the channel
8:33 AM
zas

moin
8:34 AM
reosarevok: cert expired on tickets but also on stats.metabrainz.org, i don't know what is going on, but that's very weird
8:35 AM
Nyanko-sensei joined the channel
8:36 AM
d4rkie has quit
8:41 AM
ok, my fault, forgot to deploy new certs to those machines, fix in progress
8:48 AM
fixed, i also added checks for those issues
9:08 AM
ruaok: autovacuum is actually running, according to logs, but it may not trigger an ANALYZE, because the config has no specific options (it uses default thresholds). The command that was run yesterday is VACUUM ANALYZE.
9:09 AM
I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;
9:09 AM
time matches load decreases
9:09 AM
so, to my understanding, autovacuum works, but needs to be tuned
9:10 AM
running vacuum manually shouldn't be needed with current pg versions
9:10 AM
https://www.postgresql.org/docs/current/static/...
9:13 AM
bitmap: i'd start by logging autovacuum actions (log_autovacuum_min_duration), it should be also noted we can set those options per table
9:13 AM
UmkaDK

Hi guys, I just wanted to the check on your replication server. How is it doing? Is it feeling better?
9:15 AM
We've disabled some of the alarms while replication was going up and down, so I just need to know if it's safe to re-enable them.
9:15 AM
zas

UmkaDK: nope, this issue isn't fixed yet. https://metabrainz.org/api/musicbrainz/replicat...
9:16 AM
UmkaDK

Thanks zas, I'll keep the alarms off for now. Good luck with the fix!!
9:18 AM
zas

UmkaDK: we had serious db issues those last days, the replication issue is prolly related to this, i think it'll be fixed soon
9:20 AM
UmkaDK

Yee, I've seen the chatter. Thanks for the update zas!
9:33 AM
ruaok

moin!
9:34 AM
> I checked that the improvement is due to that, and it seems to be the case: 2018-04-30 23:29:16.936 GMT postgres@musicbrainz_db 22301 172.17.0.1(16512) LOG: duration: 2486396.275 ms statement: VACUUM ANALYZE;
9:34 AM
that is the manual one yvanzo started.
9:35 AM
zas: https://metabrainz.org/api/musicbrainz/replicat...
9:35 AM
that is not actually the reason why replication is behind -- that check is pointing to another ancient problem.
9:36 AM
zas

ah
9:36 AM
ruaok

yvanzo: when you get up, things improved after the vacuum analyze, now we should be able to generate packets again.
9:36 AM
can you kick that process, please.
9:38 AM
do you know which machines stores replication packets, zas? not inside a container.. just on FS.
9:40 AM
yvanzo

ruaok: yup, seen that :)
9:41 AM
kicked
9:41 AM
ruaok

great, thanks.
9:41 AM
do you know where the replication packets are stored?
9:41 AM
one of our machines generates replication packets and stores them on the FS. where is that?
9:42 AM
yvanzo

still the same place, in docker, on hip
9:43 AM
ruaok

not ever stored on the FS?
9:44 AM
yvanzo

oh right, there is a backup dir as well, let me search
9:48 AM
ruaok: still not on the FS, the backup dir is in the container musicbrainz-production-cron at ~musicbrainz/backup
9:49 AM
ruaok

yvanzo: ok, thanks.
9:49 AM
zas: sudden degradation in performance happen because of the optimizer.
9:49 AM
... errr planner.
9:49 AM
zas

morning yvanzo
9:49 AM
ruaok

the planner needs to use statistics from the DB to plan a query for optimium efficiency.
9:50 AM
yvanzo

hi zas!
9:50 AM
ruaok

at some point in time, black magic, really, something changes.
9:50 AM
zas

let's start with facts: autovacuum runs, but doesn't help with this problem
9:50 AM
ruaok

the planner needs to do something different because the previous strategy didn't work anymore.
9:50 AM
zas

VACUUM ANALYZE did help a lot
9:50 AM
ruaok

hang on. I'm trying to answer your previous questions.
9:51 AM
zas

k
9:51 AM
ruaok

so, what used to be an efficient in-memory query now becomes an more expensive on disk query... for instance.
9:52 AM
and that has a greater impact on system performance.
9:52 AM
zas

hmmm, wait, i need to verify this
9:52 AM
ruaok

and if that query was used a lot, then bam your server is overloaded.
9:52 AM
but, this may be because it is using old statistics.
9:53 AM
zas

you assume it causes disk activity, and therefore a slowdown, but nope: https://stats.metabrainz.org/d/000000048/hetzne...
9:54 AM
there was no increase in disk IO during 3 events, i can only see CPU usage increase in fact
9:54 AM
ruaok

if the statistic don't get updated, or old shit thrown from the DB, then it has to assume that things have changed. it can't made good judgements anymore.
9:55 AM
zas

yup, but then why it isn't more progressive ? i mean i'd expect slow performance degradation
9:55 AM
ruaok

so, running vacuum analyze throws out old shit and updates stats.
9:55 AM
because the query planner now needs to make a different decision. it reaches a threshold.
9:55 AM
once it passes the threshold, bam everything backs up.
9:56 AM
now, you can spend a pile of time looking for exactly what happend.
9:56 AM
but in the end the outcome is the same: optimize your DB or add more capacity.
9:56 AM
zas

i tend to disagree here, as it doesn't explain all
9:57 AM
https://stats.metabrainz.org/d/eStswhGmk/postgr... shows there was 3 "major events"
9:57 AM
first one and second one went through without any action (i think), and lasted for 24-36hours
9:58 AM
i see no trace of VACUUM in logs for those events
9:58 AM
ruaok

ok, I'm speaking from experience of 15 years of running postgres.
9:59 AM
I'm trying to save you more frustration.
9:59 AM
zas

yes, i understand, but please stick to facts
9:59 AM
ruaok

I see you approaching this with tools that simply are not effective in combatting this.
9:59 AM
I don't have facts. that's the whole problem about this. facts are hard to come by.
10:01 AM
what I do have is observed patterns and experience with this problem.
10:01 AM
this is classic "nothing changed, but PG is freaking out, why?" I've been here several times before.
10:01 AM
zas

i can't find solutions until the problem is well defined... and for now, your attempt to define the problem doesn't match any measurement
10:02 AM
you said "what used to be an efficient in-memory query now becomes an more expensive on disk query..." -> where's the disk activity ?
10:02 AM
ruaok

I am trying to describe one of many different scenarios. I really don't know what the query planner is or is not doing in this case.
10:02 AM
it simply may not be disk related.
10:03 AM
zas

ok, but keep your scenarii realistic: here we had only increase in cpu usage; network, memory and disk activity remained constant
10:04 AM
but i agree with you it is a decrease of efficiency, likely coming from bad predictions (and lack of ANALYZE)
10:05 AM
so why does this happen suddenly (starting on 17th) after months of stable activity ?
10:05 AM
size of table triggered something ? number of insert/delete/update ?
10:09 AM
ruaok

I don't know, but I can speculate....
10:10 AM
the query planner decides to use X tables for a query...
10:10 AM
then stats go out of date and things get more fuzzy.
10:11 AM
now it can't keep the whole table in ram anymore, or it thinks it can't.
10:11 AM
the it may need to go get bits from disk and do more loading of data.
10:11 AM
but that data is fresh so it actually resides in cache, (RAM, not L2)
10:12 AM
so, now more fetching across RAM.
10:12 AM
and it might really only be a slight change, but that slight change is what trip the tipping point.
10:13 AM
and now everything backs up and can't ever recover and suddenly the server is totally overloaded.
10:13 AM
by running the stats and throwing out old cruft, we turn back to where the planner can do things better.