10:14 AM
ruaok
but, this only buys a small amount of time. we will hit this point again and each time this point comes closer.
2018-05-01 12158, 2018
10:14 AM
ruaok
looking for a possibly cause in all of this can be a huge undertaking that has one recommended outcome: tune your server or add more capacity.
2018-05-01 12121, 2018
10:15 AM
ruaok
so, best to take two approaches to this, as we have been: work to add more capacity in the short term.
2018-05-01 12141, 2018
10:15 AM
ruaok
post react move, we focus on caching and optimizing the worst queries in our system.
2018-05-01 12113, 2018
10:16 AM
zas
so, you think ANALYZE was run by autovacuum and it was fixed by itself during first 2 events ? but i don't see any autovacuum stuff in logs
2018-05-01 12158, 2018
10:16 AM
ruaok
for some reason autovacuum isn't working. that we need to find out why -- but twice now an ANALYZE has solved out problems.
2018-05-01 12109, 2018
10:17 AM
ruaok
so, we need to do that on a regular basis until we know why we need to do it.
2018-05-01 12137, 2018
10:17 AM
zas
yes, i just checked logs again
2018-05-01 12153, 2018
10:17 AM
ruaok
not sure by what you mean by first 2 events.. I think it may have done its job until the load become too high and then it stops working in order to not being the server to its knees.
2018-05-01 12158, 2018
10:17 AM
zas
2018-05-01 12114, 2018
10:18 AM
zas
basically nothing between january and april...
2018-05-01 12137, 2018
10:18 AM
zas
my guess: temp tables created by search indexer causes this
2018-05-01 12139, 2018
10:18 AM
ruaok
and I bet that if we check our chat logs we will find that we had a freakout in jan. :)
2018-05-01 12151, 2018
10:18 AM
ruaok
sure, that is quite possible.
2018-05-01 12102, 2018
10:19 AM
zas
they aren't cleaned by autovacuum, and other tables stay under thresholds
2018-05-01 12109, 2018
10:19 AM
Slurpee has quit
2018-05-01 12126, 2018
10:19 AM
zas
but autovacuum process actually runs, it just does nothing
2018-05-01 12138, 2018
10:19 AM
ruaok
the an immediate fix is to add a cron job to run ANALYZE once a week during slow times.
2018-05-01 12159, 2018
10:19 AM
ruaok
not sure about nothing... it probably updates stats.
2018-05-01 12113, 2018
10:20 AM
zas
2018-05-01 12125, 2018
10:20 AM
zas
we can tweak autovacuum options for that
2018-05-01 12143, 2018
10:20 AM
zas
it doesn't fix the temp tables thing though
2018-05-01 12143, 2018
10:21 AM
zas
question: why are temp tables created ? pg_temp_44.tmp_artistcredit (and others orphans) were created by ?
2018-05-01 12131, 2018
10:22 AM
zas
it seems to me autovacuum took care of those after we restarted pg
2018-05-01 12108, 2018
10:23 AM
ruaok
the search indexer creates them to make index building more efficient.
2018-05-01 12145, 2018
10:23 AM
ruaok
rather than repeatedly running the same query to get a chunk of rows, it creates a temp table and then queries that table while building an index.
2018-05-01 12122, 2018
10:24 AM
ruaok
it has the effect of running a fat query a few thousand times, vs a fat query once and then just fetching rows without a join.
2018-05-01 12138, 2018
10:25 AM
zas
2018-05-01 12125, 2018
10:28 AM
zas
i propose to increase autovacuum verbosity to start with
2018-05-01 12134, 2018
10:28 AM
ruaok
sure.
2018-05-01 12150, 2018
10:28 AM
ruaok
what is our current work mem setting?
2018-05-01 12156, 2018
10:28 AM
ruaok
that one had good results too.
2018-05-01 12148, 2018
10:29 AM
ruaok
yvanzo: ok, in that case, which container has the master set of replication packets?
2018-05-01 12108, 2018
10:30 AM
ruaok
we have a gap in sequence in 2016, so we should discard packets prior to that.
2018-05-01 12116, 2018
10:38 AM
ruaok
UmkaDK: check your replicated tables, should look better now.
2018-05-01 12156, 2018
10:45 AM
UmkaDK
Yep, our replication restarted and is on 1113,034
2018-05-01 12121, 2018
10:46 AM
ruaok
yay.
2018-05-01 12124, 2018
10:46 AM
UmkaDK
2018-05-01 12135, 2018
10:46 AM
ruaok
phew.
2018-05-01 12109, 2018
10:47 AM
bukwurm joined the channel
2018-05-01 12141, 2018
10:50 AM
UmkaDK
So ruaok, does it mean that the replication is now stable?
2018-05-01 12117, 2018
10:51 AM
UmkaDK
As in, I can re-enable our alarms.
2018-05-01 12122, 2018
10:51 AM
yvanzo
ruaok: No idea, the gap is from 2016-04-28 to 2016-05-23, that was prior to the move to Hetzner.
2018-05-01 12122, 2018
10:51 AM
ruaok
yes, do that.
2018-05-01 12102, 2018
10:52 AM
UmkaDK
Will do! Thanks guys!!
2018-05-01 12104, 2018
10:52 AM
ruaok
yvanzo: yeah, I'm moving old packets out of the way on the metabrainz production container.
2018-05-01 12113, 2018
10:52 AM
ruaok
sorry for the hassle UmkaDK !
2018-05-01 12141, 2018
10:52 AM
UmkaDK
ruaok, not a problem at all!! These things just happen... :)
2018-05-01 12108, 2018
10:53 AM
UmkaDK
(I wouldn't have a job if they didn't) :)
2018-05-01 12148, 2018
10:53 AM
ruaok
lol, good perspective.
2018-05-01 12115, 2018
11:02 AM
ruaok
yvanzo: a sanity check from you please.
2018-05-01 12158, 2018
11:02 AM
ruaok
on hip, in metabrainz-prod in /data/replication_packets ... the packets now start with replication-95701.tar.bz2
2018-05-01 12123, 2018
11:03 AM
ruaok
2018-05-01 12132, 2018
11:03 AM
ruaok
reports that CRITICAL Replication packet 95700 is missing
2018-05-01 12137, 2018
11:03 AM
ruaok
2018-05-01 12138, 2018
11:03 AM
yvanzo
ruaok: found missing replication packets on bootsy
2018-05-01 12144, 2018
11:03 AM
ruaok
it doesn't seem to be cached.
2018-05-01 12100, 2018
11:04 AM
ruaok
I don't really think we need to be concerned with them.
2018-05-01 12112, 2018
11:04 AM
ruaok
they are super old.
2018-05-01 12126, 2018
11:04 AM
ruaok
I just don't understand why that check is failing still.
2018-05-01 12145, 2018
11:04 AM
yvanzo
OK, does it checks backup dir or ftp dir?
2018-05-01 12124, 2018
11:05 AM
ruaok
it checks REPLICATION_PACKETS_DIR...
2018-05-01 12144, 2018
11:06 AM
ruaok
'REPLICATION_PACKETS_DIR': '/data/replication_packets'
2018-05-01 12154, 2018
11:06 AM
ruaok
so, I'm stumped as to why that check still fails.
2018-05-01 12130, 2018
11:07 AM
yvanzo
looks like it is checking the FTP dir
2018-05-01 12156, 2018
11:07 AM
yvanzo
Shall I move old packets from bootsy to there?
2018-05-01 12136, 2018
11:11 AM
yvanzo
Actually that was probably somehow broken at that time, that would explain why they were no more available from FTP.
2018-05-01 12124, 2018
11:12 AM
ruaok
how is it checking the FTP dir? I just double checked the settings.
2018-05-01 12133, 2018
11:12 AM
ruaok
> 'REPLICATION_PACKETS_DIR': '/data/replication_packets'
2018-05-01 12142, 2018
11:12 AM
ruaok
that is the current config inside the container.
2018-05-01 12154, 2018
11:12 AM
ruaok
and the packets are not supposed to be on FTP anylonger.
2018-05-01 12104, 2018
11:13 AM
ruaok
too many companies were freeloading
2018-05-01 12134, 2018
11:15 AM
yvanzo
ruaok: Volume musicbrainz-replication-packets is mounted on /data in MeB container and under .../ftp/... in MB Cron container.
2018-05-01 12159, 2018
11:17 AM
yvanzo
the backup dir you cleaned up is from volume musicbrainz-backups
2018-05-01 12119, 2018
11:23 AM
ruaok
i dont quite follow....
2018-05-01 12136, 2018
11:26 AM
ruaok
I cleaned up var/lib/docker/volumes/musicbrainz-replication-packets/_data which is mounted into:
2018-05-01 12140, 2018
11:26 AM
ruaok
2018-05-01 12142, 2018
12:08 PM
yvanzo
ruaok: May I move old packets (prior to the gap) from replication-packets (ftp) (where it is checked) to backups (or any other place for archives)?
2018-05-01 12159, 2018
12:08 PM
ruaok
sure
2018-05-01 12106, 2018
12:09 PM
ruaok
just dont put more recent ones onto FTP
2018-05-01 12118, 2018
12:09 PM
yvanzo
OK
2018-05-01 12158, 2018
12:10 PM
Lotheric has quit
2018-05-01 12153, 2018
12:17 PM
yvanzo
ruaok: replication check is happy now
2018-05-01 12118, 2018
12:19 PM
yvanzo
We still have packets previous to the gap in backups.
2018-05-01 12128, 2018
12:21 PM
ruaok
thanks for fixing that.
2018-05-01 12137, 2018
12:21 PM
ruaok
zas: replication check is ready for you to try again.
2018-05-01 12135, 2018
12:29 PM
yvanzo
I also moved distcontinued packets (from during the gap) from old/ subdir in replication-packets (ftp) to subdir replication-95000-95625/ in backups.
2018-05-01 12109, 2018
12:30 PM
Lotheric joined the channel
2018-05-01 12155, 2018
12:55 PM
Lotheric has quit
2018-05-01 12123, 2018
12:56 PM
SothoTalKer has quit
2018-05-01 12123, 2018
13:00 PM
Nyanko-sensei has quit
2018-05-01 12103, 2018
13:01 PM
Nyanko-sensei joined the channel
2018-05-01 12139, 2018
13:03 PM
Sophist-UK has quit
2018-05-01 12158, 2018
13:05 PM
Nyanko-sensei has quit
2018-05-01 12145, 2018
13:12 PM
Sophist-UK joined the channel
2018-05-01 12134, 2018
13:19 PM
kartikeyaSh_ircc joined the channel
2018-05-01 12154, 2018
13:23 PM
SothoTalKer joined the channel
2018-05-01 12102, 2018
13:31 PM
Lotheric joined the channel
2018-05-01 12128, 2018
13:49 PM
Sophist-UK has quit
2018-05-01 12154, 2018
13:51 PM
Nyanko-sensei joined the channel
2018-05-01 12157, 2018
13:53 PM
bukwurm has quit
2018-05-01 12132, 2018
13:56 PM
Nyanko-sensei has quit
2018-05-01 12143, 2018
14:21 PM
Nyanko-sensei joined the channel
2018-05-01 12122, 2018
14:23 PM
ruaok
yvanzo: zas: samj1912: invoices please!
2018-05-01 12143, 2018
14:38 PM
Slurpee joined the channel
2018-05-01 12143, 2018
14:38 PM
Slurpee has quit
2018-05-01 12143, 2018
14:38 PM
Slurpee joined the channel
2018-05-01 12114, 2018
14:40 PM
bukwurm joined the channel
2018-05-01 12148, 2018
14:42 PM
zas
bitmap: ping me when around
2018-05-01 12132, 2018
14:45 PM
Slurpee has quit
2018-05-01 12117, 2018
15:16 PM
drsaund has quit
2018-05-01 12129, 2018
15:32 PM
rembo10 has quit
2018-05-01 12129, 2018
15:32 PM
angreifer has quit
2018-05-01 12147, 2018
15:34 PM
rembo10 joined the channel
2018-05-01 12149, 2018
15:34 PM
angreifer joined the channel
2018-05-01 12123, 2018
15:47 PM
kartikeyaSh_ircc has quit
2018-05-01 12121, 2018
16:01 PM
ephemer0l has quit
2018-05-01 12133, 2018
16:01 PM
drsaund joined the channel
2018-05-01 12153, 2018
16:53 PM
drsaund has quit
2018-05-01 12126, 2018
17:52 PM
outsidecontext joined the channel
2018-05-01 12139, 2018
18:02 PM
ephemer0l_ joined the channel
2018-05-01 12157, 2018
18:13 PM
bukwurm has quit
2018-05-01 12149, 2018
18:20 PM
outsidecontext has quit
2018-05-01 12130, 2018
18:44 PM
CatQuest
meagblup! O_O
2018-05-01 12141, 2018
18:44 PM
CatQuest
megablup! O_O
2018-05-01 12150, 2018
18:44 PM
CatQuest
o_O
2018-05-01 12150, 2018
18:44 PM
CatQuest
O_o
2018-05-01 12159, 2018
18:44 PM
CatQuest
blupblupblup
2018-05-01 12112, 2018
18:45 PM
CatQuest
(when will sigyn ban me :P)
2018-05-01 12146, 2018
18:45 PM
bitmap
zas: ping
2018-05-01 12158, 2018
18:47 PM
CatQuest
sorry, no me tomar medicina
2018-05-01 12154, 2018
18:57 PM
zas
Hey bitmap, I'm available in 30 minutes to discuss about pg stuff, meanwhile you can read backlog if you didn't already.
2018-05-01 12146, 2018
18:58 PM
bitmap
ok cool, I saw some talk about autovacuum misbehaving
2018-05-01 12109, 2018
19:00 PM
bitmap
s/misbehaving/not being tuned properly?/
2018-05-01 12140, 2018
19:02 PM
zas
I'd say it needs tuning, but I'm not sure yet about the exact cause of the problem. We saw a manual analyze fixed it. But I'm not sure why yet.
2018-05-01 12119, 2018
19:10 PM
Slurpee joined the channel
2018-05-01 12129, 2018
19:14 PM
bitmap
2018-05-01 12156, 2018
19:17 PM
zas
autoanalyze is what we want
2018-05-01 12157, 2018
19:18 PM
bitmap
yeah
2018-05-01 12121, 2018
19:19 PM
zas
to start with i think we need to increase logging information concerning autovacuum
2018-05-01 12145, 2018
19:19 PM
Slurpee has quit
2018-05-01 12121, 2018
19:21 PM
zas
2018-05-01 12144, 2018
19:22 PM
zas
2018-05-01 12157, 2018
19:23 PM
zas
it is perhaps because autoanalyze actually partially occured, but overall it continued to degrade among time
2018-05-01 12137, 2018
19:24 PM
zas
perhaps conditions triggering analyze aren't met enough often
2018-05-01 12140, 2018
19:24 PM
bitmap
is there not logs indicating if autoanalyze actually occurred?
2018-05-01 12118, 2018
19:25 PM
zas
2018-05-01 12134, 2018
19:25 PM
zas
nothing between january and april
2018-05-01 12149, 2018
19:25 PM
zas
though the daemon was running afaik
2018-05-01 12122, 2018
19:26 PM
zas
so, may be, conditions to start the autovacuum aren't met