bitmap, ruaok: i'm back from diner, what's the plan finally?
2018-05-11 13102, 2018
bitmap
I think it's dependent on how long the fan replacement will take, and whether we want to wait for it
2018-05-11 13101, 2018
zas
it should be short (<20 minutes) but we need to give them an exact time where they can start
2018-05-11 13116, 2018
zas
we have to be careful, any hardware intervention can lead to other hardware failure... an misplugged cable and we lose a lot of time ;)
2018-05-11 13142, 2018
zas
also we'll prolly have the same issue with WAL vs queen after bowie shutdown, and need to resync queen after
2018-05-11 13126, 2018
zas
this (long) procedure shows how much we need something better, for fast switching
2018-05-11 13124, 2018
bitmap
yup, I was playing with pgpool & repmgr yesterday
2018-05-11 13136, 2018
bitmap
it'll still need manual intervention and careful attention, but would be a lot faster
2018-05-11 13128, 2018
bitmap
I'll push the test containers I have somewhere soon
2018-05-11 13128, 2018
iliekcomputers
Could I be invited to the switchover doc too?
2018-05-11 13154, 2018
ruaok
I say go for it and just take the downtime.
2018-05-11 13157, 2018
bitmap
iliekcomputers: sent
2018-05-11 13124, 2018
ruaok
twice, even now. :)
2018-05-11 13111, 2018
iliekcomputers
Thanks.
2018-05-11 13144, 2018
ruaok
well, zas, what should we do?
2018-05-11 13122, 2018
zas
either the complicated procedure that will fail (perhaps) or the simple that will succeed (perhaps)
2018-05-11 13111, 2018
zas
let's take everything down, and just do it fast, we plan an hour, tweet about the maintenance, and let hetzner work
2018-05-11 13127, 2018
zas
it will limit possible issues, because the switch have so many steps, and many nodes involved that i fear it will not work as we expect...
2018-05-11 13142, 2018
zas
ruaok: is it your feeling ?
2018-05-11 13154, 2018
iliekcomputers
I support the non complicated method.
2018-05-11 13109, 2018
zas
well, bad things can still happen
2018-05-11 13130, 2018
zas
bitmap: can you trigger a db backup now ? how long does it take ?
2018-05-11 13151, 2018
ruaok
I agree with everything you said.
2018-05-11 13120, 2018
ruaok
we should still offer €20 of beer if they can do it in under 10 minutes.
2018-05-11 13138, 2018
zas
in fact if they switch cpu+fan than can do it in 3
2018-05-11 13140, 2018
ruaok
:)
2018-05-11 13100, 2018
zas
bitmap: ?
2018-05-11 13104, 2018
bitmap
I don't think a pg_basebackup takes very long
2018-05-11 13123, 2018
bitmap
you want to take it now before bowie is down?
2018-05-11 13123, 2018
ruaok
then lets do that.
2018-05-11 13134, 2018
zas
yes, just do one now, it will limit damage in case of
2018-05-11 13137, 2018
ruaok
backup and then pick an exact time.
2018-05-11 13144, 2018
zas
20:00 UTC is perhaps too short (for backup+hetzner), i'd say 21UTC (in 1 hour 45 minutes)
2018-05-11 13113, 2018
bitmap
ok, I'll copy a backup to williams now. and queen should remain a usable replica backup too
2018-05-11 13139, 2018
zas
ok for the time ?
2018-05-11 13144, 2018
bitmap
let me start the backup really quick and just double check the progress of it
2018-05-11 13151, 2018
zas
ok
2018-05-11 13153, 2018
bitmap
started, no progress visible yet (pg_basebackup: initiating base backup, waiting for checkpoint to complete)
2018-05-11 13149, 2018
zas
we have to deploy barman next week, it'll help in this field
2018-05-11 13112, 2018
reosarevok
Is barman the newest version of bartendro? :p
2018-05-11 13120, 2018
zas
:) not really, but it may give us more time to play with bartendro in case of database disaster
2018-05-11 13107, 2018
bitmap
ok looks like it should complete in ~30 minutes
2018-05-11 13136, 2018
bitmap
it's at 5% now
2018-05-11 13159, 2018
bitmap
writing to /root/postgres-master-data-2018-05-11 on williams
2018-05-11 13126, 2018
bitmap
21UTC would be more than enough time anyway
2018-05-11 13144, 2018
zas
ok, i'll ask hetzner if they are ok for this time
2018-05-11 13154, 2018
bitmap
in a disaster (gasp) we can still restore from queen since it should be an exact copy after bowie is shut down
2018-05-11 13156, 2018
zas
request sent to hetzner, waiting for them to confirm
2018-05-11 13105, 2018
zas
asked for shorter delay (with beers)
2018-05-11 13149, 2018
zas
bitmap: queen disk usage >85%, we need to move solr stuff elsewhere (it takes ~40Gb)
2018-05-11 13132, 2018
zas
"We regret to tell you that the named appointment isn't available." hmmm
2018-05-11 13132, 2018
bitmap
do they say when they have available?
2018-05-11 13134, 2018
zas
nope, asking
2018-05-11 13109, 2018
CatQuest
wat
2018-05-11 13108, 2018
CatQuest
any reply zas?
2018-05-11 13133, 2018
zas
not yet
2018-05-11 13142, 2018
CatQuest
meh
2018-05-11 13116, 2018
CatQuest
I am lucky that today is my "day off" - playing botw instead of mb editing. it would have been annoying to have to wait for this.
2018-05-11 13147, 2018
CatQuest
some one mentioned in #mb that soemthing weird was happening with the mb site. odd css and errors
2018-05-11 13106, 2018
CatQuest
if the fan is literally trying to make the machine not *boil* (95˚c!?) then a banner might be good idea? (probably not a good idea to edit now, fna needs replacing waiting for hardware-reply-etc site will go down while it is repaired..)
2018-05-11 13125, 2018
zas
they suggest .... 23:15 CEST 21:15 UTC
2018-05-11 13136, 2018
zas
pff
2018-05-11 13141, 2018
CatQuest
so people can finish what they're doing and have some forewarning
2018-05-11 13158, 2018
CatQuest
eh...
2018-05-11 13125, 2018
CatQuest
idk, put up a banner, put the site in read only, have a nap for a couple hours thne go-go?
2018-05-11 13125, 2018
zas
bitmap: we have to prepare everything a bit before, and shutdown the server few minutes before 21:15 UTC
2018-05-11 13159, 2018
bitmap
okay
2018-05-11 13124, 2018
CatQuest
(imho putting up a banner *now* about it is a good idea too but..)
2018-05-11 13155, 2018
zas
ruaok: around?
2018-05-11 13113, 2018
bitmap
I think I can put put mb in read only and point it to queen as part of the process, if we want
2018-05-11 13113, 2018
ruaok
now, yes.
2018-05-11 13105, 2018
CatQuest
wel it's like 1 hour 15 minutes but still
2018-05-11 13106, 2018
bitmap
or not, not sure if it'll accept connections if bowie is down.
2018-05-11 13119, 2018
bitmap
since it's a hot standby. so nvm
2018-05-11 13145, 2018
bitmap
pg_basebackup: base backup completed
2018-05-11 13144, 2018
CatQuest
anyway, good luck and well done guys. I know you will do a good job