for instance, replication cannot even connect to the DB.
2018-04-30 12017, 2018
zas
ok, so we have no clue about what is going on, right ?
2018-04-30 12030, 2018
yvanzo
and subscriptions have run to then end on yesterday
2018-04-30 12046, 2018
yvanzo
exactly, I’m not even sure MB is the cause of it.
2018-04-30 12033, 2018
zas
my guess is that something in pg went bad
2018-04-30 12052, 2018
zas
i see nothing that could be an "external" cause
2018-04-30 12012, 2018
yvanzo
reports are taking X times more than usual to complete, with X large enough
2018-04-30 12013, 2018
zas
sure something triggered it, but hard to know what
2018-04-30 12048, 2018
zas
ok, let's do that: put everything in maintenance, and restart pg
2018-04-30 12058, 2018
ruaok
yes, please.
2018-04-30 12012, 2018
ruaok
after that I would like to take some mitigation steps, but first a restart.
2018-04-30 12017, 2018
ruaok
what is needed to do this?
2018-04-30 12027, 2018
ruaok
do we need to do prep besides tweeting?
2018-04-30 12040, 2018
zas
ruaok: can you tweet about maintenance ? i'll toggle a value in docker server config
2018-04-30 12052, 2018
zas
i added it recently, untested in prod yet though
2018-04-30 12053, 2018
yvanzo
I can stop mb cron and sentry already
2018-04-30 12039, 2018
ruaok
on it.
2018-04-30 12054, 2018
zas
btw, bowie needs a reboot, but i want to test a pg restart first
2018-04-30 12011, 2018
zas
let's not add random issues on this one
2018-04-30 12018, 2018
yvanzo
subscriptions are only 25% processed
2018-04-30 12026, 2018
zas
yes, it is uber slow
2018-04-30 12032, 2018
zas
can it resume ?
2018-04-30 12059, 2018
ruaok
twatted.
2018-04-30 12004, 2018
ruaok
we just run tomorrow.
2018-04-30 12009, 2018
ruaok
subscriptions are not that important.
2018-04-30 12016, 2018
yvanzo
no, but tomorrow’s run will catch it.
2018-04-30 12045, 2018
zas
ok, i'll push the change to docker server config, to put everything down during bowie's maintenance
2018-04-30 12009, 2018
zas
ok done
2018-04-30 12023, 2018
zas
yvanzo: restart the pg container on bowie please
2018-04-30 12024, 2018
ruaok
we need nagios to read our twitter feed.
2018-04-30 12034, 2018
ruaok
calmate, nagios!
2018-04-30 12052, 2018
zas
yvanzo: ?
2018-04-30 12019, 2018
yvanzo
yep
2018-04-30 12033, 2018
yvanzo
done
2018-04-30 12016, 2018
zas
ok, i let it run a bit before removing maintenance mode
2018-04-30 12012, 2018
zas
removing alldown, it should happen within next 2 minutes
2018-04-30 12034, 2018
zas
we'll soon know if it changes anything
2018-04-30 12007, 2018
zas
services are back
2018-04-30 12018, 2018
ruaok
cpu at 70%.
2018-04-30 12021, 2018
zas
ruaok: nagios happy again ;)
2018-04-30 12039, 2018
ruaok
no dice, that didn't change anything.
2018-04-30 12054, 2018
ruaok
what is the status of the DB on queen?
2018-04-30 12057, 2018
ruaok
yvanzo: do you know?
2018-04-30 12058, 2018
zas
i'd not conclude anything yet, but doesn't look too promising
2018-04-30 12045, 2018
yvanzo
ruaok: reloading config files
2018-04-30 12059, 2018
ruaok
but it should be operational, in theory?
2018-04-30 12009, 2018
yvanzo
yes
2018-04-30 12013, 2018
ruaok
and we simply do not have load balancing setup?
2018-04-30 12023, 2018
ruaok
is any DB traffic going there right now?
2018-04-30 12030, 2018
yvanzo
CAA redirect is already using it again
2018-04-30 12059, 2018
ruaok
ok. what other DB use things could use the read-only mirror?
2018-04-30 12010, 2018
ruaok
search index creation. subscription emails.
2018-04-30 12015, 2018
yvanzo
no one else
2018-04-30 12032, 2018
yvanzo
subscription emails have been aborted
2018-04-30 12056, 2018
ruaok
I thinking about manual balancing to buy us some time before we get a proper load balancer goin.
2018-04-30 12006, 2018
ruaok
what else can be move to queen ASAP?
2018-04-30 12031, 2018
ruaok
but queen is reliably replicating, right?
2018-04-30 12054, 2018
yvanzo
there are requests to bowie pg with musicbrainz_ro pg user
2018-04-30 12021, 2018
ruaok
do you know how many and who they are?
2018-04-30 12013, 2018
yvanzo
looking for it
2018-04-30 12052, 2018
yvanzo
roughly 15%
2018-04-30 12020, 2018
ruaok
15% of relief for PG sounds very useful.
2018-04-30 12029, 2018
ruaok
let's see if we can move this traffic to queen.
2018-04-30 12032, 2018
yvanzo
I’m not sure what matches these requests :/
2018-04-30 12005, 2018
yvanzo
Or even if they are meaningful.
2018-04-30 12020, 2018
ruaok
might just be search.
2018-04-30 12011, 2018
zas
queen replication is ok, right ? we need to put everything in maintenance again and reboot bowie (sec updates)
2018-04-30 12032, 2018
zas
yvanzo, ruaok : ok ?
2018-04-30 12033, 2018
yvanzo
yep, queen is fine
2018-04-30 12002, 2018
ruaok
I wish we would've done the restart at the same time.
2018-04-30 12017, 2018
ruaok
can we redirect traffic to queen for a bit?
2018-04-30 12020, 2018
zas
i prefer not, reboot may cause its own issues...
2018-04-30 12042, 2018
ruaok
or should we just do a reboot and then start working on balancing the load?
2018-04-30 12022, 2018
zas
the reboot is required by sec upgrades, but it is unrelated to the current load issue
2018-04-30 12028, 2018
ruaok
understood.
2018-04-30 12046, 2018
yvanzo
agreed
2018-04-30 12005, 2018
zas
about targetting queen instead of bowie, is there an easy mean yet ?
2018-04-30 12022, 2018
ruaok
search.
2018-04-30 12027, 2018
ruaok
I'm working on a PR for that.
2018-04-30 12033, 2018
zas
ah just for search, ok
2018-04-30 12035, 2018
ruaok
I'm open for suggestions on other RO traffic.
2018-04-30 12040, 2018
ruaok
and moving it over.
2018-04-30 12049, 2018
zas
i have a bunch of ideas, but they all require to be able to configure mb containers to select a db server easily, i asked bitmap about it, and he told me he'll work on something
2018-04-30 12032, 2018
yvanzo
indeed, there is no easy way to do it right now
2018-04-30 12000, 2018
zas
this is where we need pgpool or smt
2018-04-30 12012, 2018
ruaok
yep.
2018-04-30 12044, 2018
zas
let's decide what to do during the meeting, this issue is there since too long and cause too much hassle
2018-04-30 12000, 2018
ruaok
let's not. this topic is too big and detailed for the meeting.
2018-04-30 12005, 2018
ruaok
first, let's do the reboot.
2018-04-30 12011, 2018
zas
ok
2018-04-30 12012, 2018
ruaok
hard down time. starting in 2 minutes. ready?
2018-04-30 12019, 2018
zas
yup
2018-04-30 12026, 2018
zas
and ending when ...
2018-04-30 12056, 2018
zas
ruaok: tell me when tweeted
2018-04-30 12015, 2018
ruaok
done.
2018-04-30 12027, 2018
ruaok
proceed when ready.
2018-04-30 12038, 2018
zas
k, alldown set, it will happen soon
2018-04-30 12057, 2018
zas
rebooting
2018-04-30 12017, 2018
ruaok
shhh nagios. down boy.
2018-04-30 12030, 2018
zas
at least, nagios is working ;)
2018-04-30 12043, 2018
ruaok
lol
2018-04-30 12013, 2018
zas
reboot in progress, time to sacrifice a black chicken
2018-04-30 12030, 2018
ruaok
why a black chicken?
2018-04-30 12039, 2018
ruaok would prefer a rainbow chicken
2018-04-30 12056, 2018
zas
well, it works better one said ;)
2018-04-30 12059, 2018
yvanzo
poudre verte anyone?
2018-04-30 12005, 2018
zas
yuppppi, ping is back
2018-04-30 12055, 2018
zas
removing alldown
2018-04-30 12047, 2018
ruaok
samj1912: ping
2018-04-30 12000, 2018
yvanzo
zas: smt is still wrong with mb
2018-04-30 12012, 2018
zas
yes, 502s
2018-04-30 12044, 2018
yvanzo
“No such database: musicbrainz_db”
2018-04-30 12051, 2018
ruaok
uh oh
2018-04-30 12003, 2018
zas
hmmm
2018-04-30 12013, 2018
yvanzo
database should be 'musicbrainz'
2018-04-30 12051, 2018
yvanzo
? Or is it containers port the issue?
2018-04-30 12033, 2018
zas
it appeared after a reboot, nothing should have changed inside pg container
2018-04-30 12040, 2018
yvanzo
zas: yep, it is containers port
2018-04-30 12005, 2018
yvanzo
port=65401 in MB error msg, whereas pg runs on 65400 as usual