I've not looked at machines other than astro -- does this happen on pingu/asterix?
ianmcorvidae
yes, same thing on pingu at least
ruaok
da fuq?
how can a local unix domain socket start acting up on three servers at the same time?
ianmcorvidae
enough requests that it's overloading the number of processes on all of them
but that's 150, across the three of them, so I dunno
ruaok
so, the theory is that we have three servers stuck in some bizarre stuck state.
we could take the site down and reboot all three of them
let me make sure I understand this...
ianmcorvidae
I don't really know if I understand it, to be fair
ruaok
this is nginx trying to make a unix domain socket call to mb-server and it fails, but only for search queries.
ianmcorvidae also notes I need to leave nearly right now
ianmcorvidae
it's not only for searches
we just have so many more search requests that the others get lost
if you tail the error log and grep -v 'query' you'll see the others, same sort of thing
ruaok
ok,will do.
bitmap: when we restarted the front end, what exactly did you do?
we may just need to repeat the exercise.
but this time stop both nginx and mb-server.
ianmcorvidae
follow the release process, it's in syswiki
ruaok
then remove the unix domain socket.
ianmcorvidae
and that *should* involve restarting nginx as well, but
ruaok
ah, we didn't fully do that. there was something amiss that was causing the git pull to hand.
hang
ianmcorvidae
well, the pull doesn't matter really
bitmap
yeah, I just restarted the apps with svc -t /etc/service/musicbrainz-server
ruaok
clearly, but it halted the process.
ah!
no nginx restart.
bitmap: lets do this process again.
bitmap
ok
ianmcorvidae
sudo -i, cd server-configs; ./provision.sh; sudo svc -t /etc/service/musicbrainz-server; tail-f /etc/service/musicbrainz-server/log/main/current until you see the "Binding to ..." etc. and then you can bring it back in
ruaok
this time use the usual process and watch that it does restart nginx
ianmcorvidae
provision may do a git pull for mbserver, of course, but hopefully that doesn't keep failing
bitmap
yeah, the provision.sh is what was hanging
ruaok
it better not now. :)
ianmcorvidae
if it's failing, a manual nginx restart is the main thing we wanted out of that anyway, but
bitmap nods
ianmcorvidae notes I did do a HUP to nginx on astro a bit ago, because I added a quick bit of logging for the search endpoint, which a provision would overwrite (but that's fine)
ruaok
astro out
bitmap
looks like ./provision.sh still hangs, I'll try svc -t
ianmcorvidae
nginx is system-installed, not daemontools. /etc/init.d
bitmap
ah, right
ianmcorvidae
(and the musicbrainz-server svc -t isn't done by provision anyway, so that needs to be done separately in any case)
bitmap
ok, astro done
ruaok
astro in pingu out
getting resource busy errors on astro. though not as fast as before
ianmcorvidae
resource temporarily unavailable as before, or a new 'resource busy' error?
ruaok
the former.
as before.
1 every few seconds.
ianmcorvidae
1 every few seconds is way better than before, anyway
ruaok
I wonder if this started happening before today and my tinkering kicked it into high gear.
for sure.
bitmap
pingu done
ruaok
pingu in, asterix out
ianmcorvidae
it has been happening some, this is what causes the instant 502s
and I'm out. I'll have my phone if I'm really urgently needed but I'm out basically the whole evening :( and then free of this particular thing until the fall, so it's not all bad :P