in #musicbrainz-devel

20:06 PM
johtso_ joined the channel
20:06 PM
ianmcorvidae

best idea I have is kicking some musicbrainz-server instances
20:06 PM
ruaok

we did that.
20:06 PM
no effect.
20:07 PM
I think all my mucking about may have confused carl.
20:07 PM
bad hardware + mucking == who knows.
20:07 PM
ianmcorvidae

this problem isn't actually *on* carl, as far as I can tell
20:07 PM
it seems like we're running out of perl processes, or something
20:07 PM
ruaok

but half the world passes through carl.
20:08 PM
ianmcorvidae

yes, but the problem is nginx on $frontend connecting to a unix socket on the same machine
20:08 PM
ruaok

the most useful, and not very at that, error message that I've seen so far is this:
20:08 PM
ws/2/release/?query=2@+0x1.a9cd0089502p-1013rtist:2 [at 2fmt]=json", host: "musicbrainz.org""
20:08 PM
ianmcorvidae

yes, that's the issue
20:09 PM
ruaok

I've not looked at machines other than astro -- does this happen on pingu/asterix?
20:09 PM
ianmcorvidae

yes, same thing on pingu at least
20:09 PM
ruaok

da fuq?
20:10 PM
how can a local unix domain socket start acting up on three servers at the same time?
20:11 PM
ianmcorvidae

enough requests that it's overloading the number of processes on all of them
20:11 PM
but that's 150, across the three of them, so I dunno
20:13 PM
ruaok

so, the theory is that we have three servers stuck in some bizarre stuck state.
20:13 PM
we could take the site down and reboot all three of them
20:15 PM
let me make sure I understand this...
20:15 PM
ianmcorvidae

I don't really know if I understand it, to be fair
20:16 PM
ruaok

this is nginx trying to make a unix domain socket call to mb-server and it fails, but only for search queries.
20:16 PM
ianmcorvidae also notes I need to leave nearly right now
20:16 PM
ianmcorvidae

it's not only for searches
20:16 PM
we just have so many more search requests that the others get lost
20:16 PM
if you tail the error log and grep -v 'query' you'll see the others, same sort of thing
20:16 PM
ruaok

ok,will do.
20:16 PM
bitmap: when we restarted the front end, what exactly did you do?
20:17 PM
we may just need to repeat the exercise.
20:17 PM
but this time stop both nginx and mb-server.
20:17 PM
ianmcorvidae

follow the release process, it's in syswiki
20:17 PM
ruaok

then remove the unix domain socket.
20:17 PM
ianmcorvidae

and that *should* involve restarting nginx as well, but
20:17 PM
ruaok

ah, we didn't fully do that. there was something amiss that was causing the git pull to hand.
20:17 PM
hang
20:18 PM
ianmcorvidae

well, the pull doesn't matter really
20:18 PM
bitmap

yeah, I just restarted the apps with svc -t /etc/service/musicbrainz-server
20:18 PM
ruaok

clearly, but it halted the process.
20:18 PM
ah!
20:18 PM
no nginx restart.
20:18 PM
bitmap: lets do this process again.
20:19 PM
bitmap

ok
20:19 PM
ianmcorvidae

sudo -i, cd server-configs; ./provision.sh; sudo svc -t /etc/service/musicbrainz-server; tail-f /etc/service/musicbrainz-server/log/main/current until you see the "Binding to ..." etc. and then you can bring it back in
20:19 PM
ruaok

this time use the usual process and watch that it does restart nginx
20:19 PM
ianmcorvidae

provision may do a git pull for mbserver, of course, but hopefully that doesn't keep failing
20:19 PM
bitmap

yeah, the provision.sh is what was hanging
20:19 PM
ruaok

it better not now. :)
20:20 PM
ianmcorvidae

if it's failing, a manual nginx restart is the main thing we wanted out of that anyway, but
20:20 PM
bitmap nods
20:20 PM
ianmcorvidae notes I did do a HUP to nginx on astro a bit ago, because I added a quick bit of logging for the search endpoint, which a provision would overwrite (but that's fine)
20:20 PM
ruaok

astro out
20:22 PM
bitmap

looks like ./provision.sh still hangs, I'll try svc -t
20:22 PM
ianmcorvidae

nginx is system-installed, not daemontools. /etc/init.d
20:23 PM
bitmap

ah, right
20:23 PM
ianmcorvidae

(and the musicbrainz-server svc -t isn't done by provision anyway, so that needs to be done separately in any case)
20:23 PM
bitmap

ok, astro done
20:24 PM
ruaok

astro in pingu out
20:24 PM
getting resource busy errors on astro. though not as fast as before
20:25 PM
ianmcorvidae

resource temporarily unavailable as before, or a new 'resource busy' error?
20:26 PM
ruaok

the former.
20:26 PM
as before.
20:26 PM
1 every few seconds.
20:27 PM
ianmcorvidae

1 every few seconds is way better than before, anyway
20:27 PM
ruaok

I wonder if this started happening before today and my tinkering kicked it into high gear.
20:27 PM
for sure.
20:27 PM
bitmap

pingu done
20:27 PM
ruaok

pingu in, asterix out
20:27 PM
ianmcorvidae

it has been happening some, this is what causes the instant 502s
20:28 PM
and I'm out. I'll have my phone if I'm really urgently needed but I'm out basically the whole evening :( and then free of this particular thing until the fall, so it's not all bad :P
20:28 PM
ruaok

k
20:28 PM
you around tomorrow?
20:28 PM
bitmap

asterix done
20:29 PM
ruaok

ok, should be coming back in.
20:29 PM
and I just got three 502s in a row.
20:30 PM
bitmap

I'm still seeing the same errors in astro's logs
20:30 PM
they haven't really slowed down
20:31 PM
ruaok

really? what log are you watching?
20:31 PM
tail -f /var/log/nginx/001-musicbrainz.error.log | grep 502
20:32 PM
bitmap

same log without the grep 502
20:33 PM
ruaok

fuck.
20:38 PM
CatCat joined the channel
20:38 PM
ruaok reads http://serverfault.com/questions/398972/need-to-increase-nginx-throughput-to-an-upstream-unix-socket-linux-kernel-tun
20:38 PM
but our loads are low right now
20:38 PM
for obvious reasons.
20:40 PM
ok, here is a thought and it ties together with your postgres obersvation...
20:40 PM
(and perhaps you meant this earlier)
20:40 PM
but if the clients are somehow getting hung up on talking to postgres and everything gets backed up...
20:41 PM
if no perl procs area available, it would throw this error.
20:41 PM
bitmap

that's something I was thinking, but I wasn't sure what pgbouncer things were tweaked that'd cause that
20:42 PM
ruaok

they shouldn't really.
20:42 PM
but they should also not be running out of connections to postgres.
20:42 PM
but that was the last thing that was changed in our config, so that is a valid concern.
20:42 PM
let me go poke pgbouncer some more
20:43 PM
the load on totoro is very low.
20:43 PM
no disk usage.
20:47 PM
bitmap: so this is starting to make sense.
20:47 PM
we're allowing a backlog of 128 waiting query connections to pgbouncer.
20:47 PM
and there are about ~120 or so waiting.
20:48 PM
bitmap

should we up the backlog then?
20:49 PM
not sure if that was one of the settings that was changed
20:49 PM
ruaok

it wasn't, but perhaps it should be changed.
20:49 PM
but if there server is not busy, it should be able to handle this.
20:50 PM
bitmap

yes, I would think so
20:50 PM
ariscop joined the channel
20:50 PM
there's not a lot of info on how to tune this
20:51 PM
ruaok

apparently the best info appears on the musicbrainz blog. :(
20:51 PM
I can't tune the backlog without a restart of pgbouncer.
20:53 PM
bitmap

I've no clue why we'd hit this all of a sudden, but I can't think of any other options at the moment
20:55 PM
ruaok

well, chirlu and I did some tweaking yesterday. I'm guessing this is the side-effect.
20:56 PM
so, what happens if I restart pgbouncer?
20:56 PM
without taking the site down. bad things?
20:56 PM
bitmap

perhaps we should tweak net.core.somaxconn too then, since that's also 128 and seems to be a system-wide limit
20:56 PM
ruaok is utterly out of patience and coping
20:57 PM
ruaok

give it a shot.
20:57 PM
but pgbouncer really looks like its *full*
20:57 PM
bitmap

uh, I've never restarted pgbouncer on my own before
20:59 PM
ruaok

out of 63 connections, 57 are idle!!
21:00 PM
I'm just going to restart it. screw it.
21:01 PM
bitmap

alright. :/ I doubled net.core.somaxconn to 256 on totoro
21:02 PM
ruaok

astro error log went quiet.
21:03 PM
bitmap

wow, yeah
21:03 PM
ruaok

pgbouncer looks much happier.
21:04 PM
astro log is screaming again.
21:04 PM
pgbouncer still looks good.
21:04 PM
quiet again.
21:04 PM
da hell?
21:05 PM
bitmap

yeah, I saw a burst of errors and then it stopped
21:05 PM
Freso

This channel is such a rollercoaster today!
21:06 PM
ruaok

be glad you're not on the rollercoaster.
21:06 PM
ok, so this problem was unrelated to the gateway work.
21:06 PM
go figger.
21:06 PM
Leo_Verto

Wee, everything is broken!
21:07 PM
ruaok

pgbouncer fillled up again.
21:08 PM
bitmap

what'd you increase the backlog setting to?
21:09 PM
ruaok

256
21:09 PM
though I am not convinced of the correlation.
21:09 PM
I'm seeing lots of errors, but pgbouncer is fine
21:10 PM
in fact it isn't getting even close to the limit now
21:12 PM
in general, things are lot better now
21:13 PM
reosarevok

From editing at least it seems that way, yup!
21:14 PM
ruaok

its going back to shit, don't worry reosarevok
21:14 PM
reosarevok

:p
21:14 PM
We'll see
21:15 PM
ruaok

pretty much when the waiting clients gets to 100 the errors start flying
21:15 PM
what da fuq pgbouncer?
21:16 PM
the db server is idle and bored. why you be making life so hard.
21:18 PM
bitmap

have you tried reverting the settings you changeg recently?
21:19 PM
ruaok

I can do that next.