-
kepstin
search server is timing out again...
-
rest of the site's being slow as well. just got a 502 bad gateway.
-
nikki prods ruaok
-
Leftmost joined the channel
-
ruaok returns
-
nikki
we have unhappy servers
-
kepstin
started right about 1/2 hour ago.
-
ruaok
next time things to bad, please txt me!
-
things should be better now.
-
ocharles updates his number for ruaok
-
ocharles
last time I tried to text it didn't seem to go through
-
ruaok
cool, thx
-
ocharles
ah, the + has vanished
-
ruaok
hmm.
-
so we know that the jvm isn't the problem
-
it could be tomcat or the search server code.
-
fuss
-
ocharles
what are you doing to "fix" this problem
-
is it worth having an hourly server restart in cron?
-
ruaok
well, investigating the problem first of all.
-
ocharles
(akin to the constant restarting we have in mb_server)
-
ruaok
well, possibly.
-
its clearly a case of something running out of filehandles.
-
near 7000 open files things to belly up.
-
ocharles
lsof show you what's open?
-
ruaok
restarting tomcat works
-
ocharles
i think it's lsof anyway
-
ruaok
I've just been doing netstat | wc -l
-
we know who has the files open.
-
I'm mainly interested in finding out how many are open.
-
ocharles
ok, but what are the files? do more files open on every search request?
-
ruaok
and it not a clear case of file handles leaking straight out.
-
the number goes up, then drops.
-
but over time, it creeps upwards.
-
it happens faster with only one server as with two.
-
ocharles
as in one server leaks faster than 2?
-
ruaok
and, I think, with our increased traffic we're hitting a problem that we didn't hit before.
-
thus is coming around to bite us now.
-
yes.
-
I'm guessing that ever X requests our servers tip over.
-
ocharles
hmm
-
ruaok
X requests per server.
-
so with one server you hit twice as fast as with two servers.
-
ocharles
it must be a X requests/N duration, not just X requests
-
ruaok
and restarting the server isn't as easy as it is with the mb_server.
-
wait, I don't know that for sure.
-
we do a restart on the server every 3 hours.
-
but I am not sure a restart is enough to free the handles.
-
it may need to be stop/start
-
X/N might also make sense.
-
ocharles
do I have any access to these servers? just curious what lsof shows
-
ruaok
let me see if dora has calmed down now.
-
7235
-
100
-
ruaok hands ocharles a cookie
-
ocharles
what'd I do? :)
-
ruaok
the first one was when dora was tits up.
-
the second 5 minutes after diverting traffic.
-
ocharles
and doing nothing?
-
ruaok
so, sockets are being kept open too long and eventually we run out of open file handles.
-
ocharles
yea
-
that sounds like a reasonable hypothesis
-
ruaok
they are all in CLOSE_WAIT status
-
ocharles
or the garbage collection on sockets is not fast enough
-
oh wait, so the problem is that the search server is running out of sockets, not file handles?
-
so that would indicate that musicbrainz-server isn't closing it's connections properly
-
the cool down after 5 minutes is very likely from the musicbrainz-server fastcgi processes being killed and forced to close sockets
-
ruaok
on astro, see /tmp
-
lsof output for both dora and rhubarb.
-
dora is idle.
-
roobarb is live.
-
sockets == filehandles.
-
yes! that is a great observation.
-
perhaps this is the fault of mb_server.
-
well, we should see a high open socket count on the mb-server machines too, no?
-
ocharles
i know sockets == file handles, but I'm not sure if lsof shows them
-
hmm, but doesn't astro talk to carl before dora?
-
ruaok
yes.
-
nginx on carl is doing the load balancing.
-
ocharles
so, wouldn't carl have problems first?
-
ruaok
hard to say off hand.
-
could be carl, could be carl propagating problems via mb_server
-
ocharles
ruaok: hrm, keep dora out
-
maybe if I try and locally do some requests from my box to dora, we can see if the socket count goes up at all?
-
ruaok
rhubarb will suffer the same problem.
-
ah.
-
we can try.
-
make the requests from hobbes
-
ocharles
ok
-
which ip?
-
dora.mb will work :)
-
port though?
-
ruaok
8080
-
ocharles
ok, 1 request a second atm
-
erm, one sec :)
-
ok, off
-
this is using curl to do the requests from bash
-
straight to dora
-
ruaok
was 30 open, now 19.
-
no seeming effect.
-
ocharles
i'll up the rate then
-
ruaok
and next we should try going through the load balancer
-
ocharles
yea
-
ok, twice as fast now
-
ruaok
well, the load balancer is going go to both, so thats not so good.
-
19
-
up the rate much higher
-
ocharles
10 requests a second now
-
now 20
-
anything interesting?
-
ruaok
20 open handles.
-
try 1000 requests per sec.
-
ocharles
ok
-
ruaok
:-)
-
30
-
ocharles
that look like it's actually getting a lot of traffic in the logs too?
-
ruaok
we don't actually record traffic on these servers.
-
ocharles
ah, ok
-
ruaok
1.63 load.
-
so doing somehting
-
double it. :)
-
ocharles
-
load went up to 0.64 on hobbes just doing that
-
ruaok
lol
-
ocharles
next text can be through carl
-
then we can test direct to dora with LWP
-
and the through carl with LWP
-
ruaok
not bouncing up at all.
-
but I wonder if we should leave it running longer.
-
ocharles
ruaok: I think a better option would be: while [ 1 -eq 1 ] do; netstat | wc -l >> sockets; sleep 30; done on dora
-
and bring it back into production
-
get a very rough idea of how that count changes over time
-
ruaok
yeah.
-
want an account on dora?
-
ocharles
sure
-
ruaok
gimme a sec.
-
mom is whipping up a storm of activity
-
geez, she's not sick anymore
-
driving me insane.
-
ruaok sighs
-
ready!
-
ocharles
heh
-
ruaok
syntax error near done. :-(
-
dora is back in.
-
ocharles
you get it running?
-
ruaok
no
-
the count is climbing fast though.
-
ocharles
while [ 1 -eq 1 ]; do netstat | wc -l; sleep 1; done >> counter
-
redirection in the wrong place, my bad
-
ruaok
I'm going to take dora out and let things calm down.