-
kepstin
search server is timing out again...
2011-10-01 27456, 2011
-
kepstin
rest of the site's being slow as well. just got a 502 bad gateway.
2011-10-01 27404, 2011
-
nikki prods ruaok
2011-10-01 27417, 2011
-
Leftmost joined the channel
2011-10-01 27446, 2011
-
ruaok returns
2011-10-01 27400, 2011
-
nikki
we have unhappy servers
2011-10-01 27439, 2011
-
kepstin
started right about 1/2 hour ago.
2011-10-01 27420, 2011
-
ruaok
next time things to bad, please txt me!
2011-10-01 27434, 2011
-
ruaok
things should be better now.
2011-10-01 27400, 2011
-
ocharles updates his number for ruaok
2011-10-01 27406, 2011
-
ocharles
last time I tried to text it didn't seem to go through
2011-10-01 27417, 2011
-
ruaok
cool, thx
2011-10-01 27451, 2011
-
ocharles
ah, the + has vanished
2011-10-01 27402, 2011
-
ruaok
hmm.
2011-10-01 27410, 2011
-
ruaok
so we know that the jvm isn't the problem
2011-10-01 27426, 2011
-
ruaok
it could be tomcat or the search server code.
2011-10-01 27428, 2011
-
ruaok
fuss
2011-10-01 27439, 2011
-
ocharles
what are you doing to "fix" this problem
2011-10-01 27446, 2011
-
ocharles
is it worth having an hourly server restart in cron?
2011-10-01 27459, 2011
-
ruaok
well, investigating the problem first of all.
2011-10-01 27400, 2011
-
ocharles
(akin to the constant restarting we have in mb_server)
2011-10-01 27410, 2011
-
ruaok
well, possibly.
2011-10-01 27419, 2011
-
ruaok
its clearly a case of something running out of filehandles.
2011-10-01 27427, 2011
-
ruaok
near 7000 open files things to belly up.
2011-10-01 27436, 2011
-
ocharles
lsof show you what's open?
2011-10-01 27440, 2011
-
ruaok
restarting tomcat works
2011-10-01 27446, 2011
-
ocharles
i think it's lsof anyway
2011-10-01 27449, 2011
-
ruaok
I've just been doing netstat | wc -l
2011-10-01 27412, 2011
-
ruaok
we know who has the files open.
2011-10-01 27423, 2011
-
ruaok
I'm mainly interested in finding out how many are open.
2011-10-01 27425, 2011
-
ocharles
ok, but what are the files? do more files open on every search request?
2011-10-01 27435, 2011
-
ruaok
and it not a clear case of file handles leaking straight out.
2011-10-01 27447, 2011
-
ruaok
the number goes up, then drops.
2011-10-01 27451, 2011
-
ruaok
but over time, it creeps upwards.
2011-10-01 27402, 2011
-
ruaok
it happens faster with only one server as with two.
2011-10-01 27415, 2011
-
ocharles
as in one server leaks faster than 2?
2011-10-01 27427, 2011
-
ruaok
and, I think, with our increased traffic we're hitting a problem that we didn't hit before.
2011-10-01 27432, 2011
-
ruaok
thus is coming around to bite us now.
2011-10-01 27434, 2011
-
ruaok
yes.
2011-10-01 27443, 2011
-
ruaok
I'm guessing that ever X requests our servers tip over.
2011-10-01 27449, 2011
-
ocharles
hmm
2011-10-01 27454, 2011
-
ruaok
X requests per server.
2011-10-01 27403, 2011
-
ruaok
so with one server you hit twice as fast as with two servers.
2011-10-01 27423, 2011
-
ocharles
it must be a X requests/N duration, not just X requests
2011-10-01 27435, 2011
-
ruaok
and restarting the server isn't as easy as it is with the mb_server.
2011-10-01 27447, 2011
-
ruaok
wait, I don't know that for sure.
2011-10-01 27454, 2011
-
ruaok
we do a restart on the server every 3 hours.
2011-10-01 27403, 2011
-
ruaok
but I am not sure a restart is enough to free the handles.
2011-10-01 27408, 2011
-
ruaok
it may need to be stop/start
2011-10-01 27420, 2011
-
ruaok
X/N might also make sense.
2011-10-01 27426, 2011
-
ocharles
do I have any access to these servers? just curious what lsof shows
2011-10-01 27428, 2011
-
ruaok
let me see if dora has calmed down now.
2011-10-01 27457, 2011
-
ruaok
7235
2011-10-01 27458, 2011
-
ruaok
100
2011-10-01 27404, 2011
-
ruaok hands ocharles a cookie
2011-10-01 27412, 2011
-
ocharles
what'd I do? :)
2011-10-01 27418, 2011
-
ruaok
the first one was when dora was tits up.
2011-10-01 27428, 2011
-
ruaok
the second 5 minutes after diverting traffic.
2011-10-01 27449, 2011
-
ocharles
and doing nothing?
2011-10-01 27450, 2011
-
ruaok
so, sockets are being kept open too long and eventually we run out of open file handles.
2011-10-01 27455, 2011
-
ocharles
yea
2011-10-01 27404, 2011
-
ocharles
that sounds like a reasonable hypothesis
2011-10-01 27411, 2011
-
ruaok
they are all in CLOSE_WAIT status
2011-10-01 27413, 2011
-
ocharles
or the garbage collection on sockets is not fast enough
2011-10-01 27416, 2011
-
ocharles
oh wait, so the problem is that the search server is running out of sockets, not file handles?
2011-10-01 27425, 2011
-
ocharles
so that would indicate that musicbrainz-server isn't closing it's connections properly
2011-10-01 27448, 2011
-
ocharles
the cool down after 5 minutes is very likely from the musicbrainz-server fastcgi processes being killed and forced to close sockets
2011-10-01 27455, 2011
-
ruaok
on astro, see /tmp
2011-10-01 27405, 2011
-
ruaok
lsof output for both dora and rhubarb.
2011-10-01 27408, 2011
-
ruaok
dora is idle.
2011-10-01 27415, 2011
-
ruaok
roobarb is live.
2011-10-01 27451, 2011
-
ruaok
sockets == filehandles.
2011-10-01 27459, 2011
-
ruaok
yes! that is a great observation.
2011-10-01 27405, 2011
-
ruaok
perhaps this is the fault of mb_server.
2011-10-01 27437, 2011
-
ruaok
well, we should see a high open socket count on the mb-server machines too, no?
2011-10-01 27447, 2011
-
ocharles
i know sockets == file handles, but I'm not sure if lsof shows them
2011-10-01 27405, 2011
-
ocharles
hmm, but doesn't astro talk to carl before dora?
2011-10-01 27415, 2011
-
ruaok
yes.
2011-10-01 27424, 2011
-
ruaok
nginx on carl is doing the load balancing.
2011-10-01 27425, 2011
-
ocharles
so, wouldn't carl have problems first?
2011-10-01 27432, 2011
-
ruaok
hard to say off hand.
2011-10-01 27447, 2011
-
ruaok
could be carl, could be carl propagating problems via mb_server
2011-10-01 27434, 2011
-
ocharles
ruaok: hrm, keep dora out
2011-10-01 27450, 2011
-
ocharles
maybe if I try and locally do some requests from my box to dora, we can see if the socket count goes up at all?
2011-10-01 27451, 2011
-
ruaok
rhubarb will suffer the same problem.
2011-10-01 27459, 2011
-
ruaok
ah.
2011-10-01 27401, 2011
-
ruaok
we can try.
2011-10-01 27407, 2011
-
ruaok
make the requests from hobbes
2011-10-01 27413, 2011
-
ocharles
ok
2011-10-01 27415, 2011
-
ocharles
which ip?
2011-10-01 27435, 2011
-
ocharles
dora.mb will work :)
2011-10-01 27453, 2011
-
ocharles
port though?
2011-10-01 27438, 2011
-
ruaok
8080
2011-10-01 27439, 2011
-
ocharles
ok, 1 request a second atm
2011-10-01 27407, 2011
-
ocharles
erm, one sec :)
2011-10-01 27456, 2011
-
ocharles
ok, off
2011-10-01 27410, 2011
-
ocharles
this is using curl to do the requests from bash
2011-10-01 27413, 2011
-
ocharles
straight to dora
2011-10-01 27439, 2011
-
ruaok
was 30 open, now 19.
2011-10-01 27443, 2011
-
ruaok
no seeming effect.
2011-10-01 27448, 2011
-
ocharles
i'll up the rate then
2011-10-01 27459, 2011
-
ruaok
and next we should try going through the load balancer
2011-10-01 27409, 2011
-
ocharles
yea
2011-10-01 27414, 2011
-
ocharles
ok, twice as fast now
2011-10-01 27426, 2011
-
ruaok
well, the load balancer is going go to both, so thats not so good.
2011-10-01 27439, 2011
-
ruaok
19
2011-10-01 27444, 2011
-
ruaok
up the rate much higher
2011-10-01 27457, 2011
-
ocharles
10 requests a second now
2011-10-01 27417, 2011
-
ocharles
now 20
2011-10-01 27434, 2011
-
ocharles
anything interesting?
2011-10-01 27413, 2011
-
ruaok
20 open handles.
2011-10-01 27421, 2011
-
ruaok
try 1000 requests per sec.
2011-10-01 27423, 2011
-
ocharles
ok
2011-10-01 27425, 2011
-
ruaok
:-)
2011-10-01 27455, 2011
-
ruaok
30
2011-10-01 27404, 2011
-
ocharles
that look like it's actually getting a lot of traffic in the logs too?
2011-10-01 27401, 2011
-
ruaok
we don't actually record traffic on these servers.
2011-10-01 27408, 2011
-
ocharles
ah, ok
2011-10-01 27413, 2011
-
ruaok
1.63 load.
2011-10-01 27415, 2011
-
ruaok
so doing somehting
2011-10-01 27419, 2011
-
ruaok
double it. :)
2011-10-01 27423, 2011
-
ocharles
2011-10-01 27431, 2011
-
ocharles
load went up to 0.64 on hobbes just doing that
2011-10-01 27435, 2011
-
ruaok
lol
2011-10-01 27446, 2011
-
ocharles
next text can be through carl
2011-10-01 27452, 2011
-
ocharles
then we can test direct to dora with LWP
2011-10-01 27455, 2011
-
ocharles
and the through carl with LWP
2011-10-01 27408, 2011
-
ruaok
not bouncing up at all.
2011-10-01 27414, 2011
-
ruaok
but I wonder if we should leave it running longer.
2011-10-01 27405, 2011
-
ocharles
ruaok: I think a better option would be: while [ 1 -eq 1 ] do; netstat | wc -l >> sockets; sleep 30; done on dora
2011-10-01 27409, 2011
-
ocharles
and bring it back into production
2011-10-01 27419, 2011
-
ocharles
get a very rough idea of how that count changes over time
2011-10-01 27402, 2011
-
ruaok
yeah.
2011-10-01 27412, 2011
-
ruaok
want an account on dora?
2011-10-01 27416, 2011
-
ocharles
sure
2011-10-01 27432, 2011
-
ruaok
gimme a sec.
2011-10-01 27438, 2011
-
ruaok
mom is whipping up a storm of activity
2011-10-01 27448, 2011
-
ruaok
geez, she's not sick anymore
2011-10-01 27451, 2011
-
ruaok
driving me insane.
2011-10-01 27454, 2011
-
ruaok sighs
2011-10-01 27431, 2011
-
ruaok
ready!
2011-10-01 27449, 2011
-
ocharles
heh
2011-10-01 27431, 2011
-
ruaok
syntax error near done. :-(
2011-10-01 27457, 2011
-
ruaok
dora is back in.
2011-10-01 27447, 2011
-
ocharles
you get it running?
2011-10-01 27437, 2011
-
ruaok
no
2011-10-01 27423, 2011
-
ruaok
the count is climbing fast though.
2011-10-01 27449, 2011
-
ocharles
while [ 1 -eq 1 ]; do netstat | wc -l; sleep 1; done >> counter
2011-10-01 27459, 2011
-
ocharles
redirection in the wrong place, my bad
2011-10-01 27453, 2011
-
ruaok
I'm going to take dora out and let things calm down.