20:55 PM
djce has set a reminder to ask him on Monday morning
2011-10-01 27455, 2011
20:56 PM
ruaok
thanks!
2011-10-01 27443, 2011
20:57 PM
ruaok wonders if this could be a keep alive problem?
2011-10-01 27450, 2011
21:01 PM
djce
you know dora/roobarb are still running the /usr/local jvm, right?
2011-10-01 27455, 2011
21:02 PM
ruaok
yes.
2011-10-01 27417, 2011
21:03 PM
ruaok
I only moved the jvm from /usr/lib so that I was *sure* that it was using the new jre
2011-10-01 27433, 2011
21:03 PM
djce
not sure about keepalive. Might have to check next time we catch the problem in progress.
2011-10-01 27433, 2011
21:03 PM
ruaok
I will likely switch back to openjdk.
2011-10-01 27401, 2011
21:04 PM
djce
I suspect it's not keepalive, since the sockets are closing, not ESTABLISHED
2011-10-01 27404, 2011
21:04 PM
ruaok
I've set the keep alive on dora from the default 100 to 10.
2011-10-01 27423, 2011
21:04 PM
djce
what setting is that?
2011-10-01 27451, 2011
21:04 PM
ruaok
maxKeepAliveRequests="10"
2011-10-01 27402, 2011
21:05 PM
ruaok
in server.xml in the HTTP connector section.
2011-10-01 27406, 2011
21:05 PM
djce
ok
2011-10-01 27428, 2011
21:05 PM
djce
I don't think nginx (which is the client) uses keepalive in client mode.
2011-10-01 27435, 2011
21:05 PM
ruaok
I suspect that you're right.
2011-10-01 27439, 2011
21:05 PM
djce
might be better if it did in fact.
2011-10-01 27448, 2011
21:05 PM
djce
'cos there'd be less opening/closing going on.
2011-10-01 27453, 2011
21:05 PM
ruaok
the growth seem to bot have changed.
2011-10-01 27421, 2011
21:06 PM
ruaok
yeah, we've got a simple script that connects to dora directly and makes a shitload of requests.
2011-10-01 27432, 2011
21:06 PM
ruaok
without going through nginx and the problem persists.
2011-10-01 27451, 2011
21:06 PM
djce
shitload as in open/request/close, or open/req/req/req/req/req/.../close?
2011-10-01 27435, 2011
21:07 PM
ruaok
the former
2011-10-01 27441, 2011
21:07 PM
ruaok
simple LWP::UserAgent calls
2011-10-01 27401, 2011
21:08 PM
ruaok
or even wgets from forked processes.
2011-10-01 27442, 2011
21:08 PM
ruaok
3892 sockets in TIME_WAIT and 8 in ESTABLISHED.
2011-10-01 27424, 2011
21:09 PM
djce
from what host?
2011-10-01 27458, 2011
21:09 PM
ruaok
all from carl
2011-10-01 27401, 2011
21:10 PM
ruaok
nxing
2011-10-01 27423, 2011
21:10 PM
ruaok
2011-10-01 27424, 2011
21:10 PM
ruaok
hmmm.
2011-10-01 27435, 2011
21:10 PM
ruaok
TIME_WAIT is a client state.
2011-10-01 27444, 2011
21:10 PM
ruaok
not a server state.
2011-10-01 27413, 2011
21:11 PM
djce
tcp doesn't have clients and servers.
2011-10-01 27459, 2011
21:11 PM
djce
but I take the point.
2011-10-01 27405, 2011
21:12 PM
ruaok
well, see the diagram in the link I posted.
2011-10-01 27423, 2011
21:12 PM
ruaok
in this case ngxinx/wget would be the "client", no?
2011-10-01 27435, 2011
21:12 PM
djce
yup
2011-10-01 27407, 2011
21:13 PM
ruaok
why then does tomcat have connections from carl that are TIME_WAIT
2011-10-01 27408, 2011
21:13 PM
ruaok
?
2011-10-01 27422, 2011
21:13 PM
djce
right now?
2011-10-01 27423, 2011
21:13 PM
ruaok
if it never made the connection to carl? carl made the connection to it.
2011-10-01 27425, 2011
21:13 PM
ruaok
yeah
2011-10-01 27430, 2011
21:13 PM
djce
on which host?
2011-10-01 27432, 2011
21:13 PM
djce
I don't see 'em
2011-10-01 27440, 2011
21:13 PM
ruaok
see netstat on dora.
2011-10-01 27448, 2011
21:13 PM
ruaok
am I reading that wrong?
2011-10-01 27424, 2011
21:14 PM
djce
no, you're right
2011-10-01 27447, 2011
21:14 PM
djce
(I was only looking at sockets owned by processes; those ones aren't though)
2011-10-01 27447, 2011
21:14 PM
ruaok
then there is something I clearly don't understand about all this. :-)
2011-10-01 27407, 2011
21:15 PM
ruaok
lsof gives sane output, yes.
2011-10-01 27410, 2011
21:15 PM
ruaok
netstat does not.
2011-10-01 27435, 2011
21:16 PM
djce
I think TIME WAIT applies to whichever peer closed the connection first.
2011-10-01 27439, 2011
21:16 PM
ruaok
2011-10-01 27457, 2011
21:16 PM
ruaok
the diagram here suggests that there is a timeout period for TIME_WAIT
2011-10-01 27424, 2011
21:17 PM
djce
there is, 2 x MSL (Maximum Segment Lifetime)
2011-10-01 27428, 2011
21:18 PM
ruaok nods
2011-10-01 27431, 2011
21:18 PM
ruaok
4 minutes.
2011-10-01 27443, 2011
21:18 PM
ruaok
but, for normal operations, it should not have to time out.
2011-10-01 27447, 2011
21:18 PM
djce
but I don't think that TIME WAIT sockets not owned by processes are a problem.
2011-10-01 27459, 2011
21:18 PM
djce
Our problem is sockets that /are/ owned by processes.
2011-10-01 27411, 2011
21:19 PM
djce
specifically, the number of them, exceeding some limit.
2011-10-01 27449, 2011
21:19 PM
ruaok
but there are only a few ESTABLISHED sockets.
2011-10-01 27452, 2011
21:19 PM
ruaok
like 8
2011-10-01 27439, 2011
21:20 PM
ruaok
and we know that when tomcat complains about being out of file handles, that count of TIME_WAIT sockets is *very* high.
2011-10-01 27442, 2011
21:20 PM
ruaok
like 8k or so.
2011-10-01 27412, 2011
21:21 PM
djce
I think that's a red herring. Unless those sockets are owned by tomcat.
2011-10-01 27435, 2011
21:21 PM
djce
I think, just ignore netstat, and observe lsof (or /proc/NNNN/fd)
2011-10-01 27408, 2011
21:22 PM
djce
so right now, all looks happy.
2011-10-01 27426, 2011
21:22 PM
djce
later on, you'll see "lsof -c java -a -i -n -P" (for example) show 000s of sockets.
2011-10-01 27436, 2011
21:22 PM
djce
/that/ is what we need to debug, AFAIK.
2011-10-01 27424, 2011
21:24 PM
ruaok
how can we monitor this?
2011-10-01 27438, 2011
21:24 PM
ruaok
maybe run that command every few second?
2011-10-01 27448, 2011
21:24 PM
ruaok
and if the output is more than 20 lines, save it?
2011-10-01 27452, 2011
21:24 PM
ruaok
otherwise overwrite?
2011-10-01 27428, 2011
21:25 PM
ruaok
then we would have timestamps as to when it happens and what the output is.
2011-10-01 27453, 2011
21:25 PM
djce
/var/log/tomcat6/catalina.out already gives us timestamps
2011-10-01 27409, 2011
21:26 PM
djce
and we could watch that log to tell us when it's all gone wrong.
2011-10-01 27442, 2011
21:26 PM
ruaok
true that.
2011-10-01 27431, 2011
21:27 PM
ruaok
I'm wondering if it's correlated with index updates.
2011-10-01 27448, 2011
21:27 PM
ruaok
like copying the files over. or rotating them into place. or restarting tomcat.
2011-10-01 27403, 2011
21:28 PM
djce
I didn't see any obvious correlation last time I checked
2011-10-01 27402, 2011
21:29 PM
ruaok
hmm.
2011-10-01 27410, 2011
21:29 PM
ruaok
well I dropped the global limit down to 2000.
2011-10-01 27419, 2011
21:29 PM
ruaok
so the problem is unlikely to return.
2011-10-01 27427, 2011
21:29 PM
ruaok
but that is far from an acceptable solution.
2011-10-01 27443, 2011
21:29 PM
djce
/un/likely?
2011-10-01 27451, 2011
21:29 PM
ruaok
yes.
2011-10-01 27401, 2011
21:30 PM
ruaok
this morning at 7am ocharles called me.
2011-10-01 27408, 2011
21:30 PM
djce
why so? lowering limits sounds like it makes it /more/ likely to me
2011-10-01 27412, 2011
21:30 PM
ruaok
I dropped the limit from 2500 to 2000 and that made everything happy.
2011-10-01 27431, 2011
21:30 PM
djce
oh, ws global ratelimit?
2011-10-01 27434, 2011
21:30 PM
ruaok
well, we're handling fewer requests/
2011-10-01 27435, 2011
21:30 PM
ruaok
yes.
2011-10-01 27437, 2011
21:30 PM
djce
I see
2011-10-01 27455, 2011
21:30 PM
ruaok
sorry for not making that clear. :)
2011-10-01 27453, 2011
21:31 PM
ruaok
do you have saved output from lsof when things were bad?
2011-10-01 27402, 2011
21:32 PM
ruaok
did we have 0000s of ESTABLISHED sockets?
2011-10-01 27412, 2011
21:32 PM
djce
sorry, not saved. nor remembered.
2011-10-01 27414, 2011
21:32 PM
djce
:-(
2011-10-01 27423, 2011
21:32 PM
ruaok
ok.
2011-10-01 27432, 2011
21:32 PM
ruaok
I think mum wants to go run around for a bit.
2011-10-01 27434, 2011
21:32 PM
ruaok
I'll do that.
2011-10-01 27447, 2011
21:32 PM
ruaok
once I get back, I'll up the request limit and watch for things to go wrong.
2011-10-01 27449, 2011
21:32 PM
djce
ttfn
2011-10-01 27454, 2011
21:32 PM
ruaok
then I'll investigate a bit.
2011-10-01 27402, 2011
21:33 PM
ruaok
save some lsofs and netstats
2011-10-01 27406, 2011
21:33 PM
djce
I might still be here. Maybe.
2011-10-01 27421, 2011
21:33 PM
ruaok
ok, probably won't be too long.
2011-10-01 27424, 2011
21:42 PM
nikki
bitmap: ping
2011-10-01 27432, 2011
21:42 PM
bitmap
pong
2011-10-01 27400, 2011
21:43 PM
nikki
2011-10-01 27449, 2011
21:43 PM
nikki
(btw, do you want me to make you the owner of the picard component thing in trac too?)
2011-10-01 27424, 2011
21:44 PM
bitmap
hm, no, I hadn't seen that…
2011-10-01 27446, 2011
21:44 PM
bitmap
that doesn't really make sense since it should be using the metadata in the folder name
2011-10-01 27410, 2011
21:45 PM
bitmap
but I'll look into it.
2011-10-01 27454, 2011
21:45 PM
bitmap
nikki: what can the owner of the component do?
2011-10-01 27409, 2011
21:46 PM
nikki
well I think you'll probably get emails then
2011-10-01 27421, 2011
21:46 PM
bitmap
ah, yeah I'd prefer that
2011-10-01 27450, 2011
21:46 PM
bitmap
since I haven't really looked at trac at all lately
2011-10-01 27427, 2011
21:47 PM
nikki
there, done
2011-10-01 27459, 2011
21:47 PM
bitmap
great, thanks :)
2011-10-01 27438, 2011
22:16 PM
Leftmost joined the channel
2011-10-01 27416, 2011
23:14 PM
bitmap
2011-10-01 27417, 2011
23:26 PM
nikki
oh, right :)
2011-10-01 27440, 2011
23:39 PM
reosarevok joined the channel
2011-10-01 27453, 2011
23:39 PM
ruaok
djce: still here?
2011-10-01 27448, 2011
23:43 PM
djce
ruaok: pong
2011-10-01 27438, 2011
23:57 PM
ruaok
the next thing I want to try to replace tomcat with something else.
2011-10-01 27442, 2011
23:57 PM
ruaok
like jboss.
2011-10-01 27436, 2011
23:58 PM
djce
interesting.
2011-10-01 27438, 2011
23:58 PM
ruaok
we've ruled out the jvm. and the search code doesn't seem to leak file handles -- they get cleaned up.
2011-10-01 27443, 2011
23:58 PM
ruaok
it might just be a tomcat bug.
2011-10-01 27446, 2011
23:58 PM
ruaok
a high load one.
2011-10-01 27402, 2011
23:59 PM
djce
heading off to bed now but will be around tomorrow for more of this
2011-10-01 27407, 2011
23:59 PM
ruaok
ok.
2011-10-01 27407, 2011
23:59 PM
djce
:-) zzzzzz
2011-10-01 27414, 2011
23:59 PM
ruaok
I'll play with it a bit
2011-10-01 27415, 2011
23:59 PM
ruaok
nn
2011-10-01 27422, 2011
23:59 PM
ruaok
(once my mom lets up on the chores. sigh.)
2011-10-01 27428, 2011
23:59 PM
djce
ttfn. keep me posted if anything exciting comes up.
2011-10-01 27434, 2011
23:59 PM
ruaok
will do.