-
ruaok
I think so.
2015-10-22 29538, 2015
-
ruaok
check /var/log/tomcat
2015-10-22 29549, 2015
-
zas
it started at ~19:50
2015-10-22 29511, 2015
-
ruaok
the load is dropping.
2015-10-22 29550, 2015
-
ruaok
recovery in progress.
2015-10-22 29550, 2015
-
ruaok
WTF?
2015-10-22 29532, 2015
-
ruaok
was there exactly 24 hours between incidents?
2015-10-22 29549, 2015
-
zas
not exactly
2015-10-22 29559, 2015
-
ruaok
and now going back up. :(
2015-10-22 29542, 2015
-
ruaok
doh. no loads for dora & roobarb.
2015-10-22 29549, 2015
-
ruaok
we really need those next, please!
2015-10-22 29557, 2015
-
samphippen joined the channel
2015-10-22 29528, 2015
-
LordSputnik joined the channel
2015-10-22 29543, 2015
-
Lotheric
sorry to be the bearer of bad news :)
2015-10-22 29545, 2015
-
Lotheric
hehe
2015-10-22 29558, 2015
-
zas
logs are full of exceptions :-(
2015-10-22 29531, 2015
-
ruaok
Lotheric: oh the news rolled in much before you said something. I was just away.
2015-10-22 29535, 2015
-
ruaok
zas: which logs?
2015-10-22 29549, 2015
-
ruaok
roo & dora were quiet yesterday
2015-10-22 29551, 2015
-
zas
tomcat's ones
2015-10-22 29505, 2015
-
ruaok
specifically, which file are you looking at, zas ?
2015-10-22 29546, 2015
-
zas
wait few secs, i found smt
2015-10-22 29553, 2015
-
zas
grep '22-Oct-2015 17:50' /var/log/tomcat/catalina.2015-10-22.log on dora
2015-10-22 29513, 2015
-
zas
org.apache.catalina.startup.VersionLoggerListener.log Server version: Apache Tomcat/8.0.20 ?
2015-10-22 29519, 2015
-
zas
server restart ?
2015-10-22 29557, 2015
-
zas
it is when it started
2015-10-22 29534, 2015
-
ruaok
I rebooted both servers yesterday.
2015-10-22 29548, 2015
-
ruaok
both of those servers have a history of acting up strangely at times.
2015-10-22 29500, 2015
-
ruaok
and either a java upgrade or a reboot has always fixed that issue.
2015-10-22 29513, 2015
-
zas
roobarb logs are differently configured :(
2015-10-22 29535, 2015
-
ruaok
one has tomcat 8.0.20 the other 8.0.28
2015-10-22 29548, 2015
-
ruaok
I was curious to see if they acted differently.
2015-10-22 29554, 2015
-
LordSputnik has quit
2015-10-22 29558, 2015
-
LordSputnik joined the channel
2015-10-22 29520, 2015
-
LordSputnik has quit
2015-10-22 29543, 2015
-
zas
they didn't
2015-10-22 29531, 2015
-
zas
but the logs are different on dora and roobarb
2015-10-22 29558, 2015
-
zas
can you check dora log and confirm the server somehow restarted ?
2015-10-22 29543, 2015
-
zas
what is happening when one server is going down ? the other gets the whole load ?
2015-10-22 29552, 2015
-
ruaok
yep.
2015-10-22 29557, 2015
-
ruaok
nginx is very smart about that.
2015-10-22 29516, 2015
-
jct- joined the channel
2015-10-22 29538, 2015
-
ruaok
on roobarb have a look at /tmp/jstack.9790.out
2015-10-22 29520, 2015
-
ruaok
so, cartman does remote restarts of the server.
2015-10-22 29528, 2015
-
ruaok
so, yes that is happening.
2015-10-22 29548, 2015
-
ruaok
cartman builds indexes, pushes to dora/roobarb, then restarts them
2015-10-22 29532, 2015
-
zas
did it happen at 17:50 UTC ?
2015-10-22 29504, 2015
-
ruaok
no
2015-10-22 29519, 2015
-
ruaok
2015-10-22 29534, 2015
-
ruaok
from /etc/service/search-build/log/main/current
2015-10-22 29505, 2015
-
ruaok
the top of the stack trace for each proc seems to be:
2015-10-22 29506, 2015
-
ruaok
- sun.nio.ch.FileDispatcherImpl.write0(java.io.FileDescriptor, long, int) @bci=0 (Compiled frame; information may be imprecise)
2015-10-22 29509, 2015
-
zas
Oct 22 17:50:02 dora monit[1326]: 'tomcat' total mem amount of 12712944kB matches resource limit
2015-10-22 29509, 2015
-
zas
[total mem amount>12288046kB]
2015-10-22 29509, 2015
-
zas
Oct 22 17:50:02 dora monit[1326]: 'tomcat' trying to restart
2015-10-22 29544, 2015
-
ruaok
my bad. sorry, I never mentioned monit
2015-10-22 29555, 2015
-
ruaok
so the search servers leak memory like a sieve.
2015-10-22 29501, 2015
-
ruaok
we've never been able to find the source of it.
2015-10-22 29507, 2015
-
zas
java ;)
2015-10-22 29521, 2015
-
ruaok
so, we have monit installed. and monit kills the JVM when it gets to 40% of ram or so.
2015-10-22 29524, 2015
-
ruaok
yes, exactly.
2015-10-22 29534, 2015
-
zas
ok monit restarted dora tomcat, the load went on roobarb, and then ...
2015-10-22 29553, 2015
-
ruaok
oh -- is that the chain of events?
2015-10-22 29554, 2015
-
zas
i noticed we had free memory on dora
2015-10-22 29501, 2015
-
ruaok
that is where it started?
2015-10-22 29506, 2015
-
zas
yes
2015-10-22 29518, 2015
-
ruaok
ohh, a clue. :)
2015-10-22 29524, 2015
-
zas
imho the restart of dora search server caused the whole havoc
2015-10-22 29529, 2015
-
zas
let's check for yesterday
2015-10-22 29558, 2015
-
zas
it started at 17:56 yesterday
2015-10-22 29544, 2015
-
zas
UTC+2
2015-10-22 29552, 2015
-
zas
Oct 21 15:56:42 dora monit[1240]: 'tomcat' total mem amount of 12585720kB matches resource limit [total mem amount>12288046kB]
2015-10-22 29553, 2015
-
zas
Oct 21 15:56:42 dora monit[1240]: 'tomcat' trying to restart
2015-10-22 29555, 2015
-
zas
heh
2015-10-22 29500, 2015
-
zas
same cause same effect
2015-10-22 29516, 2015
-
zas
but why don't they recover ??
2015-10-22 29531, 2015
-
ruaok
they used to.
2015-10-22 29537, 2015
-
ruaok
something else must've changed.
2015-10-22 29551, 2015
-
ruaok
the only other thing that comes to mind is that we're out of bandwidith.
2015-10-22 29502, 2015
-
zas
not atm
2015-10-22 29502, 2015
-
ruaok
now waitaminute.
2015-10-22 29522, 2015
-
ruaok
backlog is 0 now.
2015-10-22 29527, 2015
-
ruaok
but not always.
2015-10-22 29532, 2015
-
ruaok
this suggest that we're just below the limit.
2015-10-22 29536, 2015
-
ruaok
+s
2015-10-22 29542, 2015
-
ruaok
even though search isn't working.
2015-10-22 29549, 2015
-
ruaok
where the hell is our bandwidth going?
2015-10-22 29550, 2015
-
ariscop joined the channel
2015-10-22 29515, 2015
-
zas
ernie bandwidth is far under the limit
2015-10-22 29538, 2015
-
ruaok
the rate limiter is kicking in.
2015-10-22 29551, 2015
-
ruaok
on ernie: watch tc -s qdisc show
2015-10-22 29501, 2015
-
ruaok
how are you determining current bandwidth?
2015-10-22 29535, 2015
-
zas
2015-10-22 29544, 2015
-
zas
select ernie
2015-10-22 29526, 2015
-
ruaok
MBps, so 10x mbits, right?
2015-10-22 29543, 2015
-
ruaok
well, that isn't right either.
2015-10-22 29508, 2015
-
ruaok
ok, let's turn off the rate limiter we have in place.
2015-10-22 29524, 2015
-
ruaok
and let things run.
2015-10-22 29538, 2015
-
ruaok
if everything returns to normal we know that this is a function of rate liminting.
2015-10-22 29550, 2015
-
zas
hmmm, how does the load balancer behave when both search servers are down ?
2015-10-22 29518, 2015
-
ruaok
throws a lot of errors, why?
2015-10-22 29520, 2015
-
ruaok
restart both at the same time?
2015-10-22 29550, 2015
-
zas
i don't get why they still eat so much cpu while we have much lower traffic atm
2015-10-22 29504, 2015
-
ruaok
because java?
2015-10-22 29518, 2015
-
ruaok
let's try removing the ratelimit on ernie.
2015-10-22 29527, 2015
-
ruaok
worst case, nothing happens.
2015-10-22 29534, 2015
-
ruaok
best case, services return to normal.
2015-10-22 29538, 2015
-
zas
root@dora:/var/log# netstat -natuop|grep '10.1.1.21:8080'|grep 'CLOSE'|wc -l
2015-10-22 29538, 2015
-
zas
1169
2015-10-22 29552, 2015
-
zas
that's a lot for "fast" network
2015-10-22 29553, 2015
-
ruaok
CLOSE_WAIT?
2015-10-22 29557, 2015
-
zas
yes
2015-10-22 29504, 2015
-
ruaok
interesting.
2015-10-22 29522, 2015
-
zas
i suspect the load balancer doing something weird
2015-10-22 29524, 2015
-
ruaok
remember when I said that the gateways were failing (carl/lenny) and that we had a huge backlog?
2015-10-22 29534, 2015
-
zas
yup
2015-10-22 29539, 2015
-
ruaok
because the gateways were dropping packets like mad.
2015-10-22 29548, 2015
-
ruaok
I see the rate limiter dropping packets.
2015-10-22 29550, 2015
-
ruaok
the pieces fit.
2015-10-22 29540, 2015
-
zas
did you try a nginx restart on ernie ?
2015-10-22 29513, 2015
-
ariscop has quit
2015-10-22 29514, 2015
-
ruaok
not complete restart, no.
2015-10-22 29520, 2015
-
ruaok
hup, yes.
2015-10-22 29520, 2015
-
ruaok
feel free.
2015-10-22 29515, 2015
-
ariscop joined the channel
2015-10-22 29511, 2015
-
zas
no impact
2015-10-22 29534, 2015
-
ruaok
rebooting ernie might have more impact.
2015-10-22 29539, 2015
-
ruaok
but we can't do that. :(
2015-10-22 29543, 2015
-
ruaok
well, we can, but...
2015-10-22 29544, 2015
-
zas
hmm
2015-10-22 29502, 2015
-
zas
not needed, the issue is on the search servers side
2015-10-22 29523, 2015
-
ruaok
I agree with your sentiment, the problem lies with the gateways.
2015-10-22 29541, 2015
-
ruaok
we've made zero changes to search; we see a ton of CLOSE_WAIT sockets.
2015-10-22 29550, 2015
-
ruaok
screams gateway, not search.
2015-10-22 29536, 2015
-
ruaok
ok, owner of DWNI responded. will forward my bandwidth query to their salesdroid.
2015-10-22 29513, 2015
-
DjSlash
(zas: fwiw, `lsof -n -iTCP:8080 -sTCP:CLOSE_WAIT|wc -l` would be faster)
2015-10-22 29533, 2015
-
LordSput1 joined the channel
2015-10-22 29504, 2015
-
LordSput1 has quit
2015-10-22 29512, 2015
-
ruaok
zas: "tc qdisc del dev em root"
2015-10-22 29518, 2015
-
ruaok
any objections?
2015-10-22 29520, 2015
-
LordSputnik joined the channel
2015-10-22 29558, 2015
-
zas
nope; no better idea
2015-10-22 29505, 2015
-
ruaok
k
2015-10-22 29552, 2015
-
ruaok
well, that wasn't it. :(
2015-10-22 29512, 2015
-
New__ has quit
2015-10-22 29518, 2015
-
ruaok
and the second you restart the search, the CLOSE_WAITs pile up. fast.
2015-10-22 29512, 2015
-
zas
we could stop the forwarding to the search servers for a while to see if they recover without new queries
2015-10-22 29522, 2015
-
MajorLurker joined the channel
2015-10-22 29536, 2015
-
ruaok
try it. what do we have to lose? :)
2015-10-22 29556, 2015
-
ruaok
so, yes, they recovered, did I follow that right?