#metabrainz

/

      • ruaok
        I think so.
      • 2015-10-22 29538, 2015

      • ruaok
        check /var/log/tomcat
      • 2015-10-22 29549, 2015

      • zas
        it started at ~19:50
      • 2015-10-22 29511, 2015

      • ruaok
        the load is dropping.
      • 2015-10-22 29550, 2015

      • ruaok
        recovery in progress.
      • 2015-10-22 29550, 2015

      • ruaok
        WTF?
      • 2015-10-22 29532, 2015

      • ruaok
        was there exactly 24 hours between incidents?
      • 2015-10-22 29549, 2015

      • zas
        not exactly
      • 2015-10-22 29559, 2015

      • ruaok
        and now going back up. :(
      • 2015-10-22 29542, 2015

      • ruaok
        doh. no loads for dora & roobarb.
      • 2015-10-22 29549, 2015

      • ruaok
        we really need those next, please!
      • 2015-10-22 29557, 2015

      • samphippen joined the channel
      • 2015-10-22 29528, 2015

      • LordSputnik joined the channel
      • 2015-10-22 29543, 2015

      • Lotheric
        sorry to be the bearer of bad news :)
      • 2015-10-22 29545, 2015

      • Lotheric
        hehe
      • 2015-10-22 29558, 2015

      • zas
        logs are full of exceptions :-(
      • 2015-10-22 29531, 2015

      • ruaok
        Lotheric: oh the news rolled in much before you said something. I was just away.
      • 2015-10-22 29535, 2015

      • ruaok
        zas: which logs?
      • 2015-10-22 29549, 2015

      • ruaok
        roo & dora were quiet yesterday
      • 2015-10-22 29551, 2015

      • zas
        tomcat's ones
      • 2015-10-22 29505, 2015

      • ruaok
        specifically, which file are you looking at, zas ?
      • 2015-10-22 29546, 2015

      • zas
        wait few secs, i found smt
      • 2015-10-22 29553, 2015

      • zas
        grep '22-Oct-2015 17:50' /var/log/tomcat/catalina.2015-10-22.log on dora
      • 2015-10-22 29513, 2015

      • zas
        org.apache.catalina.startup.VersionLoggerListener.log Server version: Apache Tomcat/8.0.20 ?
      • 2015-10-22 29519, 2015

      • zas
        server restart ?
      • 2015-10-22 29557, 2015

      • zas
        it is when it started
      • 2015-10-22 29534, 2015

      • ruaok
        I rebooted both servers yesterday.
      • 2015-10-22 29548, 2015

      • ruaok
        both of those servers have a history of acting up strangely at times.
      • 2015-10-22 29500, 2015

      • ruaok
        and either a java upgrade or a reboot has always fixed that issue.
      • 2015-10-22 29513, 2015

      • zas
        roobarb logs are differently configured :(
      • 2015-10-22 29535, 2015

      • ruaok
        one has tomcat 8.0.20 the other 8.0.28
      • 2015-10-22 29548, 2015

      • ruaok
        I was curious to see if they acted differently.
      • 2015-10-22 29554, 2015

      • LordSputnik has quit
      • 2015-10-22 29558, 2015

      • LordSputnik joined the channel
      • 2015-10-22 29520, 2015

      • LordSputnik has quit
      • 2015-10-22 29543, 2015

      • zas
        they didn't
      • 2015-10-22 29531, 2015

      • zas
        but the logs are different on dora and roobarb
      • 2015-10-22 29558, 2015

      • zas
        can you check dora log and confirm the server somehow restarted ?
      • 2015-10-22 29543, 2015

      • zas
        what is happening when one server is going down ? the other gets the whole load ?
      • 2015-10-22 29552, 2015

      • ruaok
        yep.
      • 2015-10-22 29557, 2015

      • ruaok
        nginx is very smart about that.
      • 2015-10-22 29516, 2015

      • jct- joined the channel
      • 2015-10-22 29538, 2015

      • ruaok
        on roobarb have a look at /tmp/jstack.9790.out
      • 2015-10-22 29520, 2015

      • ruaok
        so, cartman does remote restarts of the server.
      • 2015-10-22 29528, 2015

      • ruaok
        so, yes that is happening.
      • 2015-10-22 29548, 2015

      • ruaok
        cartman builds indexes, pushes to dora/roobarb, then restarts them
      • 2015-10-22 29532, 2015

      • zas
        did it happen at 17:50 UTC ?
      • 2015-10-22 29504, 2015

      • ruaok
        no
      • 2015-10-22 29519, 2015

      • ruaok
      • 2015-10-22 29534, 2015

      • ruaok
        from /etc/service/search-build/log/main/current
      • 2015-10-22 29505, 2015

      • ruaok
        the top of the stack trace for each proc seems to be:
      • 2015-10-22 29506, 2015

      • ruaok
        - sun.nio.ch.FileDispatcherImpl.write0(java.io.FileDescriptor, long, int) @bci=0 (Compiled frame; information may be imprecise)
      • 2015-10-22 29509, 2015

      • zas
        Oct 22 17:50:02 dora monit[1326]: 'tomcat' total mem amount of 12712944kB matches resource limit
      • 2015-10-22 29509, 2015

      • zas
        [total mem amount>12288046kB]
      • 2015-10-22 29509, 2015

      • zas
        Oct 22 17:50:02 dora monit[1326]: 'tomcat' trying to restart
      • 2015-10-22 29544, 2015

      • ruaok
        my bad. sorry, I never mentioned monit
      • 2015-10-22 29555, 2015

      • ruaok
        so the search servers leak memory like a sieve.
      • 2015-10-22 29501, 2015

      • ruaok
        we've never been able to find the source of it.
      • 2015-10-22 29507, 2015

      • zas
        java ;)
      • 2015-10-22 29521, 2015

      • ruaok
        so, we have monit installed. and monit kills the JVM when it gets to 40% of ram or so.
      • 2015-10-22 29524, 2015

      • ruaok
        yes, exactly.
      • 2015-10-22 29534, 2015

      • zas
        ok monit restarted dora tomcat, the load went on roobarb, and then ...
      • 2015-10-22 29553, 2015

      • ruaok
        oh -- is that the chain of events?
      • 2015-10-22 29554, 2015

      • zas
        i noticed we had free memory on dora
      • 2015-10-22 29501, 2015

      • ruaok
        that is where it started?
      • 2015-10-22 29506, 2015

      • zas
        yes
      • 2015-10-22 29518, 2015

      • ruaok
        ohh, a clue. :)
      • 2015-10-22 29524, 2015

      • zas
        imho the restart of dora search server caused the whole havoc
      • 2015-10-22 29529, 2015

      • zas
        let's check for yesterday
      • 2015-10-22 29558, 2015

      • zas
        it started at 17:56 yesterday
      • 2015-10-22 29544, 2015

      • zas
        UTC+2
      • 2015-10-22 29552, 2015

      • zas
        Oct 21 15:56:42 dora monit[1240]: 'tomcat' total mem amount of 12585720kB matches resource limit [total mem amount>12288046kB]
      • 2015-10-22 29553, 2015

      • zas
        Oct 21 15:56:42 dora monit[1240]: 'tomcat' trying to restart
      • 2015-10-22 29555, 2015

      • zas
        heh
      • 2015-10-22 29500, 2015

      • zas
        same cause same effect
      • 2015-10-22 29516, 2015

      • zas
        but why don't they recover ??
      • 2015-10-22 29531, 2015

      • ruaok
        they used to.
      • 2015-10-22 29537, 2015

      • ruaok
        something else must've changed.
      • 2015-10-22 29551, 2015

      • ruaok
        the only other thing that comes to mind is that we're out of bandwidith.
      • 2015-10-22 29502, 2015

      • zas
        not atm
      • 2015-10-22 29502, 2015

      • ruaok
        now waitaminute.
      • 2015-10-22 29522, 2015

      • ruaok
        backlog is 0 now.
      • 2015-10-22 29527, 2015

      • ruaok
        but not always.
      • 2015-10-22 29532, 2015

      • ruaok
        this suggest that we're just below the limit.
      • 2015-10-22 29536, 2015

      • ruaok
        +s
      • 2015-10-22 29542, 2015

      • ruaok
        even though search isn't working.
      • 2015-10-22 29549, 2015

      • ruaok
        where the hell is our bandwidth going?
      • 2015-10-22 29550, 2015

      • ariscop joined the channel
      • 2015-10-22 29515, 2015

      • zas
        ernie bandwidth is far under the limit
      • 2015-10-22 29538, 2015

      • ruaok
        the rate limiter is kicking in.
      • 2015-10-22 29551, 2015

      • ruaok
        on ernie: watch tc -s qdisc show
      • 2015-10-22 29501, 2015

      • ruaok
        how are you determining current bandwidth?
      • 2015-10-22 29535, 2015

      • zas
      • 2015-10-22 29544, 2015

      • zas
        select ernie
      • 2015-10-22 29526, 2015

      • ruaok
        MBps, so 10x mbits, right?
      • 2015-10-22 29543, 2015

      • ruaok
        well, that isn't right either.
      • 2015-10-22 29508, 2015

      • ruaok
        ok, let's turn off the rate limiter we have in place.
      • 2015-10-22 29524, 2015

      • ruaok
        and let things run.
      • 2015-10-22 29538, 2015

      • ruaok
        if everything returns to normal we know that this is a function of rate liminting.
      • 2015-10-22 29550, 2015

      • zas
        hmmm, how does the load balancer behave when both search servers are down ?
      • 2015-10-22 29518, 2015

      • ruaok
        throws a lot of errors, why?
      • 2015-10-22 29520, 2015

      • ruaok
        restart both at the same time?
      • 2015-10-22 29550, 2015

      • zas
        i don't get why they still eat so much cpu while we have much lower traffic atm
      • 2015-10-22 29504, 2015

      • ruaok
        because java?
      • 2015-10-22 29518, 2015

      • ruaok
        let's try removing the ratelimit on ernie.
      • 2015-10-22 29527, 2015

      • ruaok
        worst case, nothing happens.
      • 2015-10-22 29534, 2015

      • ruaok
        best case, services return to normal.
      • 2015-10-22 29538, 2015

      • zas
        root@dora:/var/log# netstat -natuop|grep '10.1.1.21:8080'|grep 'CLOSE'|wc -l
      • 2015-10-22 29538, 2015

      • zas
        1169
      • 2015-10-22 29552, 2015

      • zas
        that's a lot for "fast" network
      • 2015-10-22 29553, 2015

      • ruaok
        CLOSE_WAIT?
      • 2015-10-22 29557, 2015

      • zas
        yes
      • 2015-10-22 29504, 2015

      • ruaok
        interesting.
      • 2015-10-22 29522, 2015

      • zas
        i suspect the load balancer doing something weird
      • 2015-10-22 29524, 2015

      • ruaok
        remember when I said that the gateways were failing (carl/lenny) and that we had a huge backlog?
      • 2015-10-22 29534, 2015

      • zas
        yup
      • 2015-10-22 29539, 2015

      • ruaok
        because the gateways were dropping packets like mad.
      • 2015-10-22 29548, 2015

      • ruaok
        I see the rate limiter dropping packets.
      • 2015-10-22 29550, 2015

      • ruaok
        the pieces fit.
      • 2015-10-22 29540, 2015

      • zas
        did you try a nginx restart on ernie ?
      • 2015-10-22 29513, 2015

      • ariscop has quit
      • 2015-10-22 29514, 2015

      • ruaok
        not complete restart, no.
      • 2015-10-22 29520, 2015

      • ruaok
        hup, yes.
      • 2015-10-22 29520, 2015

      • ruaok
        feel free.
      • 2015-10-22 29515, 2015

      • ariscop joined the channel
      • 2015-10-22 29511, 2015

      • zas
        no impact
      • 2015-10-22 29534, 2015

      • ruaok
        rebooting ernie might have more impact.
      • 2015-10-22 29539, 2015

      • ruaok
        but we can't do that. :(
      • 2015-10-22 29543, 2015

      • ruaok
        well, we can, but...
      • 2015-10-22 29544, 2015

      • zas
        hmm
      • 2015-10-22 29502, 2015

      • zas
        not needed, the issue is on the search servers side
      • 2015-10-22 29523, 2015

      • ruaok
        I agree with your sentiment, the problem lies with the gateways.
      • 2015-10-22 29541, 2015

      • ruaok
        we've made zero changes to search; we see a ton of CLOSE_WAIT sockets.
      • 2015-10-22 29550, 2015

      • ruaok
        screams gateway, not search.
      • 2015-10-22 29536, 2015

      • ruaok
        ok, owner of DWNI responded. will forward my bandwidth query to their salesdroid.
      • 2015-10-22 29513, 2015

      • DjSlash
        (zas: fwiw, `lsof -n -iTCP:8080 -sTCP:CLOSE_WAIT|wc -l` would be faster)
      • 2015-10-22 29533, 2015

      • LordSput1 joined the channel
      • 2015-10-22 29504, 2015

      • LordSput1 has quit
      • 2015-10-22 29512, 2015

      • ruaok
        zas: "tc qdisc del dev em root"
      • 2015-10-22 29518, 2015

      • ruaok
        any objections?
      • 2015-10-22 29520, 2015

      • LordSputnik joined the channel
      • 2015-10-22 29558, 2015

      • zas
        nope; no better idea
      • 2015-10-22 29505, 2015

      • ruaok
        k
      • 2015-10-22 29552, 2015

      • ruaok
        well, that wasn't it. :(
      • 2015-10-22 29512, 2015

      • New__ has quit
      • 2015-10-22 29518, 2015

      • ruaok
        and the second you restart the search, the CLOSE_WAITs pile up. fast.
      • 2015-10-22 29512, 2015

      • zas
        we could stop the forwarding to the search servers for a while to see if they recover without new queries
      • 2015-10-22 29522, 2015

      • MajorLurker joined the channel
      • 2015-10-22 29536, 2015

      • ruaok
        try it. what do we have to lose? :)
      • 2015-10-22 29556, 2015

      • ruaok
        so, yes, they recovered, did I follow that right?