- 
    
        
    
                    
kepstin
                search server is timing out again...
     
      2011-10-01 27456, 2011
    - 
    
        
    
                    
kepstin
                rest of the site's being slow as well. just got a 502 bad gateway.
     
      2011-10-01 27404, 2011
    - 
    
        
    
        
        
nikki prods ruaok
     
      2011-10-01 27417, 2011
    - 
    
        
    
        
        
Leftmost joined the channel
     
      2011-10-01 27446, 2011
    - 
    
        
    
        
        
ruaok returns
     
      2011-10-01 27400, 2011
    - 
    
        
    
                    
nikki
                we have unhappy servers
     
      2011-10-01 27439, 2011
    - 
    
        
    
                    
kepstin
                started right about 1/2 hour ago.
     
      2011-10-01 27420, 2011
    - 
    
        
    
                    
ruaok
                next time things to bad, please txt me!
     
      2011-10-01 27434, 2011
    - 
    
        
    
                    
ruaok
                things should be better now.
     
      2011-10-01 27400, 2011
    - 
    
        
    
        
        
ocharles updates his number for ruaok
     
      2011-10-01 27406, 2011
    - 
    
        
    
                    
ocharles
                last time I tried to text it didn't seem to go through
     
      2011-10-01 27417, 2011
    - 
    
        
    
                    
ruaok
                cool, thx
     
      2011-10-01 27451, 2011
    - 
    
        
    
                    
ocharles
                ah, the + has vanished
     
      2011-10-01 27402, 2011
    - 
    
        
    
                    
ruaok
                hmm.
     
      2011-10-01 27410, 2011
    - 
    
        
    
                    
ruaok
                so we know that the jvm isn't the problem
     
      2011-10-01 27426, 2011
    - 
    
        
    
                    
ruaok
                it could be tomcat or the search server code.
     
      2011-10-01 27428, 2011
    - 
    
        
    
                    
ruaok
                fuss
     
      2011-10-01 27439, 2011
    - 
    
        
    
                    
ocharles
                what are you doing to "fix" this problem
     
      2011-10-01 27446, 2011
    - 
    
        
    
                    
ocharles
                is it worth having an hourly server restart in cron?
     
      2011-10-01 27459, 2011
    - 
    
        
    
                    
ruaok
                well, investigating the problem first of all.
     
      2011-10-01 27400, 2011
    - 
    
        
    
                    
ocharles
                (akin to the constant restarting we have in mb_server)
     
      2011-10-01 27410, 2011
    - 
    
        
    
                    
ruaok
                well, possibly.
     
      2011-10-01 27419, 2011
    - 
    
        
    
                    
ruaok
                its clearly a case of something running out of filehandles.
     
      2011-10-01 27427, 2011
    - 
    
        
    
                    
ruaok
                near 7000 open files things to belly up.
     
      2011-10-01 27436, 2011
    - 
    
        
    
                    
ocharles
                lsof show you what's open?
     
      2011-10-01 27440, 2011
    - 
    
        
    
                    
ruaok
                restarting tomcat works
     
      2011-10-01 27446, 2011
    - 
    
        
    
                    
ocharles
                i think it's lsof anyway
     
      2011-10-01 27449, 2011
    - 
    
        
    
                    
ruaok
                I've just been doing netstat | wc -l
     
      2011-10-01 27412, 2011
    - 
    
        
    
                    
ruaok
                we know who has the files open.
     
      2011-10-01 27423, 2011
    - 
    
        
    
                    
ruaok
                I'm mainly interested in finding out how many are open.
     
      2011-10-01 27425, 2011
    - 
    
        
    
                    
ocharles
                ok, but what are the files? do more files open on every search request?
     
      2011-10-01 27435, 2011
    - 
    
        
    
                    
ruaok
                and it not a clear case of file handles leaking straight out.
     
      2011-10-01 27447, 2011
    - 
    
        
    
                    
ruaok
                the number goes up, then drops.
     
      2011-10-01 27451, 2011
    - 
    
        
    
                    
ruaok
                but over time, it creeps upwards.
     
      2011-10-01 27402, 2011
    - 
    
        
    
                    
ruaok
                it happens faster with only one server as with two.
     
      2011-10-01 27415, 2011
    - 
    
        
    
                    
ocharles
                as in one server leaks faster than 2?
     
      2011-10-01 27427, 2011
    - 
    
        
    
                    
ruaok
                and, I think, with our increased traffic we're hitting a problem that we didn't hit before.
     
      2011-10-01 27432, 2011
    - 
    
        
    
                    
ruaok
                thus is coming around to bite us now.
     
      2011-10-01 27434, 2011
    - 
    
        
    
                    
ruaok
                yes.
     
      2011-10-01 27443, 2011
    - 
    
        
    
                    
ruaok
                I'm guessing that ever X requests our servers tip over.
     
      2011-10-01 27449, 2011
    - 
    
        
    
                    
ocharles
                hmm
     
      2011-10-01 27454, 2011
    - 
    
        
    
                    
ruaok
                X requests per server.
     
      2011-10-01 27403, 2011
    - 
    
        
    
                    
ruaok
                so with one server you hit twice as fast as with two servers.
     
      2011-10-01 27423, 2011
    - 
    
        
    
                    
ocharles
                it must be a X requests/N duration, not just X requests
     
      2011-10-01 27435, 2011
    - 
    
        
    
                    
ruaok
                and restarting the server isn't as easy as it is with the mb_server.
     
      2011-10-01 27447, 2011
    - 
    
        
    
                    
ruaok
                wait, I don't know that for sure.
     
      2011-10-01 27454, 2011
    - 
    
        
    
                    
ruaok
                we do a restart on the server every 3 hours.
     
      2011-10-01 27403, 2011
    - 
    
        
    
                    
ruaok
                but I am not sure a restart is enough to free the handles.
     
      2011-10-01 27408, 2011
    - 
    
        
    
                    
ruaok
                it may need to be stop/start
     
      2011-10-01 27420, 2011
    - 
    
        
    
                    
ruaok
                X/N might also make sense.
     
      2011-10-01 27426, 2011
    - 
    
        
    
                    
ocharles
                do I have any access to these servers? just curious what lsof shows
     
      2011-10-01 27428, 2011
    - 
    
        
    
                    
ruaok
                let me see if dora has calmed down now.
     
      2011-10-01 27457, 2011
    - 
    
        
    
                    
ruaok
                7235
     
      2011-10-01 27458, 2011
    - 
    
        
    
                    
ruaok
                100
     
      2011-10-01 27404, 2011
    - 
    
        
    
        
        
ruaok hands ocharles a cookie
     
      2011-10-01 27412, 2011
    - 
    
        
    
                    
ocharles
                what'd I do? :)
     
      2011-10-01 27418, 2011
    - 
    
        
    
                    
ruaok
                the first one was when dora was tits up.
     
      2011-10-01 27428, 2011
    - 
    
        
    
                    
ruaok
                the second 5 minutes after diverting traffic.
     
      2011-10-01 27449, 2011
    - 
    
        
    
                    
ocharles
                and doing nothing?
     
      2011-10-01 27450, 2011
    - 
    
        
    
                    
ruaok
                so, sockets are being kept open too long and eventually we run out of open file handles.
     
      2011-10-01 27455, 2011
    - 
    
        
    
                    
ocharles
                yea
     
      2011-10-01 27404, 2011
    - 
    
        
    
                    
ocharles
                that sounds like a reasonable hypothesis
     
      2011-10-01 27411, 2011
    - 
    
        
    
                    
ruaok
                they are all in CLOSE_WAIT status
     
      2011-10-01 27413, 2011
    - 
    
        
    
                    
ocharles
                or the garbage collection on sockets is not fast enough
     
      2011-10-01 27416, 2011
    - 
    
        
    
                    
ocharles
                oh wait, so the problem is that the search server is running out of sockets, not file handles?
     
      2011-10-01 27425, 2011
    - 
    
        
    
                    
ocharles
                so that would indicate that musicbrainz-server isn't closing it's connections properly
     
      2011-10-01 27448, 2011
    - 
    
        
    
                    
ocharles
                the cool down after 5 minutes is very likely from the musicbrainz-server fastcgi processes being killed and forced to close sockets
     
      2011-10-01 27455, 2011
    - 
    
        
    
                    
ruaok
                on astro, see /tmp
     
      2011-10-01 27405, 2011
    - 
    
        
    
                    
ruaok
                lsof output for both dora and rhubarb.
     
      2011-10-01 27408, 2011
    - 
    
        
    
                    
ruaok
                dora is idle.
     
      2011-10-01 27415, 2011
    - 
    
        
    
                    
ruaok
                roobarb is live.
     
      2011-10-01 27451, 2011
    - 
    
        
    
                    
ruaok
                sockets == filehandles.
     
      2011-10-01 27459, 2011
    - 
    
        
    
                    
ruaok
                yes! that is a great observation.
     
      2011-10-01 27405, 2011
    - 
    
        
    
                    
ruaok
                perhaps this is the fault of mb_server.
     
      2011-10-01 27437, 2011
    - 
    
        
    
                    
ruaok
                well, we should see a high open socket count on the mb-server machines too, no?
     
      2011-10-01 27447, 2011
    - 
    
        
    
                    
ocharles
                i know sockets == file handles, but I'm not sure if lsof shows them
     
      2011-10-01 27405, 2011
    - 
    
        
    
                    
ocharles
                hmm, but doesn't astro talk to carl before dora?
     
      2011-10-01 27415, 2011
    - 
    
        
    
                    
ruaok
                yes.
     
      2011-10-01 27424, 2011
    - 
    
        
    
                    
ruaok
                nginx on carl is doing the load balancing.
     
      2011-10-01 27425, 2011
    - 
    
        
    
                    
ocharles
                so, wouldn't carl have problems first?
     
      2011-10-01 27432, 2011
    - 
    
        
    
                    
ruaok
                hard to say off hand.
     
      2011-10-01 27447, 2011
    - 
    
        
    
                    
ruaok
                could be carl, could be carl propagating problems via mb_server
     
      2011-10-01 27434, 2011
    - 
    
        
    
                    
ocharles
                ruaok: hrm, keep dora out
     
      2011-10-01 27450, 2011
    - 
    
        
    
                    
ocharles
                maybe if I try and locally do some requests from my box to dora, we can see if the socket count goes up at all?
     
      2011-10-01 27451, 2011
    - 
    
        
    
                    
ruaok
                rhubarb will suffer the same problem.
     
      2011-10-01 27459, 2011
    - 
    
        
    
                    
ruaok
                ah.
     
      2011-10-01 27401, 2011
    - 
    
        
    
                    
ruaok
                we can try.
     
      2011-10-01 27407, 2011
    - 
    
        
    
                    
ruaok
                make the requests from hobbes
     
      2011-10-01 27413, 2011
    - 
    
        
    
                    
ocharles
                ok
     
      2011-10-01 27415, 2011
    - 
    
        
    
                    
ocharles
                which ip?
     
      2011-10-01 27435, 2011
    - 
    
        
    
                    
ocharles
                dora.mb will work :)
     
      2011-10-01 27453, 2011
    - 
    
        
    
                    
ocharles
                port though?
     
      2011-10-01 27438, 2011
    - 
    
        
    
                    
ruaok
                8080
     
      2011-10-01 27439, 2011
    - 
    
        
    
                    
ocharles
                ok, 1 request a second atm
     
      2011-10-01 27407, 2011
    - 
    
        
    
                    
ocharles
                erm, one sec :)
     
      2011-10-01 27456, 2011
    - 
    
        
    
                    
ocharles
                ok, off
     
      2011-10-01 27410, 2011
    - 
    
        
    
                    
ocharles
                this is using curl to do the requests from bash
     
      2011-10-01 27413, 2011
    - 
    
        
    
                    
ocharles
                straight to dora
     
      2011-10-01 27439, 2011
    - 
    
        
    
                    
ruaok
                was 30 open, now 19.
     
      2011-10-01 27443, 2011
    - 
    
        
    
                    
ruaok
                no seeming effect.
     
      2011-10-01 27448, 2011
    - 
    
        
    
                    
ocharles
                i'll up the rate then
     
      2011-10-01 27459, 2011
    - 
    
        
    
                    
ruaok
                and next we should try going through the load balancer
     
      2011-10-01 27409, 2011
    - 
    
        
    
                    
ocharles
                yea
     
      2011-10-01 27414, 2011
    - 
    
        
    
                    
ocharles
                ok, twice as fast now
     
      2011-10-01 27426, 2011
    - 
    
        
    
                    
ruaok
                well, the load balancer is going go to both, so thats not so good.
     
      2011-10-01 27439, 2011
    - 
    
        
    
                    
ruaok
                19
     
      2011-10-01 27444, 2011
    - 
    
        
    
                    
ruaok
                up the rate much higher
     
      2011-10-01 27457, 2011
    - 
    
        
    
                    
ocharles
                10 requests a second now
     
      2011-10-01 27417, 2011
    - 
    
        
    
                    
ocharles
                now 20
     
      2011-10-01 27434, 2011
    - 
    
        
    
                    
ocharles
                anything interesting?
     
      2011-10-01 27413, 2011
    - 
    
        
    
                    
ruaok
                20 open handles.
     
      2011-10-01 27421, 2011
    - 
    
        
    
                    
ruaok
                try 1000 requests per sec.
     
      2011-10-01 27423, 2011
    - 
    
        
    
                    
ocharles
                ok
     
      2011-10-01 27425, 2011
    - 
    
        
    
                    
ruaok
                :-)
     
      2011-10-01 27455, 2011
    - 
    
        
    
                    
ruaok
                30
     
      2011-10-01 27404, 2011
    - 
    
        
    
                    
ocharles
                that look like it's actually getting a lot of traffic in the logs too?
     
      2011-10-01 27401, 2011
    - 
    
        
    
                    
ruaok
                we don't actually record traffic on these servers.
     
      2011-10-01 27408, 2011
    - 
    
        
    
                    
ocharles
                ah, ok
     
      2011-10-01 27413, 2011
    - 
    
        
    
                    
ruaok
                1.63 load.
     
      2011-10-01 27415, 2011
    - 
    
        
    
                    
ruaok
                so doing somehting
     
      2011-10-01 27419, 2011
    - 
    
        
    
                    
ruaok
                double it. :)
     
      2011-10-01 27423, 2011
    - 
    
        
    
                    
ocharles
                
     
      2011-10-01 27431, 2011
    - 
    
        
    
                    
ocharles
                load went up to 0.64 on hobbes just doing that
     
      2011-10-01 27435, 2011
    - 
    
        
    
                    
ruaok
                lol
     
      2011-10-01 27446, 2011
    - 
    
        
    
                    
ocharles
                next text can be through carl
     
      2011-10-01 27452, 2011
    - 
    
        
    
                    
ocharles
                then we can test direct to dora with LWP
     
      2011-10-01 27455, 2011
    - 
    
        
    
                    
ocharles
                and the through carl with LWP
     
      2011-10-01 27408, 2011
    - 
    
        
    
                    
ruaok
                not bouncing up at all.
     
      2011-10-01 27414, 2011
    - 
    
        
    
                    
ruaok
                but I wonder if we should leave it running longer.
     
      2011-10-01 27405, 2011
    - 
    
        
    
                    
ocharles
                ruaok: I think a better option would be: while [ 1 -eq 1 ] do; netstat | wc -l >> sockets; sleep 30; done on dora
     
      2011-10-01 27409, 2011
    - 
    
        
    
                    
ocharles
                and bring it back into production
     
      2011-10-01 27419, 2011
    - 
    
        
    
                    
ocharles
                get a very rough idea of how that count changes over time
     
      2011-10-01 27402, 2011
    - 
    
        
    
                    
ruaok
                yeah.
     
      2011-10-01 27412, 2011
    - 
    
        
    
                    
ruaok
                want an account on dora?
     
      2011-10-01 27416, 2011
    - 
    
        
    
                    
ocharles
                sure
     
      2011-10-01 27432, 2011
    - 
    
        
    
                    
ruaok
                gimme a sec.
     
      2011-10-01 27438, 2011
    - 
    
        
    
                    
ruaok
                mom is whipping up a storm of activity
     
      2011-10-01 27448, 2011
    - 
    
        
    
                    
ruaok
                geez, she's not sick anymore
     
      2011-10-01 27451, 2011
    - 
    
        
    
                    
ruaok
                driving me insane.
     
      2011-10-01 27454, 2011
    - 
    
        
    
        
        
ruaok sighs
     
      2011-10-01 27431, 2011
    - 
    
        
    
                    
ruaok
                ready!
     
      2011-10-01 27449, 2011
    - 
    
        
    
                    
ocharles
                heh
     
      2011-10-01 27431, 2011
    - 
    
        
    
                    
ruaok
                syntax error near done. :-(
     
      2011-10-01 27457, 2011
    - 
    
        
    
                    
ruaok
                dora is back in.
     
      2011-10-01 27447, 2011
    - 
    
        
    
                    
ocharles
                you get it running?
     
      2011-10-01 27437, 2011
    - 
    
        
    
                    
ruaok
                no
     
      2011-10-01 27423, 2011
    - 
    
        
    
                    
ruaok
                the count is climbing fast though.
     
      2011-10-01 27449, 2011
    - 
    
        
    
                    
ocharles
                while [ 1 -eq 1 ]; do netstat | wc -l; sleep 1; done >> counter
     
      2011-10-01 27459, 2011
    - 
    
        
    
                    
ocharles
                redirection in the wrong place, my bad
     
      2011-10-01 27453, 2011
    - 
    
        
    
                    
ruaok
                I'm going to take dora out and let things calm down.