- 
    
        
    
                    kepstin search server is timing out again... 
2011-10-01 27456, 2011
    - 
    
        
    
                    kepstin rest of the site's being slow as well. just got a 502 bad gateway. 
2011-10-01 27404, 2011
    - 
    
        
    
        
        nikki prods ruaok 
2011-10-01 27417, 2011
    - 
    
        
    
        
        Leftmost joined the channel 
2011-10-01 27446, 2011
    - 
    
        
    
        
        ruaok returns 
2011-10-01 27400, 2011
    - 
    
        
    
                    nikki we have unhappy servers 
2011-10-01 27439, 2011
    - 
    
        
    
                    kepstin started right about 1/2 hour ago. 
2011-10-01 27420, 2011
    - 
    
        
    
                    ruaok next time things to bad, please txt me! 
2011-10-01 27434, 2011
    - 
    
        
    
                    ruaok things should be better now. 
2011-10-01 27400, 2011
    - 
    
        
    
        
        ocharles updates his number for ruaok 
2011-10-01 27406, 2011
    - 
    
        
    
                    ocharles last time I tried to text it didn't seem to go through 
2011-10-01 27417, 2011
    - 
    
        
    
                    ruaok cool, thx 
2011-10-01 27451, 2011
    - 
    
        
    
                    ocharles ah, the + has vanished 
2011-10-01 27402, 2011
    - 
    
        
    
                    ruaok hmm. 
2011-10-01 27410, 2011
    - 
    
        
    
                    ruaok so we know that the jvm isn't the problem 
2011-10-01 27426, 2011
    - 
    
        
    
                    ruaok it could be tomcat or the search server code. 
2011-10-01 27428, 2011
    - 
    
        
    
                    ruaok fuss 
2011-10-01 27439, 2011
    - 
    
        
    
                    ocharles what are you doing to "fix" this problem 
2011-10-01 27446, 2011
    - 
    
        
    
                    ocharles is it worth having an hourly server restart in cron? 
2011-10-01 27459, 2011
    - 
    
        
    
                    ruaok well, investigating the problem first of all. 
2011-10-01 27400, 2011
    - 
    
        
    
                    ocharles (akin to the constant restarting we have in mb_server) 
2011-10-01 27410, 2011
    - 
    
        
    
                    ruaok well, possibly. 
2011-10-01 27419, 2011
    - 
    
        
    
                    ruaok its clearly a case of something running out of filehandles. 
2011-10-01 27427, 2011
    - 
    
        
    
                    ruaok near 7000 open files things to belly up. 
2011-10-01 27436, 2011
    - 
    
        
    
                    ocharles lsof show you what's open? 
2011-10-01 27440, 2011
    - 
    
        
    
                    ruaok restarting tomcat works 
2011-10-01 27446, 2011
    - 
    
        
    
                    ocharles i think it's lsof anyway 
2011-10-01 27449, 2011
    - 
    
        
    
                    ruaok I've just been doing netstat | wc -l 
2011-10-01 27412, 2011
    - 
    
        
    
                    ruaok we know who has the files open. 
2011-10-01 27423, 2011
    - 
    
        
    
                    ruaok I'm mainly interested in finding out how many are open. 
2011-10-01 27425, 2011
    - 
    
        
    
                    ocharles ok, but what are the files? do more files open on every search request? 
2011-10-01 27435, 2011
    - 
    
        
    
                    ruaok and it not a clear case of file handles leaking straight out. 
2011-10-01 27447, 2011
    - 
    
        
    
                    ruaok the number goes up, then drops. 
2011-10-01 27451, 2011
    - 
    
        
    
                    ruaok but over time, it creeps upwards. 
2011-10-01 27402, 2011
    - 
    
        
    
                    ruaok it happens faster with only one server as with two. 
2011-10-01 27415, 2011
    - 
    
        
    
                    ocharles as in one server leaks faster than 2? 
2011-10-01 27427, 2011
    - 
    
        
    
                    ruaok and, I think, with our increased traffic we're hitting a problem that we didn't hit before. 
2011-10-01 27432, 2011
    - 
    
        
    
                    ruaok thus is coming around to bite us now. 
2011-10-01 27434, 2011
    - 
    
        
    
                    ruaok yes. 
2011-10-01 27443, 2011
    - 
    
        
    
                    ruaok I'm guessing that ever X requests our servers tip over. 
2011-10-01 27449, 2011
    - 
    
        
    
                    ocharles hmm 
2011-10-01 27454, 2011
    - 
    
        
    
                    ruaok X requests per server. 
2011-10-01 27403, 2011
    - 
    
        
    
                    ruaok so with one server you hit twice as fast as with two servers. 
2011-10-01 27423, 2011
    - 
    
        
    
                    ocharles it must be a X requests/N duration, not just X requests 
2011-10-01 27435, 2011
    - 
    
        
    
                    ruaok and restarting the server isn't as easy as it is with the mb_server. 
2011-10-01 27447, 2011
    - 
    
        
    
                    ruaok wait, I don't know that for sure. 
2011-10-01 27454, 2011
    - 
    
        
    
                    ruaok we do a restart on the server every 3 hours. 
2011-10-01 27403, 2011
    - 
    
        
    
                    ruaok but I am not sure a restart is enough to free the handles. 
2011-10-01 27408, 2011
    - 
    
        
    
                    ruaok it may need to be stop/start 
2011-10-01 27420, 2011
    - 
    
        
    
                    ruaok X/N might also make sense. 
2011-10-01 27426, 2011
    - 
    
        
    
                    ocharles do I have any access to these servers? just curious what lsof shows 
2011-10-01 27428, 2011
    - 
    
        
    
                    ruaok let me see if dora has calmed down now. 
2011-10-01 27457, 2011
    - 
    
        
    
                    ruaok 7235 
2011-10-01 27458, 2011
    - 
    
        
    
                    ruaok 100 
2011-10-01 27404, 2011
    - 
    
        
    
        
        ruaok hands ocharles a cookie 
2011-10-01 27412, 2011
    - 
    
        
    
                    ocharles what'd I do? :) 
2011-10-01 27418, 2011
    - 
    
        
    
                    ruaok the first one was when dora was tits up. 
2011-10-01 27428, 2011
    - 
    
        
    
                    ruaok the second 5 minutes after diverting traffic. 
2011-10-01 27449, 2011
    - 
    
        
    
                    ocharles and doing nothing? 
2011-10-01 27450, 2011
    - 
    
        
    
                    ruaok so, sockets are being kept open too long and eventually we run out of open file handles. 
2011-10-01 27455, 2011
    - 
    
        
    
                    ocharles yea 
2011-10-01 27404, 2011
    - 
    
        
    
                    ocharles that sounds like a reasonable hypothesis 
2011-10-01 27411, 2011
    - 
    
        
    
                    ruaok they are all in CLOSE_WAIT status 
2011-10-01 27413, 2011
    - 
    
        
    
                    ocharles or the garbage collection on sockets is not fast enough 
2011-10-01 27416, 2011
    - 
    
        
    
                    ocharles oh wait, so the problem is that the search server is running out of sockets, not file handles? 
2011-10-01 27425, 2011
    - 
    
        
    
                    ocharles so that would indicate that musicbrainz-server isn't closing it's connections properly 
2011-10-01 27448, 2011
    - 
    
        
    
                    ocharles the cool down after 5 minutes is very likely from the musicbrainz-server fastcgi processes being killed and forced to close sockets 
2011-10-01 27455, 2011
    - 
    
        
    
                    ruaok on astro, see /tmp 
2011-10-01 27405, 2011
    - 
    
        
    
                    ruaok lsof output for both dora and rhubarb. 
2011-10-01 27408, 2011
    - 
    
        
    
                    ruaok dora is idle. 
2011-10-01 27415, 2011
    - 
    
        
    
                    ruaok roobarb is live. 
2011-10-01 27451, 2011
    - 
    
        
    
                    ruaok sockets == filehandles. 
2011-10-01 27459, 2011
    - 
    
        
    
                    ruaok yes! that is a great observation. 
2011-10-01 27405, 2011
    - 
    
        
    
                    ruaok perhaps this is the fault of mb_server. 
2011-10-01 27437, 2011
    - 
    
        
    
                    ruaok well, we should see a high open socket count on the mb-server machines too, no? 
2011-10-01 27447, 2011
    - 
    
        
    
                    ocharles i know sockets == file handles, but I'm not sure if lsof shows them 
2011-10-01 27405, 2011
    - 
    
        
    
                    ocharles hmm, but doesn't astro talk to carl before dora? 
2011-10-01 27415, 2011
    - 
    
        
    
                    ruaok yes. 
2011-10-01 27424, 2011
    - 
    
        
    
                    ruaok nginx on carl is doing the load balancing. 
2011-10-01 27425, 2011
    - 
    
        
    
                    ocharles so, wouldn't carl have problems first? 
2011-10-01 27432, 2011
    - 
    
        
    
                    ruaok hard to say off hand. 
2011-10-01 27447, 2011
    - 
    
        
    
                    ruaok could be carl, could be carl propagating problems via mb_server 
2011-10-01 27434, 2011
    - 
    
        
    
                    ocharles ruaok: hrm, keep dora out 
2011-10-01 27450, 2011
    - 
    
        
    
                    ocharles maybe if I try and locally do some requests from my box to dora, we can see if the socket count goes up at all? 
2011-10-01 27451, 2011
    - 
    
        
    
                    ruaok rhubarb will suffer the same problem. 
2011-10-01 27459, 2011
    - 
    
        
    
                    ruaok ah. 
2011-10-01 27401, 2011
    - 
    
        
    
                    ruaok we can try. 
2011-10-01 27407, 2011
    - 
    
        
    
                    ruaok make the requests from hobbes 
2011-10-01 27413, 2011
    - 
    
        
    
                    ocharles ok 
2011-10-01 27415, 2011
    - 
    
        
    
                    ocharles which ip? 
2011-10-01 27435, 2011
    - 
    
        
    
                    ocharles dora.mb will work :) 
2011-10-01 27453, 2011
    - 
    
        
    
                    ocharles port though? 
2011-10-01 27438, 2011
    - 
    
        
    
                    ruaok 8080 
2011-10-01 27439, 2011
    - 
    
        
    
                    ocharles ok, 1 request a second atm 
2011-10-01 27407, 2011
    - 
    
        
    
                    ocharles erm, one sec :) 
2011-10-01 27456, 2011
    - 
    
        
    
                    ocharles ok, off 
2011-10-01 27410, 2011
    - 
    
        
    
                    ocharles this is using curl to do the requests from bash 
2011-10-01 27413, 2011
    - 
    
        
    
                    ocharles straight to dora 
2011-10-01 27439, 2011
    - 
    
        
    
                    ruaok was 30 open, now 19. 
2011-10-01 27443, 2011
    - 
    
        
    
                    ruaok no seeming effect. 
2011-10-01 27448, 2011
    - 
    
        
    
                    ocharles i'll up the rate then 
2011-10-01 27459, 2011
    - 
    
        
    
                    ruaok and next we should try going through the load balancer 
2011-10-01 27409, 2011
    - 
    
        
    
                    ocharles yea 
2011-10-01 27414, 2011
    - 
    
        
    
                    ocharles ok, twice as fast now 
2011-10-01 27426, 2011
    - 
    
        
    
                    ruaok well, the load balancer is going go to both, so thats not so good. 
2011-10-01 27439, 2011
    - 
    
        
    
                    ruaok 19 
2011-10-01 27444, 2011
    - 
    
        
    
                    ruaok up the rate much higher 
2011-10-01 27457, 2011
    - 
    
        
    
                    ocharles 10 requests a second now 
2011-10-01 27417, 2011
    - 
    
        
    
                    ocharles now 20 
2011-10-01 27434, 2011
    - 
    
        
    
                    ocharles anything interesting? 
2011-10-01 27413, 2011
    - 
    
        
    
                    ruaok 20 open handles. 
2011-10-01 27421, 2011
    - 
    
        
    
                    ruaok try 1000 requests per sec. 
2011-10-01 27423, 2011
    - 
    
        
    
                    ocharles ok 
2011-10-01 27425, 2011
    - 
    
        
    
                    ruaok :-) 
2011-10-01 27455, 2011
    - 
    
        
    
                    ruaok 30 
2011-10-01 27404, 2011
    - 
    
        
    
                    ocharles that look like it's actually getting a lot of traffic in the logs too? 
2011-10-01 27401, 2011
    - 
    
        
    
                    ruaok we don't actually record traffic on these servers. 
2011-10-01 27408, 2011
    - 
    
        
    
                    ocharles ah, ok 
2011-10-01 27413, 2011
    - 
    
        
    
                    ruaok 1.63 load. 
2011-10-01 27415, 2011
    - 
    
        
    
                    ruaok so doing somehting 
2011-10-01 27419, 2011
    - 
    
        
    
                    ruaok double it. :) 
2011-10-01 27423, 2011
    - 
    
        
    
                    ocharles 
2011-10-01 27431, 2011
    - 
    
        
    
                    ocharles load went up to 0.64 on hobbes just doing that 
2011-10-01 27435, 2011
    - 
    
        
    
                    ruaok lol 
2011-10-01 27446, 2011
    - 
    
        
    
                    ocharles next text can be through carl 
2011-10-01 27452, 2011
    - 
    
        
    
                    ocharles then we can test direct to dora with LWP 
2011-10-01 27455, 2011
    - 
    
        
    
                    ocharles and the through carl with LWP 
2011-10-01 27408, 2011
    - 
    
        
    
                    ruaok not bouncing up at all. 
2011-10-01 27414, 2011
    - 
    
        
    
                    ruaok but I wonder if we should leave it running longer. 
2011-10-01 27405, 2011
    - 
    
        
    
                    ocharles ruaok: I think a better option would be: while [ 1 -eq 1 ] do; netstat | wc -l >> sockets; sleep 30; done on dora 
2011-10-01 27409, 2011
    - 
    
        
    
                    ocharles and bring it back into production 
2011-10-01 27419, 2011
    - 
    
        
    
                    ocharles get a very rough idea of how that count changes over time 
2011-10-01 27402, 2011
    - 
    
        
    
                    ruaok yeah. 
2011-10-01 27412, 2011
    - 
    
        
    
                    ruaok want an account on dora? 
2011-10-01 27416, 2011
    - 
    
        
    
                    ocharles sure 
2011-10-01 27432, 2011
    - 
    
        
    
                    ruaok gimme a sec. 
2011-10-01 27438, 2011
    - 
    
        
    
                    ruaok mom is whipping up a storm of activity 
2011-10-01 27448, 2011
    - 
    
        
    
                    ruaok geez, she's not sick anymore 
2011-10-01 27451, 2011
    - 
    
        
    
                    ruaok driving me insane. 
2011-10-01 27454, 2011
    - 
    
        
    
        
        ruaok sighs 
2011-10-01 27431, 2011
    - 
    
        
    
                    ruaok ready! 
2011-10-01 27449, 2011
    - 
    
        
    
                    ocharles heh 
2011-10-01 27431, 2011
    - 
    
        
    
                    ruaok syntax error near done. :-( 
2011-10-01 27457, 2011
    - 
    
        
    
                    ruaok dora is back in. 
2011-10-01 27447, 2011
    - 
    
        
    
                    ocharles you get it running? 
2011-10-01 27437, 2011
    - 
    
        
    
                    ruaok no 
2011-10-01 27423, 2011
    - 
    
        
    
                    ruaok the count is climbing fast though. 
2011-10-01 27449, 2011
    - 
    
        
    
                    ocharles while [ 1 -eq 1 ]; do netstat | wc -l; sleep 1; done >> counter 
2011-10-01 27459, 2011
    - 
    
        
    
                    ocharles redirection in the wrong place, my bad 
2011-10-01 27453, 2011
    - 
    
        
    
                    ruaok I'm going to take dora out and let things calm down.