#musicbrainz-devel

/

      • kepstin
        search server is timing out again...
      • 2011-10-01 27456, 2011

      • kepstin
        rest of the site's being slow as well. just got a 502 bad gateway.
      • 2011-10-01 27404, 2011

      • nikki prods ruaok
      • 2011-10-01 27417, 2011

      • Leftmost joined the channel
      • 2011-10-01 27446, 2011

      • ruaok returns
      • 2011-10-01 27400, 2011

      • nikki
        we have unhappy servers
      • 2011-10-01 27439, 2011

      • kepstin
        started right about 1/2 hour ago.
      • 2011-10-01 27420, 2011

      • ruaok
        next time things to bad, please txt me!
      • 2011-10-01 27434, 2011

      • ruaok
        things should be better now.
      • 2011-10-01 27400, 2011

      • ocharles updates his number for ruaok
      • 2011-10-01 27406, 2011

      • ocharles
        last time I tried to text it didn't seem to go through
      • 2011-10-01 27417, 2011

      • ruaok
        cool, thx
      • 2011-10-01 27451, 2011

      • ocharles
        ah, the + has vanished
      • 2011-10-01 27402, 2011

      • ruaok
        hmm.
      • 2011-10-01 27410, 2011

      • ruaok
        so we know that the jvm isn't the problem
      • 2011-10-01 27426, 2011

      • ruaok
        it could be tomcat or the search server code.
      • 2011-10-01 27428, 2011

      • ruaok
        fuss
      • 2011-10-01 27439, 2011

      • ocharles
        what are you doing to "fix" this problem
      • 2011-10-01 27446, 2011

      • ocharles
        is it worth having an hourly server restart in cron?
      • 2011-10-01 27459, 2011

      • ruaok
        well, investigating the problem first of all.
      • 2011-10-01 27400, 2011

      • ocharles
        (akin to the constant restarting we have in mb_server)
      • 2011-10-01 27410, 2011

      • ruaok
        well, possibly.
      • 2011-10-01 27419, 2011

      • ruaok
        its clearly a case of something running out of filehandles.
      • 2011-10-01 27427, 2011

      • ruaok
        near 7000 open files things to belly up.
      • 2011-10-01 27436, 2011

      • ocharles
        lsof show you what's open?
      • 2011-10-01 27440, 2011

      • ruaok
        restarting tomcat works
      • 2011-10-01 27446, 2011

      • ocharles
        i think it's lsof anyway
      • 2011-10-01 27449, 2011

      • ruaok
        I've just been doing netstat | wc -l
      • 2011-10-01 27412, 2011

      • ruaok
        we know who has the files open.
      • 2011-10-01 27423, 2011

      • ruaok
        I'm mainly interested in finding out how many are open.
      • 2011-10-01 27425, 2011

      • ocharles
        ok, but what are the files? do more files open on every search request?
      • 2011-10-01 27435, 2011

      • ruaok
        and it not a clear case of file handles leaking straight out.
      • 2011-10-01 27447, 2011

      • ruaok
        the number goes up, then drops.
      • 2011-10-01 27451, 2011

      • ruaok
        but over time, it creeps upwards.
      • 2011-10-01 27402, 2011

      • ruaok
        it happens faster with only one server as with two.
      • 2011-10-01 27415, 2011

      • ocharles
        as in one server leaks faster than 2?
      • 2011-10-01 27427, 2011

      • ruaok
        and, I think, with our increased traffic we're hitting a problem that we didn't hit before.
      • 2011-10-01 27432, 2011

      • ruaok
        thus is coming around to bite us now.
      • 2011-10-01 27434, 2011

      • ruaok
        yes.
      • 2011-10-01 27443, 2011

      • ruaok
        I'm guessing that ever X requests our servers tip over.
      • 2011-10-01 27449, 2011

      • ocharles
        hmm
      • 2011-10-01 27454, 2011

      • ruaok
        X requests per server.
      • 2011-10-01 27403, 2011

      • ruaok
        so with one server you hit twice as fast as with two servers.
      • 2011-10-01 27423, 2011

      • ocharles
        it must be a X requests/N duration, not just X requests
      • 2011-10-01 27435, 2011

      • ruaok
        and restarting the server isn't as easy as it is with the mb_server.
      • 2011-10-01 27447, 2011

      • ruaok
        wait, I don't know that for sure.
      • 2011-10-01 27454, 2011

      • ruaok
        we do a restart on the server every 3 hours.
      • 2011-10-01 27403, 2011

      • ruaok
        but I am not sure a restart is enough to free the handles.
      • 2011-10-01 27408, 2011

      • ruaok
        it may need to be stop/start
      • 2011-10-01 27420, 2011

      • ruaok
        X/N might also make sense.
      • 2011-10-01 27426, 2011

      • ocharles
        do I have any access to these servers? just curious what lsof shows
      • 2011-10-01 27428, 2011

      • ruaok
        let me see if dora has calmed down now.
      • 2011-10-01 27457, 2011

      • ruaok
        7235
      • 2011-10-01 27458, 2011

      • ruaok
        100
      • 2011-10-01 27404, 2011

      • ruaok hands ocharles a cookie
      • 2011-10-01 27412, 2011

      • ocharles
        what'd I do? :)
      • 2011-10-01 27418, 2011

      • ruaok
        the first one was when dora was tits up.
      • 2011-10-01 27428, 2011

      • ruaok
        the second 5 minutes after diverting traffic.
      • 2011-10-01 27449, 2011

      • ocharles
        and doing nothing?
      • 2011-10-01 27450, 2011

      • ruaok
        so, sockets are being kept open too long and eventually we run out of open file handles.
      • 2011-10-01 27455, 2011

      • ocharles
        yea
      • 2011-10-01 27404, 2011

      • ocharles
        that sounds like a reasonable hypothesis
      • 2011-10-01 27411, 2011

      • ruaok
        they are all in CLOSE_WAIT status
      • 2011-10-01 27413, 2011

      • ocharles
        or the garbage collection on sockets is not fast enough
      • 2011-10-01 27416, 2011

      • ocharles
        oh wait, so the problem is that the search server is running out of sockets, not file handles?
      • 2011-10-01 27425, 2011

      • ocharles
        so that would indicate that musicbrainz-server isn't closing it's connections properly
      • 2011-10-01 27448, 2011

      • ocharles
        the cool down after 5 minutes is very likely from the musicbrainz-server fastcgi processes being killed and forced to close sockets
      • 2011-10-01 27455, 2011

      • ruaok
        on astro, see /tmp
      • 2011-10-01 27405, 2011

      • ruaok
        lsof output for both dora and rhubarb.
      • 2011-10-01 27408, 2011

      • ruaok
        dora is idle.
      • 2011-10-01 27415, 2011

      • ruaok
        roobarb is live.
      • 2011-10-01 27451, 2011

      • ruaok
        sockets == filehandles.
      • 2011-10-01 27459, 2011

      • ruaok
        yes! that is a great observation.
      • 2011-10-01 27405, 2011

      • ruaok
        perhaps this is the fault of mb_server.
      • 2011-10-01 27437, 2011

      • ruaok
        well, we should see a high open socket count on the mb-server machines too, no?
      • 2011-10-01 27447, 2011

      • ocharles
        i know sockets == file handles, but I'm not sure if lsof shows them
      • 2011-10-01 27405, 2011

      • ocharles
        hmm, but doesn't astro talk to carl before dora?
      • 2011-10-01 27415, 2011

      • ruaok
        yes.
      • 2011-10-01 27424, 2011

      • ruaok
        nginx on carl is doing the load balancing.
      • 2011-10-01 27425, 2011

      • ocharles
        so, wouldn't carl have problems first?
      • 2011-10-01 27432, 2011

      • ruaok
        hard to say off hand.
      • 2011-10-01 27447, 2011

      • ruaok
        could be carl, could be carl propagating problems via mb_server
      • 2011-10-01 27434, 2011

      • ocharles
        ruaok: hrm, keep dora out
      • 2011-10-01 27450, 2011

      • ocharles
        maybe if I try and locally do some requests from my box to dora, we can see if the socket count goes up at all?
      • 2011-10-01 27451, 2011

      • ruaok
        rhubarb will suffer the same problem.
      • 2011-10-01 27459, 2011

      • ruaok
        ah.
      • 2011-10-01 27401, 2011

      • ruaok
        we can try.
      • 2011-10-01 27407, 2011

      • ruaok
        make the requests from hobbes
      • 2011-10-01 27413, 2011

      • ocharles
        ok
      • 2011-10-01 27415, 2011

      • ocharles
        which ip?
      • 2011-10-01 27435, 2011

      • ocharles
        dora.mb will work :)
      • 2011-10-01 27453, 2011

      • ocharles
        port though?
      • 2011-10-01 27438, 2011

      • ruaok
        8080
      • 2011-10-01 27439, 2011

      • ocharles
        ok, 1 request a second atm
      • 2011-10-01 27407, 2011

      • ocharles
        erm, one sec :)
      • 2011-10-01 27456, 2011

      • ocharles
        ok, off
      • 2011-10-01 27410, 2011

      • ocharles
        this is using curl to do the requests from bash
      • 2011-10-01 27413, 2011

      • ocharles
        straight to dora
      • 2011-10-01 27439, 2011

      • ruaok
        was 30 open, now 19.
      • 2011-10-01 27443, 2011

      • ruaok
        no seeming effect.
      • 2011-10-01 27448, 2011

      • ocharles
        i'll up the rate then
      • 2011-10-01 27459, 2011

      • ruaok
        and next we should try going through the load balancer
      • 2011-10-01 27409, 2011

      • ocharles
        yea
      • 2011-10-01 27414, 2011

      • ocharles
        ok, twice as fast now
      • 2011-10-01 27426, 2011

      • ruaok
        well, the load balancer is going go to both, so thats not so good.
      • 2011-10-01 27439, 2011

      • ruaok
        19
      • 2011-10-01 27444, 2011

      • ruaok
        up the rate much higher
      • 2011-10-01 27457, 2011

      • ocharles
        10 requests a second now
      • 2011-10-01 27417, 2011

      • ocharles
        now 20
      • 2011-10-01 27434, 2011

      • ocharles
        anything interesting?
      • 2011-10-01 27413, 2011

      • ruaok
        20 open handles.
      • 2011-10-01 27421, 2011

      • ruaok
        try 1000 requests per sec.
      • 2011-10-01 27423, 2011

      • ocharles
        ok
      • 2011-10-01 27425, 2011

      • ruaok
        :-)
      • 2011-10-01 27455, 2011

      • ruaok
        30
      • 2011-10-01 27404, 2011

      • ocharles
        that look like it's actually getting a lot of traffic in the logs too?
      • 2011-10-01 27401, 2011

      • ruaok
        we don't actually record traffic on these servers.
      • 2011-10-01 27408, 2011

      • ocharles
        ah, ok
      • 2011-10-01 27413, 2011

      • ruaok
        1.63 load.
      • 2011-10-01 27415, 2011

      • ruaok
        so doing somehting
      • 2011-10-01 27419, 2011

      • ruaok
        double it. :)
      • 2011-10-01 27423, 2011

      • ocharles
        "while [ 1 -eq 1 ]; do curl -s http://dora.mb:8080/ws/2/artist/?query=Hello > /dev/null& sleep 0.001; done" is my test :)
      • 2011-10-01 27431, 2011

      • ocharles
        load went up to 0.64 on hobbes just doing that
      • 2011-10-01 27435, 2011

      • ruaok
        lol
      • 2011-10-01 27446, 2011

      • ocharles
        next text can be through carl
      • 2011-10-01 27452, 2011

      • ocharles
        then we can test direct to dora with LWP
      • 2011-10-01 27455, 2011

      • ocharles
        and the through carl with LWP
      • 2011-10-01 27408, 2011

      • ruaok
        not bouncing up at all.
      • 2011-10-01 27414, 2011

      • ruaok
        but I wonder if we should leave it running longer.
      • 2011-10-01 27405, 2011

      • ocharles
        ruaok: I think a better option would be: while [ 1 -eq 1 ] do; netstat | wc -l >> sockets; sleep 30; done on dora
      • 2011-10-01 27409, 2011

      • ocharles
        and bring it back into production
      • 2011-10-01 27419, 2011

      • ocharles
        get a very rough idea of how that count changes over time
      • 2011-10-01 27402, 2011

      • ruaok
        yeah.
      • 2011-10-01 27412, 2011

      • ruaok
        want an account on dora?
      • 2011-10-01 27416, 2011

      • ocharles
        sure
      • 2011-10-01 27432, 2011

      • ruaok
        gimme a sec.
      • 2011-10-01 27438, 2011

      • ruaok
        mom is whipping up a storm of activity
      • 2011-10-01 27448, 2011

      • ruaok
        geez, she's not sick anymore
      • 2011-10-01 27451, 2011

      • ruaok
        driving me insane.
      • 2011-10-01 27454, 2011

      • ruaok sighs
      • 2011-10-01 27431, 2011

      • ruaok
        ready!
      • 2011-10-01 27449, 2011

      • ocharles
        heh
      • 2011-10-01 27431, 2011

      • ruaok
        syntax error near done. :-(
      • 2011-10-01 27457, 2011

      • ruaok
        dora is back in.
      • 2011-10-01 27447, 2011

      • ocharles
        you get it running?
      • 2011-10-01 27437, 2011

      • ruaok
        no
      • 2011-10-01 27423, 2011

      • ruaok
        the count is climbing fast though.
      • 2011-10-01 27449, 2011

      • ocharles
        while [ 1 -eq 1 ]; do netstat | wc -l; sleep 1; done >> counter
      • 2011-10-01 27459, 2011

      • ocharles
        redirection in the wrong place, my bad
      • 2011-10-01 27453, 2011

      • ruaok
        I'm going to take dora out and let things calm down.