#musicbrainz-devel

/

      • djce has set a reminder to ask him on Monday morning
      • 2011-10-01 27455, 2011

      • ruaok
        thanks!
      • 2011-10-01 27443, 2011

      • ruaok wonders if this could be a keep alive problem?
      • 2011-10-01 27450, 2011

      • djce
        you know dora/roobarb are still running the /usr/local jvm, right?
      • 2011-10-01 27455, 2011

      • ruaok
        yes.
      • 2011-10-01 27417, 2011

      • ruaok
        I only moved the jvm from /usr/lib so that I was *sure* that it was using the new jre
      • 2011-10-01 27433, 2011

      • djce
        not sure about keepalive. Might have to check next time we catch the problem in progress.
      • 2011-10-01 27433, 2011

      • ruaok
        I will likely switch back to openjdk.
      • 2011-10-01 27401, 2011

      • djce
        I suspect it's not keepalive, since the sockets are closing, not ESTABLISHED
      • 2011-10-01 27404, 2011

      • ruaok
        I've set the keep alive on dora from the default 100 to 10.
      • 2011-10-01 27423, 2011

      • djce
        what setting is that?
      • 2011-10-01 27451, 2011

      • ruaok
        maxKeepAliveRequests="10"
      • 2011-10-01 27402, 2011

      • ruaok
        in server.xml in the HTTP connector section.
      • 2011-10-01 27406, 2011

      • djce
        ok
      • 2011-10-01 27428, 2011

      • djce
        I don't think nginx (which is the client) uses keepalive in client mode.
      • 2011-10-01 27435, 2011

      • ruaok
        I suspect that you're right.
      • 2011-10-01 27439, 2011

      • djce
        might be better if it did in fact.
      • 2011-10-01 27448, 2011

      • djce
        'cos there'd be less opening/closing going on.
      • 2011-10-01 27453, 2011

      • ruaok
        the growth seem to bot have changed.
      • 2011-10-01 27421, 2011

      • ruaok
        yeah, we've got a simple script that connects to dora directly and makes a shitload of requests.
      • 2011-10-01 27432, 2011

      • ruaok
        without going through nginx and the problem persists.
      • 2011-10-01 27451, 2011

      • djce
        shitload as in open/request/close, or open/req/req/req/req/req/.../close?
      • 2011-10-01 27435, 2011

      • ruaok
        the former
      • 2011-10-01 27441, 2011

      • ruaok
        simple LWP::UserAgent calls
      • 2011-10-01 27401, 2011

      • ruaok
        or even wgets from forked processes.
      • 2011-10-01 27442, 2011

      • ruaok
        3892 sockets in TIME_WAIT and 8 in ESTABLISHED.
      • 2011-10-01 27424, 2011

      • djce
        from what host?
      • 2011-10-01 27458, 2011

      • ruaok
        all from carl
      • 2011-10-01 27401, 2011

      • ruaok
        nxing
      • 2011-10-01 27423, 2011

      • ruaok
      • 2011-10-01 27424, 2011

      • ruaok
        hmmm.
      • 2011-10-01 27435, 2011

      • ruaok
        TIME_WAIT is a client state.
      • 2011-10-01 27444, 2011

      • ruaok
        not a server state.
      • 2011-10-01 27413, 2011

      • djce
        tcp doesn't have clients and servers.
      • 2011-10-01 27459, 2011

      • djce
        but I take the point.
      • 2011-10-01 27405, 2011

      • ruaok
        well, see the diagram in the link I posted.
      • 2011-10-01 27423, 2011

      • ruaok
        in this case ngxinx/wget would be the "client", no?
      • 2011-10-01 27435, 2011

      • djce
        yup
      • 2011-10-01 27407, 2011

      • ruaok
        why then does tomcat have connections from carl that are TIME_WAIT
      • 2011-10-01 27408, 2011

      • ruaok
        ?
      • 2011-10-01 27422, 2011

      • djce
        right now?
      • 2011-10-01 27423, 2011

      • ruaok
        if it never made the connection to carl? carl made the connection to it.
      • 2011-10-01 27425, 2011

      • ruaok
        yeah
      • 2011-10-01 27430, 2011

      • djce
        on which host?
      • 2011-10-01 27432, 2011

      • djce
        I don't see 'em
      • 2011-10-01 27440, 2011

      • ruaok
        see netstat on dora.
      • 2011-10-01 27448, 2011

      • ruaok
        am I reading that wrong?
      • 2011-10-01 27424, 2011

      • djce
        no, you're right
      • 2011-10-01 27447, 2011

      • djce
        (I was only looking at sockets owned by processes; those ones aren't though)
      • 2011-10-01 27447, 2011

      • ruaok
        then there is something I clearly don't understand about all this. :-)
      • 2011-10-01 27407, 2011

      • ruaok
        lsof gives sane output, yes.
      • 2011-10-01 27410, 2011

      • ruaok
        netstat does not.
      • 2011-10-01 27435, 2011

      • djce
        I think TIME WAIT applies to whichever peer closed the connection first.
      • 2011-10-01 27439, 2011

      • ruaok
      • 2011-10-01 27457, 2011

      • ruaok
        the diagram here suggests that there is a timeout period for TIME_WAIT
      • 2011-10-01 27424, 2011

      • djce
        there is, 2 x MSL (Maximum Segment Lifetime)
      • 2011-10-01 27428, 2011

      • ruaok nods
      • 2011-10-01 27431, 2011

      • ruaok
        4 minutes.
      • 2011-10-01 27443, 2011

      • ruaok
        but, for normal operations, it should not have to time out.
      • 2011-10-01 27447, 2011

      • djce
        but I don't think that TIME WAIT sockets not owned by processes are a problem.
      • 2011-10-01 27459, 2011

      • djce
        Our problem is sockets that /are/ owned by processes.
      • 2011-10-01 27411, 2011

      • djce
        specifically, the number of them, exceeding some limit.
      • 2011-10-01 27449, 2011

      • ruaok
        but there are only a few ESTABLISHED sockets.
      • 2011-10-01 27452, 2011

      • ruaok
        like 8
      • 2011-10-01 27439, 2011

      • ruaok
        and we know that when tomcat complains about being out of file handles, that count of TIME_WAIT sockets is *very* high.
      • 2011-10-01 27442, 2011

      • ruaok
        like 8k or so.
      • 2011-10-01 27412, 2011

      • djce
        I think that's a red herring. Unless those sockets are owned by tomcat.
      • 2011-10-01 27435, 2011

      • djce
        I think, just ignore netstat, and observe lsof (or /proc/NNNN/fd)
      • 2011-10-01 27408, 2011

      • djce
        so right now, all looks happy.
      • 2011-10-01 27426, 2011

      • djce
        later on, you'll see "lsof -c java -a -i -n -P" (for example) show 000s of sockets.
      • 2011-10-01 27436, 2011

      • djce
        /that/ is what we need to debug, AFAIK.
      • 2011-10-01 27424, 2011

      • ruaok
        how can we monitor this?
      • 2011-10-01 27438, 2011

      • ruaok
        maybe run that command every few second?
      • 2011-10-01 27448, 2011

      • ruaok
        and if the output is more than 20 lines, save it?
      • 2011-10-01 27452, 2011

      • ruaok
        otherwise overwrite?
      • 2011-10-01 27428, 2011

      • ruaok
        then we would have timestamps as to when it happens and what the output is.
      • 2011-10-01 27453, 2011

      • djce
        /var/log/tomcat6/catalina.out already gives us timestamps
      • 2011-10-01 27409, 2011

      • djce
        and we could watch that log to tell us when it's all gone wrong.
      • 2011-10-01 27442, 2011

      • ruaok
        true that.
      • 2011-10-01 27431, 2011

      • ruaok
        I'm wondering if it's correlated with index updates.
      • 2011-10-01 27448, 2011

      • ruaok
        like copying the files over. or rotating them into place. or restarting tomcat.
      • 2011-10-01 27403, 2011

      • djce
        I didn't see any obvious correlation last time I checked
      • 2011-10-01 27402, 2011

      • ruaok
        hmm.
      • 2011-10-01 27410, 2011

      • ruaok
        well I dropped the global limit down to 2000.
      • 2011-10-01 27419, 2011

      • ruaok
        so the problem is unlikely to return.
      • 2011-10-01 27427, 2011

      • ruaok
        but that is far from an acceptable solution.
      • 2011-10-01 27443, 2011

      • djce
        /un/likely?
      • 2011-10-01 27451, 2011

      • ruaok
        yes.
      • 2011-10-01 27401, 2011

      • ruaok
        this morning at 7am ocharles called me.
      • 2011-10-01 27408, 2011

      • djce
        why so? lowering limits sounds like it makes it /more/ likely to me
      • 2011-10-01 27412, 2011

      • ruaok
        I dropped the limit from 2500 to 2000 and that made everything happy.
      • 2011-10-01 27431, 2011

      • djce
        oh, ws global ratelimit?
      • 2011-10-01 27434, 2011

      • ruaok
        well, we're handling fewer requests/
      • 2011-10-01 27435, 2011

      • ruaok
        yes.
      • 2011-10-01 27437, 2011

      • djce
        I see
      • 2011-10-01 27455, 2011

      • ruaok
        sorry for not making that clear. :)
      • 2011-10-01 27453, 2011

      • ruaok
        do you have saved output from lsof when things were bad?
      • 2011-10-01 27402, 2011

      • ruaok
        did we have 0000s of ESTABLISHED sockets?
      • 2011-10-01 27412, 2011

      • djce
        sorry, not saved. nor remembered.
      • 2011-10-01 27414, 2011

      • djce
        :-(
      • 2011-10-01 27423, 2011

      • ruaok
        ok.
      • 2011-10-01 27432, 2011

      • ruaok
        I think mum wants to go run around for a bit.
      • 2011-10-01 27434, 2011

      • ruaok
        I'll do that.
      • 2011-10-01 27447, 2011

      • ruaok
        once I get back, I'll up the request limit and watch for things to go wrong.
      • 2011-10-01 27449, 2011

      • djce
        ttfn
      • 2011-10-01 27454, 2011

      • ruaok
        then I'll investigate a bit.
      • 2011-10-01 27402, 2011

      • ruaok
        save some lsofs and netstats
      • 2011-10-01 27406, 2011

      • djce
        I might still be here. Maybe.
      • 2011-10-01 27421, 2011

      • ruaok
        ok, probably won't be too long.
      • 2011-10-01 27424, 2011

      • nikki
        bitmap: ping
      • 2011-10-01 27432, 2011

      • bitmap
        pong
      • 2011-10-01 27400, 2011

      • nikki
      • 2011-10-01 27449, 2011

      • nikki
        (btw, do you want me to make you the owner of the picard component thing in trac too?)
      • 2011-10-01 27424, 2011

      • bitmap
        hm, no, I hadn't seen that…
      • 2011-10-01 27446, 2011

      • bitmap
        that doesn't really make sense since it should be using the metadata in the folder name
      • 2011-10-01 27410, 2011

      • bitmap
        but I'll look into it.
      • 2011-10-01 27454, 2011

      • bitmap
        nikki: what can the owner of the component do?
      • 2011-10-01 27409, 2011

      • nikki
        well I think you'll probably get emails then
      • 2011-10-01 27421, 2011

      • bitmap
        ah, yeah I'd prefer that
      • 2011-10-01 27450, 2011

      • bitmap
        since I haven't really looked at trac at all lately
      • 2011-10-01 27427, 2011

      • nikki
        there, done
      • 2011-10-01 27459, 2011

      • bitmap
        great, thanks :)
      • 2011-10-01 27438, 2011

      • Leftmost joined the channel
      • 2011-10-01 27416, 2011

      • bitmap
        nikki: oh, my memory is so terrible that I forgot that issue was already fixed: http://tickets.musicbrainz.org/browse/PICARD-1 (the folder name uses albumartist in the default string, at least)
      • 2011-10-01 27417, 2011

      • nikki
        oh, right :)
      • 2011-10-01 27440, 2011

      • reosarevok joined the channel
      • 2011-10-01 27453, 2011

      • ruaok
        djce: still here?
      • 2011-10-01 27448, 2011

      • djce
        ruaok: pong
      • 2011-10-01 27438, 2011

      • ruaok
        the next thing I want to try to replace tomcat with something else.
      • 2011-10-01 27442, 2011

      • ruaok
        like jboss.
      • 2011-10-01 27436, 2011

      • djce
        interesting.
      • 2011-10-01 27438, 2011

      • ruaok
        we've ruled out the jvm. and the search code doesn't seem to leak file handles -- they get cleaned up.
      • 2011-10-01 27443, 2011

      • ruaok
        it might just be a tomcat bug.
      • 2011-10-01 27446, 2011

      • ruaok
        a high load one.
      • 2011-10-01 27402, 2011

      • djce
        heading off to bed now but will be around tomorrow for more of this
      • 2011-10-01 27407, 2011

      • ruaok
        ok.
      • 2011-10-01 27407, 2011

      • djce
        :-) zzzzzz
      • 2011-10-01 27414, 2011

      • ruaok
        I'll play with it a bit
      • 2011-10-01 27415, 2011

      • ruaok
        nn
      • 2011-10-01 27422, 2011

      • ruaok
        (once my mom lets up on the chores. sigh.)
      • 2011-10-01 27428, 2011

      • djce
        ttfn. keep me posted if anything exciting comes up.
      • 2011-10-01 27434, 2011

      • ruaok
        will do.