#musicbrainz-devel

/

      • johtso_ joined the channel
      • 2015-04-25 11541, 2015

      • ianmcorvidae
        best idea I have is kicking some musicbrainz-server instances
      • 2015-04-25 11553, 2015

      • ruaok
        we did that.
      • 2015-04-25 11555, 2015

      • ruaok
        no effect.
      • 2015-04-25 11511, 2015

      • ruaok
        I think all my mucking about may have confused carl.
      • 2015-04-25 11524, 2015

      • ruaok
        bad hardware + mucking == who knows.
      • 2015-04-25 11530, 2015

      • ianmcorvidae
        this problem isn't actually *on* carl, as far as I can tell
      • 2015-04-25 11549, 2015

      • ianmcorvidae
        it seems like we're running out of perl processes, or something
      • 2015-04-25 11551, 2015

      • ruaok
        but half the world passes through carl.
      • 2015-04-25 11519, 2015

      • ianmcorvidae
        yes, but the problem is nginx on $frontend connecting to a unix socket on the same machine
      • 2015-04-25 11527, 2015

      • ruaok
        the most useful, and not very at that, error message that I've seen so far is this:
      • 2015-04-25 11529, 2015

      • ruaok
        ws/2/release/?query=2@+0x1.a9cd0089502p-1013rtist:2 [at 2fmt]=json", host: "musicbrainz.org""
      • 2015-04-25 11536, 2015

      • ianmcorvidae
        yes, that's the issue
      • 2015-04-25 11505, 2015

      • ruaok
        I've not looked at machines other than astro -- does this happen on pingu/asterix?
      • 2015-04-25 11528, 2015

      • ianmcorvidae
        yes, same thing on pingu at least
      • 2015-04-25 11539, 2015

      • ruaok
        da fuq?
      • 2015-04-25 11506, 2015

      • ruaok
        how can a local unix domain socket start acting up on three servers at the same time?
      • 2015-04-25 11516, 2015

      • ianmcorvidae
        enough requests that it's overloading the number of processes on all of them
      • 2015-04-25 11544, 2015

      • ianmcorvidae
        but that's 150, across the three of them, so I dunno
      • 2015-04-25 11511, 2015

      • ruaok
        so, the theory is that we have three servers stuck in some bizarre stuck state.
      • 2015-04-25 11527, 2015

      • ruaok
        we could take the site down and reboot all three of them
      • 2015-04-25 11538, 2015

      • ruaok
        let me make sure I understand this...
      • 2015-04-25 11549, 2015

      • ianmcorvidae
        I don't really know if I understand it, to be fair
      • 2015-04-25 11502, 2015

      • ruaok
        this is nginx trying to make a unix domain socket call to mb-server and it fails, but only for search queries.
      • 2015-04-25 11502, 2015

      • ianmcorvidae also notes I need to leave nearly right now
      • 2015-04-25 11510, 2015

      • ianmcorvidae
        it's not only for searches
      • 2015-04-25 11520, 2015

      • ianmcorvidae
        we just have so many more search requests that the others get lost
      • 2015-04-25 11530, 2015

      • ianmcorvidae
        if you tail the error log and grep -v 'query' you'll see the others, same sort of thing
      • 2015-04-25 11538, 2015

      • ruaok
        ok,will do.
      • 2015-04-25 11551, 2015

      • ruaok
        bitmap: when we restarted the front end, what exactly did you do?
      • 2015-04-25 11508, 2015

      • ruaok
        we may just need to repeat the exercise.
      • 2015-04-25 11522, 2015

      • ruaok
        but this time stop both nginx and mb-server.
      • 2015-04-25 11524, 2015

      • ianmcorvidae
        follow the release process, it's in syswiki
      • 2015-04-25 11528, 2015

      • ruaok
        then remove the unix domain socket.
      • 2015-04-25 11534, 2015

      • ianmcorvidae
        and that *should* involve restarting nginx as well, but
      • 2015-04-25 11556, 2015

      • ruaok
        ah, we didn't fully do that. there was something amiss that was causing the git pull to hand.
      • 2015-04-25 11557, 2015

      • ruaok
        hang
      • 2015-04-25 11523, 2015

      • ianmcorvidae
        well, the pull doesn't matter really
      • 2015-04-25 11533, 2015

      • bitmap
        yeah, I just restarted the apps with svc -t /etc/service/musicbrainz-server
      • 2015-04-25 11534, 2015

      • ruaok
        clearly, but it halted the process.
      • 2015-04-25 11539, 2015

      • ruaok
        ah!
      • 2015-04-25 11546, 2015

      • ruaok
        no nginx restart.
      • 2015-04-25 11555, 2015

      • ruaok
        bitmap: lets do this process again.
      • 2015-04-25 11500, 2015

      • bitmap
        ok
      • 2015-04-25 11509, 2015

      • ianmcorvidae
        sudo -i, cd server-configs; ./provision.sh; sudo svc -t /etc/service/musicbrainz-server; tail-f /etc/service/musicbrainz-server/log/main/current until you see the "Binding to ..." etc. and then you can bring it back in
      • 2015-04-25 11511, 2015

      • ruaok
        this time use the usual process and watch that it does restart nginx
      • 2015-04-25 11536, 2015

      • ianmcorvidae
        provision may do a git pull for mbserver, of course, but hopefully that doesn't keep failing
      • 2015-04-25 11550, 2015

      • bitmap
        yeah, the provision.sh is what was hanging
      • 2015-04-25 11551, 2015

      • ruaok
        it better not now. :)
      • 2015-04-25 11517, 2015

      • ianmcorvidae
        if it's failing, a manual nginx restart is the main thing we wanted out of that anyway, but
      • 2015-04-25 11525, 2015

      • bitmap nods
      • 2015-04-25 11537, 2015

      • ianmcorvidae notes I did do a HUP to nginx on astro a bit ago, because I added a quick bit of logging for the search endpoint, which a provision would overwrite (but that's fine)
      • 2015-04-25 11539, 2015

      • ruaok
        astro out
      • 2015-04-25 11528, 2015

      • bitmap
        looks like ./provision.sh still hangs, I'll try svc -t
      • 2015-04-25 11541, 2015

      • ianmcorvidae
        nginx is system-installed, not daemontools. /etc/init.d
      • 2015-04-25 11500, 2015

      • bitmap
        ah, right
      • 2015-04-25 11502, 2015

      • ianmcorvidae
        (and the musicbrainz-server svc -t isn't done by provision anyway, so that needs to be done separately in any case)
      • 2015-04-25 11546, 2015

      • bitmap
        ok, astro done
      • 2015-04-25 11522, 2015

      • ruaok
        astro in pingu out
      • 2015-04-25 11557, 2015

      • ruaok
        getting resource busy errors on astro. though not as fast as before
      • 2015-04-25 11547, 2015

      • ianmcorvidae
        resource temporarily unavailable as before, or a new 'resource busy' error?
      • 2015-04-25 11521, 2015

      • ruaok
        the former.
      • 2015-04-25 11522, 2015

      • ruaok
        as before.
      • 2015-04-25 11537, 2015

      • ruaok
        1 every few seconds.
      • 2015-04-25 11500, 2015

      • ianmcorvidae
        1 every few seconds is way better than before, anyway
      • 2015-04-25 11510, 2015

      • ruaok
        I wonder if this started happening before today and my tinkering kicked it into high gear.
      • 2015-04-25 11511, 2015

      • ruaok
        for sure.
      • 2015-04-25 11526, 2015

      • bitmap
        pingu done
      • 2015-04-25 11543, 2015

      • ruaok
        pingu in, asterix out
      • 2015-04-25 11544, 2015

      • ianmcorvidae
        it has been happening some, this is what causes the instant 502s
      • 2015-04-25 11516, 2015

      • ianmcorvidae
        and I'm out. I'll have my phone if I'm really urgently needed but I'm out basically the whole evening :( and then free of this particular thing until the fall, so it's not all bad :P
      • 2015-04-25 11527, 2015

      • ruaok
        k
      • 2015-04-25 11533, 2015

      • ruaok
        you around tomorrow?
      • 2015-04-25 11538, 2015

      • bitmap
        asterix done
      • 2015-04-25 11521, 2015

      • ruaok
        ok, should be coming back in.
      • 2015-04-25 11548, 2015

      • ruaok
        and I just got three 502s in a row.
      • 2015-04-25 11544, 2015

      • bitmap
        I'm still seeing the same errors in astro's logs
      • 2015-04-25 11551, 2015

      • bitmap
        they haven't really slowed down
      • 2015-04-25 11517, 2015

      • ruaok
        really? what log are you watching?
      • 2015-04-25 11526, 2015

      • ruaok
        tail -f /var/log/nginx/001-musicbrainz.error.log | grep 502
      • 2015-04-25 11535, 2015

      • bitmap
        same log without the grep 502
      • 2015-04-25 11509, 2015

      • ruaok
        fuck.
      • 2015-04-25 11506, 2015

      • CatCat joined the channel
      • 2015-04-25 11531, 2015

      • ruaok reads http://serverfault.com/questions/398972/need-to-increase-nginx-throughput-to-an-upstream-unix-socket-linux-kernel-tun
      • 2015-04-25 11537, 2015

      • ruaok
        but our loads are low right now
      • 2015-04-25 11543, 2015

      • ruaok
        for obvious reasons.
      • 2015-04-25 11504, 2015

      • ruaok
        ok, here is a thought and it ties together with your postgres obersvation...
      • 2015-04-25 11518, 2015

      • ruaok
        (and perhaps you meant this earlier)
      • 2015-04-25 11537, 2015

      • ruaok
        but if the clients are somehow getting hung up on talking to postgres and everything gets backed up...
      • 2015-04-25 11506, 2015

      • ruaok
        if no perl procs area available, it would throw this error.
      • 2015-04-25 11535, 2015

      • bitmap
        that's something I was thinking, but I wasn't sure what pgbouncer things were tweaked that'd cause that
      • 2015-04-25 11501, 2015

      • ruaok
        they shouldn't really.
      • 2015-04-25 11517, 2015

      • ruaok
        but they should also not be running out of connections to postgres.
      • 2015-04-25 11533, 2015

      • ruaok
        but that was the last thing that was changed in our config, so that is a valid concern.
      • 2015-04-25 11540, 2015

      • ruaok
        let me go poke pgbouncer some more
      • 2015-04-25 11543, 2015

      • ruaok
        the load on totoro is very low.
      • 2015-04-25 11557, 2015

      • ruaok
        no disk usage.
      • 2015-04-25 11511, 2015

      • ruaok
        bitmap: so this is starting to make sense.
      • 2015-04-25 11530, 2015

      • ruaok
        we're allowing a backlog of 128 waiting query connections to pgbouncer.
      • 2015-04-25 11547, 2015

      • ruaok
        and there are about ~120 or so waiting.
      • 2015-04-25 11534, 2015

      • bitmap
        should we up the backlog then?
      • 2015-04-25 11515, 2015

      • bitmap
        not sure if that was one of the settings that was changed
      • 2015-04-25 11527, 2015

      • ruaok
        it wasn't, but perhaps it should be changed.
      • 2015-04-25 11540, 2015

      • ruaok
        but if there server is not busy, it should be able to handle this.
      • 2015-04-25 11501, 2015

      • bitmap
        yes, I would think so
      • 2015-04-25 11503, 2015

      • ariscop joined the channel
      • 2015-04-25 11552, 2015

      • bitmap
        there's not a lot of info on how to tune this
      • 2015-04-25 11510, 2015

      • ruaok
        apparently the best info appears on the musicbrainz blog. :(
      • 2015-04-25 11522, 2015

      • ruaok
        I can't tune the backlog without a restart of pgbouncer.
      • 2015-04-25 11521, 2015

      • bitmap
        I've no clue why we'd hit this all of a sudden, but I can't think of any other options at the moment
      • 2015-04-25 11514, 2015

      • ruaok
        well, chirlu and I did some tweaking yesterday. I'm guessing this is the side-effect.
      • 2015-04-25 11527, 2015

      • ruaok
        so, what happens if I restart pgbouncer?
      • 2015-04-25 11541, 2015

      • ruaok
        without taking the site down. bad things?
      • 2015-04-25 11546, 2015

      • bitmap
        perhaps we should tweak net.core.somaxconn too then, since that's also 128 and seems to be a system-wide limit
      • 2015-04-25 11552, 2015

      • ruaok is utterly out of patience and coping
      • 2015-04-25 11509, 2015

      • ruaok
        give it a shot.
      • 2015-04-25 11517, 2015

      • ruaok
        but pgbouncer really looks like its *full*
      • 2015-04-25 11532, 2015

      • bitmap
        uh, I've never restarted pgbouncer on my own before
      • 2015-04-25 11522, 2015

      • ruaok
        out of 63 connections, 57 are idle!!
      • 2015-04-25 11511, 2015

      • ruaok
        I'm just going to restart it. screw it.
      • 2015-04-25 11526, 2015

      • bitmap
        alright. :/ I doubled net.core.somaxconn to 256 on totoro
      • 2015-04-25 11515, 2015

      • ruaok
        astro error log went quiet.
      • 2015-04-25 11510, 2015

      • bitmap
        wow, yeah
      • 2015-04-25 11557, 2015

      • ruaok
        pgbouncer looks much happier.
      • 2015-04-25 11525, 2015

      • ruaok
        astro log is screaming again.
      • 2015-04-25 11529, 2015

      • ruaok
        pgbouncer still looks good.
      • 2015-04-25 11552, 2015

      • ruaok
        quiet again.
      • 2015-04-25 11555, 2015

      • ruaok
        da hell?
      • 2015-04-25 11527, 2015

      • bitmap
        yeah, I saw a burst of errors and then it stopped
      • 2015-04-25 11529, 2015

      • Freso
        This channel is such a rollercoaster today!
      • 2015-04-25 11522, 2015

      • ruaok
        be glad you're not on the rollercoaster.
      • 2015-04-25 11542, 2015

      • ruaok
        ok, so this problem was unrelated to the gateway work.
      • 2015-04-25 11545, 2015

      • ruaok
        go figger.
      • 2015-04-25 11558, 2015

      • Leo_Verto
        Wee, everything is broken!
      • 2015-04-25 11543, 2015

      • ruaok
        pgbouncer fillled up again.
      • 2015-04-25 11547, 2015

      • bitmap
        what'd you increase the backlog setting to?
      • 2015-04-25 11512, 2015

      • ruaok
        256
      • 2015-04-25 11539, 2015

      • ruaok
        though I am not convinced of the correlation.
      • 2015-04-25 11557, 2015

      • ruaok
        I'm seeing lots of errors, but pgbouncer is fine
      • 2015-04-25 11527, 2015

      • ruaok
        in fact it isn't getting even close to the limit now
      • 2015-04-25 11529, 2015

      • ruaok
        in general, things are lot better now
      • 2015-04-25 11549, 2015

      • reosarevok
        From editing at least it seems that way, yup!
      • 2015-04-25 11543, 2015

      • ruaok
        its going back to shit, don't worry reosarevok
      • 2015-04-25 11555, 2015

      • reosarevok
        :p
      • 2015-04-25 11559, 2015

      • reosarevok
        We'll see
      • 2015-04-25 11506, 2015

      • ruaok
        pretty much when the waiting clients gets to 100 the errors start flying
      • 2015-04-25 11559, 2015

      • ruaok
        what da fuq pgbouncer?
      • 2015-04-25 11522, 2015

      • ruaok
        the db server is idle and bored. why you be making life so hard.
      • 2015-04-25 11518, 2015

      • bitmap
        have you tried reverting the settings you changeg recently?
      • 2015-04-25 11500, 2015

      • ruaok
        I can do that next.