#musicbrainz-devel

/

      • johtso_ joined the channel
      • ianmcorvidae
        best idea I have is kicking some musicbrainz-server instances
      • ruaok
        we did that.
      • no effect.
      • I think all my mucking about may have confused carl.
      • bad hardware + mucking == who knows.
      • ianmcorvidae
        this problem isn't actually *on* carl, as far as I can tell
      • it seems like we're running out of perl processes, or something
      • ruaok
        but half the world passes through carl.
      • ianmcorvidae
        yes, but the problem is nginx on $frontend connecting to a unix socket on the same machine
      • ruaok
        the most useful, and not very at that, error message that I've seen so far is this:
      • ws/2/release/?query=2@+0x1.a9cd0089502p-1013rtist:2 [at 2fmt]=json", host: "musicbrainz.org""
      • ianmcorvidae
        yes, that's the issue
      • ruaok
        I've not looked at machines other than astro -- does this happen on pingu/asterix?
      • ianmcorvidae
        yes, same thing on pingu at least
      • ruaok
        da fuq?
      • how can a local unix domain socket start acting up on three servers at the same time?
      • ianmcorvidae
        enough requests that it's overloading the number of processes on all of them
      • but that's 150, across the three of them, so I dunno
      • ruaok
        so, the theory is that we have three servers stuck in some bizarre stuck state.
      • we could take the site down and reboot all three of them
      • let me make sure I understand this...
      • ianmcorvidae
        I don't really know if I understand it, to be fair
      • ruaok
        this is nginx trying to make a unix domain socket call to mb-server and it fails, but only for search queries.
      • ianmcorvidae also notes I need to leave nearly right now
      • ianmcorvidae
        it's not only for searches
      • we just have so many more search requests that the others get lost
      • if you tail the error log and grep -v 'query' you'll see the others, same sort of thing
      • ruaok
        ok,will do.
      • bitmap: when we restarted the front end, what exactly did you do?
      • we may just need to repeat the exercise.
      • but this time stop both nginx and mb-server.
      • ianmcorvidae
        follow the release process, it's in syswiki
      • ruaok
        then remove the unix domain socket.
      • ianmcorvidae
        and that *should* involve restarting nginx as well, but
      • ruaok
        ah, we didn't fully do that. there was something amiss that was causing the git pull to hand.
      • hang
      • ianmcorvidae
        well, the pull doesn't matter really
      • bitmap
        yeah, I just restarted the apps with svc -t /etc/service/musicbrainz-server
      • ruaok
        clearly, but it halted the process.
      • ah!
      • no nginx restart.
      • bitmap: lets do this process again.
      • bitmap
        ok
      • ianmcorvidae
        sudo -i, cd server-configs; ./provision.sh; sudo svc -t /etc/service/musicbrainz-server; tail-f /etc/service/musicbrainz-server/log/main/current until you see the "Binding to ..." etc. and then you can bring it back in
      • ruaok
        this time use the usual process and watch that it does restart nginx
      • ianmcorvidae
        provision may do a git pull for mbserver, of course, but hopefully that doesn't keep failing
      • bitmap
        yeah, the provision.sh is what was hanging
      • ruaok
        it better not now. :)
      • ianmcorvidae
        if it's failing, a manual nginx restart is the main thing we wanted out of that anyway, but
      • bitmap nods
      • ianmcorvidae notes I did do a HUP to nginx on astro a bit ago, because I added a quick bit of logging for the search endpoint, which a provision would overwrite (but that's fine)
      • ruaok
        astro out
      • bitmap
        looks like ./provision.sh still hangs, I'll try svc -t
      • ianmcorvidae
        nginx is system-installed, not daemontools. /etc/init.d
      • bitmap
        ah, right
      • ianmcorvidae
        (and the musicbrainz-server svc -t isn't done by provision anyway, so that needs to be done separately in any case)
      • bitmap
        ok, astro done
      • ruaok
        astro in pingu out
      • getting resource busy errors on astro. though not as fast as before
      • ianmcorvidae
        resource temporarily unavailable as before, or a new 'resource busy' error?
      • ruaok
        the former.
      • as before.
      • 1 every few seconds.
      • ianmcorvidae
        1 every few seconds is way better than before, anyway
      • ruaok
        I wonder if this started happening before today and my tinkering kicked it into high gear.
      • for sure.
      • bitmap
        pingu done
      • ruaok
        pingu in, asterix out
      • ianmcorvidae
        it has been happening some, this is what causes the instant 502s
      • and I'm out. I'll have my phone if I'm really urgently needed but I'm out basically the whole evening :( and then free of this particular thing until the fall, so it's not all bad :P
      • ruaok
        k
      • you around tomorrow?
      • bitmap
        asterix done
      • ruaok
        ok, should be coming back in.
      • and I just got three 502s in a row.
      • bitmap
        I'm still seeing the same errors in astro's logs
      • they haven't really slowed down
      • ruaok
        really? what log are you watching?
      • tail -f /var/log/nginx/001-musicbrainz.error.log | grep 502
      • bitmap
        same log without the grep 502
      • ruaok
        fuck.
      • CatCat joined the channel
      • ruaok reads http://serverfault.com/questions/398972/need-to-increase-nginx-throughput-to-an-upstream-unix-socket-linux-kernel-tun
      • but our loads are low right now
      • for obvious reasons.
      • ok, here is a thought and it ties together with your postgres obersvation...
      • (and perhaps you meant this earlier)
      • but if the clients are somehow getting hung up on talking to postgres and everything gets backed up...
      • if no perl procs area available, it would throw this error.
      • bitmap
        that's something I was thinking, but I wasn't sure what pgbouncer things were tweaked that'd cause that
      • ruaok
        they shouldn't really.
      • but they should also not be running out of connections to postgres.
      • but that was the last thing that was changed in our config, so that is a valid concern.
      • let me go poke pgbouncer some more
      • the load on totoro is very low.
      • no disk usage.
      • bitmap: so this is starting to make sense.
      • we're allowing a backlog of 128 waiting query connections to pgbouncer.
      • and there are about ~120 or so waiting.
      • bitmap
        should we up the backlog then?
      • not sure if that was one of the settings that was changed
      • ruaok
        it wasn't, but perhaps it should be changed.
      • but if there server is not busy, it should be able to handle this.
      • bitmap
        yes, I would think so
      • ariscop joined the channel
      • there's not a lot of info on how to tune this
      • ruaok
        apparently the best info appears on the musicbrainz blog. :(
      • I can't tune the backlog without a restart of pgbouncer.
      • bitmap
        I've no clue why we'd hit this all of a sudden, but I can't think of any other options at the moment
      • ruaok
        well, chirlu and I did some tweaking yesterday. I'm guessing this is the side-effect.
      • so, what happens if I restart pgbouncer?
      • without taking the site down. bad things?
      • bitmap
        perhaps we should tweak net.core.somaxconn too then, since that's also 128 and seems to be a system-wide limit
      • ruaok is utterly out of patience and coping
      • ruaok
        give it a shot.
      • but pgbouncer really looks like its *full*
      • bitmap
        uh, I've never restarted pgbouncer on my own before
      • ruaok
        out of 63 connections, 57 are idle!!
      • I'm just going to restart it. screw it.
      • bitmap
        alright. :/ I doubled net.core.somaxconn to 256 on totoro
      • ruaok
        astro error log went quiet.
      • bitmap
        wow, yeah
      • ruaok
        pgbouncer looks much happier.
      • astro log is screaming again.
      • pgbouncer still looks good.
      • quiet again.
      • da hell?
      • bitmap
        yeah, I saw a burst of errors and then it stopped
      • Freso
        This channel is such a rollercoaster today!
      • ruaok
        be glad you're not on the rollercoaster.
      • ok, so this problem was unrelated to the gateway work.
      • go figger.
      • Leo_Verto
        Wee, everything is broken!
      • ruaok
        pgbouncer fillled up again.
      • bitmap
        what'd you increase the backlog setting to?
      • ruaok
        256
      • though I am not convinced of the correlation.
      • I'm seeing lots of errors, but pgbouncer is fine
      • in fact it isn't getting even close to the limit now
      • in general, things are lot better now
      • reosarevok
        From editing at least it seems that way, yup!
      • ruaok
        its going back to shit, don't worry reosarevok
      • reosarevok
        :p
      • We'll see
      • ruaok
        pretty much when the waiting clients gets to 100 the errors start flying
      • what da fuq pgbouncer?
      • the db server is idle and bored. why you be making life so hard.
      • bitmap
        have you tried reverting the settings you changeg recently?
      • ruaok
        I can do that next.