#musicbrainz-devel

/

      • ruaok
        pingu in, asterix out.
      • just do a restart there.
      • I suspect the git issue might be firewall related.
      • doesn't seem like this is helping.
      • but lets finish.
      • after that I can try to bring the old search load balancer back
      • bitmap
        asterix is done
      • ruaok
        thx.
      • bitmap
        yeah, still seeing 502s on the homepage
      • ruaok
        the rate has dropped, but is still very hig.
      • Leo_Verto
        hmm, reboot all the things? :P
      • ruaok
        trying.
      • kicking search servers now.
      • 20% 502s now. :)
      • down from 50%
      • bitmap: can you look into the access log on astro and examine the 502 errors?
      • tell me what you think.
      • ok, I'm calling this test failed. I'll bring the traffic back to the carl. :(
      • moufl joined the channel
      • mb-chat-logger joined the channel
      • MBJenkins joined the channel
      • Muz_ joined the channel
      • bitmap
        the only hints I can find are that there's too many backlogged sockets on the frontends hitting net.core.somaxconn, but I don't get why switching to ernie would saturate that
      • ruaok
        maybe that is something that needs to be tuned.
      • how did you spy this?
      • bitmap
        I googled the 'Resource temporarily unavailable' thing and saw that somaxconn was still the default on astro
      • ruaok
        on astro?
      • hm that shouldn't be affected.
      • reosarevok joined the channel
      • bitmap
        yeah, that's why I was confused
      • I can't ssh into ernie though
      • ruaok
        another thing that has me bufzzled.
      • log in directly.
      • not via the 10. net
      • can you ping 72.29.166.157 from you home machine?
      • bitmap
        k, that works
      • that times out
      • ruaok
        every step makes it worse. :(
      • I can ping the ip from internally.
      • the service works fine from internally.
      • ah, found it.
      • ping should work again.
      • bitmap
        yep, it works now
      • ruaok
        and the 502 error rate is not dropping. :(
      • diana_olhovik_ joined the channel
      • upstream: "http://unix:/home/musicbrainz/musicbrainz-server/musicbrainz-server.socket
      • bitmap
        no, I think that looks normal
      • ruaok
        ok.
      • bitmap
        last time this happened I think it was related to totoro's network card issues
      • but it fixed itself
      • ruaok
        this doesn't seem totoro related.
      • every one of the 502 is a web service search query. all of them
      • /ws/2/recording/?query=artist:Ed+Sheeran+%26+X+recording:don+t
      • LordSputnik plays that song
      • bitmap
        right, I think the common thread is something's causing the connections to hang and they're backed up
      • ruaok
        ok, when astro spits out that error message...
      • is that because of the connection to the search load balancer or something else?
      • ianmcorvidae
        is the special search-private IP address working?
      • ruaok
        to the best of my knowledge, yes.
      • yes, but it threw a 501 for the first time.
      • I've tried it before and it never failed on me.
      • the search private error log show no 502s.
      • what is up with these redirects?
      • 302 to mb.org /search?
      • ianmcorvidae
        something's misencoded the query I think
      • that's what happens when you go to a search server with no /ws/etc. and not query
      • ruaok
        <=== made the query by hand
      • what should the query look like?
      • bitmap
        "psql: FATAL: remaining connection slots are reserved for non-replication superuser connections" is that related to the pgbouncer changes?
      • ruaok
        oh, empty query/
      • bitmap: yes, when was that?
      • we fixed those about 24 hours ago.
      • bitmap
        just now trying psql from astro
      • ruaok
        does that problem persist?
      • bitmap
        it gave that error the 7/9 times I tried just now
      • ruaok
        sounds like pgbouncer needs checking.
      • I can go look in a sec.
      • bitmap
        (it does work for brief windows of time if I keep at it, but mostly fails)
      • ruaok
        wget 'http://10.1.1.247:777?type=recording&query=love'
      • that query works fine for me.
      • bitmap: while I go look at pgbouncer, can you please see what queries astro sends to the search load balancer?
      • chirlu-mobile joined the channel
      • chirlu-mobile
        Could the WS search queries thing be related to the nginx accel-redirect?
      • ruaok
        I was wondering about that.
      • ianmcorvidae: you know more about that than I do.
      • chirlu-mobile
        re: psql, the number of connections has always been close to the limit since yesterday, so increasing max_connections in Postgres may be the solution.
      • ruaok
        chirlu-mobile: what is interesting is that the number of waiting connections is now at ~120.
      • constantly.
      • I think I am going to move the ips that I took from carl and send them back to carl.
      • and hopefully that will get things to go back to normal
      • chirlu-mobile
        Long queue could also be caused simply by higher load.
      • ruaok
        the load is very low right now
      • chirlu-mobile
        Then it's not that. :)
      • michiwend joined the channel
      • ruaok
        can anyone ping 72.29.167.148 ?
      • bitmap
        not I
      • ruaok
        carl is refusing to take the IP back.
      • ianmcorvidae
        bitmap/ruaok: if you want to do psql from somewhere like astro, use port 6899 so you go through pgbouncer
      • (or use ./admin/psql READWRITE, which of course does that for you)
      • ruaok
        yes, and if you get that error, using -U postgres will get you in too.
      • ianmcorvidae
        yes. though probably not from astro :)
      • and the reason I was asking about the search-private thing was the accel-redirect thing, but it appears to be correctly configured
      • unless there's differences in a toplevel nginx config somehow, but I don't know how that'd be since you just copied it from carl
      • ruaok
        I'm starting to think that the iptables has some goof in it.
      • kepstin-laptop joined the channel
      • reosarevok joined the channel
      • kepstin-laptop
        ruaok, pong
      • ruaok
        hey, been a rough day here. :(
      • I'm in the process of migrating everything back to the old gateway. I'm burnt.
      • let me finish that first.
      • ianmcorvidae
        only difference in iptables is ernie has stuff in INPUT for .150 and for gtest.musicbrainz.org
      • ruaok
        150 was my dns test ip.
      • ianmcorvidae: do you see all the differences betwen em2:0 and em2?
      • any time I wanted to use a rule with em2:0 it would not work. using em2 would.
      • ianmcorvidae
        iptables -L isn't showing me anything with that.
      • ruaok
        iptables-save does.
      • kepstin-laptop
        iptables -L doesn't show all the tables, you have to use -t to ask for a specific one
      • ianmcorvidae
        yeah, looking at nat now
      • the raw table has something that seems amiss as far as connection-tracking, but that's the only thing I can see
      • (carl has target NOTRACK, as should be expected, for the couple of UDP things we turn that off for, ernie has target CT, and NOTRACK is in the destination stuff for some reason)
      • ruaok
        kepstin-laptop what are the arping invocation you were suggesting?
      • I've got one external ip that refused to go back to carl.
      • that is the last thing to undo everything I've done today.
      • kepstin-laptop
        arping -U -I ethdevice ip.add.re.ss
      • ruaok
        from which machine?
      • the one that lost or received the ip?
      • kepstin-laptop
        the one that you want to have the ip
      • that will send a fake arp reply from that machine to update the arp table in the switch
      • ruaok
        no replies.
      • it just sits there.
      • kepstin-laptop
        that's not supposed to be any replies, it's sending replies
      • ruaok
        worked this time. never did before. :)
      • ok, everything is undone. the site is as I found it this morning.
      • ianmcorvidae
        search queries still not working though
      • ruaok
        exactly.
      • da fuq?
      • ianmcorvidae
        huh
      • though one just did
      • ruaok
        30% - 40% are failing.
      • ianmcorvidae
        yup, now they're back reliably
      • wacky
      • maybe one server's still not figured it out
      • ruaok
        lots and lots still persist.
      • ok, food just arrive
      • back after noms.
      • ianmcorvidae confirms, search stuff to the backend thing is working, something's failing higher-up than that
      • ianmcorvidae
        and it's not all queries, that's just because there's so many more of them than other stuff
      • right, and all the stats are still broken >_<
      • ruaok
        yeah, once the gateway stuff is done, I was going to pick rvedotrc's brain about that