#musicbrainz-devel

/

      • ruaok
        pingu in, asterix out.
      • 2015-04-25 11523, 2015

      • ruaok
        just do a restart there.
      • 2015-04-25 11536, 2015

      • ruaok
        I suspect the git issue might be firewall related.
      • 2015-04-25 11558, 2015

      • ruaok
        doesn't seem like this is helping.
      • 2015-04-25 11504, 2015

      • ruaok
        but lets finish.
      • 2015-04-25 11522, 2015

      • ruaok
        after that I can try to bring the old search load balancer back
      • 2015-04-25 11532, 2015

      • bitmap
        asterix is done
      • 2015-04-25 11509, 2015

      • ruaok
        thx.
      • 2015-04-25 11515, 2015

      • bitmap
        yeah, still seeing 502s on the homepage
      • 2015-04-25 11517, 2015

      • ruaok
        the rate has dropped, but is still very hig.
      • 2015-04-25 11512, 2015

      • Leo_Verto
        hmm, reboot all the things? :P
      • 2015-04-25 11551, 2015

      • ruaok
        trying.
      • 2015-04-25 11501, 2015

      • ruaok
        kicking search servers now.
      • 2015-04-25 11529, 2015

      • ruaok
        20% 502s now. :)
      • 2015-04-25 11534, 2015

      • ruaok
        down from 50%
      • 2015-04-25 11513, 2015

      • ruaok
        bitmap: can you look into the access log on astro and examine the 502 errors?
      • 2015-04-25 11521, 2015

      • ruaok
        tell me what you think.
      • 2015-04-25 11545, 2015

      • ruaok
        ok, I'm calling this test failed. I'll bring the traffic back to the carl. :(
      • 2015-04-25 11536, 2015

      • moufl joined the channel
      • 2015-04-25 11552, 2015

      • mb-chat-logger joined the channel
      • 2015-04-25 11511, 2015

      • MBJenkins joined the channel
      • 2015-04-25 11538, 2015

      • Muz_ joined the channel
      • 2015-04-25 11521, 2015

      • bitmap
        the only hints I can find are that there's too many backlogged sockets on the frontends hitting net.core.somaxconn, but I don't get why switching to ernie would saturate that
      • 2015-04-25 11555, 2015

      • ruaok
        maybe that is something that needs to be tuned.
      • 2015-04-25 11502, 2015

      • ruaok
        how did you spy this?
      • 2015-04-25 11505, 2015

      • bitmap
        I googled the 'Resource temporarily unavailable' thing and saw that somaxconn was still the default on astro
      • 2015-04-25 11524, 2015

      • ruaok
        on astro?
      • 2015-04-25 11530, 2015

      • ruaok
        hm that shouldn't be affected.
      • 2015-04-25 11511, 2015

      • reosarevok joined the channel
      • 2015-04-25 11515, 2015

      • bitmap
        yeah, that's why I was confused
      • 2015-04-25 11523, 2015

      • bitmap
        I can't ssh into ernie though
      • 2015-04-25 11537, 2015

      • ruaok
        another thing that has me bufzzled.
      • 2015-04-25 11543, 2015

      • ruaok
        log in directly.
      • 2015-04-25 11551, 2015

      • ruaok
        not via the 10. net
      • 2015-04-25 11528, 2015

      • ruaok
        can you ping 72.29.166.157 from you home machine?
      • 2015-04-25 11531, 2015

      • bitmap
        k, that works
      • 2015-04-25 11548, 2015

      • bitmap
        that times out
      • 2015-04-25 11518, 2015

      • ruaok
        every step makes it worse. :(
      • 2015-04-25 11536, 2015

      • ruaok
        I can ping the ip from internally.
      • 2015-04-25 11542, 2015

      • ruaok
        the service works fine from internally.
      • 2015-04-25 11516, 2015

      • ruaok
        ah, found it.
      • 2015-04-25 11502, 2015

      • ruaok
        ping should work again.
      • 2015-04-25 11525, 2015

      • bitmap
        yep, it works now
      • 2015-04-25 11546, 2015

      • ruaok
        and the 502 error rate is not dropping. :(
      • 2015-04-25 11552, 2015

      • diana_olhovik_ joined the channel
      • 2015-04-25 11551, 2015

      • ruaok
        upstream: "http://unix:/home/musicbrainz/musicbrainz-server/musicbrainz-server.socket
      • 2015-04-25 11536, 2015

      • bitmap
        no, I think that looks normal
      • 2015-04-25 11558, 2015

      • ruaok
        ok.
      • 2015-04-25 11527, 2015

      • bitmap
        last time this happened I think it was related to totoro's network card issues
      • 2015-04-25 11541, 2015

      • bitmap
        but it fixed itself
      • 2015-04-25 11510, 2015

      • ruaok
        this doesn't seem totoro related.
      • 2015-04-25 11527, 2015

      • ruaok
        every one of the 502 is a web service search query. all of them
      • 2015-04-25 11559, 2015

      • ruaok
        /ws/2/recording/?query=artist:Ed+Sheeran+%26+X+recording:don+t
      • 2015-04-25 11557, 2015

      • LordSputnik plays that song
      • 2015-04-25 11507, 2015

      • bitmap
        right, I think the common thread is something's causing the connections to hang and they're backed up
      • 2015-04-25 11551, 2015

      • ruaok
        ok, when astro spits out that error message...
      • 2015-04-25 11508, 2015

      • ruaok
        is that because of the connection to the search load balancer or something else?
      • 2015-04-25 11543, 2015

      • ianmcorvidae
        is the special search-private IP address working?
      • 2015-04-25 11503, 2015

      • ruaok
        to the best of my knowledge, yes.
      • 2015-04-25 11544, 2015

      • ruaok
        yes, but it threw a 501 for the first time.
      • 2015-04-25 11553, 2015

      • ruaok
        I've tried it before and it never failed on me.
      • 2015-04-25 11557, 2015

      • ruaok
        the search private error log show no 502s.
      • 2015-04-25 11544, 2015

      • ruaok
        what is up with these redirects?
      • 2015-04-25 11545, 2015

      • ruaok
      • 2015-04-25 11512, 2015

      • ruaok
        302 to mb.org /search?
      • 2015-04-25 11535, 2015

      • ianmcorvidae
        something's misencoded the query I think
      • 2015-04-25 11551, 2015

      • ianmcorvidae
        that's what happens when you go to a search server with no /ws/etc. and not query
      • 2015-04-25 11551, 2015

      • ruaok
        <=== made the query by hand
      • 2015-04-25 11515, 2015

      • ruaok
        what should the query look like?
      • 2015-04-25 11541, 2015

      • bitmap
        "psql: FATAL: remaining connection slots are reserved for non-replication superuser connections" is that related to the pgbouncer changes?
      • 2015-04-25 11551, 2015

      • ruaok
        oh, empty query/
      • 2015-04-25 11501, 2015

      • ruaok
        bitmap: yes, when was that?
      • 2015-04-25 11509, 2015

      • ruaok
        we fixed those about 24 hours ago.
      • 2015-04-25 11513, 2015

      • bitmap
        just now trying psql from astro
      • 2015-04-25 11551, 2015

      • ruaok
        does that problem persist?
      • 2015-04-25 11527, 2015

      • bitmap
        it gave that error the 7/9 times I tried just now
      • 2015-04-25 11523, 2015

      • ruaok
        sounds like pgbouncer needs checking.
      • 2015-04-25 11528, 2015

      • ruaok
        I can go look in a sec.
      • 2015-04-25 11528, 2015

      • bitmap
        (it does work for brief windows of time if I keep at it, but mostly fails)
      • 2015-04-25 11539, 2015

      • ruaok
        wget 'http://10.1.1.247:777?type=recording&query=love'
      • 2015-04-25 11543, 2015

      • ruaok
        that query works fine for me.
      • 2015-04-25 11504, 2015

      • ruaok
        bitmap: while I go look at pgbouncer, can you please see what queries astro sends to the search load balancer?
      • 2015-04-25 11506, 2015

      • chirlu-mobile joined the channel
      • 2015-04-25 11521, 2015

      • chirlu-mobile
        Could the WS search queries thing be related to the nginx accel-redirect?
      • 2015-04-25 11544, 2015

      • ruaok
        I was wondering about that.
      • 2015-04-25 11554, 2015

      • ruaok
        ianmcorvidae: you know more about that than I do.
      • 2015-04-25 11519, 2015

      • chirlu-mobile
        re: psql, the number of connections has always been close to the limit since yesterday, so increasing max_connections in Postgres may be the solution.
      • 2015-04-25 11541, 2015

      • ruaok
        chirlu-mobile: what is interesting is that the number of waiting connections is now at ~120.
      • 2015-04-25 11544, 2015

      • ruaok
        constantly.
      • 2015-04-25 11515, 2015

      • ruaok
        I think I am going to move the ips that I took from carl and send them back to carl.
      • 2015-04-25 11530, 2015

      • ruaok
        and hopefully that will get things to go back to normal
      • 2015-04-25 11557, 2015

      • chirlu-mobile
        Long queue could also be caused simply by higher load.
      • 2015-04-25 11514, 2015

      • ruaok
        the load is very low right now
      • 2015-04-25 11519, 2015

      • chirlu-mobile
        Then it's not that. :)
      • 2015-04-25 11553, 2015

      • michiwend joined the channel
      • 2015-04-25 11541, 2015

      • ruaok
        can anyone ping 72.29.167.148 ?
      • 2015-04-25 11527, 2015

      • bitmap
        not I
      • 2015-04-25 11524, 2015

      • ruaok
        carl is refusing to take the IP back.
      • 2015-04-25 11511, 2015

      • ianmcorvidae
        bitmap/ruaok: if you want to do psql from somewhere like astro, use port 6899 so you go through pgbouncer
      • 2015-04-25 11523, 2015

      • ianmcorvidae
        (or use ./admin/psql READWRITE, which of course does that for you)
      • 2015-04-25 11536, 2015

      • ruaok
        yes, and if you get that error, using -U postgres will get you in too.
      • 2015-04-25 11558, 2015

      • ianmcorvidae
        yes. though probably not from astro :)
      • 2015-04-25 11533, 2015

      • ianmcorvidae
        and the reason I was asking about the search-private thing was the accel-redirect thing, but it appears to be correctly configured
      • 2015-04-25 11548, 2015

      • ianmcorvidae
        unless there's differences in a toplevel nginx config somehow, but I don't know how that'd be since you just copied it from carl
      • 2015-04-25 11555, 2015

      • ruaok
        I'm starting to think that the iptables has some goof in it.
      • 2015-04-25 11519, 2015

      • kepstin-laptop joined the channel
      • 2015-04-25 11523, 2015

      • reosarevok joined the channel
      • 2015-04-25 11524, 2015

      • kepstin-laptop
        ruaok, pong
      • 2015-04-25 11534, 2015

      • ruaok
        hey, been a rough day here. :(
      • 2015-04-25 11558, 2015

      • ruaok
        I'm in the process of migrating everything back to the old gateway. I'm burnt.
      • 2015-04-25 11506, 2015

      • ruaok
        let me finish that first.
      • 2015-04-25 11537, 2015

      • ianmcorvidae
        only difference in iptables is ernie has stuff in INPUT for .150 and for gtest.musicbrainz.org
      • 2015-04-25 11506, 2015

      • ruaok
        150 was my dns test ip.
      • 2015-04-25 11540, 2015

      • ruaok
        ianmcorvidae: do you see all the differences betwen em2:0 and em2?
      • 2015-04-25 11504, 2015

      • ruaok
        any time I wanted to use a rule with em2:0 it would not work. using em2 would.
      • 2015-04-25 11532, 2015

      • ianmcorvidae
        iptables -L isn't showing me anything with that.
      • 2015-04-25 11502, 2015

      • ruaok
        iptables-save does.
      • 2015-04-25 11509, 2015

      • kepstin-laptop
        iptables -L doesn't show all the tables, you have to use -t to ask for a specific one
      • 2015-04-25 11519, 2015

      • ianmcorvidae
        yeah, looking at nat now
      • 2015-04-25 11546, 2015

      • ianmcorvidae
        the raw table has something that seems amiss as far as connection-tracking, but that's the only thing I can see
      • 2015-04-25 11519, 2015

      • ianmcorvidae
        (carl has target NOTRACK, as should be expected, for the couple of UDP things we turn that off for, ernie has target CT, and NOTRACK is in the destination stuff for some reason)
      • 2015-04-25 11529, 2015

      • ruaok
        kepstin-laptop what are the arping invocation you were suggesting?
      • 2015-04-25 11544, 2015

      • ruaok
        I've got one external ip that refused to go back to carl.
      • 2015-04-25 11503, 2015

      • ruaok
        that is the last thing to undo everything I've done today.
      • 2015-04-25 11526, 2015

      • kepstin-laptop
        arping -U -I ethdevice ip.add.re.ss
      • 2015-04-25 11557, 2015

      • ruaok
        from which machine?
      • 2015-04-25 11504, 2015

      • ruaok
        the one that lost or received the ip?
      • 2015-04-25 11510, 2015

      • kepstin-laptop
        the one that you want to have the ip
      • 2015-04-25 11526, 2015

      • kepstin-laptop
        that will send a fake arp reply from that machine to update the arp table in the switch
      • 2015-04-25 11546, 2015

      • ruaok
        no replies.
      • 2015-04-25 11551, 2015

      • ruaok
        it just sits there.
      • 2015-04-25 11501, 2015

      • kepstin-laptop
        that's not supposed to be any replies, it's sending replies
      • 2015-04-25 11532, 2015

      • ruaok
        worked this time. never did before. :)
      • 2015-04-25 11521, 2015

      • ruaok
        ok, everything is undone. the site is as I found it this morning.
      • 2015-04-25 11503, 2015

      • ianmcorvidae
        search queries still not working though
      • 2015-04-25 11544, 2015

      • ruaok
        exactly.
      • 2015-04-25 11548, 2015

      • ruaok
        da fuq?
      • 2015-04-25 11550, 2015

      • ianmcorvidae
        huh
      • 2015-04-25 11552, 2015

      • ianmcorvidae
        though one just did
      • 2015-04-25 11504, 2015

      • ruaok
        30% - 40% are failing.
      • 2015-04-25 11505, 2015

      • ianmcorvidae
        yup, now they're back reliably
      • 2015-04-25 11507, 2015

      • ianmcorvidae
        wacky
      • 2015-04-25 11512, 2015

      • ianmcorvidae
        maybe one server's still not figured it out
      • 2015-04-25 11533, 2015

      • ruaok
        lots and lots still persist.
      • 2015-04-25 11537, 2015

      • ruaok
        ok, food just arrive
      • 2015-04-25 11541, 2015

      • ruaok
        back after noms.
      • 2015-04-25 11548, 2015

      • ianmcorvidae confirms, search stuff to the backend thing is working, something's failing higher-up than that
      • 2015-04-25 11528, 2015

      • ianmcorvidae
        and it's not all queries, that's just because there's so many more of them than other stuff
      • 2015-04-25 11546, 2015

      • ianmcorvidae
        right, and all the stats are still broken >_<
      • 2015-04-25 11511, 2015

      • ruaok
        yeah, once the gateway stuff is done, I was going to pick rvedotrc's brain about that