I suspect the git issue might be firewall related.
doesn't seem like this is helping.
but lets finish.
after that I can try to bring the old search load balancer back
bitmap
asterix is done
ruaok
thx.
bitmap
yeah, still seeing 502s on the homepage
ruaok
the rate has dropped, but is still very hig.
Leo_Verto
hmm, reboot all the things? :P
ruaok
trying.
kicking search servers now.
20% 502s now. :)
down from 50%
bitmap: can you look into the access log on astro and examine the 502 errors?
tell me what you think.
ok, I'm calling this test failed. I'll bring the traffic back to the carl. :(
moufl joined the channel
mb-chat-logger joined the channel
MBJenkins joined the channel
Muz_ joined the channel
bitmap
the only hints I can find are that there's too many backlogged sockets on the frontends hitting net.core.somaxconn, but I don't get why switching to ernie would saturate that
ruaok
maybe that is something that needs to be tuned.
how did you spy this?
bitmap
I googled the 'Resource temporarily unavailable' thing and saw that somaxconn was still the default on astro
bitmap: while I go look at pgbouncer, can you please see what queries astro sends to the search load balancer?
chirlu-mobile joined the channel
chirlu-mobile
Could the WS search queries thing be related to the nginx accel-redirect?
ruaok
I was wondering about that.
ianmcorvidae: you know more about that than I do.
chirlu-mobile
re: psql, the number of connections has always been close to the limit since yesterday, so increasing max_connections in Postgres may be the solution.
ruaok
chirlu-mobile: what is interesting is that the number of waiting connections is now at ~120.
constantly.
I think I am going to move the ips that I took from carl and send them back to carl.
and hopefully that will get things to go back to normal
chirlu-mobile
Long queue could also be caused simply by higher load.
ruaok
the load is very low right now
chirlu-mobile
Then it's not that. :)
michiwend joined the channel
ruaok
can anyone ping 72.29.167.148 ?
bitmap
not I
ruaok
carl is refusing to take the IP back.
ianmcorvidae
bitmap/ruaok: if you want to do psql from somewhere like astro, use port 6899 so you go through pgbouncer
(or use ./admin/psql READWRITE, which of course does that for you)
ruaok
yes, and if you get that error, using -U postgres will get you in too.
ianmcorvidae
yes. though probably not from astro :)
and the reason I was asking about the search-private thing was the accel-redirect thing, but it appears to be correctly configured
unless there's differences in a toplevel nginx config somehow, but I don't know how that'd be since you just copied it from carl
ruaok
I'm starting to think that the iptables has some goof in it.
kepstin-laptop joined the channel
reosarevok joined the channel
kepstin-laptop
ruaok, pong
ruaok
hey, been a rough day here. :(
I'm in the process of migrating everything back to the old gateway. I'm burnt.
let me finish that first.
ianmcorvidae
only difference in iptables is ernie has stuff in INPUT for .150 and for gtest.musicbrainz.org
ruaok
150 was my dns test ip.
ianmcorvidae: do you see all the differences betwen em2:0 and em2?
any time I wanted to use a rule with em2:0 it would not work. using em2 would.
ianmcorvidae
iptables -L isn't showing me anything with that.
ruaok
iptables-save does.
kepstin-laptop
iptables -L doesn't show all the tables, you have to use -t to ask for a specific one
ianmcorvidae
yeah, looking at nat now
the raw table has something that seems amiss as far as connection-tracking, but that's the only thing I can see
(carl has target NOTRACK, as should be expected, for the couple of UDP things we turn that off for, ernie has target CT, and NOTRACK is in the destination stuff for some reason)
ruaok
kepstin-laptop what are the arping invocation you were suggesting?
I've got one external ip that refused to go back to carl.
that is the last thing to undo everything I've done today.
kepstin-laptop
arping -U -I ethdevice ip.add.re.ss
ruaok
from which machine?
the one that lost or received the ip?
kepstin-laptop
the one that you want to have the ip
that will send a fake arp reply from that machine to update the arp table in the switch
ruaok
no replies.
it just sits there.
kepstin-laptop
that's not supposed to be any replies, it's sending replies
ruaok
worked this time. never did before. :)
ok, everything is undone. the site is as I found it this morning.
ianmcorvidae
search queries still not working though
ruaok
exactly.
da fuq?
ianmcorvidae
huh
though one just did
ruaok
30% - 40% are failing.
ianmcorvidae
yup, now they're back reliably
wacky
maybe one server's still not figured it out
ruaok
lots and lots still persist.
ok, food just arrive
back after noms.
ianmcorvidae confirms, search stuff to the backend thing is working, something's failing higher-up than that
ianmcorvidae
and it's not all queries, that's just because there's so many more of them than other stuff
right, and all the stats are still broken >_<
ruaok
yeah, once the gateway stuff is done, I was going to pick rvedotrc's brain about that