Today is a holiday, so I'm not around. Last week I spent most of my time attending a kubernetes training course at MTG. I also reviewed some LB PRs.
"""
Others up for reviews: yvanzo, bitmap, monkey, ruaok, lucifer, zas, akshat, reosarevok, CatQuest, Freso – as aye, let me know if you too want to give a review and you’re not listed here! :)
I mostly reviewed PRs and looked into improving docker server configs
Freso
bitmap and anyone else in the US/North America/Turtle Island: note that next week’s meeting will be an hour different again (back to normal time) if you’re in an area that (stops) observing DST. :)
Take care of yourselves out there. :)
</BANG>
Possibly shortest meeting ever? :p
yvanzo
Thanks everyone :)
monkey
Just catching up with the drama. Anything I can help with?
ruaok
send chocolate?
CatQuest
I only have nonstop?
but i mean i can send that
rdswift
<ruaok> send chocolate? Or chill the beer for when this is resolved.
ruaok
zas: lets regroup. what else have we learned?
zas
the problem is concerning https only (apparently, perhaps we need to ensure that)
load on gateways is higher than usual, we don't know if it's the cause or the consequence yet
ruaok
do we have any other indicators of abnormal behaviour right before the outage?
zas
it started when I switched floating IP to herb, but it can be due to another event at the same time
this is a different error from before and happens immidiately
ruaok
kepstin: ping
kepstin
hola
ruaok
kepstin: hey.
kepstin
i've read a bit of the backlog, but i don't really have any insight as to what's going wrong :/
ruaok
have you been following along? I think you could be really helpful to us.
ah.
what are your go to think to check when a server is overloaded?
what would you do and look at?
ruaok is trying to get ideas for things to do next
zas: we have redirects from HTTP -> HTTPS in place for a lot of things.
what if we turned them off and told people via twitter/blog to use HTTP for the time being?
kepstin
well, narrow down the specific thing that's overloaded - where is the error page being generated, which component is it failing to connect to?
i did mention about the http/2 thing earlier, dunno if you saw that?
ruaok
kepstin: it appears that all our HTTP traffic is totally fine and snappy.
kepstin: saw that, but it didn't lead me to a new direction.
kepstin
hmm, it just means that if you're measuring traffic by number of connections, you'll overcount http or undercount https
yvanzo
BrainzGit is running normally from aretha.
ruaok
but none of our HTTPS traffic is going through. we checked certs and so.. fine. no changes to the http stuff when this happened
yvanzo
Cert for tickets is fine too.
ruaok
kepstin: understood.
CatQuest
yea tickets loads
kepstin
and also that https can be a connection amplifier, since most http/2 reverse-proxies convert multiple requests in one connection to multiple http/1.1 connections to the backend
ruaok
we're also getting TCP listen drop messages.
yvanzo
But tickets are not runnning from the same set of servers.
ruaok
but we think that is symptom, not a cause
kepstin: that would be great for us -- our gateways, the HTTPS endpoints are having a hard time. the backends are bored.
CatQuest
hm, might be worth checking that out tho, to rule it out atleast
kepstin
hmm. if you have https ocsp pinning enabled, consider turning that off?
ruaok
zas: ^^
zas
let me check that
kepstin
probably not the issue, but it can cause some server configs to need to make outgoing connections to verify the ocsp before responding
i have to say, having connections refused on the border reverse proxies is a new experience for me
all the overload issues i've dealth with have been purely due to backend
ruaok
I believe we haven't ruled out the CAA enough yet.
CatQuest
honestly that huesound/caa thing seemed very suspicious
zas
yes, it started to raise at 16:19
so basically the switch happened when it was already 2x time the usual traffic
ruaok
we're overloaded. clearly.
bitmap: you about?
bitmap
I'm here
ruaok
can you help me cobble together a CAA?
zas
kepstin: I disabled ssl stapling
ruaok
if I rent us a new dedicated server right now, can we get CAA up and running post-haste?
bitmap
ruaok: sure, I could help with that
ruaok
ok, I'm going to do that. that rules it out.
hang on.
kepstin
what's the outward-facing application that's actually handling requests on port 443? is that an nginx reverse proxy, haproxy, ???
gavinatkinson joined the channel
ruaok
openresty / nginx
kepstin
interesting that in the stats, the connect time to the upstream servers from that frontend seems to have risen dramatically
normally less than 5ms, but then starting around 11:29 jumped way up to ~20ms
which .... might mean that it was sending requests to the upstreams faster than they were responding, until the frontend's queues filled up and it became unable to accept more connections itself?
zas: what ubuntu image should I install the new CAA server? 20.04 focal?
zas
ruaok: if possible yes
basically we have too many connections incoming to 443, saturating queues, leading to failed connections
lucifer
and http is working because no one is using it?
zas
yes
kepstin
right, and the queues are backed up specifically because responses to stuff forwarded to CAA backends aren't coming back fast enough, i think?
lucifer
are we overloaded in general or some extra traffic from particular ips?
kepstin
i guess this is a problem i wouldn't have anticipated with running multiple sites on a single external endpoint - an issue with one service can take down others :/
so if the CAA backends are the problem in particular, stubbing out caa requests to immediately return an error rather than forward to backend could bring the rest of the site back up?
lucifer
but mean connect times seem to up in general for all hosts?
kepstin
hmm. i might have just not looked at enough different graphs then :)
hmm. so if it is affecting all backends more or less evenly then i guess it is a frontend issue then :/
ruaok
DNS for CAA has been pointed to the new server.
now waiting for it to come back up and get the new server setup going.
kepstin
there are some tuning options for nginx to increase the number of backlogged connections which might help. require changing some sysctls and matching nginx config. but i don't know if that's something you've already done.
lucifer: packets from different IPs, over 100k SYN packets to port 443, top IP has 328
lucifer
makes sense
kepstin
(disk performance due to logging sometimes can cause things to fall over when traffic reaches a certain point, but I don't see anything obvious in the disk graphs)