in #metabrainz

17:59 PM
monkey is a bit late, please don't call me first :)
18:00 PM
yvanzo

monkey: not sure the meeting will take place as usual since MeB websites are down.
18:00 PM
monkey

Oop, thanks
18:00 PM
Freso

<BANG>
18:00 PM
It’s World Vegan Monday!
18:00 PM
https://www.youtube.com/watch?v=FLqjLn0W5K0
18:00 PM
I’ve received
18:00 PM
alastairp: Go!
18:00 PM
alastairp says…
18:00 PM
"""
18:00 PM
Today is a holiday, so I'm not around. Last week I spent most of my time attending a kubernetes training course at MTG. I also reviewed some LB PRs.
18:00 PM
"""
18:00 PM
Others up for reviews: yvanzo, bitmap, monkey, ruaok, lucifer, zas, akshat, reosarevok, CatQuest, Freso – as aye, let me know if you too want to give a review and you’re not listed here! :)
18:00 PM
monkey: Go!
18:00 PM
;)
18:00 PM
Nah. yvanzo, go! :)
18:00 PM
yvanzo

Hi!
18:00 PM
Freso

Oh. Uh.
18:00 PM
Yeah.
18:00 PM
zas

Freso: not the right time for me....
18:01 PM
Freso

Go focus on sites.
18:01 PM
ruaok

lucifer: https://tickets.metabrainz.org/browse/LB-992 for later
18:01 PM
BrainzBot

LB-992: Too many LB full dumps on FTP site
18:01 PM
lucifer

👍
18:01 PM
yvanzo

I mostly reviewed PRs and looked into improving docker server configs
18:02 PM
Freso

bitmap and anyone else in the US/North America/Turtle Island: note that next week’s meeting will be an hour different again (back to normal time) if you’re in an area that (stops) observing DST. :)
18:02 PM
Take care of yourselves out there. :)
18:02 PM
</BANG>
18:02 PM
Possibly shortest meeting ever? :p
18:02 PM
yvanzo

Thanks everyone :)
18:04 PM
monkey

Just catching up with the drama. Anything I can help with?
18:04 PM
ruaok

send chocolate?
18:04 PM
CatQuest

I only have nonstop?
18:04 PM
but i mean i can send that
18:05 PM
rdswift

<ruaok> send chocolate? Or chill the beer for when this is resolved.
18:05 PM
ruaok

zas: lets regroup. what else have we learned?
18:06 PM
zas

the problem is concerning https only (apparently, perhaps we need to ensure that)
18:07 PM
load on gateways is higher than usual, we don't know if it's the cause or the consequence yet
18:07 PM
ruaok

do we have any other indicators of abnormal behaviour right before the outage?
18:07 PM
zas

it started when I switched floating IP to herb, but it can be due to another event at the same time
18:07 PM
ruaok

the CAA one is the only one I've seen.
18:08 PM
do you know exactly when you did that?
18:08 PM
zas

to me, everything was fine before the switch
18:09 PM
01/11/21 04:32 pm CET
18:10 PM
according to hetzner robot web service log
18:10 PM
ruaok

the CAA load spike was in progress at 4:26pm CET.
18:10 PM
zas

hmmm
18:11 PM
ruaok

perhaps the switch just made everything worse.
18:11 PM
lucifer

https://www.irccloud.com/pastebin/24zbqyR6/
18:11 PM
this is a different error from before and happens immidiately
18:11 PM
ruaok

kepstin: ping
18:12 PM
kepstin

hola
18:12 PM
ruaok

kepstin: hey.
18:12 PM
kepstin

i've read a bit of the backlog, but i don't really have any insight as to what's going wrong :/
18:12 PM
ruaok

have you been following along? I think you could be really helpful to us.
18:12 PM
ah.
18:12 PM
what are your go to think to check when a server is overloaded?
18:12 PM
what would you do and look at?
18:13 PM
ruaok is trying to get ideas for things to do next
18:13 PM
zas: we have redirects from HTTP -> HTTPS in place for a lot of things.
18:13 PM
what if we turned them off and told people via twitter/blog to use HTTP for the time being?
18:14 PM
kepstin

well, narrow down the specific thing that's overloaded - where is the error page being generated, which component is it failing to connect to?
18:14 PM
i did mention about the http/2 thing earlier, dunno if you saw that?
18:14 PM
ruaok

kepstin: it appears that all our HTTP traffic is totally fine and snappy.
18:14 PM
kepstin: saw that, but it didn't lead me to a new direction.
18:14 PM
kepstin

hmm, it just means that if you're measuring traffic by number of connections, you'll overcount http or undercount https
18:15 PM
yvanzo

BrainzGit is running normally from aretha.
18:15 PM
ruaok

but none of our HTTPS traffic is going through. we checked certs and so.. fine. no changes to the http stuff when this happened
18:15 PM
yvanzo

Cert for tickets is fine too.
18:15 PM
ruaok

kepstin: understood.
18:15 PM
CatQuest

yea tickets loads
18:15 PM
kepstin

and also that https can be a connection amplifier, since most http/2 reverse-proxies convert multiple requests in one connection to multiple http/1.1 connections to the backend
18:16 PM
ruaok

we're also getting TCP listen drop messages.
18:16 PM
yvanzo

But tickets are not runnning from the same set of servers.
18:16 PM
ruaok

but we think that is symptom, not a cause
18:16 PM
kepstin: that would be great for us -- our gateways, the HTTPS endpoints are having a hard time. the backends are bored.
18:16 PM
CatQuest

hm, might be worth checking that out tho, to rule it out atleast
18:17 PM
kepstin

hmm. if you have https ocsp pinning enabled, consider turning that off?
18:18 PM
ruaok

zas: ^^
18:18 PM
zas

let me check that
18:18 PM
kepstin

probably not the issue, but it can cause some server configs to need to make outgoing connections to verify the ocsp before responding
18:19 PM
i have to say, having connections refused on the border reverse proxies is a new experience for me
18:19 PM
ruaok

zas: when I zoom in on the CAA graph: https://stats.metabrainz.org/d/000000061/mbstat...
18:19 PM
kepstin

all the overload issues i've dealth with have been purely due to backend
18:19 PM
ruaok

I believe we haven't ruled out the CAA enough yet.
18:20 PM
CatQuest

honestly that huesound/caa thing seemed very suspicious
18:20 PM
zas

yes, it started to raise at 16:19
18:20 PM
so basically the switch happened when it was already 2x time the usual traffic
18:21 PM
ruaok

we're overloaded. clearly.
18:22 PM
bitmap: you about?
18:22 PM
bitmap

I'm here
18:22 PM
ruaok

can you help me cobble together a CAA?
18:22 PM
zas

kepstin: I disabled ssl stapling
18:22 PM
ruaok

if I rent us a new dedicated server right now, can we get CAA up and running post-haste?
18:22 PM
bitmap

ruaok: sure, I could help with that
18:23 PM
ruaok

ok, I'm going to do that. that rules it out.
18:23 PM
hang on.
18:23 PM
kepstin

what's the outward-facing application that's actually handling requests on port 443? is that an nginx reverse proxy, haproxy, ???
18:25 PM
gavinatkinson joined the channel
18:25 PM
ruaok

openresty / nginx
18:27 PM
kepstin

interesting that in the stats, the connect time to the upstream servers from that frontend seems to have risen dramatically
18:28 PM
normally less than 5ms, but then starting around 11:29 jumped way up to ~20ms
18:29 PM
which .... might mean that it was sending requests to the upstreams faster than they were responding, until the frontend's queues filled up and it became unable to accept more connections itself?
18:30 PM
ruaok

kepstin: which graph are you looking at?
18:30 PM
kepstin

"Mean connect time per upstream"
18:33 PM
(on the CAA graph link you'd sent to zas)
18:33 PM
zas

kepstin: yes, that's what's happening
18:33 PM
https://www.irccloud.com/pastebin/tLBQNlaw/
18:33 PM
https://www.irccloud.com/pastebin/YIgiq5Vk/
18:33 PM
ruaok

zas: what ubuntu image should I install the new CAA server? 20.04 focal?
18:34 PM
zas

ruaok: if possible yes
18:34 PM
basically we have too many connections incoming to 443, saturating queues, leading to failed connections
18:34 PM
lucifer

and http is working because no one is using it?
18:35 PM
zas

yes
18:35 PM
kepstin

right, and the queues are backed up specifically because responses to stuff forwarded to CAA backends aren't coming back fast enough, i think?
18:35 PM
lucifer

are we overloaded in general or some extra traffic from particular ips?
18:36 PM
kepstin

i guess this is a problem i wouldn't have anticipated with running multiple sites on a single external endpoint - an issue with one service can take down others :/
18:37 PM
so if the CAA backends are the problem in particular, stubbing out caa requests to immediately return an error rather than forward to backend could bring the rest of the site back up?
18:37 PM
lucifer

but mean connect times seem to up in general for all hosts?
18:38 PM
kepstin

hmm. i might have just not looked at enough different graphs then :)
18:38 PM
lucifer

https://stats.metabrainz.org/d/000000061/mbstat...
18:38 PM
for one instance
18:39 PM
kepstin

hmm. so if it is affecting all backends more or less evenly then i guess it is a frontend issue then :/
18:41 PM
ruaok

DNS for CAA has been pointed to the new server.
18:41 PM
now waiting for it to come back up and get the new server setup going.
18:41 PM
kepstin

there are some tuning options for nginx to increase the number of backlogged connections which might help. require changing some sysctls and matching nginx config. but i don't know if that's something you've already done.
18:41 PM
ruaok

purple.metabrainz.org
18:42 PM
kepstin

(and also file descriptor limits can cause issues)
18:42 PM
zas

kepstin: yes, I did that already, but here we have more than enough queue sizes for "normal" situations
18:43 PM
ruaok: let's see if it helps
18:43 PM
CatQuest

wait purple? like in "deep" whoo hoo
18:44 PM
ruaok

bitmap: if you can start working out how we can get the CAA up without consul and all that...
18:44 PM
riksucks

btw this spike in traffic, its all organic right?
18:44 PM
ruaok

riksucks: unsure
18:44 PM
lucifer

we don't know that
18:44 PM
CatQuest

organic?
18:44 PM
ruaok

CatQuest: free range.
18:44 PM
CatQuest

uuhhh
18:44 PM
lucifer

not some intentionally overloading us.
18:44 PM
bitmap

ruaok: can you add this ssh key to bitmap? https://gist.github.com/mwiencek/efe7236012099d...
18:45 PM
CatQuest

not ddos, ah
18:45 PM
bitmap

I guess it only has my laptop one
18:45 PM
ruaok

oh, its not ready yet.
18:45 PM
still working toward server setup
18:45 PM
a few more mins
18:45 PM
bitmap

ok!
18:47 PM
zas

lucifer: packets from different IPs, over 100k SYN packets to port 443, top IP has 328
18:47 PM
lucifer

makes sense
18:47 PM
kepstin

(disk performance due to logging sometimes can cause things to fall over when traffic reaches a certain point, but I don't see anything obvious in the disk graphs)
18:47 PM
lucifer

so should be organic traffic most probably.