#metabrainz

/

17:59 PM
monkey is a bit late, please don't call me first :)

2021-11-01 30504, 2021

18:00 PM
yvanzo

monkey: not sure the meeting will take place as usual since MeB websites are down.

2021-11-01 30524, 2021

18:00 PM
monkey

Oop, thanks

2021-11-01 30535, 2021

18:00 PM
Freso

<BANG>

2021-11-01 30535, 2021

18:00 PM
Freso

It’s World Vegan Monday!

2021-11-01 30535, 2021

18:00 PM
Freso

https://www.youtube.com/watch?v=FLqjLn0W5K0

2021-11-01 30535, 2021

18:00 PM
Freso

I’ve received

2021-11-01 30535, 2021

18:00 PM
Freso

alastairp: Go!

2021-11-01 30535, 2021

18:00 PM
Freso

alastairp says…

2021-11-01 30535, 2021

18:00 PM
Freso

"""

2021-11-01 30536, 2021

18:00 PM
Freso

Today is a holiday, so I'm not around. Last week I spent most of my time attending a kubernetes training course at MTG. I also reviewed some LB PRs.

2021-11-01 30536, 2021

18:00 PM
Freso

"""

2021-11-01 30537, 2021

18:00 PM
Freso

Others up for reviews: yvanzo, bitmap, monkey, ruaok, lucifer, zas, akshat, reosarevok, CatQuest, Freso – as aye, let me know if you too want to give a review and you’re not listed here! :)

2021-11-01 30540, 2021

18:00 PM
Freso

monkey: Go!

2021-11-01 30540, 2021

18:00 PM
Freso

;)

2021-11-01 30547, 2021

18:00 PM
Freso

Nah. yvanzo, go! :)

2021-11-01 30555, 2021

18:00 PM
yvanzo

Hi!

2021-11-01 30557, 2021

18:00 PM
Freso

Oh. Uh.

2021-11-01 30558, 2021

18:00 PM
Freso

Yeah.

2021-11-01 30559, 2021

18:00 PM
zas

Freso: not the right time for me....

2021-11-01 30501, 2021

18:01 PM
Freso

Go focus on sites.

2021-11-01 30529, 2021

18:01 PM
ruaok

lucifer: https://tickets.metabrainz.org/browse/LB-992 for later

2021-11-01 30530, 2021

18:01 PM
BrainzBot

LB-992: Too many LB full dumps on FTP site

2021-11-01 30536, 2021

18:01 PM
lucifer

👍

2021-11-01 30557, 2021

18:01 PM
yvanzo

I mostly reviewed PRs and looked into improving docker server configs

2021-11-01 30501, 2021

18:02 PM
Freso

bitmap and anyone else in the US/North America/Turtle Island: note that next week’s meeting will be an hour different again (back to normal time) if you’re in an area that (stops) observing DST. :)

2021-11-01 30513, 2021

18:02 PM
Freso

Take care of yourselves out there. :)

2021-11-01 30516, 2021

18:02 PM
Freso

</BANG>

2021-11-01 30522, 2021

18:02 PM
Freso

Possibly shortest meeting ever? :p

2021-11-01 30528, 2021

18:02 PM
yvanzo

Thanks everyone :)

2021-11-01 30520, 2021

18:04 PM
monkey

Just catching up with the drama. Anything I can help with?

2021-11-01 30530, 2021

18:04 PM
ruaok

send chocolate?

2021-11-01 30545, 2021

18:04 PM
CatQuest

I only have nonstop?

2021-11-01 30552, 2021

18:04 PM
CatQuest

but i mean i can send that

2021-11-01 30534, 2021

18:05 PM
rdswift

<ruaok> send chocolate? Or chill the beer for when this is resolved.

2021-11-01 30553, 2021

18:05 PM
ruaok

zas: lets regroup. what else have we learned?

2021-11-01 30540, 2021

18:06 PM
zas

the problem is concerning https only (apparently, perhaps we need to ensure that)

2021-11-01 30507, 2021

18:07 PM
zas

load on gateways is higher than usual, we don't know if it's the cause or the consequence yet

2021-11-01 30538, 2021

18:07 PM
ruaok

do we have any other indicators of abnormal behaviour right before the outage?

2021-11-01 30541, 2021

18:07 PM
zas

it started when I switched floating IP to herb, but it can be due to another event at the same time

2021-11-01 30546, 2021

18:07 PM
ruaok

the CAA one is the only one I've seen.

2021-11-01 30503, 2021

18:08 PM
ruaok

do you know exactly when you did that?

2021-11-01 30511, 2021

18:08 PM
zas

to me, everything was fine before the switch

2021-11-01 30549, 2021

18:09 PM
zas

01/11/21 04:32 pm CET

2021-11-01 30505, 2021

18:10 PM
zas

according to hetzner robot web service log

2021-11-01 30537, 2021

18:10 PM
ruaok

the CAA load spike was in progress at 4:26pm CET.

2021-11-01 30551, 2021

18:10 PM
zas

hmmm

2021-11-01 30505, 2021

18:11 PM
ruaok

perhaps the switch just made everything worse.

2021-11-01 30526, 2021

18:11 PM
lucifer

https://www.irccloud.com/pastebin/24zbqyR6/

2021-11-01 30545, 2021

18:11 PM
lucifer

this is a different error from before and happens immidiately

2021-11-01 30546, 2021

18:11 PM
ruaok

kepstin: ping

2021-11-01 30506, 2021

18:12 PM
kepstin

hola

2021-11-01 30514, 2021

18:12 PM
ruaok

kepstin: hey.

2021-11-01 30527, 2021

18:12 PM
kepstin

i've read a bit of the backlog, but i don't really have any insight as to what's going wrong :/

2021-11-01 30528, 2021

18:12 PM
ruaok

have you been following along? I think you could be really helpful to us.

2021-11-01 30532, 2021

18:12 PM
ruaok

ah.

2021-11-01 30545, 2021

18:12 PM
ruaok

what are your go to think to check when a server is overloaded?

2021-11-01 30552, 2021

18:12 PM
ruaok

what would you do and look at?

2021-11-01 30501, 2021

18:13 PM
ruaok is trying to get ideas for things to do next

2021-11-01 30541, 2021

18:13 PM
ruaok

zas: we have redirects from HTTP -> HTTPS in place for a lot of things.

2021-11-01 30554, 2021

18:13 PM
ruaok

what if we turned them off and told people via twitter/blog to use HTTP for the time being?

2021-11-01 30500, 2021

18:14 PM
kepstin

well, narrow down the specific thing that's overloaded - where is the error page being generated, which component is it failing to connect to?

2021-11-01 30519, 2021

18:14 PM
kepstin

i did mention about the http/2 thing earlier, dunno if you saw that?

2021-11-01 30520, 2021

18:14 PM
ruaok

kepstin: it appears that all our HTTP traffic is totally fine and snappy.

2021-11-01 30535, 2021

18:14 PM
ruaok

kepstin: saw that, but it didn't lead me to a new direction.

2021-11-01 30559, 2021

18:14 PM
kepstin

hmm, it just means that if you're measuring traffic by number of connections, you'll overcount http or undercount https

2021-11-01 30503, 2021

18:15 PM
yvanzo

BrainzGit is running normally from aretha.

2021-11-01 30511, 2021

18:15 PM
ruaok

but none of our HTTPS traffic is going through. we checked certs and so.. fine. no changes to the http stuff when this happened

2021-11-01 30540, 2021

18:15 PM
yvanzo

Cert for tickets is fine too.

2021-11-01 30540, 2021

18:15 PM
ruaok

kepstin: understood.

2021-11-01 30550, 2021

18:15 PM
CatQuest

yea tickets loads

2021-11-01 30557, 2021

18:15 PM
kepstin

and also that https can be a connection amplifier, since most http/2 reverse-proxies convert multiple requests in one connection to multiple http/1.1 connections to the backend

2021-11-01 30500, 2021

18:16 PM
ruaok

we're also getting TCP listen drop messages.

2021-11-01 30508, 2021

18:16 PM
yvanzo

But tickets are not runnning from the same set of servers.

2021-11-01 30513, 2021

18:16 PM
ruaok

but we think that is symptom, not a cause

2021-11-01 30547, 2021

18:16 PM
ruaok

kepstin: that would be great for us -- our gateways, the HTTPS endpoints are having a hard time. the backends are bored.

2021-11-01 30551, 2021

18:16 PM
CatQuest

hm, might be worth checking that out tho, to rule it out atleast

2021-11-01 30558, 2021

18:17 PM
kepstin

hmm. if you have https ocsp pinning enabled, consider turning that off?

2021-11-01 30511, 2021

18:18 PM
ruaok

zas: ^^

2021-11-01 30513, 2021

18:18 PM
zas

let me check that

2021-11-01 30525, 2021

18:18 PM
kepstin

probably not the issue, but it can cause some server configs to need to make outgoing connections to verify the ocsp before responding

2021-11-01 30509, 2021

18:19 PM
kepstin

i have to say, having connections refused on the border reverse proxies is a new experience for me

2021-11-01 30519, 2021

18:19 PM
ruaok

zas: when I zoom in on the CAA graph: https://stats.metabrainz.org/d/000000061/mbstats?…

2021-11-01 30524, 2021

18:19 PM
kepstin

all the overload issues i've dealth with have been purely due to backend

2021-11-01 30540, 2021

18:19 PM
ruaok

I believe we haven't ruled out the CAA enough yet.

2021-11-01 30521, 2021

18:20 PM
CatQuest

honestly that huesound/caa thing seemed very suspicious

2021-11-01 30531, 2021

18:20 PM
zas

yes, it started to raise at 16:19

2021-11-01 30551, 2021

18:20 PM
zas

so basically the switch happened when it was already 2x time the usual traffic

2021-11-01 30544, 2021

18:21 PM
ruaok

we're overloaded. clearly.

2021-11-01 30502, 2021

18:22 PM
ruaok

bitmap: you about?

2021-11-01 30507, 2021

18:22 PM
bitmap

I'm here

2021-11-01 30517, 2021

18:22 PM
ruaok

can you help me cobble together a CAA?

2021-11-01 30530, 2021

18:22 PM
zas

kepstin: I disabled ssl stapling

2021-11-01 30538, 2021

18:22 PM
ruaok

if I rent us a new dedicated server right now, can we get CAA up and running post-haste?

2021-11-01 30552, 2021

18:22 PM
bitmap

ruaok: sure, I could help with that

2021-11-01 30504, 2021

18:23 PM
ruaok

ok, I'm going to do that. that rules it out.

2021-11-01 30506, 2021

18:23 PM
ruaok

hang on.

2021-11-01 30558, 2021

18:23 PM
kepstin

what's the outward-facing application that's actually handling requests on port 443? is that an nginx reverse proxy, haproxy, ???

2021-11-01 30506, 2021

18:25 PM
gavinatkinson joined the channel

2021-11-01 30526, 2021

18:25 PM
ruaok

openresty / nginx

2021-11-01 30516, 2021

18:27 PM
kepstin

interesting that in the stats, the connect time to the upstream servers from that frontend seems to have risen dramatically

2021-11-01 30528, 2021

18:28 PM
kepstin

normally less than 5ms, but then starting around 11:29 jumped way up to ~20ms

2021-11-01 30558, 2021

18:29 PM
kepstin

which .... might mean that it was sending requests to the upstreams faster than they were responding, until the frontend's queues filled up and it became unable to accept more connections itself?

2021-11-01 30516, 2021

18:30 PM
ruaok

kepstin: which graph are you looking at?

2021-11-01 30530, 2021

18:30 PM
kepstin

"Mean connect time per upstream"

2021-11-01 30500, 2021

18:33 PM
kepstin

(on the CAA graph link you'd sent to zas)

2021-11-01 30521, 2021

18:33 PM
zas

kepstin: yes, that's what's happening

2021-11-01 30526, 2021

18:33 PM
zas

https://www.irccloud.com/pastebin/tLBQNlaw/

2021-11-01 30547, 2021

18:33 PM
zas

https://www.irccloud.com/pastebin/YIgiq5Vk/

2021-11-01 30552, 2021

18:33 PM
ruaok

zas: what ubuntu image should I install the new CAA server? 20.04 focal?

2021-11-01 30502, 2021

18:34 PM
zas

ruaok: if possible yes

2021-11-01 30530, 2021

18:34 PM
zas

basically we have too many connections incoming to 443, saturating queues, leading to failed connections

2021-11-01 30554, 2021

18:34 PM
lucifer

and http is working because no one is using it?

2021-11-01 30502, 2021

18:35 PM
zas

yes

2021-11-01 30527, 2021

18:35 PM
kepstin

right, and the queues are backed up specifically because responses to stuff forwarded to CAA backends aren't coming back fast enough, i think?

2021-11-01 30528, 2021

18:35 PM
lucifer

are we overloaded in general or some extra traffic from particular ips?

2021-11-01 30527, 2021

18:36 PM
kepstin

i guess this is a problem i wouldn't have anticipated with running multiple sites on a single external endpoint - an issue with one service can take down others :/

2021-11-01 30528, 2021

18:37 PM
kepstin

so if the CAA backends are the problem in particular, stubbing out caa requests to immediately return an error rather than forward to backend could bring the rest of the site back up?

2021-11-01 30550, 2021

18:37 PM
lucifer

but mean connect times seem to up in general for all hosts?

2021-11-01 30509, 2021

18:38 PM
kepstin

hmm. i might have just not looked at enough different graphs then :)

2021-11-01 30550, 2021

18:38 PM
lucifer

https://stats.metabrainz.org/d/000000061/mbstats?…

2021-11-01 30553, 2021

18:38 PM
lucifer

for one instance

2021-11-01 30535, 2021

18:39 PM
kepstin

hmm. so if it is affecting all backends more or less evenly then i guess it is a frontend issue then :/

2021-11-01 30528, 2021

18:41 PM
ruaok

DNS for CAA has been pointed to the new server.

2021-11-01 30540, 2021

18:41 PM
ruaok

now waiting for it to come back up and get the new server setup going.

2021-11-01 30543, 2021

18:41 PM
kepstin

there are some tuning options for nginx to increase the number of backlogged connections which might help. require changing some sysctls and matching nginx config. but i don't know if that's something you've already done.

2021-11-01 30547, 2021

18:41 PM
ruaok

purple.metabrainz.org

2021-11-01 30524, 2021

18:42 PM
kepstin

(and also file descriptor limits can cause issues)

2021-11-01 30525, 2021

18:42 PM
zas

kepstin: yes, I did that already, but here we have more than enough queue sizes for "normal" situations

2021-11-01 30514, 2021

18:43 PM
zas

ruaok: let's see if it helps

2021-11-01 30540, 2021

18:43 PM
CatQuest

wait purple? like in "deep" whoo hoo

2021-11-01 30509, 2021

18:44 PM
ruaok

bitmap: if you can start working out how we can get the CAA up without consul and all that...

2021-11-01 30514, 2021

18:44 PM
riksucks

btw this spike in traffic, its all organic right?

2021-11-01 30523, 2021

18:44 PM
ruaok

riksucks: unsure

2021-11-01 30525, 2021

18:44 PM
lucifer

we don't know that

2021-11-01 30540, 2021

18:44 PM
CatQuest

organic?

2021-11-01 30548, 2021

18:44 PM
ruaok

CatQuest: free range.

2021-11-01 30555, 2021

18:44 PM
CatQuest

uuhhh

2021-11-01 30558, 2021

18:44 PM
lucifer

not some intentionally overloading us.

2021-11-01 30559, 2021

18:44 PM
bitmap

ruaok: can you add this ssh key to bitmap? https://gist.github.com/mwiencek/efe7236012099dd9…

2021-11-01 30505, 2021

18:45 PM
CatQuest

not ddos, ah

2021-11-01 30508, 2021

18:45 PM
bitmap

I guess it only has my laptop one

2021-11-01 30515, 2021

18:45 PM
ruaok

oh, its not ready yet.

2021-11-01 30521, 2021

18:45 PM
ruaok

still working toward server setup

2021-11-01 30527, 2021

18:45 PM
ruaok

a few more mins

2021-11-01 30530, 2021

18:45 PM
bitmap

ok!

2021-11-01 30500, 2021

18:47 PM
zas

lucifer: packets from different IPs, over 100k SYN packets to port 443, top IP has 328

2021-11-01 30526, 2021

18:47 PM
lucifer

makes sense

2021-11-01 30537, 2021

18:47 PM
kepstin

(disk performance due to logging sometimes can cause things to fall over when traffic reaches a certain point, but I don't see anything obvious in the disk graphs)

2021-11-01 30544, 2021

18:47 PM
lucifer

so should be organic traffic most probably.