#metabrainz

/

16:48 PM
outsidecontext

I'm away soon, I'll check back tomorrow

2021-11-01 30517, 2021

16:48 PM
lucifer

akshat: meeting is an hour late today :). (EU DST switch)

2021-11-01 30546, 2021

16:48 PM
akshat

Oh right lucifer! I'll have tea up and ready then :)

2021-11-01 30550, 2021

16:48 PM
CatQuest

wait o it's not 19 but 20?

2021-11-01 30528, 2021

16:49 PM
lucifer

CatQuest, i think its the same time for you as you are in the EU. only changed for those outside it

2021-11-01 30533, 2021

16:49 PM
akshat

Cool let's do it now then outsidecontext. The automation is done so far and for the rest of the stuff, we need to directly follow the Fastlane documentation

2021-11-01 30538, 2021

16:49 PM
zas

lucifer: we could, but that's a prod server, it is likely to explode if we do that

2021-11-01 30559, 2021

16:49 PM
zas

basically we don't know what broke....

2021-11-01 30500, 2021

16:50 PM
CatQuest

lucifer: honestly i don't change times on my clocks becasue i'm dead against it :[

2021-11-01 30513, 2021

16:50 PM
zas

and why suddenly

2021-11-01 30517, 2021

16:50 PM
CatQuest

it's 18 :50 for me now

2021-11-01 30525, 2021

16:50 PM
lucifer

yeah indeed

2021-11-01 30542, 2021

16:50 PM
zas

I didn't see any error at openresty level that could explain the problem, but...

2021-11-01 30515, 2021

16:51 PM
zas

musibrainz is using hard certs, and it doesn't work either

2021-11-01 30512, 2021

16:52 PM
CatQuest

if there is anything i can do, help test, etc please holler

2021-11-01 30547, 2021

16:52 PM
ruaok

there are very few entries in nginx error logs and those that appear are ... benign.

2021-11-01 30556, 2021

16:52 PM
akshat

outsidecontext: hit `fastlane run download_from_playstore`

2021-11-01 30503, 2021

16:53 PM
ruaok

lots of non-http traffic getting through.

2021-11-01 30521, 2021

16:53 PM
ruaok

*non-https

2021-11-01 30523, 2021

16:53 PM
akshat

But have a play_config.json ready in your local setup and don't add it to git

2021-11-01 30524, 2021

16:53 PM
zas

yes...

2021-11-01 30555, 2021

16:54 PM
zas

so I suspect https to be broken, for both LE & hard certs, which leads us to ... DNS, openresty config didn't change (I reverted recent changes), nor openresty version

2021-11-01 30549, 2021

16:55 PM
ruaok

how many instances of openresty are supposed to be running? I see 8 instances in top, but that seems too few.

2021-11-01 30505, 2021

16:56 PM
zas

one per cpu

2021-11-01 30507, 2021

16:56 PM
zas

that's ok

2021-11-01 30532, 2021

16:56 PM
ruaok

can't be DDNS becasue that would affect http

2021-11-01 30523, 2021

16:59 PM
zas

a simple curl -vik https://musicbrainz.org hangs

2021-11-01 30540, 2021

16:59 PM
zas

that's the problem, why does it hang?

2021-11-01 30508, 2021

17:00 PM
lucifer

https://www.irccloud.com/pastebin/cnZAbhxB/

2021-11-01 30549, 2021

17:00 PM
lucifer

dns seems to be working but then it hangs

2021-11-01 30500, 2021

17:01 PM
zas

so that's the handshake

2021-11-01 30553, 2021

17:01 PM
ruaok

I also changed my /etc/hosts to mask musicbrainz.org directly to herb, still hangs. unlikely the floating IP.

2021-11-01 30518, 2021

17:02 PM
lucifer

how about we block all incoming traffic except from say bono and then restart openresty in debug?

2021-11-01 30515, 2021

17:03 PM
CatQuest

:O i just got through to https://beta.musicbrainz

2021-11-01 30519, 2021

17:03 PM
CatQuest

very slow

2021-11-01 30508, 2021

17:04 PM
CatQuest

it kidna crashed my browser XD

2021-11-01 30558, 2021

17:05 PM
zas

curl -vik https://listenbrainz.org did work... but after a looong time

2021-11-01 30504, 2021

17:06 PM
CatQuest

yes

2021-11-01 30507, 2021

17:06 PM
zas

so https works (kinda)

2021-11-01 30555, 2021

17:06 PM
ruaok

I couldn't load a page from beta

2021-11-01 30510, 2021

17:07 PM
lucifer

right, thats what i saw some time ago. it looks like responses are too slow. sometime too slow that it times out at other times it loads but late.

2021-11-01 30519, 2021

17:07 PM
lucifer

dns_resolution: 0.005, tcp_established: 31.402, ssl_handshake_done: 111.763, TTFB: 111.829

2021-11-01 30540, 2021

17:07 PM
lucifer

everything seems slow. not any one single thing.

2021-11-01 30547, 2021

17:07 PM
CatQuest

fwiw, when it *does* load it loads fine, all css etc

2021-11-01 30503, 2021

17:08 PM
zas

and http loads fast

2021-11-01 30510, 2021

17:08 PM
ruaok

ok, how about this interpretation: TLS fails because we're too busy and it times out? we're too busy because of something else and we're chasing the wrong symptom?

2021-11-01 30546, 2021

17:08 PM
ruaok

ok, my idea would hold if something on the TLS path would cause something to slow down.

2021-11-01 30515, 2021

17:09 PM
ruaok

we're CPU bound, and calculating TLS connections backs up, causing listens to drop.

2021-11-01 30527, 2021

17:09 PM
ruaok

but what is causing the slowdown?

2021-11-01 30547, 2021

17:09 PM
zas

I have another theory, since everything is slow, we exhaust open socket or the like, and ssl fails due to that

2021-11-01 30509, 2021

17:10 PM
zas

but something is the root of this slowdown

2021-11-01 30510, 2021

17:10 PM
ruaok

yes, that could do it.

2021-11-01 30555, 2021

17:10 PM
ruaok

dmesg has messages with : [ 5026.378971] TCP: too many orphaned sockets

2021-11-01 30536, 2021

17:11 PM
rdswift

A cron job gone bad?

2021-11-01 30546, 2021

17:11 PM
lucifer

DDoS?

2021-11-01 30559, 2021

17:11 PM
ruaok

lucifer: DDOS would now allow our HTTP traffic to flow freely.

2021-11-01 30501, 2021

17:12 PM
CatQuest

i was gonna ask

2021-11-01 30517, 2021

17:12 PM
lucifer

ah makes sense

2021-11-01 30526, 2021

17:12 PM
ruaok

rdswift: something kinda like that is my guess. we don't really have cron job on the gateways, but something is eating CPI.

2021-11-01 30526, 2021

17:12 PM
zas

ruaok: look at network traffic on herb

2021-11-01 30535, 2021

17:12 PM
ruaok

on stats.mb ?

2021-11-01 30542, 2021

17:12 PM
zas

yes

2021-11-01 30549, 2021

17:12 PM
ruaok

on it

2021-11-01 30545, 2021

17:13 PM
ruaok

which dashboard?

2021-11-01 30520, 2021

17:14 PM
zas

https://stats.metabrainz.org/d/000000048/hetzner-…

2021-11-01 30534, 2021

17:15 PM
zas

cpu is high, temp too, network traffic on eth0 & eth1 high, so I guess load is "normal", it doesn't explain why https fails that much especially for hard certs which don't depend on anything

2021-11-01 30550, 2021

17:15 PM
zas

nginx processes have high prio

2021-11-01 30519, 2021

17:16 PM
ruaok

nothing stands out as weird.

2021-11-01 30526, 2021

17:17 PM
zas

can it be a cascading effect? slowdown leading to https failures leading to slowdown?

2021-11-01 30524, 2021

17:18 PM
lucifer

ping from kiki to outside is working again. should we retry switching?

2021-11-01 30501, 2021

17:19 PM
ruaok

zas: it could be.

2021-11-01 30512, 2021

17:19 PM
zas

well, we can, but since we didn't find the reason behind this shit...

2021-11-01 30524, 2021

17:19 PM
ruaok

herb/kiki might just be saturated. and its a holiday in the EU.

2021-11-01 30542, 2021

17:19 PM
ruaok

we can rule out hardware failure since HTTP is working.

2021-11-01 30546, 2021

17:19 PM
CatQuest

it might be fans :P

2021-11-01 30555, 2021

17:19 PM
CatQuest

ah ignore me

2021-11-01 30557, 2021

17:19 PM
ruaok

one thing that speaks against a cascading failure is that HTTP is working.

2021-11-01 30505, 2021

17:20 PM
ruaok

if everything was overloaded, HTTP would suck too.

2021-11-01 30518, 2021

17:20 PM
zas

yes, but http doesn't required as much resources

2021-11-01 30529, 2021

17:20 PM
zas

but I agree, seems weird

2021-11-01 30530, 2021

17:20 PM
ruaok

and HTTPS is not *that* much more resource intensive that it should fail like this.

2021-11-01 30506, 2021

17:21 PM
zas

ok, any https specialist around?

2021-11-01 30502, 2021

17:22 PM
rdswift

Could the certs have changed but for some reason nginx didn't reload them?

2021-11-01 30524, 2021

17:22 PM
ruaok

kepstin: ping!

2021-11-01 30528, 2021

17:22 PM
ruaok

zas: https://stats.metabrainz.org/d/000000050/hetzner-…

2021-11-01 30555, 2021

17:22 PM
ruaok

kiki's eth0 was just under 100mbit for a long time before things went bad.

2021-11-01 30507, 2021

17:23 PM
ruaok

that doesn't feel like a coincidence.

2021-11-01 30525, 2021

17:23 PM
ruaok

oh. CAA?

2021-11-01 30544, 2021

17:23 PM
ruaok

zas: what if we leave the CAA down or block its traffic for right now.

2021-11-01 30510, 2021

17:24 PM
ruaok

that is a decent chunk of requests, possible in HTTPS.

2021-11-01 30511, 2021

17:24 PM
zas

reminder: all this started on failover IP switching

2021-11-01 30530, 2021

17:24 PM
ruaok

could be a coincidence, but good to keep in mind.

2021-11-01 30535, 2021

17:24 PM
zas

yes

2021-11-01 30500, 2021

17:25 PM
ruaok

going back to lucifer's idea: lets block something that causes a lot of traffic. CAA seems a good candidate.

2021-11-01 30530, 2021

17:25 PM
CatQuest

is all traffic of caa towards mbs in https?

2021-11-01 30555, 2021

17:25 PM
CatQuest

if someone had loads of requests on caa spesfically

2021-11-01 30505, 2021

17:26 PM
ruaok

zas: look at this: https://stats.metabrainz.org/d/000000061/mbstats?…

2021-11-01 30545, 2021

17:26 PM
ruaok

as drastic spike in the LB traffic just before this started.

2021-11-01 30508, 2021

17:27 PM
zas

hmmm

2021-11-01 30510, 2021

17:27 PM
ruaok

if something picked up huesound, that would give 25 requests for 1 LB request.

2021-11-01 30518, 2021

17:27 PM
ruaok

25 CAA requests for 1 LB request.

2021-11-01 30532, 2021

17:27 PM
CatQuest

oooooooh

2021-11-01 30534, 2021

17:27 PM
ruaok

and that might be a problem.

2021-11-01 30559, 2021

17:27 PM
ruaok

CatQuest: good idea.

2021-11-01 30505, 2021

17:28 PM
ruaok

lucifer: wanna try something?

2021-11-01 30506, 2021

17:28 PM
CatQuest

:)

2021-11-01 30517, 2021

17:28 PM
lucifer

sure. what?

2021-11-01 30525, 2021

17:28 PM
ruaok

try deploying lb-web-prod with using NON https for CAA URLs.

2021-11-01 30532, 2021

17:28 PM
lucifer

on it

2021-11-01 30535, 2021

17:28 PM
ruaok

thanks.

2021-11-01 30535, 2021

17:30 PM
kepstin

one tricky thing to note if you have an https endpoint is that people might be doing http/2 to it, and multiple requests can be in progress on a single http/2 endpoint that get reverse proxied to many individual http/1.1 requests

2021-11-01 30557, 2021

17:30 PM
kepstin

so pure connection count comparisons between the two might not be valid

2021-11-01 30516, 2021

17:33 PM
lucifer

ruaok: also bring the container on gaga down which hits CAA?

2021-11-01 30526, 2021

17:33 PM
lucifer

https://github.com/metabrainz/listenbrainz-server…

2021-11-01 30555, 2021

17:34 PM
ruaok

lucifer: ok, will do.

2021-11-01 30556, 2021

17:35 PM
ruaok

I doubt it will do much of anything. its very low traffic.

2021-11-01 30508, 2021

17:36 PM
ruaok

and it runs in batches, 10 mins after the hous.

2021-11-01 30540, 2021

17:36 PM
zas

the traffic doesn't look much higher than usual...

2021-11-01 30502, 2021

17:37 PM
ruaok

writer is down. PR approved.

2021-11-01 30519, 2021

17:37 PM
ruaok

zas: yes, it could be that we hit a threshold and that tipped us over.

2021-11-01 30531, 2021

17:37 PM
CatQuest

eh. question, did we make a huesound.org site?

2021-11-01 30511, 2021

17:38 PM
ruaok

CatQuest: no, but I should make a redirect to the LB on.

2021-11-01 30512, 2021

17:38 PM
ruaok

one

2021-11-01 30531, 2021

17:38 PM
CatQuest

yea http://huesound.org/ doesnt load

2021-11-01 30552, 2021

17:38 PM
ROpdebee

is there any specific reason for that to go through CAA at all? IIRC after the plex situation, MB switched to linking directly to IA instead of following the redirect

2021-11-01 30524, 2021

17:39 PM
ruaok

the logic for going direct is complex and not available to LB, ROpdebee

2021-11-01 30539, 2021

17:40 PM
CatQuest

getting 502's on listenbrainz

2021-11-01 30540, 2021

17:40 PM
lucifer

i don't think the http thing is going to work. i took LB prod down while the image built but nothing seems to have chagned.

2021-11-01 30534, 2021

17:42 PM
ruaok

https://stats.metabrainz.org/d/000000061/mbstats?…

2021-11-01 30547, 2021

17:42 PM
ruaok

the CAA had a drastic spike in traffic before it all went bad.

2021-11-01 30501, 2021

17:43 PM
ruaok

lucifer: we may be trying the wrong thing.

2021-11-01 30513, 2021

17:43 PM
ruaok

zas thoughts on blocking the CAA from the gateways

2021-11-01 30514, 2021

17:43 PM
ruaok

?

2021-11-01 30515, 2021

17:43 PM
lucifer

makes sense. let's bring CAA down for a while

2021-11-01 30536, 2021

17:43 PM
ruaok

agreed.

2021-11-01 30537, 2021

17:43 PM
ruaok

zas?

2021-11-01 30539, 2021

17:43 PM
lucifer

yeah like that. everything is inaccessible anyways

2021-11-01 30526, 2021

17:44 PM
ruaok

if the CAA is problem we can get another server to run that.

2021-11-01 30555, 2021

17:44 PM
ROpdebee

ruaok: isn't it just `https://archive.org/download/mbid-${selectedRelease.release_mbid}/mbid-${selectedRelease.release_mbid}-${selectedRelease.caa_id}_thumb250.jpg`?

2021-11-01 30502, 2021

17:45 PM
ROpdebee

or am i missing something

2021-11-01 30507, 2021

17:45 PM
lucifer

yup, once we know the problem we can plan the mitigation but first we need to figure out the issue.

2021-11-01 30539, 2021

17:45 PM
zas

we can take it down in docker-server-configs

2021-11-01 30550, 2021

17:45 PM
ruaok

zas: want a PR?

2021-11-01 30557, 2021

17:45 PM
zas

but... it will not prevent incoming reqs

2021-11-01 30510, 2021

17:46 PM
zas

I'll do it manually

2021-11-01 30525, 2021

17:46 PM
ruaok

reduce the TTL on caa.org and point it elsewhere?

2021-11-01 30513, 2021

17:47 PM
ruaok

ROpdebee: could be. but we want a CAA cache ideally.

2021-11-01 30544, 2021

17:49 PM
ruaok

I dropped the TTL on caa.org, just in case

2021-11-01 30503, 2021

17:52 PM
ruaok

this is the only real clue that something went wrong, IMHO: https://stats.metabrainz.org/d/000000061/mbstats?…

2021-11-01 30506, 2021

17:52 PM
zas

caa is now in maintenance mode, 503