in #metabrainz

16:48 PM
outsidecontext

I'm away soon, I'll check back tomorrow
16:48 PM
lucifer

akshat: meeting is an hour late today :). (EU DST switch)
16:48 PM
akshat

Oh right lucifer! I'll have tea up and ready then :)
16:48 PM
CatQuest

wait o it's not 19 but 20?
16:49 PM
lucifer

CatQuest, i think its the same time for you as you are in the EU. only changed for those outside it
16:49 PM
akshat

Cool let's do it now then outsidecontext. The automation is done so far and for the rest of the stuff, we need to directly follow the Fastlane documentation
16:49 PM
zas

lucifer: we could, but that's a prod server, it is likely to explode if we do that
16:49 PM
basically we don't know what broke....
16:50 PM
CatQuest

lucifer: honestly i don't change times on my clocks becasue i'm dead against it :[
16:50 PM
zas

and why suddenly
16:50 PM
CatQuest

it's 18 :50 for me now
16:50 PM
lucifer

yeah indeed
16:50 PM
zas

I didn't see any error at openresty level that could explain the problem, but...
16:51 PM
musibrainz is using hard certs, and it doesn't work either
16:52 PM
CatQuest

if there is anything i can do, help test, etc please holler
16:52 PM
ruaok

there are very few entries in nginx error logs and those that appear are ... benign.
16:52 PM
akshat

outsidecontext: hit `fastlane run download_from_playstore`
16:53 PM
ruaok

lots of non-http traffic getting through.
16:53 PM
*non-https
16:53 PM
akshat

But have a play_config.json ready in your local setup and don't add it to git
16:53 PM
zas

yes...
16:54 PM
so I suspect https to be broken, for both LE & hard certs, which leads us to ... DNS, openresty config didn't change (I reverted recent changes), nor openresty version
16:55 PM
ruaok

how many instances of openresty are supposed to be running? I see 8 instances in top, but that seems too few.
16:56 PM
zas

one per cpu
16:56 PM
that's ok
16:56 PM
ruaok

can't be DDNS becasue that would affect http
16:59 PM
zas

a simple curl -vik https://musicbrainz.org hangs
16:59 PM
that's the problem, why does it hang?
17:00 PM
lucifer

https://www.irccloud.com/pastebin/cnZAbhxB/
17:00 PM
dns seems to be working but then it hangs
17:01 PM
zas

so that's the handshake
17:01 PM
ruaok

I also changed my /etc/hosts to mask musicbrainz.org directly to herb, still hangs. unlikely the floating IP.
17:02 PM
lucifer

how about we block all incoming traffic except from say bono and then restart openresty in debug?
17:03 PM
CatQuest

:O i just got through to https://beta.musicbrainz
17:03 PM
very slow
17:04 PM
it kidna crashed my browser XD
17:05 PM
zas

curl -vik https://listenbrainz.org did work... but after a looong time
17:06 PM
CatQuest

yes
17:06 PM
zas

so https works (kinda)
17:06 PM
ruaok

I couldn't load a page from beta
17:07 PM
lucifer

right, thats what i saw some time ago. it looks like responses are too slow. sometime too slow that it times out at other times it loads but late.
17:07 PM
dns_resolution: 0.005, tcp_established: 31.402, ssl_handshake_done: 111.763, TTFB: 111.829
17:07 PM
everything seems slow. not any one single thing.
17:07 PM
CatQuest

fwiw, when it *does* load it loads fine, all css etc
17:08 PM
zas

and http loads fast
17:08 PM
ruaok

ok, how about this interpretation: TLS fails because we're too busy and it times out? we're too busy because of something else and we're chasing the wrong symptom?
17:08 PM
ok, my idea would hold if something on the TLS path would cause something to slow down.
17:09 PM
we're CPU bound, and calculating TLS connections backs up, causing listens to drop.
17:09 PM
but what is causing the slowdown?
17:09 PM
zas

I have another theory, since everything is slow, we exhaust open socket or the like, and ssl fails due to that
17:10 PM
but something is the root of this slowdown
17:10 PM
ruaok

yes, that could do it.
17:10 PM
dmesg has messages with : [ 5026.378971] TCP: too many orphaned sockets
17:11 PM
rdswift

A cron job gone bad?
17:11 PM
lucifer

DDoS?
17:11 PM
ruaok

lucifer: DDOS would now allow our HTTP traffic to flow freely.
17:12 PM
CatQuest

i was gonna ask
17:12 PM
lucifer

ah makes sense
17:12 PM
ruaok

rdswift: something kinda like that is my guess. we don't really have cron job on the gateways, but something is eating CPI.
17:12 PM
zas

ruaok: look at network traffic on herb
17:12 PM
ruaok

on stats.mb ?
17:12 PM
zas

yes
17:12 PM
ruaok

on it
17:13 PM
which dashboard?
17:14 PM
zas

https://stats.metabrainz.org/d/000000048/hetzne...
17:15 PM
cpu is high, temp too, network traffic on eth0 & eth1 high, so I guess load is "normal", it doesn't explain why https fails that much especially for hard certs which don't depend on anything
17:15 PM
nginx processes have high prio
17:16 PM
ruaok

nothing stands out as weird.
17:17 PM
zas

can it be a cascading effect? slowdown leading to https failures leading to slowdown?
17:18 PM
lucifer

ping from kiki to outside is working again. should we retry switching?
17:19 PM
ruaok

zas: it could be.
17:19 PM
zas

well, we can, but since we didn't find the reason behind this shit...
17:19 PM
ruaok

herb/kiki might just be saturated. and its a holiday in the EU.
17:19 PM
we can rule out hardware failure since HTTP is working.
17:19 PM
CatQuest

it might be fans :P
17:19 PM
ah ignore me
17:19 PM
ruaok

one thing that speaks against a cascading failure is that HTTP is working.
17:20 PM
if everything was overloaded, HTTP would suck too.
17:20 PM
zas

yes, but http doesn't required as much resources
17:20 PM
but I agree, seems weird
17:20 PM
ruaok

and HTTPS is not *that* much more resource intensive that it should fail like this.
17:21 PM
zas

ok, any https specialist around?
17:22 PM
rdswift

Could the certs have changed but for some reason nginx didn't reload them?
17:22 PM
ruaok

kepstin: ping!
17:22 PM
zas: https://stats.metabrainz.org/d/000000050/hetzne...
17:22 PM
kiki's eth0 was just under 100mbit for a long time before things went bad.
17:23 PM
that doesn't feel like a coincidence.
17:23 PM
oh. CAA?
17:23 PM
zas: what if we leave the CAA down or block its traffic for right now.
17:24 PM
that is a decent chunk of requests, possible in HTTPS.
17:24 PM
zas

reminder: all this started on failover IP switching
17:24 PM
ruaok

could be a coincidence, but good to keep in mind.
17:24 PM
zas

yes
17:25 PM
ruaok

going back to lucifer's idea: lets block something that causes a lot of traffic. CAA seems a good candidate.
17:25 PM
CatQuest

is all traffic of caa towards mbs in https?
17:25 PM
if someone had loads of requests on caa spesfically
17:26 PM
ruaok

zas: look at this: https://stats.metabrainz.org/d/000000061/mbstat...
17:26 PM
as drastic spike in the LB traffic just before this started.
17:27 PM
zas

hmmm
17:27 PM
ruaok

if something picked up huesound, that would give 25 requests for 1 LB request.
17:27 PM
25 CAA requests for 1 LB request.
17:27 PM
CatQuest

oooooooh
17:27 PM
ruaok

and that might be a problem.
17:27 PM
CatQuest: good idea.
17:28 PM
lucifer: wanna try something?
17:28 PM
CatQuest

:)
17:28 PM
lucifer

sure. what?
17:28 PM
ruaok

try deploying lb-web-prod with using NON https for CAA URLs.
17:28 PM
lucifer

on it
17:28 PM
ruaok

thanks.
17:30 PM
kepstin

one tricky thing to note if you have an https endpoint is that people might be doing http/2 to it, and multiple requests can be in progress on a single http/2 endpoint that get reverse proxied to many individual http/1.1 requests
17:30 PM
so pure connection count comparisons between the two might not be valid
17:33 PM
lucifer

ruaok: also bring the container on gaga down which hits CAA?
17:33 PM
https://github.com/metabrainz/listenbrainz-serv...
17:34 PM
ruaok

lucifer: ok, will do.
17:35 PM
I doubt it will do much of anything. its very low traffic.
17:36 PM
and it runs in batches, 10 mins after the hous.
17:36 PM
zas

the traffic doesn't look much higher than usual...
17:37 PM
ruaok

writer is down. PR approved.
17:37 PM
zas: yes, it could be that we hit a threshold and that tipped us over.
17:37 PM
CatQuest

eh. question, did we make a huesound.org site?
17:38 PM
ruaok

CatQuest: no, but I should make a redirect to the LB on.
17:38 PM
one
17:38 PM
CatQuest

yea http://huesound.org/ doesnt load
17:38 PM
ROpdebee

is there any specific reason for that to go through CAA at all? IIRC after the plex situation, MB switched to linking directly to IA instead of following the redirect
17:39 PM
ruaok

the logic for going direct is complex and not available to LB, ROpdebee
17:40 PM
CatQuest

getting 502's on listenbrainz
17:40 PM
lucifer

i don't think the http thing is going to work. i took LB prod down while the image built but nothing seems to have chagned.
17:42 PM
ruaok

https://stats.metabrainz.org/d/000000061/mbstat...
17:42 PM
the CAA had a drastic spike in traffic before it all went bad.
17:43 PM
lucifer: we may be trying the wrong thing.
17:43 PM
zas thoughts on blocking the CAA from the gateways
17:43 PM
?
17:43 PM
lucifer

makes sense. let's bring CAA down for a while
17:43 PM
ruaok

agreed.
17:43 PM
zas?
17:43 PM
lucifer

yeah like that. everything is inaccessible anyways
17:44 PM
ruaok

if the CAA is problem we can get another server to run that.
17:44 PM
ROpdebee

ruaok: isn't it just `https://archive.org/download/mbid-${selectedRelease.release_mbid}/mbid-${selectedRelease.release_mbid}-${selectedRelease.caa_id}_thumb250.jpg`?
17:45 PM
or am i missing something
17:45 PM
lucifer

yup, once we know the problem we can plan the mitigation but first we need to figure out the issue.
17:45 PM
zas

we can take it down in docker-server-configs
17:45 PM
ruaok

zas: want a PR?
17:45 PM
zas

but... it will not prevent incoming reqs
17:46 PM
I'll do it manually
17:46 PM
ruaok

reduce the TTL on caa.org and point it elsewhere?
17:47 PM
ROpdebee: could be. but we want a CAA cache ideally.
17:49 PM
I dropped the TTL on caa.org, just in case
17:52 PM
this is the only real clue that something went wrong, IMHO: https://stats.metabrainz.org/d/000000061/mbstat...
17:52 PM
zas

caa is now in maintenance mode, 503