akshat: meeting is an hour late today :). (EU DST switch)
2021-11-01 30546, 2021
akshat
Oh right lucifer! I'll have tea up and ready then :)
2021-11-01 30550, 2021
CatQuest
wait o it's not 19 but 20?
2021-11-01 30528, 2021
lucifer
CatQuest, i think its the same time for you as you are in the EU. only changed for those outside it
2021-11-01 30533, 2021
akshat
Cool let's do it now then outsidecontext. The automation is done so far and for the rest of the stuff, we need to directly follow the Fastlane documentation
2021-11-01 30538, 2021
zas
lucifer: we could, but that's a prod server, it is likely to explode if we do that
2021-11-01 30559, 2021
zas
basically we don't know what broke....
2021-11-01 30500, 2021
CatQuest
lucifer: honestly i don't change times on my clocks becasue i'm dead against it :[
2021-11-01 30513, 2021
zas
and why suddenly
2021-11-01 30517, 2021
CatQuest
it's 18 :50 for me now
2021-11-01 30525, 2021
lucifer
yeah indeed
2021-11-01 30542, 2021
zas
I didn't see any error at openresty level that could explain the problem, but...
2021-11-01 30515, 2021
zas
musibrainz is using hard certs, and it doesn't work either
2021-11-01 30512, 2021
CatQuest
if there is anything i can do, help test, etc please holler
2021-11-01 30547, 2021
ruaok
there are very few entries in nginx error logs and those that appear are ... benign.
2021-11-01 30556, 2021
akshat
outsidecontext: hit `fastlane run download_from_playstore`
2021-11-01 30503, 2021
ruaok
lots of non-http traffic getting through.
2021-11-01 30521, 2021
ruaok
*non-https
2021-11-01 30523, 2021
akshat
But have a play_config.json ready in your local setup and don't add it to git
2021-11-01 30524, 2021
zas
yes...
2021-11-01 30555, 2021
zas
so I suspect https to be broken, for both LE & hard certs, which leads us to ... DNS, openresty config didn't change (I reverted recent changes), nor openresty version
2021-11-01 30549, 2021
ruaok
how many instances of openresty are supposed to be running? I see 8 instances in top, but that seems too few.
fwiw, when it *does* load it loads fine, all css etc
2021-11-01 30503, 2021
zas
and http loads fast
2021-11-01 30510, 2021
ruaok
ok, how about this interpretation: TLS fails because we're too busy and it times out? we're too busy because of something else and we're chasing the wrong symptom?
2021-11-01 30546, 2021
ruaok
ok, my idea would hold if something on the TLS path would cause something to slow down.
2021-11-01 30515, 2021
ruaok
we're CPU bound, and calculating TLS connections backs up, causing listens to drop.
2021-11-01 30527, 2021
ruaok
but what is causing the slowdown?
2021-11-01 30547, 2021
zas
I have another theory, since everything is slow, we exhaust open socket or the like, and ssl fails due to that
2021-11-01 30509, 2021
zas
but something is the root of this slowdown
2021-11-01 30510, 2021
ruaok
yes, that could do it.
2021-11-01 30555, 2021
ruaok
dmesg has messages with : [ 5026.378971] TCP: too many orphaned sockets
2021-11-01 30536, 2021
rdswift
A cron job gone bad?
2021-11-01 30546, 2021
lucifer
DDoS?
2021-11-01 30559, 2021
ruaok
lucifer: DDOS would now allow our HTTP traffic to flow freely.
2021-11-01 30501, 2021
CatQuest
i was gonna ask
2021-11-01 30517, 2021
lucifer
ah makes sense
2021-11-01 30526, 2021
ruaok
rdswift: something kinda like that is my guess. we don't really have cron job on the gateways, but something is eating CPI.
cpu is high, temp too, network traffic on eth0 & eth1 high, so I guess load is "normal", it doesn't explain why https fails that much especially for hard certs which don't depend on anything
2021-11-01 30550, 2021
zas
nginx processes have high prio
2021-11-01 30519, 2021
ruaok
nothing stands out as weird.
2021-11-01 30526, 2021
zas
can it be a cascading effect? slowdown leading to https failures leading to slowdown?
2021-11-01 30524, 2021
lucifer
ping from kiki to outside is working again. should we retry switching?
2021-11-01 30501, 2021
ruaok
zas: it could be.
2021-11-01 30512, 2021
zas
well, we can, but since we didn't find the reason behind this shit...
2021-11-01 30524, 2021
ruaok
herb/kiki might just be saturated. and its a holiday in the EU.
2021-11-01 30542, 2021
ruaok
we can rule out hardware failure since HTTP is working.
2021-11-01 30546, 2021
CatQuest
it might be fans :P
2021-11-01 30555, 2021
CatQuest
ah ignore me
2021-11-01 30557, 2021
ruaok
one thing that speaks against a cascading failure is that HTTP is working.
2021-11-01 30505, 2021
ruaok
if everything was overloaded, HTTP would suck too.
2021-11-01 30518, 2021
zas
yes, but http doesn't required as much resources
2021-11-01 30529, 2021
zas
but I agree, seems weird
2021-11-01 30530, 2021
ruaok
and HTTPS is not *that* much more resource intensive that it should fail like this.
2021-11-01 30506, 2021
zas
ok, any https specialist around?
2021-11-01 30502, 2021
rdswift
Could the certs have changed but for some reason nginx didn't reload them?
as drastic spike in the LB traffic just before this started.
2021-11-01 30508, 2021
zas
hmmm
2021-11-01 30510, 2021
ruaok
if something picked up huesound, that would give 25 requests for 1 LB request.
2021-11-01 30518, 2021
ruaok
25 CAA requests for 1 LB request.
2021-11-01 30532, 2021
CatQuest
oooooooh
2021-11-01 30534, 2021
ruaok
and that might be a problem.
2021-11-01 30559, 2021
ruaok
CatQuest: good idea.
2021-11-01 30505, 2021
ruaok
lucifer: wanna try something?
2021-11-01 30506, 2021
CatQuest
:)
2021-11-01 30517, 2021
lucifer
sure. what?
2021-11-01 30525, 2021
ruaok
try deploying lb-web-prod with using NON https for CAA URLs.
2021-11-01 30532, 2021
lucifer
on it
2021-11-01 30535, 2021
ruaok
thanks.
2021-11-01 30535, 2021
kepstin
one tricky thing to note if you have an https endpoint is that people might be doing http/2 to it, and multiple requests can be in progress on a single http/2 endpoint that get reverse proxied to many individual http/1.1 requests
2021-11-01 30557, 2021
kepstin
so pure connection count comparisons between the two might not be valid
2021-11-01 30516, 2021
lucifer
ruaok: also bring the container on gaga down which hits CAA?
is there any specific reason for that to go through CAA at all? IIRC after the plex situation, MB switched to linking directly to IA instead of following the redirect
2021-11-01 30524, 2021
ruaok
the logic for going direct is complex and not available to LB, ROpdebee
2021-11-01 30539, 2021
CatQuest
getting 502's on listenbrainz
2021-11-01 30540, 2021
lucifer
i don't think the http thing is going to work. i took LB prod down while the image built but nothing seems to have chagned.
the CAA had a drastic spike in traffic before it all went bad.
2021-11-01 30501, 2021
ruaok
lucifer: we may be trying the wrong thing.
2021-11-01 30513, 2021
ruaok
zas thoughts on blocking the CAA from the gateways
2021-11-01 30514, 2021
ruaok
?
2021-11-01 30515, 2021
lucifer
makes sense. let's bring CAA down for a while
2021-11-01 30536, 2021
ruaok
agreed.
2021-11-01 30537, 2021
ruaok
zas?
2021-11-01 30539, 2021
lucifer
yeah like that. everything is inaccessible anyways
2021-11-01 30526, 2021
ruaok
if the CAA is problem we can get another server to run that.
2021-11-01 30555, 2021
ROpdebee
ruaok: isn't it just `https://archive.org/download/mbid-${selectedRelease.release_mbid}/mbid-${selectedRelease.release_mbid}-${selectedRelease.caa_id}_thumb250.jpg`?
2021-11-01 30502, 2021
ROpdebee
or am i missing something
2021-11-01 30507, 2021
lucifer
yup, once we know the problem we can plan the mitigation but first we need to figure out the issue.