akshat: meeting is an hour late today :). (EU DST switch)
akshat
Oh right lucifer! I'll have tea up and ready then :)
CatQuest
wait o it's not 19 but 20?
lucifer
CatQuest, i think its the same time for you as you are in the EU. only changed for those outside it
akshat
Cool let's do it now then outsidecontext. The automation is done so far and for the rest of the stuff, we need to directly follow the Fastlane documentation
zas
lucifer: we could, but that's a prod server, it is likely to explode if we do that
basically we don't know what broke....
CatQuest
lucifer: honestly i don't change times on my clocks becasue i'm dead against it :[
zas
and why suddenly
CatQuest
it's 18 :50 for me now
lucifer
yeah indeed
zas
I didn't see any error at openresty level that could explain the problem, but...
musibrainz is using hard certs, and it doesn't work either
CatQuest
if there is anything i can do, help test, etc please holler
ruaok
there are very few entries in nginx error logs and those that appear are ... benign.
akshat
outsidecontext: hit `fastlane run download_from_playstore`
ruaok
lots of non-http traffic getting through.
*non-https
akshat
But have a play_config.json ready in your local setup and don't add it to git
zas
yes...
so I suspect https to be broken, for both LE & hard certs, which leads us to ... DNS, openresty config didn't change (I reverted recent changes), nor openresty version
ruaok
how many instances of openresty are supposed to be running? I see 8 instances in top, but that seems too few.
fwiw, when it *does* load it loads fine, all css etc
zas
and http loads fast
ruaok
ok, how about this interpretation: TLS fails because we're too busy and it times out? we're too busy because of something else and we're chasing the wrong symptom?
ok, my idea would hold if something on the TLS path would cause something to slow down.
we're CPU bound, and calculating TLS connections backs up, causing listens to drop.
but what is causing the slowdown?
zas
I have another theory, since everything is slow, we exhaust open socket or the like, and ssl fails due to that
but something is the root of this slowdown
ruaok
yes, that could do it.
dmesg has messages with : [ 5026.378971] TCP: too many orphaned sockets
rdswift
A cron job gone bad?
lucifer
DDoS?
ruaok
lucifer: DDOS would now allow our HTTP traffic to flow freely.
CatQuest
i was gonna ask
lucifer
ah makes sense
ruaok
rdswift: something kinda like that is my guess. we don't really have cron job on the gateways, but something is eating CPI.
cpu is high, temp too, network traffic on eth0 & eth1 high, so I guess load is "normal", it doesn't explain why https fails that much especially for hard certs which don't depend on anything
nginx processes have high prio
ruaok
nothing stands out as weird.
zas
can it be a cascading effect? slowdown leading to https failures leading to slowdown?
lucifer
ping from kiki to outside is working again. should we retry switching?
ruaok
zas: it could be.
zas
well, we can, but since we didn't find the reason behind this shit...
ruaok
herb/kiki might just be saturated. and its a holiday in the EU.
we can rule out hardware failure since HTTP is working.
CatQuest
it might be fans :P
ah ignore me
ruaok
one thing that speaks against a cascading failure is that HTTP is working.
if everything was overloaded, HTTP would suck too.
zas
yes, but http doesn't required as much resources
but I agree, seems weird
ruaok
and HTTPS is not *that* much more resource intensive that it should fail like this.
zas
ok, any https specialist around?
rdswift
Could the certs have changed but for some reason nginx didn't reload them?
as drastic spike in the LB traffic just before this started.
zas
hmmm
ruaok
if something picked up huesound, that would give 25 requests for 1 LB request.
25 CAA requests for 1 LB request.
CatQuest
oooooooh
ruaok
and that might be a problem.
CatQuest: good idea.
lucifer: wanna try something?
CatQuest
:)
lucifer
sure. what?
ruaok
try deploying lb-web-prod with using NON https for CAA URLs.
lucifer
on it
ruaok
thanks.
kepstin
one tricky thing to note if you have an https endpoint is that people might be doing http/2 to it, and multiple requests can be in progress on a single http/2 endpoint that get reverse proxied to many individual http/1.1 requests
so pure connection count comparisons between the two might not be valid
lucifer
ruaok: also bring the container on gaga down which hits CAA?
is there any specific reason for that to go through CAA at all? IIRC after the plex situation, MB switched to linking directly to IA instead of following the redirect
ruaok
the logic for going direct is complex and not available to LB, ROpdebee
CatQuest
getting 502's on listenbrainz
lucifer
i don't think the http thing is going to work. i took LB prod down while the image built but nothing seems to have chagned.
the CAA had a drastic spike in traffic before it all went bad.
lucifer: we may be trying the wrong thing.
zas thoughts on blocking the CAA from the gateways
?
lucifer
makes sense. let's bring CAA down for a while
ruaok
agreed.
zas?
lucifer
yeah like that. everything is inaccessible anyways
ruaok
if the CAA is problem we can get another server to run that.
ROpdebee
ruaok: isn't it just `https://archive.org/download/mbid-${selectedRelease.release_mbid}/mbid-${selectedRelease.release_mbid}-${selectedRelease.caa_id}_thumb250.jpg`?
or am i missing something
lucifer
yup, once we know the problem we can plan the mitigation but first we need to figure out the issue.