#metabrainz

/

      • outsidecontext
        I'm away soon, I'll check back tomorrow
      • lucifer
        akshat: meeting is an hour late today :). (EU DST switch)
      • akshat
        Oh right lucifer! I'll have tea up and ready then :)
      • CatQuest
        wait o it's not 19 but 20?
      • lucifer
        CatQuest, i think its the same time for you as you are in the EU. only changed for those outside it
      • akshat
        Cool let's do it now then outsidecontext. The automation is done so far and for the rest of the stuff, we need to directly follow the Fastlane documentation
      • zas
        lucifer: we could, but that's a prod server, it is likely to explode if we do that
      • basically we don't know what broke....
      • CatQuest
        lucifer: honestly i don't change times on my clocks becasue i'm dead against it :[
      • zas
        and why suddenly
      • CatQuest
        it's 18 :50 for me now
      • lucifer
        yeah indeed
      • zas
        I didn't see any error at openresty level that could explain the problem, but...
      • musibrainz is using hard certs, and it doesn't work either
      • CatQuest
        if there is anything i can do, help test, etc please holler
      • ruaok
        there are very few entries in nginx error logs and those that appear are ... benign.
      • akshat
        outsidecontext: hit `fastlane run download_from_playstore`
      • ruaok
        lots of non-http traffic getting through.
      • *non-https
      • akshat
        But have a play_config.json ready in your local setup and don't add it to git
      • zas
        yes...
      • so I suspect https to be broken, for both LE & hard certs, which leads us to ... DNS, openresty config didn't change (I reverted recent changes), nor openresty version
      • ruaok
        how many instances of openresty are supposed to be running? I see 8 instances in top, but that seems too few.
      • zas
        one per cpu
      • that's ok
      • ruaok
        can't be DDNS becasue that would affect http
      • zas
        a simple curl -vik https://musicbrainz.org hangs
      • that's the problem, why does it hang?
      • lucifer
      • dns seems to be working but then it hangs
      • zas
        so that's the handshake
      • ruaok
        I also changed my /etc/hosts to mask musicbrainz.org directly to herb, still hangs. unlikely the floating IP.
      • lucifer
        how about we block all incoming traffic except from say bono and then restart openresty in debug?
      • CatQuest
        :O i just got through to https://beta.musicbrainz
      • very slow
      • it kidna crashed my browser XD
      • zas
        curl -vik https://listenbrainz.org did work... but after a looong time
      • CatQuest
        yes
      • zas
        so https works (kinda)
      • ruaok
        I couldn't load a page from beta
      • lucifer
        right, thats what i saw some time ago. it looks like responses are too slow. sometime too slow that it times out at other times it loads but late.
      • dns_resolution: 0.005, tcp_established: 31.402, ssl_handshake_done: 111.763, TTFB: 111.829
      • everything seems slow. not any one single thing.
      • CatQuest
        fwiw, when it *does* load it loads fine, all css etc
      • zas
        and http loads fast
      • ruaok
        ok, how about this interpretation: TLS fails because we're too busy and it times out? we're too busy because of something else and we're chasing the wrong symptom?
      • ok, my idea would hold if something on the TLS path would cause something to slow down.
      • we're CPU bound, and calculating TLS connections backs up, causing listens to drop.
      • but what is causing the slowdown?
      • zas
        I have another theory, since everything is slow, we exhaust open socket or the like, and ssl fails due to that
      • but something is the root of this slowdown
      • ruaok
        yes, that could do it.
      • dmesg has messages with : [ 5026.378971] TCP: too many orphaned sockets
      • rdswift
        A cron job gone bad?
      • lucifer
        DDoS?
      • ruaok
        lucifer: DDOS would now allow our HTTP traffic to flow freely.
      • CatQuest
        i was gonna ask
      • lucifer
        ah makes sense
      • ruaok
        rdswift: something kinda like that is my guess. we don't really have cron job on the gateways, but something is eating CPI.
      • zas
        ruaok: look at network traffic on herb
      • ruaok
        on stats.mb ?
      • zas
        yes
      • ruaok
        on it
      • which dashboard?
      • zas
      • cpu is high, temp too, network traffic on eth0 & eth1 high, so I guess load is "normal", it doesn't explain why https fails that much especially for hard certs which don't depend on anything
      • nginx processes have high prio
      • ruaok
        nothing stands out as weird.
      • zas
        can it be a cascading effect? slowdown leading to https failures leading to slowdown?
      • lucifer
        ping from kiki to outside is working again. should we retry switching?
      • ruaok
        zas: it could be.
      • zas
        well, we can, but since we didn't find the reason behind this shit...
      • ruaok
        herb/kiki might just be saturated. and its a holiday in the EU.
      • we can rule out hardware failure since HTTP is working.
      • CatQuest
        it might be fans :P
      • ah ignore me
      • ruaok
        one thing that speaks against a cascading failure is that HTTP is working.
      • if everything was overloaded, HTTP would suck too.
      • zas
        yes, but http doesn't required as much resources
      • but I agree, seems weird
      • ruaok
        and HTTPS is not *that* much more resource intensive that it should fail like this.
      • zas
        ok, any https specialist around?
      • rdswift
        Could the certs have changed but for some reason nginx didn't reload them?
      • ruaok
        kepstin: ping!
      • kiki's eth0 was just under 100mbit for a long time before things went bad.
      • that doesn't feel like a coincidence.
      • oh. CAA?
      • zas: what if we leave the CAA down or block its traffic for right now.
      • that is a decent chunk of requests, possible in HTTPS.
      • zas
        reminder: all this started on failover IP switching
      • ruaok
        could be a coincidence, but good to keep in mind.
      • zas
        yes
      • ruaok
        going back to lucifer's idea: lets block something that causes a lot of traffic. CAA seems a good candidate.
      • CatQuest
        is all traffic of caa towards mbs in https?
      • if someone had loads of requests on caa spesfically
      • ruaok
      • as drastic spike in the LB traffic just before this started.
      • zas
        hmmm
      • ruaok
        if something picked up huesound, that would give 25 requests for 1 LB request.
      • 25 CAA requests for 1 LB request.
      • CatQuest
        oooooooh
      • ruaok
        and that might be a problem.
      • CatQuest: good idea.
      • lucifer: wanna try something?
      • CatQuest
        :)
      • lucifer
        sure. what?
      • ruaok
        try deploying lb-web-prod with using NON https for CAA URLs.
      • lucifer
        on it
      • ruaok
        thanks.
      • kepstin
        one tricky thing to note if you have an https endpoint is that people might be doing http/2 to it, and multiple requests can be in progress on a single http/2 endpoint that get reverse proxied to many individual http/1.1 requests
      • so pure connection count comparisons between the two might not be valid
      • lucifer
        ruaok: also bring the container on gaga down which hits CAA?
      • ruaok
        lucifer: ok, will do.
      • I doubt it will do much of anything. its very low traffic.
      • and it runs in batches, 10 mins after the hous.
      • zas
        the traffic doesn't look much higher than usual...
      • ruaok
        writer is down. PR approved.
      • zas: yes, it could be that we hit a threshold and that tipped us over.
      • CatQuest
        eh. question, did we make a huesound.org site?
      • ruaok
        CatQuest: no, but I should make a redirect to the LB on.
      • one
      • CatQuest
        yea http://huesound.org/ doesnt load
      • ROpdebee
        is there any specific reason for that to go through CAA at all? IIRC after the plex situation, MB switched to linking directly to IA instead of following the redirect
      • ruaok
        the logic for going direct is complex and not available to LB, ROpdebee
      • CatQuest
        getting 502's on listenbrainz
      • lucifer
        i don't think the http thing is going to work. i took LB prod down while the image built but nothing seems to have chagned.
      • ruaok
      • the CAA had a drastic spike in traffic before it all went bad.
      • lucifer: we may be trying the wrong thing.
      • zas thoughts on blocking the CAA from the gateways
      • ?
      • lucifer
        makes sense. let's bring CAA down for a while
      • ruaok
        agreed.
      • zas?
      • lucifer
        yeah like that. everything is inaccessible anyways
      • ruaok
        if the CAA is problem we can get another server to run that.
      • ROpdebee
        ruaok: isn't it just `https://archive.org/download/mbid-${selectedRelease.release_mbid}/mbid-${selectedRelease.release_mbid}-${selectedRelease.caa_id}_thumb250.jpg`?
      • or am i missing something
      • lucifer
        yup, once we know the problem we can plan the mitigation but first we need to figure out the issue.
      • zas
        we can take it down in docker-server-configs
      • ruaok
        zas: want a PR?
      • zas
        but... it will not prevent incoming reqs
      • I'll do it manually
      • ruaok
        reduce the TTL on caa.org and point it elsewhere?
      • ROpdebee: could be. but we want a CAA cache ideally.
      • I dropped the TTL on caa.org, just in case
      • this is the only real clue that something went wrong, IMHO: https://stats.metabrainz.org/d/000000061/mbstat...
      • zas
        caa is now in maintenance mode, 503