#metabrainz

/

      • monkey is a bit late, please don't call me first :)
      • yvanzo
        monkey: not sure the meeting will take place as usual since MeB websites are down.
      • monkey
        Oop, thanks
      • Freso
        <BANG>
      • It’s World Vegan Monday!
      • I’ve received
      • alastairp: Go!
      • alastairp says…
      • """
      • Today is a holiday, so I'm not around. Last week I spent most of my time attending a kubernetes training course at MTG. I also reviewed some LB PRs.
      • """
      • Others up for reviews: yvanzo, bitmap, monkey, ruaok, lucifer, zas, akshat, reosarevok, CatQuest, Freso – as aye, let me know if you too want to give a review and you’re not listed here! :)
      • monkey: Go!
      • ;)
      • Nah. yvanzo, go! :)
      • yvanzo
        Hi!
      • Freso
        Oh. Uh.
      • Yeah.
      • zas
        Freso: not the right time for me....
      • Freso
        Go focus on sites.
      • ruaok
      • BrainzBot
        LB-992: Too many LB full dumps on FTP site
      • lucifer
        👍
      • yvanzo
        I mostly reviewed PRs and looked into improving docker server configs
      • Freso
        bitmap and anyone else in the US/North America/Turtle Island: note that next week’s meeting will be an hour different again (back to normal time) if you’re in an area that (stops) observing DST. :)
      • Take care of yourselves out there. :)
      • </BANG>
      • Possibly shortest meeting ever? :p
      • yvanzo
        Thanks everyone :)
      • monkey
        Just catching up with the drama. Anything I can help with?
      • ruaok
        send chocolate?
      • CatQuest
        I only have nonstop?
      • but i mean i can send that
      • rdswift
        <ruaok> send chocolate? Or chill the beer for when this is resolved.
      • ruaok
        zas: lets regroup. what else have we learned?
      • zas
        the problem is concerning https only (apparently, perhaps we need to ensure that)
      • load on gateways is higher than usual, we don't know if it's the cause or the consequence yet
      • ruaok
        do we have any other indicators of abnormal behaviour right before the outage?
      • zas
        it started when I switched floating IP to herb, but it can be due to another event at the same time
      • ruaok
        the CAA one is the only one I've seen.
      • do you know exactly when you did that?
      • zas
        to me, everything was fine before the switch
      • 01/11/21 04:32 pm CET
      • according to hetzner robot web service log
      • ruaok
        the CAA load spike was in progress at 4:26pm CET.
      • zas
        hmmm
      • ruaok
        perhaps the switch just made everything worse.
      • lucifer
      • this is a different error from before and happens immidiately
      • ruaok
        kepstin: ping
      • kepstin
        hola
      • ruaok
        kepstin: hey.
      • kepstin
        i've read a bit of the backlog, but i don't really have any insight as to what's going wrong :/
      • ruaok
        have you been following along? I think you could be really helpful to us.
      • ah.
      • what are your go to think to check when a server is overloaded?
      • what would you do and look at?
      • ruaok is trying to get ideas for things to do next
      • zas: we have redirects from HTTP -> HTTPS in place for a lot of things.
      • what if we turned them off and told people via twitter/blog to use HTTP for the time being?
      • kepstin
        well, narrow down the specific thing that's overloaded - where is the error page being generated, which component is it failing to connect to?
      • i did mention about the http/2 thing earlier, dunno if you saw that?
      • ruaok
        kepstin: it appears that all our HTTP traffic is totally fine and snappy.
      • kepstin: saw that, but it didn't lead me to a new direction.
      • kepstin
        hmm, it just means that if you're measuring traffic by number of connections, you'll overcount http or undercount https
      • yvanzo
        BrainzGit is running normally from aretha.
      • ruaok
        but none of our HTTPS traffic is going through. we checked certs and so.. fine. no changes to the http stuff when this happened
      • yvanzo
        Cert for tickets is fine too.
      • ruaok
        kepstin: understood.
      • CatQuest
        yea tickets loads
      • kepstin
        and also that https can be a connection amplifier, since most http/2 reverse-proxies convert multiple requests in one connection to multiple http/1.1 connections to the backend
      • ruaok
        we're also getting TCP listen drop messages.
      • yvanzo
        But tickets are not runnning from the same set of servers.
      • ruaok
        but we think that is symptom, not a cause
      • kepstin: that would be great for us -- our gateways, the HTTPS endpoints are having a hard time. the backends are bored.
      • CatQuest
        hm, might be worth checking that out tho, to rule it out atleast
      • kepstin
        hmm. if you have https ocsp pinning enabled, consider turning that off?
      • ruaok
        zas: ^^
      • zas
        let me check that
      • kepstin
        probably not the issue, but it can cause some server configs to need to make outgoing connections to verify the ocsp before responding
      • i have to say, having connections refused on the border reverse proxies is a new experience for me
      • ruaok
        zas: when I zoom in on the CAA graph: https://stats.metabrainz.org/d/000000061/mbstat...
      • kepstin
        all the overload issues i've dealth with have been purely due to backend
      • ruaok
        I believe we haven't ruled out the CAA enough yet.
      • CatQuest
        honestly that huesound/caa thing seemed very suspicious
      • zas
        yes, it started to raise at 16:19
      • so basically the switch happened when it was already 2x time the usual traffic
      • ruaok
        we're overloaded. clearly.
      • bitmap: you about?
      • bitmap
        I'm here
      • ruaok
        can you help me cobble together a CAA?
      • zas
        kepstin: I disabled ssl stapling
      • ruaok
        if I rent us a new dedicated server right now, can we get CAA up and running post-haste?
      • bitmap
        ruaok: sure, I could help with that
      • ruaok
        ok, I'm going to do that. that rules it out.
      • hang on.
      • kepstin
        what's the outward-facing application that's actually handling requests on port 443? is that an nginx reverse proxy, haproxy, ???
      • gavinatkinson joined the channel
      • ruaok
        openresty / nginx
      • kepstin
        interesting that in the stats, the connect time to the upstream servers from that frontend seems to have risen dramatically
      • normally less than 5ms, but then starting around 11:29 jumped way up to ~20ms
      • which .... might mean that it was sending requests to the upstreams faster than they were responding, until the frontend's queues filled up and it became unable to accept more connections itself?
      • ruaok
        kepstin: which graph are you looking at?
      • kepstin
        "Mean connect time per upstream"
      • (on the CAA graph link you'd sent to zas)
      • zas
        kepstin: yes, that's what's happening
      • ruaok
        zas: what ubuntu image should I install the new CAA server? 20.04 focal?
      • zas
        ruaok: if possible yes
      • basically we have too many connections incoming to 443, saturating queues, leading to failed connections
      • lucifer
        and http is working because no one is using it?
      • zas
        yes
      • kepstin
        right, and the queues are backed up specifically because responses to stuff forwarded to CAA backends aren't coming back fast enough, i think?
      • lucifer
        are we overloaded in general or some extra traffic from particular ips?
      • kepstin
        i guess this is a problem i wouldn't have anticipated with running multiple sites on a single external endpoint - an issue with one service can take down others :/
      • so if the CAA backends are the problem in particular, stubbing out caa requests to immediately return an error rather than forward to backend could bring the rest of the site back up?
      • lucifer
        but mean connect times seem to up in general for all hosts?
      • kepstin
        hmm. i might have just not looked at enough different graphs then :)
      • lucifer
      • for one instance
      • kepstin
        hmm. so if it is affecting all backends more or less evenly then i guess it is a frontend issue then :/
      • ruaok
        DNS for CAA has been pointed to the new server.
      • now waiting for it to come back up and get the new server setup going.
      • kepstin
        there are some tuning options for nginx to increase the number of backlogged connections which might help. require changing some sysctls and matching nginx config. but i don't know if that's something you've already done.
      • ruaok
      • kepstin
        (and also file descriptor limits can cause issues)
      • zas
        kepstin: yes, I did that already, but here we have more than enough queue sizes for "normal" situations
      • ruaok: let's see if it helps
      • CatQuest
        wait purple? like in "deep" whoo hoo
      • ruaok
        bitmap: if you can start working out how we can get the CAA up without consul and all that...
      • riksucks
        btw this spike in traffic, its all organic right?
      • ruaok
        riksucks: unsure
      • lucifer
        we don't know that
      • CatQuest
        organic?
      • ruaok
        CatQuest: free range.
      • CatQuest
        uuhhh
      • lucifer
        not some intentionally overloading us.
      • bitmap
        ruaok: can you add this ssh key to bitmap? https://gist.github.com/mwiencek/efe7236012099d...
      • CatQuest
        not ddos, ah
      • bitmap
        I guess it only has my laptop one
      • ruaok
        oh, its not ready yet.
      • still working toward server setup
      • a few more mins
      • bitmap
        ok!
      • zas
        lucifer: packets from different IPs, over 100k SYN packets to port 443, top IP has 328
      • lucifer
        makes sense
      • kepstin
        (disk performance due to logging sometimes can cause things to fall over when traffic reaches a certain point, but I don't see anything obvious in the disk graphs)
      • lucifer
        so should be organic traffic most probably.