#metabrainz

/

      • outsidecontext joined the channel
      • 2018-05-11 13139, 2018

      • ruaok
        well, one good thing about the semi-flaky hetzner hardware is that our single points of failure are more obvious. and what better motivation... :-/
      • 2018-05-11 13102, 2018

      • CardinalWolseley joined the channel
      • 2018-05-11 13149, 2018

      • outsidecontext has quit
      • 2018-05-11 13124, 2018

      • djwhitey has quit
      • 2018-05-11 13145, 2018

      • djwhitey joined the channel
      • 2018-05-11 13121, 2018

      • CardinalWolseley has quit
      • 2018-05-11 13139, 2018

      • UmkaDK_ has quit
      • 2018-05-11 13121, 2018

      • bitmap
        ruaok, yvanzo, zas: I shared a gdoc with you for the bowie -> queen switchover
      • 2018-05-11 13151, 2018

      • ruaok waits for the invite
      • 2018-05-11 13135, 2018

      • ruaok
        wow. so much more complicated than changing a fan.
      • 2018-05-11 13121, 2018

      • ruaok
        > What about other projects?
      • 2018-05-11 13140, 2018

      • ruaok
        hmmm. LB doesn't have a read only mode
      • 2018-05-11 13135, 2018

      • ruaok
        I was doing ok following along until that part.
      • 2018-05-11 13150, 2018

      • ruaok
        now I am wondering if it wouldn't be better to just take down time.
      • 2018-05-11 13113, 2018

      • ruaok
        can we promise hetzer some beer if they can do it in under 10 minutes?
      • 2018-05-11 13114, 2018

      • bitmap
        :P
      • 2018-05-11 13103, 2018

      • bitmap
        we could go into downtime earlier, though it's still important we make sure no writes are happening before copying the data to queen
      • 2018-05-11 13155, 2018

      • ruaok
        I guess what I am saying is that we should take all sites offline, shut down bowie, have hetzner work, bring bowie and everything else back up.
      • 2018-05-11 13139, 2018

      • bitmap
        ah sorry, without switching to queen
      • 2018-05-11 13157, 2018

      • ruaok
        I'm not fully convinced that is best.
      • 2018-05-11 13118, 2018

      • ruaok
        just floating ideas -- the whole process looks freaking scary. we don't normally have schema change docs that are that long.
      • 2018-05-11 13131, 2018

      • bitmap
        true, it's closer in scope to when we had to move from totoro... though easier now that things are pieced together
      • 2018-05-11 13144, 2018

      • bitmap
        but if we take downtime during the whole process in the doc, I can simplify it
      • 2018-05-11 13110, 2018

      • ruaok
        what besides "shut down PG, then bowie" would need to happen?
      • 2018-05-11 13148, 2018

      • bitmap
        nothing if the fan replacement can be done quickly, we could just take downtime, shut down PG, then bowie
      • 2018-05-11 13127, 2018

      • ruaok
        zas?
      • 2018-05-11 13142, 2018

      • bitmap
        if it'll take a while and we want to switch to queen, we can simplify the doc a bit more if we take full downtime during the switchover
      • 2018-05-11 13120, 2018

      • bitmap
        some of the complexity was from keeping bowie up to accept RO queries while we copy the data, but we don't *have* to do that
      • 2018-05-11 13150, 2018

      • ruaok
        that might be a good compromise.
      • 2018-05-11 13139, 2018

      • D4RK-PH0ENiX has quit
      • 2018-05-11 13103, 2018

      • reosarevok
        yvanzo: seems "ended" is no longer displayed on edits? https://musicbrainz.org/edit/50509593
      • 2018-05-11 13127, 2018

      • D4RK-PH0ENiX joined the channel
      • 2018-05-11 13129, 2018

      • zas
        bitmap, ruaok: i'm back from diner, what's the plan finally?
      • 2018-05-11 13102, 2018

      • bitmap
        I think it's dependent on how long the fan replacement will take, and whether we want to wait for it
      • 2018-05-11 13101, 2018

      • zas
        it should be short (<20 minutes) but we need to give them an exact time where they can start
      • 2018-05-11 13116, 2018

      • zas
        we have to be careful, any hardware intervention can lead to other hardware failure... an misplugged cable and we lose a lot of time ;)
      • 2018-05-11 13142, 2018

      • zas
        also we'll prolly have the same issue with WAL vs queen after bowie shutdown, and need to resync queen after
      • 2018-05-11 13126, 2018

      • zas
        this (long) procedure shows how much we need something better, for fast switching
      • 2018-05-11 13124, 2018

      • bitmap
        yup, I was playing with pgpool & repmgr yesterday
      • 2018-05-11 13136, 2018

      • bitmap
        it'll still need manual intervention and careful attention, but would be a lot faster
      • 2018-05-11 13128, 2018

      • bitmap
        I'll push the test containers I have somewhere soon
      • 2018-05-11 13128, 2018

      • iliekcomputers
        Could I be invited to the switchover doc too?
      • 2018-05-11 13154, 2018

      • ruaok
        I say go for it and just take the downtime.
      • 2018-05-11 13157, 2018

      • bitmap
        iliekcomputers: sent
      • 2018-05-11 13124, 2018

      • ruaok
        twice, even now. :)
      • 2018-05-11 13111, 2018

      • iliekcomputers
        Thanks.
      • 2018-05-11 13144, 2018

      • ruaok
        well, zas, what should we do?
      • 2018-05-11 13122, 2018

      • zas
        either the complicated procedure that will fail (perhaps) or the simple that will succeed (perhaps)
      • 2018-05-11 13111, 2018

      • zas
        let's take everything down, and just do it fast, we plan an hour, tweet about the maintenance, and let hetzner work
      • 2018-05-11 13127, 2018

      • zas
        it will limit possible issues, because the switch have so many steps, and many nodes involved that i fear it will not work as we expect...
      • 2018-05-11 13142, 2018

      • zas
        ruaok: is it your feeling ?
      • 2018-05-11 13154, 2018

      • iliekcomputers
        I support the non complicated method.
      • 2018-05-11 13109, 2018

      • zas
        well, bad things can still happen
      • 2018-05-11 13130, 2018

      • zas
        bitmap: can you trigger a db backup now ? how long does it take ?
      • 2018-05-11 13151, 2018

      • ruaok
        I agree with everything you said.
      • 2018-05-11 13120, 2018

      • ruaok
        we should still offer €20 of beer if they can do it in under 10 minutes.
      • 2018-05-11 13138, 2018

      • zas
        in fact if they switch cpu+fan than can do it in 3
      • 2018-05-11 13140, 2018

      • ruaok
        :)
      • 2018-05-11 13100, 2018

      • zas
        bitmap: ?
      • 2018-05-11 13104, 2018

      • bitmap
        I don't think a pg_basebackup takes very long
      • 2018-05-11 13123, 2018

      • bitmap
        you want to take it now before bowie is down?
      • 2018-05-11 13123, 2018

      • ruaok
        then lets do that.
      • 2018-05-11 13134, 2018

      • zas
        yes, just do one now, it will limit damage in case of
      • 2018-05-11 13137, 2018

      • ruaok
        backup and then pick an exact time.
      • 2018-05-11 13144, 2018

      • zas
        20:00 UTC is perhaps too short (for backup+hetzner), i'd say 21UTC (in 1 hour 45 minutes)
      • 2018-05-11 13113, 2018

      • bitmap
        ok, I'll copy a backup to williams now. and queen should remain a usable replica backup too
      • 2018-05-11 13139, 2018

      • zas
        ok for the time ?
      • 2018-05-11 13144, 2018

      • bitmap
        let me start the backup really quick and just double check the progress of it
      • 2018-05-11 13151, 2018

      • zas
        ok
      • 2018-05-11 13153, 2018

      • bitmap
        started, no progress visible yet (pg_basebackup: initiating base backup, waiting for checkpoint to complete)
      • 2018-05-11 13149, 2018

      • zas
        we have to deploy barman next week, it'll help in this field
      • 2018-05-11 13112, 2018

      • reosarevok
        Is barman the newest version of bartendro? :p
      • 2018-05-11 13120, 2018

      • zas
        :) not really, but it may give us more time to play with bartendro in case of database disaster
      • 2018-05-11 13107, 2018

      • bitmap
        ok looks like it should complete in ~30 minutes
      • 2018-05-11 13136, 2018

      • bitmap
        it's at 5% now
      • 2018-05-11 13159, 2018

      • bitmap
        writing to /root/postgres-master-data-2018-05-11 on williams
      • 2018-05-11 13126, 2018

      • bitmap
        21UTC would be more than enough time anyway
      • 2018-05-11 13144, 2018

      • zas
        ok, i'll ask hetzner if they are ok for this time
      • 2018-05-11 13154, 2018

      • bitmap
        in a disaster (gasp) we can still restore from queen since it should be an exact copy after bowie is shut down
      • 2018-05-11 13156, 2018

      • zas
        request sent to hetzner, waiting for them to confirm
      • 2018-05-11 13105, 2018

      • zas
        asked for shorter delay (with beers)
      • 2018-05-11 13149, 2018

      • zas
        bitmap: queen disk usage >85%, we need to move solr stuff elsewhere (it takes ~40Gb)
      • 2018-05-11 13132, 2018

      • zas
        "We regret to tell you that the named appointment isn't available." hmmm
      • 2018-05-11 13132, 2018

      • bitmap
        do they say when they have available?
      • 2018-05-11 13134, 2018

      • zas
        nope, asking
      • 2018-05-11 13109, 2018

      • CatQuest
        wat
      • 2018-05-11 13108, 2018

      • CatQuest
        any reply zas?
      • 2018-05-11 13133, 2018

      • zas
        not yet
      • 2018-05-11 13142, 2018

      • CatQuest
        meh
      • 2018-05-11 13116, 2018

      • CatQuest
        I am lucky that today is my "day off" - playing botw instead of mb editing. it would have been annoying to have to wait for this.
      • 2018-05-11 13147, 2018

      • CatQuest
        some one mentioned in #mb that soemthing weird was happening with the mb site. odd css and errors
      • 2018-05-11 13106, 2018

      • CatQuest
        if the fan is literally trying to make the machine not *boil* (95˚c!?) then a banner might be good idea? (probably not a good idea to edit now, fna needs replacing waiting for hardware-reply-etc site will go down while it is repaired..)
      • 2018-05-11 13125, 2018

      • zas
        they suggest .... 23:15 CEST 21:15 UTC
      • 2018-05-11 13136, 2018

      • zas
        pff
      • 2018-05-11 13141, 2018

      • CatQuest
        so people can finish what they're doing and have some forewarning
      • 2018-05-11 13158, 2018

      • CatQuest
        eh...
      • 2018-05-11 13125, 2018

      • CatQuest
        idk, put up a banner, put the site in read only, have a nap for a couple hours thne go-go?
      • 2018-05-11 13125, 2018

      • zas
        bitmap: we have to prepare everything a bit before, and shutdown the server few minutes before 21:15 UTC
      • 2018-05-11 13159, 2018

      • bitmap
        okay
      • 2018-05-11 13124, 2018

      • CatQuest
        (imho putting up a banner *now* about it is a good idea too but..)
      • 2018-05-11 13155, 2018

      • zas
        ruaok: around?
      • 2018-05-11 13113, 2018

      • bitmap
        I think I can put put mb in read only and point it to queen as part of the process, if we want
      • 2018-05-11 13113, 2018

      • ruaok
        now, yes.
      • 2018-05-11 13105, 2018

      • CatQuest
        wel it's like 1 hour 15 minutes but still
      • 2018-05-11 13106, 2018

      • bitmap
        or not, not sure if it'll accept connections if bowie is down.
      • 2018-05-11 13119, 2018

      • bitmap
        since it's a hot standby. so nvm
      • 2018-05-11 13145, 2018

      • bitmap
        pg_basebackup: base backup completed
      • 2018-05-11 13144, 2018

      • CatQuest
        anyway, good luck and well done guys. I know you will do a good job
      • 2018-05-11 13150, 2018

      • moufl has quit
      • 2018-05-11 13100, 2018

      • zas
        Banner up, ruaok can you tweet about it ?
      • 2018-05-11 13121, 2018

      • iliekcomputers
      • 2018-05-11 13138, 2018

      • bitmap
        stopped the sitemaps container
      • 2018-05-11 13124, 2018

      • CatQuest
        iliekcomputers: unrelated, did you know that "redd" means "scared" in norwegian ? :D
      • 2018-05-11 13137, 2018

      • iliekcomputers
        i didn't!
      • 2018-05-11 13147, 2018

      • iliekcomputers
        maisie isn't scary tho! :D
      • 2018-05-11 13154, 2018

      • ruaok
        am I allowed to add the tag #wecanhazcloudnao to it?
      • 2018-05-11 13109, 2018

      • iliekcomputers
        add maisie too!
      • 2018-05-11 13111, 2018

      • CatQuest
        zas: I don't see a banner on beta
      • 2018-05-11 13120, 2018

      • zas
        ah i didn't on beta
      • 2018-05-11 13142, 2018

      • moufl joined the channel
      • 2018-05-11 13155, 2018

      • bitmap
        the daily script is running and processing subscriptions, I'm gonna stop it
      • 2018-05-11 13136, 2018

      • ruaok
        attempted tweets out. seem like twitter is having some issues too.
      • 2018-05-11 13143, 2018

      • bitmap
        pre-emptively stopped musicbrainz-production-cron
      • 2018-05-11 13138, 2018

      • iliekcomputers
        woo!
      • 2018-05-11 13146, 2018

      • bitmap
        stopped sentry so it has a nice day while the db is down
      • 2018-05-11 13155, 2018

      • CatQuest
        :D
      • 2018-05-11 13115, 2018

      • CatQuest imagines sentry having a picnic or smth :P
      • 2018-05-11 13124, 2018

      • CatQuest
        (I'm sorry for being silly)
      • 2018-05-11 13105, 2018

      • ruaok
        I was lazy. I copied zas and zas uses a different way to express himself. more formal than I. hence lolspeak. :)
      • 2018-05-11 13128, 2018

      • CatQuest
        oh
      • 2018-05-11 13114, 2018

      • zas
        lol
      • 2018-05-11 13145, 2018

      • bitmap
        should we put mb in read-only mode in ~10 minutes?
      • 2018-05-11 13133, 2018

      • ruaok
        good plan
      • 2018-05-11 13139, 2018

      • zas
        it seems safer yes
      • 2018-05-11 13149, 2018

      • bitmap
        done
      • 2018-05-11 13132, 2018

      • Leo__Verto joined the channel
      • 2018-05-11 13105, 2018

      • zas
        i'll put all down in 7 mins, and we shutdown bowie
      • 2018-05-11 13147, 2018

      • yokel joined the channel
      • 2018-05-11 13113, 2018

      • ruaok
        that soon?
      • 2018-05-11 13123, 2018

      • ruaok
        just wondering, don't let me stop you.
      • 2018-05-11 13117, 2018

      • zas
        i'll set all down now, expect 503s
      • 2018-05-11 13107, 2018

      • ruaok tweets more
      • 2018-05-11 13128, 2018

      • zas
        i shutdown bowie
      • 2018-05-11 13114, 2018

      • zas
        nagios is panicking ;) expected
      • 2018-05-11 13128, 2018

      • bitmap
        I'll stop pg slave too so it doesn't spam logs about bowie being down
      • 2018-05-11 13141, 2018

      • zas
        ok
      • 2018-05-11 13124, 2018

      • zas
        perhaps time to reboot queen too
      • 2018-05-11 13111, 2018

      • Leo__Verto
        that sounds like the thing you say before none of two the servers come back up :P
      • 2018-05-11 13107, 2018

      • zas
        well, this is why i added "perhaps" (usually it means: bad idea, don't do it)