[listenbrainz-server] 14MonkeyDo merged pull request #1739 (03master…monkey-window-title-trackname): LB-747: Update browser window title with currently playing track's name https://github.com/metabrainz/listenbrainz-serv...
Is there another "send random PRs and win prizes" competition? ^
lucifer
meeting time! :D
zas
hey
lucifer
hi!
ruaok waves
zas
ok, should I start? alastairp ?
alastairp
hi
zas
so, I did a list about what's working for us regarding gateways, and what's not
first what's working:
autossl (auto-generation of letsencrypt certs)
ruaok
zas: you are leading this meeting, yes?
zas
well, I prefer to start, not really "leading" ;)
autoconfiguration of frontends & backends from consul (docker-server-configs + gitzconsul + serviceregistrator + consul + consul-template)
load balancing with weights
lucifer has quit
redundancy: being able to switch between gateways, even though it's manual, that's something we want to keep somehow for maintenance tasks
monkey has quit
what doesn't work well:
alastairp has quit
ssl puts a serious pressure on cpu, limiting number of requests we can handle (aka the CAA redirect mess)
lucifer joined the channel
alastairp joined the channel
alastairp
sorry, irccloud is sad. here now
zas
too much "pre-processing" on gateways: some stuff we do there should be done on backends instead, but that's like this for multiple reasons
monkey joined the channel
lucifer
i missed some messages too. looking at chatlogs now
alastairp_ joined the channel
zas
grrrr
monkey
Bad timing for a split
zas
yep :(
lack of redundancy: no auto switch in case of failure
lack of horizontal scalability
only level 7 load balancing
so, I started to think about a redesign
I did a lot of research, about tools we could use, that are open source, and free (as in beer)
alastairp
currently requests for all MeB services go to 1 gateway? and it's routed by nginx/openresty?
ruaok
I never got a response from the haproxy people about non profit licensing.
zas
yes
alastairp: in fact, everything is going through gateways, but few services
btw, I think it would be preferable to not have "holes" (aka services not going through gateways), it makes few things harder to control (like IP blocking etc)
so, to scale, we have to upgrade the hardware, that's not great
alastairp
so the goal is to get traffic off of the gateways sooner? (not sure if we're talking about goals yet)
zas
nope, the goal is to have better gateways, we still have to load balance over backends, do caching, do filtering , handle ssl, etc
but we need a solution that is more robust & scalable
any questions til now?
ok, I continue: hetzner failover IPs
those are great, basically we can switch one or more IPs between machines (kiki & herb in this case)
ruaok
with the caveat that the automatic detecting that something needs failing over is VERY SLOW.
zas
but this is a slow process, because we have to use hetzner API, which is slow to answer
lucifer
how slow? 1m, 5m, 10m?
kepstin
(still faster than updating dns tho, probably)
ruaok
but if we detect the failure ourselves and tell hetzer to switch over it is fast, yes?
zas
the switch itself takes 1 second or less, but the API takes 1min or more
alastairp
is hetzner API the only way of doing the failover? (I ask because I have done the same with keepalived, and you just bring up the IP on the host that needs it)
zas
it is never fast: detect the failure at T, connect to API at T+1 second, wait til T+60s, switch at T+61 seconds
ruaok
alastairp: yes, it must be done via hetzner otherwise their routers can't route traffice.
alastairp
right
ruaok
oh, it is NEVER fast?
that's pretty crap.
zas
yes :( it was a bit faster in the past, but now never under a minute
ruaok
oy
alastairp
devil's advocate: 60 seconds may be faster than us getting a telegram, working out what's wrong, sshing in to gateway and triggering swtichover
lucifer
how do we do the failover currently?
zas
alastairp: yes
lucifer: manually
lucifer
right, i mean what's the process once we decide to do it? still need to go through the api?
zas
why? because I never manage to have a stable setup with failures detection
we run a script, which asks API to switch IP
alastairp
what is the usecase for switching over? We currently do it when zas upgrades gateways. We did it a few weeks ago when the server was overloaded. what other cases are there? server disappears unexpectedly?
kepstin
unless you can convince them to reconfigure their switches to allow using something like ARP triggered from your machines to do the failover, there probably isn't a faster option.
zas
kepstin: we can't convince them to do that, this is why we have to use the API
lucifer
so the only difference in the automatic failover and the script we have now is the time to detect the failure after that both processes are same, right?
zas
aka we can't keepalived those IPs as we would on our own network
alastairp
lucifer: and as zas points out, getting a stable setup that only switches over on real errors
interesting, so it probably involves a routing reconfiguration, since they support this across the datacenter, not just for machines on the same switch. would explain why it's slow to set up :/
zas
kepstin: yes
reosarevok has quit
lucifer
uh irccloud acting up again
ruaok
> Switching a failover IP/subnet takes between 40 and 60 seconds.
lucifer has quit
zas
yes, in practice, 60 seconds rather than 40s
lucifer joined the channel
but, it is important, the actual switch is fast: < 1 second, we lose connections though when it happens