[listenbrainz-server] 14MonkeyDo merged pull request #1739 (03master…monkey-window-title-trackname): LB-747: Update browser window title with currently playing track's name https://github.com/metabrainz/listenbrainz-server…
Is there another "send random PRs and win prizes" competition? ^
2021-11-23 32737, 2021
lucifer
meeting time! :D
2021-11-23 32742, 2021
zas
hey
2021-11-23 32747, 2021
lucifer
hi!
2021-11-23 32748, 2021
ruaok waves
2021-11-23 32745, 2021
zas
ok, should I start? alastairp ?
2021-11-23 32700, 2021
alastairp
hi
2021-11-23 32731, 2021
zas
so, I did a list about what's working for us regarding gateways, and what's not
2021-11-23 32742, 2021
zas
first what's working:
2021-11-23 32703, 2021
zas
autossl (auto-generation of letsencrypt certs)
2021-11-23 32703, 2021
ruaok
zas: you are leading this meeting, yes?
2021-11-23 32721, 2021
zas
well, I prefer to start, not really "leading" ;)
2021-11-23 32717, 2021
zas
autoconfiguration of frontends & backends from consul (docker-server-configs + gitzconsul + serviceregistrator + consul + consul-template)
2021-11-23 32733, 2021
zas
load balancing with weights
2021-11-23 32715, 2021
lucifer has quit
2021-11-23 32747, 2021
zas
redundancy: being able to switch between gateways, even though it's manual, that's something we want to keep somehow for maintenance tasks
2021-11-23 32751, 2021
monkey has quit
2021-11-23 32702, 2021
zas
what doesn't work well:
2021-11-23 32727, 2021
alastairp has quit
2021-11-23 32737, 2021
zas
ssl puts a serious pressure on cpu, limiting number of requests we can handle (aka the CAA redirect mess)
2021-11-23 32746, 2021
lucifer joined the channel
2021-11-23 32754, 2021
alastairp joined the channel
2021-11-23 32710, 2021
alastairp
sorry, irccloud is sad. here now
2021-11-23 32720, 2021
zas
too much "pre-processing" on gateways: some stuff we do there should be done on backends instead, but that's like this for multiple reasons
2021-11-23 32745, 2021
monkey joined the channel
2021-11-23 32750, 2021
lucifer
i missed some messages too. looking at chatlogs now
2021-11-23 32752, 2021
alastairp_ joined the channel
2021-11-23 32756, 2021
zas
grrrr
2021-11-23 32704, 2021
monkey
Bad timing for a split
2021-11-23 32715, 2021
zas
yep :(
2021-11-23 32736, 2021
zas
lack of redundancy: no auto switch in case of failure
2021-11-23 32751, 2021
zas
lack of horizontal scalability
2021-11-23 32742, 2021
zas
only level 7 load balancing
2021-11-23 32756, 2021
zas
so, I started to think about a redesign
2021-11-23 32720, 2021
zas
I did a lot of research, about tools we could use, that are open source, and free (as in beer)
2021-11-23 32733, 2021
alastairp
currently requests for all MeB services go to 1 gateway? and it's routed by nginx/openresty?
2021-11-23 32742, 2021
ruaok
I never got a response from the haproxy people about non profit licensing.
2021-11-23 32742, 2021
zas
yes
2021-11-23 32722, 2021
zas
alastairp: in fact, everything is going through gateways, but few services
2021-11-23 32702, 2021
zas
btw, I think it would be preferable to not have "holes" (aka services not going through gateways), it makes few things harder to control (like IP blocking etc)
2021-11-23 32756, 2021
zas
so, to scale, we have to upgrade the hardware, that's not great
2021-11-23 32724, 2021
alastairp
so the goal is to get traffic off of the gateways sooner? (not sure if we're talking about goals yet)
2021-11-23 32703, 2021
zas
nope, the goal is to have better gateways, we still have to load balance over backends, do caching, do filtering , handle ssl, etc
2021-11-23 32734, 2021
zas
but we need a solution that is more robust & scalable
2021-11-23 32727, 2021
zas
any questions til now?
2021-11-23 32759, 2021
zas
ok, I continue: hetzner failover IPs
2021-11-23 32733, 2021
zas
those are great, basically we can switch one or more IPs between machines (kiki & herb in this case)
2021-11-23 32759, 2021
ruaok
with the caveat that the automatic detecting that something needs failing over is VERY SLOW.
2021-11-23 32700, 2021
zas
but this is a slow process, because we have to use hetzner API, which is slow to answer
2021-11-23 32722, 2021
lucifer
how slow? 1m, 5m, 10m?
2021-11-23 32726, 2021
kepstin
(still faster than updating dns tho, probably)
2021-11-23 32734, 2021
ruaok
but if we detect the failure ourselves and tell hetzer to switch over it is fast, yes?
2021-11-23 32736, 2021
zas
the switch itself takes 1 second or less, but the API takes 1min or more
2021-11-23 32756, 2021
alastairp
is hetzner API the only way of doing the failover? (I ask because I have done the same with keepalived, and you just bring up the IP on the host that needs it)
2021-11-23 32729, 2021
zas
it is never fast: detect the failure at T, connect to API at T+1 second, wait til T+60s, switch at T+61 seconds
2021-11-23 32734, 2021
ruaok
alastairp: yes, it must be done via hetzner otherwise their routers can't route traffice.
2021-11-23 32739, 2021
alastairp
right
2021-11-23 32748, 2021
ruaok
oh, it is NEVER fast?
2021-11-23 32754, 2021
ruaok
that's pretty crap.
2021-11-23 32711, 2021
zas
yes :( it was a bit faster in the past, but now never under a minute
2021-11-23 32718, 2021
ruaok
oy
2021-11-23 32737, 2021
alastairp
devil's advocate: 60 seconds may be faster than us getting a telegram, working out what's wrong, sshing in to gateway and triggering swtichover
2021-11-23 32743, 2021
lucifer
how do we do the failover currently?
2021-11-23 32749, 2021
zas
alastairp: yes
2021-11-23 32755, 2021
zas
lucifer: manually
2021-11-23 32716, 2021
lucifer
right, i mean what's the process once we decide to do it? still need to go through the api?
2021-11-23 32725, 2021
zas
why? because I never manage to have a stable setup with failures detection
2021-11-23 32746, 2021
zas
we run a script, which asks API to switch IP
2021-11-23 32752, 2021
alastairp
what is the usecase for switching over? We currently do it when zas upgrades gateways. We did it a few weeks ago when the server was overloaded. what other cases are there? server disappears unexpectedly?
2021-11-23 32716, 2021
kepstin
unless you can convince them to reconfigure their switches to allow using something like ARP triggered from your machines to do the failover, there probably isn't a faster option.
2021-11-23 32751, 2021
zas
kepstin: we can't convince them to do that, this is why we have to use the API
2021-11-23 32706, 2021
lucifer
so the only difference in the automatic failover and the script we have now is the time to detect the failure after that both processes are same, right?
2021-11-23 32713, 2021
zas
aka we can't keepalived those IPs as we would on our own network
2021-11-23 32744, 2021
alastairp
lucifer: and as zas points out, getting a stable setup that only switches over on real errors
interesting, so it probably involves a routing reconfiguration, since they support this across the datacenter, not just for machines on the same switch. would explain why it's slow to set up :/
2021-11-23 32749, 2021
zas
kepstin: yes
2021-11-23 32739, 2021
reosarevok has quit
2021-11-23 32744, 2021
lucifer
uh irccloud acting up again
2021-11-23 32750, 2021
ruaok
> Switching a failover IP/subnet takes between 40 and 60 seconds.
2021-11-23 32700, 2021
lucifer has quit
2021-11-23 32718, 2021
zas
yes, in practice, 60 seconds rather than 40s
2021-11-23 32744, 2021
lucifer joined the channel
2021-11-23 32702, 2021
zas
but, it is important, the actual switch is fast: < 1 second, we lose connections though when it happens