but it overlaps network port range on the server (1024-65535), so host mode containers can use anything, that's perhaps the cause, and it was changed recently when I fixed sysctl / ufw issue on servers
2021-02-04 03506, 2021
zas
I'll change that and we'll see
2021-02-04 03535, 2021
zas
I changed pink local port range to be 21000 54000, so those doesn't conflict with ports defined in docker server scripts constants
2021-02-04 03510, 2021
zas
it also means we should group ports used by services, we have a bunch around 13k, 20k and over 55k
2021-02-04 03558, 2021
zas
yvanzo: sir-prod do not work, but that's another issue (prolly related to long queue), please have a look asap
2021-02-04 03505, 2021
yvanzo
ok
2021-02-04 03530, 2021
yvanzo
I stopped it again
2021-02-04 03551, 2021
yusuf56 joined the channel
2021-02-04 03543, 2021
Rohan_Pillai has quit
2021-02-04 03518, 2021
yvanzo
zas: sir-prod has a RuntimeError since 9:16:52 this morning
2021-02-04 03537, 2021
yvanzo
(8:16:52 UTC)
2021-02-04 03544, 2021
yvanzo
It seems it cannot connect to RabbitMQ
2021-02-04 03552, 2021
yvanzo
I deleted and recreated sir-prod container, but I still have the logs of the previous container for the last 24 hours.
the RuntimeError itself is related to logging to sentry
2021-02-04 03512, 2021
zas
RuntimeError: maximum recursion depth exceeded in cmp <--- ?
2021-02-04 03526, 2021
yvanzo
this is related to logging with raven
2021-02-04 03558, 2021
yvanzo
at least it's my guess and found it is a common error with raven-python
2021-02-04 03554, 2021
yvanzo
the source error seems to be when trying to reach rmq
2021-02-04 03549, 2021
zas
to me it seems it can connect but not to cope with all the data. This error log is very busy, hard to tell what's the problem
2021-02-04 03532, 2021
yvanzo
zas: it matches the time you restarted sir-prod.
2021-02-04 03555, 2021
yvanzo
it did error all the night without anything related to amqp
2021-02-04 03541, 2021
yvanzo
so something has changed that likely makes sir unable to reach rabbitmq
2021-02-04 03500, 2021
zas
are you sure it cannot connect to rabbitmq?
2021-02-04 03526, 2021
yvanzo
I will try from the container
2021-02-04 03511, 2021
zas
I think you are mixing 2 issues here: one (this night) was due to port issue, and was a connection issue, but now it should be solved, and the issue is too much data accumulated and sir is unable to cope with that. But I don't know enough about this stuff to be sure
2021-02-04 03547, 2021
zas
there are 115k items in queue atm and it keeps growing, meaning rabbitmq is working
2021-02-04 03547, 2021
zas
rabbitmq log is rather useless in this matter, it doesn't show client IPs
rabbitmq is growing and I can reach it from host at least
2021-02-04 03544, 2021
CatQuest
morn!
2021-02-04 03558, 2021
CatQuest
I see the rabbit(mq) is at it again, growing..
2021-02-04 03502, 2021
CatQuest
:D
2021-02-04 03544, 2021
reosarevok
yvanzo, zas: do you expect to need to restart pink again soon?
2021-02-04 03554, 2021
reosarevok
Or should I start running my bot? :D
2021-02-04 03552, 2021
ruaok
things better now??
2021-02-04 03549, 2021
yvanzo
no
2021-02-04 03508, 2021
ruaok
oh boo. :(
2021-02-04 03516, 2021
yvanzo
reosarevok: it's okay to run your bot
2021-02-04 03531, 2021
ruaok
let me know if you want another set of eyes to look. but I suspect that you have enough already.
2021-02-04 03501, 2021
Gazooo7949440 has quit
2021-02-04 03534, 2021
yvanzo
I don't see why 100k queued msg would be an issue.
2021-02-04 03543, 2021
Gazooo7949440 joined the channel
2021-02-04 03527, 2021
ruaok updates the gsoc ideas redirect in the usual decade update cycle
2021-02-04 03555, 2021
ruaok
100k queued essages and trille freaks out?
2021-02-04 03558, 2021
ruaok
+m
2021-02-04 03526, 2021
yvanzo
I’m scrutinizing sir code to see what could make it throws error first.
2021-02-04 03540, 2021
yvanzo
trille is probably be fine, the problem might be with sir code or its amqp requirement.
2021-02-04 03545, 2021
ruaok
I wonder if there is something... that isn't being cleaned up. and after X years, we've accumulated enough cruft that things start slowing down. Like "duh, you didn't vacuum you PG database, no wonder its slow"
2021-02-04 03507, 2021
yvanzo
it's not related to PG, that is for sure.
2021-02-04 03540, 2021
yvanzo
sir only retrieves 100 msg at a time, I don't see why the queue length would be a problem.
2021-02-04 03559, 2021
ruaok
and why is it *now* a problem and not before?
2021-02-04 03544, 2021
yvanzo
I hoped it could be zas fault for fixing network port setup ;)
2021-02-04 03503, 2021
Mineo
I had a quick look at the stacktrace: what seems to be happening is that sir is in the process of acknowledging a message during the initial connection attempt (timestamps 2021-02-04T09:07:58.656585230Z and 2021-02-04T09:07:58.656641473Z) and that includes the following line: https://github.com/metabrainz/sir/blob/a586387c24… which basically says
2021-02-04 03509, 2021
Mineo
"hey, if we've lost connecting to rabbitmq while processing this message, please reconnect, so we there's someone we can send the ACK to". that usually includes the following lines to skip that when the connection is already setup: https://github.com/metabrainz/sir/blob/a586387c24… but that only works if
2021-02-04 03514, 2021
Mineo
https://github.com/metabrainz/sir/blob/a586387c24… were already called to set the the "yes, there's an existing connection" flag. however, that still begs the very good question "and why is it *now* a problem and not before?" :(
2021-02-04 03542, 2021
yvanzo
in rabbitmq, there are a lot of "missed heartbeats from client, timeout: 60s
2021-02-04 03526, 2021
Etua joined the channel
2021-02-04 03550, 2021
yvanzo
139 errors in 10min, it's likely related to sir (I don't think CAA, CB or LB could produce that many errors at once)
2021-02-04 03551, 2021
ruaok
LB gets those too. and then a connection is reestablished.
2021-02-04 03521, 2021
ruaok
agreed, LB does not generate that many errors
2021-02-04 03554, 2021
zas
yvanzo: are you sure sir retrieve only 100 messages at start? because when we usually restart it first gets all messages, then go 100 by 100. But that's according logs.
2021-02-04 03504, 2021
yvanzo
(because SIR has 12 import threads, it is probably the most active)
just use the git version in CB and see if it works right?
2021-02-04 03512, 2021
alastairp
we need to: 1) update BU in some downstream dependency (e.g. LB) to this branch, 2) make sure we pip install --upgrade pip to the latest version, 3) try and install dependencies
2021-02-04 03524, 2021
alastairp
if that works in all brainzes, then it's ready to release
2021-02-04 03541, 2021
_lucifer
makes sense, i'll do that
2021-02-04 03546, 2021
alastairp
thanks!
2021-02-04 03523, 2021
_lucifer
moving back to cache then, what other than BU-4 do you have in mind