zas: Hello! Hope you are not too busy battling AI vermin, if you have a moment I have a question regarding a Grafana alert.... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/KYxbdoKEvZqxPwRnDSGEQONU>)
2025-03-25 08413, 2025
d4rk-ph0enix has quit
2025-03-25 08435, 2025
d4rk-ph0enix joined the channel
2025-03-25 08409, 2025
zas[m]
Why doesn't it provide data sometimes? I mean, if the reason is the service doesn't run we expect an alert, right?
2025-03-25 08453, 2025
monkey[m]
Transient issues that resolve themselves
2025-03-25 08436, 2025
monkey[m]
We have other alerts for non-responding website and API, so I'm not expecting these stats alerts to trigger when connection is temporarily lost.
2025-03-25 08455, 2025
monkey[m]
After a delay of 10 minutes, then yes it would make sense to trigger.
2025-03-25 08418, 2025
zas[m]
There's already a 5m pending period for this alert, we can increase this
2025-03-25 08430, 2025
monkey[m]
Ah, yes please!
2025-03-25 08457, 2025
monkey[m]
Is that a 5m delay on triggering the alert in any case, or just for loss of data (just out of curiosity)
2025-03-25 08459, 2025
zas[m]
10m or more?
2025-03-25 08410, 2025
jasje[m]
Any contributors who are cooking a GSoC proposal towards ListenBrainz Android project should take a look at ideas page again. Eased some things and added more context.
2025-03-25 08416, 2025
zas[m]
Any alert for this alert rule
2025-03-25 08422, 2025
zas[m]
We can't dissociate them
2025-03-25 08458, 2025
monkey[m]
That's what I figured. Thanks for confirming
2025-03-25 08412, 2025
monkey[m]
To me this reeads as a 1m delay, not 5. Am I looking at the wrong thing?
2025-03-25 08415, 2025
monkey[m] uploaded an image: (34KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/XKEAkQFpjiAZCDNZEtXQOqpW/image.png >
2025-03-25 08430, 2025
zas[m]
That's the time the rule is evaluated, but the pending period is the time before it actually alerts (if the state is the same)
2025-03-25 08434, 2025
zas[m] uploaded an image: (46KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/wQKnfFxSmudIESOxhainIIZZ/image.png >
2025-03-25 08407, 2025
zas[m]
So basically it evaluates the state every minute, but wait for 5 minutes to see if it was transient or not
2025-03-25 08425, 2025
zas[m]
It limits false alerts
2025-03-25 08453, 2025
zas[m]
We can evaluate the state less often too
2025-03-25 08406, 2025
zas[m]
but then the pending period has to be longer
2025-03-25 08439, 2025
monkey[m]
I was lookign at the pending period, it seemed to me to be set to 1m when I opened the alert edit page
2025-03-25 08448, 2025
zas[m]
If you look carefully Grafana shows alerts as activated as "Pending" when it happens, notifications are sent at the end of the pending period
2025-03-25 08403, 2025
zas[m]
s/as//
2025-03-25 08404, 2025
monkey[m]
Anyway, thanks for the assist. Let's see how it goes with 5m delay.
2025-03-25 08417, 2025
zas[m]
Wait, which alert did you copied from? because I have 5m for stats
2025-03-25 08428, 2025
monkey[m]
<zas[m]> "There's already a 5m pending..." <- This is the bit I was confused about. I can only see 1m pending period, nothing that says it was set to 5m
OK, and it wasn't configured with the same delay. Mystery solved :)
2025-03-25 08419, 2025
monkey[m]
I'll change that for the sitewide stats alerts
2025-03-25 08446, 2025
monkey[m]
Then I think it will work fine, the other (non-sitewide) alerts have been behaving better
2025-03-25 08457, 2025
zas[m]
The pending period limits the number of notifications if states change too quickly$
2025-03-25 08404, 2025
zas[m]
s/$/./
2025-03-25 08420, 2025
monkey[m]
Yep, that's what I was looking for.
2025-03-25 08425, 2025
monkey[m]
Thank you!
2025-03-25 08440, 2025
zas[m]
np :)
2025-03-25 08429, 2025
reosarevok[m]
Weren't we supposed to meet around now? :)
2025-03-25 08437, 2025
zas[m]
yup
2025-03-25 08408, 2025
reosarevok[m]
mayhem, bitmap, julian45 @julian45:julian45.net
2025-03-25 08414, 2025
mayhem[m] is here despite being on the phone the bank
2025-03-25 08409, 2025
reosarevok[m]
So, any updates or new info?
2025-03-25 08419, 2025
julian45[m]
none from me that haven't already been discussed out-of-band
2025-03-25 08425, 2025
zas[m]
We have to decide what to do, it was suggested to use Anubis, does it look actionable? Are there any objection?
2025-03-25 08435, 2025
mayhem[m]
I wrote a proposal with my ideas, but didn't get a lot of feedback, only from julian45 who made a number of good arguments as to why its not a great idea.
2025-03-25 08420, 2025
zas[m]
We had discussions about Cloudflare, and potential conflicts with our policies, this should be investigated too (in case we decide to move to such service, cloudflare or similar)
2025-03-25 08430, 2025
mayhem[m]
I feel so so about anubis. its seems a bit heavy handed, so I wish we could find out more about what level of effort these people are willing to go through to keep scraping us.
2025-03-25 08454, 2025
mayhem[m]
the recent cloudflare outages make me really not like that option much.
2025-03-25 08410, 2025
julian45[m]
anubis looks actionable IMO as a first line/first attempt, esp since docs indicate policy is configurable to allow, e.g., legit scrapers like google while challenging others
I think heavy handed is a good start given the situation
2025-03-25 08446, 2025
mayhem[m]
if we implement anubis, can we have it only on pages that contain data that is being scraped?
2025-03-25 08450, 2025
reosarevok[m]
(and we could lower the heavy-handiness if we get things under control in other ways)
2025-03-25 08453, 2025
julian45[m]
i do worry that it could potentially be annoying for some users who, e.g., disable js by default in their browsers, but those kinds of folks should be willing to carve out exceptions
2025-03-25 08410, 2025
mayhem[m]
e.g. style guides require no anubis
2025-03-25 08412, 2025
zas[m]
mayhem: about your proposal, I think we should rely on existing tools first if possible (not reinventing the wheel), but I don't totally rule it out, because we might find limitations in third party tools we don't have with our own ones.
2025-03-25 08421, 2025
reosarevok[m]
Those users could possibly get in touch with us and get exceptions added for them
2025-03-25 08424, 2025
julian45[m]
mayhem[m]: this kind of goes back to the separation of API requests to a subdomain need that was discussed yesterday
2025-03-25 08427, 2025
reosarevok[m]
(re: no-js people)
2025-03-25 08407, 2025
julian45[m]
reosarevok[m]: if they aren't able to configure their clients to make exceptions for us, sure - chances are they would need js to use our site(s) anyway, no?
2025-03-25 08423, 2025
mayhem[m]
julian45[m]: that seems conflated to me. we can partition the URL space for web pages without ever considering the API.
2025-03-25 08445, 2025
julian45[m]
ah i see
2025-03-25 08402, 2025
zas[m]
They (may) hit ANY page, but the fact is those with MB data have a much higher cost for us (they hit backends and db)
2025-03-25 08407, 2025
reosarevok[m]
julian45: fair, we require JS for editing anyway, just not for reading - but the kind of people we'd possibly make exceptions for probably edit so
2025-03-25 08423, 2025
julian45[m]
if needed, per the policy doc i linked, i think we can tell anubis to let certain path regexes through but not others
2025-03-25 08429, 2025
mayhem[m]
julian45[m]: still, your point is valid.
2025-03-25 08442, 2025
mayhem[m]
but probably not needed right this second.
2025-03-25 08402, 2025
reosarevok[m]
Would something like anubis interfere with external seeding?
2025-03-25 08437, 2025
reosarevok[m]
(it's ok if it does temporarily given the circumstances, but making sure we don't need to make MBS changes to support it)
2025-03-25 08443, 2025
julian45[m]
i doubt it, but then again external seeding is something we allow that many implementers (e.g., GNOME project gitlab) might not
2025-03-25 08403, 2025
julian45[m]
so i would suggest deploying on test or beta to figure that out before prod
2025-03-25 08420, 2025
julian45[m]
* i doubt it, but then again external seeding is something we have as part of our use cases that many implementers (e.g., GNOME project gitlab) might not
2025-03-25 08437, 2025
reosarevok[m]
zas: is beta also being hit?
2025-03-25 08455, 2025
zas[m]
Yes, it was first
2025-03-25 08457, 2025
reosarevok[m]
(so, if we put this in front of beta first, would it actually teach us if it will help)
2025-03-25 08400, 2025
reosarevok[m]
Ok
2025-03-25 08413, 2025
zas[m]
This is how I discovered the problem, beta containers were eating a lot of resources suddenly
2025-03-25 08448, 2025
bitmap[m]
reosarevok[m]: it might interfere with release editor seeding since that requires POST data which can't be redirected
2025-03-25 08447, 2025
reosarevok[m]
Hopefully those can be let through then since those should never match the kind of hits causing issues
2025-03-25 08407, 2025
zas[m]
But seeding requires to be logged in, so we can just skip any check for those
2025-03-25 08432, 2025
julian45[m]
zas[m]: it usually forces logout, then login though, right?
2025-03-25 08446, 2025
reosarevok[m]
Ok, that's another thing I asked several times but I think never got an answer for: can we separate logged in from not logged in queries?
2025-03-25 08450, 2025
reosarevok[m]
For anubis
2025-03-25 08400, 2025
reosarevok[m]
And run it only on logged out for now
2025-03-25 08426, 2025
mayhem[m]
I propose that we try anubis on mb.org's data pages and see what happens.
2025-03-25 08427, 2025
mayhem[m]
I much prefer this option over cloudflare.
2025-03-25 08427, 2025
mayhem[m]
who objects to this suggestion?
2025-03-25 08439, 2025
mayhem[m]
(sorry was disconnected for a bit, back now)
2025-03-25 08442, 2025
zas[m]
bitmap suggested an internal header set by backends for that, so we can get this info on gateways at least
2025-03-25 08450, 2025
reosarevok[m]
I'd say on beta.mb.org for now, but I'd agree otherwise
2025-03-25 08453, 2025
julian45[m]
reosarevok[m]: unfortunately not sure
2025-03-25 08403, 2025
mayhem[m]
mayhem[m]: this may now be out of date. heh.
2025-03-25 08412, 2025
julian45[m]
* not sure i.r.t. anubis
2025-03-25 08422, 2025
reosarevok[m]
That'd also allow us to play with any changes we need to make things better on the MBS side
2025-03-25 08428, 2025
reosarevok[m]
Before we put them in prod
2025-03-25 08448, 2025
bitmap[m]
julian45[m]: yeah, IIRC the login cookies are not available to the request MB receives
2025-03-25 08404, 2025
bitmap[m]
but if it's only restricted to data pages then not an issue
2025-03-25 08430, 2025
reosarevok[m]
Yeah, I guess if we can ignore /edit pages it seems good
2025-03-25 08457, 2025
reosarevok[m]
Well, /edit /add /create etc
2025-03-25 08415, 2025
reosarevok[m]
But we could easily come up with a list
2025-03-25 08424, 2025
julian45[m]
reosarevok[m]: which seems doable but i would like someone to double check the doc page i linked to make sure i'm interpreting correctly
2025-03-25 08443, 2025
julian45[m]
reosarevok[m]: great, because policy for anubis is configured by json file anyway per docs
{... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/jJwopxmBmSAzwBbgUVbNzSrO>)
2025-03-25 08447, 2025
reosarevok[m]
I don't see why it would not work for other things than robots.txt :)
2025-03-25 08458, 2025
julian45[m]
exactly
2025-03-25 08422, 2025
julian45[m]
just wanted to be sure i wasn't the only one reaching that conclusion from docs
2025-03-25 08432, 2025
reosarevok[m]
Ok, this looks like it should work - so sysadmin team works on setting up anubis, mbs team figures out what paths to allow (could include all of /ws/2 as well for now AFAICT), we reconvene and let it loose on test first, then beta if nothing is horribly broken?
2025-03-25 08401, 2025
bitmap[m]
"Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards." not sure what they mean by the last past (since we support older browsers too)
2025-03-25 08450, 2025
zas[m]
I wonder if Anubis is able to handle our traffic too. There's no number.
2025-03-25 08459, 2025
zas[m]
"Anubis has very minimal system requirements. I suspect that 128Mi of ram may be sufficient for a large number of concurrent clients. Anubis may be a poor fit for apps that use WebSockets and maintain open connections, but I don't have enough real-world experience to know one way or another."
2025-03-25 08400, 2025
reosarevok[m]
If we need to limit support for some older browsers temporarily while we find better options, that's a sacrifice that seems sensible to me
2025-03-25 08406, 2025
bitmap[m]
s/past/part/
2025-03-25 08420, 2025
reosarevok[m]
zas[m]: Only one way to find out?
2025-03-25 08444, 2025
zas[m]
Well, we can conduct a test on test.mb (...)
2025-03-25 08455, 2025
zas[m]
And evaluate pros & cons after that
2025-03-25 08433, 2025
reosarevok[m]
It seems worth a try compared with what you are having to spend time doing now
2025-03-25 08456, 2025
reosarevok[m]
Worst case scenario, we know we need to keep doing the same and look into cloudflare or our own version