I've changed the owner/notifications to be #listenbrainz now -- so if we see an alert from Sentry something is actually wrong and needs to be investigated.
2025-01-24 02426, 2025
monkey[m]
Roger.
2025-01-24 02452, 2025
monkey[m]
Nice work, this is going to be very helpful
2025-01-24 02446, 2025
mayhem[m]
agreed. we should know about failures before our users do. #abouttime
2025-01-24 02434, 2025
monkey[m]
Dang, sentry is a drama queen
2025-01-24 02435, 2025
monkey[m] uploaded an image: (4KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/xIjTgADllrWthsmSHmlybhgJ/image.png >
2025-01-24 02417, 2025
monkey[m]
mayhem: Should we try to reach out to users like this, considering their metadata is not exactly clean?
LB-1590: Global recording listen counts not updating
2025-01-24 02438, 2025
monkey[m]
Unclear to me. I would say it is probably fixed but if we close the ticket it woiuld be good to add a comment welcoming users to reopen the ticket accordingly
2025-01-24 02406, 2025
monkey[m]
From revently lookign at it, global counts were looking good
mayhem: the base class IncrementalStats that was added in the sitewide stats PR has documentation for all the abstract methods, its not in the current PR and that's why it seems like a lot of the documentation is missing. I have added a comment to all the concrete classes to look at that class for documentation.
mayhem: i am unsure what the question is but every sitewide stat is calculated separately in spark and stored in the same couchdb database.
2025-01-24 02452, 2025
lucifer[m]
i think one job for each stat-range combination might be an overkill though.
2025-01-24 02456, 2025
mayhem[m]
is there a chance that, for instance, artists starts would succeed and be present, but release stats are not?
2025-01-24 02405, 2025
lucifer[m]
maybe just one job that check it all.
2025-01-24 02406, 2025
lucifer[m]
yes/
2025-01-24 02421, 2025
lucifer[m]
and also artists week can pass but artists year can fail.
2025-01-24 02430, 2025
mayhem[m]
yeah, I did one job where I checked artist and monkey thinks we should the others.
2025-01-24 02430, 2025
lucifer[m]
applies to all user stats as well
2025-01-24 02416, 2025
mayhem[m]
its clearly a slippery slope... where do you stop checking?
2025-01-24 02435, 2025
lucifer[m]
i think you can club it by range or entity.
2025-01-24 02457, 2025
mayhem[m]
club?
2025-01-24 02459, 2025
monkey[m]
By range might make sense.
2025-01-24 02401, 2025
monkey[m]
Buckets
2025-01-24 02420, 2025
lucifer[m]
group the checks. a check passes if all ranges of a given entity are up to date.
2025-01-24 02423, 2025
mayhem[m]
ok, so check all artists ranges?
2025-01-24 02427, 2025
lucifer[m]
yes.
2025-01-24 02435, 2025
mayhem[m]
ok, can do.\
2025-01-24 02443, 2025
monkey[m]
So, check weekly artist and release stats age, if one is missing return none (alert), otherwise use the oldest date between the two
2025-01-24 02456, 2025
monkey[m]
Same for other ranges.
2025-01-24 02401, 2025
monkey[m]
Does that sound reasonable?
2025-01-24 02419, 2025
mayhem[m]
"weekly artist and release stats age," that is not what I understood.
2025-01-24 02437, 2025
mayhem[m]
check artist all time, artist week, artist last week.
2025-01-24 02446, 2025
mayhem[m]
and so on
2025-01-24 02452, 2025
monkey[m]
Depends how we group it I suppose. In the interest of not creating too many alerts, we could group by range (i.e. artists+releases weekly, artists+releases monthly, etc.)
2025-01-24 02409, 2025
monkey[m]
That's how I understood "club it by range or entity"
2025-01-24 02412, 2025
monkey[m]
or we could group by time range , i.e. check artist stats for week+month+year
2025-01-24 02449, 2025
monkey[m]
But the issue here is that our alerting system is based on an age metric. Are those stats all calculated at the exact same time and interval?
2025-01-24 02453, 2025
mayhem[m]
as lucifer said, that is overkill.
2025-01-24 02410, 2025
mayhem[m]
both him and I agreeed to pick one entity and check all the ranges.
2025-01-24 02419, 2025
mayhem[m]
and if we find that that fails us, we can always improve it.
2025-01-24 02434, 2025
monkey[m]
> a check passes if all ranges of a given entity are up to date
2025-01-24 02435, 2025
monkey[m]
I might be misunderstanding, but I don't know if that will work
2025-01-24 02456, 2025
monkey[m]
Because we don't return a boolean OK/not OK
2025-01-24 02458, 2025
lucifer[m]
mayhem: sorry for the confusion, i think you should check all ranges and entities but report only one alert. if any of them is failing.
2025-01-24 02417, 2025
mayhem[m]
oh, now I see what you were saying.
2025-01-24 02431, 2025
mayhem[m]
I wonder how long that test will take.
2025-01-24 02453, 2025
lucifer[m]
should be less than 1 minute or two i think.
2025-01-24 02408, 2025
mayhem[m]
yea, but this is being done in response to a web call.
2025-01-24 02416, 2025
mayhem[m]
it the call times out it could give a false positive.
2025-01-24 02435, 2025
lucifer[m]
does grafana make the call or custom python code?
2025-01-24 02439, 2025
mayhem[m]
for this reason we might need to break it into smaller groups
2025-01-24 02450, 2025
mayhem[m]
I beleive grafana.
2025-01-24 02417, 2025
mayhem[m]
I would be inclined to make a check for artists, all ranges. releases, all ranges.
2025-01-24 02424, 2025
mayhem[m]
and so on. that should finish in time.
2025-01-24 02434, 2025
lucifer[m]
sure that sounds fine too.
2025-01-24 02446, 2025
mayhem[m]
k
2025-01-24 02441, 2025
lucifer[m]
alternatively we can write the time at which stats are ingested into couchdb to say redis and let grafana handle it.
2025-01-24 02420, 2025
mayhem[m]
thats more work. let me see how this plays out.
2025-01-24 02453, 2025
lucifer[m]
i see prometheus is pull based so we would store the latest timestamp in redis and then the endpoint returns the age from redis.
2025-01-24 02412, 2025
lucifer[m]
sure, we can change it later if needed.
2025-01-24 02400, 2025
mayhem[m]
hhmm. next problem.
2025-01-24 02418, 2025
mayhem[m]
right now the python code does not make a determination if something is out of date or not.
2025-01-24 02407, 2025
mayhem[m]
the grafana alert does that. so I can't realistically check all of them and then report back if at least one failed. that decision is not mine to make. How should we handle that?
2025-01-24 02446, 2025
mayhem[m]
short of reporting them all, I would have to duplicate "is this current" logic, which is not great.
2025-01-24 02402, 2025
monkey[m]
That is why I suggested returning the age of whichever is the oldest state for that entity. So for example if weekly artist stats ran fine but monthly did not, return the age of the (previous) monthly stats.
2025-01-24 02402, 2025
monkey[m]
That means when an alert triggers we don't exactly know which range failed, but we know some artist stats failed.
2025-01-24 02437, 2025
mayhem[m]
ah, good simple solution! will do that.
2025-01-24 02433, 2025
monkey[m]
And of course if even only one of the stats for that entity is not found in DB, return none, which would alert
2025-01-24 02446, 2025
Arsen_ is now known as Arsen
2025-01-24 02438, 2025
minimal joined the channel
2025-01-24 02438, 2025
Jade[m]1 has quit
2025-01-24 02411, 2025
zas[m]
bitmap: we keep getting OOM kills on selda/yamaoka (MB containers). I was thinking about what you said (that now we get more requests on same server, it leads to more impact for users on downtime). What about running multiple containers for the same service on the same machine and revert back to settings we had before. We could have 2 or 3 instances of MB containers on the same machine, it will change deployments/upgrades though, but over
2025-01-24 02411, 2025
zas[m]
the time we'll get even more powerful machines, so such move make sense anyway. It would also ease controlling resources allocated for each container and limit impact of misbehavior.
2025-01-24 02415, 2025
zas[m]
Well, just something I was thinking about.
2025-01-24 02436, 2025
mario[m] joined the channel
2025-01-24 02436, 2025
mario[m]
Hi everyone, I'm new here so not sure if this is the correct room for a request like this, but I recently moved my Jellyfin install to a new provider, and now it looks like my IP can't reach musicbrainz.org (maybe blacklisted?) when running "beet import". Can I request for the IP (43.153.138.95) to be whitelisted?
2025-01-24 02436, 2025
mario[m]
Thanks - and if this is not the right channel, apologies!
2025-01-24 02457, 2025
mayhem[m]
zas: one for you ^^^^
2025-01-24 02420, 2025
mayhem[m]
mario: zas normally handles these requests, hang tight until he appears