I've changed the owner/notifications to be #listenbrainz now -- so if we see an alert from Sentry something is actually wrong and needs to be investigated.
monkey[m]
Roger.
Nice work, this is going to be very helpful
mayhem[m]
agreed. we should know about failures before our users do. #abouttime
monkey[m]
Dang, sentry is a drama queen
monkey[m] uploaded an image: (4KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/xIjTgADllrWthsmSHmlybhgJ/image.png >
mayhem: Should we try to reach out to users like this, considering their metadata is not exactly clean?
LB-1590: Global recording listen counts not updating
monkey[m]
Unclear to me. I would say it is probably fixed but if we close the ticket it woiuld be good to add a comment welcoming users to reopen the ticket accordingly
From revently lookign at it, global counts were looking good
mayhem: the base class IncrementalStats that was added in the sitewide stats PR has documentation for all the abstract methods, its not in the current PR and that's why it seems like a lot of the documentation is missing. I have added a comment to all the concrete classes to look at that class for documentation.
mayhem: i am unsure what the question is but every sitewide stat is calculated separately in spark and stored in the same couchdb database.
i think one job for each stat-range combination might be an overkill though.
mayhem[m]
is there a chance that, for instance, artists starts would succeed and be present, but release stats are not?
lucifer[m]
maybe just one job that check it all.
yes/
and also artists week can pass but artists year can fail.
mayhem[m]
yeah, I did one job where I checked artist and monkey thinks we should the others.
lucifer[m]
applies to all user stats as well
mayhem[m]
its clearly a slippery slope... where do you stop checking?
lucifer[m]
i think you can club it by range or entity.
mayhem[m]
club?
monkey[m]
By range might make sense.
Buckets
lucifer[m]
group the checks. a check passes if all ranges of a given entity are up to date.
mayhem[m]
ok, so check all artists ranges?
lucifer[m]
yes.
mayhem[m]
ok, can do.\
monkey[m]
So, check weekly artist and release stats age, if one is missing return none (alert), otherwise use the oldest date between the two
Same for other ranges.
Does that sound reasonable?
mayhem[m]
"weekly artist and release stats age," that is not what I understood.
check artist all time, artist week, artist last week.
and so on
monkey[m]
Depends how we group it I suppose. In the interest of not creating too many alerts, we could group by range (i.e. artists+releases weekly, artists+releases monthly, etc.)
That's how I understood "club it by range or entity"
or we could group by time range , i.e. check artist stats for week+month+year
But the issue here is that our alerting system is based on an age metric. Are those stats all calculated at the exact same time and interval?
mayhem[m]
as lucifer said, that is overkill.
both him and I agreeed to pick one entity and check all the ranges.
and if we find that that fails us, we can always improve it.
monkey[m]
> a check passes if all ranges of a given entity are up to date
I might be misunderstanding, but I don't know if that will work
Because we don't return a boolean OK/not OK
lucifer[m]
mayhem: sorry for the confusion, i think you should check all ranges and entities but report only one alert. if any of them is failing.
mayhem[m]
oh, now I see what you were saying.
I wonder how long that test will take.
lucifer[m]
should be less than 1 minute or two i think.
mayhem[m]
yea, but this is being done in response to a web call.
it the call times out it could give a false positive.
lucifer[m]
does grafana make the call or custom python code?
mayhem[m]
for this reason we might need to break it into smaller groups
I beleive grafana.
I would be inclined to make a check for artists, all ranges. releases, all ranges.
and so on. that should finish in time.
lucifer[m]
sure that sounds fine too.
mayhem[m]
k
lucifer[m]
alternatively we can write the time at which stats are ingested into couchdb to say redis and let grafana handle it.
mayhem[m]
thats more work. let me see how this plays out.
lucifer[m]
i see prometheus is pull based so we would store the latest timestamp in redis and then the endpoint returns the age from redis.
sure, we can change it later if needed.
mayhem[m]
hhmm. next problem.
right now the python code does not make a determination if something is out of date or not.
the grafana alert does that. so I can't realistically check all of them and then report back if at least one failed. that decision is not mine to make. How should we handle that?
short of reporting them all, I would have to duplicate "is this current" logic, which is not great.
monkey[m]
That is why I suggested returning the age of whichever is the oldest state for that entity. So for example if weekly artist stats ran fine but monthly did not, return the age of the (previous) monthly stats.
That means when an alert triggers we don't exactly know which range failed, but we know some artist stats failed.
mayhem[m]
ah, good simple solution! will do that.
monkey[m]
And of course if even only one of the stats for that entity is not found in DB, return none, which would alert
Arsen_ is now known as Arsen
minimal joined the channel
Jade[m]1 has quit
zas[m]
bitmap: we keep getting OOM kills on selda/yamaoka (MB containers). I was thinking about what you said (that now we get more requests on same server, it leads to more impact for users on downtime). What about running multiple containers for the same service on the same machine and revert back to settings we had before. We could have 2 or 3 instances of MB containers on the same machine, it will change deployments/upgrades though, but over
the time we'll get even more powerful machines, so such move make sense anyway. It would also ease controlling resources allocated for each container and limit impact of misbehavior.
Well, just something I was thinking about.
mario[m] joined the channel
mario[m]
Hi everyone, I'm new here so not sure if this is the correct room for a request like this, but I recently moved my Jellyfin install to a new provider, and now it looks like my IP can't reach musicbrainz.org (maybe blacklisted?) when running "beet import". Can I request for the IP (43.153.138.95) to be whitelisted?
Thanks - and if this is not the right channel, apologies!
mayhem[m]
zas: one for you ^^^^
mario: zas normally handles these requests, hang tight until he appears