#metabrainz

/

0:23 AM
zerodogg joined the channel

2024-06-13 16541, 2024

0:50 AM
minimal has quit

2024-06-13 16533, 2024

0:53 AM
zerodogg has quit

2024-06-13 16516, 2024

1:37 AM
zerodogg joined the channel

2024-06-13 16531, 2024

3:48 AM
zerodogg has quit

2024-06-13 16506, 2024

4:32 AM
zerodogg joined the channel

2024-06-13 16539, 2024

4:56 AM
pite has quit

2024-06-13 16507, 2024

5:35 AM
Kladky joined the channel

2024-06-13 16530, 2024

5:55 AM
Kladky has quit

2024-06-13 16544, 2024

5:55 AM
Kladky joined the channel

2024-06-13 16548, 2024

6:51 AM
pranav[m] joined the channel

2024-06-13 16548, 2024

6:51 AM
pranav[m]

Had a doubt.. aerozol the loading animation for the screen shud be shown just once when the user navigaes to the user page right? or shud it be shown each time when user changes their tab as well?

2024-06-13 16548, 2024

6:51 AM
pranav[m]

akshaaatt

2024-06-13 16516, 2024

7:11 AM
SigHunter has quit

2024-06-13 16502, 2024

7:14 AM
SigHunter joined the channel

2024-06-13 16549, 2024

7:54 AM
rimskii[m]

hi, lucifer ! I have finished with integrating Apple Music to troi. ([Here's the PR](https://github.com/metabrainz/troi-recommenda…)

2024-06-13 16549, 2024

7:54 AM
rimskii[m]

I plan to switch to LB and implement remaining features there. Should I continue?

2024-06-13 16552, 2024

8:37 AM
SigHunter has quit

2024-06-13 16531, 2024

8:43 AM
SigHunter joined the channel

2024-06-13 16535, 2024

8:57 AM
adhawkins has quit

2024-06-13 16547, 2024

9:01 AM
mayhem[m]

zas: atj: I'd love to have a chat with both of you today. I want to find a way to make it super easy to monitor a task. ideally by setting up the details on how to monitor the service as part of the code that creates the service. atj and I talked about this in BCN recently.

2024-06-13 16552, 2024

9:02 AM
zas[m]

sure, when?

2024-06-13 16541, 2024

9:04 AM
mayhem[m]

I'll be around most of the day. let see what atj says

2024-06-13 16516, 2024

9:05 AM
mayhem[m]

I've already explained this to atj -- do you understand what I am suggesting?

2024-06-13 16548, 2024

9:05 AM
mayhem[m]

we have too many services failing silently. I would like to make the monitoring of the services be part of the code we check in. part of the PR review process.

2024-06-13 16503, 2024

9:06 AM
d4rk-ph0enix has quit

2024-06-13 16528, 2024

9:06 AM
d4rk-ph0enix joined the channel

2024-06-13 16521, 2024

9:12 AM
zas[m]

Well, if something is failing silently it's usually because no one added any health check, metric and/or alert for it, so it makes sense to add those at the time the service is created/added/defined. But I'm curious about details. Let's have this discussion whenever atj is available.

2024-06-13 16519, 2024

9:13 AM
mayhem[m]

ok, cool. I don't have any details -- I'm going to listen to you and atj how this could work. this is just my motivation, since too many systems are quietly failing.

2024-06-13 16525, 2024

9:13 AM
mayhem[m]

how could this work?

2024-06-13 16524, 2024

9:14 AM
mayhem[m]

if we have a monitoring "definition" that is part of the code checkout, how could the services on the server discover the service definitions?

2024-06-13 16545, 2024

9:14 AM
mayhem[m]

a fixed directory in a container?

2024-06-13 16517, 2024

9:15 AM
mayhem[m]

/monitor/service_001.json, /monitor/service_002.json ...

2024-06-13 16552, 2024

9:15 AM
zas[m]

What are those "too many services" which are failing?

2024-06-13 16526, 2024

9:16 AM
mayhem[m]

eg. playlist sync with spotify wasn't working for some time. we missed it until users reported it.

2024-06-13 16535, 2024

9:16 AM
mayhem[m]

daily jams were not updated for weeks.

2024-06-13 16536, 2024

9:23 AM
zas[m]

So that's mainly "internal" services, it could easily be solved if the app generates proper metrics (like a Prometheus endpoint) and includes metrics about those. That's the starting point, the "service" has to provide information about its status. But defining alerts programmatically is another step. It's possible I guess, but defining alerts can be rather complex (well, if it's just a status like work/do not work that's simpler).

2024-06-13 16502, 2024

9:24 AM
mayhem[m]

seems right.

2024-06-13 16528, 2024

9:24 AM
mayhem[m]

is there a default monitor we could setup? or something that screams at us for not having setup the alert?

2024-06-13 16520, 2024

9:25 AM
mayhem[m]

I suppose that we can write a python lib that defines services and that manages the creation and handling of a prometheus endpoint.

2024-06-13 16554, 2024

9:28 AM
akshaaatt[m] joined the channel

2024-06-13 16554, 2024

9:28 AM
akshaaatt[m]

pranav: whenever data is loading is happening, show the loader. Doesn’t depend on when the user comes to the ui or not

2024-06-13 16523, 2024

9:29 AM
akshaaatt[m]

Omit ‘is’

2024-06-13 16536, 2024

9:29 AM
zas[m]

From what I understand, you want to monitor tasks running in one container, right? In the examples above ("playlist sync"), what does tell it works or not?

2024-06-13 16552, 2024

9:31 AM
mayhem[m]

it could be a simple timestamp for when a test playlist was last synced to/from spotify.

2024-06-13 16501, 2024

9:32 AM
mayhem[m]

and if the timestamp is > X time, complain.

2024-06-13 16522, 2024

9:32 AM
aerozol[m]

pranav: you can follow akshaaatt:s lead on that, I don't have any experience dealing with how it's efficient to load stuff. Seems like a technical question? Unless you are doing some tactical pre-loading or something then I'm happy to sit down and dig into the UX, lemme know!

2024-06-13 16552, 2024

9:39 AM
zas[m]

Usually we have things like a number of objects (let's called it V) and we collect this data on regular basis (Prometheus endpoint or telegraf reports metrics, and this is queried every X seconds), and alerts are based on V compared to a threshold (V under or above M for example) since a certain time (usually a multiple of X). The alert is triggered only if the state persists in some cases (like if it fails for 5 minutes), but immediately

2024-06-13 16552, 2024

9:39 AM
zas[m]

in other cases.

2024-06-13 16512, 2024

9:40 AM
zas[m]

More complex cases use multiple metrics to determine the status.

2024-06-13 16530, 2024

9:41 AM
zas[m]

But we can also have another approach: app just reports ok/not ok for all the subtasks and the metrics are like "subtask_ok" with a value of 0 or 1.

2024-06-13 16539, 2024

9:42 AM
mayhem[m]

I think the ok.not ok should be the default so that some monitoring is in place.

2024-06-13 16529, 2024

9:46 AM
mayhem[m]

do you have a link for what the promethieus endpoint response should look like?

2024-06-13 16514, 2024

9:51 AM
atj[m]

sorry, been on work calls all morning

2024-06-13 16541, 2024

9:51 AM
atj[m]

i think my perspective on this is a bit different to zas' 😛

2024-06-13 16553, 2024

9:51 AM
mayhem[m]

lets hear it. :)

2024-06-13 16515, 2024

9:52 AM
Maxr1998 joined the channel

2024-06-13 16532, 2024

9:52 AM
atj[m]

IMO Prometheus is for metrics and so not a good fit for service based monitoring (in general)

2024-06-13 16510, 2024

9:53 AM
atj[m]

when we discussed this in Feb I got the impression it was more about functional monitoring, rather than "number go up"

2024-06-13 16514, 2024

9:53 AM
Maxr1998_ has quit

2024-06-13 16506, 2024

9:54 AM
mayhem[m]

yes, most services just need to be verified running. time since last check could be the monitored value...

2024-06-13 16523, 2024

9:54 AM
atj[m]

AFAIR you asked me a fairly simple question about it and I went on some diatribe about deploying an Icinga 2 monitoring system and you somehow managed to feign interest

2024-06-13 16509, 2024

9:55 AM
mayhem[m]

atj[m]: not feigned. I asked a question, I really wanted to hear your take

2024-06-13 16517, 2024

9:55 AM
atj[m]

"recollections may vary" 😉

2024-06-13 16533, 2024

9:56 AM
atj[m]

yeah i know, it's just me being paranoid

2024-06-13 16518, 2024

9:57 AM
atj[m]

anyway, my idea was to have some sort of markup file that contains general checks

2024-06-13 16556, 2024

9:57 AM
atj[m]

* that contains rules for general checks

2024-06-13 16509, 2024

9:58 AM
atj[m]

then we write a Icinga / Nagios plugin that reads the files and executes the checks, then alerts if any of them fail

2024-06-13 16541, 2024

9:58 AM
mayhem[m]

that works.

2024-06-13 16544, 2024

9:58 AM
atj[m]

something along those lines

2024-06-13 16524, 2024

9:59 AM
atj[m]

we can mount a volume in docker images and the application writes a rule file into it

2024-06-13 16530, 2024

9:59 AM
atj[m]

this is then automatically picked up

2024-06-13 16555, 2024

9:59 AM
atj[m]

the only issue with this approach is that you have one "service check" in Nagios / Icinga that encompasses multiple checks

2024-06-13 16501, 2024

10:00 AM
mayhem[m]

I like that better than having to have an HTTP endpoint.

2024-06-13 16536, 2024

10:00 AM
zas[m]

rules are what exactly?

2024-06-13 16500, 2024

10:01 AM
atj[m]

process X is running, HTTP URL returns 200...

2024-06-13 16528, 2024

10:01 AM
atj[m]

file exists and was last updated < 60secs ago

2024-06-13 16551, 2024

10:01 AM
zas[m]

I mean, in which syntax? nagios-like?

2024-06-13 16556, 2024

10:01 AM
mayhem[m]

I think we can go far with those 3 types of checks.

2024-06-13 16522, 2024

10:02 AM
atj[m]

TBD, my thought was YAML or similar markup that is human and machine readable

2024-06-13 16548, 2024

10:02 AM
mayhem[m]

I dont love yaml. but I don't love coding json files either.

2024-06-13 16536, 2024

10:03 AM
atj[m]

YAML isn't great, but it's more human readable than JSON

2024-06-13 16530, 2024

10:04 AM
atj[m]

zas: if you think this is a bad / stupid idea please speak up, i won't be offended!

2024-06-13 16556, 2024

10:05 AM
atj[m]

essentially, we need to make it easy for the devs to implement monitoring

2024-06-13 16521, 2024

10:06 AM
mayhem[m]

atj[m]: that. please.

2024-06-13 16534, 2024

10:06 AM
zas[m]

So basically we define a proprietary language to define rules, code a proprietary icinga/nagios plugin to interpret them, install nagios agent on all nodes, share a volume for docker container for nagios agent to collect files ? I'm not yet sure about the full picture.

2024-06-13 16557, 2024

10:06 AM
atj[m]

that's roughly it

2024-06-13 16518, 2024

10:07 AM
atj[m]

however i'd like to deploy an Icinga 2 monitoring system on all servers as a pre-requisite

2024-06-13 16535, 2024

10:07 AM
atj[m]

(I spend a lot of time doing that at work)

2024-06-13 16537, 2024

10:07 AM
mayhem[m]

I'm happy to implement a standard if there is one. no need to reinvent things -- I just dont know about such a thing

2024-06-13 16511, 2024

10:08 AM
atj[m]

Icinga 2 supports a proper agent mode and has a UI with config deployment etc so it is much more flexible than Nagios

2024-06-13 16521, 2024

10:08 AM
atj[m]

but supports Nagios plugins

2024-06-13 16527, 2024

10:09 AM
zas[m]

For monitoring, we currently use: telegraf/influx/grafana, prometheus/grafana/alertmanager, nagios (old version)

2024-06-13 16546, 2024

10:09 AM
zas[m]

nagios is clearly obsolete, so an upgrade to icinga could be good

2024-06-13 16511, 2024

10:10 AM
atj[m]

nagios / icinga serve a different need to prometheus / telegraf / grafana IMV

2024-06-13 16524, 2024

10:10 AM
atj[m]

when you need some business logic in a check

2024-06-13 16534, 2024

10:10 AM
atj[m]

rather than just doing some calculations on some numbers

2024-06-13 16554, 2024

10:10 AM
atj[m]

so they complement each other

2024-06-13 16541, 2024

10:13 AM
atj[m]

<mayhem[m]> "I'm happy to implement a..." <- i don't know of anything suitable, and do a lot of work in that space

2024-06-13 16520, 2024

10:16 AM
zas[m]

I think your approach can work, but I still think publishing proper metrics for our apps could be very useful, as it can serve much wider purposes. A /metrics endpoint is rather common nowadays, and format is basically key = value. But I guess making it simpler for devs is the real goal here. And from this point of view it seems to me your approach is likely simpler.

2024-06-13 16507, 2024

10:17 AM
atj[m]

i definitely agree that apps exporting metrics would be good too, there's no reason we can't do both

2024-06-13 16512, 2024

10:17 AM
mayhem[m]

I think a simple system that requires very little thought for the most basic services should be our goal here.

2024-06-13 16524, 2024

10:17 AM
atj[m]

but the immediate issue is that services stop working and nobody seems to know

2024-06-13 16527, 2024

10:17 AM
mayhem[m]

anything that requires real monitoring should use our classic setup

2024-06-13 16511, 2024

10:20 AM
yvanzo[m]

atj[m]: Our services are generally lacking health checks.

2024-06-13 16550, 2024

10:20 AM
zas[m]

We also have Consul Service Heath Checks for containers btw, though they serve mainly to dynamically add/remove upstream servers on gateways, they aren't usually linked to alerting system, but they could (see https://stats.metabrainz.org/d/1lYZuloMz/consul-h…)

2024-06-13 16525, 2024

10:22 AM
mayhem[m]

https://docs.google.com/document/d/19Io9XcEbN-cjy…

2024-06-13 16533, 2024

10:22 AM
zas[m]

This is collected by telegraf, based on consul health checks, defined for all containers using SERVICE_* env vars

2024-06-13 16544, 2024

10:26 AM
mayhem[m]

atj: how should Process X checks be defined? PID files in the mounted volume dir? "service-checks

2024-06-13 16551, 2024

10:26 AM
mayhem[m]

* dir? "service-checks"

2024-06-13 16542, 2024

10:27 AM
atj[m]

PID files is probably easiest, but we could probably support pgrep style options

2024-06-13 16510, 2024

10:28 AM
mayhem[m]

KISS and start with PID files?

2024-06-13 16544, 2024

10:30 AM
mayhem[m]

how should check_intervals be defined

2024-06-13 16556, 2024

10:30 AM
mayhem[m]

Xm, Xh, Xd ?

2024-06-13 16532, 2024

10:32 AM
mayhem[m]

or go full unix def and make it crontab?

2024-06-13 16558, 2024

10:33 AM
atj[m]

ok, so that's a wrinkle because the check frequency is defined at the Icinga / Nagios level

2024-06-13 16537, 2024

10:34 AM
atj[m]

however we could probably do something where the plugins runs every 1 minute and remembers when it last ran a check

2024-06-13 16558, 2024

10:34 AM
atj[m]

that would add a bit of complexity though

2024-06-13 16536, 2024

10:35 AM
mayhem[m]

tricky.

2024-06-13 16527, 2024

10:37 AM
atj[m]

what use case did you have in mind for the check interval?

2024-06-13 16520, 2024

10:39 AM
mayhem[m]

just to define how often the check runs.

2024-06-13 16557, 2024

10:39 AM
mayhem[m]

I suppose we could have a fixed interval for this simple case.

2024-06-13 16515, 2024

10:40 AM
atj[m]

sorry, i meant cases where this would be relevant to the usefulness of the check

2024-06-13 16532, 2024

10:41 AM
yvanzo[m]

atj: Do you remember how the Solr collections/cores have been created?

2024-06-13 16553, 2024

10:41 AM
atj[m]

yvanzo[m]: it was using Ansible

2024-06-13 16510, 2024

10:44 AM
atj[m]

mayhem: maybe don't worry too much about these specific details / limitations for now. if you use the document as a brain dump for what you need i can then work out where the wrinkles are and zas and I can discuss

2024-06-13 16510, 2024

10:44 AM
mayhem[m]

all the cases I can think of could easily fit into the category of "if 1h interval is not good, use classic methods".

2024-06-13 16521, 2024

10:44 AM
mayhem[m]

so I think we can just go with 1h as the default an move on.

2024-06-13 16552, 2024

10:44 AM
mayhem[m]

atj: yea, ok. I was just hoping to write a doc that would answer all the questions from a developers perspective.

2024-06-13 16513, 2024

10:45 AM
atj[m]

we'll get there in the next iteration

2024-06-13 16502, 2024

10:46 AM
pranav[m]

Cool thanks aerozol akshaaatt

2024-06-13 16508, 2024

10:46 AM
atj[m]

there are some other options because Icinga 2 has an API which would give additional flexibility

2024-06-13 16516, 2024

10:46 AM
pranav[m]

* Cool thanks aerozol and akshat

2024-06-13 16526, 2024

10:46 AM
atj[m]

but I need to know all your use cases before to see what is feasible

2024-06-13 16533, 2024

10:46 AM
mayhem[m]

ok.

2024-06-13 16519, 2024

10:47 AM
mayhem[m]

for the file exists check, should that be limited to the mounted sevice-check volume or can that be done for any path in the container?

2024-06-13 16550, 2024

10:47 AM
atj[m]

i was just thinking about that one...

2024-06-13 16526, 2024

10:49 AM
mayhem[m]

KISS and start on the mounted volume?

2024-06-13 16518, 2024

10:50 AM
mayhem[m]

with that in mind, I think I've covered the use cases I am looking for in that doc.

2024-06-13 16529, 2024

10:50 AM
mayhem[m]

please have a look and see what else I should be including/defining.

2024-06-13 16518, 2024

10:51 AM
atj[m]

mayhem[m]: probably for now

2024-06-13 16545, 2024

10:51 AM
atj[m]

looks like you can do this to get the root dir of the container but it's not ideal:... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/dernpLEThZfDezwpZmXuGpmz>)

2024-06-13 16554, 2024

10:52 AM
mayhem[m]

ick. I wonder if "docker cp ..." would be a better solution.

2024-06-13 16539, 2024

10:53 AM
mayhem[m]

but for now, KISS

2024-06-13 16506, 2024

10:54 AM
yvanzo[m]

atj: I found the definition of `_solr_collection_create`, but couldn’t find where it is used. My goal is to reproduce it on mirrors.

2024-06-13 16517, 2024

10:57 AM
yvanzo[m]

(Previously, I created collections locally using `/solr/admin/collections?action=CREATE&name=`… but they don’t have the same schema version.)

2024-06-13 16515, 2024

11:07 AM
atj[m]

yvanzo: so, i create configsets first, then use the V2 API to create collections referencing the relevant configset

2024-06-13 16558, 2024

11:07 AM
atj[m]

is the schema version set by Solr / ZK after every update?

2024-06-13 16528, 2024

11:08 AM
yvanzo[m]

No, it is set on creation only.

2024-06-13 16519, 2024

11:10 AM
yvanzo[m]

Or do you mean every new Solr release? Only once in a while. And apparently different schema versions can be set depending on (?) the configset, the method, the server mode…