Had a doubt.. aerozol the loading animation for the screen shud be shown just once when the user navigaes to the user page right? or shud it be shown each time when user changes their tab as well?
I plan to switch to LB and implement remaining features there. Should I continue?
2024-06-13 16552, 2024
SigHunter has quit
2024-06-13 16531, 2024
SigHunter joined the channel
2024-06-13 16535, 2024
adhawkins has quit
2024-06-13 16547, 2024
mayhem[m]
zas: atj: I'd love to have a chat with both of you today. I want to find a way to make it super easy to monitor a task. ideally by setting up the details on how to monitor the service as part of the code that creates the service. atj and I talked about this in BCN recently.
2024-06-13 16552, 2024
zas[m]
sure, when?
2024-06-13 16541, 2024
mayhem[m]
I'll be around most of the day. let see what atj says
2024-06-13 16516, 2024
mayhem[m]
I've already explained this to atj -- do you understand what I am suggesting?
2024-06-13 16548, 2024
mayhem[m]
we have too many services failing silently. I would like to make the monitoring of the services be part of the code we check in. part of the PR review process.
2024-06-13 16503, 2024
d4rk-ph0enix has quit
2024-06-13 16528, 2024
d4rk-ph0enix joined the channel
2024-06-13 16521, 2024
zas[m]
Well, if something is failing silently it's usually because no one added any health check, metric and/or alert for it, so it makes sense to add those at the time the service is created/added/defined. But I'm curious about details. Let's have this discussion whenever atj is available.
2024-06-13 16519, 2024
mayhem[m]
ok, cool. I don't have any details -- I'm going to listen to you and atj how this could work. this is just my motivation, since too many systems are quietly failing.
2024-06-13 16525, 2024
mayhem[m]
how could this work?
2024-06-13 16524, 2024
mayhem[m]
if we have a monitoring "definition" that is part of the code checkout, how could the services on the server discover the service definitions?
What are those "too many services" which are failing?
2024-06-13 16526, 2024
mayhem[m]
eg. playlist sync with spotify wasn't working for some time. we missed it until users reported it.
2024-06-13 16535, 2024
mayhem[m]
daily jams were not updated for weeks.
2024-06-13 16536, 2024
zas[m]
So that's mainly "internal" services, it could easily be solved if the app generates proper metrics (like a Prometheus endpoint) and includes metrics about those. That's the starting point, the "service" has to provide information about its status. But defining alerts programmatically is another step. It's possible I guess, but defining alerts can be rather complex (well, if it's just a status like work/do not work that's simpler).
2024-06-13 16502, 2024
mayhem[m]
seems right.
2024-06-13 16528, 2024
mayhem[m]
is there a default monitor we could setup? or something that screams at us for not having setup the alert?
2024-06-13 16520, 2024
mayhem[m]
I suppose that we can write a python lib that defines services and that manages the creation and handling of a prometheus endpoint.
2024-06-13 16554, 2024
akshaaatt[m] joined the channel
2024-06-13 16554, 2024
akshaaatt[m]
pranav: whenever data is loading is happening, show the loader. Doesn’t depend on when the user comes to the ui or not
2024-06-13 16523, 2024
akshaaatt[m]
Omit ‘is’
2024-06-13 16536, 2024
zas[m]
From what I understand, you want to monitor tasks running in one container, right? In the examples above ("playlist sync"), what does tell it works or not?
2024-06-13 16552, 2024
mayhem[m]
it could be a simple timestamp for when a test playlist was last synced to/from spotify.
2024-06-13 16501, 2024
mayhem[m]
and if the timestamp is > X time, complain.
2024-06-13 16522, 2024
aerozol[m]
pranav: you can follow akshaaatt:s lead on that, I don't have any experience dealing with how it's efficient to load stuff. Seems like a technical question? Unless you are doing some tactical pre-loading or something then I'm happy to sit down and dig into the UX, lemme know!
2024-06-13 16552, 2024
zas[m]
Usually we have things like a number of objects (let's called it V) and we collect this data on regular basis (Prometheus endpoint or telegraf reports metrics, and this is queried every X seconds), and alerts are based on V compared to a threshold (V under or above M for example) since a certain time (usually a multiple of X). The alert is triggered only if the state persists in some cases (like if it fails for 5 minutes), but immediately
2024-06-13 16552, 2024
zas[m]
in other cases.
2024-06-13 16512, 2024
zas[m]
More complex cases use multiple metrics to determine the status.
2024-06-13 16530, 2024
zas[m]
But we can also have another approach: app just reports ok/not ok for all the subtasks and the metrics are like "subtask_ok" with a value of 0 or 1.
2024-06-13 16539, 2024
mayhem[m]
I think the ok.not ok should be the default so that some monitoring is in place.
2024-06-13 16529, 2024
mayhem[m]
do you have a link for what the promethieus endpoint response should look like?
2024-06-13 16514, 2024
atj[m]
sorry, been on work calls all morning
2024-06-13 16541, 2024
atj[m]
i think my perspective on this is a bit different to zas' 😛
2024-06-13 16553, 2024
mayhem[m]
lets hear it. :)
2024-06-13 16515, 2024
Maxr1998 joined the channel
2024-06-13 16532, 2024
atj[m]
IMO Prometheus is for metrics and so not a good fit for service based monitoring (in general)
2024-06-13 16510, 2024
atj[m]
when we discussed this in Feb I got the impression it was more about functional monitoring, rather than "number go up"
2024-06-13 16514, 2024
Maxr1998_ has quit
2024-06-13 16506, 2024
mayhem[m]
yes, most services just need to be verified running. time since last check could be the monitored value...
2024-06-13 16523, 2024
atj[m]
AFAIR you asked me a fairly simple question about it and I went on some diatribe about deploying an Icinga 2 monitoring system and you somehow managed to feign interest
2024-06-13 16509, 2024
mayhem[m]
atj[m]: not feigned. I asked a question, I really wanted to hear your take
2024-06-13 16517, 2024
atj[m]
"recollections may vary" 😉
2024-06-13 16533, 2024
atj[m]
yeah i know, it's just me being paranoid
2024-06-13 16518, 2024
atj[m]
anyway, my idea was to have some sort of markup file that contains general checks
2024-06-13 16556, 2024
atj[m]
* that contains rules for general checks
2024-06-13 16509, 2024
atj[m]
then we write a Icinga / Nagios plugin that reads the files and executes the checks, then alerts if any of them fail
2024-06-13 16541, 2024
mayhem[m]
that works.
2024-06-13 16544, 2024
atj[m]
something along those lines
2024-06-13 16524, 2024
atj[m]
we can mount a volume in docker images and the application writes a rule file into it
2024-06-13 16530, 2024
atj[m]
this is then automatically picked up
2024-06-13 16555, 2024
atj[m]
the only issue with this approach is that you have one "service check" in Nagios / Icinga that encompasses multiple checks
2024-06-13 16501, 2024
mayhem[m]
I like that better than having to have an HTTP endpoint.
2024-06-13 16536, 2024
zas[m]
rules are what exactly?
2024-06-13 16500, 2024
atj[m]
process X is running, HTTP URL returns 200...
2024-06-13 16528, 2024
atj[m]
file exists and was last updated < 60secs ago
2024-06-13 16551, 2024
zas[m]
I mean, in which syntax? nagios-like?
2024-06-13 16556, 2024
mayhem[m]
I think we can go far with those 3 types of checks.
2024-06-13 16522, 2024
atj[m]
TBD, my thought was YAML or similar markup that is human and machine readable
2024-06-13 16548, 2024
mayhem[m]
I dont love yaml. but I don't love coding json files either.
2024-06-13 16536, 2024
atj[m]
YAML isn't great, but it's more human readable than JSON
2024-06-13 16530, 2024
atj[m]
zas: if you think this is a bad / stupid idea please speak up, i won't be offended!
2024-06-13 16556, 2024
atj[m]
essentially, we need to make it easy for the devs to implement monitoring
2024-06-13 16521, 2024
mayhem[m]
atj[m]: that. please.
2024-06-13 16534, 2024
zas[m]
So basically we define a proprietary language to define rules, code a proprietary icinga/nagios plugin to interpret them, install nagios agent on all nodes, share a volume for docker container for nagios agent to collect files ? I'm not yet sure about the full picture.
2024-06-13 16557, 2024
atj[m]
that's roughly it
2024-06-13 16518, 2024
atj[m]
however i'd like to deploy an Icinga 2 monitoring system on all servers as a pre-requisite
2024-06-13 16535, 2024
atj[m]
(I spend a lot of time doing that at work)
2024-06-13 16537, 2024
mayhem[m]
I'm happy to implement a standard if there is one. no need to reinvent things -- I just dont know about such a thing
2024-06-13 16511, 2024
atj[m]
Icinga 2 supports a proper agent mode and has a UI with config deployment etc so it is much more flexible than Nagios
2024-06-13 16521, 2024
atj[m]
but supports Nagios plugins
2024-06-13 16527, 2024
zas[m]
For monitoring, we currently use: telegraf/influx/grafana, prometheus/grafana/alertmanager, nagios (old version)
2024-06-13 16546, 2024
zas[m]
nagios is clearly obsolete, so an upgrade to icinga could be good
2024-06-13 16511, 2024
atj[m]
nagios / icinga serve a different need to prometheus / telegraf / grafana IMV
2024-06-13 16524, 2024
atj[m]
when you need some business logic in a check
2024-06-13 16534, 2024
atj[m]
rather than just doing some calculations on some numbers
2024-06-13 16554, 2024
atj[m]
so they complement each other
2024-06-13 16541, 2024
atj[m]
<mayhem[m]> "I'm happy to implement a..." <- i don't know of anything suitable, and do a lot of work in that space
2024-06-13 16520, 2024
zas[m]
I think your approach can work, but I still think publishing proper metrics for our apps could be very useful, as it can serve much wider purposes. A /metrics endpoint is rather common nowadays, and format is basically key = value. But I guess making it simpler for devs is the real goal here. And from this point of view it seems to me your approach is likely simpler.
2024-06-13 16507, 2024
atj[m]
i definitely agree that apps exporting metrics would be good too, there's no reason we can't do both
2024-06-13 16512, 2024
mayhem[m]
I think a simple system that requires very little thought for the most basic services should be our goal here.
2024-06-13 16524, 2024
atj[m]
but the immediate issue is that services stop working and nobody seems to know
2024-06-13 16527, 2024
mayhem[m]
anything that requires real monitoring should use our classic setup
2024-06-13 16511, 2024
yvanzo[m]
atj[m]: Our services are generally lacking health checks.
2024-06-13 16550, 2024
zas[m]
We also have Consul Service Heath Checks for containers btw, though they serve mainly to dynamically add/remove upstream servers on gateways, they aren't usually linked to alerting system, but they could (see https://stats.metabrainz.org/d/1lYZuloMz/consul-h…)
This is collected by telegraf, based on consul health checks, defined for all containers using SERVICE_* env vars
2024-06-13 16544, 2024
mayhem[m]
atj: how should Process X checks be defined? PID files in the mounted volume dir? "service-checks
2024-06-13 16551, 2024
mayhem[m]
* dir? "service-checks"
2024-06-13 16542, 2024
atj[m]
PID files is probably easiest, but we could probably support pgrep style options
2024-06-13 16510, 2024
mayhem[m]
KISS and start with PID files?
2024-06-13 16544, 2024
mayhem[m]
how should check_intervals be defined
2024-06-13 16556, 2024
mayhem[m]
Xm, Xh, Xd ?
2024-06-13 16532, 2024
mayhem[m]
or go full unix def and make it crontab?
2024-06-13 16558, 2024
atj[m]
ok, so that's a wrinkle because the check frequency is defined at the Icinga / Nagios level
2024-06-13 16537, 2024
atj[m]
however we could probably do something where the plugins runs every 1 minute and remembers when it last ran a check
2024-06-13 16558, 2024
atj[m]
that would add a bit of complexity though
2024-06-13 16536, 2024
mayhem[m]
tricky.
2024-06-13 16527, 2024
atj[m]
what use case did you have in mind for the check interval?
2024-06-13 16520, 2024
mayhem[m]
just to define how often the check runs.
2024-06-13 16557, 2024
mayhem[m]
I suppose we could have a fixed interval for this simple case.
2024-06-13 16515, 2024
atj[m]
sorry, i meant cases where this would be relevant to the usefulness of the check
2024-06-13 16532, 2024
yvanzo[m]
atj: Do you remember how the Solr collections/cores have been created?
2024-06-13 16553, 2024
atj[m]
yvanzo[m]: it was using Ansible
2024-06-13 16510, 2024
atj[m]
mayhem: maybe don't worry too much about these specific details / limitations for now. if you use the document as a brain dump for what you need i can then work out where the wrinkles are and zas and I can discuss
2024-06-13 16510, 2024
mayhem[m]
all the cases I can think of could easily fit into the category of "if 1h interval is not good, use classic methods".
2024-06-13 16521, 2024
mayhem[m]
so I think we can just go with 1h as the default an move on.
2024-06-13 16552, 2024
mayhem[m]
atj: yea, ok. I was just hoping to write a doc that would answer all the questions from a developers perspective.
2024-06-13 16513, 2024
atj[m]
we'll get there in the next iteration
2024-06-13 16502, 2024
pranav[m]
Cool thanks aerozol akshaaatt
2024-06-13 16508, 2024
atj[m]
there are some other options because Icinga 2 has an API which would give additional flexibility
2024-06-13 16516, 2024
pranav[m]
* Cool thanks aerozol and akshat
2024-06-13 16526, 2024
atj[m]
but I need to know all your use cases before to see what is feasible
2024-06-13 16533, 2024
mayhem[m]
ok.
2024-06-13 16519, 2024
mayhem[m]
for the file exists check, should that be limited to the mounted sevice-check volume or can that be done for any path in the container?
2024-06-13 16550, 2024
atj[m]
i was just thinking about that one...
2024-06-13 16526, 2024
mayhem[m]
KISS and start on the mounted volume?
2024-06-13 16518, 2024
mayhem[m]
with that in mind, I think I've covered the use cases I am looking for in that doc.
2024-06-13 16529, 2024
mayhem[m]
please have a look and see what else I should be including/defining.
2024-06-13 16518, 2024
atj[m]
mayhem[m]: probably for now
2024-06-13 16545, 2024
atj[m]
looks like you can do this to get the root dir of the container but it's not ideal:... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/dernpLEThZfDezwpZmXuGpmz>)
2024-06-13 16554, 2024
mayhem[m]
ick. I wonder if "docker cp ..." would be a better solution.
2024-06-13 16539, 2024
mayhem[m]
but for now, KISS
2024-06-13 16506, 2024
yvanzo[m]
atj: I found the definition of `_solr_collection_create`, but couldn’t find where it is used. My goal is to reproduce it on mirrors.
2024-06-13 16517, 2024
yvanzo[m]
(Previously, I created collections locally using `/solr/admin/collections?action=CREATE&name=`… but they don’t have the same schema version.)
2024-06-13 16515, 2024
atj[m]
yvanzo: so, i create configsets first, then use the V2 API to create collections referencing the relevant configset
2024-06-13 16558, 2024
atj[m]
is the schema version set by Solr / ZK after every update?
2024-06-13 16528, 2024
yvanzo[m]
No, it is set on creation only.
2024-06-13 16519, 2024
yvanzo[m]
Or do you mean every new Solr release? Only once in a while. And apparently different schema versions can be set depending on (?) the configset, the method, the server mode…