Had a doubt.. aerozol the loading animation for the screen shud be shown just once when the user navigaes to the user page right? or shud it be shown each time when user changes their tab as well?
akshaaatt
SigHunter has quit
SigHunter joined the channel
rimskii[m]
hi, lucifer ! I have finished with integrating Apple Music to troi. ([Here's the PR](https://github.com/metabrainz/troi-recommendation-playground/pull/141))
I plan to switch to LB and implement remaining features there. Should I continue?
SigHunter has quit
SigHunter joined the channel
adhawkins has quit
mayhem[m]
zas: atj: I'd love to have a chat with both of you today. I want to find a way to make it super easy to monitor a task. ideally by setting up the details on how to monitor the service as part of the code that creates the service. atj and I talked about this in BCN recently.
zas[m]
sure, when?
mayhem[m]
I'll be around most of the day. let see what atj says
I've already explained this to atj -- do you understand what I am suggesting?
we have too many services failing silently. I would like to make the monitoring of the services be part of the code we check in. part of the PR review process.
d4rk-ph0enix has quit
d4rk-ph0enix joined the channel
zas[m]
Well, if something is failing silently it's usually because no one added any health check, metric and/or alert for it, so it makes sense to add those at the time the service is created/added/defined. But I'm curious about details. Let's have this discussion whenever atj is available.
mayhem[m]
ok, cool. I don't have any details -- I'm going to listen to you and atj how this could work. this is just my motivation, since too many systems are quietly failing.
how could this work?
if we have a monitoring "definition" that is part of the code checkout, how could the services on the server discover the service definitions?
What are those "too many services" which are failing?
mayhem[m]
eg. playlist sync with spotify wasn't working for some time. we missed it until users reported it.
daily jams were not updated for weeks.
zas[m]
So that's mainly "internal" services, it could easily be solved if the app generates proper metrics (like a Prometheus endpoint) and includes metrics about those. That's the starting point, the "service" has to provide information about its status. But defining alerts programmatically is another step. It's possible I guess, but defining alerts can be rather complex (well, if it's just a status like work/do not work that's simpler).
mayhem[m]
seems right.
is there a default monitor we could setup? or something that screams at us for not having setup the alert?
I suppose that we can write a python lib that defines services and that manages the creation and handling of a prometheus endpoint.
akshaaatt[m] joined the channel
akshaaatt[m]
pranav: whenever data is loading is happening, show the loader. Doesn’t depend on when the user comes to the ui or not
Omit ‘is’
zas[m]
From what I understand, you want to monitor tasks running in one container, right? In the examples above ("playlist sync"), what does tell it works or not?
mayhem[m]
it could be a simple timestamp for when a test playlist was last synced to/from spotify.
and if the timestamp is > X time, complain.
aerozol[m]
pranav: you can follow akshaaatt:s lead on that, I don't have any experience dealing with how it's efficient to load stuff. Seems like a technical question? Unless you are doing some tactical pre-loading or something then I'm happy to sit down and dig into the UX, lemme know!
zas[m]
Usually we have things like a number of objects (let's called it V) and we collect this data on regular basis (Prometheus endpoint or telegraf reports metrics, and this is queried every X seconds), and alerts are based on V compared to a threshold (V under or above M for example) since a certain time (usually a multiple of X). The alert is triggered only if the state persists in some cases (like if it fails for 5 minutes), but immediately
in other cases.
More complex cases use multiple metrics to determine the status.
But we can also have another approach: app just reports ok/not ok for all the subtasks and the metrics are like "subtask_ok" with a value of 0 or 1.
mayhem[m]
I think the ok.not ok should be the default so that some monitoring is in place.
do you have a link for what the promethieus endpoint response should look like?
atj[m]
sorry, been on work calls all morning
i think my perspective on this is a bit different to zas' 😛
mayhem[m]
lets hear it. :)
Maxr1998 joined the channel
atj[m]
IMO Prometheus is for metrics and so not a good fit for service based monitoring (in general)
when we discussed this in Feb I got the impression it was more about functional monitoring, rather than "number go up"
Maxr1998_ has quit
mayhem[m]
yes, most services just need to be verified running. time since last check could be the monitored value...
atj[m]
AFAIR you asked me a fairly simple question about it and I went on some diatribe about deploying an Icinga 2 monitoring system and you somehow managed to feign interest
mayhem[m]
atj[m]: not feigned. I asked a question, I really wanted to hear your take
atj[m]
"recollections may vary" 😉
yeah i know, it's just me being paranoid
anyway, my idea was to have some sort of markup file that contains general checks
* that contains rules for general checks
then we write a Icinga / Nagios plugin that reads the files and executes the checks, then alerts if any of them fail
mayhem[m]
that works.
atj[m]
something along those lines
we can mount a volume in docker images and the application writes a rule file into it
this is then automatically picked up
the only issue with this approach is that you have one "service check" in Nagios / Icinga that encompasses multiple checks
mayhem[m]
I like that better than having to have an HTTP endpoint.
zas[m]
rules are what exactly?
atj[m]
process X is running, HTTP URL returns 200...
file exists and was last updated < 60secs ago
zas[m]
I mean, in which syntax? nagios-like?
mayhem[m]
I think we can go far with those 3 types of checks.
atj[m]
TBD, my thought was YAML or similar markup that is human and machine readable
mayhem[m]
I dont love yaml. but I don't love coding json files either.
atj[m]
YAML isn't great, but it's more human readable than JSON
zas: if you think this is a bad / stupid idea please speak up, i won't be offended!
essentially, we need to make it easy for the devs to implement monitoring
mayhem[m]
atj[m]: that. please.
zas[m]
So basically we define a proprietary language to define rules, code a proprietary icinga/nagios plugin to interpret them, install nagios agent on all nodes, share a volume for docker container for nagios agent to collect files ? I'm not yet sure about the full picture.
atj[m]
that's roughly it
however i'd like to deploy an Icinga 2 monitoring system on all servers as a pre-requisite
(I spend a lot of time doing that at work)
mayhem[m]
I'm happy to implement a standard if there is one. no need to reinvent things -- I just dont know about such a thing
atj[m]
Icinga 2 supports a proper agent mode and has a UI with config deployment etc so it is much more flexible than Nagios
but supports Nagios plugins
zas[m]
For monitoring, we currently use: telegraf/influx/grafana, prometheus/grafana/alertmanager, nagios (old version)
nagios is clearly obsolete, so an upgrade to icinga could be good
atj[m]
nagios / icinga serve a different need to prometheus / telegraf / grafana IMV
when you need some business logic in a check
rather than just doing some calculations on some numbers
so they complement each other
<mayhem[m]> "I'm happy to implement a..." <- i don't know of anything suitable, and do a lot of work in that space
zas[m]
I think your approach can work, but I still think publishing proper metrics for our apps could be very useful, as it can serve much wider purposes. A /metrics endpoint is rather common nowadays, and format is basically key = value. But I guess making it simpler for devs is the real goal here. And from this point of view it seems to me your approach is likely simpler.
atj[m]
i definitely agree that apps exporting metrics would be good too, there's no reason we can't do both
mayhem[m]
I think a simple system that requires very little thought for the most basic services should be our goal here.
atj[m]
but the immediate issue is that services stop working and nobody seems to know
mayhem[m]
anything that requires real monitoring should use our classic setup
yvanzo[m]
atj[m]: Our services are generally lacking health checks.
zas[m]
We also have Consul Service Heath Checks for containers btw, though they serve mainly to dynamically add/remove upstream servers on gateways, they aren't usually linked to alerting system, but they could (see https://stats.metabrainz.org/d/1lYZuloMz/consul...)
This is collected by telegraf, based on consul health checks, defined for all containers using SERVICE_* env vars
mayhem[m]
atj: how should Process X checks be defined? PID files in the mounted volume dir? "service-checks
* dir? "service-checks"
atj[m]
PID files is probably easiest, but we could probably support pgrep style options
mayhem[m]
KISS and start with PID files?
how should check_intervals be defined
Xm, Xh, Xd ?
or go full unix def and make it crontab?
atj[m]
ok, so that's a wrinkle because the check frequency is defined at the Icinga / Nagios level
however we could probably do something where the plugins runs every 1 minute and remembers when it last ran a check
that would add a bit of complexity though
mayhem[m]
tricky.
atj[m]
what use case did you have in mind for the check interval?
mayhem[m]
just to define how often the check runs.
I suppose we could have a fixed interval for this simple case.
atj[m]
sorry, i meant cases where this would be relevant to the usefulness of the check
yvanzo[m]
atj: Do you remember how the Solr collections/cores have been created?
atj[m]
yvanzo[m]: it was using Ansible
mayhem: maybe don't worry too much about these specific details / limitations for now. if you use the document as a brain dump for what you need i can then work out where the wrinkles are and zas and I can discuss
mayhem[m]
all the cases I can think of could easily fit into the category of "if 1h interval is not good, use classic methods".
so I think we can just go with 1h as the default an move on.
atj: yea, ok. I was just hoping to write a doc that would answer all the questions from a developers perspective.
atj[m]
we'll get there in the next iteration
pranav[m]
Cool thanks aerozol akshaaatt
atj[m]
there are some other options because Icinga 2 has an API which would give additional flexibility
pranav[m]
* Cool thanks aerozol and akshat
atj[m]
but I need to know all your use cases before to see what is feasible
mayhem[m]
ok.
for the file exists check, should that be limited to the mounted sevice-check volume or can that be done for any path in the container?
atj[m]
i was just thinking about that one...
mayhem[m]
KISS and start on the mounted volume?
with that in mind, I think I've covered the use cases I am looking for in that doc.
please have a look and see what else I should be including/defining.
ick. I wonder if "docker cp ..." would be a better solution.
but for now, KISS
yvanzo[m]
atj: I found the definition of `_solr_collection_create`, but couldn’t find where it is used. My goal is to reproduce it on mirrors.
(Previously, I created collections locally using `/solr/admin/collections?action=CREATE&name=`… but they don’t have the same schema version.)
atj[m]
yvanzo: so, i create configsets first, then use the V2 API to create collections referencing the relevant configset
is the schema version set by Solr / ZK after every update?
yvanzo[m]
No, it is set on creation only.
Or do you mean every new Solr release? Only once in a while. And apparently different schema versions can be set depending on (?) the configset, the method, the server mode…