#metabrainz

/

      • zerodogg joined the channel
      • minimal has quit
      • zerodogg has quit
      • zerodogg joined the channel
      • zerodogg has quit
      • zerodogg joined the channel
      • pite has quit
      • Kladky joined the channel
      • Kladky has quit
      • Kladky joined the channel
      • pranav[m] joined the channel
      • pranav[m]
        Had a doubt.. aerozol the loading animation for the screen shud be shown just once when the user navigaes to the user page right? or shud it be shown each time when user changes their tab as well?
      • akshaaatt
      • SigHunter has quit
      • SigHunter joined the channel
      • rimskii[m]
        hi, lucifer ! I have finished with integrating Apple Music to troi. ([Here's the PR](https://github.com/metabrainz/troi-recommendation-playground/pull/141))
      • I plan to switch to LB and implement remaining features there. Should I continue?
      • SigHunter has quit
      • SigHunter joined the channel
      • adhawkins has quit
      • mayhem[m]
        zas: atj: I'd love to have a chat with both of you today. I want to find a way to make it super easy to monitor a task. ideally by setting up the details on how to monitor the service as part of the code that creates the service. atj and I talked about this in BCN recently.
      • zas[m]
        sure, when?
      • mayhem[m]
        I'll be around most of the day. let see what atj says
      • I've already explained this to atj -- do you understand what I am suggesting?
      • we have too many services failing silently. I would like to make the monitoring of the services be part of the code we check in. part of the PR review process.
      • d4rk-ph0enix has quit
      • d4rk-ph0enix joined the channel
      • zas[m]
        Well, if something is failing silently it's usually because no one added any health check, metric and/or alert for it, so it makes sense to add those at the time the service is created/added/defined. But I'm curious about details. Let's have this discussion whenever atj is available.
      • mayhem[m]
        ok, cool. I don't have any details -- I'm going to listen to you and atj how this could work. this is just my motivation, since too many systems are quietly failing.
      • how could this work?
      • if we have a monitoring "definition" that is part of the code checkout, how could the services on the server discover the service definitions?
      • a fixed directory in a container?
      • /monitor/service_001.json, /monitor/service_002.json ...
      • zas[m]
        What are those "too many services" which are failing?
      • mayhem[m]
        eg. playlist sync with spotify wasn't working for some time. we missed it until users reported it.
      • daily jams were not updated for weeks.
      • zas[m]
        So that's mainly "internal" services, it could easily be solved if the app generates proper metrics (like a Prometheus endpoint) and includes metrics about those. That's the starting point, the "service" has to provide information about its status. But defining alerts programmatically is another step. It's possible I guess, but defining alerts can be rather complex (well, if it's just a status like work/do not work that's simpler).
      • mayhem[m]
        seems right.
      • is there a default monitor we could setup? or something that screams at us for not having setup the alert?
      • I suppose that we can write a python lib that defines services and that manages the creation and handling of a prometheus endpoint.
      • akshaaatt[m] joined the channel
      • akshaaatt[m]
        pranav: whenever data is loading is happening, show the loader. Doesn’t depend on when the user comes to the ui or not
      • Omit ‘is’
      • zas[m]
        From what I understand, you want to monitor tasks running in one container, right? In the examples above ("playlist sync"), what does tell it works or not?
      • mayhem[m]
        it could be a simple timestamp for when a test playlist was last synced to/from spotify.
      • and if the timestamp is > X time, complain.
      • aerozol[m]
        pranav: you can follow akshaaatt:s lead on that, I don't have any experience dealing with how it's efficient to load stuff. Seems like a technical question? Unless you are doing some tactical pre-loading or something then I'm happy to sit down and dig into the UX, lemme know!
      • zas[m]
        Usually we have things like a number of objects (let's called it V) and we collect this data on regular basis (Prometheus endpoint or telegraf reports metrics, and this is queried every X seconds), and alerts are based on V compared to a threshold (V under or above M for example) since a certain time (usually a multiple of X). The alert is triggered only if the state persists in some cases (like if it fails for 5 minutes), but immediately
      • in other cases.
      • More complex cases use multiple metrics to determine the status.
      • But we can also have another approach: app just reports ok/not ok for all the subtasks and the metrics are like "subtask_ok" with a value of 0 or 1.
      • mayhem[m]
        I think the ok.not ok should be the default so that some monitoring is in place.
      • do you have a link for what the promethieus endpoint response should look like?
      • atj[m]
        sorry, been on work calls all morning
      • i think my perspective on this is a bit different to zas' 😛
      • mayhem[m]
        lets hear it. :)
      • Maxr1998 joined the channel
      • atj[m]
        IMO Prometheus is for metrics and so not a good fit for service based monitoring (in general)
      • when we discussed this in Feb I got the impression it was more about functional monitoring, rather than "number go up"
      • Maxr1998_ has quit
      • mayhem[m]
        yes, most services just need to be verified running. time since last check could be the monitored value...
      • atj[m]
        AFAIR you asked me a fairly simple question about it and I went on some diatribe about deploying an Icinga 2 monitoring system and you somehow managed to feign interest
      • mayhem[m]
        atj[m]: not feigned. I asked a question, I really wanted to hear your take
      • atj[m]
        "recollections may vary" 😉
      • yeah i know, it's just me being paranoid
      • anyway, my idea was to have some sort of markup file that contains general checks
      • * that contains rules for general checks
      • then we write a Icinga / Nagios plugin that reads the files and executes the checks, then alerts if any of them fail
      • mayhem[m]
        that works.
      • atj[m]
        something along those lines
      • we can mount a volume in docker images and the application writes a rule file into it
      • this is then automatically picked up
      • the only issue with this approach is that you have one "service check" in Nagios / Icinga that encompasses multiple checks
      • mayhem[m]
        I like that better than having to have an HTTP endpoint.
      • zas[m]
        rules are what exactly?
      • atj[m]
        process X is running, HTTP URL returns 200...
      • file exists and was last updated < 60secs ago
      • zas[m]
        I mean, in which syntax? nagios-like?
      • mayhem[m]
        I think we can go far with those 3 types of checks.
      • atj[m]
        TBD, my thought was YAML or similar markup that is human and machine readable
      • mayhem[m]
        I dont love yaml. but I don't love coding json files either.
      • atj[m]
        YAML isn't great, but it's more human readable than JSON
      • zas: if you think this is a bad / stupid idea please speak up, i won't be offended!
      • essentially, we need to make it easy for the devs to implement monitoring
      • mayhem[m]
        atj[m]: that. please.
      • zas[m]
        So basically we define a proprietary language to define rules, code a proprietary icinga/nagios plugin to interpret them, install nagios agent on all nodes, share a volume for docker container for nagios agent to collect files ? I'm not yet sure about the full picture.
      • atj[m]
        that's roughly it
      • however i'd like to deploy an Icinga 2 monitoring system on all servers as a pre-requisite
      • (I spend a lot of time doing that at work)
      • mayhem[m]
        I'm happy to implement a standard if there is one. no need to reinvent things -- I just dont know about such a thing
      • atj[m]
        Icinga 2 supports a proper agent mode and has a UI with config deployment etc so it is much more flexible than Nagios
      • but supports Nagios plugins
      • zas[m]
        For monitoring, we currently use: telegraf/influx/grafana, prometheus/grafana/alertmanager, nagios (old version)
      • nagios is clearly obsolete, so an upgrade to icinga could be good
      • atj[m]
        nagios / icinga serve a different need to prometheus / telegraf / grafana IMV
      • when you need some business logic in a check
      • rather than just doing some calculations on some numbers
      • so they complement each other
      • <mayhem[m]> "I'm happy to implement a..." <- i don't know of anything suitable, and do a lot of work in that space
      • zas[m]
        I think your approach can work, but I still think publishing proper metrics for our apps could be very useful, as it can serve much wider purposes. A /metrics endpoint is rather common nowadays, and format is basically key = value. But I guess making it simpler for devs is the real goal here. And from this point of view it seems to me your approach is likely simpler.
      • atj[m]
        i definitely agree that apps exporting metrics would be good too, there's no reason we can't do both
      • mayhem[m]
        I think a simple system that requires very little thought for the most basic services should be our goal here.
      • atj[m]
        but the immediate issue is that services stop working and nobody seems to know
      • mayhem[m]
        anything that requires real monitoring should use our classic setup
      • yvanzo[m]
        atj[m]: Our services are generally lacking health checks.
      • zas[m]
        We also have Consul Service Heath Checks for containers btw, though they serve mainly to dynamically add/remove upstream servers on gateways, they aren't usually linked to alerting system, but they could (see https://stats.metabrainz.org/d/1lYZuloMz/consul...)
      • mayhem[m]
      • zas[m]
        This is collected by telegraf, based on consul health checks, defined for all containers using SERVICE_* env vars
      • mayhem[m]
        atj: how should Process X checks be defined? PID files in the mounted volume dir? "service-checks
      • * dir? "service-checks"
      • atj[m]
        PID files is probably easiest, but we could probably support pgrep style options
      • mayhem[m]
        KISS and start with PID files?
      • how should check_intervals be defined
      • Xm, Xh, Xd ?
      • or go full unix def and make it crontab?
      • atj[m]
        ok, so that's a wrinkle because the check frequency is defined at the Icinga / Nagios level
      • however we could probably do something where the plugins runs every 1 minute and remembers when it last ran a check
      • that would add a bit of complexity though
      • mayhem[m]
        tricky.
      • atj[m]
        what use case did you have in mind for the check interval?
      • mayhem[m]
        just to define how often the check runs.
      • I suppose we could have a fixed interval for this simple case.
      • atj[m]
        sorry, i meant cases where this would be relevant to the usefulness of the check
      • yvanzo[m]
        atj: Do you remember how the Solr collections/cores have been created?
      • atj[m]
        yvanzo[m]: it was using Ansible
      • mayhem: maybe don't worry too much about these specific details / limitations for now. if you use the document as a brain dump for what you need i can then work out where the wrinkles are and zas and I can discuss
      • mayhem[m]
        all the cases I can think of could easily fit into the category of "if 1h interval is not good, use classic methods".
      • so I think we can just go with 1h as the default an move on.
      • atj: yea, ok. I was just hoping to write a doc that would answer all the questions from a developers perspective.
      • atj[m]
        we'll get there in the next iteration
      • pranav[m]
        Cool thanks aerozol akshaaatt
      • atj[m]
        there are some other options because Icinga 2 has an API which would give additional flexibility
      • pranav[m]
        * Cool thanks aerozol and akshat
      • atj[m]
        but I need to know all your use cases before to see what is feasible
      • mayhem[m]
        ok.
      • for the file exists check, should that be limited to the mounted sevice-check volume or can that be done for any path in the container?
      • atj[m]
        i was just thinking about that one...
      • mayhem[m]
        KISS and start on the mounted volume?
      • with that in mind, I think I've covered the use cases I am looking for in that doc.
      • please have a look and see what else I should be including/defining.
      • atj[m]
        mayhem[m]: probably for now
      • looks like you can do this to get the root dir of the container but it's not ideal:... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/...>)
      • mayhem[m]
        ick. I wonder if "docker cp ..." would be a better solution.
      • but for now, KISS
      • yvanzo[m]
        atj: I found the definition of `_solr_collection_create`, but couldn’t find where it is used. My goal is to reproduce it on mirrors.
      • (Previously, I created collections locally using `/solr/admin/collections?action=CREATE&name=`… but they don’t have the same schema version.)
      • atj[m]
        yvanzo: so, i create configsets first, then use the V2 API to create collections referencing the relevant configset
      • is the schema version set by Solr / ZK after every update?
      • yvanzo[m]
        No, it is set on creation only.
      • Or do you mean every new Solr release? Only once in a while. And apparently different schema versions can be set depending on (?) the configset, the method, the server mode…