#metabrainz

/

      • Nyanko-sensei has quit
      • 2021-02-01 03239, 2021

      • Nyanko-sensei joined the channel
      • 2021-02-01 03209, 2021

      • yokel has quit
      • 2021-02-01 03240, 2021

      • yokel joined the channel
      • 2021-02-01 03226, 2021

      • Nyanko-sensei has quit
      • 2021-02-01 03212, 2021

      • Nyanko-sensei joined the channel
      • 2021-02-01 03250, 2021

      • Nyanko-sensei has quit
      • 2021-02-01 03235, 2021

      • Nyanko-sensei joined the channel
      • 2021-02-01 03202, 2021

      • Nyanko-sensei has quit
      • 2021-02-01 03235, 2021

      • Nyanko-sensei joined the channel
      • 2021-02-01 03202, 2021

      • AmandeeKumar joined the channel
      • 2021-02-01 03206, 2021

      • AmandeeKumar is now known as AmandeepKumar
      • 2021-02-01 03207, 2021

      • Nyanko-sensei has quit
      • 2021-02-01 03208, 2021

      • Nyanko-sensei joined the channel
      • 2021-02-01 03232, 2021

      • AmandeepKumar has quit
      • 2021-02-01 03243, 2021

      • _lucifer has quit
      • 2021-02-01 03256, 2021

      • _lucifer joined the channel
      • 2021-02-01 03218, 2021

      • revi has quit
      • 2021-02-01 03210, 2021

      • D4RK-PH0_ has quit
      • 2021-02-01 03200, 2021

      • revi joined the channel
      • 2021-02-01 03232, 2021

      • AmandeeKumar joined the channel
      • 2021-02-01 03253, 2021

      • AmandeeKumar has quit
      • 2021-02-01 03240, 2021

      • yvanzo
        mo’’in’
      • 2021-02-01 03258, 2021

      • sumedh joined the channel
      • 2021-02-01 03214, 2021

      • rdswift has quit
      • 2021-02-01 03212, 2021

      • rdswift joined the channel
      • 2021-02-01 03247, 2021

      • sumedh has quit
      • 2021-02-01 03224, 2021

      • Nyanko-sensei has quit
      • 2021-02-01 03224, 2021

      • Nyanko-sensei joined the channel
      • 2021-02-01 03238, 2021

      • ruaok
        mo'in!
      • 2021-02-01 03227, 2021

      • ruaok
        yvanzo: zas: shall we put some thought behind fixing trille today?
      • 2021-02-01 03257, 2021

      • yvanzo
        ruaok: is listenbrainaz using rabbitmq too?
      • 2021-02-01 03205, 2021

      • ruaok
        yes
      • 2021-02-01 03207, 2021

      • yvanzo
        MB uses it to update search indexes, it should preferably not be stopped or search indexes won’t be up-to-date.
      • 2021-02-01 03247, 2021

      • yvanzo
        I don’t think that this is what take the most resources on trille, but we could move this queue to PostgreSQL.
      • 2021-02-01 03212, 2021

      • ruaok
        LB is less sensitive. if a user cannot submit a listen, clients must re-try. so, restarts are ok.
      • 2021-02-01 03228, 2021

      • ruaok
      • 2021-02-01 03254, 2021

      • ruaok
        something is a miss. load on trille keeps growing, traffic in rabbitmq hasn't grown.
      • 2021-02-01 03205, 2021

      • yvanzo
        IMHO, CB has probably a lot of margin for reducing its footprint on resources.
      • 2021-02-01 03254, 2021

      • ruaok
        Does CB use RMQ or CB is on trille as well?
      • 2021-02-01 03214, 2021

      • yvanzo
        it is on trille as well
      • 2021-02-01 03255, 2021

      • ruaok
        have we ascertained if the resource hog is CB or RMQ?
      • 2021-02-01 03248, 2021

      • ruaok
        teletgraf is the top process on trille? that feels odd to me.
      • 2021-02-01 03222, 2021

      • yvanzo
        zas: Is it possible to monitor trille’s containers from grafana more closely, for example using cadvisor?
      • 2021-02-01 03257, 2021

      • zas
        well, we already have reports for containers on trille
      • 2021-02-01 03258, 2021

      • ruaok
        you read my mind, we need to a have a % usage per container graph....
      • 2021-02-01 03236, 2021

      • zas
      • 2021-02-01 03237, 2021

      • ruaok
        zas: got link?
      • 2021-02-01 03247, 2021

      • ruaok
        heh
      • 2021-02-01 03250, 2021

      • yvanzo
      • 2021-02-01 03251, 2021

      • zas
        ignore empty graphs, scroll down
      • 2021-02-01 03227, 2021

      • zas
        guys, we already have 2 suspects: rabbitmq and critiquebrainz-redis
      • 2021-02-01 03248, 2021

      • zas
        first one is known to eat cpu for nothing in certain cases
      • 2021-02-01 03203, 2021

      • zas
        second one is actually having huge write to disk spikes
      • 2021-02-01 03210, 2021

      • ruaok
        I just dont see rabbitmq as the culprit. but I see redis. those peaks in BlkIo are worrying.
      • 2021-02-01 03218, 2021

      • zas
        yes^^
      • 2021-02-01 03228, 2021

      • zas
        it writes far too much data
      • 2021-02-01 03234, 2021

      • zas
        yesterday I reduced share of trille mbs to almost nothing, so only few queries goes to it, and even with that, we still have very slow queries (read: seconds instead of milliseconds)
      • 2021-02-01 03238, 2021

      • ruaok
      • 2021-02-01 03252, 2021

      • zas
        on some ws queries (usually < 100ms) we can reach > 10s
      • 2021-02-01 03252, 2021

      • ruaok
        that looks like we need to investigate wtf is happening here.
      • 2021-02-01 03258, 2021

      • ruaok
        what did we do on 10/9, for instance?
      • 2021-02-01 03204, 2021

      • yvanzo
      • 2021-02-01 03217, 2021

      • ruaok
        yvanzo: thank you. that helps.
      • 2021-02-01 03227, 2021

      • ruaok
        so, rabbitmq is not the problem. agreed?
      • 2021-02-01 03232, 2021

      • yvanzo
        +1
      • 2021-02-01 03208, 2021

      • ruaok
      • 2021-02-01 03223, 2021

      • ruaok
        that coincides with a CB release.
      • 2021-02-01 03236, 2021

      • ruaok
        and presumably a CB container restart.
      • 2021-02-01 03242, 2021

      • ruaok
        though the release on 10.26 didn't cause the same drop. perhaps redis was not restarted then?
      • 2021-02-01 03258, 2021

      • ruaok
        zas,yvanzo has redis been restarted recently?
      • 2021-02-01 03231, 2021

      • yvanzo
        last time 2 months ago
      • 2021-02-01 03245, 2021

      • zas
        this instance of redis doesn't run with --appendonly=yes, like most instances, so it doesn't use aof
      • 2021-02-01 03200, 2021

      • ruaok
        any objections to restarting it to see what happens to the graph?
      • 2021-02-01 03213, 2021

      • zas
        but imho beam.smp cannot be excluded yet
      • 2021-02-01 03227, 2021

      • ruaok
        I could see a situation where CB is keeping a list in redis that keeps growing. and it is written over and over again.
      • 2021-02-01 03236, 2021

      • ruaok
        a bug, for sure.
      • 2021-02-01 03257, 2021

      • ruaok
        zas: I didn't. I'm trying to exclude on clear trouble maker to get another data point.
      • 2021-02-01 03210, 2021

      • ruaok
        *one
      • 2021-02-01 03218, 2021

      • ruaok
        _lucifer: ping
      • 2021-02-01 03207, 2021

      • zas
        2273 process (beam.smp) is writing a lot
      • 2021-02-01 03238, 2021

      • ruaok
        can you tell where the data goes, zas?
      • 2021-02-01 03252, 2021

      • zas
        wait
      • 2021-02-01 03208, 2021

      • zas
        I catched redis write ops
      • 2021-02-01 03215, 2021

      • ruaok
        beam is deffo the highest disk user. didn't CB add more telegraf logging? could it be overdoing it?
      • 2021-02-01 03217, 2021

      • zas
        it goes up to 50mb/s
      • 2021-02-01 03234, 2021

      • ruaok
        yes, that is why I am focusing on redis.
      • 2021-02-01 03234, 2021

      • zas
        while beam.smp doesn't go over 400kb/s
      • 2021-02-01 03247, 2021

      • ruaok
        beam.smp is an issue too, but its less spikey.
      • 2021-02-01 03203, 2021

      • zas
        yes, so unlikely to cause huge delays we see
      • 2021-02-01 03205, 2021

      • ruaok
        I'm going to restart redis, ok?
      • 2021-02-01 03215, 2021

      • yvanzo
        It seems due to be CB usage of redis.
      • 2021-02-01 03238, 2021

      • zas
        not sure restarting it will help, but you can try
      • 2021-02-01 03208, 2021

      • ruaok
        it might drop the traffic back to 0 and then start growing again. but that would clearly indicate CB is doing something bad.
      • 2021-02-01 03209, 2021

      • yvanzo
        we will probably the same graph as from September.
      • 2021-02-01 03222, 2021

      • _lucifer
        ruaok: pong
      • 2021-02-01 03236, 2021

      • yvanzo
        At least, it should be lower CPU/Mem usage first.
      • 2021-02-01 03252, 2021

      • ruaok
        peaks are 5-6 mins apart.
      • 2021-02-01 03254, 2021

      • ruaok
        hi _lucifer !
      • 2021-02-01 03209, 2021

      • ruaok
        can you please follow the scroll back for the last 20 minutes?
      • 2021-02-01 03215, 2021

      • _lucifer
        sure
      • 2021-02-01 03223, 2021

      • ruaok
        we're seeing strange redis use coming from CB.
      • 2021-02-01 03246, 2021

      • ruaok
        I'm curious if redis use in CB has recently changed. is there anything that gets processed every 5 minutes or so?
      • 2021-02-01 03205, 2021

      • ruaok
        it is causing massive disk io spikes: https://stats.metabrainz.org/d/000000051/hetzner-…
      • 2021-02-01 03247, 2021

      • ruaok
        zas, yvanzo : as expected the disk io for redis has dropped to nothing.
      • 2021-02-01 03229, 2021

      • ruaok
        beam.smp too. which might suggest that whatever the redis bug is it might be logging info to telegraf.
      • 2021-02-01 03201, 2021

      • Gazooo7949440 has quit
      • 2021-02-01 03235, 2021

      • _lucifer
        there were a couple of bug fixes regarding that, a key mismatch but nothing comes to mind that happens at a regular interval
      • 2021-02-01 03202, 2021

      • ruaok
        ok, the regular interval might be a redis behaviour due to increased use.
      • 2021-02-01 03211, 2021

      • _lucifer
        all data served by CB is cached
      • 2021-02-01 03247, 2021

      • Gazooo7949440 joined the channel
      • 2021-02-01 03248, 2021

      • ruaok
        what is weird is that we are seeing a lot of data being written to redis, but with very little read. that is fully upside-down of what it should be.
      • 2021-02-01 03208, 2021

      • _lucifer
        yeah right
      • 2021-02-01 03237, 2021

      • _lucifer
        can we like a sample of the latest read/writes?
      • 2021-02-01 03237, 2021

      • ruaok
        so, the primary traffic for CB comes from MB hitting its API.
      • 2021-02-01 03206, 2021

      • ruaok
        /ws/1/review/?limit=1&offset=0&release_group=ee9b6cad-ee58-3529-81ba-cc204769459c&sort=rating
      • 2021-02-01 03235, 2021

      • _lucifer
        *view a sample
      • 2021-02-01 03257, 2021

      • ruaok
        is the endpoint that gets all the traffic. can you please review the entire code chain of this endpoint and review the redis use in great detail to see if we can find something where redis might be used incorrectly?
      • 2021-02-01 03210, 2021

      • _lucifer
        sure, i'll do that
      • 2021-02-01 03231, 2021

      • ruaok
        _lucifer: that is what I am hoping to do next. what is being written and read. let me dig
      • 2021-02-01 03249, 2021

      • ruaok
        wow.
      • 2021-02-01 03206, 2021

      • ruaok
        an incredible number of new keys are being generated in redis.
      • 2021-02-01 03250, 2021

      • _lucifer
        my preliminary guess is that if there is no bug, then different MB entities are being queried (different release groups being viewed but the same page is not viewed frequently) but they are not viewed that often. hence, the writes are frequent but the reads are not.
      • 2021-02-01 03216, 2021

      • ruaok
        6 keys a second are being created. that would be the problem.
      • 2021-02-01 03213, 2021

      • ruaok
        there must be a problem with the page cache.
      • 2021-02-01 03202, 2021

      • ruaok
        worst case, each page fetched from MB would cause 1 fetch and 1 write to redis.
      • 2021-02-01 03220, 2021

      • ruaok
        but I would expect some cache hits, so there should be fewer writes than reads.
      • 2021-02-01 03242, 2021

      • ruaok
        do you know where the cache keys are generated in CB, _lucifer?
      • 2021-02-01 03254, 2021

      • _lucifer
        yes a sec
      • 2021-02-01 03215, 2021

      • _lucifer
      • 2021-02-01 03236, 2021

      • ruaok
        was caching in brainzutils changed recently?
      • 2021-02-01 03243, 2021

      • _lucifer
        no doesn't seem so, the last commit to brainzutils cache was 2 years agp
      • 2021-02-01 03250, 2021

      • _lucifer
      • 2021-02-01 03256, 2021

      • ruaok
        _lucifer: could you do me a quick favor? can you disable caching from that function and make a small PR?
      • 2021-02-01 03218, 2021

      • ruaok
        then we can deploy that and observe.
      • 2021-02-01 03223, 2021

      • _lucifer
        sure, on it
      • 2021-02-01 03236, 2021

      • ruaok
        because right now caching is creating load problems , not solving them.
      • 2021-02-01 03239, 2021

      • ruaok
        thx
      • 2021-02-01 03216, 2021

      • BrainzGit
        [critiquebrainz] amCap1712 opened pull request #337 (master…disable-cache): Disable release-group caching https://github.com/metabrainz/critiquebrainz/pull…
      • 2021-02-01 03237, 2021

      • ruaok
        thx
      • 2021-02-01 03247, 2021

      • _lucifer
        1 test is failing but that is expected.
      • 2021-02-01 03240, 2021

      • ruaok
        agreed.
      • 2021-02-01 03247, 2021

      • ruaok
        let me see about deploying.
      • 2021-02-01 03211, 2021

      • BrainzGit
        [critiquebrainz] mayhem merged pull request #337 (master…disable-cache): Disable release-group caching https://github.com/metabrainz/critiquebrainz/pull…
      • 2021-02-01 03242, 2021

      • _lucifer
        I think MB only requires the review text and review ratings but CB provides additional entity data which MB already has. The number of reviews is much less than the number of entities. So for most entities, we are just caching the entity data which MB is not useful to MB. I can add a MB mode to cache only the review text and ratings. I think that would reduce the cache writes to a large extent.
      • 2021-02-01 03222, 2021

      • yvanzo
        +1
      • 2021-02-01 03230, 2021

      • ruaok
        oh, that is interesting. maybe make a separate endpoint for that?
      • 2021-02-01 03253, 2021

      • alastairp
        hello. reading backlog
      • 2021-02-01 03258, 2021

      • alastairp
        need help with CB release?
      • 2021-02-01 03217, 2021

      • ruaok
        hopefully not. 🤞