#metabrainz

/

      • suvid[m] joined the channel
      • suvid[m]
        <aerozol[m]> "^ @fettuccinae:matrix.org..." <- https://www.linkedin.com/in/suvidsinghal
      • mamanullah7[m] joined the channel
      • mamanullah7[m]
        <aerozol[m]> "Congrats GSoC’ers! If you can..." <- https://www.linkedin.com/in/mamanullah7/
      • HemangMishra[m] joined the channel
      • HemangMishra[m]
        <aerozol[m]> "^ @fettuccinae:matrix.org..." <- https://www.linkedin.com/in/hemangmishra
      • mrnelgin joined the channel
      • kepstin has quit
      • nelgin has quit
      • kepstin joined the channel
      • holycow23[m] joined the channel
      • holycow23[m]
        <aerozol[m]> "^ @fettuccinae:matrix.org..." <- https://www.linkedin.com/in/granth-bagadia/
      • vardhan joined the channel
      • Kladky joined the channel
      • darkdrift joined the channel
      • darkdrift has quit
      • akshaaatt[m] joined the channel
      • akshaaatt[m]
        Congratulations GSoC’ers! 🔥🔥🔥
      • lucifer[m]
        [@mayhem:chatbrainz.org](https://matrix.to/#/@mayhem:chatbrainz.org) lb and meb redis are separate already. not sure what caused it. were meb or lb under elevated traffic when this happened?
      • I would check myself but I am unsure when this started.
      • FWIW, I did find some queries to donor api just before the redis mess started
      • mayhem[m]
        no elevated traffic; nothing I could see, except from redis connection messages.
      • lucifer[m]
        So I have implemented a cache for those queries
      • Testing it in beta at the moment.
      • mayhem[m]
        and redis flushing its data to disk with heavy IO once a minute. that may not seem like a good idea...
      • lucifer[m]
        Those queries could create a lot of db connections to meb db bringing down meb too
      • mayhem[m]
        ah, interesting.
      • lucifer[m]
        As for redis flushing, the bulk of the data in redis would be metadata caches.
      • Wild guess would be metadata caches overwhelmed redis somehow and that caused issues in lb which caused the meb dono queries to lag and brought down meb too
      • But I will have to dig deeper
      • To be sure
      • zas[m]
        mayhem: How metabrainz.org linked to kiss? Kiss exhibits high disk write activity every 4 minutes for 3 minutes, leading to overall high disk I/O, load is rather low, cpu usage is low, memory usage looks normal
      • mayhem[m]
        That might be redis flushing to disk, which seems excessive
      • zas[m]
        But metabrainz.org use this instance of redis?
      • mayhem[m]
        <zas[m]> "But metabrainz.org use this..." <- no, but it looks like lucifer may have put a lot of load on the DB testing a pesky query, which may have caused this. that query is now being cached to limit the impact.
      • LupinIII
        HA
      • love this!
      • UltraFuzzy has left the channel
      • lucifer[m]
        zas: is the high disk write happening now as well?
      • Sophist-UK joined the channel
      • mayhem: redis logs are unavailable since the container restart so can't debug further for now.
      • mayhem[m]
        yeah, who knows what happened. let's keep our eyes open for the time being.
      • _BrainzGit
        [listenbrainz-server] 14amCap1712 opened pull request #3265 (03master…mapper-fix): Fix legacy listens index date comparison https://github.com/metabrainz/listenbrainz-serv...
      • [listenbrainz-server] 14amCap1712 opened pull request #3266 (03master…cache-donors): Cache donor queries results https://github.com/metabrainz/listenbrainz-serv...
      • lucifer[m]
        monkey / ansh can you please rebase bootstrap upgrade branch on latest master?
      • monkey[m]
        Ah, am i creating issues on test?
      • Will do
      • lucifer[m]
        just some old errors that i resolved earlier popping back up in sentry.
      • monkey[m]
        Soz. Might take a moment to rebase, but I think I'm done with testing that branch on test so i can put another image on there for the time being
      • lucifer[m]
        no worries either way.
      • zas[m]
        lucifer: not much changes
      • lucifer[m]
        zas: which of these graphs should i be looking at?
      • _BrainzGit
        [listenbrainz-server] 14amCap1712 merged pull request #3265 (03master…mapper-fix): Fix legacy listens index date comparison https://github.com/metabrainz/listenbrainz-serv...
      • zas[m]
        lucifer: for kiss/lb-redis have a look at https://stats.metabrainz.org/d/000000051/hetzne...
      • Maxr1998_ has quit
      • Maxr1998 joined the channel
      • lucifer[m]
        mayhem: restarting lb redis to check something.
      • one more restart.
      • mayhem, zas: the default redis config that we are using for LB saves the existing RDB to memory on exit and loads it back in memory on restart. it also apparently keeps track of the max memory usage that was reached in the last lifetime and then allocates the max needed memory again. we had only 900M of data in redis but it occupied 42 G because at some point in the last year lb-redis on kiss used 42G. large amount of memory also
      • means more CPU consumption. i manually got rid of the existing dumps and then stopped the redis server with nosave to prevent it from writing the dump to disk. the cpu usage has now dropped to almost 1% and the memory usage is of course a few MBs at the moment, i expect that to grow to 1G or something in a few hours as normal usage.
      • * last year,, * so lb-redis on, * used 42G ever since even after restarts. large
      • mayhem[m]
        does that mean we had a lot of data that never expired?
      • lucifer[m]
        i don't think this is related to yesterday's issue but still should speed up our redis use in LB.
      • no the data expired but redis doesn't return the memory to the OS>
      • i would expect it to know that 40/42G are unallocated and don't need to be seached and hence not affect CPU usage but don't know enough about redis internals to tell.
      • mayhem[m]
        thats less than ideal.
      • lucifer[m]
        that's how most modern allocators work but we can modify the config to add a max-memory limit on how much it can allocate.
      • we can also look into redis-compatible alternatives for caching. i guess that was the plan anyway ever since they changed their licensing.
      • _BrainzGit
        [listenbrainz-server] 14amCap1712 merged pull request #3266 (03master…cache-donors): Cache donor queries results https://github.com/metabrainz/listenbrainz-serv...
      • lucifer[m]
        i don't think we will face this issue again anytime soon though, the 42G of memory usage was most likely from the redis-streams test to replace rabbitmq for spark last year.
      • mayhem[m]
        great. thanks for digging into this!
      • monkey[m]
        <_BrainzGit> "[listenbrainz-server] amCap1712..." <- I suppose we could cache this query result for more than 10 minutes (maybe 30m?), but not sure how much impact that would have.
      • Jade[m]
        <lucifer[m]> "we can also look into redis-..." <- Iirc they changed it back recently
      • However the forks are still going, so there's a bit of an ecosystem split
      • _BrainzGit
        [listenbrainz-server] 14MonkeyDo merged pull request #3244 (03bootstrap5…ansh/bs5): Upgrade to Bootstrap 5 https://github.com/metabrainz/listenbrainz-serv...
      • dvirtz[m] has quit
      • bitmap[m]
        <lucifer[m]> "mayhem: redis logs are unavailab..." <- I briefly checked the container last night and have some truncated logs from my terminal that I can send (including `MONITOR` output)
      • lucifer[m]
        bitmap: thanks, tool a look at it but didn't find anything abnormal.
      • bitmap[m]
        yeah, I didn't either but thanks for taking a look!
      • lucifer[m]
        <monkey[m]> "I suppose we could cache this..." <- if we invalidated the cache when a new donation is received we could cache it longer. but anyway for now the impact is how long LB takes to detect a new donation and then enable flair.
      • monkey[m]
        lucifer[m]: Yeah, I made the same calculation, but figured we get on average one or two donations from LB users a day, so though maybe we could be more aggressive with the caching if needed.
      • lucifer[m]
        yup makes sense
      • monkey[m]
        Well, that's going to be an interesting merge experience...
      • monkey[m] uploaded an image: (8KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/VycMfUDszNayMaFRnAPoCiiK/image.png >
      • lucifer[m]
        oof
      • monkey[m]
        13 conflicted files. Not too bad actually
      • lucifer[m]
        <Jade[m]> "Iirc they changed it back..." <- oh i see, i wasn't aware of that. nice
      • monkey[m]
        Ohhh, that's new!
      • yvanzo[m] joined the channel
      • yvanzo[m]
        My bad, I clicked on the wrong tab.