#metabrainz

/

      • ^arcade_droid joined the channel
      • 2021-02-04 03538, 2021

      • arcade_droid has quit
      • 2021-02-04 03511, 2021

      • d4rkie has quit
      • 2021-02-04 03548, 2021

      • Nyanko-sensei joined the channel
      • 2021-02-04 03507, 2021

      • sumedh joined the channel
      • 2021-02-04 03505, 2021

      • sumedh has quit
      • 2021-02-04 03533, 2021

      • yvanzo has quit
      • 2021-02-04 03549, 2021

      • yvanzo joined the channel
      • 2021-02-04 03501, 2021

      • sumedh joined the channel
      • 2021-02-04 03525, 2021

      • sampsyo has quit
      • 2021-02-04 03558, 2021

      • sampsyo joined the channel
      • 2021-02-04 03512, 2021

      • sumedh has quit
      • 2021-02-04 03526, 2021

      • yvanzo
        Blog post updated.
      • 2021-02-04 03535, 2021

      • zas
        moin yvanzo
      • 2021-02-04 03555, 2021

      • zas
        bitmap couldn't restart mbws container on pink
      • 2021-02-04 03549, 2021

      • zas
        and I know why: port it uses was taken by a postgres connection : tcp 0 0 172.17.0.1:65012 172.17.0.4:6899 TIME_WAIT - timewait (22.00/0/0)
      • 2021-02-04 03502, 2021

      • zas
        perhaps each container should use its own bridge, rather that the default one
      • 2021-02-04 03519, 2021

      • zas
        and sir-prod issues are perhaps related
      • 2021-02-04 03553, 2021

      • Rohan_Pillai joined the channel
      • 2021-02-04 03523, 2021

      • zas
        can we stop/start postgres on pink?
      • 2021-02-04 03512, 2021

      • yvanzo
        good catch
      • 2021-02-04 03527, 2021

      • yvanzo
        let me check about pg/pink
      • 2021-02-04 03530, 2021

      • yvanzo
        we should probably wait for cron jobs to end
      • 2021-02-04 03527, 2021

      • reosarevok
        Ok, I was going to tunnel into pink soon but I'll just wait :)
      • 2021-02-04 03538, 2021

      • yvanzo
        zas: hourly cron job is done, I stopped sir-prod too, should be ok to stop/start pink now
      • 2021-02-04 03527, 2021

      • yvanzo
        I also prevented sitemaps and json-dump to run at :30
      • 2021-02-04 03557, 2021

      • zas
        ok I just stopped pgbouncer, then I could restart mbws and sir-prod
      • 2021-02-04 03515, 2021

      • zas
        so it was it, now I'm not sure what is a proper long-term fix
      • 2021-02-04 03531, 2021

      • zas
        I think the issue is related to local ip port range
      • 2021-02-04 03554, 2021

      • zas
      • 2021-02-04 03549, 2021

      • zas
        but it overlaps network port range on the server (1024-65535), so host mode containers can use anything, that's perhaps the cause, and it was changed recently when I fixed sysctl / ufw issue on servers
      • 2021-02-04 03506, 2021

      • zas
        I'll change that and we'll see
      • 2021-02-04 03535, 2021

      • zas
        I changed pink local port range to be 21000 54000, so those doesn't conflict with ports defined in docker server scripts constants
      • 2021-02-04 03510, 2021

      • zas
        it also means we should group ports used by services, we have a bunch around 13k, 20k and over 55k
      • 2021-02-04 03558, 2021

      • zas
        yvanzo: sir-prod do not work, but that's another issue (prolly related to long queue), please have a look asap
      • 2021-02-04 03505, 2021

      • yvanzo
        ok
      • 2021-02-04 03530, 2021

      • yvanzo
        I stopped it again
      • 2021-02-04 03551, 2021

      • yusuf56 joined the channel
      • 2021-02-04 03543, 2021

      • Rohan_Pillai has quit
      • 2021-02-04 03518, 2021

      • yvanzo
        zas: sir-prod has a RuntimeError since 9:16:52 this morning
      • 2021-02-04 03537, 2021

      • yvanzo
        (8:16:52 UTC)
      • 2021-02-04 03544, 2021

      • yvanzo
        It seems it cannot connect to RabbitMQ
      • 2021-02-04 03552, 2021

      • yvanzo
        I deleted and recreated sir-prod container, but I still have the logs of the previous container for the last 24 hours.
      • 2021-02-04 03554, 2021

      • zas
        does it run now?
      • 2021-02-04 03509, 2021

      • yvanzo
        no
      • 2021-02-04 03520, 2021

      • yvanzo
        I stopped it
      • 2021-02-04 03536, 2021

      • zas
        what's the error preventing it to run?
      • 2021-02-04 03542, 2021

      • yvanzo
        yes
      • 2021-02-04 03551, 2021

      • zas
        ?
      • 2021-02-04 03556, 2021

      • reosarevok
        What is, not was, yvanzo
      • 2021-02-04 03557, 2021

      • reosarevok
        :D
      • 2021-02-04 03557, 2021

      • yvanzo
        oops
      • 2021-02-04 03502, 2021

      • yvanzo
        thanks :)
      • 2021-02-04 03512, 2021

      • zas
        morning reosarevok ;)
      • 2021-02-04 03527, 2021

      • reosarevok
        Morning :) If I can help, let me know
      • 2021-02-04 03537, 2021

      • yvanzo
      • 2021-02-04 03519, 2021

      • yvanzo
        the RuntimeError itself is related to logging to sentry
      • 2021-02-04 03512, 2021

      • zas
        RuntimeError: maximum recursion depth exceeded in cmp <--- ?
      • 2021-02-04 03526, 2021

      • yvanzo
        this is related to logging with raven
      • 2021-02-04 03558, 2021

      • yvanzo
        at least it's my guess and found it is a common error with raven-python
      • 2021-02-04 03554, 2021

      • yvanzo
        the source error seems to be when trying to reach rmq
      • 2021-02-04 03549, 2021

      • zas
        to me it seems it can connect but not to cope with all the data. This error log is very busy, hard to tell what's the problem
      • 2021-02-04 03532, 2021

      • yvanzo
        zas: it matches the time you restarted sir-prod.
      • 2021-02-04 03555, 2021

      • yvanzo
        it did error all the night without anything related to amqp
      • 2021-02-04 03541, 2021

      • yvanzo
        so something has changed that likely makes sir unable to reach rabbitmq
      • 2021-02-04 03500, 2021

      • zas
        are you sure it cannot connect to rabbitmq?
      • 2021-02-04 03526, 2021

      • yvanzo
        I will try from the container
      • 2021-02-04 03511, 2021

      • zas
        I think you are mixing 2 issues here: one (this night) was due to port issue, and was a connection issue, but now it should be solved, and the issue is too much data accumulated and sir is unable to cope with that. But I don't know enough about this stuff to be sure
      • 2021-02-04 03547, 2021

      • zas
        there are 115k items in queue atm and it keeps growing, meaning rabbitmq is working
      • 2021-02-04 03547, 2021

      • zas
        rabbitmq log is rather useless in this matter, it doesn't show client IPs
      • 2021-02-04 03555, 2021

      • zas
      • 2021-02-04 03511, 2021

      • zas
        there are multiple sir connections apparently
      • 2021-02-04 03527, 2021

      • yvanzo
        zas: connection issue was with pg
      • 2021-02-04 03554, 2021

      • yvanzo
        rabbitmq is growing and I can reach it from host at least
      • 2021-02-04 03544, 2021

      • CatQuest
        morn!
      • 2021-02-04 03558, 2021

      • CatQuest
        I see the rabbit(mq) is at it again, growing..
      • 2021-02-04 03502, 2021

      • CatQuest
        :D
      • 2021-02-04 03544, 2021

      • reosarevok
        yvanzo, zas: do you expect to need to restart pink again soon?
      • 2021-02-04 03554, 2021

      • reosarevok
        Or should I start running my bot? :D
      • 2021-02-04 03552, 2021

      • ruaok
        things better now??
      • 2021-02-04 03549, 2021

      • yvanzo
        no
      • 2021-02-04 03508, 2021

      • ruaok
        oh boo. :(
      • 2021-02-04 03516, 2021

      • yvanzo
        reosarevok: it's okay to run your bot
      • 2021-02-04 03531, 2021

      • ruaok
        let me know if you want another set of eyes to look. but I suspect that you have enough already.
      • 2021-02-04 03501, 2021

      • Gazooo7949440 has quit
      • 2021-02-04 03534, 2021

      • yvanzo
        I don't see why 100k queued msg would be an issue.
      • 2021-02-04 03543, 2021

      • Gazooo7949440 joined the channel
      • 2021-02-04 03527, 2021

      • ruaok updates the gsoc ideas redirect in the usual decade update cycle
      • 2021-02-04 03555, 2021

      • ruaok
        100k queued essages and trille freaks out?
      • 2021-02-04 03558, 2021

      • ruaok
        +m
      • 2021-02-04 03526, 2021

      • yvanzo
        I’m scrutinizing sir code to see what could make it throws error first.
      • 2021-02-04 03540, 2021

      • yvanzo
        trille is probably be fine, the problem might be with sir code or its amqp requirement.
      • 2021-02-04 03545, 2021

      • ruaok
        I wonder if there is something... that isn't being cleaned up. and after X years, we've accumulated enough cruft that things start slowing down. Like "duh, you didn't vacuum you PG database, no wonder its slow"
      • 2021-02-04 03507, 2021

      • yvanzo
        it's not related to PG, that is for sure.
      • 2021-02-04 03540, 2021

      • yvanzo
        sir only retrieves 100 msg at a time, I don't see why the queue length would be a problem.
      • 2021-02-04 03559, 2021

      • ruaok
        and why is it *now* a problem and not before?
      • 2021-02-04 03544, 2021

      • yvanzo
        I hoped it could be zas fault for fixing network port setup ;)
      • 2021-02-04 03503, 2021

      • Mineo
        I had a quick look at the stacktrace: what seems to be happening is that sir is in the process of acknowledging a message during the initial connection attempt (timestamps 2021-02-04T09:07:58.656585230Z and 2021-02-04T09:07:58.656641473Z) and that includes the following line: https://github.com/metabrainz/sir/blob/a586387c24… which basically says
      • 2021-02-04 03509, 2021

      • Mineo
        "hey, if we've lost connecting to rabbitmq while processing this message, please reconnect, so we there's someone we can send the ACK to". that usually includes the following lines to skip that when the connection is already setup: https://github.com/metabrainz/sir/blob/a586387c24… but that only works if
      • 2021-02-04 03514, 2021

      • Mineo
        https://github.com/metabrainz/sir/blob/a586387c24… were already called to set the the "yes, there's an existing connection" flag. however, that still begs the very good question "and why is it *now* a problem and not before?" :(
      • 2021-02-04 03542, 2021

      • yvanzo
        in rabbitmq, there are a lot of "missed heartbeats from client, timeout: 60s
      • 2021-02-04 03526, 2021

      • Etua joined the channel
      • 2021-02-04 03550, 2021

      • yvanzo
        139 errors in 10min, it's likely related to sir (I don't think CAA, CB or LB could produce that many errors at once)
      • 2021-02-04 03551, 2021

      • ruaok
        LB gets those too. and then a connection is reestablished.
      • 2021-02-04 03521, 2021

      • ruaok
        agreed, LB does not generate that many errors
      • 2021-02-04 03554, 2021

      • zas
        yvanzo: are you sure sir retrieve only 100 messages at start? because when we usually restart it first gets all messages, then go 100 by 100. But that's according logs.
      • 2021-02-04 03504, 2021

      • yvanzo
        (because SIR has 12 import threads, it is probably the most active)
      • 2021-02-04 03514, 2021

      • yvanzo
        zas: ok, I will check that
      • 2021-02-04 03506, 2021

      • Etua has quit
      • 2021-02-04 03535, 2021

      • BrainzGit
        [musicbrainz-server] reosarevok opened pull request #1893 (master…MBS-11365): MBS-11365: Allow Resident Advisor /podcast URLs for releases https://github.com/metabrainz/musicbrainz-server/…
      • 2021-02-04 03518, 2021

      • yvanzo
        zas: still debugging sir, at least it's definitely a problem specific to sir, not a network issue.
      • 2021-02-04 03530, 2021

      • yvanzo
        (in debug log, sir workers are processing the first message they grab from rabbitmq)
      • 2021-02-04 03518, 2021

      • alastairp
        for people who hate cron, here's a story that goes down the rabbit hole: https://twitter.com/jpaulreed/status/135716040680…
      • 2021-02-04 03542, 2021

      • _lucifer
        alastairp: since the consul upgrade is postponed, would you like to discuss cache fixes sooner?
      • 2021-02-04 03552, 2021

      • alastairp
        hi _lucifer
      • 2021-02-04 03502, 2021

      • _lucifer
        hi! :D
      • 2021-02-04 03508, 2021

      • alastairp
        yes, fine. remind me what we have to talk about?
      • 2021-02-04 03523, 2021

      • alastairp
        improvements to the cache library, or improvements to how we do caches in CB?
      • 2021-02-04 03524, 2021

      • _lucifer
        improving brainzutils cache module
      • 2021-02-04 03512, 2021

      • alastairp
        great, so first let's do BU-4, that should be really straight-forward
      • 2021-02-04 03513, 2021

      • BrainzBot
        BU-4: Don't hash cache keys before inserting them https://tickets.metabrainz.org/browse/BU-4
      • 2021-02-04 03525, 2021

      • alastairp
        next, I guess we need to finish any CB improvements that require upgrading BU, right?
      • 2021-02-04 03535, 2021

      • _lucifer
        yes right
      • 2021-02-04 03501, 2021

      • _lucifer
        Can you test and release the open ended versions fix
      • 2021-02-04 03515, 2021

      • _lucifer
        without that we cannot upgrade BU in CB
      • 2021-02-04 03520, 2021

      • alastairp
        what PR?
      • 2021-02-04 03523, 2021

      • alastairp
        I'll add it to my list
      • 2021-02-04 03532, 2021

      • alastairp
        maybe tomorrow afternoon or Monday
      • 2021-02-04 03547, 2021

      • _lucifer
      • 2021-02-04 03512, 2021

      • alastairp
        ah yes, that one
      • 2021-02-04 03516, 2021

      • alastairp
        that's already on my list
      • 2021-02-04 03525, 2021

      • alastairp
        could you help with the testing of it?
      • 2021-02-04 03528, 2021

      • sumedh joined the channel
      • 2021-02-04 03531, 2021

      • _lucifer
        sure
      • 2021-02-04 03559, 2021

      • _lucifer
        just use the git version in CB and see if it works right?
      • 2021-02-04 03512, 2021

      • alastairp
        we need to: 1) update BU in some downstream dependency (e.g. LB) to this branch, 2) make sure we pip install --upgrade pip to the latest version, 3) try and install dependencies
      • 2021-02-04 03524, 2021

      • alastairp
        if that works in all brainzes, then it's ready to release
      • 2021-02-04 03541, 2021

      • _lucifer
        makes sense, i'll do that
      • 2021-02-04 03546, 2021

      • alastairp
        thanks!
      • 2021-02-04 03523, 2021

      • _lucifer
        moving back to cache then, what other than BU-4 do you have in mind
      • 2021-02-04 03524, 2021

      • BrainzBot
        BU-4: Don't hash cache keys before inserting them https://tickets.metabrainz.org/browse/BU-4
      • 2021-02-04 03504, 2021

      • alastairp
        the other ones that I remember opening are BU-28, BU-29, BU-25
      • 2021-02-04 03506, 2021

      • BrainzBot
        BU-28: Caching a timezone aware datetime loses the timezone https://tickets.metabrainz.org/browse/BU-28
      • 2021-02-04 03506, 2021

      • BrainzBot
        BU-29: Use cache namespaces in ratelimit module https://tickets.metabrainz.org/browse/BU-29
      • 2021-02-04 03506, 2021

      • BrainzBot
        BU-25: Cache namespace versions don't work in docker or with distributed hosts https://tickets.metabrainz.org/browse/BU-25
      • 2021-02-04 03511, 2021

      • alastairp
        other than that, we want to talk about if we should enforce timeouts for all items in the cache
      • 2021-02-04 03548, 2021

      • _lucifer
        i think that's a good idea. we can at least add a sensible default.
      • 2021-02-04 03549, 2021

      • alastairp
        for 25, only CB uses this functionality. I'm unsure if I like the concept, and I'm wondering if we should remove it completely
      • 2021-02-04 03512, 2021

      • alastairp
        can you open a ticket for timeouts, and focus on that one, -4, and -29?
      • 2021-02-04 03532, 2021

      • _lucifer
        👍