#metabrainz

/

      • BrainzGit
        [musicbrainz-docker] 14danielunderwood opened pull request #250 (03master…sync-upstream): Sync upstream changes https://github.com/metabrainz/musicbrainz-docker/…
      • 2023-05-20 14036, 2023

      • BrainzGit
        [musicbrainz-docker] 14danielunderwood closed pull request #250 (03master…sync-upstream): Sync upstream changes https://github.com/metabrainz/musicbrainz-docker/…
      • 2023-05-20 14004, 2023

      • d4rkie has quit
      • 2023-05-20 14045, 2023

      • Isabelxxx has quit
      • 2023-05-20 14045, 2023

      • d4rkie joined the channel
      • 2023-05-20 14017, 2023

      • Maxr1998_ joined the channel
      • 2023-05-20 14048, 2023

      • Maxr1998 has quit
      • 2023-05-20 14033, 2023

      • mayhem
        aerozol: its quite an effort to get stats like those. its a rather tricky sort of proposal.
      • 2023-05-20 14054, 2023

      • mayhem
        I suppose we could get the data for LB, which doesn't have a lot of traffic.
      • 2023-05-20 14017, 2023

      • mayhem
        and now the for real question: do we need it? This is often the domain of evil people trying to make sites more sticky and trick people into doing stuff.
      • 2023-05-20 14037, 2023

      • mayhem
        can we do an exercise where we discuss what data do we want to learn? does it really help us to know which pages are more active than others? or do we want know how many people are actively using the site? what level of granularity do we really need?
      • 2023-05-20 14014, 2023

      • zas
        something weird is happening on MB right now
      • 2023-05-20 14044, 2023

      • zas
        likely due to pg increased activity
      • 2023-05-20 14054, 2023

      • zas
      • 2023-05-20 14025, 2023

      • mayhem
        first complainer that we're not as good as last.fm: https://community.metabrainz.org/t/stats-page-sho…
      • 2023-05-20 14031, 2023

      • mayhem
        thanks for being helpful, buddy.
      • 2023-05-20 14000, 2023

      • zas
        bitmap, yvanzo: there was an increase of 504s, likely due to unresponsive pg backend, between 8:08 and 8:23
      • 2023-05-20 14008, 2023

      • zas
        I didn't find any obvious in web logs, I kept them in my home on kiki for inspection
      • 2023-05-20 14051, 2023

      • zas
        there was a load increase on pink, it's the cause of 504s, but not sure what triggered it -> https://stats.metabrainz.org/d/000000048/hetzner-…
      • 2023-05-20 14057, 2023

      • zas
        bitmap: ^^
      • 2023-05-20 14043, 2023

      • v6lur joined the channel
      • 2023-05-20 14053, 2023

      • v6lur has quit
      • 2023-05-20 14035, 2023

      • petitminion joined the channel
      • 2023-05-20 14008, 2023

      • petitminion_ joined the channel
      • 2023-05-20 14011, 2023

      • petitminion has quit
      • 2023-05-20 14008, 2023

      • petitminion_ has quit
      • 2023-05-20 14009, 2023

      • petitminion_ joined the channel
      • 2023-05-20 14038, 2023

      • mayhem
        monkey: tag radio is already a really good background music generator. and that was just a test to see if it was feasible at all!
      • 2023-05-20 14009, 2023

      • monkey
        Yeah, big potential !
      • 2023-05-20 14031, 2023

      • mayhem
        I'm going to play around with adding release-group tags into the mix later.
      • 2023-05-20 14037, 2023

      • mayhem
        and also take tag votes into account.
      • 2023-05-20 14006, 2023

      • petitminion_ has quit
      • 2023-05-20 14042, 2023

      • trolley has quit
      • 2023-05-20 14011, 2023

      • trolley joined the channel
      • 2023-05-20 14051, 2023

      • petitminion_ joined the channel
      • 2023-05-20 14053, 2023

      • mayhem
        zas: blocked IP request in support@
      • 2023-05-20 14006, 2023

      • zas
        mayhem: handled
      • 2023-05-20 14026, 2023

      • mayhem
        thx
      • 2023-05-20 14039, 2023

      • mayhem
        docker on wolf is misbehaving a bit.
      • 2023-05-20 14006, 2023

      • mayhem
        when I restart musicbrainz mirror's postgres I can connect to the DB. but then after some time connections fail until I restart docker.
      • 2023-05-20 14018, 2023

      • mayhem
        zas: is that a known problem? tips for what to do?
      • 2023-05-20 14019, 2023

      • zas
        no idea; weird it failed after a while, let me see
      • 2023-05-20 14044, 2023

      • mayhem
        thx
      • 2023-05-20 14031, 2023

      • zas
        recent docker upgrade broke network stats (with telegraf), I wonder if it is related
      • 2023-05-20 14049, 2023

      • zas
        imho we should completely restart docker to rule out that, by default, it uses live-restore which doesn't restart containers while docker is upgraded
      • 2023-05-20 14053, 2023

      • mayhem
        stop all containers, restart, then restart containers?
      • 2023-05-20 14059, 2023

      • zas
        that means that when you restart docker, it doesn't actually restart everything, and experience shows it can be necessary. There are 2 services docker.service and docker.socket
      • 2023-05-20 14022, 2023

      • zas
        restarting docker.socket can help (but in this case it will restart all containers)
      • 2023-05-20 14033, 2023

      • zas
        that said, that's a wild guess at this point
      • 2023-05-20 14042, 2023

      • mayhem
        ok, let me try.
      • 2023-05-20 14041, 2023

      • zas
        btw, since there were also kernel upgrades, rebooting the server, if possible, could be a good thing
      • 2023-05-20 14009, 2023

      • mayhem
        might as well.
      • 2023-05-20 14003, 2023

      • petitminion_ has quit
      • 2023-05-20 14043, 2023

      • petitminion_ joined the channel
      • 2023-05-20 14007, 2023

      • mayhem
      • 2023-05-20 14059, 2023

      • zas
        mayhem: did the reboot help?
      • 2023-05-20 14027, 2023

      • mayhem
        we won't know for a few hours yet. but there was a lot of cruft running, so it feels good to clean up a bit.
      • 2023-05-20 14016, 2023

      • petitminion_ has quit
      • 2023-05-20 14031, 2023

      • petitminion_ joined the channel
      • 2023-05-20 14031, 2023

      • petitminion_ has quit
      • 2023-05-20 14059, 2023

      • zas
        bitmap: not sure why but we have tons of postgres-health-check processes running on aretha
      • 2023-05-20 14010, 2023

      • zas
        I restarted consulagent container on aretha, it was the parent for all those checks
      • 2023-05-20 14040, 2023

      • zas
        There were a lot of warnings like those in consulagent logs https://www.irccloud.com/pastebin/Uj3ZgTYu/
      • 2023-05-20 14039, 2023

      • zas
        after the restart the number of stuck processes starts again to grow
      • 2023-05-20 14023, 2023

      • bitmap
        zas: ok I'll take a look
      • 2023-05-20 14006, 2023

      • mayhem
        zas: docker on wolf is still being weird.
      • 2023-05-20 14041, 2023

      • mayhem
        my app cannot connect to postgres sometimes. when connected to PG directly, all is well and queries execute fast.
      • 2023-05-20 14056, 2023

      • zas
        bitmap: from inside the consulagent container
      • 2023-05-20 14059, 2023

      • zas
      • 2023-05-20 14038, 2023

      • zas
        the check attempts to connect to port 5432, while the specified port is 65401
      • 2023-05-20 14001, 2023

      • zas
        mayhem: which app it is?
      • 2023-05-20 14015, 2023

      • mayhem
        container bono-data-sets.
      • 2023-05-20 14039, 2023

      • mayhem
        working fine right this second, but that can hang at any moment.
      • 2023-05-20 14047, 2023

      • zas
        and it connects to which pg instance?
      • 2023-05-20 14058, 2023

      • zas
        musicbrainz-docker_db_1 ?
      • 2023-05-20 14007, 2023

      • mayhem
        musicbrainz-docker_db_1, yes
      • 2023-05-20 14008, 2023

      • bitmap
        zas: that's weird, also for some reason I can't tail the postgres-aretha logs, they seem to have stopped logging
      • 2023-05-20 14011, 2023

      • zas
        bitmap: docker logs --tail 100 postgres-aretha works for me
      • 2023-05-20 14035, 2023

      • bitmap
        sorry s/tail/follow/
      • 2023-05-20 14039, 2023

      • zas
        ah, but last lines are ... old
      • 2023-05-20 14056, 2023

      • zas
        2023-05-20 08:53:09.906
      • 2023-05-20 14013, 2023

      • bitmap
        yeah
      • 2023-05-20 14023, 2023

      • zas
        so I suspect pg container has some issues, can you restart it?
      • 2023-05-20 14041, 2023

      • zas
        I wonder if all this is related to last docker upgrade... hard to tell
      • 2023-05-20 14042, 2023

      • bitmap
        there is a json dump running right now, let me see how far along it is
      • 2023-05-20 14015, 2023

      • bitmap
        hmm, there are a lot of pg connection errors in the json dump logs for today
      • 2023-05-20 14020, 2023

      • bitmap
        2023-05-20T08:53:43.194396Z 08006 DBI connect('dbname=musicbrainz_json_dump;host=10.2.2.43;port=65401','musicbrainz',...) failed: connection to server at "10.2.2.43", port 65401 failed: Connection refused
      • 2023-05-20 14024, 2023

      • bitmap
        2023-05-20T08:53:43.194406Z Is the server running on that host and accepting TCP/IP connections?
      • 2023-05-20 14035, 2023

      • saum0n has quit
      • 2023-05-20 14009, 2023

      • saum0n joined the channel
      • 2023-05-20 14048, 2023

      • zas
      • 2023-05-20 14058, 2023

      • zas
        it broke when docker was upgraded
      • 2023-05-20 14033, 2023

      • bitmap
        the first error occurred at 2023-05-20T08:53:27.445201Z
      • 2023-05-20 14040, 2023

      • bitmap
        "DBD::Pg::st execute failed: server closed the connection unexpectedly"
      • 2023-05-20 14015, 2023

      • bitmap
        ok I guess that about matches the time when docker was upgraded then
      • 2023-05-20 14002, 2023

      • bitmap
        zas: I stopped the json dumps, if you need to restart docker
      • 2023-05-20 14004, 2023

      • zas
        yes, we need to try to see if it helps, if not, I guess we'll have to downgrade docker
      • 2023-05-20 14048, 2023

      • zas
        I'll proceed
      • 2023-05-20 14029, 2023

      • zas
        done
      • 2023-05-20 14007, 2023

      • zas
        bitmap: can you check pg on aretha, container reported consul-template issues, but I think that's because of consulagent restarting
      • 2023-05-20 14048, 2023

      • lucifer
        mayhem: issue fixed?
      • 2023-05-20 14013, 2023

      • mayhem
        the wolf issue? working right now, but not trusting it
      • 2023-05-20 14031, 2023

      • lucifer
        the recording from tag query is invalid fwiw
      • 2023-05-20 14038, 2023

      • bitmap
        zas: I can connect to pg, but postgres-health-check still fails from inside the consulagent container
      • 2023-05-20 14045, 2023

      • lucifer
      • 2023-05-20 14058, 2023

      • lucifer
        instead of count there should be rg.name in GROUP BY
      • 2023-05-20 14015, 2023

      • mayhem
        yes, thanks. fixed that a while ago.
      • 2023-05-20 14019, 2023

      • lucifer
        due to some invalid error handling, this shows up as 502/499 instead
      • 2023-05-20 14052, 2023

      • mayhem
        these queries were done by hand, so they are not contributing the problem at hand.
      • 2023-05-20 14000, 2023

      • lucifer
        oh okay
      • 2023-05-20 14007, 2023

      • zas
        bitmap: the check doesn't use the correct port, so I guess it is expected it fails, but why didn't we noticed before and why doesn't it use the specified port?
      • 2023-05-20 14014, 2023

      • lucifer
        regarding doing dumps on gaga, that should be possible yes.
      • 2023-05-20 14032, 2023

      • lucifer
        zas, atj: will prepare a rough list of lb disk usage.
      • 2023-05-20 14033, 2023

      • mayhem
        and now I realize I need to convert RG to -> tracks, which needs the canonical data, I think I'm done for the day. lol.
      • 2023-05-20 14052, 2023

      • mayhem
        lucifer: ok, good to know. lets keep discussing what to do.
      • 2023-05-20 14012, 2023

      • zas
        bitmap: we don't have any stuck check process anymore
      • 2023-05-20 14051, 2023

      • zas
        this means we'd better restart docker everywhere....
      • 2023-05-20 14024, 2023

      • bitmap
        zas: I don't understand why it connects to 5432 if port=65401 is passed
      • 2023-05-20 14057, 2023

      • zas
        nor I, and it doesn't anymore...
      • 2023-05-20 14047, 2023

      • bitmap
        if I run `docker exec consulagent postgres-health-check host=10.2.2.43 port=65401 user=postgres dbname=template1 sslmode=disable` on aretha it does
      • 2023-05-20 14042, 2023

      • zas
        and caa redirect tx alerts broke again...
      • 2023-05-20 14009, 2023

      • zas
      • 2023-05-20 14024, 2023

      • bitmap
        maybe related to consulagent using host network and postgres-aretha using bridge?
      • 2023-05-20 14027, 2023

      • zas
        something definitively goes wrong in docker / network
      • 2023-05-20 14055, 2023

      • zas
        ok, very weird, yesterday I restarted caa-redirect-prod container on serge, zappa & hip, because telegraf wasn't able to collect network stats from this container, after restart, it worked. But today it broke again because of another docker upgrade (8:53)
      • 2023-05-20 14031, 2023

      • atj
        docker package upgrade?
      • 2023-05-20 14022, 2023

      • q3lont joined the channel
      • 2023-05-20 14021, 2023

      • zas
        atj: yes
      • 2023-05-20 14008, 2023

      • zas
        I think they changed something related to network, and it requires containers to restart, though we had 2 upgrades (yesterday & today), and both broke telegraf docker plugin (which doesn't report net stats after docker upgrade). Restarting containers fix it.
      • 2023-05-20 14022, 2023

      • zas
        so that's also related to live restore option
      • 2023-05-20 14059, 2023

      • zas
        I saw nothing obvious in docker changelog though
      • 2023-05-20 14017, 2023

      • atj
        I'm becoming a bit wary of live restore
      • 2023-05-20 14001, 2023

      • zas
        https://stats.metabrainz.org/d/000000051/hetzner-… shows a lot of containers have no net starts, though they have memory/cpu/disk
      • 2023-05-20 14023, 2023

      • zas
        (some containers show nothing at all, because they don't exist anymore, and we cannot easily hide them)
      • 2023-05-20 14002, 2023

      • atj
        Did you update docker for any specific reason or just general maintenance?
      • 2023-05-20 14001, 2023

      • petitminion joined the channel
      • 2023-05-20 14039, 2023

      • zas
        just usual upgrades
      • 2023-05-20 14050, 2023

      • atj
        We've had live restore break things weirdly a couple of times now I think
      • 2023-05-20 14031, 2023

      • zas
        yup, I agree, but we can't restart all containers either
      • 2023-05-20 14011, 2023

      • zas
        if we drop live-restore, we'll need to fix the docker version
      • 2023-05-20 14039, 2023

      • zas
        though, the underlying issue is more SPoF containers
      • 2023-05-20 14001, 2023

      • atj
        yes, I agree
      • 2023-05-20 14027, 2023

      • atj
        I'm not sure live restore should be relied on for Docker version upgrades
      • 2023-05-20 14047, 2023

      • atj
        Configuration changes perhaps
      • 2023-05-20 14037, 2023

      • atj
        Should I add apt hold support to the docker role?
      • 2023-05-20 14046, 2023

      • zas
        I guess so
      • 2023-05-20 14000, 2023

      • zas
        but there are multiple packages involved
      • 2023-05-20 14058, 2023

      • zas
        docker-ce docker-ce-cli
      • 2023-05-20 14019, 2023

      • zas
        and likely containerd.io
      • 2023-05-20 14042, 2023

      • zas