#musicbrainz-devel

/

      • Nyanko-sensei joined the channel
      • 2015-04-23 11317, 2015

      • LordSputnik has left the channel
      • 2015-04-23 11342, 2015

      • CallerNo6
        fwiw, somebody in #musicbrainz is asking about 502s
      • 2015-04-23 11308, 2015

      • ruaok joined the channel
      • 2015-04-23 11340, 2015

      • Nyanko-sensei
        I get a lot of them too. some servers dead?
      • 2015-04-23 11300, 2015

      • ruaok
        database server is freaking out again.
      • 2015-04-23 11305, 2015

      • ruaok
        trying to do some tweaking.
      • 2015-04-23 11316, 2015

      • Nyanko-sensei
        ok
      • 2015-04-23 11312, 2015

      • ruaok
        UNION ALL SELECT gid FROM artist UNION ALL SELECT gid FROM label UNION ALL SELECT gid FROM release_group_gid_redirect UNION ALL SELECT gid FROM recording_gid_redirect UNION ALL SELECT gid FROM area_gid_redirect UNION ALL SELECT gid FROM place_gid_redirect UNION ALL SELECT gid FROM series_gid_redirect UNION ALL SELECT gid FROM url_gid_redirect UNION ALL SELECT gid FROM track_gid_redirect UNION ALL SELECT gid FROM
      • 2015-04-23 11313, 2015

      • ruaok
        instrument_gid_redirect UNION ALL SELECT gid FROM release_gid_redirect UNION ALL SELECT gid FROM work_gid_redirect UNION ALL SELECT gid FROM event_gid_redirect UNION ALL SELECT gid
      • 2015-04-23 11317, 2015

      • ruaok
        O_O
      • 2015-04-23 11337, 2015

      • ruaok
        who thought *that* was a good idea?
      • 2015-04-23 11328, 2015

      • Gentlecat joined the channel
      • 2015-04-23 11316, 2015

      • derwin
        well *there's* your problem
      • 2015-04-23 11337, 2015

      • derwin
        I would not do that query in that way at ALL.
      • 2015-04-23 11341, 2015

      • ruaok
        yeah, me neither.
      • 2015-04-23 11323, 2015

      • ruaok
        I'm really hoping this is some sort of automatically generated query and this was an overlooked edge case
      • 2015-04-23 11357, 2015

      • ruaok
        ok, should be doing better now.
      • 2015-04-23 11358, 2015

      • ruaok
        nn
      • 2015-04-23 11345, 2015

      • JesseW joined the channel
      • 2015-04-23 11315, 2015

      • luks
        derwin: out of curiosity, how would you do the query?
      • 2015-04-23 11333, 2015

      • derwin
        it depends on what facilities the engine provides for counting rows
      • 2015-04-23 11304, 2015

      • derwin
        but I would be highly unlikely to count all tables in a query like that on an in-service host
      • 2015-04-23 11314, 2015

      • derwin
        the actual structure of it is fine
      • 2015-04-23 11323, 2015

      • luks
        it's a statistic query that runs once a day
      • 2015-04-23 11344, 2015

      • derwin
        yeah :/
      • 2015-04-23 11348, 2015

      • luks
        you could sum the tables individually and then sum the sums, but I don't think that would make a difference
      • 2015-04-23 11355, 2015

      • derwin
        no, it won't in most cases
      • 2015-04-23 11357, 2015

      • luks
        other than that, I don't see an alternative
      • 2015-04-23 11309, 2015

      • derwin
        oh, wait
      • 2015-04-23 11316, 2015

      • derwin
        I would count individually within the tables
      • 2015-04-23 11317, 2015

      • luks
        the solution is to run statistics on a read-only slave
      • 2015-04-23 11321, 2015

      • derwin
        to avoid the intermediate result set
      • 2015-04-23 11323, 2015

      • derwin
        which would be huge
      • 2015-04-23 11327, 2015

      • derwin
        and yes, +1
      • 2015-04-23 11341, 2015

      • luks
        postgresql would have to be quite native to collect the IDs
      • 2015-04-23 11346, 2015

      • derwin
        I would sum counts there
      • 2015-04-23 11353, 2015

      • luks
        it would have to with union, but not with union all
      • 2015-04-23 11359, 2015

      • derwin
        not count result-set rows
      • 2015-04-23 11312, 2015

      • derwin
        sure, I would examine handlers in mysql land
      • 2015-04-23 11314, 2015

      • luks
        I don't think there are really any
      • 2015-04-23 11328, 2015

      • luks
        it's just a huge +1 loop
      • 2015-04-23 11337, 2015

      • luks
        at least I assume so
      • 2015-04-23 11353, 2015

      • luks
        er, I meant naive, not native :)
      • 2015-04-23 11319, 2015

      • derwin
        one can also do uh, rolling summary tables if you have time modification stamps on things
      • 2015-04-23 11349, 2015

      • luks
        the individual sums indeed are faster, I expected postgresql to run it differently
      • 2015-04-23 11303, 2015

      • luks
      • 2015-04-23 11332, 2015

      • derwin
        intermediate result set, if the engine doesn't recognize
      • 2015-04-23 11340, 2015

      • derwin
        I agree that it conceptually "could" ?
      • 2015-04-23 11333, 2015

      • JesseW joined the channel
      • 2015-04-23 11333, 2015

      • diana_olhovik_ joined the channel
      • 2015-04-23 11323, 2015

      • ianmcorvidae
        for the curious, that query's for statistics, it's run once a day
      • 2015-04-23 11348, 2015

      • ianmcorvidae
        counting the total number of MBIDs. could, however, easily be changed to the other method
      • 2015-04-23 11335, 2015

      • ianmcorvidae
        that's been in for many months now, note, so I'm surprised if it's causing problems now, but anyway
      • 2015-04-23 11349, 2015

      • kepstin-laptop joined the channel
      • 2015-04-23 11300, 2015

      • sv2241
        how can I fix this error? http://pastebin.com/wRFK5mQU
      • 2015-04-23 11321, 2015

      • sv2241
        vm [at musicbrainzvm]:~$ bin/replicate now
      • 2015-04-23 11325, 2015

      • sv2241
        perl: symbol lookup error: /usr/lib/libperl.so.5.14: undefined symbol: Perl_pp_reAdline
      • 2015-04-23 11334, 2015

      • sv2241
        this is happening on the musicbrainz-server-vmware-2014-05-14.ova vmware image
      • 2015-04-23 11352, 2015

      • 7GHAAA0QH joined the channel
      • 2015-04-23 11330, 2015

      • Gentlecat joined the channel
      • 2015-04-23 11308, 2015

      • Junior_ joined the channel
      • 2015-04-23 11337, 2015

      • d4rkie joined the channel
      • 2015-04-23 11336, 2015

      • ianmcorvidae
      • 2015-04-23 11342, 2015

      • ianmcorvidae
        guessing that query wasn't the problem.
      • 2015-04-23 11332, 2015

      • Freso joined the channel
      • 2015-04-23 11347, 2015

      • kepstin-laptop joined the channel
      • 2015-04-23 11305, 2015

      • sv2241
        the above issues occured after deploying the latest musicbrainz OVA (musicbrainz-server-vmware-2014-05-14.ova) into vmware vsphere. before triggering "/bin/replicate" now I did a sudo apt-get update && sudo apt-get upgrade. is this the reason that perl broke? what should I do? redeploy the ova and this time not update the system?!
      • 2015-04-23 11341, 2015

      • ianmcorvidae
        those errors do seem consistent with some sort of perl upgrade -- I'm not sure how the perl dependencies for musicbrainz-server are installed in the VM because I haven't looked in a while, but perl version upgrades do tend to require reinstalling XS (i.e. C-language, rather than pure perl) dependencies
      • 2015-04-23 11340, 2015

      • sv2241
        how can I fix these issues?
      • 2015-04-23 11332, 2015

      • ohrstrom joined the channel
      • 2015-04-23 11306, 2015

      • Freso
        ianmcorvidae: Yo. Is there some flag that needs to be set on AREQ in Jira for it to send e-mails for followed tickets?
      • 2015-04-23 11330, 2015

      • ianmcorvidae
        Freso: no idea, but presumably
      • 2015-04-23 11351, 2015

      • ianmcorvidae
        lemme poke around a bit, or if not nikki would be the one who might know
      • 2015-04-23 11335, 2015

      • ianmcorvidae found that it had the Notification Scheme set to "None", which seems like it might be related
      • 2015-04-23 11301, 2015

      • ianmcorvidae
        I've changed it to "Default Notification Scheme", which matches most projects (e.g. picard, geordi, etc.)
      • 2015-04-23 11326, 2015

      • UmkaDK joined the channel
      • 2015-04-23 11340, 2015

      • ijabz2 joined the channel
      • 2015-04-23 11308, 2015

      • Freso
        Cool. Thanks. :)
      • 2015-04-23 11314, 2015

      • sv2241
        i will delete and redeploy the ova image as I have no other option. will make sure not to update any package from now onwards
      • 2015-04-23 11317, 2015

      • sv2241
        thanks
      • 2015-04-23 11352, 2015

      • CatQuest joined the channel
      • 2015-04-23 11355, 2015

      • ruaok joined the channel
      • 2015-04-23 11338, 2015

      • ijabz2 joined the channel
      • 2015-04-23 11340, 2015

      • ariscop joined the channel
      • 2015-04-23 11335, 2015

      • chirlu` joined the channel
      • 2015-04-23 11313, 2015

      • chirlu`
        sv2241: Both issues seem to be related to a letter that suddenly became uppercase, “tRack” and “reAdline”. No idea what might have caused this, though.
      • 2015-04-23 11303, 2015

      • UmkaDK joined the channel
      • 2015-04-23 11333, 2015

      • CatQuest
        erhm, to me https://imgur.com/KUSZPMC looks *exactly* that "stats query" is the culprit, you can see the clear ebbing down towards a lower load after it
      • 2015-04-23 11322, 2015

      • ruaok
        yep, that was the culprit, without a doubt.
      • 2015-04-23 11339, 2015

      • ruaok
        I'm going to take a look at trying to fix that now.
      • 2015-04-23 11346, 2015

      • CatQuest
        :D
      • 2015-04-23 11357, 2015

      • ruaok
        otherwise this mess will happen every night until we upgrade the kernel on that machine.
      • 2015-04-23 11317, 2015

      • chirlu`
        ruaok: The trouble started at 23 UTC already, so it doesn’t seem the statistics collector caused it.
      • 2015-04-23 11337, 2015

      • chirlu`
      • 2015-04-23 11305, 2015

      • ruaok
        well, there may be more problems then.
      • 2015-04-23 11313, 2015

      • chirlu`
        Also, did you change the backend numbers?
      • 2015-04-23 11323, 2015

      • ruaok
        I specifically saw this query running for a reallly long time taking 15% of ram on tororo.
      • 2015-04-23 11335, 2015

      • ruaok
        yes, I dropped it down to cores *2.
      • 2015-04-23 11347, 2015

      • ruaok
        and soon we need to look a the kernel upgrade to get to 3.20
      • 2015-04-23 11323, 2015

      • ruaok
        well, I killed the query, but totoro was in a bad state.
      • 2015-04-23 11334, 2015

      • ruaok
        so, I dropped the number of backend connections, because many were still idle.
      • 2015-04-23 11352, 2015

      • ruaok
        and then provisioned that, which causes a restart so, after that things where better.
      • 2015-04-23 11346, 2015

      • ijabz2 joined the channel
      • 2015-04-23 11350, 2015

      • chirlu`
        What value(s) did you change?
      • 2015-04-23 11305, 2015

      • chirlu` refers to MBH-404
      • 2015-04-23 11305, 2015

      • mb-chat-logger
      • 2015-04-23 11315, 2015

      • ruaok
        max_connections is now 50.
      • 2015-04-23 11325, 2015

      • ruaok
        going on the cores * 2, route.
      • 2015-04-23 11338, 2015

      • chirlu`
        That’s probably too low unless you also told pgbouncer to use less than 50 connections.
      • 2015-04-23 11351, 2015

      • ruaok
        I dropped pgbouncer too, but that change may have gotten overwritten. I was already asleep when all of this happened.
      • 2015-04-23 11342, 2015

      • chirlu`
        Generally, max_connections should be the pgbouncer pool size plus something.
      • 2015-04-23 11315, 2015

      • chirlu`
        Otherwise there may be no chance to phone into Postgres for diagnostics when connections are maxed out.
      • 2015-04-23 11353, 2015

      • ruaok
        k, let me find the where pgbouncer gets configured and tweak that and number of connections.
      • 2015-04-23 11319, 2015

      • ijabz2
        probably stating the obvious but it does seem to me that lots of issues of this kind and simailr issues related to building search indexes would be solved if we had a read only db they could all use
      • 2015-04-23 11348, 2015

      • chirlu`
        “ for pgbouncer, default_pool_size=50 and reserve_pool_size=0; not sure about max_client_conn, perhaps 150 or so? for Postgres itself, max_connections=65 so that there are 50 for the website plus some more for index generation, ad-hoc queries etc.”
      • 2015-04-23 11350, 2015

      • ruaok
        I disagree, ijabz2
      • 2015-04-23 11306, 2015

      • chirlu`
        was what I wrote last month, I hope it was sane. ;-)
      • 2015-04-23 11317, 2015

      • ruaok
        there never has been on single thing to push PG over the edge.
      • 2015-04-23 11327, 2015

      • ruaok
      • 2015-04-23 11328, 2015

      • chirlu`
        What was max_client_conn before?
      • 2015-04-23 11333, 2015

      • ruaok
        320!
      • 2015-04-23 11317, 2015

      • ruaok
      • 2015-04-23 11322, 2015

      • ruaok
        oh that was default pool size.
      • 2015-04-23 11329, 2015

      • ruaok
        which is also nuts.
      • 2015-04-23 11344, 2015

      • chirlu`
        So, 400. That’s the number of client connections that may wait in queue before pgbouncer rejects additional ones.
      • 2015-04-23 11357, 2015

      • chirlu`
        I’m unsure how much would be good here.
      • 2015-04-23 11323, 2015

      • ruaok
        apparently our blog post has been very useful to people trying to tune their own systems
      • 2015-04-23 11317, 2015

      • ruaok
        chirlu` if you have a moment, I think you could fix this faster than I could:
      • 2015-04-23 11319, 2015

      • ruaok
      • 2015-04-23 11343, 2015

      • ruaok
        the query needs to be rewritten to sum the counts, not to union all the gids.
      • 2015-04-23 11352, 2015

      • chirlu`
        Hm, not sure if it makes much of a difference.
      • 2015-04-23 11312, 2015

      • chirlu`
        Upgrading to Postgres 9.2 would, because it has index-only scans. :)
      • 2015-04-23 11341, 2015

      • ruaok
        as it stands it eats 2-3 times the amount of ram of other processes.
      • 2015-04-23 11313, 2015

      • chirlu`
        Well, we can change it.
      • 2015-04-23 11309, 2015

      • Junior_ joined the channel
      • 2015-04-23 11303, 2015

      • ruchiranga joined the channel
      • 2015-04-23 11359, 2015

      • ruaok
        in git, work_mem is: :work_mem => "80MB"
      • 2015-04-23 11306, 2015

      • ruaok
        in production it was 64MB.
      • 2015-04-23 11309, 2015

      • ruaok
        thoughts, chirlu`
      • 2015-04-23 11340, 2015

      • chirlu`
        As far as I remember, we reduced it in several steps during the Crisis.
      • 2015-04-23 11358, 2015

      • chirlu`
        128 → 80 → 64, I think.
      • 2015-04-23 11305, 2015

      • ruaok
        correct.
      • 2015-04-23 11318, 2015

      • ruaok
        but I don't recall getting any real feedback on its effect.
      • 2015-04-23 11336, 2015

      • ruaok
        I think it just made the backends smaller, but the real winner was fewer back ends.
      • 2015-04-23 11357, 2015

      • ruaok
        I'm going to leave it at 80MB for now.
      • 2015-04-23 11303, 2015

      • ruaok
        max connections is 65 now.
      • 2015-04-23 11327, 2015

      • chirlu`
        +1
      • 2015-04-23 11342, 2015

      • ruaok
        now to find the swapiness setting, so I can finish all the items in that ticket.
      • 2015-04-23 11343, 2015

      • chirlu`
        It’s a kernel setting, controlled via /proc somewhere, I think.