#musicbrainz-devel

/

0:01 AM
Nyanko-sensei joined the channel

2015-04-23 11317, 2015

0:26 AM
LordSputnik has left the channel

2015-04-23 11342, 2015

0:44 AM
CallerNo6

fwiw, somebody in #musicbrainz is asking about 502s

2015-04-23 11308, 2015

0:50 AM
ruaok joined the channel

2015-04-23 11340, 2015

0:56 AM
Nyanko-sensei

I get a lot of them too. some servers dead?

2015-04-23 11300, 2015

0:57 AM
ruaok

database server is freaking out again.

2015-04-23 11305, 2015

0:57 AM
ruaok

trying to do some tweaking.

2015-04-23 11316, 2015

0:59 AM
Nyanko-sensei

ok

2015-04-23 11312, 2015

1:18 AM
ruaok

UNION ALL SELECT gid FROM artist UNION ALL SELECT gid FROM label UNION ALL SELECT gid FROM release_group_gid_redirect UNION ALL SELECT gid FROM recording_gid_redirect UNION ALL SELECT gid FROM area_gid_redirect UNION ALL SELECT gid FROM place_gid_redirect UNION ALL SELECT gid FROM series_gid_redirect UNION ALL SELECT gid FROM url_gid_redirect UNION ALL SELECT gid FROM track_gid_redirect UNION ALL SELECT gid FROM

2015-04-23 11313, 2015

1:18 AM
ruaok

instrument_gid_redirect UNION ALL SELECT gid FROM release_gid_redirect UNION ALL SELECT gid FROM work_gid_redirect UNION ALL SELECT gid FROM event_gid_redirect UNION ALL SELECT gid

2015-04-23 11317, 2015

1:18 AM
ruaok

O_O

2015-04-23 11337, 2015

1:18 AM
ruaok

who thought *that* was a good idea?

2015-04-23 11328, 2015

1:21 AM
Gentlecat joined the channel

2015-04-23 11316, 2015

1:26 AM
derwin

well *there's* your problem

2015-04-23 11337, 2015

1:26 AM
derwin

I would not do that query in that way at ALL.

2015-04-23 11341, 2015

1:27 AM
ruaok

yeah, me neither.

2015-04-23 11323, 2015

1:28 AM
ruaok

I'm really hoping this is some sort of automatically generated query and this was an overlooked edge case

2015-04-23 11357, 2015

1:40 AM
ruaok

ok, should be doing better now.

2015-04-23 11358, 2015

1:40 AM
ruaok

nn

2015-04-23 11345, 2015

1:46 AM
JesseW joined the channel

2015-04-23 11315, 2015

2:53 AM
luks

derwin: out of curiosity, how would you do the query?

2015-04-23 11333, 2015

2:55 AM
derwin

it depends on what facilities the engine provides for counting rows

2015-04-23 11304, 2015

2:57 AM
derwin

but I would be highly unlikely to count all tables in a query like that on an in-service host

2015-04-23 11314, 2015

2:57 AM
derwin

the actual structure of it is fine

2015-04-23 11323, 2015

2:57 AM
luks

it's a statistic query that runs once a day

2015-04-23 11344, 2015

2:57 AM
derwin

yeah :/

2015-04-23 11348, 2015

2:57 AM
luks

you could sum the tables individually and then sum the sums, but I don't think that would make a difference

2015-04-23 11355, 2015

2:57 AM
derwin

no, it won't in most cases

2015-04-23 11357, 2015

2:57 AM
luks

other than that, I don't see an alternative

2015-04-23 11309, 2015

2:58 AM
derwin

oh, wait

2015-04-23 11316, 2015

2:58 AM
derwin

I would count individually within the tables

2015-04-23 11317, 2015

2:58 AM
luks

the solution is to run statistics on a read-only slave

2015-04-23 11321, 2015

2:58 AM
derwin

to avoid the intermediate result set

2015-04-23 11323, 2015

2:58 AM
derwin

which would be huge

2015-04-23 11327, 2015

2:58 AM
derwin

and yes, +1

2015-04-23 11341, 2015

2:58 AM
luks

postgresql would have to be quite native to collect the IDs

2015-04-23 11346, 2015

2:58 AM
derwin

I would sum counts there

2015-04-23 11353, 2015

2:58 AM
luks

it would have to with union, but not with union all

2015-04-23 11359, 2015

2:58 AM
derwin

not count result-set rows

2015-04-23 11312, 2015

2:59 AM
derwin

sure, I would examine handlers in mysql land

2015-04-23 11314, 2015

2:59 AM
luks

I don't think there are really any

2015-04-23 11328, 2015

2:59 AM
luks

it's just a huge +1 loop

2015-04-23 11337, 2015

2:59 AM
luks

at least I assume so

2015-04-23 11353, 2015

2:59 AM
luks

er, I meant naive, not native :)

2015-04-23 11319, 2015

3:00 AM
derwin

one can also do uh, rolling summary tables if you have time modification stamps on things

2015-04-23 11349, 2015

3:04 AM
luks

the individual sums indeed are faster, I expected postgresql to run it differently

2015-04-23 11303, 2015

3:06 AM
luks

https://gist.github.com/lalinsky/2aad06bc64d9c86a…

2015-04-23 11332, 2015

3:14 AM
derwin

intermediate result set, if the engine doesn't recognize

2015-04-23 11340, 2015

3:14 AM
derwin

I agree that it conceptually "could" ?

2015-04-23 11333, 2015

4:50 AM
JesseW joined the channel

2015-04-23 11333, 2015

5:02 AM
diana_olhovik_ joined the channel

2015-04-23 11323, 2015

5:22 AM
ianmcorvidae

for the curious, that query's for statistics, it's run once a day

2015-04-23 11348, 2015

5:22 AM
ianmcorvidae

counting the total number of MBIDs. could, however, easily be changed to the other method

2015-04-23 11335, 2015

5:23 AM
ianmcorvidae

that's been in for many months now, note, so I'm surprised if it's causing problems now, but anyway

2015-04-23 11349, 2015

5:31 AM
kepstin-laptop joined the channel

2015-04-23 11300, 2015

5:58 AM
sv2241

how can I fix this error? http://pastebin.com/wRFK5mQU

2015-04-23 11321, 2015

5:58 AM
sv2241

vm [at musicbrainzvm]:~$ bin/replicate now

2015-04-23 11325, 2015

5:58 AM
sv2241

perl: symbol lookup error: /usr/lib/libperl.so.5.14: undefined symbol: Perl_pp_reAdline

2015-04-23 11334, 2015

5:59 AM
sv2241

this is happening on the musicbrainz-server-vmware-2014-05-14.ova vmware image

2015-04-23 11352, 2015

6:00 AM
7GHAAA0QH joined the channel

2015-04-23 11330, 2015

6:05 AM
Gentlecat joined the channel

2015-04-23 11308, 2015

6:13 AM
Junior_ joined the channel

2015-04-23 11337, 2015

6:31 AM
d4rkie joined the channel

2015-04-23 11336, 2015

6:36 AM
ianmcorvidae

also, btw: https://imgur.com/KUSZPMC

2015-04-23 11342, 2015

6:36 AM
ianmcorvidae

guessing that query wasn't the problem.

2015-04-23 11332, 2015

6:46 AM
Freso joined the channel

2015-04-23 11347, 2015

7:03 AM
kepstin-laptop joined the channel

2015-04-23 11305, 2015

7:15 AM
sv2241

the above issues occured after deploying the latest musicbrainz OVA (musicbrainz-server-vmware-2014-05-14.ova) into vmware vsphere. before triggering "/bin/replicate" now I did a sudo apt-get update && sudo apt-get upgrade. is this the reason that perl broke? what should I do? redeploy the ova and this time not update the system?!

2015-04-23 11341, 2015

7:18 AM
ianmcorvidae

those errors do seem consistent with some sort of perl upgrade -- I'm not sure how the perl dependencies for musicbrainz-server are installed in the VM because I haven't looked in a while, but perl version upgrades do tend to require reinstalling XS (i.e. C-language, rather than pure perl) dependencies

2015-04-23 11340, 2015

7:23 AM
sv2241

how can I fix these issues?

2015-04-23 11332, 2015

7:25 AM
ohrstrom joined the channel

2015-04-23 11306, 2015

7:31 AM
Freso

ianmcorvidae: Yo. Is there some flag that needs to be set on AREQ in Jira for it to send e-mails for followed tickets?

2015-04-23 11330, 2015

7:31 AM
ianmcorvidae

Freso: no idea, but presumably

2015-04-23 11351, 2015

7:31 AM
ianmcorvidae

lemme poke around a bit, or if not nikki would be the one who might know

2015-04-23 11335, 2015

7:32 AM
ianmcorvidae found that it had the Notification Scheme set to "None", which seems like it might be related

2015-04-23 11301, 2015

7:33 AM
ianmcorvidae

I've changed it to "Default Notification Scheme", which matches most projects (e.g. picard, geordi, etc.)

2015-04-23 11326, 2015

7:37 AM
UmkaDK joined the channel

2015-04-23 11340, 2015

7:40 AM
ijabz2 joined the channel

2015-04-23 11308, 2015

7:45 AM
Freso

Cool. Thanks. :)

2015-04-23 11314, 2015

7:56 AM
sv2241

i will delete and redeploy the ova image as I have no other option. will make sure not to update any package from now onwards

2015-04-23 11317, 2015

7:56 AM
sv2241

thanks

2015-04-23 11352, 2015

8:16 AM
CatQuest joined the channel

2015-04-23 11355, 2015

8:30 AM
ruaok joined the channel

2015-04-23 11338, 2015

8:38 AM
ijabz2 joined the channel

2015-04-23 11340, 2015

8:58 AM
ariscop joined the channel

2015-04-23 11335, 2015

9:33 AM
chirlu` joined the channel

2015-04-23 11313, 2015

9:39 AM
chirlu`

sv2241: Both issues seem to be related to a letter that suddenly became uppercase, “tRack” and “reAdline”. No idea what might have caused this, though.

2015-04-23 11303, 2015

9:46 AM
UmkaDK joined the channel

2015-04-23 11333, 2015

10:24 AM
CatQuest

erhm, to me https://imgur.com/KUSZPMC looks *exactly* that "stats query" is the culprit, you can see the clear ebbing down towards a lower load after it

2015-04-23 11322, 2015

10:25 AM
ruaok

yep, that was the culprit, without a doubt.

2015-04-23 11339, 2015

10:25 AM
ruaok

I'm going to take a look at trying to fix that now.

2015-04-23 11346, 2015

10:25 AM
CatQuest

:D

2015-04-23 11357, 2015

10:25 AM
ruaok

otherwise this mess will happen every night until we upgrade the kernel on that machine.

2015-04-23 11317, 2015

10:37 AM
chirlu`

ruaok: The trouble started at 23 UTC already, so it doesn’t seem the statistics collector caused it.

2015-04-23 11337, 2015

10:37 AM
chirlu`

See totoro cached-memory graph: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi…

2015-04-23 11305, 2015

10:40 AM
ruaok

well, there may be more problems then.

2015-04-23 11313, 2015

10:40 AM
chirlu`

Also, did you change the backend numbers?

2015-04-23 11323, 2015

10:40 AM
ruaok

I specifically saw this query running for a reallly long time taking 15% of ram on tororo.

2015-04-23 11335, 2015

10:40 AM
ruaok

yes, I dropped it down to cores *2.

2015-04-23 11347, 2015

10:40 AM
ruaok

and soon we need to look a the kernel upgrade to get to 3.20

2015-04-23 11323, 2015

10:41 AM
ruaok

well, I killed the query, but totoro was in a bad state.

2015-04-23 11334, 2015

10:41 AM
ruaok

so, I dropped the number of backend connections, because many were still idle.

2015-04-23 11352, 2015

10:41 AM
ruaok

and then provisioned that, which causes a restart so, after that things where better.

2015-04-23 11346, 2015

10:43 AM
ijabz2 joined the channel

2015-04-23 11350, 2015

10:43 AM
chirlu`

What value(s) did you change?

2015-04-23 11305, 2015

10:44 AM
chirlu` refers to MBH-404

2015-04-23 11305, 2015

10:44 AM
mb-chat-logger

http://tickets.musicbrainz.org/browse/MBH-404

2015-04-23 11315, 2015

10:47 AM
ruaok

max_connections is now 50.

2015-04-23 11325, 2015

10:47 AM
ruaok

going on the cores * 2, route.

2015-04-23 11338, 2015

10:48 AM
chirlu`

That’s probably too low unless you also told pgbouncer to use less than 50 connections.

2015-04-23 11351, 2015

10:49 AM
ruaok

I dropped pgbouncer too, but that change may have gotten overwritten. I was already asleep when all of this happened.

2015-04-23 11342, 2015

10:50 AM
chirlu`

Generally, max_connections should be the pgbouncer pool size plus something.

2015-04-23 11315, 2015

10:51 AM
chirlu`

Otherwise there may be no chance to phone into Postgres for diagnostics when connections are maxed out.

2015-04-23 11353, 2015

10:51 AM
ruaok

k, let me find the where pgbouncer gets configured and tweak that and number of connections.

2015-04-23 11319, 2015

10:53 AM
ijabz2

probably stating the obvious but it does seem to me that lots of issues of this kind and simailr issues related to building search indexes would be solved if we had a read only db they could all use

2015-04-23 11348, 2015

10:53 AM
chirlu`

“ for pgbouncer, default_pool_size=50 and reserve_pool_size=0; not sure about max_client_conn, perhaps 150 or so? for Postgres itself, max_connections=65 so that there are 50 for the website plus some more for index generation, ad-hoc queries etc.”

2015-04-23 11350, 2015

10:53 AM
ruaok

I disagree, ijabz2

2015-04-23 11306, 2015

10:54 AM
chirlu`

was what I wrote last month, I hope it was sane. ;-)

2015-04-23 11317, 2015

10:54 AM
ruaok

there never has been on single thing to push PG over the edge.

2015-04-23 11327, 2015

10:56 AM
ruaok

https://gist.github.com/mayhem/6db261249b556af620…

2015-04-23 11328, 2015

10:57 AM
chirlu`

What was max_client_conn before?

2015-04-23 11333, 2015

10:57 AM
ruaok

320!

2015-04-23 11317, 2015

10:58 AM
ruaok

https://gist.github.com/mayhem/7ef3534a0ed78a5623…

2015-04-23 11322, 2015

10:58 AM
ruaok

oh that was default pool size.

2015-04-23 11329, 2015

10:58 AM
ruaok

which is also nuts.

2015-04-23 11344, 2015

10:59 AM
chirlu`

So, 400. That’s the number of client connections that may wait in queue before pgbouncer rejects additional ones.

2015-04-23 11357, 2015

10:59 AM
chirlu`

I’m unsure how much would be good here.

2015-04-23 11323, 2015

11:00 AM
ruaok

apparently our blog post has been very useful to people trying to tune their own systems

2015-04-23 11317, 2015

11:02 AM
ruaok

chirlu` if you have a moment, I think you could fix this faster than I could:

2015-04-23 11319, 2015

11:02 AM
ruaok

https://bitbucket.org/metabrainz/musicbrainz-serv…

2015-04-23 11343, 2015

11:02 AM
ruaok

the query needs to be rewritten to sum the counts, not to union all the gids.

2015-04-23 11352, 2015

11:04 AM
chirlu`

Hm, not sure if it makes much of a difference.

2015-04-23 11312, 2015

11:05 AM
chirlu`

Upgrading to Postgres 9.2 would, because it has index-only scans. :)

2015-04-23 11341, 2015

11:05 AM
ruaok

as it stands it eats 2-3 times the amount of ram of other processes.

2015-04-23 11313, 2015

11:07 AM
chirlu`

Well, we can change it.

2015-04-23 11309, 2015

11:08 AM
Junior_ joined the channel

2015-04-23 11303, 2015

11:10 AM
ruchiranga joined the channel

2015-04-23 11359, 2015

11:13 AM
ruaok

in git, work_mem is: :work_mem => "80MB"

2015-04-23 11306, 2015

11:14 AM
ruaok

in production it was 64MB.

2015-04-23 11309, 2015

11:14 AM
ruaok

thoughts, chirlu`

2015-04-23 11340, 2015

11:16 AM
chirlu`

As far as I remember, we reduced it in several steps during the Crisis.

2015-04-23 11358, 2015

11:16 AM
chirlu`

128 → 80 → 64, I think.

2015-04-23 11305, 2015

11:18 AM
ruaok

correct.

2015-04-23 11318, 2015

11:18 AM
ruaok

but I don't recall getting any real feedback on its effect.

2015-04-23 11336, 2015

11:18 AM
ruaok

I think it just made the backends smaller, but the real winner was fewer back ends.

2015-04-23 11357, 2015

11:18 AM
ruaok

I'm going to leave it at 80MB for now.

2015-04-23 11303, 2015

11:19 AM
ruaok

max connections is 65 now.

2015-04-23 11327, 2015

11:19 AM
chirlu`

+1

2015-04-23 11342, 2015

11:19 AM
ruaok

now to find the swapiness setting, so I can finish all the items in that ticket.

2015-04-23 11343, 2015

11:22 AM
chirlu`

It’s a kernel setting, controlled via /proc somewhere, I think.