it's quite embarrassing that we have libdiscid fixes committed in 2009 and never released :/
2013-01-01 00107, 2013
ianmcorvidae
heh
2013-01-01 00126, 2013
ianmcorvidae
I don't know that anyone in particular has been keeping tabs on that project, I guess that suggests nobody was :)
2013-01-01 00158, 2013
ocharles
Explosions eh
2013-01-01 00140, 2013
nikki
are you here to fix our replication?
2013-01-01 00155, 2013
ocharles
not entirely
2013-01-01 00112, 2013
ocharles
maybe if i catch up enough to understand the problem better
2013-01-01 00136, 2013
ianmcorvidae
problem isn't really well-understood generally, I think right now we're just going for getting replication back on track
2013-01-01 00111, 2013
ianmcorvidae
(though potentially still paused -- just have correct packets up to the current replication sequence on production)
2013-01-01 00127, 2013
ianmcorvidae
but until we know what's actually causing the problem we'll have the potential for getting in trouble again :/
2013-01-01 00147, 2013
ocharles
i see more scare noise about locks though
2013-01-01 00157, 2013
ocharles
1500 locks is nothing to be alarmed or happy about - it's just a number
2013-01-01 00104, 2013
ocharles
do we know what type of locks are held, and where?
2013-01-01 00111, 2013
ianmcorvidae
no
2013-01-01 00119, 2013
ianmcorvidae
which is what I want to test for next time statistics run
2013-01-01 00128, 2013
ianmcorvidae
(which is when this happened as well -- I think it may be related)
2013-01-01 00144, 2013
ianmcorvidae
(people have been getting 502s trying to edit... anything, when stats are running)
2013-01-01 00101, 2013
ocharles
yea, it does sound related
2013-01-01 00116, 2013
ocharles
are our 5xx graphs shining any light on correlation?
2013-01-01 00120, 2013
ianmcorvidae
(which makes no sense, it has almost no locks other than a bunch of access share on various tables and an exclusive lock on the stats table (but one that allows access share))
2013-01-01 00141, 2013
ianmcorvidae
for the statistics problem, it's definitely correlated -- we moved stats an hour later to test exactly this and the problem moved with it
2013-01-01 00113, 2013
ianmcorvidae
I don't know if this is related or how, but it did happen at the right time
2013-01-01 00123, 2013
ocharles
hmm
2013-01-01 00125, 2013
ianmcorvidae
basically I just want more information about the problem I know the most about that looks like it might be related :)
2013-01-01 00101, 2013
ocharles
same
2013-01-01 00119, 2013
ianmcorvidae
of course, that can't happen until twelve hours from now
2013-01-01 00103, 2013
ocharles
unless you collect stats again and throw away current stats
2013-01-01 00112, 2013
ocharles
(well, back them up, run, and then restore)
2013-01-01 00137, 2013
ianmcorvidae
yeah, we could dump and then delete today's stats, run it, then delete and reimport
2013-01-01 00104, 2013
ocharles
right
2013-01-01 00107, 2013
ianmcorvidae
my plan while running it was to trigger what would be a 502 -- i.e. try to submit and edit -- and just dump all of pg_locks to a file while it's timing out
2013-01-01 00113, 2013
ianmcorvidae
an*
2013-01-01 00148, 2013
ianmcorvidae
and then it's "just" a matter of looking at everything that's waiting for a lock to figure out *why* it's waiting when statistics shouldn't need such a thing
2013-01-01 00133, 2013
ocharles
we could also change the time out killer thingy to log the query that was executing at timeout
2013-01-01 00136, 2013
ocharles
but that should be in the serverlog
2013-01-01 00138, 2013
ocharles
(pg)
2013-01-01 00151, 2013
ianmcorvidae
hm
2013-01-01 00154, 2013
ianmcorvidae
I may not know where that log is
2013-01-01 00124, 2013
ocharles
/var/log/postgres/serverlog
2013-01-01 00141, 2013
ocharles
i need to shut this 'unexpected eof' thing up
2013-01-01 00144, 2013
ocharles
in fact, that sorta implies that queries aren't getting aborted and are running for long periods of time
2013-01-01 00144, 2013
ianmcorvidae
hm
2013-01-01 00154, 2013
ianmcorvidae
yeah
2013-01-01 00121, 2013
ianmcorvidae
rob was theorizing that something was causing locks -- by which he may have meant transactions holding locks -- to remain open
2013-01-01 00128, 2013
ianmcorvidae
DBDefs, perhaps?
2013-01-01 00140, 2013
ianmcorvidae
did the timeout make it through the DBDefs changes
2013-01-01 00102, 2013
ocharles
we still don't know if it's a locking problem, really
2013-01-01 00131, 2013
ocharles
iirc, postgresql is setup to log if stuff takes ages to acquire a lock
2013-01-01 00134, 2013
ocharles
and i'm not seeing those messages
2013-01-01 00115, 2013
ianmcorvidae
just brainstorming things to check :)
2013-01-01 00110, 2013
ocharles
2013-01-01 00114, 2013
ocharles
somewhat interesting
2013-01-01 00136, 2013
ianmcorvidae
I was wondering if it was that
2013-01-01 00157, 2013
ianmcorvidae
that's the only explicit lock we're getting (the select from editor for update) that looked probable
2013-01-01 00100, 2013
ocharles
they do always crop up at ~1:30
2013-01-01 00107, 2013
ianmcorvidae
yeah, that's the statistics time
2013-01-01 00117, 2013
ianmcorvidae
(since it got moved an hour later for diagnosing this)
2013-01-01 00131, 2013
ianmcorvidae
what I don't understand is why statistics would have a lock on an editor table that conflicts there
2013-01-01 00156, 2013
ocharles
i'm not sure it does, i just wonder if the amount of writes it does causes stuff to slow down
2013-01-01 00101, 2013
ocharles
but that's a whopping slow down
2013-01-01 00128, 2013
ianmcorvidae
yeah
2013-01-01 00102, 2013
ocharles
how is replication broke?
2013-01-01 00114, 2013
ianmcorvidae
aborted in the middle of doing a packet
2013-01-01 00134, 2013
ocharles
is there a log i can see?
2013-01-01 00149, 2013
ianmcorvidae
probably, it'd be in email
2013-01-01 00155, 2013
ianmcorvidae
did you read through rob's email?
2013-01-01 00115, 2013
ianmcorvidae looks for the relevant email, anyway
2013-01-01 00115, 2013
ocharles
yea
2013-01-01 00122, 2013
ocharles
i'm not finding what i want in emails
2013-01-01 00132, 2013
ianmcorvidae
I'm not really sure why it aborted, rob seems to have an idea why
2013-01-01 00155, 2013
ocharles
the only abort i see is that the next hour rolled around and an existing job was running
2013-01-01 00117, 2013
ianmcorvidae
yeah, that's all I'm seeing in email
2013-01-01 00149, 2013
ocharles
i'm going to guess that rob killed that job with SIGTERM/SIGKILL
2013-01-01 00150, 2013
ianmcorvidae
however, as rob outlines, it stopped dumping one packet and the next hour (whichever hour that was) included some of the same sequence IDs
2013-01-01 00117, 2013
ianmcorvidae
possibly
2013-01-01 00128, 2013
ianmcorvidae
hoping he can provide insight on this topic in a few hours
2013-01-01 00134, 2013
ocharles
mmmm
2013-01-01 00145, 2013
ocharles
well, i need to get out of bed, have a shower and get some breakfast then
2013-01-01 00149, 2013
ocharles
it's a bit of a lazy day :P
2013-01-01 00151, 2013
ianmcorvidae
seems reasonable
2013-01-01 00144, 2013
ianmcorvidae
if that editor lock is in fact the thing that's failing, btw, we might limit the grabbing of that lock to autoedits, which is the only place where it *should* be required (but we're doing it for every edit)
2013-01-01 00104, 2013
ianmcorvidae
I think that's not the main issue though
2013-01-01 00132, 2013
ocharles
i mostly think that might be a symptomn, but not the problem
2013-01-01 00139, 2013
ianmcorvidae
yeah, agreed
2013-01-01 00103, 2013
ocharles
ok, gonna et up then, bbiab
2013-01-01 00152, 2013
Freso
ianmcorvidae: Just replied to your comment on CR.
2013-01-01 00135, 2013
Freso
And, uh, sorrry for being a blurb of text. I just get up and couldn't manage figuring out a good place to insert linebreaks. :|
2013-01-01 00157, 2013
ianmcorvidae
you need to publish the comment
2013-01-01 00100, 2013
ianmcorvidae
it's not there :P
2013-01-01 00119, 2013
Freso
Oh, right.
2013-01-01 00122, 2013
Freso
Silly CR.
2013-01-01 00124, 2013
Freso
Done.
2013-01-01 00115, 2013
LordSputnik joined the channel
2013-01-01 00120, 2013
reosarevok joined the channel
2013-01-01 00121, 2013
reosarevok joined the channel
2013-01-01 00153, 2013
sezuan joined the channel
2013-01-01 00135, 2013
voiceinsideyou joined the channel
2013-01-01 00121, 2013
voiceinsideyou1 joined the channel
2013-01-01 00137, 2013
kepstin-laptop joined the channel
2013-01-01 00158, 2013
ruaok joined the channel
2013-01-01 00141, 2013
ruaok
ianmcorvidae: ping?
2013-01-01 00114, 2013
nikki
I imagine he's still asleep. he didn't go to bed until late
2013-01-01 00135, 2013
nikki
not even four hours ago :P
2013-01-01 00136, 2013
ruaok
I figured that. :) I got the last email from him about 5 hours ago.
2013-01-01 00153, 2013
ruaok
I'll wait for him to wake to try and patch things back up.
2013-01-01 00107, 2013
ruaok
I will, however, get the search indexes building again.
2013-01-01 00113, 2013
ruaok
damn nagios. not sending me emails.
2013-01-01 00119, 2013
reosarevok
Supposedly he did that?
2013-01-01 00154, 2013
ruaok
who did what?
2013-01-01 00103, 2013
nikki
ian apparently got search indexes updating again
2013-01-01 00110, 2013
reosarevok
that
2013-01-01 00129, 2013
ruaok
oh, whoops.
2013-01-01 00133, 2013
ruaok
he didn't mail me about that.
2013-01-01 00113, 2013
ruaok
ah looks like we got one set out and I killed the next run thats been going for about an hour
2013-01-01 00145, 2013
LordSputnik has left the channel
2013-01-01 00133, 2013
LordSputnik joined the channel
2013-01-01 00154, 2013
ocharles
lots of fires it seems :(
2013-01-01 00134, 2013
Freso
Well, it is New Year's! :D
2013-01-01 00102, 2013
ruaok
ocharles: yep
2013-01-01 00126, 2013
ruaok
can you look at bisecting the code from the last release so see what might've introduced the lock issue?
2013-01-01 00133, 2013
ruaok
ian and I are going to clean up as best as we can.
2013-01-01 00156, 2013
ocharles
i still don't know if there is a lock issue
2013-01-01 00103, 2013
ocharles
we had a chat about it earlier
2013-01-01 00121, 2013
ocharles
1500 locks certainly looks scary, but it's not inherently a problem if they are mostly access share or something
2013-01-01 00133, 2013
ocharles
do you have any details on the locks you saw?
2013-01-01 00105, 2013
ocharles
the only problem we think it could be is the fact that we lock editor rows when we created edits, and it could be that
2013-01-01 00110, 2013
ocharles
but that was added months ago
2013-01-01 00118, 2013
ruaok
I DO know we have a problem.
2013-01-01 00144, 2013
ocharles
ok - i only know there is a timeout problem
2013-01-01 00159, 2013
ruaok
and had you actually implemented the lock tracking code I asked you to implement a few weeks/months ago, we would have more info now.
2013-01-01 00119, 2013
ruaok
I guess I'll ask ianmcorvidae to get that done so that it actually gets done.
2013-01-01 00145, 2013
ocharles
i've read the postgresql server logs, which are configured to output information if something is blocked waiting to get a lock, and that rarely shows up
2013-01-01 00102, 2013
ocharles
it does show up occasionally around 1:30am though, which is inline with the stats collection
2013-01-01 00159, 2013
ocharles
but neither me or ian can figure out why stats stuff would interfere with that - it doesn't take any locks out
2013-01-01 00114, 2013
ocharles
one really weak hypothesis is that the increased write load is causing stuff to just timeout
2013-01-01 00124, 2013
ocharles
but again - it's strange that this only just started happening