can you look at bisecting the code from the last release so see what might've introduced the lock issue?
2013-01-01 00133, 2013
ruaok
ian and I are going to clean up as best as we can.
2013-01-01 00156, 2013
ocharles
i still don't know if there is a lock issue
2013-01-01 00103, 2013
ocharles
we had a chat about it earlier
2013-01-01 00121, 2013
ocharles
1500 locks certainly looks scary, but it's not inherently a problem if they are mostly access share or something
2013-01-01 00133, 2013
ocharles
do you have any details on the locks you saw?
2013-01-01 00105, 2013
ocharles
the only problem we think it could be is the fact that we lock editor rows when we created edits, and it could be that
2013-01-01 00110, 2013
ocharles
but that was added months ago
2013-01-01 00118, 2013
ruaok
I DO know we have a problem.
2013-01-01 00144, 2013
ocharles
ok - i only know there is a timeout problem
2013-01-01 00159, 2013
ruaok
and had you actually implemented the lock tracking code I asked you to implement a few weeks/months ago, we would have more info now.
2013-01-01 00119, 2013
ruaok
I guess I'll ask ianmcorvidae to get that done so that it actually gets done.
2013-01-01 00145, 2013
ocharles
i've read the postgresql server logs, which are configured to output information if something is blocked waiting to get a lock, and that rarely shows up
2013-01-01 00102, 2013
ocharles
it does show up occasionally around 1:30am though, which is inline with the stats collection
2013-01-01 00159, 2013
ocharles
but neither me or ian can figure out why stats stuff would interfere with that - it doesn't take any locks out
2013-01-01 00114, 2013
ocharles
one really weak hypothesis is that the increased write load is causing stuff to just timeout
2013-01-01 00124, 2013
ocharles
but again - it's strange that this only just started happening
2013-01-01 00128, 2013
nikki
maybe it also doesn't like 2013
2013-01-01 00121, 2013
ruaok
this shit all happened in 2012
2013-01-01 00128, 2013
nikki
I meant whatever broke replication
2013-01-01 00141, 2013
nikki
since afaik that broke with the first packet of 2013
2013-01-01 00145, 2013
nikki
(utc time)
2013-01-01 00127, 2013
ruaok
the deadlocked server broke replication. and the last replication packet happened before midnight PST ;-/
2013-01-01 00137, 2013
nikki
ah
2013-01-01 00157, 2013
ocharles
what did the dead locking look like? I'm not familiar with what happened the week before cause of christmas
2013-01-01 00128, 2013
ruaok
1500 open locks was all I needed to know to take action.
2013-01-01 00111, 2013
ruaok
can you please write the script that fires up periodically, dumps the pg_locks table if there are more than X locks?
2013-01-01 00122, 2013
ruaok
then we finally have some collected evidence of what is going wrong.
2013-01-01 00129, 2013
ruaok
warp: ping!
2013-01-01 00153, 2013
ocharles
well, somewhat done
2013-01-01 00158, 2013
ocharles
/root/dump-locks on totoro
2013-01-01 00102, 2013
ocharles
just put that in cron somewhere
2013-01-01 00141, 2013
ocharles
but what i'm saying is 1500 locks doesn't point to a problem until you know what the locks are. every select takes out a lock, but those would only block truncates and alter table
2013-01-01 00130, 2013
ruaok
at least now I'll have an answer to you when you ask what it looked like. :)
2013-01-01 00131, 2013
ruaok
thanks.
2013-01-01 00138, 2013
ocharles
plus a lock doesn't get released until a transaction ends. so a few transactions doing a few selects and so on can produce quite a few locks in pg_locks
2013-01-01 00126, 2013
ruaok
hmm.
2013-01-01 00127, 2013
ruaok
t
2013-01-01 00141, 2013
ruaok
that script will just chew up needless disk space.
2013-01-01 00149, 2013
ruaok fixes it himself
2013-01-01 00150, 2013
ocharles
in what way?
2013-01-01 00133, 2013
ruaok
it creates a new file every time its invoked.
2013-01-01 00107, 2013
ocharles
mmm, intentionally
2013-01-01 00116, 2013
ruaok
ideally it would create one file and if the count is > 500 it would dump to that file, including a timestamp.
2013-01-01 00133, 2013
ruaok
the most important is that it not dump stuff when the lock count isn't troublesome.
2013-01-01 00140, 2013
ocharles
ok, well i'll add that predicate
2013-01-01 00152, 2013
ruaok
thx
2013-01-01 00125, 2013
sezuan joined the channel
2013-01-01 00136, 2013
ocharles
ok, i believe that's done
2013-01-01 00103, 2013
ruaok
wanna deploy it and have it log to to /var/log/postgres or should I?
2013-01-01 00123, 2013
ruaok
when warp shows up I'll ask him about nagios and getting a check for number of locks.
2013-01-01 00127, 2013
ruaok
and then also chart that.
2013-01-01 00133, 2013
ocharles
nope, go ahead
2013-01-01 00139, 2013
ruaok
k
2013-01-01 00145, 2013
ocharles
i couldn't figure out what i needed to write to get charts
2013-01-01 00111, 2013
hawke_ joined the channel
2013-01-01 00130, 2013
ianmcorvidae
ruaok: sorry, guess I forgot to send email about the search index thing, which is what I get for staying up until 9am
2013-01-01 00142, 2013
ruaok
ha, no worries. :)
2013-01-01 00153, 2013
ruaok
didja get enough sleep?
2013-01-01 00113, 2013
ianmcorvidae
well, probably not in a general sense, but enough for now :)
2013-01-01 00119, 2013
ruaok
k.
2013-01-01 00120, 2013
ianmcorvidae
I need to take a shower though, and then we can get down to work.
2013-01-01 00126, 2013
ruaok
ok, sounds good.
2013-01-01 00102, 2013
ianmcorvidae
okay, and I'm back.
2013-01-01 00114, 2013
ianmcorvidae
catch me up -- what all's been done/needs to be done?
2013-01-01 00116, 2013
ruaok
I'm making lunch.
2013-01-01 00121, 2013
ruaok
nothing has been done.
2013-01-01 00126, 2013
ruaok
first I want to compare notes with you.
2013-01-01 00130, 2013
ianmcorvidae
okay
2013-01-01 00136, 2013
ruaok
my conclusions last night were different from yours.
2013-01-01 00137, 2013
ianmcorvidae
yeah -- your start/end numbers were right, but a transaction being open can cause there to be some overlap in numbers -- there were no lines exactly the same between the aborted packet and the next/completed one
2013-01-01 00105, 2013
ianmcorvidae
(statistics being the one that does that, here, since it stays open for quite a while)
2013-01-01 00151, 2013
ruaok
ok, I'll be back in about 10-15 minutes.
2013-01-01 00101, 2013
ianmcorvidae
k
2013-01-01 00104, 2013
ianmcorvidae
what I'm not sure is about the missing 10 that aren't in either -- what they are, what that means
2013-01-01 00126, 2013
Leftmost joined the channel
2013-01-01 00153, 2013
ruaok returns
2013-01-01 00109, 2013
ruaok
ianmcorvidae: for starters, ocharles and I put in place the lock dumper on totoro
2013-01-01 00125, 2013
ruaok
it fires up every 5 minutes and dumps the state of the locks if there are more than 1000 locks in place.
2013-01-01 00136, 2013
ruaok
they get written to /var/log/postgres/locks
2013-01-01 00110, 2013
ruaok
so, in theory when the stats collection fires up, we should see a set of locks get written out.
2013-01-01 00124, 2013
ianmcorvidae
yeah
2013-01-01 00154, 2013
ruaok
let me see if I understand where things are.
2013-01-01 00110, 2013
ruaok
the boundaries that I listed yesterday are mostly correct.
2013-01-01 00129, 2013
ruaok
but you checked the mbexport data and found that there are 10 seqids missing, right?
2013-01-01 00101, 2013
ianmcorvidae
yes, just after the end of the quasi-aborted packet
2013-01-01 00133, 2013
ianmcorvidae
69829154-29164 IIRC
2013-01-01 00158, 2013
ruaok
those would be missing from packet 832 ?
2013-01-01 00124, 2013
ianmcorvidae
well, they're not in either packet, but if they exist they'd be in 832
2013-01-01 00146, 2013
ianmcorvidae
because 69829153 is the line updating replication_control, which should be the end of 831
2013-01-01 00104, 2013
ianmcorvidae
what I don't know here is what exactly missing sequence IDs mean
2013-01-01 00123, 2013
ruaok
yeah.
2013-01-01 00114, 2013
ruaok re-reads the paragraph that starts with Update:
2013-01-01 00119, 2013
ianmcorvidae
the likely situation is that, if they're really missing and are things that should be there, they're statistics, since that performs a lot of writes
2013-01-01 00133, 2013
ianmcorvidae refreshes my memory on what I said then too :P
2013-01-01 00142, 2013
ianmcorvidae
ah, yes -- all I'm saying there is that the only manual futzing we potentially need to do is adding in bits for those 10 (er, 11) sequence IDs
2013-01-01 00157, 2013
ianmcorvidae
because there's no actual sequence ID overlap between the 831 and 832
2013-01-01 00157, 2013
ruaok
ok, let me grok that there is no overlap.
2013-01-01 00142, 2013
ruaok
the files in mbexport start with 69823186 and end with 69829153
2013-01-01 00158, 2013
ruaok
right?
2013-01-01 00102, 2013
ianmcorvidae
correct
2013-01-01 00158, 2013
ruaok
packet 64832 starts with 69827962 and ends with 69834703
2013-01-01 00126, 2013
ianmcorvidae
also correct
2013-01-01 00108, 2013
ruaok
the thing I dont quite get is why mbexport ends with 29153 and 832 starts with 27962
2013-01-01 00138, 2013
ianmcorvidae
because the bits of 27962-29153 that are in 832 are statistics
2013-01-01 00121, 2013
ianmcorvidae
specifically: those sequence IDs were created by the statistics collector in its transaction, which had not completed when 831's transaction to dump a packet started
2013-01-01 00139, 2013
ianmcorvidae
however, other things that started after statistics, but ended before the packet dumper started, were
2013-01-01 00103, 2013
ianmcorvidae
for example, the dumper itself adds a line to replication_control, (that's sequence id 29153)
2013-01-01 00151, 2013
ianmcorvidae
basically, sequence IDs are assigned in order of real time, not in order of transactions closing, but the replication packet creator cares about what's visible to it in terms of transactions
2013-01-01 00121, 2013
ianmcorvidae
so the fact that there's some overlap in terms of numerical *range* is fine, and perfectly normal when a long-running transaction's involved
2013-01-01 00102, 2013
ianmcorvidae
importantly, and what I mean when I say there's no overlap, there is no sequence ID that appears in both the aborted 831 files and in the 832 files
2013-01-01 00142, 2013
ianmcorvidae
meaning that 831 is mostly correct, other than not having a README or COPYING and not getting tarred up
2013-01-01 00104, 2013
ruaok
and missing a few lines in dbmirror, no?
2013-01-01 00114, 2013
ianmcorvidae
I don't know if the 11 missing sequence IDs are a cause for concern, as I don't know all the specifics of sequence ID assignment
2013-01-01 00128, 2013
ianmcorvidae
well, the missing numbers are after what should be the end of 831
2013-01-01 00110, 2013
ruaok
so, is it ok to have gaps in sequence ids?
2013-01-01 00119, 2013
ianmcorvidae
that's what I don't know.
2013-01-01 00116, 2013
ruaok
the mbexport file has more missing:
2013-01-01 00117, 2013
ruaok
69829121 "statistics"."statistic" i 704488954
2013-01-01 00141, 2013
ianmcorvidae
those are in 832, no?
2013-01-01 00105, 2013
ruaok checks
2013-01-01 00115, 2013
ianmcorvidae
yeah, they're in my combined file
2013-01-01 00149, 2013
ianmcorvidae
that's fine, those are from a transaction that hadn't finished when 831 was dumping -- so they should be in 832, and they are :)
2013-01-01 00132, 2013
ruaok
I'm looking at 832 and I cant see them
2013-01-01 00100, 2013
ianmcorvidae
I get:
2013-01-01 00101, 2013
ianmcorvidae
dbmirror_pending:69828840 "statistics"."statistic" i 704488954
2013-01-01 00102, 2013
ianmcorvidae
dbmirror_pendingdata:69828840 f "id"='2648648' "name"='count.discid' "value"='521483' "date_collected"='2013-01-01'
2013-01-01 00113, 2013
ianmcorvidae
(~/rep is the extracted 832 packet)
2013-01-01 00135, 2013
ruaok
Isee 8840, but no 8841
2013-01-01 00104, 2013
ruaok
no 28841 - 29120
2013-01-01 00111, 2013
ianmcorvidae
that one's in 831
2013-01-01 00121, 2013
ianmcorvidae
because it's a change to cover_art_archive.cover_art
2013-01-01 00107, 2013
nikki
by the way, I have a db on rika that's been replicating that's not doing a whole lot otherwise that I could test the next replication packet with if you need to
2013-01-01 00122, 2013
ruaok
nikki: that would be fabulous!
2013-01-01 00132, 2013
ianmcorvidae
here's how I have the list of everything that's in one packet or the other: cat 831/mbdump/dbmirror_pending 832/mbdump/dbmirror_pending | sort -g > combined_dbmirror_pending
2013-01-01 00135, 2013
nikki
at least I'm pretty sure I left it replicating
2013-01-01 00144, 2013
ianmcorvidae
(or, combined_dbmirror in my lenny home dir)
2013-01-01 00152, 2013
ruaok
ok, I see them in 831.
2013-01-01 00139, 2013
ruaok
but I see duplicated sequids, is that ok?
2013-01-01 00147, 2013
ianmcorvidae
where? I don't
2013-01-01 00101, 2013
ruaok
let me find them again.
2013-01-01 00114, 2013
nikki
ah, I did. excellent.
2013-01-01 00132, 2013
nikki
it's mbslave rather than mbserver but I guess that doesn't matter for checking whether the thing applies or not
2013-01-01 00155, 2013
ianmcorvidae
uniq -d can't find any duplicate lines either, so unless the same sequence ID is duplicate but with different data...
2013-01-01 00121, 2013
ruaok
ok, I think I wasn't looking at enough digits. can't find it again.