#musicbrainz-devel

/

      • LordSputnik joined the channel
      • 2013-01-01 00154, 2013

      • ocharles
        lots of fires it seems :(
      • 2013-01-01 00134, 2013

      • Freso
        Well, it is New Year's! :D
      • 2013-01-01 00102, 2013

      • ruaok
        ocharles: yep
      • 2013-01-01 00126, 2013

      • ruaok
        can you look at bisecting the code from the last release so see what might've introduced the lock issue?
      • 2013-01-01 00133, 2013

      • ruaok
        ian and I are going to clean up as best as we can.
      • 2013-01-01 00156, 2013

      • ocharles
        i still don't know if there is a lock issue
      • 2013-01-01 00103, 2013

      • ocharles
        we had a chat about it earlier
      • 2013-01-01 00121, 2013

      • ocharles
        1500 locks certainly looks scary, but it's not inherently a problem if they are mostly access share or something
      • 2013-01-01 00133, 2013

      • ocharles
        do you have any details on the locks you saw?
      • 2013-01-01 00105, 2013

      • ocharles
        the only problem we think it could be is the fact that we lock editor rows when we created edits, and it could be that
      • 2013-01-01 00110, 2013

      • ocharles
        but that was added months ago
      • 2013-01-01 00118, 2013

      • ruaok
        I DO know we have a problem.
      • 2013-01-01 00144, 2013

      • ocharles
        ok - i only know there is a timeout problem
      • 2013-01-01 00159, 2013

      • ruaok
        and had you actually implemented the lock tracking code I asked you to implement a few weeks/months ago, we would have more info now.
      • 2013-01-01 00119, 2013

      • ruaok
        I guess I'll ask ianmcorvidae to get that done so that it actually gets done.
      • 2013-01-01 00145, 2013

      • ocharles
        i've read the postgresql server logs, which are configured to output information if something is blocked waiting to get a lock, and that rarely shows up
      • 2013-01-01 00102, 2013

      • ocharles
        it does show up occasionally around 1:30am though, which is inline with the stats collection
      • 2013-01-01 00159, 2013

      • ocharles
        but neither me or ian can figure out why stats stuff would interfere with that - it doesn't take any locks out
      • 2013-01-01 00114, 2013

      • ocharles
        one really weak hypothesis is that the increased write load is causing stuff to just timeout
      • 2013-01-01 00124, 2013

      • ocharles
        but again - it's strange that this only just started happening
      • 2013-01-01 00128, 2013

      • nikki
        maybe it also doesn't like 2013
      • 2013-01-01 00121, 2013

      • ruaok
        this shit all happened in 2012
      • 2013-01-01 00128, 2013

      • nikki
        I meant whatever broke replication
      • 2013-01-01 00141, 2013

      • nikki
        since afaik that broke with the first packet of 2013
      • 2013-01-01 00145, 2013

      • nikki
        (utc time)
      • 2013-01-01 00127, 2013

      • ruaok
        the deadlocked server broke replication. and the last replication packet happened before midnight PST ;-/
      • 2013-01-01 00137, 2013

      • nikki
        ah
      • 2013-01-01 00157, 2013

      • ocharles
        what did the dead locking look like? I'm not familiar with what happened the week before cause of christmas
      • 2013-01-01 00128, 2013

      • ruaok
        1500 open locks was all I needed to know to take action.
      • 2013-01-01 00111, 2013

      • ruaok
        can you please write the script that fires up periodically, dumps the pg_locks table if there are more than X locks?
      • 2013-01-01 00122, 2013

      • ruaok
        then we finally have some collected evidence of what is going wrong.
      • 2013-01-01 00129, 2013

      • ruaok
        warp: ping!
      • 2013-01-01 00153, 2013

      • ocharles
        well, somewhat done
      • 2013-01-01 00158, 2013

      • ocharles
        /root/dump-locks on totoro
      • 2013-01-01 00102, 2013

      • ocharles
        just put that in cron somewhere
      • 2013-01-01 00141, 2013

      • ocharles
        but what i'm saying is 1500 locks doesn't point to a problem until you know what the locks are. every select takes out a lock, but those would only block truncates and alter table
      • 2013-01-01 00130, 2013

      • ruaok
        at least now I'll have an answer to you when you ask what it looked like. :)
      • 2013-01-01 00131, 2013

      • ruaok
        thanks.
      • 2013-01-01 00138, 2013

      • ocharles
        plus a lock doesn't get released until a transaction ends. so a few transactions doing a few selects and so on can produce quite a few locks in pg_locks
      • 2013-01-01 00126, 2013

      • ruaok
        hmm.
      • 2013-01-01 00127, 2013

      • ruaok
        t
      • 2013-01-01 00141, 2013

      • ruaok
        that script will just chew up needless disk space.
      • 2013-01-01 00149, 2013

      • ruaok fixes it himself
      • 2013-01-01 00150, 2013

      • ocharles
        in what way?
      • 2013-01-01 00133, 2013

      • ruaok
        it creates a new file every time its invoked.
      • 2013-01-01 00107, 2013

      • ocharles
        mmm, intentionally
      • 2013-01-01 00116, 2013

      • ruaok
        ideally it would create one file and if the count is > 500 it would dump to that file, including a timestamp.
      • 2013-01-01 00133, 2013

      • ruaok
        the most important is that it not dump stuff when the lock count isn't troublesome.
      • 2013-01-01 00140, 2013

      • ocharles
        ok, well i'll add that predicate
      • 2013-01-01 00152, 2013

      • ruaok
        thx
      • 2013-01-01 00125, 2013

      • sezuan joined the channel
      • 2013-01-01 00136, 2013

      • ocharles
        ok, i believe that's done
      • 2013-01-01 00103, 2013

      • ruaok
        wanna deploy it and have it log to to /var/log/postgres or should I?
      • 2013-01-01 00123, 2013

      • ruaok
        when warp shows up I'll ask him about nagios and getting a check for number of locks.
      • 2013-01-01 00127, 2013

      • ruaok
        and then also chart that.
      • 2013-01-01 00133, 2013

      • ocharles
        nope, go ahead
      • 2013-01-01 00139, 2013

      • ruaok
        k
      • 2013-01-01 00145, 2013

      • ocharles
        i couldn't figure out what i needed to write to get charts
      • 2013-01-01 00111, 2013

      • hawke_ joined the channel
      • 2013-01-01 00130, 2013

      • ianmcorvidae
        ruaok: sorry, guess I forgot to send email about the search index thing, which is what I get for staying up until 9am
      • 2013-01-01 00142, 2013

      • ruaok
        ha, no worries. :)
      • 2013-01-01 00153, 2013

      • ruaok
        didja get enough sleep?
      • 2013-01-01 00113, 2013

      • ianmcorvidae
        well, probably not in a general sense, but enough for now :)
      • 2013-01-01 00119, 2013

      • ruaok
        k.
      • 2013-01-01 00120, 2013

      • ianmcorvidae
        I need to take a shower though, and then we can get down to work.
      • 2013-01-01 00126, 2013

      • ruaok
        ok, sounds good.
      • 2013-01-01 00102, 2013

      • ianmcorvidae
        okay, and I'm back.
      • 2013-01-01 00114, 2013

      • ianmcorvidae
        catch me up -- what all's been done/needs to be done?
      • 2013-01-01 00116, 2013

      • ruaok
        I'm making lunch.
      • 2013-01-01 00121, 2013

      • ruaok
        nothing has been done.
      • 2013-01-01 00126, 2013

      • ruaok
        first I want to compare notes with you.
      • 2013-01-01 00130, 2013

      • ianmcorvidae
        okay
      • 2013-01-01 00136, 2013

      • ruaok
        my conclusions last night were different from yours.
      • 2013-01-01 00137, 2013

      • ianmcorvidae
        yeah -- your start/end numbers were right, but a transaction being open can cause there to be some overlap in numbers -- there were no lines exactly the same between the aborted packet and the next/completed one
      • 2013-01-01 00105, 2013

      • ianmcorvidae
        (statistics being the one that does that, here, since it stays open for quite a while)
      • 2013-01-01 00151, 2013

      • ruaok
        ok, I'll be back in about 10-15 minutes.
      • 2013-01-01 00101, 2013

      • ianmcorvidae
        k
      • 2013-01-01 00104, 2013

      • ianmcorvidae
        what I'm not sure is about the missing 10 that aren't in either -- what they are, what that means
      • 2013-01-01 00126, 2013

      • Leftmost joined the channel
      • 2013-01-01 00153, 2013

      • ruaok returns
      • 2013-01-01 00109, 2013

      • ruaok
        ianmcorvidae: for starters, ocharles and I put in place the lock dumper on totoro
      • 2013-01-01 00125, 2013

      • ruaok
        it fires up every 5 minutes and dumps the state of the locks if there are more than 1000 locks in place.
      • 2013-01-01 00136, 2013

      • ruaok
        they get written to /var/log/postgres/locks
      • 2013-01-01 00110, 2013

      • ruaok
        so, in theory when the stats collection fires up, we should see a set of locks get written out.
      • 2013-01-01 00124, 2013

      • ianmcorvidae
        yeah
      • 2013-01-01 00154, 2013

      • ruaok
        let me see if I understand where things are.
      • 2013-01-01 00110, 2013

      • ruaok
        the boundaries that I listed yesterday are mostly correct.
      • 2013-01-01 00129, 2013

      • ruaok
        but you checked the mbexport data and found that there are 10 seqids missing, right?
      • 2013-01-01 00101, 2013

      • ianmcorvidae
        yes, just after the end of the quasi-aborted packet
      • 2013-01-01 00133, 2013

      • ianmcorvidae
        69829154-29164 IIRC
      • 2013-01-01 00158, 2013

      • ruaok
        those would be missing from packet 832 ?
      • 2013-01-01 00124, 2013

      • ianmcorvidae
        well, they're not in either packet, but if they exist they'd be in 832
      • 2013-01-01 00146, 2013

      • ianmcorvidae
        because 69829153 is the line updating replication_control, which should be the end of 831
      • 2013-01-01 00104, 2013

      • ianmcorvidae
        what I don't know here is what exactly missing sequence IDs mean
      • 2013-01-01 00123, 2013

      • ruaok
        yeah.
      • 2013-01-01 00114, 2013

      • ruaok re-reads the paragraph that starts with Update:
      • 2013-01-01 00119, 2013

      • ianmcorvidae
        the likely situation is that, if they're really missing and are things that should be there, they're statistics, since that performs a lot of writes
      • 2013-01-01 00133, 2013

      • ianmcorvidae refreshes my memory on what I said then too :P
      • 2013-01-01 00142, 2013

      • ianmcorvidae
        ah, yes -- all I'm saying there is that the only manual futzing we potentially need to do is adding in bits for those 10 (er, 11) sequence IDs
      • 2013-01-01 00157, 2013

      • ianmcorvidae
        because there's no actual sequence ID overlap between the 831 and 832
      • 2013-01-01 00157, 2013

      • ruaok
        ok, let me grok that there is no overlap.
      • 2013-01-01 00142, 2013

      • ruaok
        the files in mbexport start with 69823186 and end with 69829153
      • 2013-01-01 00158, 2013

      • ruaok
        right?
      • 2013-01-01 00102, 2013

      • ianmcorvidae
        correct
      • 2013-01-01 00158, 2013

      • ruaok
        packet 64832 starts with 69827962 and ends with 69834703
      • 2013-01-01 00126, 2013

      • ianmcorvidae
        also correct
      • 2013-01-01 00108, 2013

      • ruaok
        the thing I dont quite get is why mbexport ends with 29153 and 832 starts with 27962
      • 2013-01-01 00138, 2013

      • ianmcorvidae
        because the bits of 27962-29153 that are in 832 are statistics
      • 2013-01-01 00121, 2013

      • ianmcorvidae
        specifically: those sequence IDs were created by the statistics collector in its transaction, which had not completed when 831's transaction to dump a packet started
      • 2013-01-01 00139, 2013

      • ianmcorvidae
        however, other things that started after statistics, but ended before the packet dumper started, were
      • 2013-01-01 00103, 2013

      • ianmcorvidae
        for example, the dumper itself adds a line to replication_control, (that's sequence id 29153)
      • 2013-01-01 00151, 2013

      • ianmcorvidae
        basically, sequence IDs are assigned in order of real time, not in order of transactions closing, but the replication packet creator cares about what's visible to it in terms of transactions
      • 2013-01-01 00121, 2013

      • ianmcorvidae
        so the fact that there's some overlap in terms of numerical *range* is fine, and perfectly normal when a long-running transaction's involved
      • 2013-01-01 00102, 2013

      • ianmcorvidae
        importantly, and what I mean when I say there's no overlap, there is no sequence ID that appears in both the aborted 831 files and in the 832 files
      • 2013-01-01 00142, 2013

      • ianmcorvidae
        meaning that 831 is mostly correct, other than not having a README or COPYING and not getting tarred up
      • 2013-01-01 00104, 2013

      • ruaok
        and missing a few lines in dbmirror, no?
      • 2013-01-01 00114, 2013

      • ianmcorvidae
        I don't know if the 11 missing sequence IDs are a cause for concern, as I don't know all the specifics of sequence ID assignment
      • 2013-01-01 00128, 2013

      • ianmcorvidae
        well, the missing numbers are after what should be the end of 831
      • 2013-01-01 00110, 2013

      • ruaok
        so, is it ok to have gaps in sequence ids?
      • 2013-01-01 00119, 2013

      • ianmcorvidae
        that's what I don't know.
      • 2013-01-01 00116, 2013

      • ruaok
        the mbexport file has more missing:
      • 2013-01-01 00117, 2013

      • ruaok
        69829121 "statistics"."statistic" i 704488954
      • 2013-01-01 00141, 2013

      • ianmcorvidae
        those are in 832, no?
      • 2013-01-01 00105, 2013

      • ruaok checks
      • 2013-01-01 00115, 2013

      • ianmcorvidae
        yeah, they're in my combined file
      • 2013-01-01 00149, 2013

      • ianmcorvidae
        that's fine, those are from a transaction that hadn't finished when 831 was dumping -- so they should be in 832, and they are :)
      • 2013-01-01 00132, 2013

      • ruaok
        I'm looking at 832 and I cant see them
      • 2013-01-01 00100, 2013

      • ianmcorvidae
        I get:
      • 2013-01-01 00101, 2013

      • ianmcorvidae
        dbmirror_pending:69828840 "statistics"."statistic" i 704488954
      • 2013-01-01 00102, 2013

      • ianmcorvidae
        dbmirror_pendingdata:69828840 f "id"='2648648' "name"='count.discid' "value"='521483' "date_collected"='2013-01-01'
      • 2013-01-01 00113, 2013

      • ianmcorvidae
        (~/rep is the extracted 832 packet)
      • 2013-01-01 00135, 2013

      • ruaok
        Isee 8840, but no 8841
      • 2013-01-01 00104, 2013

      • ruaok
        no 28841 - 29120
      • 2013-01-01 00111, 2013

      • ianmcorvidae
        that one's in 831
      • 2013-01-01 00121, 2013

      • ianmcorvidae
        because it's a change to cover_art_archive.cover_art
      • 2013-01-01 00107, 2013

      • nikki
        by the way, I have a db on rika that's been replicating that's not doing a whole lot otherwise that I could test the next replication packet with if you need to
      • 2013-01-01 00122, 2013

      • ruaok
        nikki: that would be fabulous!
      • 2013-01-01 00132, 2013

      • ianmcorvidae
        here's how I have the list of everything that's in one packet or the other: cat 831/mbdump/dbmirror_pending 832/mbdump/dbmirror_pending | sort -g > combined_dbmirror_pending
      • 2013-01-01 00135, 2013

      • nikki
        at least I'm pretty sure I left it replicating
      • 2013-01-01 00144, 2013

      • ianmcorvidae
        (or, combined_dbmirror in my lenny home dir)
      • 2013-01-01 00152, 2013

      • ruaok
        ok, I see them in 831.
      • 2013-01-01 00139, 2013

      • ruaok
        but I see duplicated sequids, is that ok?
      • 2013-01-01 00147, 2013

      • ianmcorvidae
        where? I don't
      • 2013-01-01 00101, 2013

      • ruaok
        let me find them again.
      • 2013-01-01 00114, 2013

      • nikki
        ah, I did. excellent.
      • 2013-01-01 00132, 2013

      • nikki
        it's mbslave rather than mbserver but I guess that doesn't matter for checking whether the thing applies or not
      • 2013-01-01 00155, 2013

      • ianmcorvidae
        uniq -d can't find any duplicate lines either, so unless the same sequence ID is duplicate but with different data...
      • 2013-01-01 00121, 2013

      • ruaok
        ok, I think I wasn't looking at enough digits. can't find it again.
      • 2013-01-01 00132, 2013

      • ianmcorvidae
        k