#musicbrainz-devel

/

      • Sophist-UK
        <open file u'C:\\temp\\\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916...
      • xlotlu
        goodie. and 200?
      • Sophist-UK
        open(os.path.join("C:\\temp", u'\U00010916' * 200), 'w')
      • IOError: [Errno 2] No such file or directory: u'C:\\temp\\\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916...
      • xlotlu
        thank you
      • Sophist-UK
        np
      • JonnyJD
        Sebastinas and others: FYI I will release libdiscid 0.5.1 this week (planned for thursday) (changelog already updated)
      • xlotlu
        Sophist-UK: one more thing and i'm done bugging you: len(u'\U00010916')
      • Sophist-UK
        2
      • xlotlu
        and that is how 1 equals 2 :))
      • Sophist-UK
        I think that u'\U00010916' is actually two unicode characters.
      • u'\U0001' and u'\U0916'
      • nikki
        no
      • Sophist-UK
        No?
      • When I try to convert to ascii using u'\U00010916'.encode('ascii','replace') it gives '??'
      • nikki
        it's one character (u+10916), but it gets encoded using two characters in some encodings
      • hence xlotlu's comment :P
      • Sophist-UK
        Ah yes - so I see - a unicode > 65535
      • nikki
        yep
      • JonnyJD
        Sophist-UK: the point about actual unicode character support is to count these 2-byte characters as one, making string length and character by character comparisions sane again
      • Sophist-UK
        According to a page I just found:
      • the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to len, can have a different result in different builds of the standard interpreter.
      • xlotlu
        it turns that character into a "surrogate pair", i.e. u'\ud802\udd16' ... and funnily, u'\ud802\udd16'.encode('utf-8').decode('utf-8') --> u'\U00010916'
      • JonnyJD
        Sophist-UK: doesn't that depend on if you are using a python 3 str (which is unicode in python 2) or a bytes (which is str in python 2)
      • Sophist-UK
        JJD: No idea. Never really got to grips fully with unicode.
      • xlotlu
        JonnyJD: it's a unicode talk. so it's about strs
      • JonnyJD is somewhat interested in the topic since he has 2-3 python projects that are supposed to run on Python 2 and 3 (unchanged)
      • bytes are bytes are bytes. unicodes are ... something virtual. bytes only when represented in an encoding
      • Sophist-UK
        ValueError: unichr() arg not in range(0x10000) (narrow Python build)
      • JonnyJD
        Anyways, my machine doesn't return two for len(u'\U00010916'). Why the heck is there a compile option that breaks this?
      • xlotlu
        JonnyJD: linux?
      • Sophist-UK
        pep261 defines this stuff
      • JonnyJD
        xlotlu: Arch Linux, yes
      • PEP 261 is 12 years old
      • Sophist-UK
        Yup
      • And its still caausing discussions like this.
      • Pretty stupid thing to do IMO - allow differing results depending on how you build python.exe
      • xlotlu
        there's a point when every programmer gets smacked behind the head with unicode. for me it was today.
      • Sophist-UK
        I suppose it is fixed fully in Python3
      • JonnyJD
      • So it is Windows where this still doesn't work?
      • xlotlu
        it's narrow on mac too
      • JonnyJD
        Hm, I should test this on my BSDs
      • Biggest problem on Windows so far was having a Unicode-enabled console..
      • Sophist-UK
        So - whilst my knowledge of unicode is lower than basic, it seems to me that you are stuffed - when passed a string with a mixture of unicode <=65535 and > 65535 you probably cannot tell whether it is u'\u00010916' or u'\u0001' followed by u'\u0916'
      • JonnyJD
        FreeBSD 8 is wide. Mac OS X really is narrow (Python 2.7 tested on both)
      • xlotlu
        you can. you check for wide vs narrow, and if on narrow, you look for surrogate pairs
      • JonnyJD
        and probably tell people to use Python 3.3 if available. (tested on Windows, the length is one starting with Python 3.3, as announced)
      • Sophist-UK
        I have seen a page where you encode as UTF-8 and then use a regex to count characters based on values < .
      • 128
      • xlotlu
        there's more recipes. i like a one-liner with struct(which i don't understand) :)
      • JonnyJD
        FYI: NetBSD 6.0 is also narrow (in contrast to FreeBSD)
      • xlotlu
        Sophist-UK: the long-charactered representation is with capital U. it means that character's number in the Unicode specification
      • Sophist-UK
        Ah/
      • So, where is the unicode data with > 65535 originating from then?
      • xlotlu
        from outside the basic multilingual plane :)
      • JonnyJD
        OpenBSD is also narrow, same as NetBSD. Up until now I thought that is only a Windows problem :-/
      • xlotlu
        JonnyJD: it really doesn't matter. the function will have to test for wide / narrow, and act accordingly
      • the real mess is how the filesystem actually stores the data
      • s/data/filename
      • it seems on windows it does just like narrow python does, but i'm not sure..
      • and JonnyJD: why do you have that many OSs? :)
      • JonnyJD
        libdiscid supports a shitload of platforms
      • xlotlu
        ah
      • Sophist-UK
        I mean, is this data from MB or from Windows via an API or ...
      • JonnyJD
        and since I maintain it, I try to maintain a couple testing platforms. I am missing a solaris system with a physical disc drive and I only borrow Mac Systems with physical drives. Apart from that, I have all major platforms available now.
      • xlotlu
        Sophist-UK: you mean libdiscid's data, or my data? 'cause mine's is filenames in picard
      • Sophist-UK
        xlotlu: your's - so we are talking existing files whose filenames include u:\U10916'?
      • xlotlu
        filenames that include \U > 65535
      • actually, it gets even messier
      • apparently windows (but i'm not sure!) stores that in a pair of surrogates
      • nikki
        hm, I wonder if my non-bmp characters are being stored as surrogates
      • xlotlu
        os x does the same, but also uses a specific form of normalization that stores composite characters as two, i.e. ă --> a + ˘
      • and on everything else it's "simple", because they don't use unicode filenames, but simply bytes
      • nikki mutters something about nfd being evil
      • and they don't care about the representation of those bytes. it's userspace doing the translation from whatever into the user's encoding
      • nikki: i wonder if one could actually write a diacritic in a filename
      • Sophist-UK
        So, if filename has unicode characters greater than 65535 how do you distinguish them from two characters <= 65535?
      • nikki
        xlotlu: hm?
      • every so often I end up going around tidying up combining characters that probably came from people using some function to generate the tags from the filename and then adding stuff to mb using that data
      • xlotlu
        how does it know if i meant ă and not really ˘a?
      • Sophist-UK: they're in a specific range
      • nikki
        for a start, the combining characters go after the base characters :P
      • xlotlu
        you get my point :P
      • nikki
        if you want them separate and there's a standalone diacritic in unicode, I don't *think* nfd decomposes it... if it does, you'd have to combine it with a space or something
      • xlotlu
        Sophist-UK: They are divided into leading or "high surrogates" (D800–DBFF) and trailing or "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, ....
      • Sophist-UK
        I am beginning to wish I hadn;t asked LOL
      • So the answer is that you can distinguish, but the len method is broken?
      • xlotlu
        yes
      • Sophist-UK
        Actually, unicode ob
      • xlotlu
        it really isn't "broken", because it counts what it has stored internally. it's just that it's.. broken :)
      • Sophist-UK
        = False
      • Broken and not broken = False
      • xlotlu
        the really crappy part is not just len() is broken, but [:limit] is too
      • Sophist-UK
        It sounds like you get UTF-8 or UTF-16 encoded unicode, and can decode it OK if you know to do so, but once you store it in a unicode object then you are stuffed.
      • I have no idea why someone didn't reimplement the py2 unicode object like a py3 string.
      • nikki
        xlotlu: ah, nfd doesn't change it, nfkd changes it to space + combining breve
      • xlotlu
        meaning you can't tell the difference?
      • nikki
        well, the space stops it from combining with the 'a' because it combines with the space instead
      • xlotlu
        and nfd is the one that is, or the one that was..?
      • i wonder if somewhere deep inside OS X's guts there's some 5000-line code to check if one character is allowed to combine with the next
      • nikki
        I'm not sure what the question was
      • xlotlu
        if os x uses nfd or nfkd
      • nikki
        oh
      • nfd
      • Sophist-UK
        Alternative is to move Picard to Py3.
      • nikki
        nfkd loses various bits of info
      • xlotlu
        Sophist-UK: that would only solve part of the problem. i still have to count the code points in whatever flavour the os has
      • Sophist-UK
        Well - *if* in Py3 there are automatic encoding from UTF-8/16 which Windows uses to the fully unicode enabled string function, then len and [:limit] will work as expected without any coding on your part.
      • P.S. Can I discuss with you file naming in Windows?
      • xlotlu
        sure
      • Sophist-UK
        At present the code which gets rid of disallowed characters in windows path/file names is a bit basic.
      • Accoridng to MSDN, the following characters are disallowed: <>:"/\|?*
      • xlotlu
        it's a regexp, isn't it?
      • Sophist-UK
        So we have code to replace these with an underscore.
      • WIBNI... the code was a little more intelligent in how it replaced characters...
      • So < could be replaced with [, > with ]
      • " with '
      • xlotlu
        umh
      • Sophist-UK
        And :/\|?* as follows (using - as an example):
      • using : as an example I mean...
      • space:space => space-space
      • :space => space-space
      • space: => space-
      • : (no spaces either side) => -
      • xlotlu
        i don't follow
      • Sophist-UK
        Also leading and trailing . are disallowed - currently translate to _ but propose they are dropped entirely.
      • "Beatles: Hard days night" currently comes out as "Beatles_ Hard days night"
      • xlotlu
        i kinda agree on the quotes. i did stare at that dot stuff, didn't udnerstand why that'd be..
      • Sophist-UK
        But it would be better as "Beatles - Hard days night"
      • But not "Beatles- Hard days night"
      • misterswag joined the channel
      • i.e. Trying to make it more readable automatically.
      • xlotlu
        it makes some sense
      • Sophist-UK
        In Ux, a leading . means hidden.
      • You can't have a leading . in Windows
      • SultS_
        you can… maybe not in every windows though
      • Sophist-UK
        This would be an empty filename and only a file extension which is not allowed (thanks to MS-DOS 1.0 from c. 1980.
      • xlotlu
        makes sense :)
      • Sophist-UK
        You can have just a . - but this is shorthand for current directory and can't be used for a file.
      • xlotlu
        submit an issue? i'm only messing about with filename lengths, so won't touch this
      • Sophist-UK
        So an mp3 file with filename trailing. would be trailing..mp3.
      • xlotlu
        but personally, i only fully agree with the quotes. the rest are debatable
      • Sophist-UK
        Yes - but equally debatable is just doing a dumb substitution for underscore.
      • xlotlu
        it may not look pretty, but that period had some semantic meaning
      • which the windows folks with extensions disabled will cry for :)
      • Sophist-UK
        Do you mean extensions hidden rather than disabled?
      • xlotlu
        hidden indeed
      • misterswag joined the channel