IOError: [Errno 2] No such file or directory: u'C:\\temp\\\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916...
xlotlu
thank you
Sophist-UK
np
JonnyJD
Sebastinas and others: FYI I will release libdiscid 0.5.1 this week (planned for thursday) (changelog already updated)
xlotlu
Sophist-UK: one more thing and i'm done bugging you: len(u'\U00010916')
Sophist-UK
2
xlotlu
and that is how 1 equals 2 :))
Sophist-UK
I think that u'\U00010916' is actually two unicode characters.
u'\U0001' and u'\U0916'
nikki
no
Sophist-UK
No?
When I try to convert to ascii using u'\U00010916'.encode('ascii','replace') it gives '??'
nikki
it's one character (u+10916), but it gets encoded using two characters in some encodings
hence xlotlu's comment :P
Sophist-UK
Ah yes - so I see - a unicode > 65535
nikki
yep
JonnyJD
Sophist-UK: the point about actual unicode character support is to count these 2-byte characters as one, making string length and character by character comparisions sane again
Sophist-UK
According to a page I just found:
the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to len, can have a different result in different builds of the standard interpreter.
xlotlu
it turns that character into a "surrogate pair", i.e. u'\ud802\udd16' ... and funnily, u'\ud802\udd16'.encode('utf-8').decode('utf-8') --> u'\U00010916'
JonnyJD
Sophist-UK: doesn't that depend on if you are using a python 3 str (which is unicode in python 2) or a bytes (which is str in python 2)
Sophist-UK
JJD: No idea. Never really got to grips fully with unicode.
xlotlu
JonnyJD: it's a unicode talk. so it's about strs
JonnyJD is somewhat interested in the topic since he has 2-3 python projects that are supposed to run on Python 2 and 3 (unchanged)
bytes are bytes are bytes. unicodes are ... something virtual. bytes only when represented in an encoding
Sophist-UK
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
JonnyJD
Anyways, my machine doesn't return two for len(u'\U00010916'). Why the heck is there a compile option that breaks this?
xlotlu
JonnyJD: linux?
Sophist-UK
pep261 defines this stuff
JonnyJD
xlotlu: Arch Linux, yes
PEP 261 is 12 years old
Sophist-UK
Yup
And its still caausing discussions like this.
Pretty stupid thing to do IMO - allow differing results depending on how you build python.exe
xlotlu
there's a point when every programmer gets smacked behind the head with unicode. for me it was today.
Biggest problem on Windows so far was having a Unicode-enabled console..
Sophist-UK
So - whilst my knowledge of unicode is lower than basic, it seems to me that you are stuffed - when passed a string with a mixture of unicode <=65535 and > 65535 you probably cannot tell whether it is u'\u00010916' or u'\u0001' followed by u'\u0916'
JonnyJD
FreeBSD 8 is wide. Mac OS X really is narrow (Python 2.7 tested on both)
xlotlu
you can. you check for wide vs narrow, and if on narrow, you look for surrogate pairs
and probably tell people to use Python 3.3 if available. (tested on Windows, the length is one starting with Python 3.3, as announced)
Sophist-UK
I have seen a page where you encode as UTF-8 and then use a regex to count characters based on values < .
128
xlotlu
there's more recipes. i like a one-liner with struct(which i don't understand) :)
JonnyJD
FYI: NetBSD 6.0 is also narrow (in contrast to FreeBSD)
xlotlu
Sophist-UK: the long-charactered representation is with capital U. it means that character's number in the Unicode specification
Sophist-UK
Ah/
So, where is the unicode data with > 65535 originating from then?
xlotlu
from outside the basic multilingual plane :)
JonnyJD
OpenBSD is also narrow, same as NetBSD. Up until now I thought that is only a Windows problem :-/
xlotlu
JonnyJD: it really doesn't matter. the function will have to test for wide / narrow, and act accordingly
the real mess is how the filesystem actually stores the data
s/data/filename
it seems on windows it does just like narrow python does, but i'm not sure..
and JonnyJD: why do you have that many OSs? :)
JonnyJD
libdiscid supports a shitload of platforms
xlotlu
ah
Sophist-UK
I mean, is this data from MB or from Windows via an API or ...
JonnyJD
and since I maintain it, I try to maintain a couple testing platforms. I am missing a solaris system with a physical disc drive and I only borrow Mac Systems with physical drives. Apart from that, I have all major platforms available now.
xlotlu
Sophist-UK: you mean libdiscid's data, or my data? 'cause mine's is filenames in picard
Sophist-UK
xlotlu: your's - so we are talking existing files whose filenames include u:\U10916'?
xlotlu
filenames that include \U > 65535
actually, it gets even messier
apparently windows (but i'm not sure!) stores that in a pair of surrogates
nikki
hm, I wonder if my non-bmp characters are being stored as surrogates
xlotlu
os x does the same, but also uses a specific form of normalization that stores composite characters as two, i.e. ă --> a + ˘
and on everything else it's "simple", because they don't use unicode filenames, but simply bytes
nikki mutters something about nfd being evil
and they don't care about the representation of those bytes. it's userspace doing the translation from whatever into the user's encoding
nikki: i wonder if one could actually write a diacritic in a filename
Sophist-UK
So, if filename has unicode characters greater than 65535 how do you distinguish them from two characters <= 65535?
nikki
xlotlu: hm?
every so often I end up going around tidying up combining characters that probably came from people using some function to generate the tags from the filename and then adding stuff to mb using that data
xlotlu
how does it know if i meant ă and not really ˘a?
Sophist-UK: they're in a specific range
nikki
for a start, the combining characters go after the base characters :P
xlotlu
you get my point :P
nikki
if you want them separate and there's a standalone diacritic in unicode, I don't *think* nfd decomposes it... if it does, you'd have to combine it with a space or something
xlotlu
Sophist-UK: They are divided into leading or "high surrogates" (D800–DBFF) and trailing or "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, ....
Sophist-UK
I am beginning to wish I hadn;t asked LOL
So the answer is that you can distinguish, but the len method is broken?
xlotlu
yes
Sophist-UK
Actually, unicode ob
xlotlu
it really isn't "broken", because it counts what it has stored internally. it's just that it's.. broken :)
Sophist-UK
= False
Broken and not broken = False
xlotlu
the really crappy part is not just len() is broken, but [:limit] is too
Sophist-UK
It sounds like you get UTF-8 or UTF-16 encoded unicode, and can decode it OK if you know to do so, but once you store it in a unicode object then you are stuffed.
I have no idea why someone didn't reimplement the py2 unicode object like a py3 string.
nikki
xlotlu: ah, nfd doesn't change it, nfkd changes it to space + combining breve
xlotlu
meaning you can't tell the difference?
nikki
well, the space stops it from combining with the 'a' because it combines with the space instead
xlotlu
and nfd is the one that is, or the one that was..?
i wonder if somewhere deep inside OS X's guts there's some 5000-line code to check if one character is allowed to combine with the next
nikki
I'm not sure what the question was
xlotlu
if os x uses nfd or nfkd
nikki
oh
nfd
Sophist-UK
Alternative is to move Picard to Py3.
nikki
nfkd loses various bits of info
xlotlu
Sophist-UK: that would only solve part of the problem. i still have to count the code points in whatever flavour the os has
Sophist-UK
Well - *if* in Py3 there are automatic encoding from UTF-8/16 which Windows uses to the fully unicode enabled string function, then len and [:limit] will work as expected without any coding on your part.
P.S. Can I discuss with you file naming in Windows?
xlotlu
sure
Sophist-UK
At present the code which gets rid of disallowed characters in windows path/file names is a bit basic.
Accoridng to MSDN, the following characters are disallowed: <>:"/\|?*
xlotlu
it's a regexp, isn't it?
Sophist-UK
So we have code to replace these with an underscore.
WIBNI... the code was a little more intelligent in how it replaced characters...
So < could be replaced with [, > with ]
" with '
xlotlu
umh
Sophist-UK
And :/\|?* as follows (using - as an example):
using : as an example I mean...
space:space => space-space
:space => space-space
space: => space-
: (no spaces either side) => -
xlotlu
i don't follow
Sophist-UK
Also leading and trailing . are disallowed - currently translate to _ but propose they are dropped entirely.
"Beatles: Hard days night" currently comes out as "Beatles_ Hard days night"
xlotlu
i kinda agree on the quotes. i did stare at that dot stuff, didn't udnerstand why that'd be..
Sophist-UK
But it would be better as "Beatles - Hard days night"
But not "Beatles- Hard days night"
misterswag joined the channel
i.e. Trying to make it more readable automatically.
xlotlu
it makes some sense
Sophist-UK
In Ux, a leading . means hidden.
You can't have a leading . in Windows
SultS_
you can… maybe not in every windows though
Sophist-UK
This would be an empty filename and only a file extension which is not allowed (thanks to MS-DOS 1.0 from c. 1980.
xlotlu
makes sense :)
Sophist-UK
You can have just a . - but this is shorthand for current directory and can't be used for a file.
xlotlu
submit an issue? i'm only messing about with filename lengths, so won't touch this
Sophist-UK
So an mp3 file with filename trailing. would be trailing..mp3.
xlotlu
but personally, i only fully agree with the quotes. the rest are debatable
Sophist-UK
Yes - but equally debatable is just doing a dumb substitution for underscore.
xlotlu
it may not look pretty, but that period had some semantic meaning
which the windows folks with extensions disabled will cry for :)
Sophist-UK
Do you mean extensions hidden rather than disabled?