IOError: [Errno 2] No such file or directory: u'C:\\temp\\\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916...
2013-06-16 16710, 2013
xlotlu
thank you
2013-06-16 16728, 2013
Sophist-UK
np
2013-06-16 16758, 2013
JonnyJD
Sebastinas and others: FYI I will release libdiscid 0.5.1 this week (planned for thursday) (changelog already updated)
2013-06-16 16749, 2013
xlotlu
Sophist-UK: one more thing and i'm done bugging you: len(u'\U00010916')
2013-06-16 16706, 2013
Sophist-UK
2
2013-06-16 16700, 2013
xlotlu
and that is how 1 equals 2 :))
2013-06-16 16702, 2013
Sophist-UK
I think that u'\U00010916' is actually two unicode characters.
2013-06-16 16726, 2013
Sophist-UK
u'\U0001' and u'\U0916'
2013-06-16 16732, 2013
nikki
no
2013-06-16 16739, 2013
Sophist-UK
No?
2013-06-16 16711, 2013
Sophist-UK
When I try to convert to ascii using u'\U00010916'.encode('ascii','replace') it gives '??'
2013-06-16 16714, 2013
nikki
it's one character (u+10916), but it gets encoded using two characters in some encodings
2013-06-16 16756, 2013
nikki
hence xlotlu's comment :P
2013-06-16 16736, 2013
Sophist-UK
Ah yes - so I see - a unicode > 65535
2013-06-16 16740, 2013
nikki
yep
2013-06-16 16743, 2013
JonnyJD
Sophist-UK: the point about actual unicode character support is to count these 2-byte characters as one, making string length and character by character comparisions sane again
2013-06-16 16732, 2013
Sophist-UK
According to a page I just found:
2013-06-16 16745, 2013
Sophist-UK
the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to len, can have a different result in different builds of the standard interpreter.
2013-06-16 16732, 2013
xlotlu
it turns that character into a "surrogate pair", i.e. u'\ud802\udd16' ... and funnily, u'\ud802\udd16'.encode('utf-8').decode('utf-8') --> u'\U00010916'
2013-06-16 16733, 2013
JonnyJD
Sophist-UK: doesn't that depend on if you are using a python 3 str (which is unicode in python 2) or a bytes (which is str in python 2)
2013-06-16 16723, 2013
Sophist-UK
JJD: No idea. Never really got to grips fully with unicode.
2013-06-16 16723, 2013
xlotlu
JonnyJD: it's a unicode talk. so it's about strs
2013-06-16 16739, 2013
JonnyJD is somewhat interested in the topic since he has 2-3 python projects that are supposed to run on Python 2 and 3 (unchanged)
2013-06-16 16720, 2013
xlotlu
bytes are bytes are bytes. unicodes are ... something virtual. bytes only when represented in an encoding
2013-06-16 16744, 2013
Sophist-UK
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
2013-06-16 16700, 2013
JonnyJD
Anyways, my machine doesn't return two for len(u'\U00010916'). Why the heck is there a compile option that breaks this?
2013-06-16 16722, 2013
xlotlu
JonnyJD: linux?
2013-06-16 16731, 2013
Sophist-UK
pep261 defines this stuff
2013-06-16 16732, 2013
JonnyJD
xlotlu: Arch Linux, yes
2013-06-16 16748, 2013
JonnyJD
PEP 261 is 12 years old
2013-06-16 16755, 2013
Sophist-UK
Yup
2013-06-16 16708, 2013
Sophist-UK
And its still caausing discussions like this.
2013-06-16 16745, 2013
Sophist-UK
Pretty stupid thing to do IMO - allow differing results depending on how you build python.exe
2013-06-16 16704, 2013
xlotlu
there's a point when every programmer gets smacked behind the head with unicode. for me it was today.
Biggest problem on Windows so far was having a Unicode-enabled console..
2013-06-16 16756, 2013
Sophist-UK
So - whilst my knowledge of unicode is lower than basic, it seems to me that you are stuffed - when passed a string with a mixture of unicode <=65535 and > 65535 you probably cannot tell whether it is u'\u00010916' or u'\u0001' followed by u'\u0916'
2013-06-16 16701, 2013
JonnyJD
FreeBSD 8 is wide. Mac OS X really is narrow (Python 2.7 tested on both)
2013-06-16 16711, 2013
xlotlu
you can. you check for wide vs narrow, and if on narrow, you look for surrogate pairs
and probably tell people to use Python 3.3 if available. (tested on Windows, the length is one starting with Python 3.3, as announced)
2013-06-16 16715, 2013
Sophist-UK
I have seen a page where you encode as UTF-8 and then use a regex to count characters based on values < .
2013-06-16 16725, 2013
Sophist-UK
128
2013-06-16 16753, 2013
xlotlu
there's more recipes. i like a one-liner with struct(which i don't understand) :)
2013-06-16 16703, 2013
JonnyJD
FYI: NetBSD 6.0 is also narrow (in contrast to FreeBSD)
2013-06-16 16728, 2013
xlotlu
Sophist-UK: the long-charactered representation is with capital U. it means that character's number in the Unicode specification
2013-06-16 16742, 2013
Sophist-UK
Ah/
2013-06-16 16706, 2013
Sophist-UK
So, where is the unicode data with > 65535 originating from then?
2013-06-16 16754, 2013
xlotlu
from outside the basic multilingual plane :)
2013-06-16 16739, 2013
JonnyJD
OpenBSD is also narrow, same as NetBSD. Up until now I thought that is only a Windows problem :-/
2013-06-16 16722, 2013
xlotlu
JonnyJD: it really doesn't matter. the function will have to test for wide / narrow, and act accordingly
2013-06-16 16739, 2013
xlotlu
the real mess is how the filesystem actually stores the data
2013-06-16 16702, 2013
xlotlu
s/data/filename
2013-06-16 16746, 2013
xlotlu
it seems on windows it does just like narrow python does, but i'm not sure..
2013-06-16 16751, 2013
xlotlu
and JonnyJD: why do you have that many OSs? :)
2013-06-16 16708, 2013
JonnyJD
libdiscid supports a shitload of platforms
2013-06-16 16717, 2013
xlotlu
ah
2013-06-16 16756, 2013
Sophist-UK
I mean, is this data from MB or from Windows via an API or ...
2013-06-16 16718, 2013
JonnyJD
and since I maintain it, I try to maintain a couple testing platforms. I am missing a solaris system with a physical disc drive and I only borrow Mac Systems with physical drives. Apart from that, I have all major platforms available now.
2013-06-16 16702, 2013
xlotlu
Sophist-UK: you mean libdiscid's data, or my data? 'cause mine's is filenames in picard
2013-06-16 16707, 2013
Sophist-UK
xlotlu: your's - so we are talking existing files whose filenames include u:\U10916'?
2013-06-16 16724, 2013
xlotlu
filenames that include \U > 65535
2013-06-16 16734, 2013
xlotlu
actually, it gets even messier
2013-06-16 16700, 2013
xlotlu
apparently windows (but i'm not sure!) stores that in a pair of surrogates
2013-06-16 16712, 2013
nikki
hm, I wonder if my non-bmp characters are being stored as surrogates
2013-06-16 16755, 2013
xlotlu
os x does the same, but also uses a specific form of normalization that stores composite characters as two, i.e. ă --> a + ˘
2013-06-16 16719, 2013
xlotlu
and on everything else it's "simple", because they don't use unicode filenames, but simply bytes
2013-06-16 16734, 2013
nikki mutters something about nfd being evil
2013-06-16 16751, 2013
xlotlu
and they don't care about the representation of those bytes. it's userspace doing the translation from whatever into the user's encoding
2013-06-16 16722, 2013
xlotlu
nikki: i wonder if one could actually write a diacritic in a filename
2013-06-16 16726, 2013
Sophist-UK
So, if filename has unicode characters greater than 65535 how do you distinguish them from two characters <= 65535?
2013-06-16 16732, 2013
nikki
xlotlu: hm?
2013-06-16 16716, 2013
nikki
every so often I end up going around tidying up combining characters that probably came from people using some function to generate the tags from the filename and then adding stuff to mb using that data
2013-06-16 16717, 2013
xlotlu
how does it know if i meant ă and not really ˘a?
2013-06-16 16746, 2013
xlotlu
Sophist-UK: they're in a specific range
2013-06-16 16758, 2013
nikki
for a start, the combining characters go after the base characters :P
2013-06-16 16712, 2013
xlotlu
you get my point :P
2013-06-16 16705, 2013
nikki
if you want them separate and there's a standalone diacritic in unicode, I don't *think* nfd decomposes it... if it does, you'd have to combine it with a space or something
2013-06-16 16712, 2013
xlotlu
Sophist-UK: They are divided into leading or "high surrogates" (D800–DBFF) and trailing or "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, ....
2013-06-16 16735, 2013
Sophist-UK
I am beginning to wish I hadn;t asked LOL
2013-06-16 16703, 2013
Sophist-UK
So the answer is that you can distinguish, but the len method is broken?
2013-06-16 16723, 2013
xlotlu
yes
2013-06-16 16748, 2013
Sophist-UK
Actually, unicode ob
2013-06-16 16752, 2013
xlotlu
it really isn't "broken", because it counts what it has stored internally. it's just that it's.. broken :)
2013-06-16 16712, 2013
Sophist-UK
= False
2013-06-16 16720, 2013
Sophist-UK
Broken and not broken = False
2013-06-16 16729, 2013
xlotlu
the really crappy part is not just len() is broken, but [:limit] is too
2013-06-16 16705, 2013
Sophist-UK
It sounds like you get UTF-8 or UTF-16 encoded unicode, and can decode it OK if you know to do so, but once you store it in a unicode object then you are stuffed.
2013-06-16 16755, 2013
Sophist-UK
I have no idea why someone didn't reimplement the py2 unicode object like a py3 string.
2013-06-16 16716, 2013
nikki
xlotlu: ah, nfd doesn't change it, nfkd changes it to space + combining breve
2013-06-16 16752, 2013
xlotlu
meaning you can't tell the difference?
2013-06-16 16734, 2013
nikki
well, the space stops it from combining with the 'a' because it combines with the space instead
2013-06-16 16729, 2013
xlotlu
and nfd is the one that is, or the one that was..?
2013-06-16 16712, 2013
xlotlu
i wonder if somewhere deep inside OS X's guts there's some 5000-line code to check if one character is allowed to combine with the next
2013-06-16 16747, 2013
nikki
I'm not sure what the question was
2013-06-16 16756, 2013
xlotlu
if os x uses nfd or nfkd
2013-06-16 16759, 2013
nikki
oh
2013-06-16 16700, 2013
nikki
nfd
2013-06-16 16705, 2013
Sophist-UK
Alternative is to move Picard to Py3.
2013-06-16 16730, 2013
nikki
nfkd loses various bits of info
2013-06-16 16714, 2013
xlotlu
Sophist-UK: that would only solve part of the problem. i still have to count the code points in whatever flavour the os has
2013-06-16 16751, 2013
Sophist-UK
Well - *if* in Py3 there are automatic encoding from UTF-8/16 which Windows uses to the fully unicode enabled string function, then len and [:limit] will work as expected without any coding on your part.
2013-06-16 16710, 2013
Sophist-UK
P.S. Can I discuss with you file naming in Windows?
2013-06-16 16741, 2013
xlotlu
sure
2013-06-16 16744, 2013
Sophist-UK
At present the code which gets rid of disallowed characters in windows path/file names is a bit basic.
2013-06-16 16721, 2013
Sophist-UK
Accoridng to MSDN, the following characters are disallowed: <>:"/\|?*
2013-06-16 16725, 2013
xlotlu
it's a regexp, isn't it?
2013-06-16 16738, 2013
Sophist-UK
So we have code to replace these with an underscore.
2013-06-16 16702, 2013
Sophist-UK
WIBNI... the code was a little more intelligent in how it replaced characters...
2013-06-16 16726, 2013
Sophist-UK
So < could be replaced with [, > with ]
2013-06-16 16742, 2013
Sophist-UK
" with '
2013-06-16 16730, 2013
xlotlu
umh
2013-06-16 16759, 2013
Sophist-UK
And :/\|?* as follows (using - as an example):
2013-06-16 16729, 2013
Sophist-UK
using : as an example I mean...
2013-06-16 16713, 2013
Sophist-UK
space:space => space-space
2013-06-16 16731, 2013
Sophist-UK
:space => space-space
2013-06-16 16747, 2013
Sophist-UK
space: => space-
2013-06-16 16710, 2013
Sophist-UK
: (no spaces either side) => -
2013-06-16 16749, 2013
xlotlu
i don't follow
2013-06-16 16751, 2013
Sophist-UK
Also leading and trailing . are disallowed - currently translate to _ but propose they are dropped entirely.
2013-06-16 16732, 2013
Sophist-UK
"Beatles: Hard days night" currently comes out as "Beatles_ Hard days night"
2013-06-16 16737, 2013
xlotlu
i kinda agree on the quotes. i did stare at that dot stuff, didn't udnerstand why that'd be..
2013-06-16 16700, 2013
Sophist-UK
But it would be better as "Beatles - Hard days night"
2013-06-16 16712, 2013
Sophist-UK
But not "Beatles- Hard days night"
2013-06-16 16714, 2013
misterswag joined the channel
2013-06-16 16739, 2013
Sophist-UK
i.e. Trying to make it more readable automatically.
2013-06-16 16754, 2013
xlotlu
it makes some sense
2013-06-16 16703, 2013
Sophist-UK
In Ux, a leading . means hidden.
2013-06-16 16716, 2013
Sophist-UK
You can't have a leading . in Windows
2013-06-16 16756, 2013
SultS_
you can… maybe not in every windows though
2013-06-16 16707, 2013
Sophist-UK
This would be an empty filename and only a file extension which is not allowed (thanks to MS-DOS 1.0 from c. 1980.
2013-06-16 16733, 2013
xlotlu
makes sense :)
2013-06-16 16743, 2013
Sophist-UK
You can have just a . - but this is shorthand for current directory and can't be used for a file.
2013-06-16 16720, 2013
xlotlu
submit an issue? i'm only messing about with filename lengths, so won't touch this
2013-06-16 16732, 2013
Sophist-UK
So an mp3 file with filename trailing. would be trailing..mp3.
2013-06-16 16742, 2013
xlotlu
but personally, i only fully agree with the quotes. the rest are debatable
2013-06-16 16712, 2013
Sophist-UK
Yes - but equally debatable is just doing a dumb substitution for underscore.
2013-06-16 16714, 2013
xlotlu
it may not look pretty, but that period had some semantic meaning
2013-06-16 16728, 2013
xlotlu
which the windows folks with extensions disabled will cry for :)
2013-06-16 16753, 2013
Sophist-UK
Do you mean extensions hidden rather than disabled?