#musicbrainz-devel

/

12:57 PM
Sophist-UK

<open file u'C:\\temp\\\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916...

2013-06-16 16750, 2013

12:57 PM
xlotlu

goodie. and 200?

2013-06-16 16739, 2013

12:58 PM
Sophist-UK

open(os.path.join("C:\\temp", u'\U00010916' * 200), 'w')

2013-06-16 16748, 2013

12:58 PM
Sophist-UK

IOError: [Errno 2] No such file or directory: u'C:\\temp\\\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916\U00010916...

2013-06-16 16710, 2013

12:59 PM
xlotlu

thank you

2013-06-16 16728, 2013

12:59 PM
Sophist-UK

np

2013-06-16 16758, 2013

13:07 PM
JonnyJD

Sebastinas and others: FYI I will release libdiscid 0.5.1 this week (planned for thursday) (changelog already updated)

2013-06-16 16749, 2013

13:19 PM
xlotlu

Sophist-UK: one more thing and i'm done bugging you: len(u'\U00010916')

2013-06-16 16706, 2013

13:21 PM
Sophist-UK

2

2013-06-16 16700, 2013

13:24 PM
xlotlu

and that is how 1 equals 2 :))

2013-06-16 16702, 2013

13:31 PM
Sophist-UK

I think that u'\U00010916' is actually two unicode characters.

2013-06-16 16726, 2013

13:31 PM
Sophist-UK

u'\U0001' and u'\U0916'

2013-06-16 16732, 2013

13:31 PM
nikki

no

2013-06-16 16739, 2013

13:31 PM
Sophist-UK

No?

2013-06-16 16711, 2013

13:32 PM
Sophist-UK

When I try to convert to ascii using u'\U00010916'.encode('ascii','replace') it gives '??'

2013-06-16 16714, 2013

13:32 PM
nikki

it's one character (u+10916), but it gets encoded using two characters in some encodings

2013-06-16 16756, 2013

13:32 PM
nikki

hence xlotlu's comment :P

2013-06-16 16736, 2013

13:33 PM
Sophist-UK

Ah yes - so I see - a unicode > 65535

2013-06-16 16740, 2013

13:33 PM
nikki

yep

2013-06-16 16743, 2013

13:34 PM
JonnyJD

Sophist-UK: the point about actual unicode character support is to count these 2-byte characters as one, making string length and character by character comparisions sane again

2013-06-16 16732, 2013

13:35 PM
Sophist-UK

According to a page I just found:

2013-06-16 16745, 2013

13:35 PM
Sophist-UK

the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to len, can have a different result in different builds of the standard interpreter.

2013-06-16 16732, 2013

13:36 PM
xlotlu

it turns that character into a "surrogate pair", i.e. u'\ud802\udd16' ... and funnily, u'\ud802\udd16'.encode('utf-8').decode('utf-8') --> u'\U00010916'

2013-06-16 16733, 2013

13:36 PM
JonnyJD

Sophist-UK: doesn't that depend on if you are using a python 3 str (which is unicode in python 2) or a bytes (which is str in python 2)

2013-06-16 16723, 2013

13:37 PM
Sophist-UK

JJD: No idea. Never really got to grips fully with unicode.

2013-06-16 16723, 2013

13:37 PM
xlotlu

JonnyJD: it's a unicode talk. so it's about strs

2013-06-16 16739, 2013

13:37 PM
JonnyJD is somewhat interested in the topic since he has 2-3 python projects that are supposed to run on Python 2 and 3 (unchanged)

2013-06-16 16720, 2013

13:38 PM
xlotlu

bytes are bytes are bytes. unicodes are ... something virtual. bytes only when represented in an encoding

2013-06-16 16744, 2013

13:39 PM
Sophist-UK

ValueError: unichr() arg not in range(0x10000) (narrow Python build)

2013-06-16 16700, 2013

13:41 PM
JonnyJD

Anyways, my machine doesn't return two for len(u'\U00010916'). Why the heck is there a compile option that breaks this?

2013-06-16 16722, 2013

13:41 PM
xlotlu

JonnyJD: linux?

2013-06-16 16731, 2013

13:41 PM
Sophist-UK

pep261 defines this stuff

2013-06-16 16732, 2013

13:41 PM
JonnyJD

xlotlu: Arch Linux, yes

2013-06-16 16748, 2013

13:41 PM
JonnyJD

PEP 261 is 12 years old

2013-06-16 16755, 2013

13:41 PM
Sophist-UK

Yup

2013-06-16 16708, 2013

13:42 PM
Sophist-UK

And its still caausing discussions like this.

2013-06-16 16745, 2013

13:42 PM
Sophist-UK

Pretty stupid thing to do IMO - allow differing results depending on how you build python.exe

2013-06-16 16704, 2013

13:43 PM
xlotlu

there's a point when every programmer gets smacked behind the head with unicode. for me it was today.

2013-06-16 16711, 2013

13:43 PM
Sophist-UK

I suppose it is fixed fully in Python3

2013-06-16 16725, 2013

13:43 PM
JonnyJD

http://stackoverflow.com/questions/14790708/do-un…

2013-06-16 16738, 2013

13:43 PM
JonnyJD

So it is Windows where this still doesn't work?

2013-06-16 16715, 2013

13:44 PM
xlotlu

it's narrow on mac too

2013-06-16 16754, 2013

13:44 PM
JonnyJD

Hm, I should test this on my BSDs

2013-06-16 16739, 2013

13:45 PM
JonnyJD

Biggest problem on Windows so far was having a Unicode-enabled console..

2013-06-16 16756, 2013

13:48 PM
Sophist-UK

So - whilst my knowledge of unicode is lower than basic, it seems to me that you are stuffed - when passed a string with a mixture of unicode <=65535 and > 65535 you probably cannot tell whether it is u'\u00010916' or u'\u0001' followed by u'\u0916'

2013-06-16 16701, 2013

13:52 PM
JonnyJD

FreeBSD 8 is wide. Mac OS X really is narrow (Python 2.7 tested on both)

2013-06-16 16711, 2013

13:52 PM
xlotlu

you can. you check for wide vs narrow, and if on narrow, you look for surrogate pairs

2013-06-16 16759, 2013

13:53 PM
xlotlu

http://en.wikipedia.org/wiki/Mapping_of_Unicode_c…

2013-06-16 16742, 2013

13:54 PM
JonnyJD

and probably tell people to use Python 3.3 if available. (tested on Windows, the length is one starting with Python 3.3, as announced)

2013-06-16 16715, 2013

13:56 PM
Sophist-UK

I have seen a page where you encode as UTF-8 and then use a regex to count characters based on values < .

2013-06-16 16725, 2013

13:56 PM
Sophist-UK

128

2013-06-16 16753, 2013

13:56 PM
xlotlu

there's more recipes. i like a one-liner with struct(which i don't understand) :)

2013-06-16 16703, 2013

13:58 PM
JonnyJD

FYI: NetBSD 6.0 is also narrow (in contrast to FreeBSD)

2013-06-16 16728, 2013

13:58 PM
xlotlu

Sophist-UK: the long-charactered representation is with capital U. it means that character's number in the Unicode specification

2013-06-16 16742, 2013

13:58 PM
Sophist-UK

Ah/

2013-06-16 16706, 2013

14:00 PM
Sophist-UK

So, where is the unicode data with > 65535 originating from then?

2013-06-16 16754, 2013

14:00 PM
xlotlu

from outside the basic multilingual plane :)

2013-06-16 16739, 2013

14:01 PM
JonnyJD

OpenBSD is also narrow, same as NetBSD. Up until now I thought that is only a Windows problem :-/

2013-06-16 16722, 2013

14:02 PM
xlotlu

JonnyJD: it really doesn't matter. the function will have to test for wide / narrow, and act accordingly

2013-06-16 16739, 2013

14:02 PM
xlotlu

the real mess is how the filesystem actually stores the data

2013-06-16 16702, 2013

14:03 PM
xlotlu

s/data/filename

2013-06-16 16746, 2013

14:03 PM
xlotlu

it seems on windows it does just like narrow python does, but i'm not sure..

2013-06-16 16751, 2013

14:04 PM
xlotlu

and JonnyJD: why do you have that many OSs? :)

2013-06-16 16708, 2013

14:05 PM
JonnyJD

libdiscid supports a shitload of platforms

2013-06-16 16717, 2013

14:05 PM
xlotlu

ah

2013-06-16 16756, 2013

14:05 PM
Sophist-UK

I mean, is this data from MB or from Windows via an API or ...

2013-06-16 16718, 2013

14:06 PM
JonnyJD

and since I maintain it, I try to maintain a couple testing platforms. I am missing a solaris system with a physical disc drive and I only borrow Mac Systems with physical drives. Apart from that, I have all major platforms available now.

2013-06-16 16702, 2013

14:08 PM
xlotlu

Sophist-UK: you mean libdiscid's data, or my data? 'cause mine's is filenames in picard

2013-06-16 16707, 2013

14:09 PM
Sophist-UK

xlotlu: your's - so we are talking existing files whose filenames include u:\U10916'?

2013-06-16 16724, 2013

14:10 PM
xlotlu

filenames that include \U > 65535

2013-06-16 16734, 2013

14:10 PM
xlotlu

actually, it gets even messier

2013-06-16 16700, 2013

14:11 PM
xlotlu

apparently windows (but i'm not sure!) stores that in a pair of surrogates

2013-06-16 16712, 2013

14:11 PM
nikki

hm, I wonder if my non-bmp characters are being stored as surrogates

2013-06-16 16755, 2013

14:12 PM
xlotlu

os x does the same, but also uses a specific form of normalization that stores composite characters as two, i.e. ă --> a + ˘

2013-06-16 16719, 2013

14:13 PM
xlotlu

and on everything else it's "simple", because they don't use unicode filenames, but simply bytes

2013-06-16 16734, 2013

14:13 PM
nikki mutters something about nfd being evil

2013-06-16 16751, 2013

14:13 PM
xlotlu

and they don't care about the representation of those bytes. it's userspace doing the translation from whatever into the user's encoding

2013-06-16 16722, 2013

14:14 PM
xlotlu

nikki: i wonder if one could actually write a diacritic in a filename

2013-06-16 16726, 2013

14:14 PM
Sophist-UK

So, if filename has unicode characters greater than 65535 how do you distinguish them from two characters <= 65535?

2013-06-16 16732, 2013

14:14 PM
nikki

xlotlu: hm?

2013-06-16 16716, 2013

14:15 PM
nikki

every so often I end up going around tidying up combining characters that probably came from people using some function to generate the tags from the filename and then adding stuff to mb using that data

2013-06-16 16717, 2013

14:15 PM
xlotlu

how does it know if i meant ă and not really ˘a?

2013-06-16 16746, 2013

14:15 PM
xlotlu

Sophist-UK: they're in a specific range

2013-06-16 16758, 2013

14:15 PM
nikki

for a start, the combining characters go after the base characters :P

2013-06-16 16712, 2013

14:16 PM
xlotlu

you get my point :P

2013-06-16 16705, 2013

14:17 PM
nikki

if you want them separate and there's a standalone diacritic in unicode, I don't *think* nfd decomposes it... if it does, you'd have to combine it with a space or something

2013-06-16 16712, 2013

14:18 PM
xlotlu

Sophist-UK: They are divided into leading or "high surrogates" (D800–DBFF) and trailing or "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, ....

2013-06-16 16735, 2013

14:18 PM
Sophist-UK

I am beginning to wish I hadn;t asked LOL

2013-06-16 16703, 2013

14:19 PM
Sophist-UK

So the answer is that you can distinguish, but the len method is broken?

2013-06-16 16723, 2013

14:19 PM
xlotlu

yes

2013-06-16 16748, 2013

14:19 PM
Sophist-UK

Actually, unicode ob

2013-06-16 16752, 2013

14:19 PM
xlotlu

it really isn't "broken", because it counts what it has stored internally. it's just that it's.. broken :)

2013-06-16 16712, 2013

14:20 PM
Sophist-UK

= False

2013-06-16 16720, 2013

14:20 PM
Sophist-UK

Broken and not broken = False

2013-06-16 16729, 2013

14:20 PM
xlotlu

the really crappy part is not just len() is broken, but [:limit] is too

2013-06-16 16705, 2013

14:22 PM
Sophist-UK

It sounds like you get UTF-8 or UTF-16 encoded unicode, and can decode it OK if you know to do so, but once you store it in a unicode object then you are stuffed.

2013-06-16 16755, 2013

14:22 PM
Sophist-UK

I have no idea why someone didn't reimplement the py2 unicode object like a py3 string.

2013-06-16 16716, 2013

14:23 PM
nikki

xlotlu: ah, nfd doesn't change it, nfkd changes it to space + combining breve

2013-06-16 16752, 2013

14:23 PM
xlotlu

meaning you can't tell the difference?

2013-06-16 16734, 2013

14:24 PM
nikki

well, the space stops it from combining with the 'a' because it combines with the space instead

2013-06-16 16729, 2013

14:25 PM
xlotlu

and nfd is the one that is, or the one that was..?

2013-06-16 16712, 2013

14:26 PM
xlotlu

i wonder if somewhere deep inside OS X's guts there's some 5000-line code to check if one character is allowed to combine with the next

2013-06-16 16747, 2013

14:26 PM
nikki

I'm not sure what the question was

2013-06-16 16756, 2013

14:27 PM
xlotlu

if os x uses nfd or nfkd

2013-06-16 16759, 2013

14:27 PM
nikki

oh

2013-06-16 16700, 2013

14:28 PM
nikki

nfd

2013-06-16 16705, 2013

14:28 PM
Sophist-UK

Alternative is to move Picard to Py3.

2013-06-16 16730, 2013

14:28 PM
nikki

nfkd loses various bits of info

2013-06-16 16714, 2013

14:29 PM
xlotlu

Sophist-UK: that would only solve part of the problem. i still have to count the code points in whatever flavour the os has

2013-06-16 16751, 2013

14:30 PM
Sophist-UK

Well - *if* in Py3 there are automatic encoding from UTF-8/16 which Windows uses to the fully unicode enabled string function, then len and [:limit] will work as expected without any coding on your part.

2013-06-16 16710, 2013

14:31 PM
Sophist-UK

P.S. Can I discuss with you file naming in Windows?

2013-06-16 16741, 2013

14:31 PM
xlotlu

sure

2013-06-16 16744, 2013

14:31 PM
Sophist-UK

At present the code which gets rid of disallowed characters in windows path/file names is a bit basic.

2013-06-16 16721, 2013

14:32 PM
Sophist-UK

Accoridng to MSDN, the following characters are disallowed: <>:"/\|?*

2013-06-16 16725, 2013

14:32 PM
xlotlu

it's a regexp, isn't it?

2013-06-16 16738, 2013

14:32 PM
Sophist-UK

So we have code to replace these with an underscore.

2013-06-16 16702, 2013

14:33 PM
Sophist-UK

WIBNI... the code was a little more intelligent in how it replaced characters...

2013-06-16 16726, 2013

14:33 PM
Sophist-UK

So < could be replaced with [, > with ]

2013-06-16 16742, 2013

14:33 PM
Sophist-UK

" with '

2013-06-16 16730, 2013

14:34 PM
xlotlu

umh

2013-06-16 16759, 2013

14:34 PM
Sophist-UK

And :/\|?* as follows (using - as an example):

2013-06-16 16729, 2013

14:35 PM
Sophist-UK

using : as an example I mean...

2013-06-16 16713, 2013

14:36 PM
Sophist-UK

space:space => space-space

2013-06-16 16731, 2013

14:36 PM
Sophist-UK

:space => space-space

2013-06-16 16747, 2013

14:36 PM
Sophist-UK

space: => space-

2013-06-16 16710, 2013

14:37 PM
Sophist-UK

: (no spaces either side) => -

2013-06-16 16749, 2013

14:37 PM
xlotlu

i don't follow

2013-06-16 16751, 2013

14:37 PM
Sophist-UK

Also leading and trailing . are disallowed - currently translate to _ but propose they are dropped entirely.

2013-06-16 16732, 2013

14:38 PM
Sophist-UK

"Beatles: Hard days night" currently comes out as "Beatles_ Hard days night"

2013-06-16 16737, 2013

14:38 PM
xlotlu

i kinda agree on the quotes. i did stare at that dot stuff, didn't udnerstand why that'd be..

2013-06-16 16700, 2013

14:39 PM
Sophist-UK

But it would be better as "Beatles - Hard days night"

2013-06-16 16712, 2013

14:39 PM
Sophist-UK

But not "Beatles- Hard days night"

2013-06-16 16714, 2013

14:39 PM
misterswag joined the channel

2013-06-16 16739, 2013

14:39 PM
Sophist-UK

i.e. Trying to make it more readable automatically.

2013-06-16 16754, 2013

14:39 PM
xlotlu

it makes some sense

2013-06-16 16703, 2013

14:40 PM
Sophist-UK

In Ux, a leading . means hidden.

2013-06-16 16716, 2013

14:40 PM
Sophist-UK

You can't have a leading . in Windows

2013-06-16 16756, 2013

14:40 PM
SultS_

you can… maybe not in every windows though

2013-06-16 16707, 2013

14:41 PM
Sophist-UK

This would be an empty filename and only a file extension which is not allowed (thanks to MS-DOS 1.0 from c. 1980.

2013-06-16 16733, 2013

14:41 PM
xlotlu

makes sense :)

2013-06-16 16743, 2013

14:41 PM
Sophist-UK

You can have just a . - but this is shorthand for current directory and can't be used for a file.

2013-06-16 16720, 2013

14:42 PM
xlotlu

submit an issue? i'm only messing about with filename lengths, so won't touch this

2013-06-16 16732, 2013

14:42 PM
Sophist-UK

So an mp3 file with filename trailing. would be trailing..mp3.

2013-06-16 16742, 2013

14:42 PM
xlotlu

but personally, i only fully agree with the quotes. the rest are debatable

2013-06-16 16712, 2013

14:43 PM
Sophist-UK

Yes - but equally debatable is just doing a dumb substitution for underscore.

2013-06-16 16714, 2013

14:43 PM
xlotlu

it may not look pretty, but that period had some semantic meaning

2013-06-16 16728, 2013

14:43 PM
xlotlu

which the windows folks with extensions disabled will cry for :)

2013-06-16 16753, 2013

14:43 PM
Sophist-UK

Do you mean extensions hidden rather than disabled?

2013-06-16 16711, 2013

14:44 PM
xlotlu

hidden indeed

2013-06-16 16748, 2013

14:44 PM
misterswag joined the channel