At the moment we're using "detect" to guess encoding on urls, maybe we should try utf-8 first always, and if it fails, try and guess
kepstin-laptop
yeah, ä is 0xC3 0xA4 in UTF-8
nikki
I would just use utf-8 first and if it fails, leave it encoded
ocharles
nikki: the problem is then we can't display a pretty wikipedia name
nikki
why not?
wikipedia uses utf-8
kepstin-laptop
nikki: some of the wikipedias don't.
ocharles
nikki: the examples in that ticket isn't utf-8
nor is that de. url above
oh, sorry
that one is
but the one in the ticket is latin-1
nikki
actually they do. that example redirects
I only left it unedited so that I had an example I could find
ocharles
so we'd have to resolve the URL to display it, which is also not really an option
here's my real suggestion
kepstin-laptop
ocharles: you could leave them unencoded, until someone comes along and fixes them to use the correct UTF-8 url.
nikki
kepstin-laptop: exactly
ocharles
Add/editing URLs should only allow utf-8 encoding. If it's not utf-8, we present a user with a list of how it would look in various encodings, and they can correct it
nikki
that's what we did pre-ngs and it worked just fine
ocharles
In the database, we need to find a list of URLs that aren't utf-8, and just clean them up
kepstin-laptop
ocharles: presumably there are some legacy sites that don't use UTF-8 in urls tho; you can't just convert them, you'll get 404s.
ocharles
and we need to ensure that we *only* store utf-8 encoding in the database
nikki
and when a site only accepts non-utf-8? we tell the user they can't add it because we can't implement somehting we used to have?
ocharles
ok, then we need to store the encoding
nikki
I don't know why you're making it so bloody complicated
ocharles
getting emotional isn't going to help...
nikki
no, it isn't
ocharles
i'm not making it complicated, I'm making it correct
ocharles shrugs
kepstin-laptop
well, character sets are hard to detect, and requiring a user to manually select one would be quite a pain.
ocharles
kepstin-laptop: I was going to hide that complexity to the user though
instead of saying "is this utf-8 or latin-1" just say "which of these looks correct?"
<option value="encoding scheme">[ url in that encoding ]</option>
kepstin-laptop
ocharles: so, an imcomplete list? what do you pick if none are correct?
ijabz joined the channel
ocharles
kepstin-laptop: how would that be the case?
kepstin-laptop
ocharles: there are a lot of character sets.
ocharles
right
kepstin-laptop
so either a very long list, or an incomplete list :/
ocharles
well, the list would only display stuff that successful decodes from the bytes in the url to text
kepstin-laptop
for something that honestly really doesn't matter that much.
ocharles
it matters if we want to display them human readable
(in the wikipedia case)
kepstin-laptop
right now, all the sites where you use a human-readable version take UTF-8.
and the random urls to things like blogs, etc. don't have a human-readable name - none would make sense, so the character encoding doesn't matter.
ocharles
so what was nikki trying to suggest? if it doesn't decode, then just display the URL and nothing else?