At the moment we're using "detect" to guess encoding on urls, maybe we should try utf-8 first always, and if it fails, try and guess
2011-09-21 26458, 2011
kepstin-laptop
yeah, ä is 0xC3 0xA4 in UTF-8
2011-09-21 26459, 2011
nikki
I would just use utf-8 first and if it fails, leave it encoded
2011-09-21 26419, 2011
ocharles
nikki: the problem is then we can't display a pretty wikipedia name
2011-09-21 26423, 2011
nikki
why not?
2011-09-21 26427, 2011
nikki
wikipedia uses utf-8
2011-09-21 26432, 2011
kepstin-laptop
nikki: some of the wikipedias don't.
2011-09-21 26436, 2011
ocharles
nikki: the examples in that ticket isn't utf-8
2011-09-21 26446, 2011
ocharles
nor is that de. url above
2011-09-21 26452, 2011
ocharles
oh, sorry
2011-09-21 26454, 2011
ocharles
that one is
2011-09-21 26458, 2011
ocharles
but the one in the ticket is latin-1
2011-09-21 26402, 2011
nikki
actually they do. that example redirects
2011-09-21 26416, 2011
nikki
I only left it unedited so that I had an example I could find
2011-09-21 26432, 2011
ocharles
so we'd have to resolve the URL to display it, which is also not really an option
2011-09-21 26442, 2011
ocharles
here's my real suggestion
2011-09-21 26400, 2011
kepstin-laptop
ocharles: you could leave them unencoded, until someone comes along and fixes them to use the correct UTF-8 url.
2011-09-21 26408, 2011
nikki
kepstin-laptop: exactly
2011-09-21 26408, 2011
ocharles
Add/editing URLs should only allow utf-8 encoding. If it's not utf-8, we present a user with a list of how it would look in various encodings, and they can correct it
2011-09-21 26417, 2011
nikki
that's what we did pre-ngs and it worked just fine
2011-09-21 26431, 2011
ocharles
In the database, we need to find a list of URLs that aren't utf-8, and just clean them up
2011-09-21 26456, 2011
kepstin-laptop
ocharles: presumably there are some legacy sites that don't use UTF-8 in urls tho; you can't just convert them, you'll get 404s.
2011-09-21 26456, 2011
ocharles
and we need to ensure that we *only* store utf-8 encoding in the database
2011-09-21 26415, 2011
nikki
and when a site only accepts non-utf-8? we tell the user they can't add it because we can't implement somehting we used to have?
2011-09-21 26434, 2011
ocharles
ok, then we need to store the encoding
2011-09-21 26446, 2011
nikki
I don't know why you're making it so bloody complicated
2011-09-21 26458, 2011
ocharles
getting emotional isn't going to help...
2011-09-21 26403, 2011
nikki
no, it isn't
2011-09-21 26406, 2011
ocharles
i'm not making it complicated, I'm making it correct
2011-09-21 26416, 2011
ocharles shrugs
2011-09-21 26436, 2011
kepstin-laptop
well, character sets are hard to detect, and requiring a user to manually select one would be quite a pain.
2011-09-21 26453, 2011
ocharles
kepstin-laptop: I was going to hide that complexity to the user though
2011-09-21 26404, 2011
ocharles
instead of saying "is this utf-8 or latin-1" just say "which of these looks correct?"
2011-09-21 26421, 2011
ocharles
<option value="encoding scheme">[ url in that encoding ]</option>
2011-09-21 26425, 2011
kepstin-laptop
ocharles: so, an imcomplete list? what do you pick if none are correct?
2011-09-21 26435, 2011
ijabz joined the channel
2011-09-21 26439, 2011
ocharles
kepstin-laptop: how would that be the case?
2011-09-21 26453, 2011
kepstin-laptop
ocharles: there are a lot of character sets.
2011-09-21 26411, 2011
ocharles
right
2011-09-21 26442, 2011
kepstin-laptop
so either a very long list, or an incomplete list :/
2011-09-21 26401, 2011
ocharles
well, the list would only display stuff that successful decodes from the bytes in the url to text
2011-09-21 26401, 2011
kepstin-laptop
for something that honestly really doesn't matter that much.
2011-09-21 26414, 2011
ocharles
it matters if we want to display them human readable
2011-09-21 26418, 2011
ocharles
(in the wikipedia case)
2011-09-21 26431, 2011
kepstin-laptop
right now, all the sites where you use a human-readable version take UTF-8.
2011-09-21 26416, 2011
kepstin-laptop
and the random urls to things like blogs, etc. don't have a human-readable name - none would make sense, so the character encoding doesn't matter.
2011-09-21 26422, 2011
ocharles
so what was nikki trying to suggest? if it doesn't decode, then just display the URL and nothing else?