Image-uploads from MacOS X to a wikipedia using UTF-8 results in the image not
being found later. This appears to be independent of the browser used (i'm not
experiencing this bug myself, as I don't have a Mac. I'm just reporting
something that has been discussed in the german WP:
http://de.wikipedia.org/w/wiki.phtml?title=Wikipedia_Diskussion:UTF8-Probleme#Umlaute_in_Upload_Dateinamen_bei_Mac_OS_X
(german))
The reason for this problem seems to be that the MacOS filesystem uses a
different decomosition-policy for filenames than is used on other operating
systems or by most browsers. To me it seems that the best solution (and The
Right Thing) would be to perform a unicode canonisation (see
http://www.unicode.org/notes/tn5/) on the server side, on names of uploaded
files, but also on search terms and titles of articles.
To clarify: in unicode (and therefore in UTF8) there are often several way of
expressing the same character. For instance, there is a separate character for
"ü", but also a way to express it as "u" + "dots". The two representations are
(should be) equivalent, but are not handeled as such by the wiki software. If
would be best to enforce a consisten internal canonisation by processing all
incomming unicode.
The following appeared on the mailinglist unicode@unicode.org:
FYI, by far the largest source of text in NFD (decomposed) form in Mac
OS X is the file system. File names are stored this way (for historical
reasons), so anything copied from a file name is in (a slightly altered
form of) NFD.
Also, a few keyboard layouts generate text that is partly decomposed,
for ease of typing (e.g., Vietnamese).
Deborah Goldsmith
Internationalization, Unicode liaison
Apple Computer, Inc.
goldsmit@apple.com
This makes it quite clear that this is not a BUG on the part of MacOS - it's a
classical incompatibility, which should be handeled by the server.
Version: 1.3.x
Severity: normal
OS: Mac OS X 10.0
Platform: Macintosh