Page MenuHomePhabricator

File-Uploads from MacOS X has Problems with UTF-8
Closed, ResolvedPublic

Description

Image-uploads from MacOS X to a wikipedia using UTF-8 results in the image not
being found later. This appears to be independent of the browser used (i'm not
experiencing this bug myself, as I don't have a Mac. I'm just reporting
something that has been discussed in the german WP:
http://de.wikipedia.org/w/wiki.phtml?title=Wikipedia_Diskussion:UTF8-Probleme#Umlaute_in_Upload_Dateinamen_bei_Mac_OS_X
(german))

The reason for this problem seems to be that the MacOS filesystem uses a
different decomosition-policy for filenames than is used on other operating
systems or by most browsers. To me it seems that the best solution (and The
Right Thing) would be to perform a unicode canonisation (see
http://www.unicode.org/notes/tn5/) on the server side, on names of uploaded
files, but also on search terms and titles of articles.

To clarify: in unicode (and therefore in UTF8) there are often several way of
expressing the same character. For instance, there is a separate character for
"ü", but also a way to express it as "u" + "dots". The two representations are
(should be) equivalent, but are not handeled as such by the wiki software. If
would be best to enforce a consisten internal canonisation by processing all
incomming unicode.

The following appeared on the mailinglist unicode@unicode.org:

FYI, by far the largest source of text in NFD (decomposed) form in Mac
OS X is the file system. File names are stored this way (for historical
reasons), so anything copied from a file name is in (a slightly altered
form of) NFD.
Also, a few keyboard layouts generate text that is partly decomposed,
for ease of typing (e.g., Vietnamese).

Deborah Goldsmith
Internationalization, Unicode liaison
Apple Computer, Inc.
goldsmit@apple.com

This makes it quite clear that this is not a BUG on the part of MacOS - it's a
classical incompatibility, which should be handeled by the server.


Version: 1.3.x
Severity: normal
OS: Mac OS X 10.0
Platform: Macintosh

Details

Reference
bz215
TitleReferenceAuthorSource BranchDest Branch
Fix wikiacloudvps-repos/wikistats!5rhinosf1fix-wikia-updatesmaster
Customize query in GitLab

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:43 PM
bzimport set Reference to bz215.
bzimport added a subscriber: Unknown Object (MLST).

I have dug up some mor info on this:

The crucial point is that *some* canonisation (normal form) should be used as
internal representation. For compatibility reasons, this should probably be a
composed form, as the decomposed forms are rendered badly on some systems. Here
is the official document about unicode normal forms:

http://www.unicode.org/reports/tr15/

HTH

jeluf wrote:

This has to be done at least for:

  • User names
  • File names
  • Page titles

and should be also done for wikitext, at least in the searchindex.

Bug seems to be specific to Safari. Firefox 0.9.1 and IE 5.2.3 both normalize the name to the precomposed form.

Uploaded from Safari:
http://meta.wikimedia.org/wiki/Image:Wiki_test_e%CC%81.png

Uploaded from Firefox and IE:
http://meta.wikimedia.org/wiki/Image:Wiki_test_%C3%A9.png

Nonetheless we certainly should be normalizing input... Check if there's an iconv or mb_* function for doing this efficiently.

I spent a few minutes googling and came up with nothing useful pre-existing in PHP. Guess I'll have to write another hack. :P
All the necessary data should be in the Unicode data tables... It may be possible to write a DSO extension that makes use of
existing library functions (libidn seems to have UTF-8-based normalization functions for instance) but we'll need a 'native'
PHP version anyway for general distribution.

jeluf wrote:

libidn provides a stringprep_utf8_nfkc_normalize() function. The glyphs created
by this normalization differ from the input. e.g. ² becomes 2.
When using this for user names, would we want to preserve the original string
for displaying but use an internal representation for comparing?

There is a PHP-libidn binding at http://php-idn.bayour.com/ but it looks they do
not yet provide access to stringprep_utf8_nfkc_normalize().

jeluf wrote:

The ucdata library might be interesting, it provides both composition and
decomposition, upper case, etc.

http://crl.nmsu.edu/~mleisher/ucdata.html

The download page at that site is broken, there is rev 2.5 available at
ftp://crl.nmsu.edu/CLR/multiling/unicode/ucdata-2.5.tar.gz

A further note: in addition to being decomposed, Safari actually is sending the
filename with HTML character references: "Wiki test é.png"

Adding an accept-charset attribute to the <form> unfortunately doesn't seem to
change anything. Also, in current 1.4 cvs the # is now stripped to - before we
get to the point where we normalize the title and would be interpreting the
character, so things get even weirder.

Now fixed in 1.4 CVS.

Might consider backporting the isolated filename normalization part to 1.3 on account of the safari problem without
risking the general case; leaving this bug open for the moment.

1.4 nearing release; not backporting.

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:20 AM