Page MenuHomePhabricator

MediaWiki allows characters in the U+0080 to U+009F range
Open, LowPublic

Description

MediaWiki allows characters in the U+0080 to U+009F range in articles titles and
bodies. These charactes should never appear in valid HTML. I suggest they are
handeled like characters in the U+0000 to U+001F range. Using them in article
titles/URLs should lead to a "Bad title" error. When using them in the article
body they should be replaced with U+FFFD upon save.

See also T44807: Invisible Unicode characters allowed on pagetitle (\u200E | \uFEFF | \u200B)

Details

Reference
bz5732

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:11 PM
bzimport set Reference to bz5732.
bzimport added a subscriber: Unknown Object (MLST).

Should be added to title chars blacklist. While technically legal, these are technically control characters. (Unlike
the ASCII control characters they are allowed in XML however.)

inbox wrote:

(In reply to comment #1)

Should be added to title chars blacklist. While technically legal, these are

technically control characters. (Unlike

the ASCII control characters they are allowed in XML however.)

No they are not. I any of these characters is present the page will fail to
validate. Unlike a page which contains a CR, for example.

This may be an error in your validation tool.

XML 1.0 explicitly includes them among the allowed character ranges, see:
http://www.w3.org/TR/2004/REC-xml-20040204/#charsets

inbox wrote:

(In reply to comment #3)

This may be an error in your validation tool.

XML 1.0 explicitly includes them among the allowed character ranges, see:
http://www.w3.org/TR/2004/REC-xml-20040204/#charsets

Interesting. I was using w3C's validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fen.wikipedia.org%2Fw%2Findex.php%3Ftitle%3D%25C2%2580.

Yeah; the description of the error contains some definite mistakes:

"HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others)
65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical
quote marks and similar in proprietary character sets."

  1. Unicode most definitely *does* define these, you can see them in the code charts and the character

database:
http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/Public/UNIDATA/

  1. The list of points mentioned there as undefined includes tab, newline, and carriage return, which are

*most definitely* both defined in Unicode and allowed in HTML.

  1. So far as I'm aware, neither HTML nor XHTML declares these characters to be disallowed, and XML only

disallows a subset of 0-31 (minus tab, newline, and carriage return).

inbox wrote:

Compare http://test.wikipedia.org/wiki/User:R._Koot/C1-1 (which uses numeric
entities) with http://test.wikipedia.org/wiki/User:R._Koot/C1-2 (which simply
uses the characters directly). C1-1 validates, while C1-2 doesn't. They also
look different (under Firefox 1.0.7/SUSE 10.0). C1-1 displays characters from
the Windows-1252 character set in a truetype font, while C1-2 displays them in a
bitmapped font and also shows glyphs where there are blanks at C1-1. I viewed
C1-2 earlier today on Firefox 1.5/Windows 2000. All the characters are black
except for one, which displays as a question mark. There is definitly some
compatibility-stuff going one here. It could be possible that XML only allows
character in the U+0080-U+009F range to be represented using nermeric entities?

No, XML allows them completely.

Anyway that's not really relevant, we probably want to ban them just to avoid confusion. :)

omniplex wrote:

re #7: I'm not sure about u+00 up to u+1F, IIRC the allowed characters
are HT, LF, and CR. Anything else including FF and VT is bad.

The range u+7F (not u+80, it starts at 127) up to u+9F used to be bad.
In XML 1.1 (caveat: XHTML 1.0 is XML 1.0) u+85 NEL was declared to be
okay, because the EBCDIC folks have a single NEL elsewhere in addition
to their own variants of CR and LF.

But that's XML 1.1, at the moment u+7F up to u+9F is marked as invalid
by the W3C validator (for Unicode charsets, or independent of that for
NCRs  up to Ÿ).

michael wrote:

Can't the software assume that a browser sending characters in the 7F ... 9F range is sending Windows CP-1252 typographic
characters? In this case, shouldn't they just be converted to the Unicode equivalents and entered thus into the database, once
and for all?

I can't imagine that there is any utility in entering the equivalent Unicode values into wikitext—aren't they all control
characters which have no valid display, or are only useful in a text terminal?

michael wrote:

I believe that most modern web browsers display these characters assuming that they are Windows CP-1252 anyway, so why
not explicitly enter into the database what is being assumed?

inbox wrote:

Someone knowledgeable at the W3C has concluded that the SGML, XML 1.0, XML 1.1,
HTML 4.01 and XHTML 1.0 specifications are inconsistent and unclear on this
point, but suggests that the correct behaviour of the W3C validator is to reject
these characters as invalid http://www.w3.org/People/cmsmcq/2007/C1.xml.

rd232 wrote:

Five years later, MediaWiki still allows these C1 control codes (http://en.wikipedia.org/wiki/C0_and_C1_control_codes), even though they can cause problems especially when appearing in filenames (the file can only be used by copy-pasting the name from the file page, and it's a mystery to the user why). As far as I can tell these codes are not valid characters in XML (http://en.wikipedia.org/wiki/Valid_characters_in_XML), with the possible exception of U+0085 which if possible should be translated to a newline (I think). Can we do something about this?

Aklapper lowered the priority of this task from Medium to Low.Apr 11 2019, 9:32 AM
Aklapper removed a subscriber: wikibugs-l-list.