Page MenuHomePhabricator

Add flexible magic character conversion to the user interface
Closed, DeclinedPublic

Description

Author: gangleri

Description:
Hallo!

To my knowledge the build in editor can handle "magical character conversion"
which is activated for wiki's with content language Esperanto.

This magical character conversion works as follows. The characters Ĉ, Ĝ, Ĥ, Ĵ,
Ŝ, Ŭ, ĉ, ĝ, ĥ, ĵ, ŝ, ŭ are stored in the database but are displayed as Cx, Gx,
Hx, Jx, Sx, Ux, cx, gx, hx, jx, sx, ux in the brwoser. See
[[eo:Vikipedio:Bugzilla_1512#Notes]].
*note* :eo: has also an "escape notation" as displaying Cxx for a stord Cxx, CxX
for a stord CxX etc.

*Requirement of this bug report:*

It should be possible to set up the character conversion

  • character by character

and / or

  • by *include* range

and / or

  • by *exclude* range

Examples:
If you want to distinguish between "minus" = "-" and – – – = "–"
see: Unicode Character EN DASH - U 2013

There should be a syntax that
"–" should be shown as – in the editor but saved as "-".

There should be a syntax to match "existing" magical character conversion in
Esperanto. If such syntax, configuration would be available users could activate
it as their choice in their monobook definition.

This feature would allow to detect BiDi punctuation characters as mentioned in
bug 3819: strip phantom general punctuation characters from page titles

This feature would help to distinguish different kinds of whitespace as "space"
"tab" as mentioned in
bug 3894: white space characters, BiDi control characters should show up in diff

This feature would allow to edit InterLangua links using the magic character
conversion as desired by Scot in bug 3615 comment 1:
bug 3615: blocks of code not handling magic character conversions in Esperanto
correcty - reason for page deletion

*notes*
a) The best way to introduce a feature is to offer it optional. This will not
brake code or iritate users.
b) Because of "escape syntax" at some point it should be decided if this feature
would brake documentaion pages with examples. If there all versions –
– – = "–" are used these should not be changed while the page is
edited and saved again.
c) no details about the required syntax are specified here in order to avoid
limitations

The requested felexible magic character conversion build in the editor will
offer solutions / workarounds / would be helpfull also for other reported bugs:

One could see in the source of a page

  • if Unicode whitespaces is used in article title

bug 1414: Unicode whitespaces allowed in article title

  • one could distinguish characters which look the same in different alphabets

but are coded differently
bug 1524: usernames should use unicode whitelist
bug 2290: user impersonation using homographs
bug 3885: title normalisation

This feature would "normalise" the way how characters are saved. This would /
should make search more efficient.

"copy and past" can be platform and program dependend. I have seen many broken
pages where cyrillic characters in InterLanguage links where saved as ????. This
could be avioded if a whole range would be displayed only in &#nnnn; notation in
the editor and saved back as real unicode.

The same applies detecting homoglyphs / homographs. The reports are mentioned above.

regards reinhardt [[user:gangleri]]

P.S. This request is only about implementing the basic feature. Please open
individual bugs for subfeatures wherever necessary.


Version: unspecified
Severity: enhancement
URL: http://test.leuksman.com/edit/User:Brion%E2%80%AD%E2%80%AC?oldid=9812

Details

Reference
bz4012

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 8:57 PM
bzimport set Reference to bz4012.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

with special setups this feature could provide a limited replacement to the
discontinued special page ~makeutf8~

gangleri wrote:

(In reply to comment #0)

This feature would allow to detect BiDi punctuation characters as mentioned in
bug 3819: strip phantom general punctuation characters from page titles

testcase
http://test.leuksman.com/index.php?title=User_talk:Gangleri&oldid=10505&action=edit&section=6

I mentioned this testcase in order to ilustrate how editing of a page could turn
out. There are scenarios I do not like to discuss here in public which would
allow to "acieve this effect" many revisions after the eroneous / malicious
change. This would make it very difficult to trace and correct the error later.
Please e-mail me if you like to know more details.

gangleri wrote:

(In reply to comment #0)
Expanding the request because %nn&nn&nn is another method to encode characters.
see %C3%9C at
http://de.wikipedia.org/w/index.php?title=Benutzer:VanGore&action=edit&section=7
[[wikibooks:de:%c3%9Cber_das_Wesen_der_Information]] is an alternative way to encode
[[wikibooks:de:Über_das_Wesen_der_Information]]

Examples:
If you want to distinguish between "minus" = "-" and – – –

"–"

see: Unicode Character EN DASH - U 2013

If you want to distinguish between "minus" = "-" and – – –
%E2%80%93 and %e2%80%93 = "–"

*notes*
b) Because of "escape syntax" at some point it should be decided if this feature
would brake documentaion pages with examples. If there all versions –
– – = "–" are used these should not be changed while the page is
edited and saved again.

b) ... If there all versions – – – %E2%80%93 and %e2%80%93 =
"–" are used these should not be changed while the page is edited and saved again.

This feature would "normalise" the way how characters are saved. This would /
should make search more efficient.

add ... It also makes it easier to read and edit / correct text encoded as
%nn%nn/nn .

ui2t5v002 wrote:

Here's my version, in response to Bug 2676:

This is great, but Unicode-unaware *browsers* aren't the only problem. A lot of
people want to work in Unicode-unaware text editors as well, and this makes it
difficult for them. They'd have to fake out the server into thinking they had an
old browser or something in order to see the HTML entity version of the source.
I have a different proposal:

  1. Convert all HTML entities (named or Unicode numbers or whatever) into plain

Unicode characters in the wikisource.

  1. Provide an option in the editing interface to view the source in either

"plain Unicode" format (with actual characters) or "plain text" format (with HTML
entities) on a per-edit basis.

2.a. When editing in "plain text" mode, all the bad characters (non-ASCII?) will
be converted into named HTML entities if possible (— and the like), or
into numbered HTML entities if not possible (— and the like).

2.b. The default editing format will be selectable in preferences.

gangleri wrote:

*note*
This bug was opened with the summary:
"Add flexible magic character conversion to the built-in editor"

After reading the response from Omegatron in comment 4 and searching for bug
dependencies and duplicates I wonder if the "magic character conversion" should
be limited to "the build-in editor" or should be available for other functions
as well.

changing summary to
"Add flexible magic character conversion to the user interface"
this should cover also the "built-in editor"

adding dependency
blocks: bug 3894: white space characters, BiDi control characters should show up
in diff
having a duplicate: bug 3672: BiDi: improuve the diffs with regard to RTL issues

A feature as requested here would prevent users opening invalid bugs as
(invalid bug 3621: BiDi: RTL list not rendered correctly)

best regards reinhardt [[user:gangleri]]

With the experience of Esperanto magic character conversion, I'm pretty sure we don't want to add any more of that.