Page MenuHomePhabricator

Automatic hyphens to (localized?) dashes
Closed, DeclinedPublic

Description

Author: mapellegrini

Description:
The en manual of style has long been promising that the software would
automatically convert -- (a double dash) into the html –. This would keep
ugly html out of our articles and make editing more accessable for the html
impaired. When is it coming?


Version: unspecified
Severity: enhancement

Details

Reference
bz1485

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 8:12 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz1485.
bzimport added a subscriber: Unknown Object (MLST).

nhamblen wrote:

Replaces certain sequences with UTF-8 codes for dashes

I've written a patch that I think is fairly well placed since it's adjacent to
the existing code that inserts non-breaking spaces between guillemets. This
method would make a lot of people happy, and it promotes compliance to the
Manual of Style as much as is possible. Here's how it works:

  1. Replace any ' -- ' with the UTF-8 sequence equivalent to ' – '
  2. Replace any '--' between numbers with '–' alone.
  3. Replace any ' --- ' with the UTF-8 sequence equivalent to ' — '

attachment string_replacement.diff ignored as obsolete

Don't use raw UTF-8 here; numeric character references will be compatible with Latin-1 wikis as well. Test to make sure this
doesn't break things interestingly.

Also, there's no need to use the 'i' regex modifier on an expression that contains no letters.

nhamblen wrote:

Replace dash sequences with HTML codes rather than UTF-8

Hey, that's great... I hadn't thought we would be able to do it with HTML
entities, having a distant memory of a prior dash fix causing problems for
exactly that reason. But, the guillemet replace string uses   so, duh, of
course we can. I had copied the /insensitive from the guillemet string —
which doesn't need it either — so this patch removes it from both places.

By the way, I filed bug #1513 to do similar work for quotes and elipses (and
dashes) in a separate function. I used UTF-8 for it because I don't see it
going in before 1.5 when everything's UTF-8 anyway. (Whether or not people even
want that feature is, of course, up for debate.)

attachment HTML dashes.diff ignored as obsolete

michael wrote:

This is excellent. But a million typists already habitually use two hyphens to represent a parenthetical dash (em dash), usually
spaced, but often not. There's a very strong usability case to make things work the way people expect.

nhamblen wrote:

(In reply to comment #4)
I agree, it would be nice to make -- do — since that's what many people already
use for it, but I'm not sure how we could do that and still allow for the (also
very common) shorter dash used in ranges (i.e. January -- March becomes January
– March).

More people than I would expect are familiar with the triple-hyphen from TeX,
and the idea the idea of doing likewise was debated on [[Wikipedia talk:Manual
of Style (dashes)]] and didn't meet with tremendous opposition. I think that if
something is finally put into place, people will adopt to it quickly and fix
pages in short order (there are some pretty serious typographers out there!).

Agree with Michael; I can't imagine ever intending to write an en-dash with '--'.
Virtually all existing cases will be meant as em-dashes.

nhamblen wrote:

From that talk page I keep mentioning: "When the automatic conversion was
briefly turned on, a - remained unaffected, -- turned into a dash (an n dash I
assume) and --- turned into a longer dash (an m dash I assume)."

My thinking was that if this was ok once, it will be ok again (esecially since
it won't break tables this time!) The question could be raised once again on the
talk page, but from what I can tell it's a technical problem (how to allow for
both length dashes) with only one proposed solution.

mapellegrini wrote:

[[en:user:Curps]] asked me to post this:

It would be nice to accomodate minus-sign as well, and could probably easily be
done.

The Unicode minus-sign character is approved in [[Wikipedia:Manual of Style
(dashes)]].

In addition to the three rules already proposed, anything of the form '
-[0123456789]' (space followed by hyphen followed by a digit) should get
converted to −

nhamblen wrote:

(In reply to comment #8)
But shouldn't the minus sign also apply to subtraction?

And we'd need to make sure that <math> sections aren't affected. My test setup
doesn't have the right parts installed to render them so I'm not sure; if we're
lucky, <math> is turned into a reference to a graphic before it gets to the
patch's code.

gwalla wrote:

I suggest having "--" become an en dash, and " -- " (spaces and all) become an
em dash. This is the usage recommended by many typewriter style manuals, and it
has carried through to modern computing. "---" as an em dash is obvious to TeX
users, but not to the general populace.

jra wrote:

I can just barely agree with Garth's comment, above. But any code that converts

  • into anything but an em-dash will be surgically pruned out of any wiki's *I*

run; that violates the Principle Of Least Astonishment with *unusual* violence.

It's bad enough no one thinks that we can reasonably parse the traditional
ASCII-7 'escape sequences' for *bold* and _italics_ (as the typographical
special case of underlining).

No one *needs* an en-dash, anyway.

nhamblen wrote:

For me it would be a little "astonishing" to prohibit spaces around en-dashes,
since those spaces are prescribed in our style guide. And please don't dismiss
en-dashes out of hand; there's a mob on wikipedia that wants shortcuts to both
kinds of dashes. (Please do read for yourself.)

There's another proposal on the dash talk page: " -- " goes to em-dash and " - "
goes to en-dash. I'm a little concerned that it would affect <math> code. Can
someone confirm that?

nhamblen wrote:

applies new dash rules, excludes math sections

I got math parsing going on my install and found that the old patch did affect
math sections if they were simple enough to be rendered in HTML. That would
pose problems, especially if we convert ' - ' to endashes. To excude the math
markup, I moved the replace function to be between the strip() and unstrip()
functions. That worked, then I updated the regular expressions to the new
proposed format.

Have a look at the source yourself to be sure. Here's what the expressions do
in words:

  1. replace a hyphen surrounded by spaces with an endash preceeded by a

nonbreaking space and followed by a regular space

  1. replace a hyphen between two numeric characters (a range) with an endash.
  2. replace a double-hyphen surrounded by spaces with an emdash preceeded by a

nonbreaking space and followed by a regular space

Attached:

jeluf wrote:

Fixed in CVS HEAD. Scheduled for Release 1.5

  • Bug 1782 has been marked as a duplicate of this bug. ***

I've removed this from 1.5 as it has a nasty tendency to break legitimate markup in addition to
generally being inconsistent in when it activated.

nhamblen wrote:

(In reply to comment #16)

I've removed this from 1.5 as it has a nasty tendency to break legitimate

markup in addition to

generally being inconsistent in when it activated.

Could we have some more information? I'm happy to play with the regular
expression some more to fix whatever's breaking.

  • conversion must not happen in markup
  • conversion must not happen in markup
  • conversion must not happen in markup
  • conversion should happen in text regardless of surrounding markup
  • conversion must not happen in markup

and, let's not forget:

  • conversion must not happen in markup

not to mention:

  • nobody agrees on what should actually be converted when to what

A regex is unlikely to get this right very easily.

Nathan asked for more details. Here are the existing bug reports for the issues I mentioned above.
Some had been worked around, others not:

bug 2021: Corruption of markup (wikilinks)
bug 2462: Corruption of markup (URLs)
bug 2122: Consistency of application when there is surrounding markup
bug 2109: Is this just consistency or does it break date conversion too?
bug 1937: Was this just consistency or did it break functioning of ISBN links too?

nhamblen wrote:

How about using this SmartyPants implementation on PHP:
http://www.michelf.com/projects/php-smartypants/ . I tried hooking it up to
mediawiki and it works fine. SmartyPants is used on all kinds of web sites and
dosen't do dumb things like changing hyphens inside URLs, and it won't even
touch MathML. It does conversion "in the markup," but it's battle tested.

It also does quotes and ellipses. (bug #1513)

Downside is it doesn't offer exactly the conversion syntax we sort-of agreed to,

  • to ndash and -- to mdash. From discussions here I would say the best

configuration for it is -- to mdash, --- to ndash (backwards and weird) or ndash
disabled entirely. People were pretty hostile to the idea of having to use ---
for the very common mdash, which is its default.

ui2t5v002 wrote:

"in the markup" meaning it converts -- into &mdash; when you save? Like
converting ~~~ into signature? That's bad. It needs to *render* -- as &mdash;,
but leave the markup as --

mapellegrini wrote:

Erm, yes, sorry if I wasn't clear about that. Yes, I meant the conversion should
occur at page-render time, not at save time.

plugwash wrote:

mmm another option would be to convert on save but put the dash itself in the
wikitext rather than a html entity.

ui2t5v002 wrote:

(In reply to comment #23)

mmm another option would be to convert on save but put the dash itself in the
wikitext rather than a html entity.

Do all browsers support them in edit boxes, though? Or will some convert them
back into hyphens?

ui2t5v002 wrote:

(In reply to comment #24)

Do all browsers support them in edit boxes, though? Or will some convert them
back into hyphens?

There is a workaround for old browsers and dashes can now be entered directly
into the unicode wikitext with no problems. I've written a user script that
automatically converts the HTML entities, double hyphens, and so on into their
unicode characters.

ayg wrote:

(In reply to comment #13)

Have a look at the source yourself to be sure. Here's what the expressions do
in words:

  1. replace a hyphen surrounded by spaces with an endash preceeded by a

nonbreaking space and followed by a regular space

  1. replace a hyphen between two numeric characters (a range) with an endash.
  2. replace a double-hyphen surrounded by spaces with an emdash preceeded by a

nonbreaking space and followed by a regular space

You forgot 4: replace a double-hyphen not surrounded by spaces with a lone em
dash. (Obviously the attachment is most likely so old as to be worthless at
this point, so this is just a note to future implementers.)

avarab wrote:

*** Bug 6402 has been marked as a duplicate of this bug. ***

ayg wrote:

Thinking about it, I don't think that a hyphen between two numbers should be
converted to en dash. Consider the text "Type Alt-0-1-5-0 to get an en
dash"—those are supposed to be hyphens, I believe, not en dashes. More
generally, there's no legitimate use of two consecutive hyphens in English other
than as a dash, and I certainly can't think of a legitimate use for " - " other
than as a dash, but I get the nagging feeling that there will be a nontrivial
number of non-ranges/subtractions that will look like them. I'd drop point 2
and go for 1, 3, and 4 instead.

ayg wrote:

*** Bug 7125 has been marked as a duplicate of this bug. ***

ayg wrote:

Please note that this should really be localized. Whether to use phrases
(presumably very slow, but easy for i18n people to manage) or switch statements
(as fast as is possible, but slightly icky) I leave to people who know about
server load.

yonidebest wrote:

I would like to note that I would like this feature to *replace* the -- and --- sign
into another sort of hyphen (like the replacement of ~~~ to sig) and not just display
the text in another way. I want the Wiki code itself to change and display another
sort of hyphen - i.e. I wouldn't like to see Wiki code with -- and --- everywhere.

ui2t5v002 wrote:

(In reply to comment #31)

I would like to note that I would like this feature to *replace* the -- and

  • sign

into another sort of hyphen (like the replacement of ~~~ to sig) and not just

display

the text in another way. I want the Wiki code itself to change and display

another

sort of hyphen - i.e. I wouldn't like to see Wiki code with -- and --- everywhere.

I would like to note that I want the opposite. :-) -- should be a wikicode and
rendered by the software as an em dash, in the right circumstances. If you just
want a double-hyphen to unicode dash converter, one can be made in javascript.

yonidebest wrote:

Thanks for the idea Omegatron. We will see if it is worth using javascript locally,
but I do think that text conversions should be handled by the server. If there is a
demand to keep the -- and --- as is in wikicode, pehaps the developers can create an
option for those who would like the -- and --- converted. At these times I wish I
knew programming...

*** Bug 14795 has been marked as a duplicate of this bug. ***

De-assigning since not under active development atm.

Marking this as wontfix for now. It is too hard to get it right and the existing automatic conversions already cause us trouble. If you mean something, type it. There is already enough assistance and methods to do so even if your keyboard layout is missing characters which are needed to type typographically correct and good looking text in your language.

(In reply to comment #16)

I've removed this from 1.5 as it has a nasty tendency to break legitimate
markup in addition to
generally being inconsistent in when it activated.

(In reply to comment #18)

  • nobody agrees on what should actually be converted when to what

Having used mailing lists, Usenet, and Mediawiki etc. for years, I was
aghast at the gall of WordPress meddling with what the user entered
(mainly quote marks), and am glad that Mediawiki will not be stepping
over that fine line.

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:21 AM
epriestley added a commit: Unknown Object (Diffusion Commit).
epriestley added a commit: Unknown Object (Diffusion Commit).