Page MenuHomePhabricator

">"-token in URL-tail parsed wrongly
Closed, ResolvedPublic

Description

Author: timwi

Description:
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=957818&group_id=34373&atid=411192
Originally submitted by Roger Persson (rogper) 2004-05-21 07:00

Of a coincident I noticed that greater-than (>) char in URLs is
rendered wrongly IF it occures as last character in URL.

Example:
Check this extra semicolon http://sample.link/<hello> in the
end
Check this http://sample.link/<hello&gt strange thing

Result:
http://sample.link/<hello>;
http://sample.link/<hello>

  • Additional comments ------------------------

Date: 2004-05-28 09:35
Sender: SF user vibber

The HTML output is:
http://
sample.link/&lt;hello&gt;;

It looks like the HTML stripping is being done before external
links, so
the have become "&lt;" and "&gt;". Semicolons are
actually
legal in links; the _final_ punctuation (not followed by linkable
chars) is
stripped, but the bits in the middle are considered fair game
for
belonging to a link so it extends up to the "&gt" but not
including
the final ";" (or the other ";" that follows, which
is extraneous).

Correct behavior would be to have the link cover
"http://sample.link/",
then cut off at the <. This will require parsing for external
links before
stripping HTML; perhaps another placeholder step would be useful
here (might also help the longstanding URL-within-URL bug).

Bug is present in both 1.2 and current 1.3.


Version: 1.4.x
Severity: normal

Details

Reference
bz289

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:52 PM
bzimport set Reference to bz289.
bzimport added a subscriber: Unknown Object (MLST).

timwi wrote:

*** Bug 308 has been marked as a duplicate of this bug. ***

Still present; added a test case to parserTests.

wmahan_04 wrote:

According to RFC 2396, '<' and '>' are disallowed within URIs, and hence I added
them to the list of prohibited characters.

Wil, right. The problem is that the conversion of < and > to &lt; and &gt; has already been done when we do the
external link parsing, and & and ; _are_ allowed in URLs.

wmahan_04 wrote:

(In reply to comment #4)

Wil, right. The problem is that the conversion of < and > to &lt; and &gt; has

already been done when we do the

external link parsing, and & and ; _are_ allowed in URLs.

Oh, I see. This should now be fixed in HEAD (Parser.php revision 1.323).
Rather than replacing external links before stripping HTML tags as
you suggested before, I just added a check for '&lt;' and '&gt;'
within external links. It's not an especially elegant solution, but
I think it will fix this without meddling with the order of
parser passes.

wmahan_04 wrote:

(In reply to comment #6)

Added more test cases.

Fixed one by adding '<' and '>' back to the list of disallowed chars
(I added them earlier, but then I got nervous and undid the change.)

The two cases that still fail are due to the way disallowed
characters are treated as part of the link description; if that's
a bug, it's separate from this one, IMHO.

Issue fixed in HEAD and 1.5, all parsertests in HEAD passed successfully.