">"-token in URL-tail parsed wrongly
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Sep 3 2004, 3:07 AM

Description

Author: timwi

Description:
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=957818&group_id=34373&atid=411192
Originally submitted by Roger Persson (rogper) 2004-05-21 07:00

Of a coincident I noticed that greater-than (>) char in URLs is
rendered wrongly IF it occures as last character in URL.

Example:
Check this extra semicolon http://sample.link/<hello> in the
end
Check this http://sample.link/<hello&gt strange thing

Result:
http://sample.link/<hello>;
http://sample.link/<hello>

Additional comments ------------------------

Date: 2004-05-28 09:35
Sender: SF user vibber

The HTML output is:
http://
sample.link/<hello>;

It looks like the HTML stripping is being done before external
links, so
the have become "<" and ">". Semicolons are
actually
legal in links; the _final_ punctuation (not followed by linkable
chars) is
stripped, but the bits in the middle are considered fair game
for
belonging to a link so it extends up to the "&gt" but not
including
the final ";" (or the other ";" that follows, which
is extraneous).

Correct behavior would be to have the link cover
"http://sample.link/",
then cut off at the <. This will require parsing for external
links before
stripping HTML; perhaps another placeholder step would be useful
here (might also help the longstanding URL-within-URL bug).

Bug is present in both 1.2 and current 1.3.

Version: 1.4.x
Severity: normal

Details

Reference: bz289

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:52 PM

• bzimport added projects: MediaWiki-General, Parser.

• bzimport set Reference to bz289.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Sep 3 2004, 3:07 AM

timwi wrote:

*** Bug 308 has been marked as a duplicate of this bug. ***

Still present; added a test case to parserTests.

wmahan_04 wrote:

According to RFC 2396, '<' and '>' are disallowed within URIs, and hence I added
them to the list of prohibited characters.

Wil, right. The problem is that the conversion of < and > to < and > has already been done when we do the
external link parsing, and & and ; _are_ allowed in URLs.

wmahan_04 wrote:

(In reply to comment #4)

Wil, right. The problem is that the conversion of < and > to < and > has

already been done when we do the

external link parsing, and & and ; _are_ allowed in URLs.

Oh, I see. This should now be fixed in HEAD (Parser.php revision 1.323).
Rather than replacing external links before stripping HTML tags as
you suggested before, I just added a check for '<' and '>'
within external links. It's not an especially elegant solution, but
I think it will fix this without meddling with the order of
parser passes.

Added more test cases.

wmahan_04 wrote:

(In reply to comment #6)

Added more test cases.

Fixed one by adding '<' and '>' back to the list of disallowed chars
(I added them earlier, but then I got nervous and undid the change.)

The two cases that still fail are due to the way disallowed
characters are treated as part of the link description; if that's
a bug, it's separate from this one, IMHO.

Issue fixed in HEAD and 1.5, all parsertests in HEAD passed successfully.

">"-token in URL-tail parsed wronglyClosed, ResolvedPublicActions

Description

Details

Event Timeline

">"-token in URL-tail parsed wrongly
Closed, ResolvedPublic
Actions