Page MenuHomePhabricator

*first* perform Unicode normalisation and check for existence of pages *after* the normalisation
Closed, ResolvedPublic

Description

Hi,

I found a problem on URL with some Devanagari characters on present (14.02.2005) Hindi Wiktionary
project. This is tested with Konqueror and Mozilla and I think only present in 1.4.

URLs with some Devanagari characters (at least ज़, ड़ and फ़) can't be resolved. Links appears in red
although the article exists. Same while using Unicode numbers, respectively ज़ ड़ and
फ़ for the 3 characters above.

Examples :
http://hi.wiktionary.org/wiki/शनिवार
Article [[हफ़्ता]] exists, but is not accessible on http://hi.wiktionary.org/wiki/हफ़्ता

Thanks a lot,
Yann


Version: 1.4.x
Severity: major
URL: http://hi.wiktionary.org

Details

Reference
bz1527
ReferenceSource BranchDest BranchAuthorTitle
toolforge-repos/wikibugs2!5btullis/fix_typomaintaaviFix typo in the channel name for wikimedia-data-platform
toolforge-repos/wikibugs2!4fix_typomainbtullisFix typo in the channel name for wikimedia-data-platform
repos/cloud/toolforge/toolforge-deploy!209bump_builds-apimainproject_1317_bot_df3177307bed93c3f34e421e26c86e38builds-api: bump to 0.0.131-20240222212602-322f874b
repos/cloud/toolforge/toolforge-deploy!208bump_builds-buildermainproject_1317_bot_df3177307bed93c3f34e421e26c86e38builds-builder: bump to 0.0.93-20240222212537-5707b25d
repos/cloud/toolforge/builds-api!78cleanup_harbor_auth_on_local_envmainraymond-ndibe[builds-api] use similar auth mechanism in prod and local
repos/cloud/toolforge/builds-builder!34cleanup_harbor_auth_on_local_envmainraymond-ndibe[builds-builder]: use similar auth mechanism in prod and local
repos/cloud/toolforge/toolforge-deploy!196cleanup_harbor_auth_on_local_envbump_builds-builderraymond-ndibe[toolforge-deploy] use similar auth mechanism in prod and local harbor
repos/cloud/toolforge/toolforge-deploy!191bump_builds-apimainproject_1317_bot_df3177307bed93c3f34e421e26c86e38builds-api: bump to 0.0.124-20240207010916-3e487c5b
repos/cloud/toolforge/builds-api!72minor_refactormainraymond-ndibe[builds-api] minor refactor
repos/cloud/toolforge/builds-api!71use_harbor_clientmainraymond-ndibe[builds-api] use goharbor/go-client for harbor
repos/phabricator/phabricator!30T352782rmIconQuipswmf/stableaklapperRevert rPHABde19094b2d611147713ade8131d64ad3b44d8036
repos/cloud/toolforge/toolforge-deploy!156bump_builds-buildermainproject_1317_bot_df3177307bed93c3f34e421e26c86e38builds-builder: bump to 0.0.81-20231213135909-70531772
repos/cloud/toolforge/builds-cli!36bump_versionmaindcarod/changelog: bump to 0.0.9
repos/cloud/toolforge/toolforge-deploy!155bump_builds-apimainproject_1317_bot_df3177307bed93c3f34e421e26c86e38builds-api: bump to 0.0.117-20231213110249-3c9d6b72
repos/cloud/toolforge/toolforge-deploy!153bump_builds-apimaindcarobuilds-api: bump to 0.0.115-20231212120506-fa16fa96
repos/cloud/toolforge/builds-cli!35add_envvarsmaindcarostart: add envvar parameter
repos/cloud/toolforge/toolforge-deploy!146bulids-api_use_nonclassic_heroku_buildermaindcarobuilds-api: use the non-deprecated heroku builder
repos/cloud/toolforge/toolforge-deploy!145builds-admission_enable_nonclassic_buildermaindcarobuilds-admission: add non-classic heroku builder
repos/cloud/toolforge/builds-api!63use_latest_heroku_buildermaindcarouse latest heroku builder
repos/cloud/toolforge/builds-builder!21add_dotnetmaindcarodotnet: Add dotnet buildpack injection
Show related patches Customize query in GitLab

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:13 PM
bzimport set Reference to bz1527.
bzimport added a subscriber: Unknown Object (MLST).

This bug also appears with Firefox and IE on Windows, so it's independent of the browser.

Here is a way to get out of it, thanks to Muke. Yann

<MukeUTF-8> I have run into the same bug
<MukeUTF-8> It is because of Unicode normalization
<MukeUTF-8> the same happened with old articles using the Greek acute accent
<MukeUTF-8> I think it is the same problem. i am looking into ti
<MukeUTF-8> *it
<yannf> MukeUTF-8, oh interesting
<MukeUTF-8> the reason is because, say.
<MukeUTF-8> the mediawiki software takes the "ज़" that you type in
<MukeUTF-8> and it converts it into the "ज" plus the dot
<MukeUTF-8> as two separate characters, because the Unicode standard defines them as identical.
<MukeUTF-8> the problem is that your article with "ज़" in the title was created first... and it was never
converted
<yannf> yes, it appears on URLs with letters with a dot
<yannf> what do you mean by "converted" ?
<MukeUTF-8> I mean that it converts the one character "ज़" into the two characters "ज" and "़"
<yannf> how can we solve this ?
<MukeUTF-8> Someone has to go into the database and convert the old article titles.
<yannf> there are also articles which are accessible, but the link remains red
<MukeUTF-8> or at least convert whatever points to the articles
<yannf> also with a dot in the URL
<yannf> "at least convert whatever points to the articles" <- but the links seem to be ok
<MukeUTF-8> I mean in the database
<MukeUTF-8> I don't really know the details of how it could be fixed.
<yannf> why it appears only in 1.4 ?
<MukeUTF-8> Because Unicode normalization was implemented
<MukeUTF-8> which means, for convenience of storage and searching and whatnot, characters that are defined as
identical are stored in a canonical form, which may not be the same form as was typed in
<MukeUTF-8> another example was the Greek characters I mentioned... where "ά" (greek alpha with old acute
accent) was typed in before, it is now converted to "ά" (greek alpha with modern tonos)
<MukeUTF-8> So old article titles with "ά" with the old accent can't be reached anymore, because it will always
be turned into the letter with the modern accent by the software
<yannf> what if i copy the articles by hand ?
<MukeUTF-8> if you can get to the article
<MukeUTF-8> New articles shouldn't have any trouble
<MukeUTF-8> only ones from before the conversion
<yannf> yes, but there are also articles which are accessible, but the link remains red
<MukeUTF-8> that i'm not sure about
<MukeUTF-8> oh wait
<MukeUTF-8> When was the last time the page with the link was edited?
<yannf> http://hi.wiktionary.org/w/index.php?title=Template:-fr-&action=history
<yannf> Dec 30, 2004
<MukeUTF-8> because not only old article titles, but old article text was not converted. So if the link
contains an "old" character, it will consider it a red link, even though the target page with the "new"
character exists. But the conversion is in place now, so if you edit the page, it should convert it to a
"new" character and work properly. Try it now, edit the page and hit "preview"
<MukeUTF-8> (the page is not loading for me atm, or i would check this myself)
<yannf> if i edit the page, the link becomes red on http://hi.wiktionary.org/wiki/Template:-fr-
<MukeUTF-8> ah...
<MukeUTF-8> that's because the page with the "old" character exists, but not the page with the "new" character
<yannf> yes, i think i understood
<MukeUTF-8> http://bugzilla.wikipedia.org/show_bug.cgi?id=1375
<yannf> on this page, the link was red, i edited, and it's now blue, http://hi.wiktionary.org/wiki/Template:kk
<MukeUTF-8> *nod*
<MukeUTF-8> the articles can be updated to the new characters by editing them... but the titles need to be
edited by someone with access to the database, because we can't reach them from here
<MukeUTF-8> I posted on the wiktionary mailing list for them to do it for the Greek words involved but it never
happened :\
<yannf> well, there are only a handful of them, so i could even create them again, if it solves the pb
<MukeUTF-8> but then, the things i ask for never seem to happen...
<yannf> i have a dump of the old database
<MukeUTF-8> true, you could make them again, though you lose the history
<MukeUTF-8> and attributions
<yannf> yes, i am the only editor on the indi wiktionary ;)
<yannf> *hindi
<MukeUTF-8> ah, well, then that is probably ok :x)
<MukeUTF-8> i'm just about the only editor on the latin one, so I know how it is ;)
<yannf> ;)
<MukeUTF-8> there is like... one other regular user. but he only speaks Japanese, and only adds proper
names...
<MukeUTF-8> so I don't generally count him o-o
<yannf> there will be a few lost articles in the database, that the only remaining pb
<MukeUTF-8> hmm, i suppose i could pull those greek articles out of the old db dumps...
<yannf> may i copy the log of this chat to the bug report ?
<yannf> it would be others
<yannf> it would help others
<MukeUTF-8> ok
<MukeUTF-8> I have to go to work now. ttyl.
<yannf> ok thanks
<MukeUTF-8> no problem :)

So I created again the inaccessible articles. Now the old ones need to be deleted: all articles with ड़
(&#x095C;), ज़ (&#x095B;) or फ़ (&#x095E;) in the URL created before the conversion have to be deleted.

gangleri wrote:

Hallo!

please see

This fixed the problem both for the section and the category and also
[[wiktionary:hi:अंग्रेज़ी]]. (All links are blue now / some black at
[[wiktionary:hi:अंग्रेज़ी]]).
http://hi.wiktionary.org/w/index.php?title=%E0%A4%85%E0%A4%82%E0%A4%97%E0%A5%8D%E0%A4%B0%E0%A5%87%E0%A4%9C%E0%A4%BC%E0%A5%80&action=purge

A duplicate of this is
Bug 3860: links generated with precombined characters show red despite the fact
that the normalised links exist

best regards reinhardt [[user:gangleri]]

gangleri wrote:

*** Bug 3860 has been marked as a duplicate of this bug. ***

gangleri wrote:

making readjustments for component and dependencies
There are some plans to make this easier in Bugzilla:
Bug [Bugzilla] 102161

Resolving as duplicate should display field differences

Bug [Bugzilla] 319803
== feature request: when changing product, component etc. display old product,

old component, other fields in all required steps
Bugzilla [Bugzilla] 65382

Let people know when deps exist as resolving duplicate.

Bug 3860
depends on Bug 2399: Unicode normalization interferes with Hebrew and Arabic
with vowels
blocks Bug 3985: character conversion (tracking)

"Component" will be changed to "Internationalization" in a next "edit".

gangleri wrote:

*** Bug 1375 has been marked as a duplicate of this bug. ***

gangleri wrote:

changing summary from
problem on URL with Devanagari characters
to
*first* perform Unicode normalisation and check for existence of pages *after*
the normalisation

Hope that this would be easy to fix. Unicode normalisation should always be
performed *first*.

chnaging Severity from "normal" to "major".

Bug 1375: Unicode normalization leaves red links
mentions that special:Whatlinkshere might be afected as well. Please verify if
this will be fixed as well.

Hopefully there are no other places in the code where the Unicode normalisation
is *not* performed first.

best regards reinhardt [[user:gangleri]]

gangleri wrote:

(In reply to comment #11)

Bug 1375: Unicode normalization leaves red links
mentions that special:Whatlinkshere might be afected as well. Please verify if
this will be fixed as well.

http://la.wiktionary.org/wiki/Special:Whatlinkshere/%E1%BD%88%CE%BE%CF%8D%CF%82
does *not* show "[[wiktionary:la:Ὀξύς]]"

*but* *every* [[Special:Whatlinkshere/foo]] shows [[foo]] in the list.
This is easier to see at [[Special:Whatlinkshere/Tofu]].
Why this is *not* the case at [[wiktionary:la:Special:Whatlinkshere/Ὀξύς]]?

gangleri wrote:

(In reply to comment #6)

So I created again the inaccessible articles. Now the old ones need to be

deleted: all articles with ड़

(&#x095C;), ज़ (&#x095B;) or फ़ (&#x095E;) in the URL created before the

conversion have to be deleted.

please read also the disussion from comment #5

Yann I understand that there was / there is also *another* problem related to
page titles you can not access and which should be deleted.
Please go to [[wiktionary:hi:special:Allpages]]. Tray to identify if you see
titles which would not open or which would apear to be twice there. Please do both:
a) make a screen dump and mark / some of the titles which have / create problems
b) please provide the links
c) please describe the problem from *your* point of view (what you expect, what
you can, what does not work
d) How many namespaces are affected?
Thanks in advance!

best regards reinhardt [[user:gangleri]]

gangleri wrote:

(In reply to comment #12)

(In reply to comment #11)

Bug 1375: Unicode normalization leaves red links
mentions that special:Whatlinkshere might be afected as well. Please verify if
this will be fixed as well.

"special:Whatlinkshere might be afected as well" see also
[[user:Gangleri/tests/bugzilla/03860]]

As far as I can see the problem only affects very old titles, and I think your script that checks invalid titles should catch them.

Diffusion added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:22 AM