Page MenuHomePhabricator

Fuzzy (approximate) wiki page title access - fuzzy bookmarking - auto search
Closed, ResolvedPublic

Description

I recently came across a wiki which implements a more useful way to access
(search) pages by actually implementing a form of fuzzy (approximate) bookmarking.

I am copying the relevant text from http://wiki.tcl.tk/391 :

To search for the word "cgi" in all page titles, you can use the URL:

http://purl.org/tcl/wiki/cgi

To search for this word in all titles and in the full texts, use:

http://purl.org/tcl/wiki/cgi*  (in general: an regular expression)

Or, if you prefer, you can enter the search word on the search page, at:

http://purl.org/tcl/wiki/search

But there's a little more to it. That last URL is actually a form of fuzzy
bookmarking. There is no web page called "search". Wikit presents its contents
as if it were a directory with pages, but its all smoke and mirrors...

First of all, note that all Wikit pages have a unique identifying number. The
"About" page is at http://purl.org/tcl/wiki/1.html, for example. But although
these unique IDs are effective for internal links, they are quite awkward as
bookmarks, since they convey no information whatsoever about the title or
contents of a page.

To offer a more useful way of bookmarking, pages which are not of the form
<number>.html are treated as search instructions to locate a page. The following
URL is an instruction to look for a page titled "hawaii":

http://purl.org/tcl/wiki/hawaii

Assuming there is a page titled "hawaii" (case is ignored), the above URL will
lead directly to that page.

But wiki's change. So do page titles, occasionally. Some page titles are long
and may contain embedded spaces or other inconvenient characters. This all makes
the above search mechanism a bit too brittle for long-lasting URLs.

To solution which has been adopted here, is to refine the search process as
follows (everything after the slash will be called the search term):

  1. If the search term is a reference to a page (<number>.html), then simply

go to that page

  1. If the search term matches a page title (while ignoring case), then jump

to the page with that title

  1. If the search term includes one or more upper-case letters, modify the

search to be approximate (see below). If the approximate match finds exactly one
page, jump to that page.

  1. Otherwise, treat the search term as a regular search, and present the

search results.

Approximate matching - if the search term has upper-case letters, for example
"OneTwoThree", it is turned into a match pattern (using the glob / string match
syntax). In the example given, a search would be performed on page titles
matching the pattern "*[Oo]ne*[Tt]wo*[Tt]hree*".

What's the point of all this? Well... this mechanism allows you to specify URLs
pointing into the Tcl'ers Wiki with some quite attractive properties:

  • If the search keyword is accurate enough, it's equivalent to a real URL
  • If the search is general enough, it'll survive minor title changes (e.g.

typo's)

  • The URL has a meaningful word in it, so people can remember what it was about
  • If more pages are added to the wiki, the search will turn up more than one

match

  • This is an extremely useful feature, because the original match will be

one of the search results listed, and so will new - probably related - pages

For an example, here's a link to Don Libes' book on Expect:

http://purl.org/tcl/wiki/Expect

And here's a search which lists all pages where the word "expect" is used:

http://purl.org/tcl/wiki/expect*

Version: unspecified
Severity: enhancement
URL: http://wiki.tcl.tk/391

Details

Reference
bz883
ReferenceSource BranchDest BranchAuthorTitle
repos/releng/train-dev!3review/dancy/add-geoip-database-debmaindancyDockerfile.deploy: Add geoip-database package
repos/releng/cli!17gitlab-ci-no-specified-mirrormainaddshore.gitlab-ci.yml: Don't specify docker mirror here
repos/releng/cli!9fixSegfaultmainaddshoreFix segfault. Don't set xdebug.var_display_max settings
repos/releng/cli!7createDotComposerIfMissingmainaddshoreCreate user .composer dir for cache if it doesn't exist
repos/releng/cli!6backupLocalSettingsmainaddshoreTake backups of LocalSettings incase they get lost
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 7:00 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz883.
bzimport added a subscriber: Unknown Object (MLST).

soloturn99 wrote:

does this solve the problem of not finding "my_faq" and just getting "faq" wiki,
when searching for "faq"?

(In reply to comment #1)

does this solve the problem of not finding "my_faq" and just getting "faq" wiki,
when searching for "faq"?

Of course, it will ! as long as the distance between the user input (call it "needle") is not too far away from the needle in
the "haystack". I am an expert in AGREP (see http://www.tgries.de/agrep and there are several spawn-offs which could be
integrated in MediaWiki) and AGREP used with the option "-By" would automatically first try an exact match (my_faq = faq) which
does not match and in that case it increments an error number to 1 an searches with one allowed error. The same loops until at
least one match has been found, usually several similar spellings ...

... which then would be presented to the user to select from OR
... to really create a new page with the "my_faq" page title, if the user wants this.

Are you an developer ?

(added for documentation completeness only)

See also my other enhancement bug
http://bugzilla.wikimedia.org/show_bug.cgi?id=2486

Automatic wiki page name suggestion similar as "Google Suggest"

Changed component to "RecentChanges"

happy.melon.wiki wrote:

This bug is totally stale, but most of the features requested seem to have been developed in the intervening period. We have a much better search functionality through LuceneSearch, which includes "did you mean", fuzzy matching, etc. We have mwsuggest that does useful things in the search box. I don't think a fuzzy-matching algorithm being automatically triggered on all URLs is a good idea. Resolving FIXED.