Page MenuHomePhabricator

Google sitemap support
Closed, ResolvedPublic

Description

Author: neubau

Description:
Google has released a protocol for sitemaps. It is in an experimental stage
right now and google does not guarantee anything.

The announcement is at
http://googleblog.blogspot.com/2005/06/webmaster-friendly.html.

The protocol is explained at
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html

The FAQ for this project is at
https://www.google.com/webmasters/sitemaps/docs/en/faq.html.

Currently, Wikipedia's pages do not get updated by google "fast enough". I've
monitored the article [[Sarah Kane]] at de.wikipedia when it was first written
and well-linked. It took several weeks until the url appeared and several weeks
more until the content was indexed. With more than 2 million wikipedia articles
and many more pages such as user-pages, talks, other namespaces, simply crawling
the site might not be the best solution.

Mediawiki could automatically provide sitemaps. The protocol is supporting gzip
compression. A single file may not be larger than 10MB (uncompressed) and may
not contain more than 50.000 urls. It is allowed to have more than one sitemap
file and link them in a sitemap index file. Sitemap index files may not list
more than 1,000 Sitemaps. Under these circumstances, it seems to be possible not
to run into protocol limits in the near time, given the size of the individual
large mediawiki installations (such as en.wikipedia and de.wikipedia).

The XML-DTD contains several tags which would have to be filled with content:

  • changefreq: Enumerated list. Valid values are "always", "hourly", "daily",

"weekly", "monthly", "yearly" and "never". Suggestion: Date on which the
article was created minus the current date. Divide the result by the number of
revisions. A finer solution might be just to monitor the frequency of edits
within the last 2 months of that article to reflect "current event"-articles better.

  • lastmod: time that the URL was last modified. We already have that information

in the cur-table.

  • loc: URL for that page. obvious.
  • priority: Optional. The priority of a particular URL relative to other pages

on the same site. The value for this tag is a number between 0.0 and 1.0, where
0.0 identifies the lowest priority page(s) on your site and 1.0 identifies the
highest priority page(s) on your site. It would be simple to give all the
articles a 0.7 and other namespaces sigificant lower priorities. People might
consider to use a more sophisticated approach based on the number of backlinks
or whatever.


Version: unspecified
Severity: enhancement
URL: https://www.google.com/webmasters/sitemaps/docs/en/protocol.html

Details

Reference
bz2320

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:30 PM
bzimport set Reference to bz2320.
bzimport added a subscriber: Unknown Object (MLST).

I like the idea, and I belive this could be easily done by extending the code
that creates the RSS feed for the recent changes page.

However, this would probably be helpful mostly for small wikis, as the large
number of edits on the big wikipedias is likely to be more than google is
willing to handle. But it's worth a try... Maybe it would be a good idea to talk
to google directly about it - they may be willing to handle wikipedia specially
(they already do, for their "definition" feature).

avarab wrote:

So, basically a frigg'n huge RC feed for searchengines;)

It might be good to just put changefreq to "always" to avoid the performance
penalty of computing it (you'd have to load the whole timestamp row (or a period
like the last 10) rather than just look at the recentchanges table, at least
make it optional.

It's probably best to skip priority, or make it optional and compute it from
backlinks like Mathias suggested.

neubau wrote:

(In reply to comment #2)

So, basically a frigg'n huge RC feed for searchengines;)

There is a proverb in German "Seit wann kommt der Berg zum Prophet?" (does
"Since when does the mountain go to the prophet?" make any sense?).

I would rather not call this an RC feed. It is simply a "standardized" and
slightly more informative sitemap which can be found on some web sites. An RC
feed would look more similar to trackback/pings.

It might be good to just put changefreq to "always" to avoid the performance
penalty of computing it (you'd have to load the whole timestamp row (or a period
like the last 10) rather than just look at the recentchanges table, at least
make it optional.

Putting the changefreq to "always" would not be a good idea. The point of this
sitemap is to make crawler and spider bots work more efficient. An "always" for
all pages would eliminate all the advantages over stupid crawling of the web
site. "Always" is meant for pages that change every time you check their page.
Most standard wikipedia articles do not change fast. I see that there is some
computation effort to get useful results but putting an changefreq="always" to
articles like [[Mucocutaneous boundary]] is simply wrong. This article looks the
same since 18 month. If you really want to have a fixed setting for all
articles, I would recommend nothing more frequent as "weekly". It would be a
possible solution to have a standard setting for "weekly" for all articles while
running a log on the recentchanges-channel in the IRC. A script could find out
all the articles which have changed more than twice within two days (just a
thought) and so a simple search-and-replace to "daily" in the sitemap files.
This would reduce computing power, I guess. I don't know if this idea would
survive contact to reality.

It's probably best to skip priority, or make it optional and compute it from
backlinks like Mathias suggested.

Actually, priority *is* optional according to the google web site. Computing it
from the backlinks might not really reflect the "real" priority of that article,
whatever that may be. I guess we could all agree that the namespace 0 has a
higher priority in any case than all the other namespaces. A log of the number
of backlinks to the power of 4 divided by 10 (max: 0.5) plus 0.5 would give
priority between 0.5 and 1.0 (are there articles with more than 1024 backlinks
on en right now?). This might be better than nothing. The backlink-information
is in the mysql tables anyway, right?

avarab wrote:

(In reply to comment #3)

(In reply to comment #2)

So, basically a frigg'n huge RC feed for searchengines;)

[snip]

I would rather not call this an RC feed. It is simply a "standardized" and
slightly more informative sitemap which can be found on some web sites. An RC
feed would look more similar to trackback/pings.

I was under the impression that this index would only list a subset of the pages
on the wiki, but just so that we're clear, would it be a complete index (though
perhaps not totally up to date) of them all?

It might be good to just put changefreq to "always" to avoid the performance
penalty of computing it (you'd have to load the whole timestamp row (or a period
like the last 10) rather than just look at the recentchanges table, at least
make it optional.

Putting the changefreq to "always" would not be a good idea. The point of this
sitemap is to make crawler and spider bots work more efficient. An "always" for
all pages would eliminate all the advantages over stupid crawling of the web
site. "Always" is meant for pages that change every time you check their page.
Most standard wikipedia articles do not change fast. I see that there is some
computation effort to get useful results but putting an changefreq="always" to
articles like [[Mucocutaneous boundary]] is simply wrong. This article looks the
same since 18 month. If you really want to have a fixed setting for all
articles, I would recommend nothing more frequent as "weekly". It would be a
possible solution to have a standard setting for "weekly" for all articles while
running a log on the recentchanges-channel in the IRC. A script could find out
all the articles which have changed more than twice within two days (just a
thought) and so a simple search-and-replace to "daily" in the sitemap files.
This would reduce computing power, I guess. I don't know if this idea would
survive contact to reality.

I agree, not using "always" is probably best.

It's probably best to skip priority, or make it optional and compute it from
backlinks like Mathias suggested.

Actually, priority *is* optional according to the google web site. Computing it

I know, read the spec;)

from the backlinks might not really reflect the "real" priority of that article,
whatever that may be. I guess we could all agree that the namespace 0 has a
higher priority in any case than all the other namespaces. A log of the number

Yeah, NS_MAIN should be highest.

of backlinks to the power of 4 divided by 10 (max: 0.5) plus 0.5 would give
priority between 0.5 and 1.0 (are there articles with more than 1024 backlinks
on en right now?).

Defenetly, stuff like [[Europe]] gets linked alot.

This might be better than nothing. The backlink-information
is in the mysql tables anyway, right?

Indeed, but something like this is way too heavy to be dynamically generated, it
would have to be done by a cronjob or made from a database dump.

neubau wrote:

(In reply to comment #4)

I was under the impression that this index would only list a subset of the pages
on the wiki, but just so that we're clear, would it be a complete index (though
perhaps not totally up to date) of them all?

This is what this sitemap is meant for (as far as I understand it): A complete
list of all pages of a certain web site (which are to be found by a search engine).

This might be better than nothing. The backlink-information
is in the mysql tables anyway, right?

Indeed, but something like this is way too heavy to be dynamically generated, it
would have to be done by a cronjob or made from a database dump.

While I do not consider an "uptodate"-version of this sitemap as an impossible
thing, a sitemap would only have to made once a while (having a weekly output
would still mean an improvement for google and - in the medium and long run -
for us)

All I did was add the recent changes RSS feeds to my google sitemap account
http://www.wikicities.com/wiki/Community_portal#Google_Sitemaps
Don't know if it's having much effect, but it seems to be retrieving the pages
about every 12 hours. Perhaps it could be done more efficiently by batch.

avarab wrote:

We now have a siteman generator in CVS HEAD, marking this as FIXED.