Page MenuHomePhabricator

Subcategory paging is not separate from article or image paging
Closed, ResolvedPublic

Description

Category:Business is large, so the list of articles pages, as it should - there
are 200 articles on a page. The first page is all up to F.

However, the subcategories are also paged and only go up to F, even though there
are only 17 on the first page. This is not optimal for usability - I was
actually going to the page to check which subcat was right for an article, but
have to page unnecessarily.

Expected result: That subcats would not be paged, or would also page at 200.

Actual result: Subcats page alphabetically in step with articles, though this
reduces the usability of the subcat listing.

Test case: category:business


Version: 1.12.x
Severity: normal
URL: http://en.wikipedia.org/wiki/Category:Business

Details

Reference
bz1211
ReferenceSource BranchDest BranchAuthorTitle
repos/abstract-wiki/wikifunctions/function-orchestrator!80sync-function-schematamainjforresterUpdate function-schemata sub-module to HEAD (a3499c5)
repos/abstract-wiki/wikifunctions/function-evaluator!98sync-function-schematamainjforresterUpdate function-schemata sub-module to HEAD (b392301)
repos/abstract-wiki/wikifunctions/wikilambda-cli!18sync-function-schematamainjforresterUpdate function-schemata sub-module to HEAD (a3499c5)
repos/abstract-wiki/wikifunctions/function-orchestrator!78T321115-8maindmartinReplace remaining hardcoded errors in ZWrapper.js
repos/abstract-wiki/wikifunctions/function-orchestrator!77T321115-7maindmartinChange 2 hardcoded Z22-format errors to invalid_format error type
repos/abstract-wiki/wikifunctions/function-orchestrator!73T321115-6maindmartinChange some more English-language errors to structured ones
repos/abstract-wiki/wikifunctions/function-orchestrator!69T321115-5maindmartinChange 5 more English-language errors to structured ones
repos/abstract-wiki/wikifunctions/function-orchestrator!68T321115-4maindmartinChange more English-language errors to structured ones
repos/abstract-wiki/wikifunctions/function-orchestrator!63T321115-2maindmartinChange English-language errors to properly structured ones
repos/cloud/toolforge/builds-builder!7arturo-tekton-pipelines-contromainaborrerotekton-pipelines-controller: refresh distroless digest
repos/cloud/toolforge/builds-builder!6arturo-tekton-pipeline-controlmainaborrerotekton-pipeline-controller: use distroless-base:latest
repos/abstract-wiki/wikifunctions/function-orchestrator!15T321115maindmartinUse proper errors when Head/Tail builtins get empty list
Show related patches Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:24 PM
bzimport set Reference to bz1211.

Had a similar problem with Category:Computer_stubs. I added a sub=category
(Category:Software stubs) but it wasn't showing up in the list. Spent a while
trying to get it to show up, until I stumbled across it if I looked atarticles
from 'S' onwards. Sub-Categories should be completely separate from articles.
I understand that mediawiki is getting top 200 matches and then sorting them
into separate arrays, but unless this is sorted it makes sub-categories in
well-populated categories redundant.

The list of sub-categories should be completely separate from the list of articles.

gangleri wrote:

Halló!

To my opinion (hopefully all) subcategories should be listed whenever a
categgory page is generated. A problem can occur if there are more then 200
subcategories the maximal number of items displayed.

Some "turn arounds" for the actual implementation are discussed in English at:
[[sr:Википедија:Tests for Корисник:Bonzo]] .
One sugestion is to add a space in [[:category:XXXX| YYYY]] or a "!"
[[:category:XXXX|!YYYY]] depending on what consensus you have in your wiki, if
you are using a navigation template etc.

Regards Reinhardt

gangleri wrote:

[[en:Category:Stub categories]] is a category having more than 200 subcategories.
Regards Reinhardt

  • Bug 1990 has been marked as a duplicate of this bug. ***

Is this really only an enhancement? It seems a loss of function to me.

Indeed. The subcats not appearing *at all* until the second page if they're out of
luck in the alpha sorting is a definite bug. Marking "minor" rather than
"enhancement".

shaww wrote:

This needs fixing. All subcats should come before any articles. This bug is forcing burdensome workarounds, like setting the sort keys of Category tags on each of 40 subcategory pages done for http://en.wikipedia.org/wiki/Category:Computing

elian wrote:

This is definitely a usability bug. If someone fixes this, please care for
http://bugzilla.wikimedia.org/show_bug.cgi?id=3095 as well.

jnothman wrote:

[Note that #3095 was a duplicate of #1212]

I see two solutions to this problem (together with #1212):
(1) get ALL entries under the category and sort them out
(2) make additional SQL queries

I have also proposed a synthesis of these two approaches, whereby (1) is used
for categories containing less than a certain threshold number of entries and
(2) is used for large categories. I don't know if this is more efficient at all.

jnothman wrote:

Proposed patch

Using two-pronged approach with options (1) and (2) given in most recent
comment. May be useless =)

attachment CategoryPage.php.diff ignored as obsolete

Wiki.Melancholie wrote:

*** Bug 4910 has been marked as a duplicate of this bug. ***

jimhu wrote:

It seems to me that both subcategories and articles for large categories should
be displayed in a way similar to how Special:Allpages displays links to ranges
of pages once the size exceeds some threshold.

This is a severe problem for me, since I am running a wiki on biology/genomics,
where the category pages reflect the taxonomy tree, and there are articles for
every gene in the organisms I'm covering. This means that sometimes the
subcategories are buried under TENS OF THOUSANDS of other pages! :(

So, I'll try to take this on if no one else is working in this direction...

jimhu wrote:

alternate category page

This alternative Category Page is similar to the earlier patch, but has some
significant differences. If the number of items in either subcategories or
articles is larger than $limit = $wgCategoryPagingLimit different things
happen:
*If the number of items is < 5*$limit, the user gets pages of with $limit items
with the usual previous and next links
*If the number of items is larger than that, the user gets "Show $limit pages
starting with..." and a list of links to chunks. The script divides the total
number of items so that there will be <= 20 total chunks, which means that the
number of clicks it will take to get to a particular item will vary.

attachment CategoryPage.php ignored as obsolete

jimhu wrote:

p.s.

  1. The alternate category page can be seen in action at:

http://dimer.tamu.edu/GO/wiki/index.php/Category:cellular_organisms

  1. I am not entirely happy with speed or the cleanliness of the code, but I

posted in the hope that others will build on it and share their versions. This
can take a couple of minutes on my server for a really large category. The link
is to one with 95,320 articles, and it just took about a minute to load.

  1. Whether or not you like the other aspects, I think the Category page should

give the correct number of subcategories and articles in the Category, instead
of just giving the size of the subset retrieved by the SELECT statement. This
alternate does so by doing select count(*) for both subcategories and articles.

bugs wrote:

No progress on this? This is awful for usability - I thought a certain category just didn't exist at all, because it didn't show up on the list of subcategories. It's incredibly counterintuitive when you see something like this:

Subcategories:
Aardvark

Pages in category "blah"
A
[lots of pages]
B
[lots of pages]
(previous 200) (next 200)

There's no reason to think that there are more subcategories that you can't see.

Would an easier fix be to simply repeat the link to the more pages, something like:

Subcategories:
Aardvark
(previous) (next)

Pages in category "blah"
A
[lots of pages]
B
[lots of pages]
(previous 200) (next 200)

ayg wrote:

*** Bug 10488 has been marked as a duplicate of this bug. ***

ayg wrote:

This really should not be difficult, I don't think. A committer just has to review those patches, or spend maybe half an hour writing their own.

ayg wrote:

Ah, I see why it was done this way. It's more efficient. Otherwise you have to add a WHERE page_namespace != NS_CATEGORY and muck up the efficiency. Actually, I think that might require that all the rows of the category be read regardless of the LIMIT, depending on how MySQL does joins, but even if not, you could still easily read ten times as many rows as strictly needed this way. A schema change to add a boolean "cl_subcategory" to categorylinks, along with appropriate indexing, would fix the inefficiency and possibly be useful for other things too. Alternatively we could ignore the deficiency and scan a bunch of extra rows on certain category views, which is what I would provisionally be inclined to do.

Proposed patch #1 deliberately scans five times as many rows as necessary, not really acceptable. Proposed patch #2 isn't in diff format and so I can't easily review it.

mediazilla wrote:

Patch to always show all subcategories

This is my take on this. I just always show all subcategories; they are no longer paged. This could of course lead to problems if a category has a lot of subcategories; and it requires an additional database query (although I think that's unavoidable if you don't want to read through all rows using PHP).

I've also moved the top "previous"/"next" links below the subcategories; after this change, leaving them at the very top makes no sense.

Attached:

  • Bug 11696 has been marked as a duplicate of this bug. ***

magnusrk+wiki wrote:

No paging at all on subcategories would need to be configurable, but that's simple enough. Full-fledged subcategory paging would probably require changes that affect extensions that extend CategoryPage/Viewer as well. I could only find CategoryTree (and DPL using some static methods), so this might not be such a big problem.

ayg wrote:

(In reply to comment #20)

Created an attachment (id=4083) [details]
Patch to always show all subcategories

This is my take on this. I just always show all subcategories; they are no
longer paged. This could of course lead to problems if a category has a lot of
subcategories; and it requires an additional database query (although I think
that's unavoidable if you don't want to read through all rows using PHP).

I've also moved the top "previous"/"next" links below the subcategories; after
this change, leaving them at the very top makes no sense.

After some consideration: this patch is unacceptable, given that I believe it requires scanning the entire set of category links for a page. This can be hundreds of thousands of rows, and take seconds, on the English Wikipedia. Some kind of schema change is almost certainly necessary for this to be done efficiently.

t.laqua wrote:

(In reply to comment #23)

After some consideration: this patch is unacceptable, given that I believe it
requires scanning the entire set of category links for a page. This can be
hundreds of thousands of rows, and take seconds, on the English Wikipedia.
Some kind of schema change is almost certainly necessary for this to be done
efficiently.

What do you mean? The patch specifies 'page_namespace' => NS_CATEGORY as part of the WHERE clause in the additional sub-category query.

t.laqua wrote:

(In reply to comment #24)

(In reply to comment #23)

After some consideration: this patch is unacceptable, given that I believe it
requires scanning the entire set of category links for a page. This can be
hundreds of thousands of rows, and take seconds, on the English Wikipedia.
Some kind of schema change is almost certainly necessary for this to be done
efficiently.

What do you mean? The patch specifies 'page_namespace' => NS_CATEGORY as part
of the WHERE clause in the additional sub-category query.

I ran an explain on the two queries (trimmed output):
explain select page_title, page_len, page_is_redirect, cl_sortkey from page, categorylinks use index (cl_sortkey) where cl_from=page_id AND cl_to='Departments' AND page_namespace=14 order by cl_sortkey;

table | type | possible_keys | Extra |
categorylinks | ref | cl_sortkey | Using where; Using index |
page | eq_ref | PRIMARY,name_title | Using where |

explain select page_title, page_len, page_is_redirect, cl_sortkey from page, categorylinks use index (cl_sortkey) where cl_from=page_id AND cl_to='Departments' order by cl_sortkey;

table | type | possible_keys | Extra |
categorylinks | ref | cl_sortkey | Using where; Using index |
page | eq_ref | PRIMARY | |

As far as I understand, the page_namespace addition to the WHERE clause brings in the name_title index f/ the page table, but the type is still eq_ref and unless i'm misinterpreting, I can't see where we'd be taking a huge performance hit.

I haven't worked too much w/ query optimization, especially on the scale of the wikipedia DB - but I'd certainly like to know how significant of a performance hit there is bringing in the name_title index (assuming that really is the only significant difference here).

t.laqua wrote:

I talked to Domas about it, the name_title key wouldn't be used, so as Simetrical said - it could indeed lead to scanning a pile of rows that wouldn't be needed w/o page_namespace in the WHERE clause.

So ignore the last two comments. ;-)

ayg wrote:

The difference, AFAIK, is not in the queries themselves, but in how they behave with LIMIT (which EXPLAIN ignores). The query with no condition will scan a maximum of 200 rows, because it scans until it gets 200 rows that satisfy the condition. The query with condition page_namespace=14 will have to scan every single row, and will not be able to omit rows that don't match that condition, because page_namespace is not in (and cannot be in) the cl_sortkey index it's using to retrieve the rows for the first table. It has to retrieve a row from categorylinks and join to page before it can decide whether to keep it. Compare the two queries' performance on a large category (this is the toolserver, so much slower than it would be on the live server, but you can tell the difference anyway):

mysql> SELECT page_namespace, page_title FROM categorylinks USE INDEX (cl_sortkey) JOIN page ON cl_from=page_id WHERE cl_to='Living_people' ORDER BY cl_sortkey LIMIT 200;
+----------------+--------------------------------+

page_namespacepage_title

+----------------+--------------------------------+

4Biographies_of_living_persons
020Bello
040_Cal.
050/50_Twin
06025_(musician)

. . .

0Hisham_Abbas
0Imran_Abbas

+----------------+--------------------------------+
200 rows in set (3.63 sec)

mysql> SELECT page_namespace, page_title FROM categorylinks USE INDEX (cl_sortkey) JOIN page ON cl_from=page_id WHERE cl_to='Living_people' AND page_namespace=14 ORDER BY cl_sortkey LIMIT 200;
+----------------+----------------+

page_namespacepage_title

+----------------+----------------+

14Pete_Rock
14Royal_children
14Mar_Roxas

+----------------+----------------+
3 rows in set (12 min 49.22 sec)

Now note that the latter is probably more like three seconds on the real servers (cf. http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopl, which evidently no one has killed yet). But the point remains, the latter is O(N) in the whole category size and the former is O(N) (maybe O(N log N), I don't know) in just the result set. The difference here was a factor of over 200, the difference on the real servers is probably over 300 (figuring 3 s for the latter, <10 ms for the former), and it will only increase as categories get bigger. A scalable solution is needed if this is to be put in core.

The sane solution here is probably to just have a distinct table for "subcategories" from other pages.

Alternatively, an additional column could be added to categorylinks which indicates which set of categorization it belongs in, with an appropriate index.

A separate table would be a quicker and more backwards-compatible upgrade, as it wouldn't require a complete rebuild of the (large) categorylinks table. The upgrade would go like this:

  1. create the new subcategorylinks table
  2. copy categorylinks data for page_namespace=14 into subcategorylinks
  3. delete categorylinks data for page_namespace=14
  4. profit!

The two sets of data would then be trivial to page separately.

ayg wrote:

And it's even safe, since it's impossible to move non-categories to categories or vice versa (a point that didn't occur to me when similar ideas were mentioned elsewhere).

ayg wrote:

. . . Of course, any such change will probably break anything else that deals with categorylinks, to some extent.

johnblumel wrote:

(In reply to comment #28)

The sane solution here is probably to just have a distinct table for
"subcategories" from other pages...

I'm a submitter of a duplicate of this issue (although I didn't notice it referenced here, but may have just overlooked it) and I'm afraid I'm utterly unfamiliar with the MediaWiki schema or the code that deals with this, but I just wanted to raise the question of how this would work with subcategories that also have subcategories and categories/subcategories that are subcategories of multiple categories which may also be subcategories.

In our wiki,

http://wiki.hmssurprise.org

every category but one,

http://wiki.hmssurprise.org/index.php/Category:WikiPOBia

is (or will be) a subcategory of some other category, and in many instances is a subcategory of multiple categories. Examples of this can be found particularly under the following two categories (all the entries under these categories, subcategories and articles, were generated and botted to the site, and this is probably quite atypical):

http://wiki.hmssurprise.org/index.php/Category:Guide_for_the_Perplexed
http://wiki.hmssurprise.org/index.php/Category:Maturin%27s_Medicine

The whole categorization scheme here was organized with this particular issue in mind, and we currently are not plagued by this problem, but I wouldn't want to see the fix cause it's own set of problems, or wreak havic with our existing categorization scheme.

ayg wrote:

(In reply to comment #31)

I'm a submitter of a duplicate of this issue (although I didn't notice it
referenced here, but may have just overlooked it) and I'm afraid I'm utterly
unfamiliar with the MediaWiki schema or the code that deals with this, but I
just wanted to raise the question of how this would work with subcategories
that also have subcategories and categories/subcategories that are
subcategories of multiple categories which may also be subcategories.

There would be no difference to the current way things work.

shaww wrote:

Above it was suggested that a new subcategorylinks table be added. There may be a simpler solution which is also potentially less disruptive. This wouldn't involve a complete rebuild of categorylinks, but would require a upgrade phase. It would go like this:

  1. Add a column, say specialref, to categorylinks, default NULL, indexed if value not NULL.
  2. Make a pass over categorylinks where namespace=14, updating specialref to a nonnull value, say 14.

The normal subcat display would use the index, the page member display would not but would skip records which have a non-NULL specialref. Accessors ignorant of the change would continue to work as now. Add of a new subcat of course sets specialref.

(The page list retrieval has to skip the subcat entries, but that can be expected to be a small percentage.)

ayg wrote:

That requires an ALTER TABLE, which is much more of a pain than adding a table. Copying millions of rows from categorylinks to subcategorylinks: easy, just run some queries. Altering categorylinks: locks the table to writes while you're doing it, so you have to take slave out of rotation, apply alteration, re-add to rotation, repeat for each slave, switch the master, repeat for master. It's more disruptive for the sysadmins, not less.

Besides, as I say, there are indexing issues. For this to be effective here, you would need to change the cl_sortkey index from (cl_to, cl_sortkey, cl_from) to (cl_to, cl_sortkey, cl_subcat, cl_from). This would break the use of the index for ordering by cl_from, unless you were retrieving by cl_subcat. Unmodified apps would still work, they'd just take forever on large categories, as it filesorted the entire set of pages in the given category. This is as far as I can see, without testing, but I'm fairly confident this is correct for MySQL.

ayg wrote:

Of course, your solution would *not* disrupt uses of categorylinks that don't use the cl_sortkey index, which is quite possibly most of them. It does have that in its favor.

Fixing bug 164 requires changing the cl_sortkey index anyway. So if a fix were within reach for the two of them, the categorylinks rebuild could be combined and we could stay with a single table.

ayg wrote:

Why does it require changing the cl_sortkey index? It just requires rebuilding the contents of the cl_sortkey column, which can be done live (sorting on category pages may or may not be a bit odd in the interim, depending on how different the new collation algorithm is).

Turelio001 wrote:

I'm asking to step up the Severity of this bug at least to Normal (better to Major) as it really restricts everyday usability of categories on Commons. As another user in the current discussion on Commons put it, we set up a system with currently >2.7 mio files and our category system has problems when there are >200 files in a category. [http://commons.wikimedia.org/wiki/Commons:Village_pump#How_to_show_first_all_sub-cats_of_a_cat.3F]

billclark wrote:

I suggest that we add a cl_namespace column to the table and also add that column to the cl_sortkey index. This will allow efficient independent selection (and thus paging) of a category's members based on the namespace, which would not only resolve this bug but would help with #450 as well. It's really not feasible to resolve #450 unless the different namespaces can be paged independently.

I'll be happy to provide a software patch for both of these bugs (I'm almost done with the one for #450) but want to be sure that the proposed schema change is acceptable, first.

ayg wrote:

(In reply to comment #40)

I suggest that we add a cl_namespace column to the table and also add that
column to the cl_sortkey index. This will allow efficient independent
selection (and thus paging) of a category's members based on the namespace,
which would not only resolve this bug but would help with #450 as well. It's
really not feasible to resolve #450 unless the different namespaces can be
paged independently.

I'll be happy to provide a software patch for both of these bugs (I'm almost
done with the one for #450) but want to be sure that the proposed schema change
is acceptable, first.

First of all, this implementation of a fix would depend on bug 16012. All schema changes to large tables are suspended until there's a reasonable way of executing them, apparently, and categorylinks is certainly a large table.

Second of all, you're (further) denormalizing the database here. This might be justifiable -- certainly it seems like the only reasonable way to fix bug 450 -- but it's a cost. You'd have to make sure that anything that moves pages also updates cl_namespace. This is not currently a cost we have anywhere else, AFAIK: either things are associated with the page_id and don't have to be updated, or are associated with the namespace/title combination rather than the page itself (e.g., pl_namespace and pl_title) and therefore still don't have to be updated. We can expect that bugs will occur here as cl_namespace sometimes isn't updated on moves, for whatever reason.

Thirdly, this would probably *require* bug 450 to be fixed to be reasonably efficient. In the current setup, we'd need something like 'SELECT ... WHERE cl_from=xxx AND cl_namespace NOT IN (' . NS_FILE . ',' . NS_CATEGORY . ') ORDER BY cl_sortkey', which MySQL (at least older versions like 4.0) is too stupid to execute efficiently: it can't use the index for ordering if there's a range condition like IN in the WHERE clause. (I think it would be smart enough in 5.0, or at least 5.1.)

And as for the proposal in bug 450, I see two issues. First of all, you'd surely need one query per namespace on each category view, which could be a couple dozen on a lot of wikis. That seems kind of excessive, if it's avoidable. Second of all, the current structure of the category table reflects the pages/subcats/files breakdown. If counts are given separately for all namespaces, you'd need to store all the counts in the category table. This would either require an ALTER TABLE for every namespace added or removed (not happening), or else breaking off a new categorycount table like (cc_cat, cc_namespace, cc_count) to store the counts and deleting the cat_pages, cat_subcats, cat_files from the category table.

(In reply to comment #41)

And as for the proposal in bug 450, I see two issues. First of all, you'd
surely need one query per namespace on each category view, which could be a
couple dozen on a lot of wikis. That seems kind of excessive, if it's
avoidable. Second of all, the current structure of the category table reflects
the pages/subcats/files breakdown. If counts are given separately for all
namespaces, you'd need to store all the counts in the category table. This
would either require an ALTER TABLE for every namespace added or removed (not
happening), or else breaking off a new categorycount table like (cc_cat,
cc_namespace, cc_count) to store the counts and deleting the cat_pages,
cat_subcats, cat_files from the category table.

If we throw the separate paging thing out the window citing performance concerns, we could page by (ns, sortkey), which should be feasible given the right index (especially since the range condition is now also the first field of the ORDER BY). Of course that would page the main namespace first, then Talk, then User, etc., but that's the price we'd have to pay for not liking burned database servers. Jump-ahead links to a certain namespace would still be possible as well.

sean wrote:

Would it be rocking the boat too much to suggest that this bug gets upgraded to a severity above normal? I know it doesn't matter toooooo too much, but it might make us more likely to solve it ;)

It's just that most reasonably sized wikis tend to have categories with >200 articles and categories that large will quite often have subcategories and it's really not obvious to a user what's going on when they see only SOME of the subcategories on the first page.

ayg wrote:

-patch, -need-review. Patch was reviewed and rejected, see comment 23. Also removing schema-change until it's clear that we need one (we could just adjust the sortkeys in-place, perhaps).

The most straightforward solution to this seems to be to reserve the first character of the cl_sortkey column for a namespace designator. So just make it "c" for subcategories, "a" for articles, "f" for files. Then the three could be paged separately. This would require some interface work too, of course. A script would then have to be run before the change goes live, updating the categorylinks table; until this is done, category pagination would be kind of wonky, but there would be no other ill effects (at least if it works ;) ).

  • Bug 17359 has been marked as a duplicate of this bug. ***
  • Bug 16249 has been marked as a duplicate of this bug. ***

What's the problem about solving this?
I'm thinking about manipulating CategoryPage.php myself but my php skills suck and it would take me ages of trial & error.
I want to:

sort articles desc, subcats asc

subcats in 3 cols, articles in 2 cols

different default navigation ammount for all 3 sections (subcats: e.g. all, articles: 200, files: 20)

split navigation logically: the 3 sections have their own navigation: going NEXT to articles 200-400 will still list subcats and images 0-X

Can anyone at provide a patch that sorts articles Articles in the recerve order?

Changing function finaliseCategoryState() to array_reserve only sorts batch per page (200) but not the whole set.

brian wrote:

If we can't fix this quickly (and it looks like we can't), can we at least add a message to each category page explaining this issue? I can't even find it documented anywhere except in bug reports.

I skimmed through the above comments and couldn't find an explanation of how the subcategories are split between pages at, say, http://commons.wikimedia.org/wiki/Category:Girls. The subcategories go from A-D, then from G-T, even though the media goes from A-L, then L-Z. Also, if you were viewing the first 200 category items, you'd never guess that the page "Super Girl (contest)" was in that category as well.

(In reply to comment #41)

First of all, this implementation of a fix would depend on bug 16012. All
schema changes to large tables are suspended until there's a reasonable way of
executing them, apparently, and categorylinks is certainly a large table.

Schema changes are in no way blocked by bug 16012, that was a misconception. I did a couple of schema changes during the deployment of 1.16wmf4, and I'm happy to do more in the next deployment. There are some conditions which need to be met for easy migration, but those won't change when bug 16012 is implemented. The table can be any size, as long as it fits into the free disk space, since MySQL needs to make a copy when it does an ALTER TABLE.

[...]

Thirdly, this would probably *require* bug 450 to be fixed to be reasonably
efficient. In the current setup, we'd need something like 'SELECT ... WHERE
cl_from=xxx AND cl_namespace NOT IN (' . NS_FILE . ',' . NS_CATEGORY . ') ORDER
BY cl_sortkey', which MySQL (at least older versions like 4.0) is too stupid to
execute efficiently: it can't use the index for ordering if there's a range
condition like IN in the WHERE clause. (I think it would be smart enough in
5.0, or at least 5.1.)

Do you mean the category page query? Presumably that would be cl_to=xxx not cl_from=xxx.

You don't need bug 450 to be fixed. You can keep the current (cl_to, cl_sortkey, cl_from) index and add an additional (cl_to, cl_namespace, cl_sortkey, cl_from) index. Then we can support both UIs.

When namespace breakdowns are enabled, there would be some complexity involved in making the main category page query skip subcategories and files, but I think it would be possible to do it efficiently, even in MySQL 4.0, by using ranges and unions.

For instance, we could have a union with the first subquery having a condition like (cl_sortkey > $offset AND cl_namespace < NS_CATEGORY), and the second subquery having a condition like (cl_namespace > NS_CATEGORY).

And as for the proposal in bug 450, I see two issues. First of all, you'd
surely need one query per namespace on each category view, which could be a
couple dozen on a lot of wikis. That seems kind of excessive, if it's
avoidable.

The category page view with the namespace breakdown would have to page through the results one namespace at a time, it wouldn't display, say, pages in all namespaces that start with the letter "L". As in the current scheme, the UI has to reflect the indexes. The query would have ORDER BY cl_namespace, cl_sortkey.

Second of all, the current structure of the category table reflects
the pages/subcats/files breakdown. If counts are given separately for all
namespaces, you'd need to store all the counts in the category table. This
would either require an ALTER TABLE for every namespace added or removed (not
happening), or else breaking off a new categorycount table like (cc_cat,
cc_namespace, cc_count) to store the counts and deleting the cat_pages,
cat_subcats, cat_files from the category table.

Breaking namespaces out into a new table would be the only decent solution, assuming we need that feature.

(In reply to comment #51)

You don't need bug 450 to be fixed. You can keep the current (cl_to,
cl_sortkey, cl_from) index and add an additional (cl_to, cl_namespace,
cl_sortkey, cl_from) index. Then we can support both UIs.

Although of course that means not fixing the current bug. My mistake.

To fix the current bug and not include namespaces in the sorting order would require a boolean cl_is_subcat field, defined to be the same as cl_namespace=NS_CATEGORY. Then the index would be (cl_to, cl_is_subcat, cl_sortkey, cl_from).

ayg wrote:

(In reply to comment #51)

Schema changes are in no way blocked by bug 16012, that was a misconception. I
did a couple of schema changes during the deployment of 1.16wmf4, and I'm happy
to do more in the next deployment. There are some conditions which need to be
met for easy migration, but those won't change when bug 16012 is implemented.
The table can be any size, as long as it fits into the free disk space, since
MySQL needs to make a copy when it does an ALTER TABLE.

Okay.

Do you mean the category page query? Presumably that would be cl_to=xxx not
cl_from=xxx.

Yes, right. I always get them confused. cl_page and cl_category would have made much more sense as column names. Although less consistent with pagelinks.

You don't need bug 450 to be fixed. You can keep the current (cl_to,
cl_sortkey, cl_from) index and add an additional (cl_to, cl_namespace,
cl_sortkey, cl_from) index. Then we can support both UIs.

I think that by "fixing bug 450" I meant that you'd need to store the namespace in the categorylinks table somehow.

When namespace breakdowns are enabled, there would be some complexity involved
in making the main category page query skip subcategories and files, but I
think it would be possible to do it efficiently, even in MySQL 4.0, by using
ranges and unions.

For instance, we could have a union with the first subquery having a condition
like (cl_sortkey > $offset AND cl_namespace < NS_CATEGORY), and the second
subquery having a condition like (cl_namespace > NS_CATEGORY).

Yes, this is true. Maybe I hadn't seen that hack yet in 2008.

The category page view with the namespace breakdown would have to page through
the results one namespace at a time, it wouldn't display, say, pages in all
namespaces that start with the letter "L". As in the current scheme, the UI has
to reflect the indexes. The query would have ORDER BY cl_namespace, cl_sortkey.

Do you mean it would have WHERE cl_to='X' AND cl_namespace=Y ORDER BY cl_sortkey? It wouldn't make sense in any UI to display the first 200 results from the main namespace only, so you'd have to go through fifty pages to get to the files. You could cut it down to one query by doing a UNION, with different WHERE clauses in each part of the union.

Breaking namespaces out into a new table would be the only decent solution,
assuming we need that feature.

If we're displaying different namespaces under different headings as bug 450 suggests, we'd presumably want to keep the total count for each namespace present in the UI. It could be dropped, but it doesn't seem like there's any point.

(In reply to comment #52)

To fix the current bug and not include namespaces in the sorting order would
require a boolean cl_is_subcat field, defined to be the same as
cl_namespace=NS_CATEGORY. Then the index would be (cl_to, cl_is_subcat,
cl_sortkey, cl_from).

I don't see why this is necessary or advantageous. You could just do several queries, maybe UNIONed together, one for each namespace, and that will work with a (cl_to, cl_namespace, cl_sortkey, cl_from) index. The only advantage would be that pages can't get moved between the category namespace and other namespaces, so if we only stored subcategory-ness, we wouldn't have to update anything extra on page moves, and there'd be little to no risk of errors creeping in. This is an advantage of sticking to the current subcats/pages/files approach, but we can't have that if we want to fix bug 450. (Not sure if we need to or not.)

*** Bug 24271 has been marked as a duplicate of this bug. ***

(In reply to comment #50)

If we can't fix this quickly (and it looks like we can't), can we at least add
a message to each category page explaining this issue?

This would be a reasonable interrim solution. Just knowing that there are more subcategories is a huge usability benefit.

ayg wrote:

I'll be working on this bug. I hope to have a solution coded up within a couple of weeks. I wrote a post to wikitech-l about it, and encourage people to respond there rather than here (since this involves several bugs):

http://lists.wikimedia.org/pipermail/wikitech-l/2010-July/048399.html

ayg wrote:

This was fixed in trunk as of r70415. Subcategories, files, and other pages sort in three separate sections.

felsodaniel wrote:

I don't think this is fixed, just see: http://en.wikipedia.org/wiki/Category:Computer_stubs
It says "This category has the following 8 subcategories, out of 16 total."

Bugs are marked as fixed right away when the code is fixed. It then takes time before end users see the fix, until entities like Wikimedia update their code to newer version. When that happens, bugs can be marked as verified or closed, but that doesn't seem the actual practice here.

felsodaniel wrote:

I checked the version of the English Wikipedia (http://en.wikipedia.org/wiki/Special:Version), it's r72153, so it should be "fixed".

That is a different branch of the software. Don't rely on those numbers.

ayg wrote:

Wikimedia is way behind on code updates. There's no guarantee on when Wikipedia will see the features, but they're fixed in trunk -- which means there's no more development work to do, so there's no purpose for the bug to remain open. The fix is done, it just has to be deployed, which will happen along with all the other tens of thousands of revisions that aren't live yet.

It might not go live for months to come, who knows. The last full scap was around r64689, and that was based on REL1_16, which was based on r62818 of trunk. That was in February. The revision numbers don't tell you what's live, as Derk-Jan points out, since Wikimedia is using the wmf branch of the code, not trunk. I only gave the revision number so it would produce a commit link.

So do we file a new bug that asks to implement the fixed bug?

ayg wrote:

If anyone is going to be interested in deploying this out of schedule, they won't need a bug to remind them to do so. Plus, I doubt it will happen till bug 164 is fixed.

(In reply to comment #65)

The patch is included in release 1.17. See
http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_17/phase3/RELEASE-NOTES?view=co

As an aside, last i heard the category stuff was going to be deployed slightly later then the rest of 1.17 (On Wikimedia).

If someone can link an example category, pls do so.

Also, please let us know when this was included in a stable MW release.

ayg wrote:

This is in 1.17, as comment 65 says. It will be in a stable release when 1.17 is released. It will be enabled on Wikimedia at some time in the relatively near future. Please be patient.

johnnymrninja wrote:

1.17 is live, though I have not been able to confirm that this is fixed. I've edited categories and they do not move to the front. It could be some sort of category lag... Can someone confirm that this is now working properly on WP?

(In reply to comment #69)

1.17 is live, though I have not been able to confirm that this is fixed. I've
edited categories and they do not move to the front. It could be some sort of
category lag... Can someone confirm that this is now working properly on WP?

1.17 was deployed with the category changes stripped out. Tim has been working on doing the required database schema changes yesterday and today, and I guess we'll roll out the related software changes in the next few days.