Page MenuHomePhabricator

Recursive checkuser feature needed
Closed, DeclinedPublic

Description

Author: mapellegrini

Description:
Following a recent checkuser request that required over 3 hours of tedious
manual labor (which I never did finish) - [[Wikipedia:Requests for
checkuser/Case/Artaxiad]] - it's become obvious to me that checkuser needs a
recursion feature.

The feature should take a username or IP address, and repeatedly run checkuser
on all subsequent matches to a certain user-specified recursion depth. Repeated
entries (that is, a user or IP address that appears multiple times in the tree)
should be excluded.

The output should be a tree-list form, as shown at
http://en.wikipedia.org/wiki/Wikipedia:Requests_for_checkuser/Case/Artaxiad#Raw_data
(Note - the recursion depth in that case is 17). Usernames and IP addresses
should be linked to their respective user pages.


Version: unspecified
Severity: enhancement

Details

Reference
bz9858

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:37 PM
bzimport added a project: CheckUser.
bzimport set Reference to bz9858.
bzimport added a subscriber: Unknown Object (MLST).

For performance reasons, unless Tim thinks it's OK.

lar wrote:

I'd like to ask that this WONTFIX decision be revisited. It is OK, in my view, if the request itself is throttled to not put too much of a load on the system, and takes considerable elapsed time to return results, because the alternative of doing it by hand takes far more elapsed time.

I don't see why it would be a performance problem, as long as the total result set size was limited.

lar wrote:

So can it be reopened then, or do you need a new bug submitted? The choices I see (as a non developer) don't admit of reopening via change resolution to ____ , but I'll be happy to check "reopen bug" instead... I just think maybe one of the devs should do it, I don't want to overstep.

My other problem with this is that it would be giving a very incomplete report, possibly with false negatives with no context. I suppose it is not useful enough that I'd want to code it...but if someone else wants too they can.

Re-opening per Tim's comment.

(In reply to comment #6)

My other problem with this is that it would be giving a very incomplete report,
possibly with false negatives with no context. I suppose it is not useful
enough that I'd want to code it...but if someone else wants too they can.

Re-opening per Tim's comment.

Opps, I meant "false positives", though there would be even more "false negatives" (as users' IPs don't stay exactly the same for long).

lar wrote:

Thanks for the reopen. Perhaps the requirement wasn't stated clearly enough? This is just automating the "walking the tree" we do now in many cases. Judgement in interpreting the results (and deciding what further explorations to do or not do) would still be just as required. and the tree walking we do now is exactly as vulnerable (no more, no less) to false negatives and false positives as if it were automated. But getting the report 3 levels deep on one page instead of spread out over 35 tabs (which required 35 separate checks) makes our work a bit easier. I think I would make the suppression of repetition an option though, I probably would run the check with that suppression turned off.

You will still need to check CIDRs rather than single IPs. Also, a shear list alone doesn't have enough context to spot out false positives. I suppose the user can then check those users/IPs.

It could occasionally be a decent starting point for catching a few users in large sock farms, like running a large fish net throw the water.

mapellegrini wrote:

Also, the ability to input a list of usernames into recursive checkuser (instead of just one) would be very helpful.

Anything that reduces the legwork of complex Checkuser cases will be great. Perhaps I'm speaking from ignorance here (I'm not a Checkuser myself) but it seems to me it's not just a matter of time-saving, but surely also that the increased automation will enable the Checkusers to put their intellect into checking the results for false positives/negatives, rather than worrying about running the processes.

mapellegrini wrote:

I filed another feature request - bug 12808 - which is related to this one. Preferably both could be implemented in a single patch.

Bumped up to *high*, given the obvious necessity for a checkuser interface that isn't second in tediousness only to running the queries by hand against the db.

(In reply to comment #13)

Bumped up to *high*, given the obvious necessity for a checkuser interface
that
isn't second in tediousness only to running the queries by hand against the
db.

I honestly don't think this is high on any developer's priorities. If there are other improvements to the CU interface that can be made, it would be helpful to file them as separate bugs.

There has to be concern that a recursive check like this could seriously degrade the privacy of users. For example suppose that I am a trusted user - something I am never sure about - and I come up as having shared an IP with a vandal. A manual checkuser would stop there, and perhaps confirm that the IP usage was widely disparate in time if it was an dynamic IP, or was a communal hotspot or something, but a recursive one would identify all the IPs I had edited from, and all the editors who had edited from those IPs, etc.. I know certain editors who do not want their attendance at meet-ups shared with non-attendees, and it would certainly not be a good thing to have these associations seen by some who have Checkuser power, and doubtless archived on the Checkuser wiki for all eternity.

@Rich_Farmbrough checkusers do not retain data of good users, that was a clear statement made by WMF legal to checkusers and stewards, and why the checkusers don't have a mailing list archive, It was determined that checkusers should only be retaining data that relates to problematic vandals, banned users, and the like. Comment can be made against IP ranges that there are good and/or bad users, and precautions and conditions to be made on blocks.

I doubt that there is any difference on privacy for a recursive check as they can be done per IP address now. The same information is available, it just takes a lots of time to generate and you chase false leads.

@Rich_Farmbrough checkusers do not retain data of good users, that was a clear statement made by WMF legal to checkusers and stewards, and why the checkusers don't have a mailing list archive, It was determined that checkusers should only be retaining data that relates to problematic vandals, banned users, and the like. Comment can be made against IP ranges that there are good and/or bad users, and precautions and conditions to be made on blocks.

I doubt that there is any difference on privacy for a recursive check as they can be done per IP address now. The same information is available, it just takes a lots of time to generate and you chase false leads.

We should be looking for ways to kill CheckUser and its surrounding bureaucracy, in my opinion.

@Rich_Farmbrough checkusers do not retain data of good users, that was a clear statement made by WMF legal to checkusers and stewards, and why the checkusers don't have a mailing list archive, It was determined that checkusers should only be retaining data that relates to problematic vandals, banned users, and the like. Comment can be made against IP ranges that there are good and/or bad users, and precautions and conditions to be made on blocks.

I doubt that there is any difference on privacy for a recursive check as they can be done per IP address now. The same information is available, it just takes a lots of time to generate and you chase false leads.

We should be looking for ways to kill CheckUser and its surrounding bureaucracy, in my opinion.

Well, that sort of opine belongs elsewhere. FWIW welcome to all the spam that is coming our way if checkuser disappears with all the other open doors that checkuser allows to be closed.

Assessment from Sprint meeting:

  • risk: medium (could have performance issues, need to get the UI right)
  • impact: ? (need to discuss with CheckUser users)
  • feasibility: medium (UI challenges)
  • support: high (came from Community Wishlist Survey)
DannyH renamed this task from Recursive checkuser featured needed to Recursive checkuser feature needed.Oct 11 2016, 11:46 PM
DannyH added a project: Community-Tech.
DannyH moved this task from New & TBD Tickets to Older: Team Work on the Community-Tech board.

This is a weird task.

As an example, let's say users A, B, C, and D are all the using the same IP address. We could fairly easily do a GROUP BY on the MariaDB database back-end and find and report on all the cases where IPs are being shared in the recent past.

However, there are strong objections to doing this. Wikimedians are pretty opposed to this type of behavior (cf. https://en.wikipedia.org/wiki/Wikipedia:CheckUser#Fishing) and instead require there to be reasonable suspicion when checking a user.

Won't a recursive CheckUser feature inherently mean that more people who are innocent get checked?

And, if implementing recursive CheckUser functionality is not problematic, why not "go all the way" and just have aggregate reporting?

In the case of the CheckUser extension, Wikimedia has set up a system that allows privileged users the ability to manually check a user's recent IP addresses and User-Agent strings. There's a whole process for appointing these people, there are policies surrounding its use (e.g., https://meta.wikimedia.org/wiki/CheckUser_policy), each use is logged, etc. There's even https://wikimediafoundation.org/wiki/Guidelines_for_modifying_CheckUser_logs about modifying private CheckUser logs. This is to say: it's considered a big deal to run a check on someone, particularly long-time users.

I can understand wanting to automate checks and it's likely some CheckUsers are already doing this on their own with scripts, but I'm not sure it's what was ever intended here. That said, alternately, if we're going to go down this route, why not just automate the checking further and do away with the inefficiencies of the current checking system?

I continue to wonder whether CheckUser is a case where the cure is worse than the disease. The amount of bureaucracy and process and distraction that have come from "sockpuppet investigations" is mind-boggling and sad, in my opinion. It's possible a world without CheckUser would be worse, with more spam and double voting and such, but I'm not sure.

@MZMcBride: Those are excellent points. I agree that such a feature could be problematic in light of Wikipedia:CheckUser#Fishing and similar policies. I wonder if an RfC at meta would be helpful.

Huji subscribed.

This is a request that was made in 2007. I think our understanding of CU, and our community's expectations of when and how NOT to use CU, and our policies regarding privacy and nonpublic information have changed a lot since this was asked.

The motivation behind this was one particularly complex case. Most CU requests don't need this level of complexity. Also, as your "tree" expands and you find more and more "possibly linked" accounts, your true positive rate diminishes. Just because A is related to B, B is related to C and C is related to D, it doesn't mean A is related to D. These relationships are technical relationships. In settings were people use lots of different ISPs to connect to the internet (their work machine, their home machine, their mobile hotspot, the Starbucks at the corner, etc.) these similarities are likely due to chance.

In light of the above criticisms and the little amount of action this bug has received between 2008 and 2015, I am marking it as Declined. People should feel free to reopen it but I hope they only reopen it if the following conditions are met:

  1. A clearer design for the alternative is provided.
  2. A stronger justification as to why it is needed is provided.
  3. A strong support from the community is provided before it is reopened.