Analysis of Lichess' cheating detection with Machine Learning (ML) - a mis-use of ML & doesn't work

The point that is made about lacking proper reliable labelled data on "cheaters" and "non-cheaters" is extremely important.

Even if Lichess always uses a human reviewer before banning someone for a report, this human reviewer isn't above making mistakes either, and if flawed training data influences the model to produce an inaccurate "cheater score", this "cheater score" itself can influence the decision of the human reviewer (I assume the human reviewer will look at the game but has also access to the output of the ML models, Lichess doesn't disclose details on the procedure here).

If a moderator gets a model that tells him this player is 99% a cheater, that will inevitably bias how the specific moves, move-times and so on are judged. The same move that is a "proof of cheating" if the model spits out 99% may be judged human and normal in a different context.

Catching non-obvious cheaters without having many false positives is very challenging. I'm sure Lichess tries its best, but people shouldn't believe that ML is some kind of magical cure. Even with good training data, this would be a challenging task, and the training data is flawed.

phobbs5

edited

#13

Your linked paper about dropout, http://mipal.snu.ac.kr/images/1/16/Dropout_ACCV2016.pdf, is about improving dropout for CNNs. It specifically says "Dropout works well in practice especially with fully connected layers," which is how it is being used in this model. In what way is it weird to use dropout as used in this model? Do you have a better citation for why dropout might be controversial or dubious in any way?

Regarding the model training on its own labels, how would you go about reframing the problem to avoid the predict-your-labels problem? Given that cheating detection has a massive scale problem and confessions are rare, would you rely on supplying human arbiters with heuristics and deriving labels from the arbiter's decisions? Could you augment the dataset with "artificial" cheating to make the training data more robust?

> (for instance, FIDE requires 99.998% accuracy for statistical models).

This is related to the z-score of the conventional statistical tests that FIDE uses. It doesn't necessarily correspond to 99.998% accuracy *of the accused*, but 99.998% accuracy among those *evaluated*. If cheating is rare, false positives will be much more common.

> Second, when chess-playing friends as well as my wife reviewed this work, they told me that professional players & chess streamers often move their mouse "off the board" to avoid any kind of "mouse slips" - particularly in tight games - after every move. For instance, I was shown a youtube video of Hikaru Nakamura where he specifically did that and commented on it during a game why he does it.

1. Moving your mouse off the board does not make the browser window lose focus. This would require actually clicking on another tab or window, which is not happening in Hikaru's case and seems like a clear signal of cheating, especially if the moves which correspond to focus-change are correlated with higher strength.
2. This kind of very careful mouse behavior is almost certainly rare in the first place, and only applies to slower time controls where you don't need quick mouse speed, so it probably does not pollute the telemetry data much.

> But more importantly, given the results above, the browser telemetry does not seem to have a huge influence on the model performance, as I did not have this data and the model still behaves in a similar way.

It's possible some of the accounts you decided were false positives were ones that were flagged due to signals from telemetry, right? Someone who cheated on a few games might have a strong cheating signal from telemetry + conventional features on those games but not affect their overall score.

Onyx_Chess edited

#14

@Cedur216 said in #5:
> @IrwinCaladinResearch
>
>
>
>
> go to 31:00 and forward
>
> at 32:30 there's the spot where Chris says that "only the dumbest and most obvious cheaters are caught automatically, the low hanging fruit if you will, and then from there the case is moved on to human beings". So you don't need to argue about automated cases.
>
> Most bans come from reports and every report is reviewed by humans. It doesn't just take machine learning or knowledge of statistics, it also takes understanding of chess
>
> also talking about chesscom, there was the (former) NSF president confessing that he used assistance from a stronger player at chesscom, and at that time chesscom was already highly suspicious of him but didn't ban him because the chance for a false positive was 0.1%, at which rate they'd falseban a lot of high-profile players every month and have serious issues. Besides, the likelihood of false positive is not the only thing, the likelihood of false negative matters too.

I would add to the point, also, that not only are there a plethora of other factors that indicate cheating (I personally know of at least 3 that I've never seen made public)...

...but also that Lichess has been dedicated to anti-cheat without an interest in appearing to be the best, while chesscom has a history of doing just-enough.

Right now, they can't figure out why and how Lichess can manage so much dedicated traffic, and they're finally beginning to understand that it's the decade of their "do-just-enough" culture that's finally starting to catch up with them, meanwhile Lichess is happily reaping what it's sown.

If not for chesscom's massive propaganda and advertisement expenditures...
If not for chesscom's appearance of legitimacy investing in the GMs...

...chesscom wouldn't even be a footnote in the chess world.

Their latest trick was announcing how many people they catch cheating in order to sway public opinion that they're the only legitimate site to play at.

This is the whole of chesscom's "substance".
It's all facade and veneer.

On practical and on moral grounds, Lichess deserves the title.

izzie26

#15

@IrwinCaladinResearch

Consider this a response on behalf of the mod team.

Thanks for the post. We appreciate the effort you put into it.

We're always looking to improve our systems, so if this is something you're genuinely interested in, we are willing to engage with you further about your work and findings.

Please, however, could you first send us some evidence of your technical experience and skills - a CV, links to publications, a Github profile, anything really, just so we can gauge where you're coming from. We don't need to see anything here; you can email it to contact@lichess.org.

Alright, let's talk about your post. Unsurprisingly, we strongly disagree with your central claims that our systems are "fundamentally flawed" and that "it is very likely that [Lichess] punishes a lot of non-cheating players as well" - especially if the latter refers to decisions taken about accounts.

More generally, we think your claims are rather strongly stated given the limited details that you provide to support them, and even those details need to be probed further. For example, just from what you've posted, we'd have concerns about the inferences you have drawn from the available data, the assumptions and logic behind your 'false positive' estimate, and whether your analysis has fully accounted for our ML systems' primary role to inform a multifactorial decision process.

In addition, your characterisation of the feedback loops in our models is completely off the mark, because we have taken proactive measures to avoid exactly what you describe. You really ought to give us a bit more credit! We know we're dealing with applied ML here, where "ground truth" data never truly reflects actual ground truth, and perfect labels are the exception, not the norm.

We'll leave it there for now. Our offer of further engagement still stands, so please get in touch if you're interested.

boilingFrog

edited

#16

Finally, something makes sense again ...

kalafiorczyk

#17

Well, chess dot com was mentioned, but one distinguished feature of their cheat detection system wasn't mentioned: they do make a claim of somehow objectively measuring the difficulty of choosing a move out of possible legal moves. In standard chess the range of possible moves is between 1 and 248 if I remember correctly.

Unfortunately, I don't have time to explain this deeper now, so I'll just link to selected papers by Ken Regan:

cse.buffalo.edu/~regan/publications.html#chess

BorisOspasky

#18

@izzie26 I think he’s looking for my lost parcel. A response to your request may take some time. :-)

Ps... I thought he sounded like an intelligent bloke but as I understood very little about what we wrote, beyond ‘cheat detection is flawed’ and it ‘teaches itself to become more flawed’.. well I think that’s what he’s saying.. Keep up the great work you mods do! Anything that makes the site better gets a thumbs up from me.

IrwinCaladinResearch

#19

I am going to reply to a few statements in this thread and ignore some of the ad-hominem attacks (which were expected):

@SomewhatUnsound said in #10:
> Model drift is going to be a problem for sure if it's "eating its own classifications", but I would hope the human labelled data is heavily used for training purposes which might help.

Let's assume for a second that only user reporting triggers a validation of a player and the process that occurs then: A moderator / chess expert looks at some of the games and requests an Irwin report. More likely than not, the moderator will then rely on a combination of his "understanding of chess" and the Irwin-report to ban the player. Even if the "weighted" influence of Irwin into this decision is 50%, the decision itself is based on the ML model that serves Irwin and hence the label itself is based on "eating its own classifications".

On a side note: I do not think that being transparent about how this process of user reports exactly works in a Lichess blogpost would benefit cheaters.

@IndigoEngun said in #12:
> Catching non-obvious cheaters without having many false positives is very challenging. I'm sure Lichess tries its best, but people shouldn't believe that ML is some kind of magical cure. Even with good training data, this would be a challenging task, and the training data is flawed.

Not having sufficiently good data is not an excuse to deploy a ML model that makes inacceptable amounts of mistakes. Period. How would you feel about this statement if we were talking about level 5 autonomous cars here? "Oh, we did not have enough data on pedestrians wearing yellow jackets crossing the street in a construction site, so 1-2% of them got hit." While the Lichess model does not kill anyone, I can still see how it could affect real people in real life in a really bad way.

@izzie26 said in #15:
> Alright, let's talk about your post. Unsurprisingly, we strongly disagree with your central claims that our systems are "fundamentally flawed" and that "it is very likely that [Lichess] punishes a lot of non-cheating players as well" - especially if the latter refers to decisions taken about accounts.

That's good. I hoped you would disagree.

@izzie26 said in #15:
> More generally, we think your claims are rather strongly stated given the limited details that you provide to support them, and even those details need to be probed further. For example, just from what you've posted, we'd have concerns about the inferences you have drawn from the available data, the assumptions and logic behind your 'false positive' estimate, and whether your analysis has fully accounted for our ML systems' primary role to inform a multifactorial decision process.

I am happy to be proven wrong. I would be the first one to say: "Lichess is doing this differently than I expected and my concern is void because xyz addresses this." This would obviously require you to write down how you execute this "multifactorial decision process". I do not think this would benefit cheaters in their effort to avoid this process as this occurs after they have been reported or flagged by one of the systems. If you think this would benefit cheaters, please outline how.

This kind of transparency would go a long way for all of these systems and as stated in my opening post, Lichess is already a lot better than Chesscom.

> In addition, your characterisation of the feedback loops in our models is completely off the mark, because we have taken proactive measures to avoid exactly what you describe. You really ought to give us a bit more credit! We know we're dealing with applied ML here, where "ground truth" data never truly reflects actual ground truth, and perfect labels are the exception, not the norm.

Here I disagree and I have already made a statement about about the input data. Your model is affecting real people and making a statement that "perfect labels are the exception" as an excuse to accept sub-par performance in the model is not a stance I would subscribe to.

And again: I am happy to admit I am wrong here. If your model training process in fact does not use its own labels, then describe the process how you obtain the labels. This would likely not benefit cheaters, because as you outlined, you are not using labels generated from previous iterations/versions of the model. This would be yet another measure to improve transparency without reducing effectiveness.

@izzie26 said in #15:
> We'll leave it there for now. Our offer of further engagement still stands, so please get in touch if you're interested.

I will have to ask my employer about that and will come back if I get a positive response by e-mail.

BorisOspasky

#20

@IrwinCaladinResearch Do you class humor as an ‘ad hominem’ attack? Just curious.

This topic has been archived and can no longer be replied to.