In my last blog post, I wrote about how to detect LLM output, with the conclusion that machine-learning approaches are not the solution. Today I want to show how the recent news “Man beats machine at Go in human victory over AI” is related to that topic.
Man beats machine at Go
What actually happened? FAR AI figured out that it’s possible to win against KataGo, a state-of-the-art Go AI system, on superhuman level using adversarial policies. “… our adversaries do not win by learning to play Go better than KataGo … our adversaries win by tricking KataGo into making serious blunders.” The full article can be found here, the paper here, and the GitHub repository here. Kellin Pelrine used that blind spot to win over 90% of the games without direct computer support. More details can be found in the Financial Times article.
How is that related to detecting LLM output?
The apparent goal of LLM is to generate text on (super)human level, not distinguishable from text written by a human. Therefore detecting LLM output is the search for flaws. But what’s going to happen if such a flaw is found? The next generation of LLMs will be optimized to avoid this flaw. Depending on the steps of each evolution, at some point, the race of generator vs. detector will be almost even, and the results will be lost on the signal-to-noise ratio, as mentioned in the last blog post.
In the case of KataGo, the researcher trained the model again using the adversarial policies, but it hasn’t improved much. But they also propose other methods that could help remove the flaw, mentioned in the future work section of the paper. Let’s see how many more iterations it takes.