Why AI detectors think the US Constitution was written by AI
Can AI writing detectors be trusted? We dig into the theory behind them.
BENJ EDWARDS – 7/14/2023, 4:00 AM
If you feed America’s most important legal document—the US Constitution—into a tool designed to detect text written by AI models like ChatGPT, it will tell you that the document was almost certainly written by AI. But unless James Madison was a time traveler, that can’t be the case. Why do AI writing detection tools give false positives? We spoke to several experts—and the creator of AI writing detector GPTZero—to find out.
Among news stories of overzealous professors flunking an entire class due to the suspicion of AI writing tool use and kids falsely accused of using ChatGPT, generative AI has education in a tizzy. Some think it represents an existential crisis. Teachers relying on educational methods developed over the past century have been scrambling for ways to keep the status quo—the tradition of relying on the essay as a tool to gauge student mastery of a topic.
As tempting as it is to rely on AI tools to detect AI-generated writing, evidence so far has shown that they are not reliable. Due to false positives, AI writing detectors such as GPTZero, ZeroGPT, and OpenAI’s Text Classifier cannot be trusted to detect text composed by large language models (LLMs) like ChatGPT.
If you feed GPTZero a section of the US Constitution, it says the text is “likely to be written entirely by AI.” Several times over the past six months, screenshots of other AI detectors showing similar results have gone viral on social media, inspiring confusion and plenty of jokes about the founding fathers being robots. It turns out the same thing happens with selections from The Bible, which also show up as being AI-generated.
To explain why these tools make such obvious mistakes (and otherwise often return false positives), we first need to understand how they work.
Understanding the concepts behind AI detection
Different AI writing detectors use slightly different methods of detection but with a similar premise: There’s an AI model that has been trained on a large body of text (consisting of millions of writing examples) and a set of surmised rules that determine whether the writing is more likely to be human- or AI-generated.
For example, at the heart of GPTZero is a neural network trained on “a large, diverse corpus of human-written and AI-generated text, with a focus on English prose,” according to the service’s FAQ. Next, the system uses properties like “perplexity” and burstiness” to evaluate the text and make its classification.
In machine learning, perplexity is a measurement of how much a piece of text deviates from what an AI model has learned during its training. As Dr. Margaret Mitchell of AI company Hugging Face told Ars, “Perplexity is a function of ‘how surprising is this language based on what I’ve seen?'”
So the thinking behind measuring perplexity is that when they’re writing text, AI models like ChatGPT will naturally reach for what they know best, which comes from their training data. The closer the output is to the training data, the lower the perplexity rating. Humans are much more chaotic writers—or at least that’s the theory—but humans can write with low perplexity, too, especially when imitating a formal style used in law or certain types of academic writing. Also, many of the phrases we use are surprisingly common.
Let’s say we’re guessing the next word in the phrase “I’d like a cup of _____.” Most people would fill in the blank with “water,” “coffee,” or “tea.” A language model trained on a lot of English text would do the same because those phrases occur frequently in English writing. The perplexity of any of those three results would be quite low because the prediction is fairly certain.
Now consider a less common completion: “I’d like a cup of spiders.” Both humans and a well-trained language model would be quite surprised (or “perplexed”) by this sentence, so its perplexity would be high. (As of this writing, the phrase “I’d like a cup of spiders” gives exactly one result in a Google search, compared to 3.75 million results for “I’d like a cup of coffee.”)
If the language in a piece of text isn’t surprising based on the model’s training, the perplexity will be low, so the AI detector will be more likely to classify that text as AI-generated. This leads us to the interesting case of the US Constitution. In essence, the Constitution’s language is so ingrained in these models that they classify it as AI-generated, creating a false positive.
GPTZero creator Edward Tian told Ars Technica, “The US Constitution is a text fed repeatedly into the training data of many large language models. As a result, many of these large language models are trained to generate similar text to the Constitution and other frequently used training texts. GPTZero predicts text likely to be generated by large language models, and thus this fascinating phenomenon occurs.”
The problem is that it’s entirely possible for human writers to create content with low perplexity as well (if they write primarily using common phrases such as “I’d like a cup of coffee,” for example), which deeply undermines the reliability of AI writing detectors.
Another property of text measured by GPTZero is “burstiness,” which refers to the phenomenon where certain words or phrases appear in rapid succession or “bursts” within a text. Essentially, burstiness evaluates the variability in sentence length and structure throughout a text.
Human writers often exhibit a dynamic writing style, resulting in text with variable sentence lengths and structures. For instance, we might write a long, complex sentence followed by a short, simple one, or we might use a burst of adjectives in one sentence and none in the next. This variability is a natural outcome of human creativity and spontaneity.
AI-generated text, on the other hand, tends to be more consistent and uniform—at least so far. Language models, which are still in their infancy, generate sentences with more regular lengths and structures. This lack of variability can result in a low burstiness score, indicating that the text may be AI-generated.
However, burstiness isn’t a foolproof metric for detecting AI-generated content, either. As with perplexity, there are exceptions. A human writer may write in a highly structured, consistent style, resulting in a low burstiness score. Conversely, an AI model might be trained to emulate a more human-like variability in sentence length and structure, raising its burstiness score. In fact, as AI language models improve, studies show that their writing looks more and more like human writing all the time.
Ultimately, there’s no magic formula that can always distinguish human-written text from that composed by a machine. AI writing detectors can make a strong guess, but the margin of error is too large to rely on them for an accurate result.
A 2023 study from researchers at the University of Maryland demonstrated empirically that detectors for AI-generated text are not reliable in practical scenarios and that they perform only marginally better than a random classifier. Not only do they return false positives, but detectors and watermarking schemes (that seek to alter word choice in a telltale way) can easily be defeated by “paraphrasing attacks” that modify language model output while retaining its meaning.
“I think they’re mostly snake oil,” said AI researcher Simon Willison of AI detector products. “Everyone desperately wants them to work—people in education especially—and it’s easy to sell a product that everyone wants, especially when it’s really hard to prove if it’s effective or not.”
Additionally, a recent study from Stanford University researchers showed that AI writing detection is biased against non-native English speakers, throwing out high false-positive rates for their human-written work and potentially penalizing them in the global discourse if AI detectors become widely used.
The cost of false accusations
Some educators, like Professor Ethan Mollick of Wharton School, are accepting this new AI-infused reality and even actively promoting the use of tools like ChatGPT to aid learning. Mollick’s reaction is reminiscent of how some teachers dealt with the introduction of pocket calculators into classrooms: They were initially controversial but eventually came to be widely accepted.
“There is no tool that can reliably detect ChatGPT-4/Bing/Bard writing,” Mollick tweeted recently. “The existing tools are trained on GPT-3.5, they have high false positive rates (10%+), and they are incredibly easy to defeat.” Additionally, ChatGPT itself cannot assess whether text is AI-written or not, he added, so you can’t just paste in text and ask if it was written by ChatGPT.
In a conversation with Ars Technica, GPTZero’s Tian seemed to see the writing on the wall and said he plans to pivot his company away from vanilla AI detection into something more ambiguous. “Compared to other detectors, like Turn-it-in, we’re pivoting away from building detectors to catch students, and instead, the next version of GPTZero will not be detecting AI but highlighting what’s most human, and helping teachers and students navigate together the level of AI involvement in education,” he said.
How does he feel about people using GPTZero to accuse students of academic dishonesty? Unlike traditional plagiarism checker companies, Tian said, “We don’t want people using our tools to punish students. Instead, for the education use case, it makes much more sense to stop relying on detection on the individual level (where some teachers punish students and some teachers are fine with AI technologies) but to apply these technologies on the school [or] school board [level], even across the country, because how can we craft the right policies to respond to students using AI technologies until we understand what is going on, and the degree of AI involvement across the board?”
Yet despite the inherent problems with accuracy, GPTZero still advertises itself as being “built for educators,” and its site proudly displays a list of universities that supposedly use the technology. There’s a strange tension between Tian’s stated goals not to punish students and his desire to make money with his invention. But whatever the motives, using these flawed products can have terrible effects on students. Perhaps the most damaging result of people using these inaccurate and imperfect tools is the personal cost of false accusations.
A case reported by USA Today highlights the issue in a striking way. A student was accused of cheating based on AI text detection tools and had to present his case before an honor board. His defense included showing his Google Docs history to demonstrate his research process. Despite the board finding no evidence of cheating, the stress of preparing to defend himself led the student to experience panic attacks. Similar scenarios have played out dozens (if not hundreds) of times across the US and are commonly documented on desperate Reddit threads.
Common penalties for academic dishonesty often include failing grades, academic probation, suspension, or even expulsion, depending on the severity and frequency of the violation. That’s a difficult charge to face, and the use of flawed technology to levy those charges feels almost like a modern-day academic witch hunt.
“AI writing is undetectable and likely to remain so”
In light of the high rate of false positives and the potential to punish non-native English speakers unfairly, it’s clear that the science of detecting AI-generated text is far from foolproof—and likely never will be. Humans can write like machines, and machines can write like humans. A more helpful question might be: Do humans who write with machine assistance understand what they are saying? If someone is using AI tools to fill in factual content in a way they don’t understand, that should be easy enough to figure out by a competent reader or teacher.
AI writing assistance is here to stay, and if used wisely, AI language models can potentially speed up composition in a responsible and ethical way. Teachers may want to encourage responsible use and ask questions like: Does the writing reflect the intentions and knowledge of the writer? And can the human author vouch for every fact included?
A teacher who is also a subject matter expert could quiz students on the contents of their work afterward to see how well they understand it. Writing is not just a demonstration of knowledge but a projection of a person’s reputation, and if the human author can’t stand by every fact represented in the writing, AI assistance has not been used appropriately.
Like any tool, language models can be used poorly or used with skill. And that skill also depends on context: You can paint an entire wall with a paintbrush or create the Mona Lisa. Both scenarios are an appropriate use of the tool, but each demands different levels of human attention and creativity. Similarly, some rote writing tasks (generating standardized weather reports, perhaps) may be accelerated appropriately by AI, while more intricate tasks need more human care and attention. There’s no black-or-white solution.
For now, Ethan Mollick told Ars Technica that despite panic from educators, he isn’t convinced that anyone should use AI writing detectors. “I am not a technical expert in AI detection,” Mollick said. “I can speak from the perspective of an educator working with AI to say that, as of now, AI writing is undetectable and likely to remain so, AI detectors have high false positive rates, and they should not be used as a result.”