It’s simple enough for AI to seem to comprehend data, but devising a true test of a machine’s knowledge has proved difficult.

Illustration showing a human understanding text while a machine doesn’t. — **Maggie Chiang for Quanta Magazine**

Remember IBM’s Watson, the AI Jeopardy! champion? A 2010 promotionproclaimed, “Watson understands natural language with all its ambiguity and complexity.” However, as we saw when Watson subsequently failed spectacularly in its quest to “revolutionize medicine with artificial intelligence,” a veneer of linguistic facility is not the same as actually comprehending human language.

Natural language understanding has long been a major goal of AI research. At first, researchers tried to manually program everything a machine would need to make sense of news stories, fiction or anything else humans might write. This approach, as Watson showed, was futile — it’s impossible to write down all the unwritten facts, rules and assumptions required for understanding text. More recently, a new paradigm has been established: Instead of building in explicit knowledge, we let machines learn to understand language on their own, simply by ingesting vast amounts of written text and learning to predict words. The result is what researchers call a language model. When based on large neural networks, like OpenAI’s GPT-3, such models can generate uncannily humanlike prose (and poetry!) and seemingly perform sophisticated linguistic reasoning.

But has GPT-3 — trained on text from thousands of websites, books and encyclopedias — transcended Watson’s veneer? Does it really understand the language it generates and ostensibly reasons about? This is a topic of stark disagreement in the AI research community. Such discussions used to be the purview of philosophers, but in the past decade AI has burst out of its academic bubble into the real world, and its lack of understanding of that world can have real and sometimes devastating consequences. In one study, IBM’s Watson was found to propose “multiple examples of unsafe and incorrect treatment recommendations.” Another study showed that Google’s machine translation system made significant errors when used to translate medical instructions for non-English-speaking patients.

How can we determine in practice whether a machine can understand? In 1950, the computing pioneer Alan Turing tried to answer this question with his famous “imitation game,” now called the Turing test. A machine and a human, both hidden from view, would compete to convince a human judge of their humanness using only conversation. If the judge couldn’t tell which one was the human, then, Turing asserted, we should consider the machine to be thinking — and, in effect, understanding.

Unfortunately, Turing underestimated the propensity of humans to be fooled by machines. Even simple chatbots, such as Joseph Weizenbaum’s 1960s ersatz psychotherapist Eliza, have fooled people into believing they were conversing with an understanding being, even when they knew that their conversation partner was a machine.

In a 2012 paper, the computer scientists Hector Levesque, Ernest Davis and Leora Morgenstern proposed a more objective test, which they called the Winograd schema challenge. This test has since been adopted in the AI language community as one way, and perhaps the best way, to assess machine understanding — though as we’ll see, it is not perfect. A Winograd schema, named for the language researcher Terry Winograd, consists of a pair of sentences, differing by exactly one word, each followed by a question. Here are two examples:

Sentence 1: I poured water from the bottle into the cup until it was full.
Question: What was full, the bottle or the cup?
Sentence 2: I poured water from the bottle into the cup until it was empty.
Question: What was empty, the bottle or the cup?

Sentence 1: Joe’s uncle can still beat him at tennis, even though he is 30 years older.
Question: Who is older, Joe or Joe’s uncle?
Sentence 2: Joe’s uncle can still beat him at tennis, even though he is 30 years younger.
Question: Who is younger, Joe or Joe’s uncle?

In each sentence pair, the one-word difference can change which thing or person a pronoun refers to. Answering these questions correctly seems to require commonsense understanding. Winograd schemas are designed precisely to test this kind of understanding, alleviating the Turing test’s vulnerability to unreliable human judges or chatbot tricks. In particular, the authors designed a few hundred schemas that they believed were “Google-proof”: A machine shouldn’t be able to use a Google search (or anything like it) to answer the questions correctly.

These schemas were the subject of a competition held in 2016 in which the winning program was correct on only 58% of the sentences — hardly a better result than if it had guessed. Oren Etzioni, a leading AI researcher, quipped, “When AI can’t determine what ‘it’ refers to in a sentence, it’s hard to believe that it will take over the world.”

However, the ability of AI programs to solve Winograd schemas rose quickly due to the advent of large neural network language models. A 2020 paper from OpenAI reported that GPT-3 was correct on nearly 90% of the sentences in a benchmark set of Winograd schemas. Other language models have performed even better after training specifically on these tasks. At the time of this writing, neural network language models have achieved about 97% accuracy on a particular set of Winograd schemas that are part of an AI language-understanding competition known as SuperGLUE. This accuracy roughly equals human performance. Does this mean that neural network language models have attained humanlike understanding?

Not necessarily. Despite the creators’ best efforts, those Winograd schemas were not actually Google-proof. These challenges, like many other current tests of AI language understanding, sometimes permit shortcuts that allow neural networks to perform well without understanding. For example, consider the sentences “The sports car passed the mail truck because it was going faster” and “The sports car passed the mail truck because it was going slower.” A language model trained on a huge corpus of English sentences will have absorbed the correlation between “sports car” and “fast,” and between “mail truck” and “slow,” and so it can answer correctly based on those correlations alone rather than by drawing on any understanding. It turns out that many of the Winograd schemas in the SuperGLUE competition allow for these kinds of statistical correlations.

Rather than give up on the Winograd schemas as a test of understanding, a group of researchers from the Allen Institute for Artificial Intelligence decided instead to try to fix some of their problems. In 2019 they created WinoGrande, a much larger set of Winograd schemas. Instead of several hundred examples, WinoGrande contains a whopping 44,000 sentences. To obtain that many examples, the researchers turned to Amazon Mechanical Turk, a popular platform for crowdsourcing work. Each (human) worker was asked to write several sentence pairs, with some constraints to ensure that the collection would contain diverse topics, though now the sentences in each pair could differ by more than one word.

The researchers then attempted to eliminate sentences that could allow statistical shortcuts by applying a relatively unsophisticated AI method to each sentence and discarding any that were too easily solved. As expected, the remaining sentences presented a much harder challenge for machines than the original Winograd schema collection. While humans still scored very high, neural network language models that had matched human performance on the original set scored much lower on the WinoGrande set. This new challenge seemed to redeem Winograd schemas as a test for commonsense understanding — as long as the sentences were carefully screened to ensure that they were Google-proof.

However, another surprise was in store. In the almost two years since the WinoGrande collection was published, neural network language models have grown ever larger, and the larger they get, the better they seem to score on this new challenge. At the time of this writing, the current best programs — which have been trained on terabytes of text and then further trained on thousands of WinoGrande examples — get close to 90% correct(humans get about 94% correct). This increase in performance is due almost entirely to the increased size of the neural network language models and their training data.

Have these ever larger networks finally attained humanlike commonsense understanding? Again, it’s not likely. The WinoGrande results come with some important caveats. For example, because the sentences relied on Amazon Mechanical Turk workers, the quality and coherence of the writing is quite uneven. Also, the “unsophisticated” AI method used to weed out “non-Google-proof” sentences may have been too unsophisticated to spot all possible statistical shortcuts available to a huge neural network, and it only applied to individual sentences, so some of the remaining sentences ended up losing their “twin.” One follow-up study showed that neural network language models tested on twin sentences only — and required to be correct on both — are much less accurate than humans, showing that the earlier 90% result is less significant than it seemed.

So, what to make of the Winograd saga? The main lesson is that it is often hard to determine from their performance on a given challenge if AI systems truly understand the language (or other data) that they process. We now know that neural networks often use statistical shortcuts — instead of actually demonstrating humanlike understanding — to obtain high performance on the Winograd schemas as well as many of the most popular “general language understanding” benchmarks.

The crux of the problem, in my view, is that understanding language requires understanding the world, and a machine exposed only to language cannot gain such an understanding. Consider what it means to understand “The sports car passed the mail truck because it was going slower.” You need to know what sports cars and mail trucks are, that cars can “pass” one another, and, at an even more basic level, that vehicles are objects that exist and interact in the world, driven by humans with their own agendas.