A Response to Noam Chomsky on Machine Learning and Knowledge
On machine learning, the nature of knowledge and intelligence, associations, predictions, and explanations.
1. Noam Chomsky on Machine Learning
Recent advances in machine learning have revived an old criticism: that these systems do not possess genuine knowledge. Noam Chomsky has long voiced doubts of this kind. More recently, he co-authored an essay predicting that machine learning “will degrade our science and debase our ethics by incorporating into our technology a fundamentally flawed conception of language and knowledge.”
Chomsky’s main concern is with the statistical nature of these methods. Machine learning systems operate without explicit programming, instead identifying patterns in data and making predictions based on them. This, Chomsky argues, is not how the human mind works. The human mind is not “a lumbering statistical engine for pattern matching”. It does not seek “to infer brute correlations among data points” but “to create explanations”.
This skepticism follows naturally from Chomsky’s broader intellectual commitments. Referring to Plato’s Meno—where Socrates shows how an uneducated slave can exhibit knowledge of geometric principles with only a few prompts—Chomsky asks: How does a child acquire so much knowledge so rapidly, with so little evidence? Like Immanuel Kant, he recognizes the selectivity that must originate from us against the shower of data to which we are exposed. And like Plato and Kant, his answer to the possibility of knowledge involves something innate to human nature.
Chomsky views the mind as composed of inborn, interacting faculties. Each faculty operates according to distinct, domain-specific rules that produce various mental phenomena.
Language is one such faculty. Chomsky explains it through a framework of principles (invariant rules common to all natural languages) and parameters (options that are set upon exposure to linguistic data, accounting for differences between languages). To use his analogy: Although growth would not occur without eating, it is not the food but the child’s inner nature that determines how growth will occur. Similarly, it is not linguistic data but the child’s biological endowment that determines how language is acquired.
Opposed to this nativist conception of knowledge and language is the kind of empiricism which denies innate rules for knowledge. Where Chomsky sees the mind as directed by inner principles that selectively use data as part of a fixed developmental program, empiricists see it as forming associations based on experience. Without rigid inner procedures to impose structure, experience plays a far more determining role. Chomsky criticizes this approach for treating the mind differently than we treat other bodily systems—all of which we assume have innate structures.
An approach focused solely on associations from observed data, Chomsky argues, would fail to discover the underlying principles from which certain outcomes—and only those outcomes—follow. He sees scientific inquiry as the pursuit of such principles and emphasizes abstraction and idealization, as with thought experiments:
More generally in the sciences, for millennia, conclusions have been reached by experiments–often thought experiments–each a radical abstraction from phenomena. Experiments are theory-driven, seeking to discard the innumerable irrelevant factors that enter into observed phenomena ... the basic distinction goes back to Aristotle’s distinction between possession of knowledge and use of knowledge. The former is the central object of study.
In machine learning, where associations dominate rather than rules, Chomsky sees echoes of the empiricist approach. He doubts that statistical techniques for finding patterns in data will yield explanatory insights. A notion of success that involves predictions but not explanatory insights, he remarks, has little precedent in science. He calls such predictions pseudoscience.
2. Two Distinct Claims
Chomsky makes two distinct claims that are worth separating.
The first concerns how the human mind differs from machine learning systems. “The human mind,” Chomsky says, “is a surprisingly efficient and even elegant system that operates with small amounts of information.” Referring to the innate language faculty “that limits the languages we can learn to those with a certain kind of almost mathematical elegance,” he points to a lack of similar constraints in machine learning. The same applies to knowledge: “humans are limited in the kinds of explanations we can rationally conjecture,” while “machine learning systems can learn both that the earth is flat and that the earth is round.”
This claim is unsurprising. The human mind—a part of the human organism shaped by evolution over countless generations—naturally differs from machine learning systems, which are non-biological creations developed using human methods and data. If Chomsky’s only point were that the human mind operates differently from machine learning systems, it would be a rather trivial observation.
But Chomsky appears to be making a second, more general claim about the nature of knowledge itself. He identifies the deepest flaw of machine learning programs as:
the absence of the most critical capacity of any intelligence: to say not only what is the case, what was the case and what will be the case—that’s description and prediction—but also what is not the case and what could and could not be the case. Those are the ingredients of explanation, the mark of true intelligence ... The crux of machine learning is description and prediction; it does not posit any causal mechanisms or physical laws.
To define knowledge this way, one must rely on a universal conception of what knowledge is—one that could apply equally to animals, humans, and artificial systems, free from human biases. With such a conception, we could distinguish what is essential to knowledge from what is merely instrumental. Chomsky, however, seems to be using a particular way of acquiring knowledge, with its unique constraints, to conclude that any different path would fail to yield genuine knowledge.
Consider an analogy with locomotion. There are different ways to move from place to place. For humans, walking is the natural method—enabled by a particular bodily architecture with particular constraints. A wheeled vehicle has a different architecture with different constraints. If we compared them, we would note their differences: one walks, the other rolls, and so they are not identical. Yet if we understand locomotion as a general concept—independent of the particular means through which it occurs—then both walking and rolling are genuine instances of it. We might legitimately ask which is more energy-efficient, but there would be no question of which represents “genuine” locomotion and which merely simulates it.
What should concern us, then, is the nature of knowledge itself. What are its core features that allow us to identify it regardless of how it is achieved? I will argue that associations and predictions are sufficient to capture the essential features of knowledge. Everything that aids in achieving them—whether biological endowments, algorithms, or epistemic techniques—holds only instrumental value.
3. Associations and Predictions
Associations, Platonic forms, causal relations, and physical laws are all instances of what we can call “propositions”. A proposition emerges from accounting for, or linking, data.
Consider a simple example. Suppose we traverse an area and develop an internal representation of its physical features. We have linked some material available to us—material that increased as we moved through the space. If we externalized this representation using pencil and paper, it would take the form of a map, where points are connected in particular ways. What we did internally is analogous to linking points on paper.
As Chomsky himself believes about cognitive operations, this process of developing propositions is independent of individual awareness.
The process is also independent of the ability to externalize a proposition. Our internal representation exists before we translate it into lines on paper and would exist even if we never externalized it. That an animal cannot create diagrams or utter sentences does not prevent it from discovering propositions.
Everything that aids in accounting for data contributes to one’s capacity for knowledge. Some contributions are internal (like memory); others are external (like the mechanical calculator or the scientific method). Since memory can deteriorate with age, and since we can be exposed to new inventions through cultural diffusion, our capacity for knowledge is subject to variation.
What is the material that we account for when discovering a proposition? If we define data as that which is accounted for in this process, what properties does it have? Understanding these properties will clarify how different types of propositions are possible and help address some of Chomsky’s concerns.
The material that forms propositions has properties paralleling those of physical objects. Just as a physical object consists of a quantity of material that can increase or decrease, the amount of data available to us fluctuates over time. Data, like physical materials, comes from diverse sources. The divisibility of a proposition lets us appreciate the granularity of its constituent data, much like dividing an object into smaller parts reveals its material composition. This granularity enables propositions to be about anything discoverable in data, just as different combinations of particles yield distinct objects.
Consider each of these properties in turn.
Accumulation. As illustrated in the map example, data can become available to us after previously being unavailable. New data is constantly added to us, even while we sleep. What was once unavailable becomes accessible at each moment, contributing to the totality of data available to us.
Diversity of sources. In traversing an area, we encounter one source of data: direct observation. Other sources might include photographs of the area or written descriptions. Some sources provide more value by allowing us to account for more data with less effort. Instead of physically traversing an area, we might rely on a photograph to form our mental map. To the extent the photograph is representative, the features in our proposition will align with any actual visit we make. Our ability to discriminate among different sources emerges through experience with data itself.
Granularity. Philosophers have long drawn an analogy between how particles combine to form physical objects and how simple ideas from experience combine to form complex ones. We can think of this in terms of divisibility: any part of a physical object can be distinguished as a constituent, which can have its own divisible parts. Similarly, a proposition contains other propositions presupposed within it.
Consider the proposition “whenever one ball strikes another, the other ball moves.” Within it, we find presupposed such propositions as “a ball is a round body,” “to move is to go from one place to another,” and even “a thing cannot both be and not be at the same time.” When we form a proposition, we implicitly account for all the data accounted for by the propositions it presupposes.
This granularity allows for the different kinds of propositions we discover. Some pertain to what we can observe: both what we have observed and what we have yet to observe. If we ask through what data we arrived at the ball proposition, we might say we had a series of experiences directly observing one ball strike another. But this is not the only path. Suppose we have never directly observed one ball strike another but have observed objects colliding, seen pictures of balls, and read about what happens when balls collide. We could form the same proposition without ever witnessing such an event directly.
Beyond propositions about what we can observe, we can also form propositions about what we cannot observe. Suppose we hear a sound from a tape recorder. We cannot observe the original source—it is in the past and not present before us. Yet by combining the data from the recording with what we already know, we can infer that the original speaker was a woman, that she has a quiet disposition, that she felt lonely the night before. We might have reached these same judgments had she been speaking in front of us.
We never find, in imagination’s ingredients or products, anything not present in some form within the data available to us. This helps explain how propositions can be about what does not exist. In nature we may observe lions and men, but a lion-man—a figure with a lion’s head and a man’s body—exists only in imagination. Yet this figure could never have occurred to us without first observing the elements we used to create it.
While propositions are discovered based on available data, the data a proposition accounts for can also include data yet to become available. When we account for certain currently available data, we treat multiple instances as examples of a single state of affairs. When we treat data yet to become available as further instances of that state of affairs, we form predictions. If these predictions agree with the data when it eventually becomes available, the proposition has successfully accounted for that data.
We can compare propositions if one accounts for at least all the data another accounts for. The proposition that accounts for more data is truer. Continuing our analogy with physical objects: truth is a gradable property, much like physical size. We can describe an object as large relative to a certain range, or larger than another object, or as having a specific length. Similarly with truth: we can describe a proposition as true relative to a range, truer than another proposition, or as accounting for specific data.
A problem in Plato’s Meno concerns the possibility of inquiry: How can we search for what we don’t know, since we wouldn’t recognize it if we found it? And why search for what we already know? This puzzle dissolves once we recognize that knowledge consists in accounting for data, that truth refers to the amount of data accounted for, and that data becoming available is simply one of the transformations occurring in the world.
From this account, we can see that animals, humans, and machine learning systems differ in their capacities for knowledge partly because of differences in their architectures. These architectures—more malleable and improvable in machine learning systems than in biological organisms—create tendencies to account for data in specific ways. They also constrain what kinds of data a system can access. A human and a bat exposed to the same environment for the same time are not accounting for the same data. Architectures also determine how external contributions can enhance a system’s capacity for knowledge: education increases a human’s capacity but not an animal’s.
Despite these differences, a universal constraint across all systems is the kind of data available for building propositions. Corrupt data misleading a machine learning system is not fundamentally different from a Cartesian deceiver misleading a human. The journey from error to knowledge is the discovery of truer propositions. In this process, unsuccessful predictions by a machine learning system are no different from those made by a human.
The ultimate reference for truth and falsehood is what occurs in the world. Propositions about what does not occur, whether possible or impossible, can only be true insofar as they point to what does occur. Using propositions about possibilities to arrive at propositions about actualities can be understood as an epistemic technique. Other, yet undiscovered techniques may exist. If it is possible to reach a true proposition without relying on the particular properties of the human system, then the constraints of that system are merely instrumental.
The endowments of biological entities can also be considered independently of knowledge—as pure manifestations of bodily processes, like the pumping of the heart. In that case, as Kant said of the senses, they would form no judgment, correct or incorrect. They would not be involved in accounting for data, which is the basis for truth and falsehood. The faculties of the human mind contribute to knowledge by enhancing a person’s capacity for it, as when the language faculty helps us read a textbook. We might compare the mind’s contribution to knowledge with the limbs’ contribution to locomotion.
This account also clarifies how predictions make the internal process of accounting for data concrete. When someone claims to understand something, we assess this by asking questions or presenting problems, then comparing their predictions against a benchmark. We treat past actions as records of previous predictions. Since the capacity for externalization is independent of the capacity for knowledge, it is possible to demonstrate knowledge without being able to articulate it.
Predictions thus help us distinguish knowledge from its absence in any system—animal, human, or machine. We can raise legitimate questions about which system performs better on a specific metric, such as how much data is required to reach a particular prediction. But there is no meaningful question about which possesses “genuine” knowledge and which “merely simulates” it. Any such demarcation would be arbitrary, like saying a system achieves genuine locomotion only once it crosses a certain speed or efficiency threshold.
4. Explanations
Chomsky’s contention that knowledge requires something more than description and prediction echoes an argument from Socrates: even if we form beliefs that happen to be true, we need something more for genuine knowledge.
In Plato’s Theaetetus, Socrates illustrates this with the example of lawyers who persuade jurors about criminal acts. If jurors conclude that a defendant is innocent based on a compelling argument, we cannot say they possess knowledge of his innocence—even if he is indeed innocent—since we can imagine them reaching the opposite verdict had the lawyer argued otherwise. We can form all kinds of beliefs, some of which might be true by mere chance. Socrates suggests that a true belief must be fastened by explanation and made reliable to become knowledge.
What, then, is an explanation? In Aristotle’s Posterior Analytics, we find:
We suppose ourselves to possess unqualified scientific knowledge of a thing, as opposed to knowing it in the accidental way in which the sophist knows, whenever we think we are aware both that the explanation because of which the object is is its explanation, and that it is not possible for this to be otherwise.
An explanation, then, is an answer to the question of why a particular fact is the way it is and not otherwise. In an explanation, the fact to be explained is shown to be an instance of a general proposition—a causal relation or universal law. A particular ball moving after being struck can be explained by the general proposition “whenever one ball strikes another, it causes the other ball to move.”
Let us examine this example to understand causal relations. Our intuitive notion of causation suggests that the first ball acts on the second upon contact, causing it to move by necessity. However, as David Hume argued, what we actually discover in causal relations is merely a constant conjunction between pairs of events. The necessity we attribute to this connection—the sense that one event must follow the other—is not something we directly observe.
Strip away this obscure notion of necessity, which adds nothing to the relationships we identify despite any significance we attach to it, and what remains are associations as discovered by a system with certain capacities. More precisely, causal relations are propositions in which relations of precedence and succession between states of affairs are presupposed, with some propositions being truer than others.
In explaining why, when one ball strikes another, the other moves, we can also invoke a law of nature, such as “the total momentum of a system remains constant.” This law is a truer proposition than “whenever one ball strikes another, the other moves.” It accounts for all the data the simpler proposition does while also accounting for more. A law of nature, then, is simply a type of proposition distinguished by the large amount of data it accounts for.
If finding causal relations and laws of nature consists of identifying propositions of certain kinds, then explanation—where it is taken to demonstrate knowledge—involves having a sufficiently truer proposition than the one being explained. Such a truer proposition contains within it the limits that made the other proposition falser, just as an amended map contains the constraints of the older map. This is evident in the predictions we form based on the truer proposition.
There is, however, a risk when explanations are viewed as the goal of all inquiry. Because accounting for data is an internal process, our ignorance can become obscured in the pursuit of explanations. We may mistake a satisfying subjective state—the feeling of clarity or simplicity a particular explanation generates—for actual knowledge. Predictions help mitigate this risk, as demonstrated by the success of modern science, which offers predictions that pre-modern philosophy, despite its elaborate explanations, failed to offer. Anything involved in developing explanations is relevant to knowledge only insofar as it helps account for more data.
Once we understand how predictions are made, we can acknowledge that a machine learning system making predictions must have based its conclusions on something it discovered, even if it cannot externalize that discovery. While this presents its own risks—such as hindering the cumulative knowledge-building that externalization facilitates in humans—as a fact about the possession of knowledge, it resembles the case of a person who cannot articulate a proposition but demonstrates it through successful predictions. Each successful prediction reflects the data accounted for by a proposition, and it is on this basis that we can distinguish accidentally true beliefs from more reliable ones.
In the case of Socrates’ jurors, what yields knowledge rather than mere true belief is the identification of a truer proposition—one like “the accused is innocent because of such-and-such evidence, and the lawyer’s persuasiveness is independent of this,” rather than “the accused is innocent because otherwise the lawyer would not have been so convincing.”
Or consider our ball example: we might observe that one ball strikes another but the second does not move. We would then realize that any explanation that had satisfied us, whatever its other uses, had limits attributable to nothing else but the limited data accounted for by our proposition. This could prompt us to identify a truer proposition, such as “whenever one ball strikes another, the other moves, except when it is significantly heavier,” or eventually, “the total momentum of a system remains constant.” It is an observed consequence of a proposition being sufficiently true that it leads us to discover what would or would not occur under particular circumstances.
5. Conclusion
When we understand the nature of knowledge, we recognize that it is aided by a system’s endowments—such as those innate to humans—but not solely defined by them. We can imagine these endowments independently of knowledge, while also conceiving of entities that lack the same endowments yet remain capable of acquiring knowledge.
Continue Reading:
→ Every Problem Is a Prediction Problem
→ A Conversation with Martin Frické on the Epistemology of Machine Learning
→ A Conversation with Massimo Pigliucci on Scientific ExplanationOther Projects:
→ Universal Open Textbook Initiative (Free, multilingual textbooks)
→ Aesthete (Curate your culture — iOS app)
