Chomsky's one paragraph quote at the beginning of this article is more clear and thoughtful than the rest of this. I feel the author's missing the point.
In the case of language, observing and reporting statistical probabilities in written/spoken language output does very little to explain the cognitive systems used in acquiring and using language. Even one statistical anomaly serves to show that statistical learning is NOT the entire picture when it comes to language development.
There was another article on HN a while back that had another great quote from Chomsky that does well to illustrate what I feel is his main point here: "Fooling people into mistaking a submarine for a whale doesn't show that submarines really swim; nor does it fail to establish the fact". Creating a computer that can produce millions of grammatical utterances does little to show that we understand language systems. Now, if a computer could - like humans - learn to produce infinite, novel, contextual, and meaningful grammatical utterances, that's a different story. But that story will take a lot more than statistical learning to write.
Chomsky is just appealing to our own biases. We don't want to be statistical approximation machines, so that makes it easy to dismiss attempts to mimic us with statistical approximation machines.
However, the preponderance of evidence* so far suggests that we are just statistical processing machines. Hence why Chomsky seems way off the mark.
*We know that various layers in the visual and auditory systems basically just compute ICA, and we know that the brain is incredibly plastic. Large areas can be removed and the remainder will compensate. That makes it seem likely that all neurons compute something like ICA (or at least that degrades to ICA when confronted with visual or auditory input.)
Chomsky is just appealing to our own biases. We don't want to be statistical approximation machines
Do we? Intellectual cynicism that makes people to publicly reduce humans and humanity to something mechanical and predictable is probably the most popular attitude I see online. This doesn't mean it's wrong in all cases, but surely it's not something exceptional.
The kind of intellectual cynicism you describe is often a prerogative and mode of those highly educated in the sciences, and therefore of a very small majority of humans generally, who otherwise in my experience do tend to think of themselves as agents free of statistical determinations and as having wills and minds rather than cogs of any sort.
Statistical learning as a field benefits humans most when its augments our actions. I think instead of comparing humans and statistical algorithms to see who fares better, we should focus on how the two can blend together, and help each other out. As the author points out, all the success stories are largely man-machine collaborations (imagine search engines without user data and inputs).
Thanks for the comment, brockf. I'm sorry the essay didn't make sense to you. Let me try again on a few points.
Are you saying that one statistical error in a probabilistic model makes the entire model wrong? Then you'd equally have to say that one logical error in a categorical model makes it equally wrong. And manifestly, there are many logical errors in all grammars. So I'm not sure what your point is here.
I'm interested to know: I quoted Chomsky: "That's a notion of [scientific] success that's very novel. I don't know of anything like it in the history of science." Do you agree with him? If so, do you judge all the Science and Cell articles as not being about accurately modeling the world and only about providing insight? Or do you think Chomsky meant something else by that?
I understand that there are two goals, accurately representing the world, and finding satisfactorily simple explanations. I think Chomsky has gone too far in ignoring the first, but I acknowledge that both are part of science. I further think that statistical/probabilistic models of language are better for both goals. This is obvious to me after working on the problem for 30 years, so maybe it is hard for me to explain why. I think Manning, Pereira, Abney, and Lappin/Shreiber do a good job of it. Also, I don't see how a system that successfully learns language could be anything other than statistical and probabilistic. I agree it is a long ways away ...
>I further think that statistical/probabilistic models of language are better for both goals.
Could you give some concrete examples? As a linguist, I don't see that statistical models are currently giving us much insight in those areas where current syntactic theory does give some insight. So for example, we don't seem to have learned much about relative clauses, ergativity, passivization, etc. etc. through these models. On the whole, statistical methods seem very much complementary to traditional syntactic theory. This seems to be Chomsky's view also:
"A quite separate question is whether various characterizations of the entities and processes of language, and steps in acquisition, might involve statistical analysis and procedural algorithms. That they do was taken for granted in the earliest work in generative grammar, for example, in my Logical Structure of Linguistic Theory (LSLT, Chomsky 1955). I assumed that identification of chunked word-like elements in phonologically analyzed strings was based on analysis of transitional probabilities — which, surprisingly, turns out to be false, as Thomas Gambell and Charles Yang discovered, unless a simple UG prosodic principle is presupposed. LSLT also proposed methods to assign chunked elements to categories, some with an information-theoretic flavor; hand calculations in that pre-computer age had suggestive results in very simple cases, but to my knowledge, the topic has not been further pursued."
Anyway, if you want to pursue this critique of Chomsky further, I'd recommend a bit more background reading. This article gives a fuller explanation of the views he was outlining at the conference: http://www.tilburguniversity.edu/research/institutes-and-res...
>Or do you think Chomsky meant something else by that?
He presumably means what he said, namely that merely creating accurate models of phenomena has never been the end goal of science. You acknowledge this yourself when you say that you take both modeling and explanation to be part of science.
What about the middle ground of structured probabilistic/statistical models? By introducing strong assumptions and prior information you create models that still have great flexibility, but have meaningful parameters which can be interpreted theoretically. These appear to me to solve both Chomsky's apparent non-interpretive model complaint and the technical problem of training a model with a large number of parameters.
On one end of the continuum, n-gram models for large n with infinite training data estimate the empirical distribution of language and thus are the best you can possibly do. On the other end, rule based grammars directly transcribe intelligible "rules" of language generation and comprehension. Both ends are clearly fraught with problems.
In the middle we have topic models, recursive grammars, decision trees, various ad-hoc smoothing methods, each of which both allowing for more tractable training and introducing more meaning to the parameters of the trained model.
I feel like effort here provides (somewhat unsatisfactory) answers to both criticisms. I think it's fair to say that probabilistic/statistical models deserve more attention in a lot of fields in order to overcome a history of neglect, however.
>In the case of language, observing and reporting statistical probabilities in written/spoken language output does very little to explain the cognitive systems used in acquiring and using language.
Unless, of course, those cognitive systems are nothing more than some statistical probabilistic mechanism. I don't know anything about the field, but the article was interesting to me in that it seemed to at least partly argue that. I know, for me at least, I'll frequently produce a sentence and then repeat it to myself a few times to see if it "sounds right." Now, I don't know what is happening to determine that, but perhaps I'm comparing it to some statistical probabilistic model I have in my head?
> Even one statistical anomaly serves to show that statistical learning is NOT the entire picture when it comes to language development.
1) Does it? Maybe it shows the specific statistical probabilistic model in question is wrong. Consider, as Chomsky did, a model which predicts zero probability for a novel sentence. Clearly, as you say, one anomalous novel sentence is all it takes to disprove such a model. But what about other models which can handle them? The "anomaly" may not be an anomaly anymore.
2) Do you have some anomaly in mind which shows statistical probabilistic models don't work?
-----
The article was very interesting to me, but I don't know anything about the field. I guess my main question boils down to: Is it possible that language acquisition and production is nothing more inside our heads than a simple statistical probabilistic model?
"Now, I don't know what is happening to determine that, but perhaps I'm comparing it to some statistical probabilistic model I have in my head?"
I had a non-native Japanese teacher once who, when asked a question on proper Japanese usage, would often stop for a second, clearly playing the sentence or phrase over again in his head, and way "no, they don't really say that" or "yes, they do say it that way."
Clearly, he was using his extensive experience listening to Japanese over many years to determine grammaticality, so at least a statistical model, if not conclusively a probabilistic one.
A simple statistical model is probably not the only thing human infants are using when they learn language. Linguists make a pretty good case that there must be some structure in-place for infants to acquire language robustly, quickly, and with the kinds of noisy input (overheard speech) they have to work with.
It's not my field so I can't give examples off the top of my head, but the argument involves rapid acquisition of syntax and near-complete absence of errors that you'd expect to see in a simple statistical model.
Exactly. Almost everyone can identify recursive grammar (except people in a small South American tribe who speak a non-recursive language).
You don't need a raw MC to assess the likelihood of "The DOG ate my homework", "My WASHING MACHINE ate my homework", and "MY LEGALLY ate my homework". You need P(WASHING MACHINE = NOUN), and P("My NOUN ate my homework").
But that's not right. You could also have P(WASHING MACHINE = NOUN THAT CAN EAT STUFF). Or maybe P(EAT = HUMOROUS TERM FOR DESTROYED), P(HUMOROUS SENTENCE), P(WASHING MACHINE = OBJECT THAT CAN DESTROY HOMEWORK).
Anyway, it's really bloody hard to put it all together. But that's what humans do. I'd imagine that we store it in our short term memory, then make a few quick parses of it, under varying assumptions, and keep the ones that are most consistent.
In reality, Chomsky is fighting the same sort of battle that happened when Newton and Leibnitz were around (no, not the battle between Newton and ... the rest of the world really). OK, you have gravity. But what causes it? Why? It's an interesting question, but not necessarily one that will lead anywhere.
In the case of "my washing machine ate my homework" and other non-standard expressions, many people will be confused by making the obvious associations. It's only when the new rules are explained to them that they come to understand what was meant.
The failure of a machine to understand a sentence given a set of rules, may simple mean that it needs to be taught new rules.
If that was true, then why did humans evolve to speak at all? Why, if speech is simply a reaction to statistics we are tracking and behaviours that have been rewarded, would the first utterances have been made? And how do we make completely novel utterances that attempt to express our otherwise abstract thoughts?
> Why, if speech is simply a reaction to statistics we are tracking and behaviours that have been rewarded, would the first utterances have been made?
Why not? Look at it from the bottom up:
Communication is a fundament of life, from intra-cellular to inter-cellular to inter-organism interactions (another fundament is the ability to keep oneself in a low entropy state, at the expense of the rest of the world).
Human speech is an evolution of mammal communication. It grew up in complexity, from grunts and other basic noises, along with our way of living, up to what we have now.
> And how do we make completely novel utterances that attempt to express our otherwise abstract thoughts?
Speech is a big collage. New is either the result of
* a recombination of the sub-parts of past speech
* the definition of a new word in terms of older words, or sometimes arbitrarily (for proper nouns).
There's a big difference between "grunts and basic noises" and language. Or at least, that's my opinion. In this same line, I don't believe dogs/monkeys/birds/bees have language, despite the ability to communicate.
This view is just to simplistic to hold its weight when you really look at the intricacies of language and its evolutionary history which, by the way, I would suggest comes from manual gesture and not grunting.
> There's a big difference between "grunts and basic noises" and language. Or at least, that's my opinion. In this same line, I don't believe dogs/monkeys/birds/bees have language, despite the ability to communicate.
This view is just to simplistic to hold its weight when you really look at the intricacies of language and its evolutionary history which, by the way, I would suggest comes from manual gesture and not grunting.
Mu![1]
But you're probably right about gestures.
Wild chimps have a vocabulary of about 66 signs. We can also observe tribes with languages more primitive than ours (no pronouns, for example). But there's a missing link of several millions of years of evolution between both.
What are the (known) intricacies of the evolution of our ability to communicate?
There's no definitive proof for the statistical argument, but a growing amount of (neuro)scientific evidence points to it. What's (are) your alternative hypothese(s)?
I think that most people who believe in some form of the motor theory of speech perception will also believe that speech evolved from manual gesture.
Others scoff at the motor theory. In fact, I'd say I'm in the minority by bringing it up with any regularity.
If the question, what is "known" about the evolution of our ability to communicate, I wouldn't have much to point you towards. Most is theory based on modern evidence, somewhat like armchair psychology. Other people point to our ability to integrate non-verbal gestures into our comprehension, activation of our motor cortex prior to semantic/phonetic network activation when disambiguating difficult speech sounds, our ability to synthesize visual/auditory sources of information when the visual information relates to speech gestures (mouth/tongue movements), etc.
Why, if speech is simply a reaction to statistics we are tracking and behaviours that have been rewarded, would the first utterances have been made?
That criticism can be lobbed at all abilities that we claim came about due to evolution - which, to be clear, is all of them. The statistical model would be the mechanism, but it wouldn't be the reason why it evolved. That answer is relatively boring, and is the same one as all evolutionary processes: it appeared randomly from mutation, and it provided benefit to those that had it.
"That answer is relatively boring, and is the same one as all evolutionary processes: it appeared randomly from mutation, and it provided benefit to those that had it."
Not just boring, but a totally banal and useless answer.
What kinds of mutations? In what sequence? How did it provide a survival benefit? What were the earlier forms of language like, and how did they become the languages spoken today?
Just saying "evolution did it" is about as informative as saying "God did it."
Excellent questions! That I hope someone will investigate. But brockf seemed skeptical that it was even possible for there to be an evolutionary process that produced humans with a statistical-learning-engine in their brains for language. Which I find curious, since - and this is my point - the same can be said for everything that is a result of evolutionary processes. That is, his complaint has nothing to do with language and statistical processoes. The same complaint could be lobbed at eyes.
Just to clarify my position (as it is misunderstood above): I believe it is one of the most important factors in acquiring language. 100%. However, I personally believe that it's a domain-general tool exploited by a domain-specific language module adhering to evolved instincts in language acquisition.
And why can't that domain-general tool be some kind of statistical machine? I ask this because I don't see why what you said is incompatible with it - in fact, I agree with what you said - but I suspect that the mechanism is probably statistical in nature.
There's no doubt that Noam Chomsky founded a paradigm of academic activity. Linguists can generate an unlimited number of papers and monographs by finding problems and proposing intellectually convincing solutions.
From an engineering standpoint, however, Chomsky's view of grammar has been remarkably barren when it comes to machine processing of natural language. It's made a major contribution to artificial languages but despite a lot of effort it hasn't added much performance to what can be done with statistical methods.
I'd agree that a hidden Markov model that does POS tagging with high accuracy doesn't provide an intellectually satisfying model for "how language works", but you don't need to have a model for "how language works" in order to use it.
I feel there is excessive emphasis on "what Chomsky said" and "what Chomsky did". Norvig chooses to point out that the principles and parameters framework is, let's say, imperfect, but, well, Chomsky would agree. Moreover, if you zoom out and stop obsessing about quotes from "Syntactic Structures", you will realize that a lot of the work that's being done in theoretical linguistics is not quite as barren. Yes, statistical methods for (say) anaphora resolution can be extremely efficient, but basically very few people had thought about anaphoric relations in any systematic way before generative linguistics came around.
Moreover, rule-based NLP approaches also have their place, and they are often the direct result of theoretical advances. A case in point is the modelling of morphophonology (which is necessary for spell checking, dictionaries and text generation for morphologically complex languages): many successful approaches are those based on finite-state machines, which could not have happened without Johnson and later Koskenniemi using them to formalize the rule-based approach pioneered by Halle and (yes) Chomsky (well, not quite, but this is still the point of reference for rule-based phonology).
(I am a theoretical phonologist, but my colleagues who do actual NLP work of this type tell me that statistical methods aren't that great for the sort of work they do.)
There is no guarantee that the Kolmogorov complexity of any interesting system will fit inside the rational parts of our heads. There pretty much is a guarantee that we will not be able to fully understand our own brains, using our brains; the part of our brain that can understand things is just dwarfed by the size of the rest of it. (We really do quite a lot with not very many free neurons.) Even if there is a generative theory that can explain human speech in less bits than a direct lookup table, there's no guarantee we can find it, and the null hypothesis must be that we won't because there is no such theory.
We should look for it, but we should not expect to find it.
I'd agree that a hidden Markov model that does POS tagging with high accuracy doesn't provide an intellectually satisfying model for "how language works", but you don't need to have a model for "how language works" in order to use it.
I'd agree that an adhoc equation that fits observed core sample data doesn't provide an intellectually satisfying model for "how sedimentation works", but you don't need to have a model for "how sedimentation works" in order to use it.
That's actually from something I worked on a long time ago. People actually use adhoc models of sedimentation and the formation of sedimentary rock for practical purposes. I also think most people suspect that we'd learn something valuable by figuring out the underlying reason why the data fits the particular description.
I think Norvig acknowledges the point you are making here, namely that the statistical approach does not explain the cognitive systems behind language. However (if I understand correctly) he implies that those systems might be too complex to be adequately explained, let alone emulated and we can achieve more by observing them as black boxes, analyzing their outputs, i.e. language as it is used.
"if a computer could - like humans - learn to produce infinite, novel, contextual, and meaningful grammatical utterances"
To perfectly achieve this goal, you might have to simulate 4 billion years of evolution under the same conditions as it happened on Earth, and a few thousand years of cultural evolution as it led to our languages and our cultural context. Language is incredibly complex and changing, many of its details might be incidental, i.e. results of random events, so it seems unreasonable to pretend that we can deduce it all from some elegant first principles. At least that is my reading of Norvig's argument.
> I think Norvig acknowledges the point you are making here, namely that the statistical approach does not explain the cognitive systems behind language.
If that is the case, then the argument that Norvig is making is irrelevant to the argument Chomsky is making. Chomsky simply makes the point that statistical accounts lack explanatory adequacy. As someone who has worked closely with many of his students and who has received extensive training on his scientific program, I can say with confidence that Chomsky would have no objection whatsoever about the usefulness of statistical approaches to linguistic engineering problems. The results speak for themselves. He would go on to say, however, that how well a statistical approach solves a linguistic engineering problem is irrelevant to the question of how humans do what they do.
The answer to the question may well be statistically grounded. That is a valid hypothesis and a logical possibility which should be taken seriously. However, it is incumbent on the proponents of such an answer to provide evidence that it is what humans are doing. Here are some examples of the kinds of evidence necessary:
* evidence that humans are capable of performing the kinds of computations that the statistical approach requires,
* evidence that the statistical approach works with the relatively limited amount of data that a human receives,
* evidence that the statistical approach fails in ways that humans fail
How well a statistical approach succeeds at an engineering task is not an item on this list, simply, again, because engineering tasks are irrelevant to what humans actually do.
Let me specifically say that statistical approaches are not, from the start, ruled out as potential candidates for the algorithms underlying human language. It's just that a case has to be made for them using the right kind of evidence.
Finally, I'll reiterate what others have pointed out: from a scientific perspective, that something is hard to explain doesn't mean that we shouldn't try. And, those that have given up (as you suggest Norvig has) shouldn't fault those who haven't for calling them out on it.
In situations like this, I tend to speak in theoretical absolutes. A computer that "could - like humans - learn to produce infinite, novel, contextual, and meaningful grammatical utterances" isn't even on the timeline right now, but it's the theoretical goal in showing that we understand language acquisition (ontogenetic development), evolution (phylogenetic development), and production.
Just because that goals seems unattainable doesn't, to me, mean that we need to aim any lower. Now, this is premised on my belief that mimicking phenomena with statistical learning is not as intellectually satisfying as understanding the underlying cognitive systems, but that's not believed by everyone.
I agree that learning to produce and interpret varied utterances is a worthy goal, but the fact is that (far as it still is today) lowly statistical methods have gotten us closer to this goal than the other, chomskian approach. It could be a situation where aiming lower lets you shoot higher.
This is a fundamental misunderstanding of what modern generative linguistics is all about (to be fair, it is extremely widespread). The aim of this branch of science is expressly not to "learn to produce and interpret varied utterances" (called E-language in the jargon), but to understand the cognitive processes behind the production and interpretation of utterances (called I-language). Now you may agree or disagree with the methods and assumptions used in the pursuit of this goal, but it is patently unfair to accuse the field of failing to do something it never set out to do.
You provide no evidence for the last statement: "that story will take a lot more than statistical learning to write."
The existing evidence overwhelming suggests that a computer that can "learn to produce infinite, novel, contextual, and meaningful grammatical utterances" will be based on probabilistic models. In fact it's hard to imagine how it could possibly be otherwise.
The computer is observing noisy sensory input and is trying to make inferences about how to communicate with some future reader. Mathematically, there is only one way to write this problem: probability. It's true that the learned model may have amazing structure to it, but this will almost certainly be learned via probabilistic models rather than being hand-coded by some future Chomsky.
That fact does not imply we will understand language systems or the human mind. The Chomsky route may be better suited for that task.
Where is the existing evidence? And what evidence in modern science can possibly look into the future and make a prediction about something that right now is so far beyond our grasp?
Statistical learning explains a lot. I'm a huge fan of it. Skinner's Behaviourism also explained a lot in psychology. But, just as in the case of behaviourism, I fear that statistical learning has/will hit a wall at which its explanations become futile and overly simplified.
My personal belief is that, at that wall, we'll see that human language instincts and evolved language-specific mechanisms will be what we are looking for.
Consider catching a ball. We know how to design a robot that will catch a ball: it will be the hardware for moving an "arm" and a "hand" for the catching, as well as computer hardware and some software for the logic. The software will solve differential equations in order to predict where the ball will be, and when to move the "arm" and "hand" to the correct spot in order to catch it.
No one, as far as I know, argues that humans actually solve differential equations in their head when they catch a ball. They just... catch it. Perhaps with some failed attempts along the way, but as a part of growing up, we learned basic eye-hand coordination.
The notion that syntax and grammar as we have formalized it exist in our brains is the same as saying that differential equations exist in our brains. I find it much more likely that we innately have rough models for syntax, grammar, mechanical movement and object trajectories, but that it takes significant trial-and-error for us to tune those models until the point of competence. I think these models have to be at least partly statistical - otherwise, we wouldn't need to learn anything - and that while our formalisms may be nice approximations of what we do in our brains, I see no reason why they have to be exactly it.
By actually I meant solve them in the same way that you and I solve them; analytically, using our formalisms. Rather, I'm proposing that our brains are using some statistical model that gives results pretty damn close to what the analytical answers would be. And that something similar is true for syntax and grammar.
I don't think Norvig was arguing Chomsky was completely wrong about what he said, more that statistical models are a hell of a lot more important than Chomsky implies.
Looking at the statistics and evidence is of great importance in trying to form models and answers to the "why" questions. Although mimicking a bee dance may not mean we understand it, it does provide a basis for founding and comparing theories.
When is the pretending so good that it ceases to be pretending? How much Mocking does the Mocking Bird need to partake in before it is the Creating Mocking Bird?
What it didn't seem like Norvig got was difference between understanding and a highly sophisticated pretender. Gut Level vs Self Aware intelligence. Both are valid forms of intelligence but only one is a valid form of understanding.
I think statistical methods are a form of intelligence that are highly mechanical and could never achieve human level cognition (ie fart jokes). But I could be wrong, usually am more than half the time.
This [Chomsky, not you] is just a warmed-over restatement of Searle's Chinese Room argument against AI. And it's a bullshit argument, for a reason I can state in two words: Turing test.
Whose's to say that the human brain doesn't learn by statistical analysis?
The human brain collects data, forms hypotheses and tests them. Just like a statistical machine.
just because you only understand one side of an argument doesn't mean everyone else is an idiot.
in some sense this is the same argument as searle's chinese room. that's sufficiently well known and debated that it's fair to say that neither side can be dismissed as simply "missing the point".
Throughout history, there have been many "well known and debated" arguments that have proven to be idiotic. See: the shape of the Earth.
This isn't one of them, though. And I didn't mean to imply that anyone was an idiot.
Instead of critiquing how I said something, or that I said something at all, can you tell me what I am missing? I'm obviously missing something - Norvig is no idiot.
the problem is whether or not there is any way to "ground" meaning. for physics, the "unreasonable effectiveness of mathematics" might suggest that there are simple "meanings" that underly physical "laws".
but there's nothing to say that the same is true for intelligence or language. maybe the brain is nothing more than a particularly flexible "neural net", which statistical methods are modelling quite well. in that case, "intelligence" is not qualitatively different from "a good simulation of intelligence".
the same problem occurs in free will - does it "really" exist? if we're just (mechanical, predictable, although highly complex) machines then it is difficult to imagine how it can. yet the intuition is that there is clearly some meaning to the idea of a "free agent".
these are hard questions. people don't know the answers. instead we look to what daniel dennett calls "intuition pumps" (see his book "elbow room" on the free will problem) - simple parallels that "feel right". from those, we use intuition to argue in one direction or another. but the problem with that approach is that it depends on what you choose as a "hint".
some advance is being made through experiment. imaging of neural activity in the brain, for example, or the recent discovery that people who believe they have free will behave differently to those that don't.
i can't find the free will behaviour result that was in the news about a week ago. but that's perhaps less related to this anyway.
and i doubt that everyone who thought the earth was flat was idiotic, frankly. just because something is obvious now doesn't mean it was a stupid question, or easy to answer, when first raised.
In the case of language, observing and reporting statistical probabilities in written/spoken language output does very little to explain the cognitive systems used in acquiring and using language. Even one statistical anomaly serves to show that statistical learning is NOT the entire picture when it comes to language development.
There was another article on HN a while back that had another great quote from Chomsky that does well to illustrate what I feel is his main point here: "Fooling people into mistaking a submarine for a whale doesn't show that submarines really swim; nor does it fail to establish the fact". Creating a computer that can produce millions of grammatical utterances does little to show that we understand language systems. Now, if a computer could - like humans - learn to produce infinite, novel, contextual, and meaningful grammatical utterances, that's a different story. But that story will take a lot more than statistical learning to write.