Of course, he is well-known for his paper with Marcus Hutter on providing a mathematical definition of universal general intelligence. I’m not sure if we’ve made a lot of progress since then at turning this highly theoretical notion into some sort of practical “AI IQ” though.
Personally, I would argue the already widely used cross-entropy loss for sequence prediction applied to datasets containing highly diverse types of data generated or collected by humans is a pretty darn good approximation. Much better than attempting to use IQ tests.
The only problem with this approach is that AI can converge on higher intelligence in a lopsided fashion depending on how much weight is given to the different problem domains represented in the dataset; suppose our sequence predictor performs well on subsets of the training data that relate to photographs but not mathematical proofs.
For an optimal machine intelligence, the weights don’t really matter (it will perform as well as possible across all problem domains), but from the perspective of how we want to steer improvements to the sequence predictor, we need to specify these weights manually, otherwise they will be determined implicitly based on the number of samples in the dataset representing each problem domain.
I suppose the selection of these weights is an optimization problem in its own right, where if the eventual goal is minimizing total loss across all problem domains relevant to humans (i.e., not a random sample of distinct problem instances of a formal language), then the optimal selection of weights corresponds to that which leads to the fastest improvement in our development of sequence predictors. Highly weighting human language seems to be having outsized returns at the moment, but I imagine that more highly weighting problems that relate to abstract mathematics will lead to better returns in the future.
For example, most (if not all) IQ tests will test you on working memory. Meaning that you'll be given a string of characters and numbers, and then you'll have to re-iterate them in some ordered fashion. That is completely trivial for a machine, and will give a large skewed max score.
Same with detecting differences. A typical task is to be shown two different pictures, and find the difference between those. Again, totally trivial task for a machine.
Or the vocabulary test. Quite trivial for language models.
The final IQ score will be some weighted and scaled score that consists of all those different parts. When I took the WAIS-IV, that's how it worked.
On the other hand, excluding those (trivial for machine) parts would give a score which may not mirror human intelligence, as far as scoring/testing goes.
Here's a wonderful paper (which I probably found here on HN) on the subject: On the Measure of Intelligence, by
Francois Chollet (working for Google), from 2019:
I am ambivalent on how accurate the test is for an LLM, but it's interesting nonetheless and can be used as a complementary metric for LLM capabilities.
Unlike Chatbot Arena leaderboard and standard benchmark datasets, visuospatial IQ tests are largely knowledge-free and focused on measuring pattern matching and reasoning capabilities.
I think this result is really cool, and is another way to measure progress in AI capabilities. I don't think it says much about the absolute position of how "smart" AIs are, but it definitely has value in showing how far it's progressing.
How is an AI passing the visual reasoning questions?
edit:
> But if I translate the image to this (it’s tedious to read for us, who are used to processing such things visually):
If you translate the visual questions they're no longer visual questions, wouldn't this massage the results? Especially given AIs are really bad at context.
IQ tests are incredibly crude. The amalgamate a series of distinct measures of visual, verbal and mathematical reasoning - primarily utilitised based on ease of measurement, rather than completeness of assessment.
My undergrad thesis involved developing a neurocognitive battery of tests (to explore cognitive deficits related to diabetes). The array of cognitive modules you can test (and that can be defective) is dizzying. You can score a couple of standard deviations over the median on stanford-binet, but be severely functionally impaired due to anterograde amnesia, unable to speak due to Broca's aphasia, or unable to find your way out of a room due to visual agnosia -- to give some concrete examples.
There's also no capacity (due to the pen and paper nature of IQ tests, and the culture bound cognitive aspects they measure) to assess proprioceptive intelligence, musical ability, affective comprehension, explicit or implicit long term memory and numerous other aspects of fluid intelligence that concretely impact an individuals capacity to complete real world 'intelligence' dependent tasks.
Performance on IQ tests can be highly dependent on cultural approaches to test taking, and priming effects in IQ testing have withstood the replication crisis. The mathematical aspects of IQ tests are dependent on familiarity with the kinds of operations tested (and unsurprisingly scores increase with practice), and similarly literacy is required for and influences verbal scores. While learning math and acquiring the ability to read and write certainly do have an impact on cognitive function, it would be ludicrous to suggest that the don't depend on latent abilities IQ tests literally can't measure.
So while IQ tests certainly have their uses, but to consider them a worthwhile measure of human cognitive capacity, let alone machine intelligence, is extremely dubious.
It wasn't a full specturm IQ test, but some random mensa test that looks like the online 'test your IQ' nonsense. And the software was given precise written descriptions of the visual questions after performing too poorly otherwise. This completely removes the element of visual-spatial understanding that the questions are testing to begin with!
But I think there's also a more salient point. Correlated abilities mean different things for different groups. The ability to do simple math rapidly is generally going to correlate with intelligence for a human. That doesn't mean a calculator, which can do simple math millions of times faster than the fastest human, is thus millions of times more intelligent.
This Stanford-Binet IQ test was created to tell less intelligent kids from average, to let them grow in special school programs. They don't measure all intelligence, just a part of it. I mean one can go there and get 140 and boast oneself, but there are some IQ 90 guys there in the world earning six figures, so the good grade on this test does not mean one is great in every field of living
Yes, and that if you want to believe in something you can always come up with the correct sounding justification. In this case, it’s the use of obviously formulaic IQ test formats, thus easily targeted for training, to justify claims that we have AI with intelligence equals to the average human of 100 points. This is pure BS to anyone who has not sold their soul to the AI bandwagon for profit or otherwise.
The average IQ in the training data isn't 100. Higher IQ individuals likely write more online, higher quality datasets like Wikipedia are oversampled etc.
Holy crap, ChatGPT 3.5 fared terribly on that one. I'm usually negative to these kinds of tests and rather rely on the blind test at the leaderboard on Hugging Face, but this one was special in the unique results that still makes "sense".
It looks kind of like a particularly punishing test but one that still adheres to the trend and LLM advances, so it's not completely BS either.
I actually agree on the test regarding the free Bing Copilot in Creative Mode vs Gemini Pro 1.0 (or called "Gemini (normal)" here). Copilot has been my favorite free way of getting near-GPT4 quality. It's clearly been better at coding for me than Gemini. I think these tables will turn soon though, with the coming public launch of Gemini Pro 1.5.
Of course, he is well-known for his paper with Marcus Hutter on providing a mathematical definition of universal general intelligence. I’m not sure if we’ve made a lot of progress since then at turning this highly theoretical notion into some sort of practical “AI IQ” though.
Personally, I would argue the already widely used cross-entropy loss for sequence prediction applied to datasets containing highly diverse types of data generated or collected by humans is a pretty darn good approximation. Much better than attempting to use IQ tests.
The only problem with this approach is that AI can converge on higher intelligence in a lopsided fashion depending on how much weight is given to the different problem domains represented in the dataset; suppose our sequence predictor performs well on subsets of the training data that relate to photographs but not mathematical proofs.
For an optimal machine intelligence, the weights don’t really matter (it will perform as well as possible across all problem domains), but from the perspective of how we want to steer improvements to the sequence predictor, we need to specify these weights manually, otherwise they will be determined implicitly based on the number of samples in the dataset representing each problem domain.
I suppose the selection of these weights is an optimization problem in its own right, where if the eventual goal is minimizing total loss across all problem domains relevant to humans (i.e., not a random sample of distinct problem instances of a formal language), then the optimal selection of weights corresponds to that which leads to the fastest improvement in our development of sequence predictors. Highly weighting human language seems to be having outsized returns at the moment, but I imagine that more highly weighting problems that relate to abstract mathematics will lead to better returns in the future.