AIs ranked by IQ; AI passes 100 IQ for first time, with release of Claude-3

Xcelerate · on March 6, 2024

Shane Legg gave a really neat talk in 2010 about devising a good measure of “machine intelligence”: https://youtu.be/0ghzG14dT-w?si=OPvVqre0WqsnSUum

Of course, he is well-known for his paper with Marcus Hutter on providing a mathematical definition of universal general intelligence. I’m not sure if we’ve made a lot of progress since then at turning this highly theoretical notion into some sort of practical “AI IQ” though.

Personally, I would argue the already widely used cross-entropy loss for sequence prediction applied to datasets containing highly diverse types of data generated or collected by humans is a pretty darn good approximation. Much better than attempting to use IQ tests.

The only problem with this approach is that AI can converge on higher intelligence in a lopsided fashion depending on how much weight is given to the different problem domains represented in the dataset; suppose our sequence predictor performs well on subsets of the training data that relate to photographs but not mathematical proofs.

For an optimal machine intelligence, the weights don’t really matter (it will perform as well as possible across all problem domains), but from the perspective of how we want to steer improvements to the sequence predictor, we need to specify these weights manually, otherwise they will be determined implicitly based on the number of samples in the dataset representing each problem domain.

I suppose the selection of these weights is an optimization problem in its own right, where if the eventual goal is minimizing total loss across all problem domains relevant to humans (i.e., not a random sample of distinct problem instances of a formal language), then the optimal selection of weights corresponds to that which leads to the fastest improvement in our development of sequence predictors. Highly weighting human language seems to be having outsized returns at the moment, but I imagine that more highly weighting problems that relate to abstract mathematics will lead to better returns in the future.

TrackerFF · on March 6, 2024

IQ tests for models seem...somewhat flawed.

For example, most (if not all) IQ tests will test you on working memory. Meaning that you'll be given a string of characters and numbers, and then you'll have to re-iterate them in some ordered fashion. That is completely trivial for a machine, and will give a large skewed max score.

Same with detecting differences. A typical task is to be shown two different pictures, and find the difference between those. Again, totally trivial task for a machine.

Or the vocabulary test. Quite trivial for language models.

The final IQ score will be some weighted and scaled score that consists of all those different parts. When I took the WAIS-IV, that's how it worked.

On the other hand, excluding those (trivial for machine) parts would give a score which may not mirror human intelligence, as far as scoring/testing goes.

TacticalCoder · on March 6, 2024

> IQ tests for models seem...somewhat flawed.

They are for all the reasons you explain.

Here's a wonderful paper (which I probably found here on HN) on the subject: On the Measure of Intelligence, by Francois Chollet (working for Google), from 2019:

https://arxiv.org/abs/1911.01547

He comes up with many tests which are relatively simple for humans that AI models cannot solve.

If you don't want to read all the paper, jump to the figures (starting from page 48).

The goal, of course, is not to train an AI on the correct results of the exact problems in that paper: that'd just be more memory testing.

Once Claude solves that, things are going to get really wild!

singularity2001 · on March 6, 2024

  > That is completely trivial for a machine

If given as arbitrary images the tests become a bit more 'fair'

It will be fascinating to see the race between the invention of new tests which humans excel at and the progress of models.

... until neither us nor the machines can think of any more categorically different tests

nopinsight · on March 6, 2024

I am ambivalent on how accurate the test is for an LLM, but it's interesting nonetheless and can be used as a complementary metric for LLM capabilities.

Unlike Chatbot Arena leaderboard and standard benchmark datasets, visuospatial IQ tests are largely knowledge-free and focused on measuring pattern matching and reasoning capabilities.

lysecret · on March 6, 2024

It’s indeed fascinating. My gut feeling is they aren’t much more flawed in measure intelligence as they are for humans.

silveraxe93 · on March 6, 2024

All ~models~ measures are wrong, some are useful.

I think this result is really cool, and is another way to measure progress in AI capabilities. I don't think it says much about the absolute position of how "smart" AIs are, but it definitely has value in showing how far it's progressing.

lysecret · on March 6, 2024

Reminds me of this talk where they measured an LLMs performance and in how well it can draw a unicorn and modify it using svg.

All measures are wrong but some are useful.

fragmede · on March 6, 2024

https://www.youtube.com/watch?v=qbIk7-JPB2c seems to be said talk.

https://gpt-unicorn.adamkdean.co.uk/

mauvia · on March 6, 2024

How is an AI passing the visual reasoning questions?

edit:

> But if I translate the image to this (it’s tedious to read for us, who are used to processing such things visually):

If you translate the visual questions they're no longer visual questions, wouldn't this massage the results? Especially given AIs are really bad at context.

ggm · on March 6, 2024

Does this not strongly suggest IQ tests are too crude?

dbspin · on March 6, 2024

IQ tests are incredibly crude. The amalgamate a series of distinct measures of visual, verbal and mathematical reasoning - primarily utilitised based on ease of measurement, rather than completeness of assessment.

My undergrad thesis involved developing a neurocognitive battery of tests (to explore cognitive deficits related to diabetes). The array of cognitive modules you can test (and that can be defective) is dizzying. You can score a couple of standard deviations over the median on stanford-binet, but be severely functionally impaired due to anterograde amnesia, unable to speak due to Broca's aphasia, or unable to find your way out of a room due to visual agnosia -- to give some concrete examples.

There's also no capacity (due to the pen and paper nature of IQ tests, and the culture bound cognitive aspects they measure) to assess proprioceptive intelligence, musical ability, affective comprehension, explicit or implicit long term memory and numerous other aspects of fluid intelligence that concretely impact an individuals capacity to complete real world 'intelligence' dependent tasks.

Performance on IQ tests can be highly dependent on cultural approaches to test taking, and priming effects in IQ testing have withstood the replication crisis. The mathematical aspects of IQ tests are dependent on familiarity with the kinds of operations tested (and unsurprisingly scores increase with practice), and similarly literacy is required for and influences verbal scores. While learning math and acquiring the ability to read and write certainly do have an impact on cognitive function, it would be ludicrous to suggest that the don't depend on latent abilities IQ tests literally can't measure.

So while IQ tests certainly have their uses, but to consider them a worthwhile measure of human cognitive capacity, let alone machine intelligence, is extremely dubious.

MichaelZuo · on March 10, 2024

How did you not come across raven's progressive matrices? Which sidesteps all possible cultural biases at least.

somenameforme · on March 6, 2024

It wasn't a full specturm IQ test, but some random mensa test that looks like the online 'test your IQ' nonsense. And the software was given precise written descriptions of the visual questions after performing too poorly otherwise. This completely removes the element of visual-spatial understanding that the questions are testing to begin with!

But I think there's also a more salient point. Correlated abilities mean different things for different groups. The ability to do simple math rapidly is generally going to correlate with intelligence for a human. That doesn't mean a calculator, which can do simple math millions of times faster than the fastest human, is thus millions of times more intelligent.

p0w3n3d · on March 6, 2024

This Stanford-Binet IQ test was created to tell less intelligent kids from average, to let them grow in special school programs. They don't measure all intelligence, just a part of it. I mean one can go there and get 140 and boast oneself, but there are some IQ 90 guys there in the world earning six figures, so the good grade on this test does not mean one is great in every field of living

qeternity · on March 6, 2024

No, because these tests are designed for humans. Cross-comparison makes no sense.

It's like asking if a 100 meter dash is too crude because a Ferrari can do it faster.

bugbuddy · on March 6, 2024

Yes, and that if you want to believe in something you can always come up with the correct sounding justification. In this case, it’s the use of obviously formulaic IQ test formats, thus easily targeted for training, to justify claims that we have AI with intelligence equals to the average human of 100 points. This is pure BS to anyone who has not sold their soul to the AI bandwagon for profit or otherwise.

hiq · on March 6, 2024

I'm not sure we can deduce much from this without knowing how many questions (and answers) were part of the training data.

MrBuddyCasino · on March 6, 2024

Wouldn't you expect that an AI would eventually approach the average IQ of its training data?

akasakahakada · on March 6, 2024

With the "in context" ability of LLM I think you can add one sentence like "as you were IQ185" to make AI ignore portion of its training data.

bugbuddy · on March 6, 2024

The average IQ is 100…

Tenoke · on March 6, 2024

The average IQ in the training data isn't 100. Higher IQ individuals likely write more online, higher quality datasets like Wikipedia are oversampled etc.

pcthrowaway · on March 8, 2024

You must not spend too much time on Reddit...

tmaly · on March 6, 2024

By country, it is not.

silveraxe93 · on March 6, 2024

Nope. In order to minimise total loss it needs to be able to generate the whole spectrum of IQ.

mungoman2 · on March 6, 2024

Very interesting. So if you want an intelligent model, you should train it on data that gives a high IQ score.

You can then test a text to see if it increases or decreases the IQ of a model, if included in the training set.

This makes it possible to classify text as "decreases intelligence" of "increases intelligence".

Logical next step is to mark books based on the expected effect on intelligence.

jug · on March 6, 2024

Holy crap, ChatGPT 3.5 fared terribly on that one. I'm usually negative to these kinds of tests and rather rely on the blind test at the leaderboard on Hugging Face, but this one was special in the unique results that still makes "sense".

It looks kind of like a particularly punishing test but one that still adheres to the trend and LLM advances, so it's not completely BS either.

I actually agree on the test regarding the free Bing Copilot in Creative Mode vs Gemini Pro 1.0 (or called "Gemini (normal)" here). Copilot has been my favorite free way of getting near-GPT4 quality. It's clearly been better at coding for me than Gemini. I think these tables will turn soon though, with the coming public launch of Gemini Pro 1.5.

p0w3n3d · on March 6, 2024

Today: AI passes 100 IQ test. Tomorrow: "Thou shalt not make a machine in the likeness of a human mind.”, human navigators and harvesting spice

belter · on March 6, 2024

If you are into Herbert quotes...

“Technology is both a tool for helping humans and for destroying them. This is the paradox of our times which we’re compelled to face.”

— Frank Herbert

Lockal · on March 7, 2024

tldr: last week author demonstrated that "AI" is random guesser.

Now instead of feeding actual questions, author inputs:

  3 - 1 - 2
  2 - 3 - 1
  1 - 2 - ?

And "AI" responds that answer is 3 with high probability