Magistral Small seems wayyy too heavy-handed with its RL to me:
\boxed{Hey! How can I help you today?}
They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).
It also forgets to <think> unless you use their special system prompt reminding it to.
Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.
"Thinking" is a term of art referring to the hidden/internal output of "reasoning" models where they output "chain of thought" before giving an answer[1]. This technique and name stem from the early observation that LLMs do better when explicitly told to "think step by step"[2]. Hope that helps clarify things for you for future constructive discussion.
The point that was trying to be made, which I agree with, is that anthropomorphizing a statistical model isn’t actually helpful. It only serves to confuse laypersons into assuming these models are capable of a lot more than they really are.
That’s perfect if you’re a salesperson trying to dump your bad AI startup onto the public with an IPO, but unhelpful for pretty much any other reason, especially true understanding of what’s going on.
If that was their point, it would have been more constructive to actually make it.
To your point, it's only anthropomorphization if you make the anthrocentric assumption that "thinking" refers to something that only humans can do.[1]
And I don't think it confuses laypeople, when literally telling it to "think" achieves the very similar results as in humans - it produces output that someone provided it out-of-context would easily identify as "thinking out loud", and improves the accuracy of results like how... thinking does.
The best mental model of RLHF'd LLMs that I've seen is that they are statistical models "simulating"[1] how a human-like character would respond to a given natural-language input. To calculate the statistically "most likely" answer that an intelligent creature would give to a non-trivial question, with any sort of accuracy, you need emergent effects which look an awful like like a (low fidelity) simulation of intelligence. This includes simulating "thought". (And the distinction between "simulating thinking" and "thinking" is a distinction without a difference given enough accuracy)
I'm curious as to what "capabilities" you think the layperson is misled about, because if anything they tend to exceed layperson understanding IME. And I'm curious what mental model you have of LLMs that provides more "true understanding" of how a statistical model can generate answers that appear nowhere in its training.
[1] It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.
> It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.
And yet we added a hand wavy 7th to humanize a peice of technology.
I know this is the terminology, but I'd argue that the activations are the actual thinking. It's probably too late to change that, but I wish people would refer to thinking as the work Anthropic and Deepmind are doing with their mech interp
It's a misleading "term of art" which is more accurately described as a "term of marketing". Reasoning is precisely what LLMs don't do and it's precisely why they are unsuited to many tasks they are peddled for.
How are you defining "reasoning" such that you are confident that LLMs are definitely not doing it? What evidence do you have to that effect? (And are you certain that none of your reasoning applies to humans as well?)
I am thoroughly unimpressed by this paper. It sets up a vague strawman definition of "thinking" that I'm not aware of anyone using (and makes no claim it applies to humans) and then knocks down the strawman.
It also leans way too heavy on determinism - For one thing, we have no way of knowing if human brains are deterministic (until we solve whether reality itself is). For another, I doubt you would suddenly reverse your position if we created a LoRa composed of atmospheric noise, so it does not support your real position.
"While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. [...] Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. [...] We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."
Starts by saying "we actually don't understand them" (meaning we don't know well enough to give a yes or no) and then proceeds to list flaws that, as I keep saying, also can be applied to most (if not all) humans' ability to reason. Human reasoning also collapses in accuracy above a certain complexities, and certainly are observed to fail to use explicit algorithms, as well as reasoning inconsistently across puzzles.
So unless your definition of anthropomorphization excludes most humans, this is far from a slam dunk.
> They don’t even always output their internal state accurately.
I have some really bad news about humans for you. I believe (Buddha et al, 500 BCE) is the foundational text on this, but there's been some more recent research (Hume, 1739), (Kierkegaard, 1849)
Whodathunkit, some people are so infatuated with their simulacra that they choose to go tooth and nail in defense of the simulation.
My point was congruent with the argument that LLMs are not humans or possess human-like thinking and reasoning, and you have conveniently demonstrated that.
> My point was congruent with the argument that LLMs are not humans or possess human-like thinking and reasoning, and you have conveniently demonstrated that.
I mean, they are obviously not humans, that is trivially true, yes.
I don't know what I said makes you believe I demonstrated that they do not possess human-like thinking and reasoning, though, considering I've mostly pointed out ways they seem similar to humans. Can you articulate your point there?
These kind of comments are the equivalent of going to dog owners' forums, analyzing word choices in every post and warning the dog owners about the dangers of anthropomorphizing their pets, an effort as accurate as it is boorish and ineffectual.
Do you also complain when someone says "Half-life 2 has great water-physics" with "Don't call it physics, we still don't understand all the physical laws of the universe, and also they use limited-precision floating-point, so it's not water-physics, it's just a bunch of math"?
Like, we've agreed that "water-physics" and "cloth physics" in 3d graphics refers to a mathematical approximation of something we don't actually understand at the subatomic level (are there strings down there? Who knows).
Can "thinking" in AI not refer to this intentionally false imitation that has a similar observable outward effect?
Like, we're okay saying minecraft's water has "water physics", why are we not okay saying "in the AI context, thinking is a term that externally looks a bit like a human thinking, even though at a deeper layer it's unrelated"?
Or is thinking special, is it like "soul" and we must defend the word with our life else we lose our humanity? If I say "that building's been thinking about falling over for 50 years", did I commit a huge faux pas against my humanity?
> Do you also complain when someone says "Half-life 2 has great water-physics"
I would if they said the water in Half-life 2 was great for quenching your thirst or that in the near future everyone will only drink water from Half-life 2 and it will flow from our kitchen taps when it's clear that however good Half-life 2 is at approximating what water looks and acts like it isn't capable of being a beverage and isn't likely to ever become one. Right now there are a lot of people going around saying that what passes for AI these days has the ability to reason and that AGI is right around the corner but that's just as obvious a lie and every bit as unlikely, but the more it gets repeated the more people end up falling for it.
It's frustrating because at some point (if it hasn't happened already) you're going to find yourself feeling very thirsty and be shocked to discover that the only thing you have access to is Half-life 2 water, even though it does nothing for you except make you even more thirsty since it looks close enough to remind you of the real thing. All because some idiot either fell for the hype or saved enough money by not supplying you with real water that they don't care how thirsty that leaves you.
The more companies force the use of flawed and unreasoning AI to do things that require actual reasoning the worse your life is going to get. The constant misrepresentation of AI and what it's capable of is accelerating that outcome.
That’s comparing apples to oranges. Nobody is going to be making a real cruise ship based on game water physics simulations.
In such a task, better water simulations are used. We have those, because we can directly observe the behavior of water under different conditions. It’s okay because the people doing it are explicitly aware that they are using simulation.
AI will get used in real decisions affecting other people, and the people doing those decisions will be influenced by the terminology we choose to use.
Because my understanding is that how "thinking" works is actually still a total mystery. How is it we no for certain that the basis for the analog electric-potential-based computing done by neurons is not based on statistical prediction?
Do we have actual evidence of that, or are you just doing "statistical token prediction" yourself?
I'm not reversing it lol. You're the one making a claim, the burden of evidence is on you.
Absence of evidence is not evidence of absence, but it is still absence of evidence. Making a claim without any is more religious that not. After all, we know humans can't be descended from monkeys!
I think people misunderstand LLMs, you should think of them like humans with limited recall capabilities. Seems like the author asked to retrieve a lot of data which it is bound to make mistakes as the training data might contain this but only a lossy representation of it, the better way to think is can it generate some SQL given this dataset and provide answers you were looking for just like how humans would approach this type of problem.
I have been experimenting with USDA food database and sending just the metadata of the table structure to the LLM as a prompt so it can write SQL
My prompt is below
----
You are a SQL Generator for USDA Food Database which is stored in sqlite. When generating SQL make sure to use :parameter_name for queries requiring parameters.
Here is the schema:
{% for row in data %}
Table: {{ row.table_name }}
Columns:
{{ row.columns }}
{% endfor %}
You can generate python code to analyze the data only if user requests it, each python code block should be able to run in Jupyter cell fully self contained. Libraries such as matplotlib, numpy, seaborn are installed. You will get the previously executed sql queries by the user in <context> </context>tags
You can access this executed data from cache
```python
import cache
data = cache.get_data('query_hash')
```
the data in the above example is already a pandas data frame
Wait for the user to ask for questions before generating any queries.
Exactly, his questions are simple tasks for classical computing and when you have one of those what you really want is for the AI to write and run the code. To its credit, GPT can often figure that out for itself these days (that it should respond by writing and running a program) but that leads to the other issue, that he's testing the $0.15 4o-mini instead of the $15.00 o1.
Damn, you're right. I didn't even consider looking at the monitor itself as "They can't be so lazy they don't even use a real screenshot" while faking the rest kind of makes sense, otherwise you need a studio setup.
Never underestimate how lazy companies with a ~$3 trillion market cap can be.
Absolutely, I'm all for dogfooding! But when you do, make sure you get and use good results, not something that looks like it was generated by someone who just learned about Stable Diffusion :)
I mean the whole company is betting on AI, why wouldn’t they use AI to generate the image?? Fundamentally it doesn’t matter if it was AI generated or not, most people don’t care and the people that do won’t impact their bottom line
The keyboard layout seems perfectly reasonable, and rather common: from top to bottom, the rightmost column of keys after the letters would be backspace, |\, enter, shift, ctrl. On the left, mirrored, you have ~`, tab, caps lock, shift, ctrl. The sizes and shapes match many common keyboard layouts I've seen.
I love Claude 3.5 sonnet and their UI is top notch especially for coding, recently though they have been facing capacity issues especially during weekdays correlating with working hours. Have tried Qwen2.5 coder 32B and it's very good and close to Claude 3.5 in my coding cases.
This is what annoys me a lot, too. I mean the fact that I cannot have paste retain the formatting (```, `, etc.). Same with the UI retaining my prompt, but not the formatting, so if you do some formatting and reload, you will lose that formatting.
I have aging parents in India, and I can tell you firsthand—healthcare there is far from cheap. In fact, prices have surged, and many people are now accustomed to the higher costs, thanks to rampant price gouging during the COVID pandemic.
oauth-idp-server - OAuth 2.0 Identity Provider with third-party support https://github.com/gavi/oauth-idp-server
mcp-pyexec-client - Testing client for end-to-end validation https://github.com/gavi/mcp-pyexec-client