More

gavi · 2025-07-26T20:25:12 1753561512

mcp-pyexec - Core MCP server for secure Python execution https://github.com/gavi/mcp-pyexec

oauth-idp-server - OAuth 2.0 Identity Provider with third-party support https://github.com/gavi/oauth-idp-server

mcp-pyexec-client - Testing client for end-to-end validation https://github.com/gavi/mcp-pyexec-client

gavi · 2025-06-10T20:05:36 1749585936

too much thinking

https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...

fzzzy · 2025-06-10T20:12:27 1749586347

My impression from running the first R1 release locally was that it also does too much thinking.

reissbaker · 2025-06-11T07:42:27 1749627747

Magistral Small seems wayyy too heavy-handed with its RL to me:

\boxed{Hey! How can I help you today?}

They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).

It also forgets to <think> unless you use their special system prompt reminding it to.

Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.

cluckindan · 2025-06-10T20:52:52 1749588772

It does not do any thinking. It is a statistical model, just like the rest of them.

LordDragonfang · 2025-06-10T22:29:06 1749594546

"Thinking" is a term of art referring to the hidden/internal output of "reasoning" models where they output "chain of thought" before giving an answer[1]. This technique and name stem from the early observation that LLMs do better when explicitly told to "think step by step"[2]. Hope that helps clarify things for you for future constructive discussion.

[1] https://arxiv.org/html/2410.10630v1

[2] https://arxiv.org/pdf/2205.11916

bobsomers · 2025-06-10T22:43:13 1749595393

We are aware of the term of art.

The point that was trying to be made, which I agree with, is that anthropomorphizing a statistical model isn’t actually helpful. It only serves to confuse laypersons into assuming these models are capable of a lot more than they really are.

That’s perfect if you’re a salesperson trying to dump your bad AI startup onto the public with an IPO, but unhelpful for pretty much any other reason, especially true understanding of what’s going on.

LordDragonfang · 2025-06-10T23:22:58 1749597778

If that was their point, it would have been more constructive to actually make it.

To your point, it's only anthropomorphization if you make the anthrocentric assumption that "thinking" refers to something that only humans can do.[1]

And I don't think it confuses laypeople, when literally telling it to "think" achieves the very similar results as in humans - it produces output that someone provided it out-of-context would easily identify as "thinking out loud", and improves the accuracy of results like how... thinking does.

The best mental model of RLHF'd LLMs that I've seen is that they are statistical models "simulating"[1] how a human-like character would respond to a given natural-language input. To calculate the statistically "most likely" answer that an intelligent creature would give to a non-trivial question, with any sort of accuracy, you need emergent effects which look an awful like like a (low fidelity) simulation of intelligence. This includes simulating "thought". (And the distinction between "simulating thinking" and "thinking" is a distinction without a difference given enough accuracy)

I'm curious as to what "capabilities" you think the layperson is misled about, because if anything they tend to exceed layperson understanding IME. And I'm curious what mental model you have of LLMs that provides more "true understanding" of how a statistical model can generate answers that appear nowhere in its training.

[1] It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.

[2] https://www.astralcodexten.com/p/janus-simulators

zer00eyz · 2025-06-11T02:52:52 1749610372

> It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.

And yet we added a hand wavy 7th to humanize a peice of technology.

MindTheAbstract · 2025-06-11T06:34:15 1749623655

I know this is the terminology, but I'd argue that the activations are the actual thinking. It's probably too late to change that, but I wish people would refer to thinking as the work Anthropic and Deepmind are doing with their mech interp

andrepd · 2025-06-11T10:12:17 1749636737

It's a misleading "term of art" which is more accurately described as a "term of marketing". Reasoning is precisely what LLMs don't do and it's precisely why they are unsuited to many tasks they are peddled for.

LordDragonfang · 2025-06-11T16:51:40 1749660700

How are you defining "reasoning" such that you are confident that LLMs are definitely not doing it? What evidence do you have to that effect? (And are you certain that none of your reasoning applies to humans as well?)

cluckindan · 2025-06-11T18:18:42 1749665922

They don’t ”think”.

https://arxiv.org/abs/2503.09211

They don’t ”reason”.

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

They don’t even always output their internal state accurately.

https://arxiv.org/abs/2505.05410

LordDragonfang · 2025-06-11T22:44:30 1749681870

> https://arxiv.org/abs/2503.09211

I am thoroughly unimpressed by this paper. It sets up a vague strawman definition of "thinking" that I'm not aware of anyone using (and makes no claim it applies to humans) and then knocks down the strawman.

It also leans way too heavy on determinism - For one thing, we have no way of knowing if human brains are deterministic (until we solve whether reality itself is). For another, I doubt you would suddenly reverse your position if we created a LoRa composed of atmospheric noise, so it does not support your real position.

> https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

This one is more substantial, but:

"While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. [...] Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. [...] We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."

Starts by saying "we actually don't understand them" (meaning we don't know well enough to give a yes or no) and then proceeds to list flaws that, as I keep saying, also can be applied to most (if not all) humans' ability to reason. Human reasoning also collapses in accuracy above a certain complexities, and certainly are observed to fail to use explicit algorithms, as well as reasoning inconsistently across puzzles.

So unless your definition of anthropomorphization excludes most humans, this is far from a slam dunk.

> They don’t even always output their internal state accurately.

I have some really bad news about humans for you. I believe (Buddha et al, 500 BCE) is the foundational text on this, but there's been some more recent research (Hume, 1739), (Kierkegaard, 1849)

cluckindan · 2025-06-12T05:28:55 1749706135

Whodathunkit, some people are so infatuated with their simulacra that they choose to go tooth and nail in defense of the simulation.

My point was congruent with the argument that LLMs are not humans or possess human-like thinking and reasoning, and you have conveniently demonstrated that.

LordDragonfang · 2025-06-12T18:30:43 1749753043

> My point was congruent with the argument that LLMs are not humans or possess human-like thinking and reasoning, and you have conveniently demonstrated that.

I mean, they are obviously not humans, that is trivially true, yes.

I don't know what I said makes you believe I demonstrated that they do not possess human-like thinking and reasoning, though, considering I've mostly pointed out ways they seem similar to humans. Can you articulate your point there?

boredhedgehog · 2025-06-11T07:19:13 1749626353

These kind of comments are the equivalent of going to dog owners' forums, analyzing word choices in every post and warning the dog owners about the dangers of anthropomorphizing their pets, an effort as accurate as it is boorish and ineffectual.

cluckindan · 2025-06-11T20:55:10 1749675310

Dogs will not be quite as widely influencing decisions concerning other people.

robmccoll · 2025-06-10T22:11:19 1749593479

What are we doing when we think?

cluckindan · 2025-06-11T06:01:33 1749621693

Human neurons are not reducible to arithmetic artificial neurons in a statistical model. Do not conflate them.

jeffhuys · 2025-06-11T06:41:59 1749624119

Why not, actually?

cluckindan · 2025-06-11T06:52:55 1749624775

Because we do not have a complete understanding of human neurons. How are we supposed to accurately model something we cannot directly observe?

TheDong · 2025-06-11T07:05:13 1749625513

Do you also complain when someone says "Half-life 2 has great water-physics" with "Don't call it physics, we still don't understand all the physical laws of the universe, and also they use limited-precision floating-point, so it's not water-physics, it's just a bunch of math"?

Like, we've agreed that "water-physics" and "cloth physics" in 3d graphics refers to a mathematical approximation of something we don't actually understand at the subatomic level (are there strings down there? Who knows).

Can "thinking" in AI not refer to this intentionally false imitation that has a similar observable outward effect?

Like, we're okay saying minecraft's water has "water physics", why are we not okay saying "in the AI context, thinking is a term that externally looks a bit like a human thinking, even though at a deeper layer it's unrelated"?

Or is thinking special, is it like "soul" and we must defend the word with our life else we lose our humanity? If I say "that building's been thinking about falling over for 50 years", did I commit a huge faux pas against my humanity?

autoexec · 2025-06-11T10:53:46 1749639226

> Do you also complain when someone says "Half-life 2 has great water-physics"

I would if they said the water in Half-life 2 was great for quenching your thirst or that in the near future everyone will only drink water from Half-life 2 and it will flow from our kitchen taps when it's clear that however good Half-life 2 is at approximating what water looks and acts like it isn't capable of being a beverage and isn't likely to ever become one. Right now there are a lot of people going around saying that what passes for AI these days has the ability to reason and that AGI is right around the corner but that's just as obvious a lie and every bit as unlikely, but the more it gets repeated the more people end up falling for it.

It's frustrating because at some point (if it hasn't happened already) you're going to find yourself feeling very thirsty and be shocked to discover that the only thing you have access to is Half-life 2 water, even though it does nothing for you except make you even more thirsty since it looks close enough to remind you of the real thing. All because some idiot either fell for the hype or saved enough money by not supplying you with real water that they don't care how thirsty that leaves you.

The more companies force the use of flawed and unreasoning AI to do things that require actual reasoning the worse your life is going to get. The constant misrepresentation of AI and what it's capable of is accelerating that outcome.

cluckindan · 2025-06-11T08:18:35 1749629915

That’s comparing apples to oranges. Nobody is going to be making a real cruise ship based on game water physics simulations.

In such a task, better water simulations are used. We have those, because we can directly observe the behavior of water under different conditions. It’s okay because the people doing it are explicitly aware that they are using simulation.

AI will get used in real decisions affecting other people, and the people doing those decisions will be influenced by the terminology we choose to use.

inimino · 2025-06-11T09:37:11 1749634631

Just because you don't know how does not mean that we can't.

cluckindan · 2025-06-11T11:07:18 1749640038

Prove it, then.

otabdeveloper4 · 2025-06-11T05:02:59 1749618179

We don't know yet. But we do know it's certainly not statistical token prediction.

(People can do statistical token prediction too, but that's called "bullshitting", not "thinking". Thinking is a much wider class of activity.)

LordDragonfang · 2025-06-11T17:04:31 1749661471

Do we know that with certainty? Do we actually?

Because my understanding is that how "thinking" works is actually still a total mystery. How is it we no for certain that the basis for the analog electric-potential-based computing done by neurons is not based on statistical prediction?

Do we have actual evidence of that, or are you just doing "statistical token prediction" yourself?

cluckindan · 2025-06-12T06:31:35 1749709895

You’re reversing the burden of proof in a similar manner as religious people often do. Absence of evidence is not evidence of absence, and so on.

LordDragonfang · 2025-06-12T18:23:18 1749752598

I'm not reversing it lol. You're the one making a claim, the burden of evidence is on you.

Absence of evidence is not evidence of absence, but it is still absence of evidence. Making a claim without any is more religious that not. After all, we know humans can't be descended from monkeys!

gavi · on Feb 4, 2025

I think people misunderstand LLMs, you should think of them like humans with limited recall capabilities. Seems like the author asked to retrieve a lot of data which it is bound to make mistakes as the training data might contain this but only a lossy representation of it, the better way to think is can it generate some SQL given this dataset and provide answers you were looking for just like how humans would approach this type of problem.

I have been experimenting with USDA food database and sending just the metadata of the table structure to the LLM as a prompt so it can write SQL

My prompt is below

----

You are a SQL Generator for USDA Food Database which is stored in sqlite. When generating SQL make sure to use :parameter_name for queries requiring parameters. Here is the schema:

{% for row in data %} Table: {{ row.table_name }} Columns: {{ row.columns }} {% endfor %}

You can generate python code to analyze the data only if user requests it, each python code block should be able to run in Jupyter cell fully self contained. Libraries such as matplotlib, numpy, seaborn are installed. You will get the previously executed sql queries by the user in <context> </context>tags

You can access this executed data from cache

```python import cache data = cache.get_data('query_hash') ``` the data in the above example is already a pandas data frame

Wait for the user to ask for questions before generating any queries.

----

you can try it out here https://catalyst.voov.ai

svachalek · on Feb 4, 2025

Exactly, his questions are simple tasks for classical computing and when you have one of those what you really want is for the AI to write and run the code. To its credit, GPT can often figure that out for itself these days (that it should respond by writing and running a program) but that leads to the other issue, that he's testing the $0.15 4o-mini instead of the $15.00 o1.

gavi · on Feb 2, 2025

Yes

https://rlhfbook.com/book.pdf

gavi · on Jan 7, 2025

this image seems to be AI Generated - :-)

https://s3.amazonaws.com/cms.ipressroom.com/219/files/20250/...

Source: https://nvidianews.nvidia.com/news/nvidia-puts-grace-blackwe...

paxys · on Jan 7, 2025

The text on the screen is an obvious giveaway.

diggan · on Jan 7, 2025

Damn, you're right. I didn't even consider looking at the monitor itself as "They can't be so lazy they don't even use a real screenshot" while faking the rest kind of makes sense, otherwise you need a studio setup.

Never underestimate how lazy companies with a ~$3 trillion market cap can be.

adolph · on Jan 7, 2025

Lazy? This is Nvidia eating their own dogfood. They put in lots of work to get to the point where someone can call it "lazy."

diggan · on Jan 7, 2025

> Lazy? This is Nvidia eating their own dogfood

Absolutely, I'm all for dogfooding! But when you do, make sure you get and use good results, not something that looks like it was generated by someone who just learned about Stable Diffusion :)

sipjca · on Jan 7, 2025

I mean the whole company is betting on AI, why wouldn’t they use AI to generate the image?? Fundamentally it doesn’t matter if it was AI generated or not, most people don’t care and the people that do won’t impact their bottom line

diggan · on Jan 7, 2025

Agree, unless I've missed some recent invention where keyboards now have two of either Enter/Backspace/Shift keys on the right side.

Not sure if that isn't expected though? Likely most people wouldn't even notice, and the company can say they're dogfooding some product I guess.

tsimionescu · on Jan 7, 2025

The keyboard layout seems perfectly reasonable, and rather common: from top to bottom, the rightmost column of keys after the letters would be backspace, |\, enter, shift, ctrl. On the left, mirrored, you have ~`, tab, caps lock, shift, ctrl. The sizes and shapes match many common keyboard layouts I've seen.

patrulek · on Jan 7, 2025

> unless I've missed some recent invention where keyboards now have two of either Enter/Backspace/Shift keys on the right side

It doesnt have to be two enter/backspace/shift. Keyboard layout seems to be almost identical to Azio L70 Keyboard (at least the keys).

throw310822 · on Jan 7, 2025

Prompt: something with some splashy graph on screen.

gavi · on Nov 22, 2024

I love Claude 3.5 sonnet and their UI is top notch especially for coding, recently though they have been facing capacity issues especially during weekdays correlating with working hours. Have tried Qwen2.5 coder 32B and it's very good and close to Claude 3.5 in my coding cases.

anovick · on Nov 22, 2024

There's one problem with Claude's chat box where ``` opens an intrusive code block box that's hard to close/skip.

But I also agree that Claude 3.5 Sonnet is giving very good results. Not only for coding, and also for languages other than English.

johnisgood · on Nov 23, 2024

This is what annoys me a lot, too. I mean the fact that I cannot have paste retain the formatting (```, `, etc.). Same with the UI retaining my prompt, but not the formatting, so if you do some formatting and reload, you will lose that formatting.

KTibow · on Nov 22, 2024

You can exit with the down arrow

anovick · on Nov 22, 2024

Thanks!

gavi · on Oct 5, 2024

I have aging parents in India, and I can tell you firsthand—healthcare there is far from cheap. In fact, prices have surged, and many people are now accustomed to the higher costs, thanks to rampant price gouging during the COVID pandemic.

gavi · on April 18, 2024

The GPU requirements for realtime video generation are very minimal in the grand scheme of things. Assault on reality itself.

gavi · on April 3, 2024

This is used to build micrograd and essentially extensively used in PyTorch.

https://www.youtube.com/watch?v=VMj-3S1tku0

Excellent intro video btw

gavi · on March 13, 2024

This is great. Please check obsidian which I think is excellent also.