More

cjbarber · 2026-03-11T05:04:30 1773205470

See also various sandbox tools I and others (e.g. jpeeler) have collected: https://news.ycombinator.com/item?id=47102258

amsha · 2026-03-11T14:46:07 1773240367

I'd be interested in seeing a breakdown in which ones use:

- VMs - Containers - sandbox-exec (macOS builtin tool) - Endpoint Security + Network Extension (AFAIK this is just Ash but it would be good to see company here)

cjbarber · 2026-03-09T20:53:52 1773089632

This is an exceptional read.

cjbarber · 2026-03-08T23:55:50 1773014150

See also various sandbox tools I and others (e.g. jpeeler) have collected: https://news.ycombinator.com/item?id=47102258

touristtam · 2026-03-11T07:29:19 1773214159

Priceless thank you

cjbarber · 2026-02-25T00:38:05 1771979885

I've tried a few computer use and browser use tools and they feel relatively tok/s bottlenecked.

And in some sense, all of my claude code usage feels tok/s bottlenecked. There's never really a time where I'm glad to wait for the tokens, I'd always prefer faster.

cjbarber · 2026-02-25T00:28:16 1771979296

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.

But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.

estsauver · 2026-02-25T02:36:35 1771986995

I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.

Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf

refulgentis · 2026-02-25T02:41:24 1771987284

I'm very worried for both.

Cerebras requires a $3K/year membership to use APIs.

Groq's been dead for about 6 months, even pre-acquisition.

I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview

nl · 2026-02-25T02:59:58 1771988398

Taalas is interesting. 16,000 TPS for Llama on a chip.

https://taalas.com/

micw · 2026-02-25T05:35:12 1771997712

On a very old model, it's more like 16.000 garbage words/s

nl · 2026-02-25T05:58:27 1771999107

Llama 3.1 8B is pretty useful for some thing. I use it to generate SQL pretty reliably for example.

They are doing an updated model in a month or so anyway, then a frontier level one "by summer".

numeri · 2026-02-25T15:24:01 1772033041

but Taalas had to quantize Llama 3.1 8B to death to get it to fit. It can't produce coherent non-English text at all.

patapong · 2026-02-25T11:01:13 1772017273

I do wonder if there are tasks where 16k garbage words/s are more useful than 200 good words per second. Does anyone have any ideas? Data extraction perhaps?

pnocera · 2026-02-25T19:45:10 1772048710

A politician communication agent maybe...

Nihilartikel · 2026-02-25T12:41:11 1772023271

Neat! I had been wondering if anyone was trying to implement a model in silico. We're getting closer to having chatty talking toasters every day now!

empath75 · 2026-02-25T13:14:39 1772025279

"What is my purpose..."

https://www.youtube.com/watch?v=sa9MpLXuLs0

Nihilartikel · 2026-02-28T20:19:33 1772309973

Even more on the nose: https://youtu.be/LRq_SAuQDec?si=CAe210GZ_lKcc6_Y

DeathArrow · 2026-02-25T06:07:34 1771999654

I wonder how many token per seconds can they get if they put Mercury 2 on a chip.

replete · 2026-02-25T07:29:28 1772004568

Its exciting to see, but look at the die size for only an 8b model

freeqaz · 2026-02-25T02:51:07 1771987867

You can call Cerebras APIs via OpenRouter if you specify them as the provider in your request fyi. It's a bit pricier but it exists!

andai · 2026-02-25T03:57:37 1771991857

I used their API normally (pay per token) a few weeks ago. Their Coding Plan appears to be permanently sold out though.

ainch · 2026-02-25T03:09:19 1771988959

I don't think it's a good comparison given Inception work on software and Cerebras/Groq work on hardware. If Inception demonstrate that diffusion LLMs work well at scale (at a reasonable price) then we can probably expect all the other frontier labs to copy them quickly, similarly to OpenAI's reasoning models.

refulgentis · 2026-02-25T03:12:01 1771989121

Definitely depends on what you're buying, maybe some of the audience here was buying Groq and Cerebras chips? I don't think they sold them but can't say for sure.

If you're a poor schmoke like me, you'd be thinking of them as API vendors of ~1000 token/s LLMs.

Especially because Inception v1's been out for a while and we haven't seen a follow-the-leader effect.

Coincidentally, that's one of my biggest questions: why not?

7thpower · 2026-02-25T03:03:17 1771988597

What do you mean by Grow is dead since about 6 months ago? Not refuting your point, but I’m curious.

refulgentis · 2026-02-25T03:08:15 1771988895

No new model since GPT-OSS 120B, er maybe Kimi K2 not-thinking? Basically there were a couple models it normally obviously support, and it didn't.

Something about that Nvidia sale smelled funny to me because the # was yuge, yet, the software side shut down decently before the acquisition.

But that's 100% speculation, wouldn't be shocked if it was:

"We were never looking to become profitable just on API users, but we had to have it to stay visible. So, yeah, once it was clear an Nvidia sale was going through, we stopped working 16 hours a day, and now we're waiting to see what Nvidia wants to do with the API"

vessenes · 2026-02-25T15:31:53 1772033513

The groq purchase was designed to not trigger federal oversight of mergers, so you buy out the ‘interesting’ part, leave a skeleton team and a line of business you don’t care about -> no CFIUS, no mandatory FTC reporting -> smoother process.

estsauver · 2026-02-25T03:32:38 1771990358

I am currently using their APIs on a paygo plan, I think it might just be a capacity issue for new sign ups.

Leynos · 2026-02-25T09:06:33 1772010393

Cerebras are on OpenRouter.

behnamoh · 2026-02-25T05:06:59 1771996019

Once again, it's a tech that Google created but never turned into a product. AFAIK in their demo last year, Google showed a special version of Gemini that used diffusion. They were so excited about it (on the stage) and I thought that's what they'd use in Google search and Gmail.

refulgentis · 2026-02-26T22:33:02 1772145182

Google did not create it; it is correct there was a Gemini that used diffusion, you could apply for access (not via API). It was okay

volodia · 2026-02-25T02:21:09 1771986069

We agree! In fact, there is an emerging class of models aimed at fast agentic iteration (think of Composer, the Flash versions of proprietary and open models). We position Mercury 2 as a strong model in this category.

estsauver · 2026-02-25T03:39:12 1771990752

Do you guys all think you'll be able to convert open source models to diffusion models relatively cheaply ala the d1 // LLaDA series of papers? If so, that seems like an extremely powerful story where you get to retool the much, much larger capex of open models into high performance diffusion models.

(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)

bigbuppo · 2026-02-25T02:13:59 1771985639

Maybe make that intelligence per token per relative unit of hardware per watt. If you're burning 30 tons of coal to be 0.0000000001% better than the 5 tons of coal option because you're throwing more hardware at it, well, it's not much of a real improvement.

estsauver · 2026-02-25T03:44:07 1771991047

I think the fast inference options have historically been only marginally more expensive then their slow cousins. There's a whole set of research about optimal efficiency, speed, and intelligence pareto curves. If you can deliver even an outdated low intelligence/old model at high efficiency, everyone will be interested. If you can deliver a model very fast, everyone will be interested. (If you can deliver a very smart model, everyone is obviously the most interested, but that's the free space.)

But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.

jakubtomanik · 2026-02-25T10:18:19 1772014699

Intelligence per second is a great metric. I never could fully articulate why I like Gemini 3 Flash but this is exactly why. It’s smart enough and unbelievably fast. Thanks for sharing this

josephg · 2026-02-25T02:01:01 1771984861

Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?

At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!

estsauver · 2026-02-25T03:46:19 1771991179

For what it's worth, most people already are doing this! Some of the subagents in Claude Code (Explore, I think even compaction) default to Haiku and then you have to manually overwrite it with an env variable if you want to change it.

Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!

embedding-shape · 2026-02-25T09:44:44 1772012684

> Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!

I'm awaiting the day the small and fast models come anywhere close to acceptable quality, as of today, neither GPT5.3-codex-spark nor Haiku are very suitable for either compaction or similar tasks, as they'll miss so much considering they're quite a lot dumber.

Personally I do it the other way, the compaction done by the biggest model I can run, the planning as well, but then actually following the step-by-step "implement it" is done by a small model. It seemed to me like letting a smaller model do the compaction or writing overviews just makes things worse, even if they get a lot faster.

jmvldz · 2026-02-27T20:09:14 1772222954

The explore step with Codex-5.3-Spark and Opus 4.6 Fast both feel incredible.

nubg · 2026-02-25T01:22:55 1771982575

Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!

jdthedisciple · 2026-02-25T06:14:14 1772000054

Interesting suggestion.

Maybe we could use some sort of entropy-based metric as a proxy for that?

dmichulke · 2026-02-25T05:02:23 1771995743

Useful for evaluating people as well

irishcoffee · 2026-02-25T13:57:42 1772027862

I really thought this was sarcasm. Intelligence per token? Intelligence at all, in a token? We don’t even agree on how to measure _human_ intelligence! I just can’t. Artificially intelligent indeed. Probably the perfect term for it, you know in lieu of authentic intelligence.

picard_facepalm.jpg

cjbarber · 2026-02-21T16:33:12 1771691592

See also:

https://github.com/obra/packnplay

https://github.com/strongdm/leash

https://github.com/lynaghk/vibe

(I've been collecting different tools for sandboxing coding agents)

cjbarber · 2026-02-21T16:35:06 1771691706

And from this thread I also see:

https://github.com/eugene1g/agent-safehouse via CGamesPlay

https://multitui.com/ via davidcann

jpeeler · 2026-02-21T18:04:48 1771697088

I've been collecting a list of sandboxing related projects as well, some lower level than others. I wish I had time to evaluate them all:

- https://github.com/jingkaihe/matchlock

- https://github.com/mishushakov/libkrun-go

- https://github.com/earendil-works/gondolin

- https://github.com/butter-dot-dev/bvisor

- https://github.com/amlalabs/amla-sandbox

- https://github.com/eryx-org/eryx

- https://github.com/containers/bubblewrap (not new)

- https://github.com/coplane/localsandbox

- https://github.com/sd2k/conch

- https://github.com/Gerharddc/litterbox

- https://github.com/finbarr/yolobox

- https://github.com/coventry/sandbox-codex

- https://github.com/osks/ctenv

- https://github.com/tianon/gosu

- https://github.com/colony-2/shai

- https://github.com/rcarmo/agentbox

- https://github.com/coder/httpjail

- https://github.com/bytecodealliance/componentize-py

- https://github.com/tursodatabase/agentfs

- https://github.com/always-further/nono

- (another list on HN Deno Sandbox: https://news.ycombinator.com/item?id=46876022)

- Did not check if any/all of these are here: https://github.com/arjan/awesome-agent-sandboxes

selridge · 2026-02-21T18:11:23 1771697483

Got a weird one https://github.com/Protonk/PAWL

johnmw · 2026-02-21T22:28:38 1771712918

Also (in case people haven't already seen this), I recently discovered Docker now has an easy way to run agents in a sandbox, ie:

  docker sandbox run claude ~/project-a

https://docs.docker.com/ai/sandboxes/

cjbarber · 2026-02-17T21:41:40 1771364500

I wonder if it's actually from CC harness updates that make it much more inclined to use subagents, rather than from the model update.

cjbarber · 2026-02-16T22:10:54 1771279854

From the author on X (https://x.com/g_leech_/status/2023384135201349633), below is all me quoting the tweet thread:

New paper on a long-shot I've been obsessed with for a year:

How much are AI reasoning gains confounded by expanding the training corpus 10000x? How much LLM performance is down to "local" generalisation (pattern-matching to hard-to-detect semantically equivalent training data)?

tl;dr

- OLMo 3 training corpus contains exact duplicates of 50% of the ZebraLogic test set.

- We embed the corpus to find semantic duplicates of test data in the wild. 78% of the CodeForces test set had >=1 semantic duplicate

- The semantic duplicate rate is maybe >4 in 10000

* at least 50% and at least 78% that is

arxiv.org/pdf/2602.12413

Imagine you're head of training at at OpenAI, and you want your benchmark scores to be meaningful (: to estimate OOD performance)

You have a hard task ahead of you! Your models have seen so much, memorisation is so easy - as is local generalisation (noisy pattern-matching).

What can you do? Well, obviously you take every benchmark you're going to test on and try to "decontaminate" your training corpus (remove test data from the training data).

By default this is just one level above string matching ("n-gram matching" - if sentences overlap in (say) a 13-token window, remove them from the training corpus).

But you're actually trying, so you also translate the test sets and delete translations of test from train.

But! every piece of test data has an arbitrary number of logical equivalents and neighbours (like how `x + y = 10` is the same problem as `2x + 2y = 20`). And LLMs are amazing at semantic search, so maybe this inflates benchmark scores.

The cutting-edge tech for detecting these "semantic" duplicates is... an LLM. But you simply can't do 100T x 1M calls. There's not enough compute in the world (yet).

So you do what you can - maybe you

- categorise the entire corpus & do intense search inside relevant partitions (e.g. maths > number theory > ...)

- embed the whole corpus & look for things really close to test data

- train a wee 300M filter model & do what you can with that

How much does this process catch? How many semantic duplicates of test data slip through? And what's the impact on final benchmark scores?

We don't know, This (finally) is where our paper comes in:

We experiment on OLMo 3, one of the only really good models with open training data. Since we have its entire training corpus, we can exhaustively check for real "natural" duplicates and finetune it to estimate their impact. We embed the entire Dolma Instruct corpus.

Firstly: we were surprised by how ineffective n-gram decontamination was at catching exact duplicates - 70% of harder tasks had a match. But the spurious performance gain wasn't so large, at most +4pp.

Secondly, every single MBPP test example and 78% of CodeForces have semantic duplicates

Thirdly we generated 10k synthetic duplicates for MuSR, Zebralogic, and MBPP problems and finetuned on them.

- MuSR +22pp. Semantic duplicates as strong as exact

- ZebraLogic +12pp. Exact much stronger

- MBPP +17pp. Exact stronger

Fourthly we guess that 4 in 10,000 training datapoints are a strong semantic duplicate for a given benchmark datapoint (where strong means just "obvious to Gemini")

So: n-gram decontamination is not enough even for the easy (exact) stuff, semantic duplicates are at least a moderately big deal, and this probably transfers to frontier models to some degree. The above are probably underestimates too (since our detection pipeline was cheapo).

Data contamination is a huge field. Here's how we're new

This is preliminary work on a shoestring - we didn't get at the big questions yet ("what share of benchmark gains come from interpolation over a hidden training corpus?", "does this even matter?")

And local generalisation across very different strings is anyway pretty miraculous

The grand aim of this research programme is to decompose benchmark gains / apparent AI progress into 4 estimates:

1. benchmaxxing (memorising exact duplicates)

2. usemaxxing (RLing narrow capabilities)

3. hidden interpolation / local generalisation

4. OOD generalisation

We have a lot of ideas! If you're interested in funding this, grab me at gavin@arbresearch.com

Nearly all of the real work done by Ari Spiesberger, Juan_VaGu, Nicky Pochinkov, Tomas Gavenciak, peligrietzer and NandiSchoots

And ofc this work wouldn't be possible without allen_ai and natolambert working in public and enabling actually scientific evals.

cjbarber · 2026-02-12T18:25:12 1770920712

It'll be nice when there's smarter routing between models, or easier routing, so some things get sent to the fast model, some get sent to the cheap model, some get sent to the smart model, etc.

cjbarber · 2026-02-12T18:24:14 1770920654

> In my opinion, they solved the wrong problem. The main issue I have with Codex is that the best model is insanely slow, except at nights and weekends when Silicon Valley goes to bed. I don't want a faster, smaller model (already have that with GLM and MiniMax). I want a faster, better model (at least as fast as Opus).

It's entirely possible that this is the first step and that they will also do faster better models, too.

behnamoh · 2026-02-12T18:25:17 1770920717

I doubt it; there's a limit on model size that can be supported by Cerebras tech. GPT-5.3 is supposedly +1T parameters...

joshuastuden · 2026-02-12T21:49:41 1770932981

Um, no. There's no limit on model size for Cerebras hardware. Where do you come up with this stuff?