More

anentropic · 2026-02-19T10:11:47 1771495907

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

networked · 2026-02-19T12:38:10 1771504690

51% doesn't tell you much by itself. Benchmarks like this are usually not graded on a curve and aren't calibrated so that 100% is the performance level of a qualified human. You could design a superhuman benchmark where 10% was the human level of performance.

Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.

andai · 2026-02-19T14:07:00 1771510020

This is interesting, TFA lists Opus at 59. Which is the same as Claude Code with opus on the page you linked here. But it has Droid agent with Opus scoring 69. Which means the CC harness harness loses Opus 10 points on this benchmark.

I'm reminded of https://swe-rebench.com/ where Opus actually does better without CC. (Roughly same score but half the cost!)

pitched · 2026-02-19T11:59:28 1771502368

That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.

varispeed · 2026-02-19T12:24:39 1771503879

Gemini 3 Flash is pure rubbish. It can easily get into loop mode and spout information no different than Markov chain and repeat it over and over.

YetAnotherNick · 2026-02-19T12:51:46 1771505506

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

esafak · 2026-02-19T14:58:21 1771513101

What do you mean? It tests whether the model knows the tools and uses them.

YetAnotherNick · 2026-02-19T15:47:08 1771516028

Yeah it's a knowledge benchmark not agentic benchmark.

esafak · 2026-02-19T15:54:03 1771516443

That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.

YetAnotherNick · 2026-02-19T16:07:50 1771517270

I am saying the opposite. If a coding benchmark just tests the syntax of a esoteric language, it shouldn't be called coding benchmark.

For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.

anentropic · 2026-02-16T13:34:27 1771248867

Also the answers are non-deterministic

anentropic · 2026-02-13T10:15:53 1770977753

Is this only when using the Go SDK?

otterley · 2026-02-13T14:29:38 1770992978

Nah, it’ll show up in the others in their upcoming releases. Much of the code for the SDKs is autogenerated from JSON “API shape” files: https://github.com/aws/api-models-aws

Specifically, in this case: https://github.com/aws/api-models-aws/commit/8bca88a33592ca4...

anentropic · 2026-02-04T15:33:41 1770219221

I still don't really understand what it's for, despite that it sounds interesting and gets linked here from time to time

but I think the difference is the "distributed" part, where I think they mean distributed over untrusted networks as opposed to distributed over nodes in a private cluster

anentropic · 2026-02-04T09:22:09 1770196929

How easy will this be to combine with https://github.com/mysql/mysql-operator for deployment?

baotiao · 2026-02-04T21:11:01 1770239461

We havn't try that before, maybe I will try to combine with mysql-operator later..

anentropic · 2026-02-13T11:21:20 1770981680

Or just any guidance on production deployments would be appreciated

anentropic · 2026-02-03T17:41:03 1770140463

I am a big fan of prek and have converted a couple of projects over from pre-commit

The main advantage for me is that prek has support for monorepo/workspaces, while staying compatible with existing pre-commit hooks.

So you can have additional .pre-commit-config.yaml files in each workspace under the root, and prek will find and run them all when you commit. The results are collated nicely. Just works.

Having the default hooks reimplemented in Rust is minor bonus (3rd party hooks won't be any faster) and also using uv as the package manager speeds up hook updates for python hooks.

OJFord · 2026-02-04T09:09:47 1770196187

I believe that was the main reason it was created - pre-commit author acknowledged the request for such support but said they wouldn't do (or merge) it in pre-commit.

anentropic · 2026-01-23T09:54:42 1769162082

> I would wonder why anyone would chose Turso over SQLite

well, Turso adds features

otherwise yeah, there'd be no reason

namibj · 2026-01-23T17:46:29 1769190389

Ok, `io_uring` (like NVMe but for IO commands from application to kernel) and DBSP (high-grade framework for differential (as in, based on Delta streams/diffs not full updates) compression of "incremental view maintenance", it can keep materialized views synchronously up-to-date with a cost proportional to just the diff (for most typical ones; certain queries can of course be doing things at an intermediate stage that blow up and collapse again right after)).

At least notably; not sure about the MVCC `BEGIN CONCURRENT`'s practical relevance though; I am just already familiar enough with the other two big ones to chime in without having to dive into what Turso does about them...

bawolff · 2026-01-24T16:45:55 1769273155

> Ok, `io_uring` (like NVMe but for IO commands from application to kernel)

Are there benchmarks comparing turso with io_uring to sqlite (with other config the same)?

io_uring has the potential to be faster but its not garunteed. It might be the same, it might be slower, depending on how you use it. People bragging about the technology instead of the result of using the technology is a bit of a red flag.

anentropic · 2026-01-22T10:02:42 1769076162

> I don’t think anyone actually rejects that. And those who do...

slow clap

anentropic · 2026-01-21T11:26:17 1768994777

Could plate tectonics conceivably throw off the alignment of this monument within the ~10ky timescales involved?

staplung · 2026-01-21T18:35:30 1769020530

Not really. The axis of the earth's rotation is not affected by plate tectonics and the star map is recording where the earth's northern axis is pointing as precession slowly moves it around a circle. The star map could certainly move around or get broken up by plate tectonics over that timescale but it's not really aligned with anything in a meaningful way so that doesn't matter.

The stars themselves will move - relative to us - and eventually some of them will disappear but nothing much is expected on that front for much longer than 10ky timeframes.

anentropic · 2026-01-20T14:35:56 1768919756

Yes, it's embarrassing