Hacker Newsnew | past | comments | ask | show | jobs | submit | anentropic's commentslogin

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks


51% doesn't tell you much by itself. Benchmarks like this are usually not graded on a curve and aren't calibrated so that 100% is the performance level of a qualified human. You could design a superhuman benchmark where 10% was the human level of performance.

Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.


This is interesting, TFA lists Opus at 59. Which is the same as Claude Code with opus on the page you linked here. But it has Droid agent with Opus scoring 69. Which means the CC harness harness loses Opus 10 points on this benchmark.

I'm reminded of https://swe-rebench.com/ where Opus actually does better without CC. (Roughly same score but half the cost!)


That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.

Gemini 3 Flash is pure rubbish. It can easily get into loop mode and spout information no different than Markov chain and repeat it over and over.

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

What do you mean? It tests whether the model knows the tools and uses them.

Yeah it's a knowledge benchmark not agentic benchmark.

That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.

I am saying the opposite. If a coding benchmark just tests the syntax of a esoteric language, it shouldn't be called coding benchmark.

For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.


Also the answers are non-deterministic

Is this only when using the Go SDK?

Nah, it’ll show up in the others in their upcoming releases. Much of the code for the SDKs is autogenerated from JSON “API shape” files: https://github.com/aws/api-models-aws

Specifically, in this case: https://github.com/aws/api-models-aws/commit/8bca88a33592ca4...


I still don't really understand what it's for, despite that it sounds interesting and gets linked here from time to time

but I think the difference is the "distributed" part, where I think they mean distributed over untrusted networks as opposed to distributed over nodes in a private cluster


How easy will this be to combine with https://github.com/mysql/mysql-operator for deployment?


We havn't try that before, maybe I will try to combine with mysql-operator later..


Or just any guidance on production deployments would be appreciated

I am a big fan of prek and have converted a couple of projects over from pre-commit

The main advantage for me is that prek has support for monorepo/workspaces, while staying compatible with existing pre-commit hooks.

So you can have additional .pre-commit-config.yaml files in each workspace under the root, and prek will find and run them all when you commit. The results are collated nicely. Just works.

Having the default hooks reimplemented in Rust is minor bonus (3rd party hooks won't be any faster) and also using uv as the package manager speeds up hook updates for python hooks.


I believe that was the main reason it was created - pre-commit author acknowledged the request for such support but said they wouldn't do (or merge) it in pre-commit.


> I would wonder why anyone would chose Turso over SQLite

well, Turso adds features

otherwise yeah, there'd be no reason


Ok, `io_uring` (like NVMe but for IO commands from application to kernel) and DBSP (high-grade framework for differential (as in, based on Delta streams/diffs not full updates) compression of "incremental view maintenance", it can keep materialized views synchronously up-to-date with a cost proportional to just the diff (for most typical ones; certain queries can of course be doing things at an intermediate stage that blow up and collapse again right after)).

At least notably; not sure about the MVCC `BEGIN CONCURRENT`'s practical relevance though; I am just already familiar enough with the other two big ones to chime in without having to dive into what Turso does about them...


> Ok, `io_uring` (like NVMe but for IO commands from application to kernel)

Are there benchmarks comparing turso with io_uring to sqlite (with other config the same)?

io_uring has the potential to be faster but its not garunteed. It might be the same, it might be slower, depending on how you use it. People bragging about the technology instead of the result of using the technology is a bit of a red flag.


> I don’t think anyone actually rejects that. And those who do...

slow clap


Could plate tectonics conceivably throw off the alignment of this monument within the ~10ky timescales involved?


Not really. The axis of the earth's rotation is not affected by plate tectonics and the star map is recording where the earth's northern axis is pointing as precession slowly moves it around a circle. The star map could certainly move around or get broken up by plate tectonics over that timescale but it's not really aligned with anything in a meaningful way so that doesn't matter.

The stars themselves will move - relative to us - and eventually some of them will disappear but nothing much is expected on that front for much longer than 10ky timeframes.


Yes, it's embarrassing


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: