Hacker Newsnew | past | comments | ask | show | jobs | submit | mustaphah's commentslogin

Google is terrible at marketing, but this feels like a big step forward.

As per the announcement, Gemini 3.1 Pro score 68.5% on Terminal-Bench 2.0, which makes it the top performer on the Terminus 2 harness [1]. That harness is a "neutral agent scaffold," built by researchers at Terminal-Bench to compare different LLMs in the same standardized setup (same tools, prompts, etc.).

It's also taken top model place on both the Intelligence Index & Coding Index of Artificial Analysis [2], but on their Agentic Index, it's still lagging behind Opus 4.6, GLM-5, Sonnet 4.6, and GPT-5.2.

---

[1] https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

[2] https://artificialanalysis.ai


Benchmarks aren't everything.

Gemini consistently has the best benchmarks but the worst actual real-world results.

Every time they announce the best benchmarks I try again at using their tools and products and each time I immediately go back to Claude and Codex models because Google is just so terrible at building actual products.

They are good at research and benchmaxxing, but the day to day usage of the products and tools is horrible.

Try using Google Antigravity and you will not make it an hour before switching back to Codex or Claude Code, it's so incredibly shitty.


That's been my experience too; can't disagree. Still, when it comes to tasks that require deep intelligence (esp. mathematical reasoning [1]), Gemini has consistently been the best.

[1] https://arxiv.org/abs/2602.10177


What’s so shitty about it?

This is like trying to fix hallucination by telling LLM not to hallucinate.

What if the most interesting finding ends up buried under a vague title? Aside from the "self-generated skills" aspect, there isn't much there that meaningfully warrants deeper discussion.

I chose a title that directly reflects an interesting finding - something that offers substantial insight to the community. I think the rule should be applied with some nuance; in this case, being explicit is a net positive.

I have no interest in linkbait, and I hope that's evident from my previous submissions


Thanks @dang for moderating! This is indeed not our original findings and this is a sub conclusion for an ablation we did to remove the confound of LLMs internal domain knowledge. Thanks for submitting for us @mustaphah here's a little bit more details on how we approach this:

> I would frame the 'post-trajectory generated skills' as feedback-generated skills, so is Letta: https://www.letta.com/blog/skill-learning. We haven't seen existing research or hypothesis debating whether the skills improvement might come from the skill prompt themselves activated knowledge in LLMs that can help itself. So that's why we added an ablation of 'pre-trajectory generated skills' because we have that hypothesis and this seems a very clean way to test it. Also it is very logical that feedback generated skills can help, because it most certainly contain the failure mode of agents on that specific tasks.


Yeah, I got your point when I read the paper. You're essentially controlling for "latent domain knowledge."

I might have been a bit blunt with the title - sorry about that, but I still think it was a good title. From what I've observed, a lot of Skills on GitHub are just AI-generated without any feedback or deliberative refinement. Many thought those would still be valuable, but you've shown evidence otherwise.


no worries it's totally fine! there is indeed work needs to be done on the feedbacks generated skills. Thanks for helping us submitting on HackerNews. And for > a lot of Skills on GitHub are just AI-generated without any feedback or deliberative refinement. Many thought those would still be valuable, but you've shown evidence otherwise. we do find most skills on the internet to be useless, and thanks to the generosity of https://skillsmp.com/ author, we were able to get all the meta data of the 99k skills indexed on his website. We did a lot of filtering and deduping and we discovered ~40k+ skills were relevant at the time we did the study.

Yes, I appreciate that, and yes there is room for nuance. But I think you went too far in this case, meaning that the delta between the article title and the submission title was too large. For example, the word "useless" appears nowhere in the article abstract nor in the article body. That's a big delta.

I was starting to type out a longer explanation but I ran out of time - however, I probably would just be repeating things I've said many times before, for example here: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... - perhaps some of that would be helpful.

You're a fine HN contributor and obviously a genuine user and I hope I didn't come across as critical! From our side it's just standard HN moderation practice. The way we deal with titles has been stable for many years. It isn't entirely mechanical, there are many subtleties (back to the nuance thing) but the core rules have served the site really well. THe main thing we want to avoid is having the title field be a mini-genre where whoever makes the submission gets to put their spin on the article.


I'm fine with editing the title, and I see your point.

In retrospect, I'd probably avoid "useless." While it's a fairly descriptive term for their finding, it probably carries a subjective tone.



Location: Baghdad (UTC+3)

Remote: Yes

Willing to relocate: Depends

Technologies: TypeScript/Node, Ruby/Rails, Python, decent frontend (JS, React, Tailwind, ...), microservices, REST APIs, GraphQL, RabbitMQ, Pub/Sub, Redis, Postgres/MySQL, Elastic Stack, Prometheus, Splunk, Kubernetes/Docker, Ansible.

Website: https://hadid.dev

Résumé/CV: https://hadid.dev/resume/

GitHub: https://github.com/mhadidg

Email: career+hn @ [my website domain]

---

Hi! I'm a backend engineer (~8 YOE) with strong backend & DevOps experience and decent frontend skills. Looking for a backend or backend-leaning full-stack role.

I worked at Automattic (US), the company behind WordPress; fully remote, async teams across the globe.

I've been part of teams focused on speed and rapid iteration, and I've also worked on high-quality systems where long-term maintainability and reliability matter the most. I believe this mix of experience has helped me develop a good sense of where each fits best and has enabled me to adapt quickly based on context and requirements.

I've built and maintained time-sensitive, high-throughput, distributed services (millions of ops daily) and owned features and small- to mid-sized projects end-to-end from design to deployment. I do my best working autonomously, and I like to think of myself as a generalist.

Most of my career has been in large enterprises (7+ YOE), but I've also done a fair amount of freelance work (around 1 YOE) for clients.

I have a couple of small open-source projects on GitHub.


I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:

- Minimal, configurable context - including system prompts [2]

- Minimal and extensible tools; for example, todo tasks extension [3]

- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]

Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], contextual drift [8], etc.), you'd truly appreciate Pi lets you fine-tune the WHOLE context for optimal performance.

It's clearly not for everyone, but I can see how powerful it can be.

---

[1] https://lucumr.pocoo.org/2026/1/31/pi/

[2] https://github.com/badlogic/pi-mono/tree/main/packages/codin...

[3] https://github.com/mitsuhiko/agent-stuff/blob/main/pi-extens...

[4] https://github.com/nicobailon/pi-mcp-adapter

[5] https://github.com/steipete/mcporter

[6] https://github.com/gkamradt/LLMTest_NeedleInAHaystack

[7] https://research.trychroma.com/context-rot

[8] https://arxiv.org/html/2601.20834v1


Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.

The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.


Author of Pi is Mario, not Armin, but Armin is a contributor


I can see their point.

Traditional systems (git blame, review history, ticket links) tell you who committed or approved changes, but not whether the content originated from an AI agent, which model it used, or what prompt/context produced it.

Agent Trace is aiming to standardize that missing layer so different tools can read/write it consistently.


It's kinda popular these days. I've read some high-quality articles there.


Once exercise becomes a habit, it's very easy to do even on days when your mood is terrible. A strict routine (initially) is the trick to making things easier forever.

You definitely want to build that habit when you're at your best.


Codex is better for backend coding. For UI/UX, Claude is a clear winner for me.

I use both interchangeably.


Interesting, that is good to know. I have definitely experienced Codex fumbling really easy UI tasks so that will be worth giving Claude a try for those.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: