More

andai · 2026-04-03T23:54:46 1775260486

Why are they doing that? Opus is the only good way to run Claw. Do they regret making it cheaper or what?

Also what's the point of Claude -p if not integration with 3rd party code? (They have a whole agents SDK which does the same thing.. but I think that one requires per token pricing.) I guess they regret supporting subscription auth on the -p flag

randall · 2026-04-04T00:01:43 1775260903

exactly. They probably have unsustainable margins on accident.

andai · 2026-04-03T18:12:33 1775239953

Aaron needs to hurry up and reincarnate already.

jedberg · 2026-04-03T18:24:36 1775240676

Aaron had very little to do with Markdown, other than reviewing the spec once at the end.

andai · 2026-04-01T03:38:23 1775014703

The site says 14x less memory usage. I'm a bit confused about that situation. The model file is indeed very small, but on my machine it used roughly the same RAM as 4 bit quants (on CPU).

Though I couldn't get actual English output from it, so maybe something went wrong while running it.

andai · 2026-04-01T02:13:26 1775009606

But brother, he is shipping at inference speed!

andai · 2026-04-01T01:57:05 1775008625

Does anyone know how to run this on CPU?

Do I need to build their llama.cpp fork from source?

Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!

Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?

Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.

andai · 2026-04-01T01:52:42 1775008362

Thanks. Did you need to use Prism's llama.cpp fork to run this?

jjcm · 2026-04-01T02:47:32 1775011652

andai · 2026-04-01T03:35:53 1775014553

Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.

Sample output below (the model's response to "hi" in the forked llama-cli):

X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in

freakynit · 2026-04-01T04:28:55 1775017735

I have older M1 air with 8GB, but still getting ober 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size.

1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git`

2. Then (assuming you already have xcode build tools installed):

  cd llama.cpp
  cmake -B build -DGGML_METAL=ON
  cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

3. Finally, run it with (you can adjust arguments):

  ./build/bin/llama-server -m ~/Downloads/Bonsai-8B.gguf --port 80 --host 0.0.0.0 --ctx-size 0 --parallel 4 --flash-attn on --no-perf --log-colors on --api-key some_api_key_string

Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main

freakynit · 2026-04-01T04:57:30 1775019450

To the author: why is this taking 4.56GB ? I was expecting this to be under 1GB for 4B model. https://ibb.co/CprTGZ1c

And this is when Im serving zero prompts.. just loaded the model (using llama-server).

jjcm · 2026-04-01T03:46:23 1775015183

I did this: https://image.non.io/2093de83-97f6-43e1-a95e-3667b6d89b3f.we...

Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.

Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main

I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.

andai · 2026-03-31T21:00:27 1774990827

Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?

My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.

https://www.anthropic.com/engineering/contextual-retrieval

Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.

tjgreen · 2026-03-31T21:05:59 1774991159

Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.

andai · 2026-03-31T20:53:36 1774990416

I want to say "we structured the system like that, right?", i.e. maximize profit at all costs.

But it seems to be the natural outcome of the incentives, of an organization made of organisms in an entropy-based simulation.

i.e. the problem might be slightly deeper than an economic or political model. That being said, we might see something approximating post-scarcity economics in our lifetimes, which will be very interesting.

In the meantime... we might fiddle with the incentives a bit ;)

als0 · 2026-03-31T21:31:59 1774992719

> we might see something approximating post-scarcity economics in our lifetimes

Can you elaborate more on this? All I see is growing inequality.

hamdingers · 2026-03-31T21:37:22 1774993042

The upper arm of the K shaped economy uses their capital to invent and control the replicator and the lower arm dies off? Seems like the most realistic path to "post-scarcity" from where we're standing now.

andai · 2026-03-31T20:45:49 1774989949

How can I learn more about this? I looked into it recently but didn't get very far.

This seems like the kind of thing that should be more widely known, and have some good tutorials written for it :)

FelipeCortez · 2026-03-31T22:49:09 1774997349

The Wikidata documentation is good:

https://www.wikidata.org/wiki/Wikidata:Introduction

And you can find lots of SPARQL examples here:

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...

andai · 2026-03-31T03:22:18 1774927338

Very interesting. What are goals, fitness and mutation in this context?