Why are they doing that? Opus is the only good way to run Claw. Do they regret making it cheaper or what?
Also what's the point of Claude -p if not integration with 3rd party code? (They have a whole agents SDK which does the same thing.. but I think that one requires per token pricing.) I guess they regret supporting subscription auth on the -p flag
The site says 14x less memory usage. I'm a bit confused about that situation. The model file is indeed very small, but on my machine it used roughly the same RAM as 4 bit quants (on CPU).
Though I couldn't get actual English output from it, so maybe something went wrong while running it.
Do I need to build their llama.cpp fork from source?
Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!
Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?
Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.
Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.
Sample output below (the model's response to "hi" in the forked llama-cli):
X ( Altern as the from (..
Each. ( the or,./, and, can the Altern for few the as ( (.
.
( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern.
. That, on, and similar, and, similar,, and, or in
Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.
Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main
I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.
Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.
Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.
I want to say "we structured the system like that, right?", i.e. maximize profit at all costs.
But it seems to be the natural outcome of the incentives, of an organization made of organisms in an entropy-based simulation.
i.e. the problem might be slightly deeper than an economic or political model. That being said, we might see something approximating post-scarcity economics in our lifetimes, which will be very interesting.
In the meantime... we might fiddle with the incentives a bit ;)
The upper arm of the K shaped economy uses their capital to invent and control the replicator and the lower arm dies off? Seems like the most realistic path to "post-scarcity" from where we're standing now.
Also what's the point of Claude -p if not integration with 3rd party code? (They have a whole agents SDK which does the same thing.. but I think that one requires per token pricing.) I guess they regret supporting subscription auth on the -p flag
reply