qwen3-coder-next 80B (128k ctx) local benchmark on RTX 5090 + Ryzen 9 9950x3d ra...

qwen3-coder-next 80B (128k ctx) local benchmark on RTX 5090 + Ryzen 9 9950x3d

ran 10 real coding workloads via ollama, kv cache q8.

below is claude-generated report from raw numbers:

Hardware: Corsair Vengeance a7500 AIR -- RTX 5090 32GB, Ryzen 9 9950X3D, 192GB DDR5, Fedora 43 (kernel 6.17.6)

Workload TTFT Time Prompt Gen P tok/s G tok/s ------------------------------------------------------------------------------------ Short Code Gen (Dijkstra) 5.9s 68s 74 754 93.8 12.2 Code Comprehension (100L hash) 3.7s 271s 1,131 3,202 319.3 12.0 Bug Fixing (connection pool) 2.9s 478s 876 5,629 308.7 11.9 Security Review (Flask app) 2.7s 329s 758 3,905 285.6 12.0 System Design (persistent BST) 1.0s 407s 141 4,857 156.1 12.0 Code Refactoring 2.8s 242s 823 2,862 309.1 12.0 Tool Calling (single turn) 3.6s 9s 1,093 66 309.7 12.2 Tool Calling (multi-turn, 3T) 11.2s 43s 2,746 326 407.8 11.8 Needle in Haystack (~15k ctx) 29.4s 210s 10,576 1,962 361.2 10.9 Long Generation (mini-Celery) 1.1s 1,001s 238 11,342 242.3 11.4 ------------------------------------------------------------------------------------ TOTAL ~51min 18,456 34,905 11.4 avg

Key observations:

- Model is 51GB, runs 53% CPU / 47% GPU split since it exceeds the 5090's 32GB VRAM.

- Generation throughput is a dead-flat 12 tok/s regardless of task complexity. The CPU is the bottleneck (53% of layers offloaded to RAM), not the GPU -- GPU utilization sits at 3-4%.

- Prompt processing is fast: 280-410 tok/s. The 5090 helps here.

- TTFT scales linearly with context: 1-6s for small prompts, 29s for the 10k+ token needle-in-haystack test.

- Sustained generation holds steady -- the 16-minute long generation test (11,342 tokens, full distributed task queue implementation) showed no throughput degradation.

- Quality was solid: correctly identified a buried bug report comment in 3000 lines of code (JIRA ticket number, proposed fix), found the N+1 query in the multi-turn agent test, caught all major OWASP vulns in the security review, generated correct tool calls with proper parameters.

- Thinking mode was not enabled (/nothink). Would be interesting to compare with /think.

The 12 tok/s ceiling is purely the CPU/RAM bandwidth wall from the offload split. If you have 48GB+ VRAM (dual GPU or a future card) this model would fly. On this setup it's usable but you feel the wait on anything over ~1000 tokens.