Hacker Newsnew | past | comments | ask | show | jobs | submit | tigershark's commentslogin

What about NVDA (~$4.5T) and AVGO (~$1.8T)?


How much more was worth USD at the beginning of the year?


Flux kontext quality is noticeably worse that nano banana, Qwen image 2509 and Seedream 4 most of the times. For pure image generation instead Hunyuan image is scarily good.


Seedream 4 is better than nano banana on average, so that test result seems accurate to me


The biggest model that they have used has only 760M parameters, and it outperforms models 1 order of magnitude larger.


Gah dmn


Yeah, less than from Milan to my city still in Italy…


I’m not an expert, but I would say extremely small.

For comparison Hunyuan video encodes a shit-ton of videos and rudimentary real world physics understanding, at very high quality in only 13B parameters. LLAMA 3.3 encodes a good chunk of all the knowledge available to humanity in only 70B parameters. And this is only considering open source models, the closed source one may be even more efficient.


Maybe we have different understandings of what extremely small is (including that emphasis) but an LLM is not that by definition (the first L). I'm no expert either but the smaller value mentioned is 13e9. If these things are 8-bit integers, that's 13 GB data (more for a normal integer or a float). That's a significant percentage of long term storage on a phone (especially Apple models) let alone that it would fit in RAM on even most desktops which is afaik required for useful speeds. Taking this as upper bound and saying it must be extremely small to encode only landmarks, idk. I'd be impressed if it's down to a few dozen megabytes, but with potentially hundreds of such mildly useful neural nets, it adds up and isn't so small that you'd include it as a no-brainer either


Exactly. The previous version of o1 did actually worse in the coding benchmarks, so I would expect it to be worse in real life scenarios. The new version released a few days ago on the other hand is better in the benchmarks, so it would seem strange that someone used it and is saying that it’s worse than Claude.


Where is the plateau? Chatgtp 4 was ~0% in ARC-AGI. 4o was 5%. This model literally solved it with a score higher than the 85% of the average human. And let’s not forget the unbelievable 25% in frontier math, where all the most brilliant mathematicians in the world cannot solve by themselves a lot of the problems. We are speaking about cutting edge math research problems that are out of reach from practically everyone. You will get a rude awakening if you call this unbelievable advancement a “plateau”.


I don't care about benchmarks. O1 ranks higher than Claude on "benchmarks" but performs worse on particular real life coding situations. I'll judge the model myself by how useful/correct it is for my tasks rather than a hypothetical benchmarks.


In most non-competitive coding benchmarks (aider, live bench, swe-bench), o1 ranks worse than Sonnet (so the benchmarks aren't saying anything different) or at least did, the new checkpoint 2 days ago finally pushed o1 over sonnet on livebench.


As I said, o3 demonstrated field medal level research capacity in the frontier math tests. But I’m sure that your use cases are much more difficult than that, obviously.


there are many comments in internet about this, that only subset of frontier math benchmark is "field medal level research", and o3 likely scored on easier subset.

Also, all that stuff is shady in the way that it is just numbers from OAI, which are not reproducible on benchmark sponsored by OAI. If we say OAI could be bad actor, they had plenty of opportunities to cheat on this.


“Objective benchmarks are useless, let’s argue about which one works better for me personally.”


Yes. My benchmarks and their benchmarks means AGI. Their benchmarks only means over-fitted.


Ok so what if we get different results for our own personal benchmarks/use cases.

(See why objective benchmarks exist?)


Yes, "objective" benchmarks can be gamed, real-life tasks cannot.


AI benchmarks and tests that claim to measure understanding, reasoning, intelligence, and so on are a dime a dozen. Chess, Go, Atari, Jeopardy, Raven's Progressive Matrices, the Winograd Schema Challenge, Starcraft... and so on and so forth.

Or let's talk about the breakthroughs. SVMs would lead us to AGI. Then LSTMs would lead us to AGI. Then Convnets would lead us to AGI. Then DeepRL would lead us to AGI. Now Transformers will lead us to AGI.

Benchmarks fall right and left and we keep being led to AGI but we never get there. It leaves one with such a feeling of angst. Are we ever gonna get to AGI? When's Godot coming?


You must be stuck at SDXL for posting something absolutely and verifiably false as the sentence above.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: