Hacker Newsnew | past | comments | ask | show | jobs | submit | aroo's commentslogin

Obviously this is more of a personal project to help you out, but could be cool if there were more characters (rest of lowercase, uppercase, special) and the speed kept increasing. Reached level 208 with a score of 2083 before I got bored.


Horrible comparison given one score was achieved using 32-shot CoT (Gemini) and the other was 5-shot (GPT-4).


CoT@32 isn't "32-shot CoT"; it's CoT with 32 samples (or rollouts) from the model, and the answer is taken by consensus vote from those rollouts. It doesn't use any extra data, only extra compute. It's explained in the tech report here:

> We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.

(They could certainly have been clearer about it -- I don't see anywhere they explicitly explain the CoT@k notation, but I'm pretty sure this is what they're referring to given that they report CoT@8 and CoT@32 in various places, and use 8 and 32 as the example numbers in the quoted paragraph. I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?)

The 87% for GPT-4 is also with CoT@32, so it's more or less "fair" to compare that Gemini's 90% with CoT@32. (Although, getting to choose the metric you report for both models is probably a little "unfair".)

It's also fair to point out that with the more "standard" 5-shot eval Gemini does do significantly worse than GPT-4 at 83.7% (Gemini) vs 86.4% (GPT-4).


> I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?

Chain of Thought prompting, as defined in the paper referenced, is a modification of few-shot prompting where the example q/a pairs used have chain-of-thought style reasoning included as well as the question and answer, so I don't think that, if they were using a 0-shot method (even if designed to elicit CoT-style output) they would call it Chain of Thought and reference that paper.


A-ha, thanks! Hadn't looked at or heard of the referenced paper, but yeah, sounds like it's almost certainly also 5-shot then.

It would've been more consistent to call it e.g. "5-shot w/ CoT@32" in that case, but I guess there's only so much you can squeeze into a table.


The vibe I was getting from the paper was that they think something's funny about GPT4's 5-shot MMLU (e.g. possibly leakage into the training set).


Sounds like something right up the domain of synthetic data.


Yeah that's what I thought. Nothing I downloaded in the past week showed up on any of the VPN IPs I regularly use.


"Unreal Engine will no longer be free for non-gaming companies" is 100x less misleading. You are right in that technically speaking it's not misleading, but to me it feels like it's skirting the line of lying by omission. Obviously now with hindsight I can see how the "for all" changes the meaning, but it wasn't obvious (at least to me).


OpenOrca's goal is to provide an opensource replica to Microsoft's Orca 13B model, so changing the name makes no sense.


I wonder how it would perform if you feed it a comprehensive chess rulebook, and to avoid all the illegal moves possible


I think it'll stop when browsers integrate an LLM powered summary tool


Edge did, though would be nice a better integration (context menu?)


considering that ChatGPT is going to run on GPT-4 and is more or less free, why would someone use this paid feature instead of talking to ChatGPT?


GPT-4 on ChatGPT is currently only available for paying users.


The original LLaMA paper has some benchmarks. https://arxiv.org/pdf/2302.13971.pdf


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: