The Phi models always seem to do really well when it comes to benchmarks but then in real world performance they always fall way behind competing models.
>the biggest indication of flawed leadership is a company or agency leadership photo where the majority percentage of the people in it are all the same skin tone.
Does this opinion come from your actual experience or just from your ideological indoctrination?
Virtually all non-western businesses have zero concern about fostering racial diversity, they are all failures in your opinion?
Given how quickly AI is progressing from the software side, and how poorly AI scales from just throwing raw compute time at a model, I don't see a company holding onto the lead for very long with that strategy.
If I can come out with a model a year later, and it can provide 95% of the performance while costing 10% as much to run, I think I would end up stealing a lot of customers before they had a chance to break even.
Take Llama3-8B for example, this is an 8 billion parameter model from 2024 that performs about as well the the original ChatGPT, a 175 billion parameter model from 2022. It only took 2 years before a model that can run on a desktop could compete with a model that required a data center.
LLMs actually scale extremely well just by throwing compute at them. That's the whole reason they took off. Training a bigger model or training it longer or increasing the dataset all work more or less equally well. Now that we've saturated the dataset component (at least for human written text) pretty much, everyone throws their compute at bigger models or more epochs.
Well, I guess the question I have is, what exactly does he mean by the "cost to train"? As in, just the cost of the electricity used to train that one model? That seems really excessive.
Or is it the total overall cost of buying TPUs / GPUs, developing infrastructure, constructing data centers, putting together quality data sets, doing R&D, paying salaries, etc. as well as training the model itself? I could see that overall investment into AI scaling into the tens of billions over the next few years.
Well, the statement that GPT-4 is 1.8T parameters is a little misleading since it's really a 8 x 220B MoE (according to the rumors at least).
Also the size of the model itself isn't the only factor that determines performance, LLama 3 70B outperforms LLama 2 70B even though they have the same size.