Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> 8 billion people all publishing random info can't be listened to.

Yet it's what we train LLMs on.



It's what we train LLMs on to make them learn language, a thing that all healthy adult human beings are experts on using. It's definitely not what we train LLMs on if we want them to do science.


There's a paper Textbooks are all you need - https://arxiv.org/abs/2306.11644

> We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval

We train on the internet because, for example, I speak a fairly niche English dialect influenced by Hebrew, Yiddish and Aramaic, and there are no digitised textbooks or dictionaries that cover this language. I assume the base weights of models are still using high quality materials.


Which are known to be unreliable beyond basic things that most people that have some relevant experience get right anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: