> 8 billion people all publishing random info can't be listened to. Yet it's wha...

tsimionescu · 2025-05-24T18:51:03 1748112663

It's what we train LLMs on to make them learn language, a thing that all healthy adult human beings are experts on using. It's definitely not what we train LLMs on if we want them to do science.

verbify · 2025-05-24T19:43:22 1748115802

There's a paper Textbooks are all you need - https://arxiv.org/abs/2306.11644

> We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval

We train on the internet because, for example, I speak a fairly niche English dialect influenced by Hebrew, Yiddish and Aramaic, and there are no digitised textbooks or dictionaries that cover this language. I assume the base weights of models are still using high quality materials.

birn559 · 2025-05-24T18:55:02 1748112902

Which are known to be unreliable beyond basic things that most people that have some relevant experience get right anyway.