And no, I don't think the knowledge of language is necessary. To give a concrete example, tokens from TinyStories dataset (the dataset size is ~1GB) are known to be sufficient to bootstrap basic language.
And no, I don't think the knowledge of language is necessary. To give a concrete example, tokens from TinyStories dataset (the dataset size is ~1GB) are known to be sufficient to bootstrap basic language.