Then, you should check out papers like https://arxiv.org/abs/2302.13971 and http...

Then, you should check out papers like https://arxiv.org/abs/2302.13971 and https://arxiv.org/abs/2307.09288

In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.

The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.