This seems like what xkcd called "citogenesis" - something inaccurate gets published on Wikipedia, a traditional "reliable source" (newspaper, published nonfiction book, etc.) retells that bit of information without a source, Wikipedia cites the source, that information is now firmly established as truth for the rest of humanity.
I worry about this with Google's language translation models, too. It's entirely possible that it's making up phrases or connotations that never existed in organic human speech, but people who aren't fully fluent in the language use Google Translate for assistance, publish something, and then suddenly it's in a published text by an actual human and Google reinforces its own belief.
For at least the past five years, Google Translate has translated "who proceeds from the father" into Latin as "qui ex patre filioque procedit" - inserting the additional word "filioque," which means "and the son." The question of whether to add this word is a 1500-year-old theological argument: https://en.wikipedia.org/wiki/Filioque Since the Western Church added the word, most texts in Latin include it, so Google is almost certainly deciding that the phrasing with "filioque" is more popular - but it doesn't know what the words mean, so it can't realize that the phrase it came up with means something different!
I met an artist whose body of work consisted of citogenisis. He'd been at it for a decade when I met him, and had created at least a dozen fictional artists, detailed biographies, and a complete "retrospective" gallery show for each. This let him copy the styles of the great masters of different periods, yet make something new, of a sort. He never edited Wikipedia himself, but instead led gallery tours of these retrospectives and was always overjoyed when people would write about them, propagating the fiction.
If I'm reading that post right, Latin and other languages all use the same model, they just have more training data for more common pairs of languages.
How do you enforce that other people provide provenance on text?
Taking the translation example - how do you enforce that users of Google Translate keep provenance in their translated text that remains with the text?
I worry about this with Google's language translation models, too. It's entirely possible that it's making up phrases or connotations that never existed in organic human speech, but people who aren't fully fluent in the language use Google Translate for assistance, publish something, and then suddenly it's in a published text by an actual human and Google reinforces its own belief.
For at least the past five years, Google Translate has translated "who proceeds from the father" into Latin as "qui ex patre filioque procedit" - inserting the additional word "filioque," which means "and the son." The question of whether to add this word is a 1500-year-old theological argument: https://en.wikipedia.org/wiki/Filioque Since the Western Church added the word, most texts in Latin include it, so Google is almost certainly deciding that the phrasing with "filioque" is more popular - but it doesn't know what the words mean, so it can't realize that the phrase it came up with means something different!