Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This seems like what xkcd called "citogenesis" - something inaccurate gets published on Wikipedia, a traditional "reliable source" (newspaper, published nonfiction book, etc.) retells that bit of information without a source, Wikipedia cites the source, that information is now firmly established as truth for the rest of humanity.

I worry about this with Google's language translation models, too. It's entirely possible that it's making up phrases or connotations that never existed in organic human speech, but people who aren't fully fluent in the language use Google Translate for assistance, publish something, and then suddenly it's in a published text by an actual human and Google reinforces its own belief.

For at least the past five years, Google Translate has translated "who proceeds from the father" into Latin as "qui ex patre filioque procedit" - inserting the additional word "filioque," which means "and the son." The question of whether to add this word is a 1500-year-old theological argument: https://en.wikipedia.org/wiki/Filioque Since the Western Church added the word, most texts in Latin include it, so Google is almost certainly deciding that the phrasing with "filioque" is more popular - but it doesn't know what the words mean, so it can't realize that the phrase it came up with means something different!



I met an artist whose body of work consisted of citogenisis. He'd been at it for a decade when I met him, and had created at least a dozen fictional artists, detailed biographies, and a complete "retrospective" gallery show for each. This let him copy the styles of the great masters of different periods, yet make something new, of a sort. He never edited Wikipedia himself, but instead led gallery tours of these retrospectives and was always overjoyed when people would write about them, propagating the fiction.


Google Translate's Latin translations are just a short markov model.

https://www.reddit.com/r/latin/comments/6akqdi/why_is_google...

Translation for many (most?) other languages does something more sophisticated.


I really wish someone would come up with some better Latin translation software/service. Google Translate is so bad for Latin it is close to useless.


If I'm reading that post right, Latin and other languages all use the same model, they just have more training data for more common pairs of languages.


They use some kind of neural net thing for other languages. e.g. here is a paper https://arxiv.org/abs/1609.08144


This is an issue solvable with better provenance / data lineage, which has come up in recent HN discussions.


How do you enforce that other people provide provenance on text?

Taking the translation example - how do you enforce that users of Google Translate keep provenance in their translated text that remains with the text?


Sounds like a kind of cyber-meme?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: