Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If, hypothetically, libraries in the US - including in particular the Library of Congress - were to scan and OCR every book, newspaper and magazine they have with copyright protection already expired, would that be enough? Is there some estimate for the size of such dataset?


Much of that material is already available at https://archive.org. It might be good enough for some purposes, but limiting it to stuff before 1928 (in the United Sates) isn't going to be very helpful for (e.g.) coding.

Maybe if you added github projects with permissive licenses?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: