Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you just want to segment larger blocks of text into tokens you can try the segment library (it implements the word boundary portion of unicode annex 29):

https://github.com/blevesearch/segment

If you need more manipulation of tokens after segmentation/tokenization, you could look at the analysis sub-package of bleve. Its intended to be able to be used indepenently of the rest of the library.

https://github.com/blevesearch/bleve



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: