I have a small batch(~900) of Russian sentences and tried to run the default word2vec example on it. I noticed that after the learning process was finished in the results there were a lot of similar words but in different forms(declension, case, ending) like (cat, cat’s, cats) (кошка, кошку, кошки), is dl4j contain some tools for word normalization/lemmatization?
Or maybe there is a widely used java tool to do it?
Probably I found a tool for lemmatisation of Russian texts: https://github.com/alexeyev/mystem-scala
Actually, it’s a wrapper for proprietary Mystem from Yandex, it has some limitation of usage but for personal use it’s ok.