Foreign text lemmatisation?

sbaldin · November 10, 2020, 5:17pm

Hello.
I have a small batch(~900) of Russian sentences and tried to run the default word2vec example on it. I noticed that after the learning process was finished in the results there were a lot of similar words but in different forms(declension, case, ending) like (cat, cat’s, cats) (кошка, кошку, кошки), is dl4j contain some tools for word normalization/lemmatization?
Or maybe there is a widely used java tool to do it?

sbaldin · November 15, 2020, 3:13pm

Probably I found a tool for lemmatisation of Russian texts: GitHub - alexeyev/mystem-scala: Morphological analyzer `mystem` (Russian language) wrapper for JVM languages
Actually, it’s a wrapper for proprietary Mystem from Yandex, it has some limitation of usage but for personal use it’s ok.

Topic		Replies	Views
Bert Model in DL4j - for text similarity DL4J	5	260	December 24, 2023
Chatbot with DL4J DL4J	2	495	August 21, 2022
Word2Vec performance getting synonyms DL4J	0	274	February 1, 2022
Reimplement DeepPavlov FAQ with DL4j DL4J	2	1925	September 21, 2020
Great start on Word2Vec, but still a few questions DL4J	4	473	June 26, 2020

Foreign text lemmatisation?

Related topics