Deeplearning4j NLP how to add Chinese word segmentation using tfidfvectorizer

cxp · July 26, 2022, 7:48am

//Defaulttokenizer is in English. If it is in Chinese, it should be implemented by using Chinese word segmentation, such as ansj and hanlp
DefaultTokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
// tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());

    TfidfVectorizer vectorizer = new TfidfVectorizer.Builder()
            .setMinWordFrequency(1)
            .setStopWords(new ArrayList<>())
            .setTokenizerFactory(tokenizerFactory)
            .setIterator(iter)
            .build();

agibsonccc · July 26, 2022, 8:16am

@cxp please feel free to adapt our old code here for this: deeplearning4j/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp-chinese/src/main/java/org/deeplearning4j/nlp/chinese/tokenization at release/1.0.0-beta7 · eclipse/deeplearning4j · GitHub

It used ansj. We just weren’t maintaining it and needed to keep out code we weren’t updating out of the code base. This should still be usable for your purposes though.

Just make sure to implement the proper TokenizerFactory and Tokenizer by copying this code that uses an updated ansj.

Topic		Replies	Views
Vectorizing with word2vec/paragraphvectors DL4J	5	31	July 4, 2025
Create a NLP model that responds to question I ask DL4J	4	290	September 19, 2023
Sentence generator ND4J	5	481	February 22, 2023
SelfAttention Token Training Example DL4J	2	258	January 5, 2023
Using NLP with DL4J DL4J	4	239	February 11, 2023

Deeplearning4j NLP how to add Chinese word segmentation using tfidfvectorizer

Related topics