Deeplearning4j NLP how to add Chinese word segmentation using tfidfvectorizer

//Defaulttokenizer is in English. If it is in Chinese, it should be implemented by using Chinese word segmentation, such as ansj and hanlp
DefaultTokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
// tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());

    TfidfVectorizer vectorizer = new TfidfVectorizer.Builder()
            .setMinWordFrequency(1)
            .setStopWords(new ArrayList<>())
            .setTokenizerFactory(tokenizerFactory)
            .setIterator(iter)
            .build();

@cxp please feel free to adapt our old code here for this: deeplearning4j/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp-chinese/src/main/java/org/deeplearning4j/nlp/chinese/tokenization at release/1.0.0-beta7 · eclipse/deeplearning4j · GitHub

It used ansj. We just weren’t maintaining it and needed to keep out code we weren’t updating out of the code base. This should still be usable for your purposes though.

Just make sure to implement the proper TokenizerFactory and Tokenizer by copying this code that uses an updated ansj.