I try to build word2Vec in a parts.
I split original text file by several parts for reduce memory consumption in training.
Word2Vec word2Vec = initEmptyModel();
while(haseNextFile()){
TokenizerFactory tokenizerFactory = new UimaTokenizerFactory();
tokenizerFactory.setTokenPreProcessor(new LowCasePreProcessor());
word2Vec.setTokenizerFactory(tokenizerFactory);
word2Vec.setSentenceIterator(new BasicLineIterator(getNextFile()));
word2Vec.buildVocab();
word2Vec.fit();
}
Processing first file is successful, but for second is always fail.
Maybe I’m not using api correctly for this case, can anyone help me resolve problem?
INFO o.d.m.s.SequenceVectors - Starting vocabulary building...
INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [1], Current vocabulary size: [244448]; Sequences/sec: [0.01];
INFO o.d.m.s.SequenceVectors - Starting vocabulary building...
INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [1], Current vocabulary size: [244448]; Sequences/sec: [0.01];
INFO o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [186.50 MB]
INFO o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
INFO o.d.m.s.SequenceVectors - Building learning algorithms:
INFO o.d.m.s.SequenceVectors - building ElementsLearningAlgorithm: [SkipGram]
INFO o.d.m.s.SequenceVectors - Starting learning process...
INFO o.d.m.s.SequenceVectors - Epoch [1] finished; Elements processed so far: [5849492]; Sequences processed: [1]
INFO o.d.m.s.SequenceVectors - Time spent on training: 303155 ms
INFO o.d.m.s.SequenceVectors - Starting vocabulary building...
INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [1], Current vocabulary size: [355772]; Sequences/sec: [0.01];
INFO o.d.m.s.SequenceVectors - Starting vocabulary building...
INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [1], Current vocabulary size: [355772]; Sequences/sec: [0.01];
INFO o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
INFO o.d.m.s.SequenceVectors - Starting learning process...
Process finished with exit code -1073741819 (0xC0000005)