Subset of Google model Vector file

ethiel · March 11, 2020, 8:06pm

Hi, people.
I’m using the widely known Google vector model for sentiment analysis along with a CNN neural network.
It’s working after a lot of effort and it’s working fine. However, for unit testing, I’d like to use a small part of that file hence a subset. I tried with gensim but the file was not correct, the number of words and the size were not correctly got when I try to use the load static method.
So, is there a way to use a small subset of that file using DL4J?

treo · March 12, 2020, 7:36am

What exactly have you tried? For the text file based word vectors, you can easily reduce its size by removing the lines that you don’t need.

ethiel · March 12, 2020, 8:43am

Hi, @treo thanks for answering.
I solved my problem by editing the file manually to add the number of words and the size of the vectors. However, there is something curious about gensim: even although I say to gensim to use only 200 words, it took 184; I don’t understand why, but that was the issue.
Thanks for your help, @treo

Topic		Replies	Views
Workspace for loading Google vectors model using loadStaticModel DL4J	7	490	March 3, 2020
WordVectorSerializer.readWord2VecModel throws an exception: "Unable to guess input file format" DL4J	0	370	November 20, 2020
Create a NLP model that responds to question I ask DL4J	4	289	September 19, 2023
Saving trained neural nets in csv or txt format	17	915	April 14, 2020
Bert Model in DL4j - for text similarity DL4J	5	274	December 24, 2023

Subset of Google model Vector file

Related topics