FastText - Getting vectors for misspelled / unknown words?

aliakhtar · February 15, 2020, 1:38pm

I’m using the FastText.java class included in DL4J to access FastText functionality:

eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/fasttext/FastText.java

/*
 *  ******************************************************************************
 *  *
 *  *
 *  * This program and the accompanying materials are made available under the
 *  * terms of the Apache License, Version 2.0 which is available at
 *  * https://www.apache.org/licenses/LICENSE-2.0.
 *  *
 *  *  See the NOTICE file distributed with this work for additional
 *  *  information regarding copyright ownership.
 *  * Unless required by applicable law or agreed to in writing, software
 *  * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 *  * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 *  * License for the specific language governing permissions and limitations
 *  * under the License.
 *  *
 *  * SPDX-License-Identifier: Apache-2.0
 *  *****************************************************************************
 */

This file has been truncated. show original

One of the main features of FastText is that it can provide vectors even for misspellings / unknown words. E.g according to the FAQs of FastText:

However, if I try to access a vector for an unknown word using fast.getWordVectorMatrix(word) in FastText.java, it just returns a default vector. However, I’d like it to retrieve a vector from the underlying fasttext implementation instead.

Is this possibile?

AlexBlack · February 17, 2020, 12:54am

What type of FastText vectors are you loading?

Keep in mind there are two types of vectors for FastText - text/.vec and binary/.bin.
The text vectors are word-level only - no subword information.
Only the binary vectors have subword information and hence can return (non-default/unknown) vectors for unknown words.

That’s not in any way a limitation of DL4J, that’s a limitation of the FastText text/.vec format.

github.com/eclipse/deeplearning4j

DL4J - FastText - UX, JavaDoc and .vec format support

opened 03:42AM - 06 Jun 19 UTC

closed 11:50AM - 06 Jun 19 UTC

AlexDBlack

UX NLP

Looks like JFastText can only load .bin files, not the text .vec files First th…ing I did when trying to use FastText: 1. Download English .vec files here: https://fasttext.cc/docs/en/english-vectors.html 2. Try to load them using `new FastText(File)` 3. Hit a confusing exception: `Exception in thread "main" java.lang.IllegalArgumentException: Model file's format is not compatible with this JFastText version!` I didn't initially realize FastText had 2 separate formats (binary .bin and text .vec) and that JFastText only loads one of them. This is bad UX, and I'm sure other people will run into exactly the same exception. We need to do 2 things: 1. In the short term, detect if .vec file, and throw a useful exception (basically: "Only supports FastText .bin format binary files") 2. See if it's possible to add loading .vec files in JFastText (or javacpp bindings) and if possible send a PR to add this. Then detect which format and load. While we are at it, we also need to add JavaDoc to the FastText class and methods: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/fasttext/FastText.java

aliakhtar · February 17, 2020, 1:23pm

@AlexBlack, Thanks for that clarification. getWordVectorMatrix() indeed works for unknown words if I just use the binary model!

Is it possible to call the other methods (e.g wordsNearest(), similarity), etc for unknown words on the FastText class? I seem to get an NPE since modelUtils is null if I call wordsNearest. There’s a setModelUtils() method - do I need to provide a modelUtils implementation myself?

AlexBlack · February 20, 2020, 11:14am

@aliakhtar Not sure sorry, but that’s a good question, I’ve opened an issue here - keep an eye on that

github.com/eclipse/deeplearning4j

DL4J FastText - wordsNearest and similar methods

opened 11:13AM - 20 Feb 20 UTC

closed 06:48AM - 27 Feb 20 UTC

AlexDBlack

DL4J NLP

Currently, if we do .wordsNearest(word, n) on a FastText binary model, we get a …NullPointerException, due to the modelUtils field being null. ``` java.lang.NullPointerException at org.deeplearning4j.models.fasttext.FastText.wordsNearest(FastText.java:445) at org.deeplearning4j.models.fasttext.FastTextTest.testPredict(FastTextTest.java:123) ``` We need one of 2 things here: 1. If FastText binary models _can_ support words nearest, we should make it work, OR 2. If we can't support words nearest, we should throw a useful exception (explaining to the user why it's not possible, and their options, if any)

Darius · September 14, 2021, 8:30pm

Hi, I trained a FastText model using gensim in Python.
In order to use it with DL4J I need to make use of the "save_facebook_model " method instead of the other possible option which is the “save” method.
Now, I noticed that using the “save_facebook_model” method produces a file which is over 380MB in size while the other “save” method only produces a file that is less than 1MB.

I wonder why there is such a big difference. Is it possible to somehow use the model that is saved with the “save” method?

In the documentation of Gensim it says that with the “save” method, the model can be loaded again using load(). and it supports incremental training and getting vectors for out-of-vocabulary words.

So, it does not seem to leave out any information.
Is there a way to load the gensim FastText model that is saved with save() ?
Thanks!

treo · September 27, 2021, 6:00am

@Darius please don’t revive year old threads. You would have gotten an answer quicker, if you just created a new thread.

As for your actual question: As far as I’m aware, there is no support for the custom binary format of Gensim.

Topic		Replies	Views
Error when loading 'wiki.en.bin' pretrained FastText model DL4J	4	485	March 19, 2021
WordVectorSerializer.readWord2VecModel throws an exception: "Unable to guess input file format" DL4J	0	370	November 20, 2020
Beta 7 - Glove Word Vector DL4J	3	1010	May 19, 2020
Doc2Vec question DL4J	10	557	April 13, 2020
Failed to execute op concat DL4J	4	1014	June 18, 2020

FastText - Getting vectors for misspelled / unknown words?

Related topics