passing the dataset to the fit method of the MultiLayerNetwork class

@Arasaka sure no hurry on my part.

@agibsonccc hello, I’m back to work. I need to know if I’m doing everything right. This is the error that started to appear: Sequence lengths do not match for RnnOutputLayer input and labels:Arrays should be rank 3 with shape [minibatch, size, sequenceLength] - mismatch on dimension 2 (sequence length).
As I understand it, that error indicates that the shape (shape) of the input and output data does not match the expected shape. In this case, RnnOutputLayer expects the dimension of the third axis of the input data (sequenceLength) to match the dimension of the third axis of the labels.

@Arasaka do you have a minimal reproducer for me I can run from a main method? Thanks!

@agibsonccc

I’m on a short business trip, so I’ll have to wait a couple of days. I apologize for the wait.

@agibsonccc hello. Finally back and ready to continue if you don’t mind. I did it a little differently, and now the error is this: Exception in thread “main” java.lang.IllegalStateException: Cannot merge time series with different size for dimension 1 (first shape: [1, 6, 300], 1th shape: [1 , 5, 300]
at org.nd4j.linalg.dataset.api.DataSetUtil.mergeTimeSeries(DataSetUtil.java:467)
at org.nd4j.linalg.dataset.api.DataSetUtil.mergeFeatures(DataSetUtil.java:206)
at org.nd4j.linalg.dataset.api.DataSetUtil.mergeFeatures(DataSetUtil.java:226)
at org.nd4j.linalg.dataset.MultiDataSet.merge(MultiDataSet.java:488)
I tried to align with paddings, another error began to appear. How should the fit method properly accept? Exactly what measurements should be the same?

@Arasaka you should figure out whatever the max possible lengths are and use masking to address different lengths like that. See more here:

@agibsonccc thanks, i read the article. Equalized measurements with shape to [1, 6, 300]. Another exception was thrown: Exception in thread “main” java.lang.IllegalStateException: Invalid input: EmbeddingSequenceLayer expects either rank 2 input of shape [minibatch,seqLength] or rank 3 input of shape [minibatch,1,seqLength]. Got rank 3 input of shape [4, 6, 300]
I tried reshape to get the desired shape of rank 3, that is, [1, 1, 6]. An exception appeared: Exception in thread “main” org.nd4j.linalg.exception.ND4JIllegalStateException: New shape length doesn’t match original length: [6] vs [1800]. Original shape: [1, 6, 300] New Shape: [1, 1, 6]
I understand that all this is due to vectorSize, which is 300. I tried to make a mask, it showed [1, 6] in the logs, I added it to the List. This exception recurred: Exception in thread “main” java.lang.IllegalStateException: Invalid input: EmbeddingSequenceLayer expects either rank 2 input of shape [minibatch,seqLength] or rank 3 input of shape [minibatch,1,seqLength]. Got rank 3 input of shape [4, 6, 300]
I tried to put it in the first place when placing INDArray in List (new MultiDataSet(inputMask, input, output, null)), the following exception appeared: Invalid size: cannot get size of dimension 2 for rank 2 NDArray (array shape: [1, 6])
Maybe there is an opportunity to change the submission form in the EmbeddingSequenceLayer? Or am I doing something wrong with masks?

@Arasaka could you clarify your data pipeline and the like? I can’t tel what’s off yet. A reminder of your use case would be appreciated.

@agibsonccc ok, i’ll hide some things if you don’t mind

public static MultiDataSet convertDataToMultiDataSet(String data) throws IOException {
List dataSetList = new ArrayList<>();

InputStream modelIn = new FileInputStream(*path to the model*);
SentenceModel sentenceModel = new SentenceModel(modelIn);
SentenceDetectorME sentenceDetectorME = new SentenceDetectorME(sentenceModel);

String[] sentences = sentenceDetectorME.sentDetect(data);

Tokenizer tokenizer = SimpleTokenizer.INSTANCE;

int maxLength = 0;
for (String sentence : sentences) {
    String[] tokens = tokenizer.tokenize(sentence);
    maxLength = Math.max(maxLength, tokens.length);
}

int vectorSize = 300;
int numClasses = 2;
INDArray padding = Nd4j.zeros(1, maxLength, vectorSize);

for (String sentence : sentences) {

    if (sentence.trim().isEmpty()) {
        continue;
    }

    INDArray inputMask = Nd4j.zeros(1, maxLength); // mask
    INDArray input = Nd4j.zeros(1, maxLength, vectorSize);
    INDArray output = Nd4j.zeros(1, maxLength, numClasses);


    String[] tokens = tokenizer.tokenize(sentence);
    int length = Math.min(tokens.length, maxLength);
    //int length = tokens.length;

    System.out.println("Processing sentence: " + sentence);
    System.out.println("Tokens: " + Arrays.toString(tokens));

    for (int i = 0; i < length; i++) {
        String token = tokens[i];
        System.out.println("Processing token: " + token);
        INDArray vector = getWordVector(token);
        if (vector != null) {
            input.put(new INDArrayIndex[]{NDArrayIndex.point(0), NDArrayIndex.point(i), NDArrayIndex.all()}, vector);
            inputMask.put(new INDArrayIndex[]{NDArrayIndex.point(0), NDArrayIndex.interval(0, length)}, Nd4j.ones(1, length )); // установить маску
        }
    }

    int length1 = (int) input.size(1);

    if (length1 < maxLength) {
        INDArray inputCopy = input.dup(); // copy
        inputCopy.put(new INDArrayIndex[]{NDArrayIndex.all(), NDArrayIndex.interval(length1, maxLength), NDArrayIndex.all()}, padding);
        dataSetList.add(new MultiDataSet(inputCopy, output));
    }
    else {
        dataSetList.add(new MultiDataSet(input, output));
    }

    System.out.println("Input shape: " + Arrays.toString(input.shape()));
    System.out.println("Output shape: " + Arrays.toString(output.shape()));
    System.out.println("InputMask shape: " + Arrays.toString(inputMask.shape()));

    dataSetList.add(new MultiDataSet(input, output, inputMask, null));
    //dataSetList.add(new MultiDataSet(input, output));

}

return MultiDataSet.merge(dataSetList);

}

I have three lines displayed in the console twice. For the first time:
Input shape: [1, 6, 300]
output shape: [1, 6, 2]
InputMask shape: [1, 6]
A second time:
Input shape: [1, 6, 300]
output shape: [1, 6, 2]
InputMask shape: [1, 6]
Thanks in advance for your reply.

@Arasaka Sorry for the late reply.
Could you tell me a bit more about your problem? I don’t need secret/proprietary information just what you’re trying to doo.
It looks like you’re using opennlp. Are you feeding the output of that to dl4j?

It looks like you are trying to train custom word embeddings? Is that what the embedding sequence layer is for?

@agibsonccc let me elaborate on what I’m trying to do.

Yes, I’m using the OpenNLP library for sentence splitting and tokenization. Then the result of her work is transferred to dl4j for training the neural network.

No, learning my own word embeddings is not my goal.
In my code, I use the already pre-trained word2vec model to get vector representations for each word in the sentence.

You are right that the EmbeddingSequenceLayer is just designed to work with sequences of vector representations of words. I use it to feed word embeddings to LSTM input.

Please ask any additional questions if any clarification is needed! It is important for me to understand what I am doing wrong.

@Arasaka you shouldn’t need the embedding layer then. Your embeddings are external. Just skip straight to using LSTMs. Look around at this:

@agibsonccc I apologize for the long answer, at this moment I am considering your example. If there are any questions, I will write.