Why is my model only predicting 0 and never 1?

Hi everyone. I made a simple example to get to know deeplearning4j:

  • the data has the shape Map<Double, Double>, with the first Double representing the features: 2 numbers chosen randomly between 0 and 100. The last Double is the target: 1 if they together are above 50, 0 if not.
  • Here is how I make the Iterator:
    public static DataSetIterator fromEntriesToDataSet(Map<Double[], Double> entries) {
        final Double[][] boxedDoubles = entries.keySet().toArray(new Double[entries.size()][numFeatures]);
        double[][] unboxedDoubles = Utilities.unbox2DArrayOfDoubles(boxedDoubles, numFeatures);

        INDArray inputNDArray = Nd4j.create(unboxedDoubles);

        final Double[] bools = entries.values().toArray(new Double[entries.size()]);
        INDArray outPut = Nd4j.create(Utilities.unbox1DArrayOfDoubles(bools), entries.size(), 1);

        DataSet dataSet = new DataSet(inputNDArray, outPut);
        List<DataSet> listDs = dataSet.asList();
        return new ListDataSetIterator<>(listDs, entries.size());
  • Here’s how I make the network:
    static int numInput_ofNetwork = 2;
    static int nHidden_ofNetwork = 110;
    static int numOutputs_ofNetwork = 1;

    private static MultiLayerNetwork makeNetworkModel() {
        MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
                .updater(new Sgd(0.1))
                .layer(new DenseLayer.Builder().nIn(numInput_ofNetwork).nOut(nHidden_ofNetwork).build())
                .layer(new OutputLayer.Builder(LossFunctions.LossFunction.XENT)
        MultiLayerNetwork model = new MultiLayerNetwork(conf);
        return model;

I assume the mistake will be somewhere in there, otherwise tell me if I need to post more code.
I wasn’t expecting the model to be outstanding on the first try. But the only-zero prediction suggests I made a mistake somewhere.

I think I’m getting closer to figuring it out, but it still makes no sense:

  • For simplicity, I used the training-data for testing as well. Made an INDArray out of the input-Double
  • I used that as a parameter to model.predict and got an int. This int has only zeroes, hence why I thought it didn’t work. But now I suspect the problem is that I got fractions between 0 and 1 and it just rounded them all down. So first question: Can’t you have it return fractions? That’s actually why I put the sigmoid activation function at the end.
  • If I used INDArray output = model.output(inputIndArray); and inspect it via output.data().asFloat() I see a bunch of different values between 0 and 1 that look a lot more like solutions. If we have this, then what’s the point of .predict? However, they are all around 0.05. Why are they so small? The classes are equally distributed.
  • eval.eval(labelIndArray, output) shows me a confusion matrix according to which the model always guessed 0

I’m going to ignore that the data preparation step you are using could be massively improved, as it works for now and you aren’t asking about that in particular here.

As for your actual question, I can see a few reasons for why you aren’t having any success. They all stem from the same reason: You don’t quite understand what is happening in the neural network that you have designed.

Mathematically, your neural network can be described as: y = sigmoid(W_2*(tanh(W_1*x+b_1))+b_2)

Your x is defined as:

And your y is defined as:

For this discussion we can assume that W and b are initialized with values in the -1 to 1 range.

So lets take a look what happens when you put a number between 0 and 100 in.

W_1 * x will be 110 values in the range between -200 and 200.

tanh(x) is defined as image. Its value range therefore is -1 to 1 and it is the most sensitive for values between -2 and 2.

So for pretty much all of your inputs the value of tanh(W_1*x + b_1) is going to be pinned to -1 or 1. The problem with that is that the farther you get away from the sensitive range, the less actual gradient your model will get during training. So it will either take a very long time to train it (or require some very specific hyper parameters), or the model will not train at all.

I sometimes feel like a broken record taking about this, because it is so often the reason why people fail even at their toy tasks, but the easiest way to solve that problem is to normalize your data.

TL;DR: Scale down your input data from 0 to 100 to 0 to 1, and your model will get at least a chance to solve the problem.

Also, given that you know the solution to your problem beforehand, it would make sense to design your NN accordingly. It doesn’t need 110 nodes in the hidden layer to do your task, they only make it harder to train the model - sure the L2 constraint will pull most of them to 0 after some time, but given that it is a trivial problem where you can work out the solution by hand, it doesn’t really make any sense to make it any less clear.