Autoencode sanity check

Hello there.

I am new to DL4J. I am trying to create an autoencoder, and had a look at several examples. I have a dataset that has 11 doubles as input. I would like to create an autoencoder that reduces the 11 features to a code of size 2 or 3 for nice visualisations. Before doing that, I was thinking of a sanity check, and using a code of size 11. So I have a dense multi-layer network with 5 layers, each having 11 nodes. This means that the input should rather simply be passed straight through, and a perfect score should be obtainable. I have tried many different settings, and nothing made DL4J learn this solution. I am seeking for help and ideas so that I can do this sanity check. After that I want to reduce the code size.

Here are some things I have tried.
Preprocessing

  • None
  • NormalizerStandardize
  • NormalizerMinMaxScaler

Batch size

  • 64

Epoch

  • 30 + but there is no convergence after only a couple of epochs

Weight Initializations

  • Identity
  • RELU
  • Xavier

Updater

  • AdaGrad(0.05)
  • Adam(0.05)

Activation functions

  • Identity
  • Relu
  • Sigmoid

Optimization

  • Sotchastic gradient descent
  • line gradient descent

L2

  • 0.0001
  • disabled

Layers

  • 0-3 hidden layers

Loss function

  • MSE
  • MAE

I have tried many of the above combinations, but it just will not converge to anything. The error stays pretty much the same after weight initialization. When feeding the inputs through the network, I see that the response is somewhat, but not very similar to the original input. It should be really similar, as not encoding should be occuring. I also see that some internal nodes have an activation of 0 (depending on the settings).

I have no idea what I am doing wrong. It seems I have tried enough variants. Is there maybe something wrong with my custom data iterator? When inspecting the data set, I nicely see the same numbers as features and labels.

Please help solving this mystery. Thank you very much.

Here is the code, for reference.

public class CAMSTAMUnsupervised {

private static int trainBatchSize = 64;
private static int testBatchSize = 1;
private static int numEpochs = 30;

public static String dataLocalPath;


public static void main(String[] args) throws Exception {

    File modelFile = new File(dataLocalPath, "camstam.gz");
    DataSetIterator trainIterator = new CAMSTAMDataSetIterator(new File(dataLocalPath, "camstam_with_hr_features.csv").getAbsolutePath(), trainBatchSize);
    DataSetIterator testIterator = new CAMSTAMDataSetIterator(new File(dataLocalPath,"camstam_with_hr_features.csv").getAbsolutePath(), testBatchSize);
    
    System.out.println("Input Columns: "+trainIterator.inputColumns());
    System.out.println("Output Columns: "+trainIterator.totalOutcomes());
    
    MultiLayerNetwork net = createModel(trainIterator.inputColumns(), trainIterator.totalOutcomes());
    
    UIServer uiServer = UIServer.getInstance();
    StatsStorage statsStorage = new InMemoryStatsStorage();
    uiServer.attach(statsStorage);
   
    DataSet dst = trainIterator.next(1);
    System.out.println(dst.getFeatures());
    
    //DataNormalization normalizer = new NormalizerStandardize();
    DataNormalization normalizer = new NormalizerMinMaxScaler();
    
    normalizer.fit(trainIterator);              //Collect training data statistics
    trainIterator.reset();
    trainIterator.setPreProcessor(normalizer);
    testIterator.setPreProcessor(normalizer);	//Note: using training normalization statistics
    NormalizerSerializer.getDefault().write(normalizer, new File(dataLocalPath, "anomalyDetectionNormlizer.ty").getAbsolutePath());
    
    // training
    net.setListeners(new StatsListener(statsStorage), new ScoreIterationListener(10));
    net.fit(trainIterator, numEpochs);

    //Sanity check
      while (testIterator.hasNext()) {
        DataSet ds = testIterator.next();
        System.out.println(ds.getFeatures());
        List<INDArray> result = net.feedForward(ds.getFeatures());
        System.out.println(result.get(4));
        System.out.println("-----------");
    }
}

public static MultiLayerNetwork createModel(int inputNum, int outputNum) {
	MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
		    .seed(12345)
		    .weightInit(WeightInit.)
		    //.updater(new AdaGrad(0.05))
		    .updater(new Adam(0.05))
		    .activation(Activation.IDENTITY)
		    .optimizationAlgo(OptimizationAlgorithm.)
		    //.l2(0.0001)
		    .list()
		    .layer(0, new DenseLayer.Builder().nIn(inputNum).nOut(11)
		            .build())
		    .layer(1, new DenseLayer.Builder().nIn(11).nOut(11)
		            .build())
		    .layer(2, new DenseLayer.Builder().nIn(11).nOut(11)
		            .build())
		    .layer(3, new OutputLayer.Builder().nIn(11).nOut(outputNum)
		            .lossFunction(LossFunctions.LossFunction.MSE)
		            .build())
		    .build();
	
    MultiLayerNetwork net = new MultiLayerNetwork(conf);
    net.init();
    return net;
}

I have found the problemn. I assumed that the .activation function on the MultiLayerConfiguration.Builder is the default for all the activation layers. It is not however. After setting each activation separately for each layer with the identity activation, it works perfectly. With identity weight initialization the solution is directly there; and with other weight initializations the optimal weights are being learned. The output can be reconstructed nicely. Here is not my network configuration for reference (reducing 11 dimensions to 10).

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
		    .seed(12345)
		    .weightInit(WeightInit.XAVIER)
		    .updater(new Adam(0.5))  
            .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
		    .list()
		    .layer(0, new DenseLayer.Builder().nIn(inputNum).nOut(11).activation(Activation.IDENTITY)
		            .build())
		    .layer(1, new DenseLayer.Builder().nIn(11).nOut(10).activation(Activation.IDENTITY)
		            .build())
		    .layer(2, new DenseLayer.Builder().nIn(10).nOut(11).activation(Activation.IDENTITY)
		            .build())
		    .layer(3, new OutputLayer.Builder().nIn(11).nOut(outputNum).activation(Activation.IDENTITY)
		            .lossFunction(LossFunctions.LossFunction.MSE)
		            .build())
		    .build();

Thanks for posting this @maarten.schadd - I found exactly the same thing. To be explicit, the network:

conf = new NeuralNetConfiguration.Builder()
    .seed(12345)
    .weightInit(WeightInit.XAVIER)
    .updater(new AdaGrad(0.05))
    .activation(Activation.IDENTITY)
    .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
    .l2(0.0)
    .list()
        .layer(0, new DenseLayer.Builder().nIn(base_width).nOut(inner_width).activation(Activation.IDENTITY)
                .build())
        .layer(1, new DenseLayer.Builder().nIn(inner_width).nOut(bottleneck).activation(Activation.IDENTITY)
                .build())
        .layer(2, new DenseLayer.Builder().nIn(bottleneck).nOut((inner_width)).activation(Activation.IDENTITY)
                .build())
        .layer(3, new OutputLayer.Builder().nIn((inner_width)).nOut(base_width).activation(Activation.IDENTITY)
                .lossFunction(LossFunctions.LossFunction.MSE)
                .build())
    .build()

Should have the same behaviour as the one where the activations are not specified for each layer:

conf = new NeuralNetConfiguration.Builder()
    .seed(12345)
    .weightInit(WeightInit.XAVIER)
    .updater(new AdaGrad(0.05))
    .activation(Activation.IDENTITY)
    .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
    .l2(0.0)
    .list()
    .layer(0, new DenseLayer.Builder().nIn(base_width).nOut(inner_width)
            .build())
    .layer(1, new DenseLayer.Builder().nIn(inner_width).nOut(bottleneck)
            .build())
    .layer(2, new DenseLayer.Builder().nIn(bottleneck).nOut((inner_width))
            .build())
    .layer(3, new OutputLayer.Builder().nIn((inner_width)).nOut(base_width)
            .lossFunction(LossFunctions.LossFunction.MSE)
            .build())
    .build()

Empirically the behavior is very different however; first one trains well and reproduces its input, while the second fails to train and produces a very poor, wrongly scaled impression of it.

The documentation very clearly says the activation setting is supposed to be applied to the layers:

Note: values set by this method will be applied to all applicable layers in the network

So I really wonder what is happening here?

Finally, I think perhaps I worked out that the issue comes from use of OutputLayer, as the constructor of OutputLayer sets the activation function to SoftMax - clearly OutputLayer is meant for categorical classification problems and not use in AutoEncoder type applications. Perhaps you were fooled like I was into trying this by the AutoEncoder example which is floating around in the docs and has this error. It isn’t present in the one that is actually in the examples directory though.

Replacing just the activation function on the OutputLayer solves this issue for me.