Why the autoencoder model can be fitted when the loss function is MSE but not when it is XENT?

DataAIPlayer · April 20, 2020, 8:53am

I want to use DL4J build an autoencoder model. I find the autoencoder model can be fitted when the loss function is MSE but not when it is XENT. The model code show as below:

val tm = new ParameterAveragingTrainingMaster.Builder(100,1)
      .averagingFrequency(5)
      .workerPrefetchNumBatches(2)
      .rddTrainingApproach(RDDTrainingApproach.Direct)
      .storageLevel(StorageLevel.MEMORY_ONLY_SER)
      .batchSizePerWorker(256)
      .build()

val conf = new NeuralNetConfiguration.Builder()
      .seed(12345)
      .weightInit(WeightInit.XAVIER)
      .updater(new Adam())
      .activation(Activation.RELU)
      .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
      .list()
      .layer(0, new DenseLayer.Builder().nIn(144).nOut(64)
        .build())
      .layer(1, new DenseLayer.Builder().nIn(64).nOut(16)
        .build())
      .layer(2, new DenseLayer.Builder().nIn(16).nOut(2)
        .build())
      .layer(3, new DenseLayer.Builder().nIn(2).nOut(16)
        .build())
      .layer(4, new DenseLayer.Builder().nIn(16).nOut(64)
        .build())
      .layer(5, new OutputLayer.Builder().nIn(64).activation(Activation.SIGMOID)
        .nOut(144).lossFunction(LossFunctions.LossFunction.XENT)
        .build())
      .build()
val sparkNet = new SparkDl4jMultiLayer(sc,conf,tm)

My train data like DataSet(INDArray,INDArray).When loss function is XENT, each epoch running sparkNet.getscore, the result is always around 99.

treo · April 20, 2020, 8:58am

Auto encoders are typically used with a reconstruction error type of metric. So that is why MSE works so well here.

In principle your setup can also work, but only if your input data is actually a set of probabilities. Binary cross entropy, i.e. the loss function you select when you use XENT, expects predictions to be in the range of 0 to 1 and the label to be either 0 or 1.

As you are using an autoencoder setup, your labels and your inputs are the same, and if they don’t satisfy the requirements of XENT, it is no wonder that you are getting weird loss scores.

DataAIPlayer · April 20, 2020, 9:08am

Yeah, I understand. I had built the model by keras, then loss function is binary cross entropy, but i don’t know this can’t work in DL4J. What loss function can I use，KL?

treo · April 20, 2020, 9:19am

So it worked with the same data in Keras? Can you do a sanity check and print out your labels, so we are sure you are actually feeding it with the same data?

DataAIPlayer · April 20, 2020, 9:32am

OK
input dataset a record like this:
===========INPUT===================
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
=================OUTPUT==================
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
then model predict like this:
Epoch 0 Score: 99.87674774169922
input:
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
predict:
[[ 0.5013, 0.4983, 0.5016, 0.5005, 0.5014, 0.5012, 0.5000, 0.5005, 0.5012, 0.5001, 0.4994, 0.5001, 0.4986, 0.5003, 0.5003, 0.5024, 0.4972, 0.5008, 0.4972, 0.5026, 0.4986, 0.5011, 0.4976, 0.5005, 0.5007, 0.5007, 0.4995, 0.4998, 0.4983, 0.5018, 0.5002, 0.5003, 0.4980, 0.4987, 0.5002, 0.5028, 0.5035, 0.4961, 0.4989, 0.4990, 0.5011, 0.4993, 0.5012, 0.5002, 0.5015, 0.5022, 0.5028, 0.5019, 0.5000, 0.4993, 0.4995, 0.5007, 0.5005, 0.5021, 0.4992, 0.4993, 0.5007, 0.5008, 0.4997, 0.4970, 0.4984, 0.4981, 0.4993, 0.4991, 0.5017, 0.4995, 0.5001, 0.4996, 0.5016, 0.4996, 0.4982, 0.5014, 0.5005, 0.4999, 0.5010, 0.5009, 0.5001, 0.4995, 0.4997, 0.4992, 0.4986, 0.5009, 0.5014, 0.5023, 0.5034, 0.4999, 0.5035, 0.5011, 0.5007, 0.4997, 0.5020, 0.5021, 0.4997, 0.5008, 0.5044, 0.4999, 0.5001, 0.5019, 0.5006, 0.4976, 0.4985, 0.4980, 0.4981, 0.5000, 0.4988, 0.5014, 0.4986, 0.5003, 0.5018, 0.4986, 0.4964, 0.5005, 0.4982, 0.4984, 0.5008, 0.5033, 0.4964, 0.5000, 0.4983, 0.4960, 0.4991, 0.4967, 0.5001, 0.4970, 0.5002, 0.4997, 0.5013, 0.5008, 0.5018, 0.4990, 0.5005, 0.5014, 0.4982, 0.5011, 0.4996, 0.5021, 0.4992, 0.5011, 0.4995, 0.5018, 0.5004, 0.5022, 0.4991, 0.5001]]
Euclidean Distance: 5.982511043548584

The following is about keras
input data:

model training:

autoencoder.compile(Adam(), loss='binary_crossentropy')
autoencoder.fit(train_data[:train_nums], train_data[:train_nums], batch_size= 256, epochs=50,
                validation_data=(train_data[train_nums:], train_data[train_nums:]))

treo · April 20, 2020, 9:35am

Then it might be just a matter of hyper parameter tuning.

As I can see you are running it in a distributed environment in DL4J, which inherently is a different “hyperparameter” than what you’d get with single node training.

Can you run it just directly, with the same hyper parameters that you’ve used in keras, and see if it trains similarly?

DataAIPlayer · April 20, 2020, 10:00am

I want to train model with bigdata, so i don’t train with single node. Next, will try it. And can I ask you another question? I find model training is slow, even in a distributed environment in DL4J. But i can’t use RDDTrainingApproach.export, this approach will get lots of small files, it may cause the file system to crash. So is there any way to solve it ? Thank you.

treo · April 20, 2020, 10:08am

It will take a lot of data before a distributed approach is actually faster than fully loading a fast single machine.

How much data do you have?

DataAIPlayer · April 20, 2020, 10:10am

400 million records.

DataAIPlayer · April 20, 2020, 10:11am

So i choose DL4J to try.

treo · April 20, 2020, 10:27am

That is still something that can probably be handled in a reasonable time on a single machine.

Can you share a bit more about the System you are trying to run this on, and your overall configuration? The autoencoder network you’ve shared, should be able to easily handle multiple thousands of records per second even on modest machines.

DataAIPlayer · April 20, 2020, 10:48am

I test DL4J in spark cluster with half a million records, each epoch spends half an hour. The same work just a few seconds by a GPU. About spark cluster configuration: --num-executors 50 --executor-cores 2 --executor-memory 10g --driver-memory 10g. I can’t understand why it is so slow, maybe just need a single machine.

treo · April 20, 2020, 11:03am

The reason is pretty simple: Network overhead. There is little you can do to reduce the inherent latency when you start doing distributed training. In addition to that, GPU’s are just very efficient at doing matrix multiplications.

If you want, we can go into more details on tuning a single machine to run as fast as possible. But I guess we should start a new thread for that, as we’ve gone quite far away from the initial topic of discussion.

DataAIPlayer · April 20, 2020, 2:32pm

I have to use spark to read train set from hdfs, so choose distributed training. Otherwise I will use gpu-tensorflow train the model. Next, I will try configure trainmaster parameters to speed up. Maybe I need to start a new topic for speed up my distributed DL4J model . And thank you very much.

Topic		Replies	Views
Autoencode sanity check DL4J	3	506	January 29, 2023
Unsupervised pretraining with autoencoder and supervised training DL4J	2	365	July 8, 2020
Convolutional Autoencoder Example Showcase	0	776	August 5, 2020
Error: Labels and preOutput must have equal shapes DL4J	2	317	March 1, 2022
Build Customer Loss Function DL4J	3	1491	July 31, 2020

Why the autoencoder model can be fitted when the loss function is MSE but not when it is XENT?

Related topics