Why the autoencoder model can be fitted when the loss function is MSE but not when it is XENT?

I want to use DL4J build an autoencoder model. I find the autoencoder model can be fitted when the loss function is MSE but not when it is XENT. The model code show as below:

val tm = new ParameterAveragingTrainingMaster.Builder(100,1)
      .averagingFrequency(5)
      .workerPrefetchNumBatches(2)
      .rddTrainingApproach(RDDTrainingApproach.Direct)
      .storageLevel(StorageLevel.MEMORY_ONLY_SER)
      .batchSizePerWorker(256)
      .build()

val conf = new NeuralNetConfiguration.Builder()
      .seed(12345)
      .weightInit(WeightInit.XAVIER)
      .updater(new Adam())
      .activation(Activation.RELU)
      .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
      .list()
      .layer(0, new DenseLayer.Builder().nIn(144).nOut(64)
        .build())
      .layer(1, new DenseLayer.Builder().nIn(64).nOut(16)
        .build())
      .layer(2, new DenseLayer.Builder().nIn(16).nOut(2)
        .build())
      .layer(3, new DenseLayer.Builder().nIn(2).nOut(16)
        .build())
      .layer(4, new DenseLayer.Builder().nIn(16).nOut(64)
        .build())
      .layer(5, new OutputLayer.Builder().nIn(64).activation(Activation.SIGMOID)
        .nOut(144).lossFunction(LossFunctions.LossFunction.XENT)
        .build())
      .build()
val sparkNet = new SparkDl4jMultiLayer(sc,conf,tm)

My train data like DataSet(INDArray,INDArray).When loss function is XENT, each epoch running sparkNet.getscore, the result is always around 99.

Auto encoders are typically used with a reconstruction error type of metric. So that is why MSE works so well here.

In principle your setup can also work, but only if your input data is actually a set of probabilities. Binary cross entropy, i.e. the loss function you select when you use XENT, expects predictions to be in the range of 0 to 1 and the label to be either 0 or 1.

As you are using an autoencoder setup, your labels and your inputs are the same, and if they don’t satisfy the requirements of XENT, it is no wonder that you are getting weird loss scores.

Yeah, I understand. I had built the model by keras, then loss function is binary cross entropy, but i don’t know this can’t work in DL4J. What loss function can I use,KL?

So it worked with the same data in Keras? Can you do a sanity check and print out your labels, so we are sure you are actually feeding it with the same data?

OK
input dataset a record like this:
===========INPUT===================
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
=================OUTPUT==================
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
then model predict like this:
Epoch 0 Score: 99.87674774169922
input:
[[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
predict:
[[ 0.5013, 0.4983, 0.5016, 0.5005, 0.5014, 0.5012, 0.5000, 0.5005, 0.5012, 0.5001, 0.4994, 0.5001, 0.4986, 0.5003, 0.5003, 0.5024, 0.4972, 0.5008, 0.4972, 0.5026, 0.4986, 0.5011, 0.4976, 0.5005, 0.5007, 0.5007, 0.4995, 0.4998, 0.4983, 0.5018, 0.5002, 0.5003, 0.4980, 0.4987, 0.5002, 0.5028, 0.5035, 0.4961, 0.4989, 0.4990, 0.5011, 0.4993, 0.5012, 0.5002, 0.5015, 0.5022, 0.5028, 0.5019, 0.5000, 0.4993, 0.4995, 0.5007, 0.5005, 0.5021, 0.4992, 0.4993, 0.5007, 0.5008, 0.4997, 0.4970, 0.4984, 0.4981, 0.4993, 0.4991, 0.5017, 0.4995, 0.5001, 0.4996, 0.5016, 0.4996, 0.4982, 0.5014, 0.5005, 0.4999, 0.5010, 0.5009, 0.5001, 0.4995, 0.4997, 0.4992, 0.4986, 0.5009, 0.5014, 0.5023, 0.5034, 0.4999, 0.5035, 0.5011, 0.5007, 0.4997, 0.5020, 0.5021, 0.4997, 0.5008, 0.5044, 0.4999, 0.5001, 0.5019, 0.5006, 0.4976, 0.4985, 0.4980, 0.4981, 0.5000, 0.4988, 0.5014, 0.4986, 0.5003, 0.5018, 0.4986, 0.4964, 0.5005, 0.4982, 0.4984, 0.5008, 0.5033, 0.4964, 0.5000, 0.4983, 0.4960, 0.4991, 0.4967, 0.5001, 0.4970, 0.5002, 0.4997, 0.5013, 0.5008, 0.5018, 0.4990, 0.5005, 0.5014, 0.4982, 0.5011, 0.4996, 0.5021, 0.4992, 0.5011, 0.4995, 0.5018, 0.5004, 0.5022, 0.4991, 0.5001]]
Euclidean Distance: 5.982511043548584

The following is about keras
input data:


model training:

autoencoder.compile(Adam(), loss='binary_crossentropy')
autoencoder.fit(train_data[:train_nums], train_data[:train_nums], batch_size= 256, epochs=50,
                validation_data=(train_data[train_nums:], train_data[train_nums:]))

Then it might be just a matter of hyper parameter tuning.

As I can see you are running it in a distributed environment in DL4J, which inherently is a different “hyperparameter” than what you’d get with single node training.

Can you run it just directly, with the same hyper parameters that you’ve used in keras, and see if it trains similarly?

I want to train model with bigdata, so i don’t train with single node. Next, will try it. And can I ask you another question? I find model training is slow, even in a distributed environment in DL4J. But i can’t use RDDTrainingApproach.export, this approach will get lots of small files, it may cause the file system to crash. So is there any way to solve it ? Thank you.

It will take a lot of data before a distributed approach is actually faster than fully loading a fast single machine.

How much data do you have?

400 million records.

So i choose DL4J to try.

That is still something that can probably be handled in a reasonable time on a single machine.

Can you share a bit more about the System you are trying to run this on, and your overall configuration? The autoencoder network you’ve shared, should be able to easily handle multiple thousands of records per second even on modest machines.

I test DL4J in spark cluster with half a million records, each epoch spends half an hour. :crazy_face:The same work just a few seconds by a GPU. About spark cluster configuration: --num-executors 50 --executor-cores 2 --executor-memory 10g --driver-memory 10g. I can’t understand why it is so slow, maybe just need a single machine.

The reason is pretty simple: Network overhead. There is little you can do to reduce the inherent latency when you start doing distributed training. In addition to that, GPU’s are just very efficient at doing matrix multiplications.

If you want, we can go into more details on tuning a single machine to run as fast as possible. But I guess we should start a new thread for that, as we’ve gone quite far away from the initial topic of discussion.

I have to use spark to read train set from hdfs, so choose distributed training. Otherwise I will use gpu-tensorflow train the model. Next, I will try configure trainmaster parameters to speed up. Maybe I need to start a new topic for speed up my distributed DL4J model .:sweat_smile: And thank you very much.