I constructed a multi-layer network of 1D CNN + LSTM. During the training process, I found that if dropout was set, even with a very small value like 0.05, it would lead to no learning. I thought that perhaps the number of nodes in each layer was too small, resulting in excessive sensitivity to dropout. Surprisingly, I found that using MutiGPUWrapper had a very good effect on preventing overfitting. Regrettably, if it was used, the UIServer couldn’t work properly. I hope this aspect can be improved.
@cqiaoYc please file issues with this stuff and I’ll take a look. Small improvements like this should be straightforward.
@agibsonccc if mutilGPUWrapper = new ParallelWrapper.Builder(net) is out of " for (int i = 0; i < nEpochs; i++) {}", the UIServer works fine, but an exception was thrown: https://community.konduit.ai/t/how-to-customize-a-dataset-iterator-that-supports-multiple-gpus/3163
Another question: In a single GPU environment, using ParallelWrapper, the training speed has increased by nearly double, but it does not have the anti-overfitting ability in a multi-GPU environment. Can a similar ability to that in a multi-GPU environment be obtained by setting a multi-threaded training method?
* Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license
* Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template
package com.cq.aifocusstocks.train;
import java.nio.charset.Charset;
import java.nio.file.Path;
import java.time.LocalDateTime;
import java.time.LocalTime;
import java.util.ArrayList;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.deeplearning4j.core.storage.StatsStorage;
import org.deeplearning4j.nn.api.OptimizationAlgorithm;
import org.deeplearning4j.nn.conf.ConvolutionMode;
import org.deeplearning4j.nn.conf.GradientNormalization;
import org.deeplearning4j.nn.conf.MultiLayerConfiguration;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.conf.RNNFormat;
import org.deeplearning4j.nn.conf.inputs.InputType;
import org.deeplearning4j.nn.conf.layers.Convolution1D;
import org.deeplearning4j.nn.conf.layers.LSTM;
import org.deeplearning4j.nn.conf.layers.RnnOutputLayer;
import org.deeplearning4j.nn.conf.layers.Subsampling1DLayer;
import org.deeplearning4j.nn.conf.layers.SubsamplingLayer;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.deeplearning4j.nn.weights.WeightInit;
import org.deeplearning4j.parallelism.ParallelWrapper;
import org.deeplearning4j.ui.api.UIServer;
import org.deeplearning4j.ui.model.stats.StatsListener;
import org.deeplearning4j.ui.model.storage.InMemoryStatsStorage;
import org.nd4j.linalg.activations.Activation;
import org.nd4j.linalg.api.buffer.DataType;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.dataset.DataSet;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.linalg.learning.config.Adam;
import org.nd4j.linalg.learning.config.RmsProp;
import org.nd4j.linalg.lossfunctions.impl.LossMSE;
import org.nd4j.linalg.schedule.ISchedule;
import org.nd4j.linalg.schedule.ScheduleType;
import org.nd4j.linalg.schedule.StepSchedule;
* @author cqiao
public class CnnLstmPredictModelTestUIServer {
protected int featuresCount = 24;
protected int timeStep = 30;
protected int nEpochs = 100;
protected int startTrainResultReportEpoch = 3; //从此次迭代开始输出报告
protected int trainResultReportStep = 2;
protected int batchSize = 64;
private int samplesTotal = 100000;
protected double l1 = 0;
protected double l2 = 0.0001;
protected float dropOut = 0.5f;
protected ISchedule rnnLrSchedule;
protected ISchedule outLrSchedule;
protected Path modelFileNamesFilePath;
protected final Charset CHARSET = Charset.forName("UTF-8");
protected boolean mutilGPU = true;
protected int prefetchBufferMutilGPU = 24;
protected int workersMutilGPU = 4;
protected int avgFrequencyMutilGPU = 2;
protected float gradientNormalizationThreshold = 1; //默认
protected float rnnGradientNormalizationThreshold = 0.5f;
protected boolean hasPoolingLayer = false;
protected ISchedule cnnLrSchedule;
protected int[] cnnStrides = {1, 1, 1, 1};// Strides for each CNN layer
protected int[] cnnNeurons = {32, 64}; //cnn各层的神经元数量
protected int[] rnnNeurons = {64, 32};//rnn各层的神经元数量
int[] cnnKernelSizes = {3, 3, 3, 3}; // Kernel sizes for each CNN layer
public MultiLayerConfiguration getNetConf() {
double startLR = 0.001f;
double endLR = 0.00001f;
long iterationsTotal = samplesTotal / batchSize * nEpochs;
long step = 100;
double decayRate = computeDecayRate(startLR, endLR, iterationsTotal, step);
cnnLrSchedule = new StepSchedule(ScheduleType.ITERATION, startLR, decayRate, endLR);
rnnLrSchedule = cnnLrSchedule;
outLrSchedule = cnnLrSchedule;
DataType dataType = DataType.FLOAT;
NeuralNetConfiguration.Builder nncBuilder = new NeuralNetConfiguration.Builder()
// .updater(new RmsProp(rnnLrSchedule))//(rnnLrSchedule))
NeuralNetConfiguration.ListBuilder listBuilder = nncBuilder.list();
int nIn = featuresCount;//
int layerIndex = 0;
// Add CNN layers
if (cnnNeurons != null) {
final int cnnLayerCount = cnnNeurons.length;
final Adam adam = new Adam(cnnLrSchedule);
for (int i = 0; i < cnnLayerCount; i++) {
listBuilder.layer(layerIndex, new Convolution1D.Builder()
// .padding(cnnPadding)
nIn = cnnNeurons[i];
if (hasPoolingLayer) {
listBuilder.layer(layerIndex, new Subsampling1DLayer.Builder()
// listBuilder.layer(layerIndex, new BatchNormalization.Builder().nOut(nIn).build());//an exception is thrown
// ++layerIndex;
// Add RNN layers
final RmsProp rmsProp = new RmsProp(rnnLrSchedule);
for (int i = 0; i < this.rnnNeurons.length; ++i) {
listBuilder.layer(layerIndex, new LSTM.Builder()
nIn = rnnNeurons[i];
// listBuilder.layer(layerIndex, new BatchNormalization.Builder().nOut(nIn).build());//an exception is thrown
// ++layerIndex;
new RnnOutputLayer.Builder(new LossMSE()).updater(new RmsProp(outLrSchedule))//
// listBuilder.setInputType(InputType.recurrent(featuresCount));
MultiLayerConfiguration conf = listBuilder.build();
return conf;
private double computeDecayRate(double startLr, double endLr, long iterationsTotal, long step) {
return Math.pow(endLr / startLr, (double) step / iterationsTotal);
public void trainModel() {
System.out.println("start train: " + LocalDateTime.now());
TimeSeriesListDataSetIterator trainIterator=generateIterator();
MultiLayerNetwork net = new MultiLayerNetwork(getNetConf());
UIServer uiServer = uiMonitor(net);
// modelFileNamesFilePath = Paths.get(modelSaveFileName + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmmss")) + ".txt");
ParallelWrapper mutilGPUWrapper = null;
// if (mutilGPU) {
// mutilGPUWrapper = new ParallelWrapper.Builder(net)
// .prefetchBuffer(prefetchBufferMutilGPU)
// .workers(workersMutilGPU)
// .averagingFrequency(avgFrequencyMutilGPU)
// .reportScoreAfterAveraging(true)
// .build();
// }
for (int i = 0; i < nEpochs; i++) {
if (mutilGPU) {
//if this statement is placed outside the loop,
//an exception will be thrown after being executed multiple times.
//The number of times the loop can be executed is uncertain.
//My iterator is custom-defined and I don't know if it is caused by it.
mutilGPUWrapper = new ParallelWrapper.Builder(net)
} else {
System.out.println("==No." + i + " nEpochs, " //
+ LocalTime.now() + ", model.score=" + net.score());
// if ((i == startTrainResultReportEpoch || (i > startTrainResultReportEpoch && (i - startTrainResultReportEpoch) % trainResultReportStep == 0)) && i != nEpochs - 1) {
// if (!mutilGPU) {
// RegressionEvaluation eval = new RegressionEvaluation();
// test(eval, model, validateDataSetIterator);
// }
// String modelId = getNeuronsStr() + "-" + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmmss"));
// this.saveModel(model, modelSaveFileName, modelId);
// }
try {
if (uiServer != null) {
} catch (InterruptedException ex) {
Logger.getLogger(CnnLstmPredictModelTestUIServer.class.getName()).log(Level.SEVERE, null, ex);
private TimeSeriesListDataSetIterator generateIterator() {
List<DataSet> dataSetList = new ArrayList<>();
int dataSetCount = samplesTotal / batchSize;
System.out.println("the iterator count of each Epoch: "+dataSetCount);
for (int i = 0; i < dataSetCount; ++i) {
INDArray features3D = Nd4j.randn(new int[]{batchSize, featuresCount, timeStep}).muli(2).subi(1);
INDArray labels3D = Nd4j.randn(new int[]{batchSize, 1,timeStep}).muli(2).subi(1);
dataSetList.add(new DataSet(features3D,labels3D));
return new TimeSeriesListDataSetIterator(dataSetList, true);
public UIServer uiMonitor(MultiLayerNetwork model) {
System.setProperty("org.deeplearning4j.ui.port", "9001");
UIServer uiServer = UIServer.getInstance();
StringBuilder sb = new StringBuilder("http://localhost:").append(UIServer.getInstance().getPort()).append("/");
System.out.println("UIServer url:" + sb.toString());
StatsStorage statsStorage = new InMemoryStatsStorage(); //或者: new FileStatsStorage(File),用于后续的保存和载入
model.setListeners(new StatsListener(statsStorage));
return uiServer;
public void setCnnStrides(int[] cnnStrides) {
this.cnnStrides = cnnStrides;
public int[] getCnnNeurons() {
return cnnNeurons;
public void setCnnNeurons(int[] cnnNeurons) {
this.cnnNeurons = cnnNeurons;
public int[] getCnnKernelSizes() {
return cnnKernelSizes;
public void setCnnKernelSizes(int[] cnnKernelSizes) {
this.cnnKernelSizes = cnnKernelSizes;
public ISchedule getCnnLrSchedule() {
return cnnLrSchedule;
public void setCnnLrSchedule(ISchedule cnnLrSchedule) {
this.cnnLrSchedule = cnnLrSchedule;
public float getGradientNormalizationThreshold() {
return gradientNormalizationThreshold;
public void setGradientNormalizationThreshold(float gradientNormalizationThreshold) {
this.gradientNormalizationThreshold = gradientNormalizationThreshold;
public float getRnnGradientNormalizationThreshold() {
return rnnGradientNormalizationThreshold;
public void setRnnGradientNormalizationThreshold(float rnnGradientNormalizationThreshold) {
this.rnnGradientNormalizationThreshold = rnnGradientNormalizationThreshold;
public boolean isHasPoolingLayer() {
return hasPoolingLayer;
public void setHasPoolingLayer(boolean hasPoolingLayer) {
this.hasPoolingLayer = hasPoolingLayer;
public static void main(String[] args){
CnnLstmPredictModelTestUIServer testUI=new CnnLstmPredictModelTestUIServer();
* Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license
* Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template
package com.cq.aifocusstocks.train;
import java.util.List;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.dataset.DataSet;
import org.nd4j.linalg.dataset.api.DataSetPreProcessor;
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
import org.nd4j.linalg.factory.Nd4j;
* @author cqiao
public class TimeSeriesListDataSetIterator implements DataSetIterator {
private List<DataSet> dataSetList;
protected int inputColumns;
protected int outputColumns;
private int totalSamples;
private int totalBatch;
private int batchCursor;
protected int batchSize;
protected int sampleStep;
private INDArray batchsizeLabelMask;
private boolean needLabelMask = true;
// public TimeSeriesListDataSetIterator(List<DataSet> dataSetList) {
// TimeSeriesListDataSetIterator(dataSetList,true);
// }
* @param dataSetList 每个DataSet有相同数量的sample,3D
* @param needLabelMask
public TimeSeriesListDataSetIterator(List<DataSet> dataSetList,boolean needLabelMask) {
this.dataSetList = dataSetList;
totalBatch = dataSetList.size();
long[] featuresShape = dataSetList.get(0).getFeatures().shape();
long[] labelsShape = dataSetList.get(0).getLabels().shape();
this.batchSize = (int) labelsShape[0];
this.inputColumns = (int) featuresShape[1];
this.outputColumns = (int) labelsShape[1];
this.sampleStep = (int) featuresShape[2];
totalSamples = totalBatch * batchSize;
if (needLabelMask) {
if (!dataSetList.get(0).hasMaskArrays()) {
batchsizeLabelMask = generateLabelsMask(batchSize);
for (DataSet dataSet : this.dataSetList) {
public synchronized DataSet next(int num) {
if (batchCursor == totalBatch) {
batchCursor = 0;
DataSet dataSet = this.dataSetList.get(batchCursor);
return dataSet;
private INDArray generateLabelsMask(int batchSize) {
INDArray mask = Nd4j.create(new int[]{batchSize, sampleStep}, 'f');
for (int j = 0; j < batchSize; ++j) {
mask.putScalar(j, sampleStep - 1, 1);
return mask;
public boolean resetSupported() {
return true;
public boolean asyncSupported() {
return true;
public synchronized void reset() {
batchCursor = 0;
public synchronized boolean hasNext() {
return this.batchCursor < this.totalBatch;
public DataSet next() {
return next(0);
public int getTotalExamples() {
return totalSamples;
public int inputColumns() {
return this.inputColumns;
public int totalOutcomes() {
return this.outputColumns;
public int batch() {
return batchSize;
public boolean isNeedLabelMask() {
return needLabelMask;
public void setNeedLabelMask(boolean needLabelMask) {
this.needLabelMask = needLabelMask;
public void setPreProcessor(DataSetPreProcessor dspp) {
throw new UnsupportedOperationException("Not supported yet."); // Generated from nbfs://nbhost/SystemFileSystem/Templates/Classes/Code/GeneratedMethodBody
public DataSetPreProcessor getPreProcessor() {
throw new UnsupportedOperationException("Not supported yet."); // Generated from nbfs://nbhost/SystemFileSystem/Templates/Classes/Code/GeneratedMethodBody
public List<String> getLabels() {
throw new UnsupportedOperationException("Not supported yet."); // Generated from nbfs://nbhost/SystemFileSystem/Templates/Classes/Code/GeneratedMethodBody
Can you clarify your gpu setup a bit? Your training will be as fast as your slowest gpu.
I would look at that and also try to measure more to understand what your overhead on gpu communication is before even going to multi gpu.
E-PRO:~$ lspci -nn | grep -i nvidia
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104 [GeForce RTX 3070 Ti] [10de:2482] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GA104 High Definition Audio Controller [10de:228b] (rev a1)
OS:ubuntu 24.04 LTS
@cqiaoYc so you’re running on 1 gpu? Why are you trying to use multi gpu then? I’d probably need to see some sort of profiler output. A 3070 is a good card so I don’t think that should be an issue in and of itself.
My development environment has 1 GPU, and the training environment has 4 GPUs (2080Ti). I need to train multiple models simultaneously. Therefore, I usually train models simultaneously in the development environment and the training environment, hoping that the results of the two can have relatively consistent performance.
@cqiaoYc I would recommend just isolating what gpus are available to each process then. You can do that using the CUDA_VISIBLE_DEVICES environment variable for each process before you start it. That would allow you to use each gpu separately. Just make sure that the cpu you’re using can handle that.
In this topic, my expectation is to obtain similar anti-overfitting ability in a single-GPU environment as in a multi-GPU environment. The problem of throwing exceptions in multi-GPU training is given in another topic.