GPU Error between epochs

I’m getting some errors after the first epoch when training on a multiple GPU setup.

nvidia-smi output is:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… Off | 00000000:03:00.0 Off | 0 |
| N/A 33C P0 33W / 250W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-PCIE… Off | 00000000:04:00.0 Off | 0 |
| N/A 29C P0 32W / 250W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla P100-PCIE… Off | 00000000:82:00.0 Off | 0 |
| N/A 31C P0 33W / 250W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla P100-PCIE… Off | 00000000:83:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Code is:

DataSetIterator train = new ExistingMiniBatchDataSetIterator(new File(TRAIN_PATH));
DataSetIterator test = new ExistingMiniBatchDataSetIterator(new File(TEST_PATH));

    ParallelWrapper pw = new ParallelWrapper.Builder<>(net)
        .prefetchBuffer(16 * Nd4j.getAffinityManager().getNumberOfDevices())
        .reportScoreAfterAveraging(true)
        .averagingFrequency(10)
        .workers(Nd4j.getAffinityManager().getNumberOfDevices())
        .build();

    log.info("Starting training...");
    for (int i = 0; i < nEpochs; i++) {
        pw.fit(train);
        train.reset();
    }

Error is:

2020-10-14 15:51:42,963 WARN o.d.n.l.r.LSTMHelpers [ParallelWrapper training thread 0] MKL/CuDNN execution failed - falling back on built-in implementation java.lang.RuntimeException: cuDNN status = 8: CUDNN_STATUS_EXECUTION_FAILED
at org.deeplearning4j.cuda.BaseCudnnHelper.checkCudnn(BaseCudnnHelper.java:48)
at org.deeplearning4j.cuda.recurrent.CudnnLSTMHelper.activate(CudnnLSTMHelper.java:469)
at org.deeplearning4j.nn.layers.recurrent.LSTMHelpers.activateHelper(LSTMHelpers.java:205)
at org.deeplearning4j.nn.layers.recurrent.LSTM.activateHelper(LSTM.java:177)
at org.deeplearning4j.nn.layers.recurrent.LSTM.activate(LSTM.java:147)
at org.deeplearning4j.nn.layers.recurrent.BidirectionalLayer.activate(BidirectionalLayer.java:201)
at org.deeplearning4j.nn.layers.recurrent.BidirectionalLayer.activate(BidirectionalLayer.java:239)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.ffToLayerActivationsInWs(MultiLayerNetwork.java:1134)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2746)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2704)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:170)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:2305)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2263)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:2326)
at org.deeplearning4j.parallelism.trainer.DefaultTrainer.fit(DefaultTrainer.java:236)
at org.deeplearning4j.parallelism.trainer.DefaultTrainer.run(DefaultTrainer.java:385)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.deeplearning4j.parallelism.ParallelWrapper$2$1.run(ParallelWrapper.java:161)
at java.lang.Thread.run(Thread.java:748)

@zuzumumba do you mind giving the pom.xml/gradle.build you’re using and the main class you’re using? I’m kind of wondering if this might be an out of memory or something in disguise.

This is effectively just the ImdbReviewClassificationRNN in the cuda-specific-examples with a small modification to run multiple epochs

pom.xml

<?xml version="1.0" encoding="UTF-8"?>


4.0.0
org.deeplearning4j
dl4j-cuda-specific-examples
1.0.0-beta7
DeepLearning4j CUDA special examples

<dl4j-master.version>1.0.0-beta7</dl4j-master.version>
<nd4j.backend>nd4j-cuda-10.2</nd4j.backend>
<java.version>1.8</java.version>
<exec-maven-plugin.version>1.4.0</exec-maven-plugin.version>
<maven-shade-plugin.version>2.4.3</maven-shade-plugin.version>
<jcommon.version>1.0.23</jcommon.version>
<logback.version>1.1.7</logback.version>

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-9.2</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-10.0</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-10.1</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>nd4j-cuda-10.2</artifactId>
            <version>${dl4j-master.version}</version>
        </dependency>
        <dependency>
            <groupId>org.freemarker</groupId>
            <artifactId>freemarker</artifactId>
            <version>2.3.29</version>
        </dependency>
        <dependency>
            <groupId>io.netty</groupId>
            <artifactId>netty-common</artifactId>
            <version>4.1.42.Final</version>
        </dependency>
  </dependencies>
</dependencyManagement>

<dependencies>
    <!-- Dependency for parallel wrapper (for multi-GPU parameter averaging -->
    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-parallel-wrapper</artifactId>
        <version>${dl4j-master.version}</version>
    </dependency>

    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-core</artifactId>
        <version>${dl4j-master.version}</version>
    </dependency>
    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-ui</artifactId>
        <version>${dl4j-master.version}</version>
    </dependency>
    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-zoo</artifactId>
        <version>${dl4j-master.version}</version>
    </dependency>
    <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>${nd4j.backend}</artifactId>
    </dependency>
    <!-- datavec-data-codec: used only in video example for loading video data -->
    <dependency>
        <artifactId>datavec-data-codec</artifactId>
        <groupId>org.datavec</groupId>
        <version>${dl4j-master.version}</version>
    </dependency>

    <dependency>
        <groupId>ch.qos.logback</groupId>
        <artifactId>logback-classic</artifactId>
        <version>${logback.version}</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>exec-maven-plugin</artifactId>
            <version>${exec-maven-plugin.version}</version>
            <executions>
                <execution>
                    <goals>
                        <goal>exec</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <executable>java</executable>
            </configuration>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>${maven-shade-plugin.version}</version>
            <configuration>
                <shadedArtifactAttached>true</shadedArtifactAttached>
                <shadedClassifierName>${shadedClassifier}</shadedClassifierName>
                <createDependencyReducedPom>true</createDependencyReducedPom>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>org/datanucleus/**</exclude>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
            </configuration>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <transformers>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                <resource>reference.conf</resource>
                            </transformer>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.5.1</version>
            <configuration>
                <source>${java.version}</source>
                <target>${java.version}</target>
            </configuration>
        </plugin>
    </plugins>
</build>

main -

/*******************************************************************************

  • Copyright © 2020 Konduit K.K.
  • Copyright © 2015-2019 Skymind, Inc.
  • This program and the accompanying materials are made available under the
  • terms of the Apache License, Version 2.0 which is available at
  • https://www.apache.org/licenses/LICENSE-2.0.
  • Unless required by applicable law or agreed to in writing, software
  • distributed under the License is distributed on an “AS IS” BASIS, WITHOUT
  • WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
  • License for the specific language governing permissions and limitations
  • under the License.
  • SPDX-License-Identifier: Apache-2.0
    ******************************************************************************/

package org.deeplearning4j.examples.multigpu.advanced.w2vsentiment;

import org.deeplearning4j.nn.conf.GradientNormalization;
import org.deeplearning4j.nn.conf.MultiLayerConfiguration;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.conf.layers.LSTM;
import org.deeplearning4j.nn.conf.layers.RnnOutputLayer;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.deeplearning4j.nn.weights.WeightInit;
import org.deeplearning4j.optimize.listeners.PerformanceListener;
import org.deeplearning4j.parallelism.ParallelWrapper;
import org.nd4j.evaluation.classification.Evaluation;
import org.nd4j.jita.conf.CudaEnvironment;
import org.nd4j.linalg.activations.Activation;
import org.nd4j.linalg.dataset.ExistingMiniBatchDataSetIterator;
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.linalg.learning.config.Adam;
import org.nd4j.linalg.lossfunctions.LossFunctions;
import org.slf4j.Logger;

import java.io.File;

import static org.deeplearning4j.examples.multigpu.advanced.w2vsentiment.DataSetsBuilder.TEST_PATH;
import static org.deeplearning4j.examples.multigpu.advanced.w2vsentiment.DataSetsBuilder.TRAIN_PATH;

/**

  • Example: Given a movie review (raw text), classify that movie review as either positive or negative based on the words it contains.

  • This example is the multi-gpu version of the dl4j-example example of the same name.

  • Here the dataset is presaved to save time on multiple epochs.

  • @author Alex Black
    */
    public class ImdbReviewClassificationRNN {
    private static final Logger log = org.slf4j.LoggerFactory.getLogger(ImdbReviewClassificationRNN.class);

    public static void main(String args) throws Exception {

     int vectorSize = 300;   //Size of the word vectors. 300 in the Google News model
     int nEpochs = 3;        //Number of epochs (full passes of training data) to train on
    

// Nd4j.setDataType(DataBuffer.Type.DOUBLE);

    CudaEnvironment.getInstance().getConfiguration()
        // key option enabled
        .allowMultiGPU(true)

        // we're allowing larger memory caches
        .setMaximumDeviceCache(2L * 1024L * 1024L * 1024L)

        // cross-device access is used for faster model averaging over pcie
        .allowCrossDeviceAccess(true);

    //Set up network configuration
    MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
        .updater(new Adam.Builder().learningRate(2e-2).build())
        .l2(1e-5)
        .weightInit(WeightInit.XAVIER)
        .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue).gradientNormalizationThreshold(1.0)
        .list()
        .layer(0, new LSTM.Builder().nIn(vectorSize).nOut(256)
            .activation(Activation.TANH).build())
        .layer(1, new RnnOutputLayer.Builder().activation(Activation.SOFTMAX)
            .lossFunction(LossFunctions.LossFunction.MCXENT).nIn(256).nOut(2).build())
        .build();

    MultiLayerNetwork net = new MultiLayerNetwork(conf);
    net.init();
    net.setListeners(new PerformanceListener(10, true));

    if (!new File(TRAIN_PATH).exists() || !new File(TEST_PATH).exists()) {
        new DataSetsBuilder().run(args);
    }
    //DataSetIterators for training and testing respectively
    DataSetIterator train = new ExistingMiniBatchDataSetIterator(new File(TRAIN_PATH));
    DataSetIterator test = new ExistingMiniBatchDataSetIterator(new File(TEST_PATH));

    ParallelWrapper pw = new ParallelWrapper.Builder<>(net)
        .prefetchBuffer(16 * Nd4j.getAffinityManager().getNumberOfDevices())
        .reportScoreAfterAveraging(true)
        .averagingFrequency(10)
        .workers(Nd4j.getAffinityManager().getNumberOfDevices())
        .build();

    log.info("Starting training...");
    for (int i = 0; i < nEpochs; i++) {
        pw.fit(train);
        train.reset();
    }

    log.info("Starting evaluation...");

    //Run evaluation. This is on 25k reviews, so can take some time
    Evaluation evaluation = net.evaluate(test);
    System.out.println(evaluation.stats());
}

}

@zuzumumba hmm that pom looks strange. You have 2 versions of cuda in there for some reason, one of which doesn’t even exist for the latest version. You’re using cuda 10.2 on your machine but the cuda versions don’t seem to match.

I also only see those in dependencyManagement not as an explicit dependency. Could you fix that as well and put an actual nd4j backend declaration in there rather in dependencyMnagement?

Could you clean this up first?