MmulHelper::mmulMxM cuda failed !; Error code: [700]

Constantin · September 29, 2022, 12:49pm

Hello everyone, I am working on a project where I am using yolov2 model wrapped in ParallelIference.
Haedware properties:
OS : windows 10
RAM: 16GB
GPU: 2xGTX3060 ti
CUDA: 11.6

My pom.xml file:
-----------------------------------------pom.xml-------------------------------------------------

<?xml version="1.0" encoding="UTF-8"?>

4.0.0

org.springframework.boot
spring-boot-starter-parent
2.7.2

com.matcon
yolo-v2-runner
0.0.1-SNAPSHOT
yolo-v2-runner
yolo-v2-runner

<java.version>11</java.version>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
<dl4j-master.version>1.0.0-M2.1</dl4j-master.version>
<nd4j-master.version>1.0.0-M2.1</nd4j-master.version>
11.6
<maven-shade-plugin.version>3.3.0</maven-shade-plugin.version>
<maven.minimum.version>3.3.1</maven.minimum.version>
<exec-maven-plugin.version>1.4.0</exec-maven-plugin.version>
<shaded.classifier>bin</shaded.classifier>

org.springframework.boot
spring-boot-starter-web

org.springframework.boot
spring-boot-starter-validation

org.springframework.boot
spring-boot-starter-test
test

org.deeplearning4j
deeplearning4j-core
${dl4j-master.version}

org.nd4j
nd4j-native-platform
${nd4j-master.version}

org.nd4j
nd4j-cuda-${cuda-version}-platform
${nd4j-master.version}

org.deeplearning4j
deeplearning4j-parallel-wrapper
${dl4j-master.version}

org.deeplearning4j
deeplearning4j-zoo
${dl4j-master.version}

org.bytedeco
opencv-platform
4.5.5-1.5.7

    <!--        $CUDA_VERSION-$CUDNN_VERSIUON-$JAVACPP_VERSION-->

</dependencies>

<build>
    <plugins>

        <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
        </plugin>
        <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>exec-maven-plugin</artifactId>
            <version>${exec-maven-plugin.version}</version>
            <executions>
                <execution>
                    <goals>
                        <goal>exec</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <executable>java</executable>
            </configuration>
        </plugin>
        <plugin>
            <groupId>com.lewisd</groupId>
            <artifactId>lint-maven-plugin</artifactId>
            <version>0.0.11</version>
            <configuration>
                <failOnViolation>true</failOnViolation>
                <onlyRunRules>
                    <rule>DuplicateDep</rule>
                    <rule>RedundantPluginVersion</rule>
                    <!-- Rules incompatible with Java 9
                    <rule>VersionProp</rule>
                    <rule>DotVersionProperty</rule> -->
                </onlyRunRules>
            </configuration>
            <executions>
                <execution>
                    <id>pom-lint</id>
                    <phase>validate</phase>
                    <goals>
                        <goal>check</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <artifactId>maven-enforcer-plugin</artifactId>
            <version>1.0.1</version>
            <executions>
                <execution>
                    <id>enforce-default</id>
                    <goals>
                        <goal>enforce</goal>
                    </goals>
                    <configuration>
                        <rules>
                            <requireMavenVersion>
                                <version>[${maven.minimum.version},)</version>
                                <message>********** Minimum Maven Version is ${maven.minimum.version}. Please
                                    upgrade Maven before continuing (run "mvn --version" to check). **********
                                </message>
                            </requireMavenVersion>
                        </rules>
                    </configuration>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <artifactId>maven-surefire-plugin</artifactId>
            <version>3.0.0-M5</version>
            <inherited>true</inherited>
            <dependencies>
                <dependency>
                    <groupId>org.apache.maven.surefire</groupId>
                    <artifactId>surefire-junit-platform</artifactId>
                    <version>3.0.0-M5</version>
                </dependency>
            </dependencies>
        </plugin>
    </plugins>


    <pluginManagement>
        <plugins>
            <plugin>
                <groupId>org.eclipse.m2e</groupId>
                <artifactId>lifecycle-mapping</artifactId>
                <version>1.0.0</version>
                <configuration>
                    <lifecycleMappingMetadata>
                        <pluginExecutions>
                            <pluginExecution>
                                <pluginExecutionFilter>
                                    <groupId>com.lewisd</groupId>
                                    <artifactId>lint-maven-plugin</artifactId>
                                    <versionRange>[0.0.11,)</versionRange>
                                    <goals>
                                        <goal>check</goal>
                                    </goals>
                                </pluginExecutionFilter>
                                <action>
                                    <ignore/>
                                </action>
                            </pluginExecution>
                        </pluginExecutions>
                    </lifecycleMappingMetadata>
                </configuration>
            </plugin>
        </plugins>
    </pluginManagement>

</build>

-----------------------------------------pom.xml-------------------------------------------------

Bellow is the output of application start:
-----------------------------------------app logs-------------------------------------------------
2022-09-29 13:27:09.846 INFO 1796 — [ main] org.nd4j.linalg.factory.Nd4jBackend : Loaded [JCublasBackend] backend
2022-09-29 13:27:13.762 INFO 1796 — [ main] org.nd4j.nativeblas.NativeOpsHolder : Number of threads used for linear algebra: 32
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Backend used: [CUDA]; OS: [Windows 10]
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Cores: [4]; Memory: [2.0GB];
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Blas vendor: [CUBLAS]
2022-09-29 13:27:13.837 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : ND4J CUDA build version: 11.6.55
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : CUDA device 0: [NVIDIA GeForce RTX 3060 Ti]; cc: [8.6]; Total memory: [8589279232]
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : CUDA device 1: [NVIDIA GeForce RTX 3060 Ti]; cc: [8.6]; Total memory: [8589410304]
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : Backend build information:
MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
-----------------------------------------app logs-------------------------------------------------
My project works fine with one gpu, but when I got the second one I ecountered the following error:
-----------------------------------------exception------------------------------------------------
java.lang.RuntimeException: MmulHelper::mmulMxM cuda failed !; Error code: [700]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2067) ~[nd4j-cuda-11.6-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1870) ~[nd4j-cuda-11.6-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6545) ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm(BaseLevel3.java:62) ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli(BaseNDArray.java:3202) ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.preOutput(ConvolutionLayer.java:473) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.activate(ConvolutionLayer.java:509) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doForward(LayerVertex.java:110) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.outputOfLayersDetached(ComputationGraph.java:2450) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1752) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1708) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1694) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference$ModelHolder.output(InplaceParallelInference.java:267) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference$ModelSelector.output(InplaceParallelInference.java:150) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference.output(InplaceParallelInference.java:91) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.ParallelInference.output(ParallelInference.java:191) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.ParallelInference.output(ParallelInference.java:187) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
-----------------------------------------exception------------------------------------------------

agibsonccc · September 29, 2022, 11:48pm

@Constantin could you specify which GPUs you’re using as well as the cuda version? It appears there’s some sort of issue with out of memory access: CUDA error 700 ?? cudaDeviceSynchronize returned error code 700 - CUDA Programming and Performance - NVIDIA Developer Forums

Are you using the same arrays across different GPUs? Either way I would need a reproducer to even begin debugging this.

Constantin · September 30, 2022, 7:05am

I am using 2x gtx3060ti , cuda version 11.6

agibsonccc · September 30, 2022, 7:10am

@Constantin could you answer all of my questions not just 1? If you don’t know what I mean please ask me to clarify and I’m happy to help. I can’t read your screen or run your code. I need as much detail as possible to help you.

Constantin · September 30, 2022, 7:22am

What do you mean by ==> Are you using the same arrays across different GPUs?

agibsonccc · September 30, 2022, 7:27am

Thanks for following up. One of the main suspects of the crash based on the error message is that a pointer for GPU 1 might be passed to GPU 2.

A common issue there would be training data or weights.

My other question was about a reproducer. Do you have a standalone example I might be able to run on multi gpu? I don’t need anything proprietary from your code just a standalone example that might be similar to your situation.

Constantin · September 30, 2022, 7:34am

this piece of code is shared betwen multiple threads.

in this morning I received this exit code.
matrixImg from detect method is created using NativeImageLoader.asMatrix(Mat image)

agibsonccc · September 30, 2022, 8:13am

@Constantin do you have an hs_err_pid.log somewhere in the directory where this was ran?

Constantin · September 30, 2022, 8:36am

I did not find any log files in the project.

Constantin · September 30, 2022, 10:51am

I am not entirely sure but this problem could be somehow bound with the topic from this post :

agibsonccc · September 30, 2022, 11:20pm

@Constantin no it’s definitely not. I already gave you the probable cause: data sharing. If I can’t run your code I can’t see the problem only guess though. Again, if you don’t understand something I’m saying, ask don’t ignore. It doesn’t help either of us.

I’m not going to chase down something I can’t reproduce. It’s up to you whether you want to meet me half way. I can’t magically ssh in to your computer and see what’s going on. I need something to go off of: logs, a reproducer of some kind, or some effort on your part than just posting unrelated errors.

If you’d like to put effort in to this try to see if a purely standalone solution with parallelinference reproduces that problem. If the issue isn’t related to your specific usage of the library then I’m happy to look at it.

Constantin · October 1, 2022, 5:52am

@agibsonccc for me is not a problem to share with you the entire project, or to give acces to the pc with code.Tell me what would be the easiest way for you to reproduce the case.

agibsonccc · October 1, 2022, 6:08am

@Constantin I just need to reproduce it. If you want share it with me on github. I don’t want to login to anything unless you’re paying me I purely want to see what you’re doing.

Constantin · October 1, 2022, 7:14am

@agibsonccc I added you as a colaborator to the repository.

Topic		Replies	Views
Nd4j-cuda-11.2-platform not available for m2 ND4J	9	502	July 4, 2022
Dl4j cuda 11.2 running out of memory on evaluation on ubuntu 20.04 DL4J	25	1741	November 6, 2021
Requesting help with POM for GPU support GeForce RTX 3090 DL4J	16	1207	July 17, 2021
Onnx-import-examples fails with cuda-11.4 backend SameDiff	4	283	August 18, 2022
Configuring GPU-Backend in Maven Dependencies DL4J	10	1437	May 27, 2021

MmulHelper::mmulMxM cuda failed !; Error code: [700]

Related topics