MmulHelper::mmulMxM cuda failed !; Error code: [700]

Hello everyone, I am working on a project where I am using yolov2 model wrapped in ParallelIference.
Haedware properties:
OS : windows 10
RAM: 16GB
GPU: 2xGTX3060 ti
CUDA: 11.6

My pom.xml file:
-----------------------------------------pom.xml-------------------------------------------------

<?xml version="1.0" encoding="UTF-8"?>


4.0.0

org.springframework.boot
spring-boot-starter-parent
2.7.2


com.matcon
yolo-v2-runner
0.0.1-SNAPSHOT
yolo-v2-runner
yolo-v2-runner

<java.version>11</java.version>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
<dl4j-master.version>1.0.0-M2.1</dl4j-master.version>
<nd4j-master.version>1.0.0-M2.1</nd4j-master.version>
11.6
<maven-shade-plugin.version>3.3.0</maven-shade-plugin.version>
<maven.minimum.version>3.3.1</maven.minimum.version>
<exec-maven-plugin.version>1.4.0</exec-maven-plugin.version>
<shaded.classifier>bin</shaded.classifier>



org.springframework.boot
spring-boot-starter-web


org.springframework.boot
spring-boot-starter-validation


org.springframework.boot
spring-boot-starter-test
test



org.deeplearning4j
deeplearning4j-core
${dl4j-master.version}







org.nd4j
nd4j-native-platform
${nd4j-master.version}


org.nd4j
nd4j-cuda-${cuda-version}-platform
${nd4j-master.version}


org.deeplearning4j
deeplearning4j-parallel-wrapper
${dl4j-master.version}


org.deeplearning4j
deeplearning4j-zoo
${dl4j-master.version}


org.bytedeco
opencv-platform
4.5.5-1.5.7

    <!--        $CUDA_VERSION-$CUDNN_VERSIUON-$JAVACPP_VERSION-->

</dependencies>

<build>
    <plugins>

        <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
        </plugin>
        <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>exec-maven-plugin</artifactId>
            <version>${exec-maven-plugin.version}</version>
            <executions>
                <execution>
                    <goals>
                        <goal>exec</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <executable>java</executable>
            </configuration>
        </plugin>
        <plugin>
            <groupId>com.lewisd</groupId>
            <artifactId>lint-maven-plugin</artifactId>
            <version>0.0.11</version>
            <configuration>
                <failOnViolation>true</failOnViolation>
                <onlyRunRules>
                    <rule>DuplicateDep</rule>
                    <rule>RedundantPluginVersion</rule>
                    <!-- Rules incompatible with Java 9
                    <rule>VersionProp</rule>
                    <rule>DotVersionProperty</rule> -->
                </onlyRunRules>
            </configuration>
            <executions>
                <execution>
                    <id>pom-lint</id>
                    <phase>validate</phase>
                    <goals>
                        <goal>check</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <artifactId>maven-enforcer-plugin</artifactId>
            <version>1.0.1</version>
            <executions>
                <execution>
                    <id>enforce-default</id>
                    <goals>
                        <goal>enforce</goal>
                    </goals>
                    <configuration>
                        <rules>
                            <requireMavenVersion>
                                <version>[${maven.minimum.version},)</version>
                                <message>********** Minimum Maven Version is ${maven.minimum.version}. Please
                                    upgrade Maven before continuing (run "mvn --version" to check). **********
                                </message>
                            </requireMavenVersion>
                        </rules>
                    </configuration>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <artifactId>maven-surefire-plugin</artifactId>
            <version>3.0.0-M5</version>
            <inherited>true</inherited>
            <dependencies>
                <dependency>
                    <groupId>org.apache.maven.surefire</groupId>
                    <artifactId>surefire-junit-platform</artifactId>
                    <version>3.0.0-M5</version>
                </dependency>
            </dependencies>
        </plugin>
    </plugins>


    <pluginManagement>
        <plugins>
            <plugin>
                <groupId>org.eclipse.m2e</groupId>
                <artifactId>lifecycle-mapping</artifactId>
                <version>1.0.0</version>
                <configuration>
                    <lifecycleMappingMetadata>
                        <pluginExecutions>
                            <pluginExecution>
                                <pluginExecutionFilter>
                                    <groupId>com.lewisd</groupId>
                                    <artifactId>lint-maven-plugin</artifactId>
                                    <versionRange>[0.0.11,)</versionRange>
                                    <goals>
                                        <goal>check</goal>
                                    </goals>
                                </pluginExecutionFilter>
                                <action>
                                    <ignore/>
                                </action>
                            </pluginExecution>
                        </pluginExecutions>
                    </lifecycleMappingMetadata>
                </configuration>
            </plugin>
        </plugins>
    </pluginManagement>

</build>
-----------------------------------------pom.xml-------------------------------------------------

Bellow is the output of application start:
-----------------------------------------app logs-------------------------------------------------
2022-09-29 13:27:09.846 INFO 1796 — [ main] org.nd4j.linalg.factory.Nd4jBackend : Loaded [JCublasBackend] backend
2022-09-29 13:27:13.762 INFO 1796 — [ main] org.nd4j.nativeblas.NativeOpsHolder : Number of threads used for linear algebra: 32
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Backend used: [CUDA]; OS: [Windows 10]
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Cores: [4]; Memory: [2.0GB];
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Blas vendor: [CUBLAS]
2022-09-29 13:27:13.837 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : ND4J CUDA build version: 11.6.55
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : CUDA device 0: [NVIDIA GeForce RTX 3060 Ti]; cc: [8.6]; Total memory: [8589279232]
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : CUDA device 1: [NVIDIA GeForce RTX 3060 Ti]; cc: [8.6]; Total memory: [8589410304]
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : Backend build information:
MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
-----------------------------------------app logs-------------------------------------------------
My project works fine with one gpu, but when I got the second one I ecountered the following error:
-----------------------------------------exception------------------------------------------------
java.lang.RuntimeException: MmulHelper::mmulMxM cuda failed !; Error code: [700]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2067) ~[nd4j-cuda-11.6-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1870) ~[nd4j-cuda-11.6-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6545) ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm(BaseLevel3.java:62) ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli(BaseNDArray.java:3202) ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.preOutput(ConvolutionLayer.java:473) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.activate(ConvolutionLayer.java:509) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doForward(LayerVertex.java:110) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.outputOfLayersDetached(ComputationGraph.java:2450) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1752) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1708) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output(ComputationGraph.java:1694) ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference$ModelHolder.output(InplaceParallelInference.java:267) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference$ModelSelector.output(InplaceParallelInference.java:150) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference.output(InplaceParallelInference.java:91) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.ParallelInference.output(ParallelInference.java:191) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.ParallelInference.output(ParallelInference.java:187) ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
-----------------------------------------exception------------------------------------------------

@Constantin could you specify which GPUs you’re using as well as the cuda version? It appears there’s some sort of issue with out of memory access: CUDA error 700 ?? cudaDeviceSynchronize returned error code 700 - CUDA Programming and Performance - NVIDIA Developer Forums

Are you using the same arrays across different GPUs? Either way I would need a reproducer to even begin debugging this.

I am using 2x gtx3060ti , cuda version 11.6

@Constantin could you answer all of my questions not just 1? If you don’t know what I mean please ask me to clarify and I’m happy to help. I can’t read your screen or run your code. I need as much detail as possible to help you.

What do you mean by ==> Are you using the same arrays across different GPUs?

Thanks for following up. One of the main suspects of the crash based on the error message is that a pointer for GPU 1 might be passed to GPU 2.

A common issue there would be training data or weights.

My other question was about a reproducer. Do you have a standalone example I might be able to run on multi gpu? I don’t need anything proprietary from your code just a standalone example that might be similar to your situation.

this piece of code is shared betwen multiple threads.


in this morning I received this exit code.
matrixImg from detect method is created using NativeImageLoader.asMatrix(Mat image)

@Constantin do you have an hs_err_pid.log somewhere in the directory where this was ran?

I did not find any log files in the project.
image

I am not entirely sure but this problem could be somehow bound with the topic from this post :

@Constantin no it’s definitely not. I already gave you the probable cause: data sharing. If I can’t run your code I can’t see the problem only guess though. Again, if you don’t understand something I’m saying, ask don’t ignore. It doesn’t help either of us.

I’m not going to chase down something I can’t reproduce. It’s up to you whether you want to meet me half way. I can’t magically ssh in to your computer and see what’s going on. I need something to go off of: logs, a reproducer of some kind, or some effort on your part than just posting unrelated errors.

If you’d like to put effort in to this try to see if a purely standalone solution with parallelinference reproduces that problem. If the issue isn’t related to your specific usage of the library then I’m happy to look at it.

@agibsonccc for me is not a problem to share with you the entire project, or to give acces to the pc with code.Tell me what would be the easiest way for you to reproduce the case.

@Constantin I just need to reproduce it. If you want share it with me on github. I don’t want to login to anything unless you’re paying me :slight_smile: I purely want to see what you’re doing.

@agibsonccc I added you as a colaborator to the repository.