MmulHelper::mmulMxM cuda failed !; Error code: [700]

Hello everyone, I am working on a project where I am using yolov2 model wrapped in ParallelIference.
Haedware properties:
OS : windows 10
GPU: 2xGTX3060 ti
CUDA: 11.6

My pom.xml file:

<?xml version="1.0" encoding="UTF-8"?>

















                    <!-- Rules incompatible with Java 9
                    <rule>DotVersionProperty</rule> -->
                                <message>********** Minimum Maven Version is ${maven.minimum.version}. Please
                                    upgrade Maven before continuing (run "mvn --version" to check). **********



Bellow is the output of application start:
-----------------------------------------app logs-------------------------------------------------
2022-09-29 13:27:09.846 INFO 1796 — [ main] org.nd4j.linalg.factory.Nd4jBackend : Loaded [JCublasBackend] backend
2022-09-29 13:27:13.762 INFO 1796 — [ main] org.nd4j.nativeblas.NativeOpsHolder : Number of threads used for linear algebra: 32
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Backend used: [CUDA]; OS: [Windows 10]
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Cores: [4]; Memory: [2.0GB];
2022-09-29 13:27:13.826 INFO 1796 — [ main] o.n.l.a.o.e.DefaultOpExecutioner : Blas vendor: [CUBLAS]
2022-09-29 13:27:13.837 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : ND4J CUDA build version: 11.6.55
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : CUDA device 0: [NVIDIA GeForce RTX 3060 Ti]; cc: [8.6]; Total memory: [8589279232]
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : CUDA device 1: [NVIDIA GeForce RTX 3060 Ti]; cc: [8.6]; Total memory: [8589410304]
2022-09-29 13:27:13.840 INFO 1796 — [ main] org.nd4j.linalg.jcublas.JCublasBackend : Backend build information:
MSVC: 192930146
STD version: 201402L
-----------------------------------------app logs-------------------------------------------------
My project works fine with one gpu, but when I got the second one I ecountered the following error:
java.lang.RuntimeException: MmulHelper::mmulMxM cuda failed !; Error code: [700]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec( ~[nd4j-cuda-11.6-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec( ~[nd4j-cuda-11.6-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.factory.Nd4j.exec( ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm( ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli( ~[nd4j-api-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.preOutput( ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.activate( ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doForward( ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.outputOfLayersDetached( ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output( ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output( ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.nn.graph.ComputationGraph.output( ~[deeplearning4j-nn-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference$ModelHolder.output( ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference$ModelSelector.output( ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.InplaceParallelInference.output( ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.ParallelInference.output( ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]
at org.deeplearning4j.parallelism.ParallelInference.output( ~[deeplearning4j-parallel-wrapper-1.0.0-M2.1.jar:na]

@Constantin could you specify which GPUs you’re using as well as the cuda version? It appears there’s some sort of issue with out of memory access: CUDA error 700 ?? cudaDeviceSynchronize returned error code 700 - CUDA Programming and Performance - NVIDIA Developer Forums

Are you using the same arrays across different GPUs? Either way I would need a reproducer to even begin debugging this.

I am using 2x gtx3060ti , cuda version 11.6

@Constantin could you answer all of my questions not just 1? If you don’t know what I mean please ask me to clarify and I’m happy to help. I can’t read your screen or run your code. I need as much detail as possible to help you.

What do you mean by ==> Are you using the same arrays across different GPUs?

Thanks for following up. One of the main suspects of the crash based on the error message is that a pointer for GPU 1 might be passed to GPU 2.

A common issue there would be training data or weights.

My other question was about a reproducer. Do you have a standalone example I might be able to run on multi gpu? I don’t need anything proprietary from your code just a standalone example that might be similar to your situation.

this piece of code is shared betwen multiple threads.

in this morning I received this exit code.
matrixImg from detect method is created using NativeImageLoader.asMatrix(Mat image)

@Constantin do you have an hs_err_pid.log somewhere in the directory where this was ran?

I did not find any log files in the project.

I am not entirely sure but this problem could be somehow bound with the topic from this post :

@Constantin no it’s definitely not. I already gave you the probable cause: data sharing. If I can’t run your code I can’t see the problem only guess though. Again, if you don’t understand something I’m saying, ask don’t ignore. It doesn’t help either of us.

I’m not going to chase down something I can’t reproduce. It’s up to you whether you want to meet me half way. I can’t magically ssh in to your computer and see what’s going on. I need something to go off of: logs, a reproducer of some kind, or some effort on your part than just posting unrelated errors.

If you’d like to put effort in to this try to see if a purely standalone solution with parallelinference reproduces that problem. If the issue isn’t related to your specific usage of the library then I’m happy to look at it.

@agibsonccc for me is not a problem to share with you the entire project, or to give acces to the pc with code.Tell me what would be the easiest way for you to reproduce the case.

@Constantin I just need to reproduce it. If you want share it with me on github. I don’t want to login to anything unless you’re paying me :slight_smile: I purely want to see what you’re doing.

@agibsonccc I added you as a colaborator to the repository.