Fatal Error in Java Runtime Environment

After upgrading to 1.0.0-M2.1 from 1.0.0-M1.1 I see JVM crashes routinely. I train a MultiLayerNetwork (model.fit(trainData, nEpochs)) for a few hours and then it fails. At first I thought it was a hardware problem - memory instability, but now I think it’s a bug in ND4J after examining the hs_error_pid.log files. I’ve run my code on three different computers.

Here are snippets from two hs_err_pid files:
C [libnd4jcpu.so+0x172ced0] sd::DataBuffer::primary()+0x0
C [libjnind4jcpu.so+0x1d44f7] Java_org_nd4j_linalg_cpu_nativecpu_bindings_Nd4jCpu_execBroadcast__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2+0x437
J 5080 org.nd4j.linalg.cpu.nativecpu.bindings.Nd4jCpu.execBroadcast(Lorg/bytedeco/javacpp/PointerPointer;ILorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;)V (0 bytes) @ 0x00007effe946e585 [0x00007effe946e3a0+0x00000000000001e5]

C [libnd4jcpu.so+0x174c5f0] sd::TadDescriptor::TadDescriptor(long long const*, int const*, int, bool)+0x420
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 5099 org.nd4j.linalg.cpu.nativecpu.bindings.Nd4jCpu.execBroadcast(Lorg/bytedeco/javacpp/PointerPointer;ILorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;Lorg/nd4j/nativeblas/OpaqueDataBuffer;Lorg/bytedeco/javacpp/LongPointer;Lorg/bytedeco/javacpp/LongPointer;)V (0 bytes) @ 0x00007fd3f14a4380 [0x00007fd3f14a4220+0x0000000000000160]
J 5677 c2 org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(Lorg/nd4j/linalg/api/ops/BroadcastOp;Lorg/nd4j/linalg/api/ops/OpContext;)Lorg/nd4j/linalg/api/ndarray/INDArray; (614 bytes) @ 0x00007fd3f159ed60 [0x00007fd3f159cf00+0x0000000000001e60]

Here’s a link to several hs_err_pid.log files:

I notice there are 17 transitive vulnerable dependencies from my pom.xml file. Could there be a dependency problem? Here is the list of vulnerabilities listed from Intellij.

Provides transitive vulnerable dependency maven:commons-net:commons-net:3.1 CVE-2021-37533 6.5 Improper Input Validation vulnerability with medium severity found CVE-2021-37533 6.5 Improper Input Validation vulnerability with medium severity found Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.apache.commons:commons-collections4:4.1 Cx78f40514-81ff 7.5 Uncontrolled Recursion vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:commons-codec:commons-codec:1.10 Cxeb68d52e-5509 3.7 Exposure of Sensitive Information to an Unauthorized Actor vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.freemarker:freemarker:2.3.23 Cxb3498186-093f 7.5 Vulnerability with medium severity found Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:com.fasterxml.jackson.core:jackson-databind:2.11.4 Cxced0c06c-935c 5.9 Uncontrolled Resource Consumption vulnerability pending CVSS allocation CVE-2020-36518 7.5 Out-of-bounds Write vulnerability pending CVSS allocation CVE-2022-42003 7.5 Deserialization of Untrusted Data vulnerability pending CVSS allocation CVE-2022-42004 7.5 Deserialization of Untrusted Data vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:io.netty:netty-common:4.1.68.Final CVE-2022-24823 5.5 Exposure of Resource to Wrong Sphere vulnerability with medium severity found Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.json:json:20190722 Cxdb5a1032-eda2 7.5 Loop with Unreachable Exit Condition (‘Infinite Loop’) vulnerability pending CVSS allocation CVE-2022-45689 7.5 Out-of-bounds Write vulnerability with medium severity found CVE-2022-45690 7.5 Out-of-bounds Write vulnerability with medium severity found Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:com.beust:jcommander:1.27 Cx8fd408ac-dd80 8.1 Inclusion of Functionality from Untrusted Control Sphere vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:com.google.guava:guava:19.0 CVE-2018-10237 5.9 Allocation of Resources Without Limits or Throttling vulnerability pending CVSS allocation CVE-2020-8908 3.3 Incorrect Permission Assignment for Critical Resource vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:io.netty:netty-codec-http:4.1.48.Final CVE-2021-21290 5.5 Creation of Temporary File With Insecure Permissions vulnerability pending CVSS allocation CVE-2021-21295 5.9 Inconsistent Interpretation of HTTP Requests (‘HTTP Request Smuggling’) vulnerability pending CVSS allocation CVE-2021-43797 6.5 Inconsistent Interpretation of HTTP Requests (‘HTTP Request Smuggling’) vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:io.netty:netty-codec:4.1.48.Final CVE-2021-37136 7.5 Uncontrolled Resource Consumption vulnerability pending CVSS allocation CVE-2021-37137 7.5 Uncontrolled Resource Consumption vulnerability pending CVSS allocation CVE-2022-41915 6.5 Improper Neutralization of CRLF Sequences in HTTP Headers (‘HTTP Response Splitting’) vulnerability with medium severity found Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:io.vertx:vertx-web:3.9.0 CVE-2019-17640 9.8 Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’) vulnerability pending CVSS allocation CVE-2020-35217 8.8 Cross-Site Request Forgery (CSRF) vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.webjars.bower:lodash:3.10.1-amd Cx0b414307-5d4b 7.3 Improperly Controlled Modification of Object Prototype Attributes (‘Prototype Pollution’) vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.webjars:bootstrap:2.2.2-1 CVE-2018-14042 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation CVE-2018-14040 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.webjars:jquery-ui:1.10.2 CVE-2016-7103 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation CVE-2021-41184 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability with medium severity found CVE-2021-41182 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation CVE-2021-41183 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.webjars:jquery:2.2.0 CVE-2007-2379 5.8 Exposure of Sensitive Information to an Unauthorized Actor vulnerability pending CVSS allocation CVE-2020-11022 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation CVE-2016-10707 7.5 Uncontrolled Resource Consumption vulnerability pending CVSS allocation CVE-2015-9251 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation CVE-2019-11358 6.1 Improperly Controlled Modification of Object Prototype Attributes (‘Prototype Pollution’) vulnerability pending CVSS allocation CVE-2020-11023 6.1 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability pending CVSS allocation Results powered by Checkmarx(c)
Provides transitive vulnerable dependency maven:org.apache.httpcomponents:httpclient:4.5.2 CVE-2020-13956 5.3 Improper Input Validation vulnerability pending CVSS allocation Results powered by Checkmarx(c)

These are my dependencies:

        <dependency>
            <groupId>org.nd4j</groupId>
            <artifactId>${nd4j.backend}</artifactId>
        </dependency>
        <dependency>
            <groupId>org.datavec</groupId>
            <artifactId>datavec-api</artifactId>
        </dependency>
        <dependency>
            <groupId>org.datavec</groupId>
            <artifactId>datavec-data-image</artifactId>
        </dependency>
        <dependency>
            <groupId>org.datavec</groupId>
            <artifactId>datavec-local</artifactId>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-datasets</artifactId>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-core</artifactId>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>resources</artifactId>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-ui</artifactId>
        </dependency>
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-zoo</artifactId>
        </dependency>
        <!-- ParallelWrapper & ParallelInference live here -->
        <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-parallel-wrapper</artifactId>
        </dependency>

In a parent pom I list the master dl4j version as 1.0.0-M2.1

@daviddbal sorry your earlier post was flagged for some reason. I fixed that and needed to take a look at this closer.

What you’re seeing might be already fixed in snapshots. Could you please try that and confirm? Snapshots - Deeplearning4j

After some load testing I discovered a potential race condition between shape buffer deallocation and usage in other threads.

NDArrays should be a lot closer to being thread safe now after this.

Thanks for the tip. I’ve changed to the snapshot version and I’ll let you know if it’s stable in about a day.

So far, it seems that the crashes are less frequent using SNAPSHOT, but I’ve still seen two in the past two days. I would prefer zero. Is there anything I can do to get it rock solid? Below is the beginning of the latest hs_err_pid:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f2dce700070, pid=4974, tid=4976

JRE version: Java™ SE Runtime Environment (17.0.5+9) (build 17.0.5+9-LTS-191)

Java VM: Java HotSpot™ 64-Bit Server VM (17.0.5+9-LTS-191, mixed mode, sharing, tiered, compressed class ptrs, g1 gc, linux-amd64)

Problematic frame:

C [libnd4jcpu.so+0x1700070] sd::DataBuffer::primary()+0x0

Core dump will be written. Default location: Core dumps may be processed with “/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g – %E” (or dumping to /home/david/workspace/stock1/net-trainer/core.4974)

If you would like to submit a bug report, please visit:

https://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

@daviddbal could you send me more information? I’d love it if you could DM me your code that’s being used here.

I do think it’s been a bit since I’ve updated the PRs. I just tested this with a customer recently with 100s of concurrent users and didn’t see crashes. I

might just have to update snapshots and see what’s going on. I will do another update here in a day or so after I merge the next PR and we can try again there.

Just in case let’s double check your specific case though. I still think it’s the same issue.

I’m seeing errors so rarely now - maybe once every couple of days. I’m not sure it’s the same cause (a race condition, as you explained).

When you say you want my code are you saying you want snippets of the code to review or do you want a functioning example? Providing snippets is easy. Making functioning example code is more difficult.

Debugging a specific issue, without something that reproduces it is even more difficult though.

@daviddbal I was seeing that with a customer. This sounds VERY similar. Your training issues as well. The slow down if you profile it is probably related to the build of of deallocatable references in the deallocator service. I bet if you look in jvisualvm you’ll see strings as a good portion of the on heap memory. I’d love to know more to confirm if that’s the case or not.