JVM Crash on Linux after upgrade to M2

partarstu · April 11, 2022, 7:53pm

Hi. After upgrading to M2 I get the following problem while running the Samediff model on Debian Buster 10 running in a docker container. With the previous versions it worked quite fine. The error:

 A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007efee98fab70, pid=1, tid=79
JRE version: OpenJDK Runtime Environment (18.0+36) (build 18+36-2087)
Java VM: OpenJDK 64-Bit Server VM (18+36-2087, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
 Problematic frame:
 C  [libnd4jcpu.so+0x113cb70]  void functions::transform::TransformAny<float, float>::exec<simdOps::Assign<float, float> >(void const*, long long const*, void*, long long const*, void*, unsigned long, unsigned long)+0x1410

Never seen this issue before. I’ve been running the model using linux-x86_64-avx2, linux-x86_64-onednn-avx2 and linux-x86_64 classifiers. With the latter one the error showed up quite later (like twice later). I haven’t found an option to upload the hs_err_pid1.log file to this topic though.

Would appreciate any ideas how to fix it.

agibsonccc · April 12, 2022, 12:38pm

@partarstu could you file an issue with more details? Something end to end to reproduce it would be nice. These crashes can come from multi threading, mis use of allocation or anything and may not be bugs. We need to know what your assumptions are when using the framework.

partarstu · April 13, 2022, 2:31pm

@agibsonccc should I report this issue on Issues · eclipse/deeplearning4j · GitHub ?

Unfortunately I can’t provide my model which allows to reproduce it, but the exact same model used to work fine on the exact same platform with M1.1.

I’m currently testing it on Windows and will let you know the results shortly.

partarstu · April 13, 2022, 8:57pm

I think I found the root cause of this issue. Seems like it’s about the Gather Op and its failure (incorrect dimensions during the back-prop) which was not propagated correctly in order to fail the further workflow’s execution by throwing the corresponding error/exception. I’ll post additional details later tomorrow.

agibsonccc · April 13, 2022, 9:28pm

@partarstu ok that seems way more plausible and I"d be happy to look at a fix for that. As mentioned in the other post we’ll do a follow up release here within the next week or so to address these problems as well.

partarstu · April 14, 2022, 9:11am

@agibsonccc I’ve submitted an issue on Github regarding both Gather Op and this JVM crash: https://github.com/eclipse/deeplearning4j/issues/9669

Topic		Replies	Views
Fatal Error in Java Runtime Environment ND4J	8	894	April 12, 2023
libnd4jcpu frame issue DL4J	4	500	March 22, 2023
Application fatal error exit code 134 SameDiff	8	291	October 9, 2023
Error in jvm when get output from keras model DL4J	11	2492	June 12, 2020
Error in link library SameDiff	3	913	August 29, 2020

JVM Crash on Linux after upgrade to M2

Related topics