Hi. After upgrading to M2 I get the following problem while running the Samediff model on Debian Buster 10 running in a docker container. With the previous versions it worked quite fine. The error:
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007efee98fab70, pid=1, tid=79
JRE version: OpenJDK Runtime Environment (18.0+36) (build 18+36-2087)
Java VM: OpenJDK 64-Bit Server VM (18+36-2087, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
Problematic frame:
C [libnd4jcpu.so+0x113cb70] void functions::transform::TransformAny<float, float>::exec<simdOps::Assign<float, float> >(void const*, long long const*, void*, long long const*, void*, unsigned long, unsigned long)+0x1410
Never seen this issue before. I’ve been running the model using linux-x86_64-avx2
, linux-x86_64-onednn-avx2
and linux-x86_64
classifiers. With the latter one the error showed up quite later (like twice later). I haven’t found an option to upload the hs_err_pid1.log file to this topic though.
Would appreciate any ideas how to fix it.
@partarstu could you file an issue with more details? Something end to end to reproduce it would be nice. These crashes can come from multi threading, mis use of allocation or anything and may not be bugs. We need to know what your assumptions are when using the framework.
@agibsonccc should I report this issue on Issues · eclipse/deeplearning4j · GitHub ?
Unfortunately I can’t provide my model which allows to reproduce it, but the exact same model used to work fine on the exact same platform with M1.1.
I’m currently testing it on Windows and will let you know the results shortly.
I think I found the root cause of this issue. Seems like it’s about the Gather Op and its failure (incorrect dimensions during the back-prop) which was not propagated correctly in order to fail the further workflow’s execution by throwing the corresponding error/exception. I’ll post additional details later tomorrow.
@partarstu ok that seems way more plausible and I"d be happy to look at a fix for that. As mentioned in the other post we’ll do a follow up release here within the next week or so to address these problems as well.
@agibsonccc I’ve submitted an issue on Github regarding both Gather Op and this JVM crash: https://github.com/eclipse/deeplearning4j/issues/9669