Fatal Error terminate called after throwing an instance of 'std::runtime_error' what(): bad data type

SO I’m tinkering with a network and right now I am getting this extremely vague and unhelpful error

 terminate called after throwing an instance of 'std::runtime_error'
  what():  bad data type

If I go a little up in my logging I see

[ERROR] Unknown dtypeX=1025205691 on D:/jenkins/ws/dl4j-deeplearning4j-1.0.0-beta6-windows-x86_64-cpu-avx2/libnd4j/blas/cpu/NativeOpExecutioner.cpp

I have looked these up and for the ladder come back with zero results

To note I am using RL4J but I feel like this isn’t a issue caused by that considering I am getting an error from the nativelib

Anyone have any idea why this is happening because my network is actually learning but a full program crash happens when this error appears and I lose all the progress

Thanks for your help

This looks very much like a bug on our side. Can you share the full log output with us? Ideally also the code that led to this behavior

Yes please, we definitely need the code that reproduces this.

So in a later run, the program actually produced a crash file before it exited, I have attached it below.

However, I think I have discovered the problem. It may actually be related to RL4J after all because when I was checking my observation space I realized that my getLow(); method returned an INDArray of all zeros when in actuality the lowest value my environment can return is -5. After adjusting my array to -5 I have not run into this problem again however that may also be just pure chance. So I think that fixes the problem but if yall really wan’t I can upload a version that produces the error still just let me know

Link to log its too long to include https://pastebin.com/vedBC5W3

That looks very much like a buffer overflow somewhere to me :confused:

The observation space limits however should probably not be the reason for it, so I’d guess that you will run into it again.

If you do run into it again, can you please try using snapshots?

It kept happening so I am using the snapshot right now like you suggested. So far it hasn’t happened but my network is also significantly slower probably more than an order of magnitude slower. This makes training almost infeasible as I try adjusting values and reward functions. Do you know why this is happening is one of the native libraries like avx2 not setup yet.

I will keep you posted on the crashes though and let you know if they occur again

Ok so I tried again with beta7 and things ran smoother however I ran into the error again but what I have noticed is that the error only occurs when I am using avx2 not cuda so hopefully that can help you. But that kinda sucks for me because when using the cuda/cudnn library my network runs at like a fifth the speed for some reason.

Hmm, that indicates it might be a problem with mkldnn.

Can you try disabling it with Nd4j.getEnvironment().allowHelpers(false);?

That did not fix it I still got the same error

I guess the actual error is now somewhat different. Can you share the new crash log?

Ok, so I got the error a few more times but it wasn’t creating logs I don’t know why. However I found an old post saying that jna errors can sometimes be caused by java bugs so after updating my java version I have not run into this issue again. I’m not saying it’s fixed but so far it seems to be. For reference the JRE I was having issues with was

JRE 1.8.0_211

I’m now on

JRE 1.8.0_251

Never Mind it crashed again here’s the dump
And here’s the text printed in the console

> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000003ed5eec8, pid=2304, tid=0x0000000000005f0c
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_251-b08) (build 1.8.0_251-b08)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.251-b08 mixed mode windows-amd64 compressed oops)
> # Problematic frame:
> # C  [libnd4jcpu.dll+0xf2eec8]
> #
> # Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
> #

I’ve been trying to reproduce this crash but couldn’t make it crash yet.

Can you share your code? Also, AsyncNStepQLearning appears to be buggy overall, it didn’t learn quite as well as A3C for the example I’ve ran it with (a simple Snake game).

Can you try A3C as well? If it also crashes there, then it isn’t something about NStep, if it doesn’t, then we’ll at least have some kind of hint to follow.

I have tried A3C with AVX2 and I have gotten the error mentioned in my original post I haven’t managed to get a dump out of it yet though.

Here is a link to a version that produces the error using Async: https://drive.google.com/open?id=1Yom_iLDGPBBFO0Fbuus-J4_SklDmHVK7

Launch the main method in the class Learn.java to run it

I believe A3C will also throw it if you launch it I have been using cuda but I changed it back to avx2 for this upload. To use A3C uncomment SnakeControl.A3CcartPole(); in Learn and in SnakeBox change getData() to return “a” which is the proper array for LSTM.

I hope this helps let me know if you figure out a way to prevent the error. I would greatly appreciate it!

Also the program doesn’t always throw the error but if it does it usually happens in the first 1-20 minutes so if it takes longer than that you may want to relaunch it

I’ve managed to reproduce the crashes and I’ve created an issue to track the progress: https://github.com/eclipse/deeplearning4j/issues/8977

I’ve still got to create something self-contained to reproduce the behavior though.