SO I’m tinkering with a network and right now I am getting this extremely vague and unhelpful error
terminate called after throwing an instance of 'std::runtime_error'
what(): bad data type
If I go a little up in my logging I see
[ERROR] Unknown dtypeX=1025205691 on D:/jenkins/ws/dl4j-deeplearning4j-1.0.0-beta6-windows-x86_64-cpu-avx2/libnd4j/blas/cpu/NativeOpExecutioner.cpp
I have looked these up and for the ladder come back with zero results
To note I am using RL4J but I feel like this isn’t a issue caused by that considering I am getting an error from the nativelib
Anyone have any idea why this is happening because my network is actually learning but a full program crash happens when this error appears and I lose all the progress
So in a later run, the program actually produced a crash file before it exited, I have attached it below.
However, I think I have discovered the problem. It may actually be related to RL4J after all because when I was checking my observation space I realized that my getLow(); method returned an INDArray of all zeros when in actuality the lowest value my environment can return is -5. After adjusting my array to -5 I have not run into this problem again however that may also be just pure chance. So I think that fixes the problem but if yall really wan’t I can upload a version that produces the error still just let me know
It kept happening so I am using the snapshot right now like you suggested. So far it hasn’t happened but my network is also significantly slower probably more than an order of magnitude slower. This makes training almost infeasible as I try adjusting values and reward functions. Do you know why this is happening is one of the native libraries like avx2 not setup yet.
I will keep you posted on the crashes though and let you know if they occur again
Ok so I tried again with beta7 and things ran smoother however I ran into the error again but what I have noticed is that the error only occurs when I am using avx2 not cuda so hopefully that can help you. But that kinda sucks for me because when using the cuda/cudnn library my network runs at like a fifth the speed for some reason.
Ok, so I got the error a few more times but it wasn’t creating logs I don’t know why. However I found an old post saying that jna errors can sometimes be caused by java bugs so after updating my java version I have not run into this issue again. I’m not saying it’s fixed but so far it seems to be. For reference the JRE I was having issues with was
Never Mind it crashed again here’s the dump https://pastebin.com/NkZuZPKy
And here’s the text printed in the console
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000003ed5eec8, pid=2304, tid=0x0000000000005f0c
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_251-b08) (build 1.8.0_251-b08)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.251-b08 mixed mode windows-amd64 compressed oops)
> # Problematic frame:
> # C [libnd4jcpu.dll+0xf2eec8]
> #
> # Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
> #
I’ve been trying to reproduce this crash but couldn’t make it crash yet.
Can you share your code? Also, AsyncNStepQLearning appears to be buggy overall, it didn’t learn quite as well as A3C for the example I’ve ran it with (a simple Snake game).
Can you try A3C as well? If it also crashes there, then it isn’t something about NStep, if it doesn’t, then we’ll at least have some kind of hint to follow.
Launch the main method in the class Learn.java to run it
I believe A3C will also throw it if you launch it I have been using cuda but I changed it back to avx2 for this upload. To use A3C uncomment SnakeControl.A3CcartPole(); in Learn and in SnakeBox change getData() to return “a” which is the proper array for LSTM.
I hope this helps let me know if you figure out a way to prevent the error. I would greatly appreciate it!
Also the program doesn’t always throw the error but if it does it usually happens in the first 1-20 minutes so if it takes longer than that you may want to relaunch it