ND4JIllegalStateException

On windows with beta7 I can train models with data type DOULE or FLOAT, but when I try to train with HALF this exception is thrown:
Caused by: org.nd4j.linalg.exception.ND4JIllegalStateException: Op [histogram] X argument uses leaked workspace pointer from workspace [WS_LAYER_WORKING_MEM]: Workspace the array was defined in is no longer open.

Also i saw that HALF is deprecated and FLOAT16 should be used, but when i try to train with that this happens:
IllegalStateException:Data type must be a floating point type: one of DOUBLE, FLOAT, or HALF. Got datatype: BFLOAT16

@lavajaw can you give us something to reproduce this? FWIW, training models with half varies on intel chips and generally works on gpu but in general is a bit hit or miss. It’s not really recommended if you can help it. If you want half, I would recommend optimizing it for inference instead after training.

This is my gradle:
implementation 'org.deeplearning4j:deeplearning4j-core:1.0.0-beta7'
implementation 'org.deeplearning4j:deeplearning4j-ui:1.0.0-beta7'
implementation 'org.deeplearning4j:deeplearning4j-zoo:1.0.0-beta7'
implementation 'ch.qos.logback:logback-classic:1.2.3'
implementation 'org.projectlombok:lombok:1.18.12'

//  GPU
implementation group: 'org.nd4j', name: 'nd4j-cuda-10.2-platform', version: '1.0.0-beta7'
implementation group: 'org.deeplearning4j', name: 'deeplearning4j-cuda-10.2', version: '1.0.0-beta7'

17:25:52.497 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
17:25:52.516 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 10.2.89
17:25:52.517 [main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [GeForce RTX 2070]; cc: [7.5]; Total memory: [8589934592]
17:25:53.187 [main] INFO org.deeplearning4j.nn.graph.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]

Caused by: org.nd4j.linalg.exception.ND4JIllegalStateException: Op [histogram] X argument uses leaked workspace pointer from workspace [WS_LAYER_WORKING_MEM]: Workspace the array was defined in is no longer open.
All open workspaces:
Caused by: org.nd4j.linalg.exception.ND4JIllegalStateException: Op [histogram] X argument uses leaked workspace pointer from workspace [WS_LAYER_WORKING_MEM]: Workspace the array was defined in is no longer open.

Then i tried to add:
.trainingWorkspaceMode(WorkspaceMode.NONE)
.inferenceWorkspaceMode(WorkspaceMode.NONE)
and it didnt crash but all values at evaluations was NaNs

I wanted to try HALF, because model should be executed on android device, so I wanted to check is any improvement in execute speed.

“I would recommend optimizing it for inference instead after training”. I am not sure how to do this, do you have any links/example so I can learn more about this?

@quickwritereader could you take a look at HALF execution? ^

@lavajaw
I talked to the team. I was testing a simple example. Depending on updater and other parameters training using Float16 could either give not optimal results or NANs. So I was advised to convert it after training. It reduced my model size and also gave almost the same results for evaluation.
You can use .convertDataType(DataType.FLOAT16) and save it.
Coming on performance on arm float16 promoted to float for computations. but Armv8.2-A and later arches will use hardware-based instructions for computations. So if your model is small enough I do not see any reasons to use it for now.
thanks

@quickwritereader thanks.

Just to know, I tested my model, same that I provided you in DM.
Resaults:

NeuralNetwork neuralNetwork = ModelSerializer.restoreComputationGraph(path).convertDataType(DataType.HALF);
output = ((ComputationGraph) neuralNetwork).output(indArray.dup());

convertDataType(DataType.HALF) was 2 times slower than
.convertDataType(DataType.FLOAT) or .convertDataType(DataType.DOUBLE), which is very strange.

though it should be great because of cache and memory bandwidth it seems only having limited instruction that promotes float16 to float32 are not enough.
but on the latest chips, float16 and as well as bfloat16 will be faster and widely supported.