Float16 use gives cuda 700 error

I have implemented a PSFormer using SameDiff, and it seems to work using dtype FLOAT. Hoping to get some faster training I tried with FLOAT16 (HALF) but run into cuda 700 errors. I’ve narrowed it down to the initial weights that are used. If I overrule xavier init and set them all to 0.01, a JUnit test I made for this succeeds. If I use xavier, or just set some random numbers myself the error 700 is triggered. Logging on trace level :

2025-10-03 17:15:41.083 INFO  Nd4jBackend - Loaded [JCublasBackend] backend
2025-10-03 17:15:41.085 TRACE ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
2025-10-03 17:15:41.086 TRACE ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
2025-10-03 17:15:41.086 TRACE ND4JClassLoading - Cannot find class [org.nd4j.linalg.jblas.JblasBackend] of provided class-loader.
2025-10-03 17:15:41.087 TRACE ND4JClassLoading - Cannot find class [org.canova.api.io.data.DoubleWritable] of provided class-loader.
2025-10-03 17:15:42.261 INFO  NativeOpsHolder - Number of threads used for linear algebra: 32
2025-10-03 17:15:42.282 INFO  DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
2025-10-03 17:15:42.282 INFO  DefaultOpExecutioner - Cores: [48]; Memory: [4.0GB];
2025-10-03 17:15:42.282 INFO  DefaultOpExecutioner - Blas vendor: [CUBLAS]
2025-10-03 17:15:42.286 INFO  JCublasBackend - ND4J CUDA build version: 11.6.124
2025-10-03 17:15:42.287 INFO  JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3090]; cc: [8.6]; Total memory: [25293946880]
2025-10-03 17:15:42.287 INFO  JCublasBackend - Backend build information:
 GCC: "7.5.0"
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
2025-10-03 17:15:42.304 TRACE DeallocatorService - Starting deallocator thread 1
2025-10-03 17:15:42.305 TRACE DeallocatorService - Starting deallocator thread 2
psf/ps/W1 datatype = HALF
psf/ps/W2 datatype = HALF
psf/ps/W3 datatype = HALF
psf/head/WF datatype = HALF
2025-10-03 17:15:43.071 TRACE SameDiff - Defining function "grad"
2025-10-03 17:15:43.211 WARN  ImportClassMapping - Duplicate TF op mapping found for op Pow: org.nd4j.linalg.api.ops.impl.scalar.Pow vs org.nd4j.linalg.api.ops.impl.transforms.custom.Pow
2025-10-03 17:15:43.216 WARN  ImportClassMapping - Duplicate TF op mapping found for op FloorMod: org.nd4j.linalg.api.ops.impl.transforms.pairwise.arithmetic.FModOp vs org.nd4j.linalg.api.ops.impl.transforms.pairwise.arithmetic.FloorModOp
2025-10-03 17:15:43.518 TRACE CudaDeallocator - Deallocating CUDA memory
2025-10-03 17:15:43.518 TRACE CudaDeallocator - Deallocating CUDA memory
2025-10-03 17:15:43.519 TRACE CudaDeallocator - Deallocating CUDA memory
....<a whole lot more>....
2025-10-03 17:15:43.551 TRACE FlatBuffersMapper - Own Name: reduce_mean
2025-10-03 17:15:43.562 TRACE FlatBuffersMapper - Own Name: expand_dims
2025-10-03 17:15:43.562 TRACE FlatBuffersMapper - Own Name: subtract
2025-10-03 17:15:43.613 TRACE FlatBuffersMapper - Own Name: multiply
2025-10-03 17:15:43.613 TRACE FlatBuffersMapper - Own Name: reduce_mean_1
2025-10-03 17:15:43.614 TRACE FlatBuffersMapper - Own Name: expand_dims_1
....<more operations>
2025-10-03 17:15:43.633 TRACE SameDiff - Defining backward function: initial outputs [scaled_loss]
Added differentiated op reduce_mean_3
Added differentiated op multiply_6
Added differentiated op subtract_1
Added differentiated op identity
...<more diff ops>....
2025-10-03 17:15:43.658 TRACE AbstractSession - Adding add_37 to subgraph for output.
2025-10-03 17:15:43.658 TRACE AbstractSession - Adding add_38 to subgraph for output.
2025-10-03 17:15:43.658 TRACE AbstractSession - Adding add_40 to subgraph for output.
2025-10-03 17:15:43.658 TRACE AbstractSession - Adding psf/head/WF-grad to subgraph for output.
2025-10-03 17:15:43.658 TRACE AbstractSession - Adding add_31 to subgraph for output.
2025-10-03 17:15:43.659 TRACE AbstractSession - Adding matmul_bp_24:1 to subgraph for output.
....<etc etc> ....
2025-10-03 17:15:43.679 TRACE AbstractSession - Beginning execution step 0: ExecStep(VARIABLE,name="grad",("main",0))
2025-10-03 17:15:43.679 TRACE AbstractDependencyTracker - No values depend on: ExecStep(VARIABLE,name="grad",("main",0))
2025-10-03 17:15:43.679 TRACE AbstractSession - Beginning execution step 1: ExecStep(VARIABLE,name="one-var",("main",0))
2025-10-03 17:15:43.679 TRACE AbstractSession - Beginning execution step 2: ExecStep(VARIABLE,name="psf/head/WF",("main",0))
2025-10-03 17:15:43.679 TRACE AbstractSession - Beginning execution step 3: ExecStep(VARIABLE,name="psf/ps/W1",("main",0))
2025-10-03 17:15:43.680 TRACE AbstractSession - Beginning execution step 4: ExecStep(VARIABLE,name="psf/ps/W2",("main",0))
2025-10-03 17:15:43.680 TRACE AbstractSession - Beginning execution step 5: ExecStep(VARIABLE,name="psf/ps/W3",("main",0))
...<more stuff>....
2025-10-03 17:15:43.685 TRACE InferenceSession - cast - ("main",0) outputs: (0 - cast = 170)
2025-10-03 17:15:43.688 TRACE AbstractDependencyTracker - No values depend on: OpDep(cast,frame=main,iter=0)
2025-10-03 17:15:43.688 TRACE AbstractSession - Beginning execution step 16: ExecStep(OP,name="reduce_mean",("main",0))
2025-10-03 17:15:43.689 TRACE DeviceTADManager - Creating new TAD...
2025-10-03 17:15:43.689 TRACE DeviceTADManager - Using TAD from cache...
2025-10-03 17:15:43.689 TRACE InferenceSession - reduce_mean - ("main",0) outputs: (0 - revin/mu = 171)
2025-10-03 17:15:43.690 TRACE AbstractDependencyTracker - No values depend on: OpDep(reduce_mean,frame=main,iter=0)
2025-10-03 17:15:43.690 TRACE AbstractSession - Beginning execution step 17: ExecStep(OP,name="expand_dims",("main",0))
... <really, a lot of variations on the above>....
2025-10-03 17:15:44.496 TRACE AbstractSession - Beginning execution step 11: ExecStep(PLACEHOLDER,name="input",("main",0))
2025-10-03 17:15:44.496 TRACE AbstractSession - Beginning execution step 12: ExecStep(PLACEHOLDER,name="is_training",("main",0))
2025-10-03 17:15:44.496 TRACE AbstractDependencyTracker - No values depend on: ExecStep(PLACEHOLDER,name="is_training",("main",0))
2025-10-03 17:15:44.496 TRACE AbstractSession - Beginning execution step 13: ExecStep(PLACEHOLDER,name="loss_scale",("main",0))
2025-10-03 17:15:44.496 TRACE AbstractDependencyTracker - No values depend on: ExecStep(PLACEHOLDER,name="loss_scale",("main",0))
2025-10-03 17:15:44.496 TRACE AbstractSession - Beginning execution step 14: ExecStep(PLACEHOLDER,name="target",("main",0))
2025-10-03 17:15:44.496 TRACE AbstractSession - Beginning execution step 15: ExecStep(OP,name="cast",("main",0))
2025-10-03 17:15:44.496 TRACE InferenceSession - cast - ("main",0) outputs: (0 - cast = 983)
2025-10-03 17:15:44.496 TRACE AbstractDependencyTracker - No values depend on: OpDep(cast,frame=main,iter=0)
2025-10-03 17:15:44.496 TRACE AbstractSession - Beginning execution step 16: ExecStep(OP,name="reduce_mean",("main",0))
2025-10-03 17:15:44.496 DEBUG ArrayCacheMemoryMgr - Cache hit for data type HALF and shape [32, 1]

java.lang.RuntimeException: execReduceFloat failed; Error code: [700]

	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.invoke(CudaExecutioner.java:1104)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:626)
	at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6534)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(InferenceSession.java:805)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:255)
	at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:163)
	at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:45)
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:533)
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:154)
	at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:129)
	at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1936)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1792)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1732)
	at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
	at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1747)
	at com.newhighs.psformer.model.UnderlyingModelTest.testFLOAT16_noDropout_noScaling_noXavier(UnderlyingModelTest.java:479)
	

The thing is, using FLOAT datatype doesn’t trigger the error. I have checked, all variables are of type HALF. At this point I can pretty much rule out a mixup of float and float16 datatypes. When creating ND4j arrays I set the datatype explicitly. Any pointers on how to debug this? Or are there known instabilities when using HALF?

Mark.

@markiemark could you try use cuda-memcheck and compute-sanitizer? In order to do that you would need to build an uber jar and execute your proces swith compute-sanitizer.

Something like:
compute-sanitizer java -cp…. however you do that is up to you. I’d need more information though.

I tried several times, sometimes with logging on trace, sometimes on info. Compute-sanitize with –tool memcheck or not. The weird thing is that compute-sanitize at the end reports a different number of errors. Also, it seems to fail much, much quicker if logging is on info (even accounting for the fact that logging on trace level will make it slower). Here’s the log (on info, tool memcheck):

========= COMPUTE-SANITIZER
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2025-10-04 14:46:11.547 INFO  Nd4jBackend - Loaded [JCublasBackend] backend
2025-10-04 14:46:16.930 INFO  NativeOpsHolder - Number of threads used for linear algebra: 32
2025-10-04 14:46:16.946 INFO  DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
2025-10-04 14:46:16.946 INFO  DefaultOpExecutioner - Cores: [48]; Memory: [29.5GB];
2025-10-04 14:46:16.946 INFO  DefaultOpExecutioner - Blas vendor: [CUBLAS]
2025-10-04 14:46:16.949 INFO  JCublasBackend - ND4J CUDA build version: 11.6.124
2025-10-04 14:46:16.950 INFO  JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3090]; cc: [8.6]; Total memory: [25293946880]
2025-10-04 14:46:16.950 INFO  JCublasBackend - Backend build information:
 GCC: "7.5.0"
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
psf/ps/W1 datatype = HALF
psf/ps/W2 datatype = HALF
psf/ps/W3 datatype = HALF
psf/head/WF datatype = HALF
2025-10-04 14:46:22.588 WARN  ImportClassMapping - Duplicate TF op mapping found for op Pow: org.nd4j.linalg.api.ops.impl.scalar.Pow vs org.nd4j.linalg.api.ops.impl.transforms.custom.Pow
2025-10-04 14:46:22.593 WARN  ImportClassMapping - Duplicate TF op mapping found for op FloorMod: org.nd4j.linalg.api.ops.impl.transforms.pairwise.arithmetic.FModOp vs org.nd4j.linalg.api.ops.impl.transforms.pairwise.arithmetic.FloorModOp
Added differentiated op reduce_mean_3
Added differentiated op multiply_6
Added differentiated op subtract_1
Added differentiated op identity
Added differentiated op add_9
Added differentiated op multiply_5
Added differentiated op reshape_17
Added differentiated op matmul_26
Added differentiated op reshape_16
Added differentiated op reshape_15
Added differentiated op permute_5
Added differentiated op reshape_14
Added differentiated op reshape_13
Added differentiated op matmul_25
Added differentiated op add_8
Added differentiated op matmul_24
Added differentiated op gelu_5
Added differentiated op matmul_23
Added differentiated op reshape_12
Added differentiated op add_7
Added differentiated op matmul_22
Added differentiated op softmax_3
Added differentiated op multiply_4
Added differentiated op matmul_21
Added differentiated op permute_4
Added differentiated op reshape_11
Added differentiated op matmul_20
Added differentiated op add_6
Added differentiated op matmul_19
Added differentiated op gelu_4
Added differentiated op matmul_18
Added differentiated op reshape_10
Added differentiated op relu_1
Added differentiated op matmul_17
Added differentiated op softmax_2
Added differentiated op multiply_3
Added differentiated op matmul_16
Added differentiated op permute_3
Added differentiated op reshape_9
Added differentiated op matmul_15
Added differentiated op add_5
Added differentiated op matmul_14
Added differentiated op gelu_3
Added differentiated op matmul_13
Added differentiated op reshape_8
Added differentiated op reshape_7
Added differentiated op matmul_12
Added differentiated op add_4
Added differentiated op matmul_11
Added differentiated op gelu_2
Added differentiated op matmul_10
Added differentiated op reshape_6
Added differentiated op add_3
Added differentiated op matmul_9
Added differentiated op softmax_1
Added differentiated op multiply_2
Added differentiated op matmul_8
Added differentiated op permute_2
Added differentiated op reshape_5
Added differentiated op matmul_7
Added differentiated op add_2
Added differentiated op matmul_6
Added differentiated op gelu_1
Added differentiated op matmul_5
Added differentiated op reshape_4
Added differentiated op relu
Added differentiated op matmul_4
Added differentiated op softmax
Added differentiated op multiply_1
Added differentiated op matmul_3
Added differentiated op permute_1
Added differentiated op reshape_3
Added differentiated op matmul_2
Added differentiated op add_1
Added differentiated op matmul_1
Added differentiated op gelu
Added differentiated op matmul
========= Invalid __global__ read of size 2 bytes
=========     at 0x5180 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
=========     by thread (96,0,0) in block (1,0,0)
=========     Address 0x7d17e30ab380 is out of bounds
=========     and is 128 bytes before the nearest allocation at 0x7d17e30ab400 of size 1544 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel [0x39e715]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart803 [0x27486ab]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:cudaLaunchKernel [0x27a3508]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:void functions::reduce::ReduceFloatFunction<float16, float16>::intermediateXD<simdOps::Mean<float16, float16> >(dim3, CUstream_st**, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a24e2]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:functions::reduce::ReduceFloatFunction<float16, float16>::execReduceXD(dim3, CUstream_st**, int, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a6fcb]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15980]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:execReduceFloat2 [0xa4027f]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
=========     Host Frame: [0x3c026eac0]
=========                in 
========= 
========= Invalid __global__ read of size 2 bytes
=========     at 0x5180 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
=========     by thread (97,0,0) in block (1,0,0)
=========     Address 0x7d17e30ab382 is out of bounds
=========     and is 126 bytes before the nearest allocation at 0x7d17e30ab400 of size 1544 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel [0x39e715]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart803 [0x27486ab]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:cudaLaunchKernel [0x27a3508]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:void functions::reduce::ReduceFloatFunction<float16, float16>::intermediateXD<simdOps::Mean<float16, float16> >(dim3, CUstream_st**, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a24e2]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:functions::reduce::ReduceFloatFunction<float16, float16>::execReduceXD(dim3, CUstream_st**, int, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a6fcb]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15980]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:execReduceFloat2 [0xa4027f]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
=========     Host Frame: [0x3c026eac0]
=========                in 

....<lots of messages about Invalid __global__ read of size 2 bytes. They differ in address> ...

========= 
========= Invalid __global__ read of size 2 bytes
=========     at 0x5180 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
=========     by thread (255,0,0) in block (7,0,0)
=========     Address 0x7d17e30abb3e is out of bounds
=========     and is 194 bytes before the nearest allocation at 0x7d17e30abc00 of size 6152 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel [0x39e715]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart803 [0x27486ab]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:cudaLaunchKernel [0x27a3508]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:void functions::reduce::ReduceFloatFunction<float16, float16>::intermediateXD<simdOps::Mean<float16, float16> >(dim3, CUstream_st**, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a24e2]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:functions::reduce::ReduceFloatFunction<float16, float16>::execReduceXD(dim3, CUstream_st**, int, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a6fcb]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15980]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:execReduceFloat2 [0xa4027f]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPoi
nter_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
=========     Host Frame: [0x3c026eac0]
=========                in 
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4acfa6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaStreamSynchronize [0x27a32e8]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15698]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:execReduceFloat2 [0xa4027f]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
=========     Host Frame: [0xffffffffe3d06a1f]
=========                in 
========= 
Exception in thread "main" java.lang.RuntimeException: execReduceFloat failed; Error code: [719]
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.invoke(CudaExecutioner.java:1104)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:626)
        at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6534)
        at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(InferenceSession.java:805)
        at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:255)
        at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:163)
        at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:45)
        at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:533)
        at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:154)
        at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:129)
        at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1936)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1792)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1732)
        at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1747)
        at com.newhighs.psformer.examples.example3.RunUnderlyingModel.testFLOAT16_noDropout_noScaling_noXavier(RunUnderlyingModel.java:492)
        at com.newhighs.psformer.examples.example3.RunUnderlyingModel.main(RunUnderlyingModel.java:144)
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4acfa6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaFree [0x2788aee]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:sd::ContextBuffers::release() [0x1751ac3]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:sd::ContextBuffers::~ContextBuffers() [0x1751cdd]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:./stdlib/cxa_thread_atexit_impl.c:162:__call_tls_dtors [0x4772f]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:./nptl/pthread_create.c:455:start_thread [0x9c88b]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:../sysdeps/unix/sysv/linux/x86_64/clone3.S:80:clone3 [0x129c6c]
=========                in /lib/x86_64-linux-gnu/libc.so.6
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4acfa6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaFreeHost [0x2788c8e]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:sd::ContextBuffers::release() [0x1751ad1]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:sd::ContextBuffers::~ContextBuffers() [0x1751cdd]
=========                in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
=========     Host Frame:./stdlib/cxa_thread_atexit_impl.c:162:__call_tls_dtors [0x4772f]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:./nptl/pthread_create.c:455:start_thread [0x9c88b]
=========                in /lib/x86_64-linux-gnu/libc.so.6

....<lots of messages about 'Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost.'. They differ on Host Frame> ....

========= 
========= Target application returned an error
========= ERROR SUMMARY: 1805 errors

Am I running into a race condition?

(EDIT: tried also with –tool racecheck. Yes, I run into race conditions):

========= Error: Race reported between Read access at 0x53d0 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
=========     and Write access at 0x5400 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *) [1280 hazards]
========= 
========= Error: Race reported between Read access at 0x53d0 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
=========     and Write access at 0x5400 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *) [2176 hazards]
========= 
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
=========     and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [512 hazards]
========= 
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
=========     and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [514 hazards]
========= 
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
=========     and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [512 hazards]
========= 
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
=========     and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [512 hazards]
========= 
========= Error: Race reported between Read access at 0x6f90 in void simpleReduce<float16, simdOps::Sum<float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
=========     and Write access at 0x7000 in void simpleReduce<float16, simdOps::Sum<float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *) [224 hazards]

.... and lots more of these....

Exception in thread "main" java.lang.RuntimeException: execReduceFloat failed; Error code: [700]
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.invoke(CudaExecutioner.java:1104)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:626)
        at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6534)
        at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(InferenceSession.java:805)
        at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:255)
        at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:163)
        at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:45)
        at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:533)
        at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:154)
        at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:129)
        at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1936)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1792)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1732)
        at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
        at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1747)
        at com.newhighs.psformer.examples.example3.RunUnderlyingModel.testFLOAT16_noDropout_noScaling_noXavier(RunUnderlyingModel.java:492)
        at com.newhighs.psformer.examples.example3.RunUnderlyingModel.main(RunUnderlyingModel.java:144)
========= Target application returned an error
========= RACECHECK SUMMARY: 25 hazards displayed (25 errors, 0 warnings)

Not sure where it is coming from. My test doesn’t spawn off any threads. It’s basically this (the error comes from sd.fit(trainIter,1):

final int M = 1;
        final int L = 96;
        final int F = 24;
        final int P = 8;
        final int layers = 2;
        final long seed = 1234L;

        // Set both the computation and variable data types
        Nd4j.setDefaultDataTypes(DataType.HALF, DataType.HALF);

        // Configure with FLOAT16
        PSFormerConfig cfg = new PSFormerConfig(M, L, F, P, layers);
        cfg.dtype = DataType.FLOAT16;
        cfg.seed = seed;
        cfg.dropout = null;
        cfg.validate();
        
        // Build model
        SameDiff sd = PSFormer.build(cfg);
        
        // Override parameters with constant initialization (0.01)
        int N = cfg.N();
        sd.getVariable("psf/ps/W1").getArr().assign(0.01);
        sd.getVariable("psf/ps/W2").getArr().assign(0.31);
        sd.getVariable("psf/ps/W3").getArr().assign(-0.51);
        sd.getVariable("psf/head/WF").getArr().assign(-0.91);
        System.out.println("psf/ps/W1 datatype = " + (sd.getVariable("psf/ps/W1").dataType()));
        System.out.println("psf/ps/W2 datatype = " + (sd.getVariable("psf/ps/W2").dataType()));
        System.out.println("psf/ps/W3 datatype = " + (sd.getVariable("psf/ps/W3").dataType()));
        System.out.println("psf/head/WF datatype = " + (sd.getVariable("psf/head/WF").dataType()));

        // Training config
        TrainingConfig tc = new TrainingConfig.Builder()
            .updater(new Adam(1e-3))
            .dataSetFeatureMapping(PSFormer.INPUT)
            .dataSetLabelMapping(PSFormer.TARGET)
            .minimize(PSFormer.LOSS)
            .l2(1e-4)
            .build();
        sd.setTrainingConfig(tc);
        
        // Create training data
        MultiDataSetIterator trainIter = createTrainingIterator(20, 32, M, L, F, DataType.FLOAT16);
        
        // Train for 5 epochs
        for (int epoch = 0; epoch < 5; epoch++) {
            sd.fit(trainIter, 1);
            trainIter.reset();
        }
        
        // Perform prediction
        INDArray[] testData = createSawtoothData(1, M, L, F, DataType.FLOAT16);
        Map<String, INDArray> feed = new HashMap<>();
        feed.put(PSFormer.INPUT, testData[0]);
        feed.put(PSFormer.TARGET, testData[1]);
        
        INDArray prediction = sd.outputSingle(feed, PSFormer.PRED);
        
        // Verify prediction is finite
        assertNotNull(prediction);
        assertEquals(DataType.FLOAT16, prediction.dataType());
//        for (int i = 0; i < prediction.length(); i++) {
//            double val = prediction.getDouble(i);
//            assertTrue(Double.isFinite(val), "Prediction should be finite at index " + i);
//        }

My final thought is that PSFormers share parameter blocks. Could that be the cause of these race conditions?

This looks like it yeah. 700s often come from out of bounds or multi threading.
Shared parameters probably need some looking at.