I tried several times, sometimes with logging on trace, sometimes on info. Compute-sanitize with –tool memcheck or not. The weird thing is that compute-sanitize at the end reports a different number of errors. Also, it seems to fail much, much quicker if logging is on info (even accounting for the fact that logging on trace level will make it slower). Here’s the log (on info, tool memcheck):
========= COMPUTE-SANITIZER
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2025-10-04 14:46:11.547 INFO Nd4jBackend - Loaded [JCublasBackend] backend
2025-10-04 14:46:16.930 INFO NativeOpsHolder - Number of threads used for linear algebra: 32
2025-10-04 14:46:16.946 INFO DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
2025-10-04 14:46:16.946 INFO DefaultOpExecutioner - Cores: [48]; Memory: [29.5GB];
2025-10-04 14:46:16.946 INFO DefaultOpExecutioner - Blas vendor: [CUBLAS]
2025-10-04 14:46:16.949 INFO JCublasBackend - ND4J CUDA build version: 11.6.124
2025-10-04 14:46:16.950 INFO JCublasBackend - CUDA device 0: [NVIDIA GeForce RTX 3090]; cc: [8.6]; Total memory: [25293946880]
2025-10-04 14:46:16.950 INFO JCublasBackend - Backend build information:
GCC: "7.5.0"
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
psf/ps/W1 datatype = HALF
psf/ps/W2 datatype = HALF
psf/ps/W3 datatype = HALF
psf/head/WF datatype = HALF
2025-10-04 14:46:22.588 WARN ImportClassMapping - Duplicate TF op mapping found for op Pow: org.nd4j.linalg.api.ops.impl.scalar.Pow vs org.nd4j.linalg.api.ops.impl.transforms.custom.Pow
2025-10-04 14:46:22.593 WARN ImportClassMapping - Duplicate TF op mapping found for op FloorMod: org.nd4j.linalg.api.ops.impl.transforms.pairwise.arithmetic.FModOp vs org.nd4j.linalg.api.ops.impl.transforms.pairwise.arithmetic.FloorModOp
Added differentiated op reduce_mean_3
Added differentiated op multiply_6
Added differentiated op subtract_1
Added differentiated op identity
Added differentiated op add_9
Added differentiated op multiply_5
Added differentiated op reshape_17
Added differentiated op matmul_26
Added differentiated op reshape_16
Added differentiated op reshape_15
Added differentiated op permute_5
Added differentiated op reshape_14
Added differentiated op reshape_13
Added differentiated op matmul_25
Added differentiated op add_8
Added differentiated op matmul_24
Added differentiated op gelu_5
Added differentiated op matmul_23
Added differentiated op reshape_12
Added differentiated op add_7
Added differentiated op matmul_22
Added differentiated op softmax_3
Added differentiated op multiply_4
Added differentiated op matmul_21
Added differentiated op permute_4
Added differentiated op reshape_11
Added differentiated op matmul_20
Added differentiated op add_6
Added differentiated op matmul_19
Added differentiated op gelu_4
Added differentiated op matmul_18
Added differentiated op reshape_10
Added differentiated op relu_1
Added differentiated op matmul_17
Added differentiated op softmax_2
Added differentiated op multiply_3
Added differentiated op matmul_16
Added differentiated op permute_3
Added differentiated op reshape_9
Added differentiated op matmul_15
Added differentiated op add_5
Added differentiated op matmul_14
Added differentiated op gelu_3
Added differentiated op matmul_13
Added differentiated op reshape_8
Added differentiated op reshape_7
Added differentiated op matmul_12
Added differentiated op add_4
Added differentiated op matmul_11
Added differentiated op gelu_2
Added differentiated op matmul_10
Added differentiated op reshape_6
Added differentiated op add_3
Added differentiated op matmul_9
Added differentiated op softmax_1
Added differentiated op multiply_2
Added differentiated op matmul_8
Added differentiated op permute_2
Added differentiated op reshape_5
Added differentiated op matmul_7
Added differentiated op add_2
Added differentiated op matmul_6
Added differentiated op gelu_1
Added differentiated op matmul_5
Added differentiated op reshape_4
Added differentiated op relu
Added differentiated op matmul_4
Added differentiated op softmax
Added differentiated op multiply_1
Added differentiated op matmul_3
Added differentiated op permute_1
Added differentiated op reshape_3
Added differentiated op matmul_2
Added differentiated op add_1
Added differentiated op matmul_1
Added differentiated op gelu
Added differentiated op matmul
========= Invalid __global__ read of size 2 bytes
========= at 0x5180 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
========= by thread (96,0,0) in block (1,0,0)
========= Address 0x7d17e30ab380 is out of bounds
========= and is 128 bytes before the nearest allocation at 0x7d17e30ab400 of size 1544 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:cuLaunchKernel [0x39e715]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:__cudart803 [0x27486ab]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:cudaLaunchKernel [0x27a3508]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:void functions::reduce::ReduceFloatFunction<float16, float16>::intermediateXD<simdOps::Mean<float16, float16> >(dim3, CUstream_st**, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a24e2]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:functions::reduce::ReduceFloatFunction<float16, float16>::execReduceXD(dim3, CUstream_st**, int, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a6fcb]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15980]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:execReduceFloat2 [0xa4027f]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
========= Host Frame: [0x3c026eac0]
========= in
=========
========= Invalid __global__ read of size 2 bytes
========= at 0x5180 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
========= by thread (97,0,0) in block (1,0,0)
========= Address 0x7d17e30ab382 is out of bounds
========= and is 126 bytes before the nearest allocation at 0x7d17e30ab400 of size 1544 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:cuLaunchKernel [0x39e715]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:__cudart803 [0x27486ab]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:cudaLaunchKernel [0x27a3508]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:void functions::reduce::ReduceFloatFunction<float16, float16>::intermediateXD<simdOps::Mean<float16, float16> >(dim3, CUstream_st**, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a24e2]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:functions::reduce::ReduceFloatFunction<float16, float16>::execReduceXD(dim3, CUstream_st**, int, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a6fcb]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15980]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:execReduceFloat2 [0xa4027f]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
========= Host Frame: [0x3c026eac0]
========= in
....<lots of messages about Invalid __global__ read of size 2 bytes. They differ in address> ...
=========
========= Invalid __global__ read of size 2 bytes
========= at 0x5180 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
========= by thread (255,0,0) in block (7,0,0)
========= Address 0x7d17e30abb3e is out of bounds
========= and is 194 bytes before the nearest allocation at 0x7d17e30abc00 of size 6152 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:cuLaunchKernel [0x39e715]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:__cudart803 [0x27486ab]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:cudaLaunchKernel [0x27a3508]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:void functions::reduce::ReduceFloatFunction<float16, float16>::intermediateXD<simdOps::Mean<float16, float16> >(dim3, CUstream_st**, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a24e2]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:functions::reduce::ReduceFloatFunction<float16, float16>::execReduceXD(dim3, CUstream_st**, int, void const*, long long const*, long long const*, void*, void*, void*, long long const*, long long const*, int const*) [0x20a6fcb]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15980]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:execReduceFloat2 [0xa4027f]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPoi
nter_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
========= Host Frame: [0x3c026eac0]
========= in
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x4acfa6]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:cudaStreamSynchronize [0x27a32e8]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:NativeOpExecutioner::execReduceFloat(sd::LaunchContext*, int, void const*, long long const*, void const*, long long const*, void*, void*, long long const*, void*, long long const*, int*, int) [0xa15698]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:execReduceFloat2 [0xa4027f]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:Java_org_nd4j_linalg_jcublas_bindings_Nd4jCuda_execReduceFloat2__Lorg_bytedeco_javacpp_PointerPointer_2ILorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_Pointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_nd4j_nativeblas_OpaqueDataBuffer_2Lorg_bytedeco_javacpp_LongPointer_2Lorg_bytedeco_javacpp_LongPointer_2 [0xff8b8]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libjnind4jcuda.so
========= Host Frame: [0xffffffffe3d06a1f]
========= in
=========
Exception in thread "main" java.lang.RuntimeException: execReduceFloat failed; Error code: [719]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.invoke(CudaExecutioner.java:1104)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:626)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6534)
at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(InferenceSession.java:805)
at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:255)
at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:163)
at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:45)
at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:533)
at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:154)
at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:129)
at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1936)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1792)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1732)
at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1747)
at com.newhighs.psformer.examples.example3.RunUnderlyingModel.testFLOAT16_noDropout_noScaling_noXavier(RunUnderlyingModel.java:492)
at com.newhighs.psformer.examples.example3.RunUnderlyingModel.main(RunUnderlyingModel.java:144)
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x4acfa6]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:cudaFree [0x2788aee]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:sd::ContextBuffers::release() [0x1751ac3]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:sd::ContextBuffers::~ContextBuffers() [0x1751cdd]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:./stdlib/cxa_thread_atexit_impl.c:162:__call_tls_dtors [0x4772f]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:./nptl/pthread_create.c:455:start_thread [0x9c88b]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:../sysdeps/unix/sysv/linux/x86_64/clone3.S:80:clone3 [0x129c6c]
========= in /lib/x86_64-linux-gnu/libc.so.6
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x4acfa6]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:cudaFreeHost [0x2788c8e]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:sd::ContextBuffers::release() [0x1751ad1]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:sd::ContextBuffers::~ContextBuffers() [0x1751cdd]
========= in /home/mark/.javacpp/cache/psformer-1.4.1-SNAPSHOT.jar/org/nd4j/linalg/jcublas/bindings/linux-x86_64/libnd4jcuda.so
========= Host Frame:./stdlib/cxa_thread_atexit_impl.c:162:__call_tls_dtors [0x4772f]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:./nptl/pthread_create.c:455:start_thread [0x9c88b]
========= in /lib/x86_64-linux-gnu/libc.so.6
....<lots of messages about 'Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost.'. They differ on Host Frame> ....
=========
========= Target application returned an error
========= ERROR SUMMARY: 1805 errors
Am I running into a race condition?
(EDIT: tried also with –tool racecheck. Yes, I run into race conditions):
========= Error: Race reported between Read access at 0x53d0 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
========= and Write access at 0x5400 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *) [1280 hazards]
=========
========= Error: Race reported between Read access at 0x53d0 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
========= and Write access at 0x5400 in void simpleReduce<float16, float16, simdOps::Mean<float16, float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *) [2176 hazards]
=========
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
========= and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [512 hazards]
=========
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
========= and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [514 hazards]
=========
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
========= and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [512 hazards]
=========
========= Error: Race reported between Read access at 0x3000 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *)
========= and Write access at 0x6d00 in void sd::ops::helpers::softMaxCuda<float16>(const void *, const long long *, const long long *, void *, const long long *, const long long *) [512 hazards]
=========
========= Error: Race reported between Read access at 0x6f90 in void simpleReduce<float16, simdOps::Sum<float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *)
========= and Write access at 0x7000 in void simpleReduce<float16, simdOps::Sum<float16>>(const void *, const long long *, const long long *, void *, void *, void *, const long long *) [224 hazards]
.... and lots more of these....
Exception in thread "main" java.lang.RuntimeException: execReduceFloat failed; Error code: [700]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.invoke(CudaExecutioner.java:1104)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:626)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6534)
at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(InferenceSession.java:805)
at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:255)
at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:163)
at org.nd4j.autodiff.samediff.internal.TrainingSession.getOutputs(TrainingSession.java:45)
at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:533)
at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:154)
at org.nd4j.autodiff.samediff.internal.TrainingSession.trainingIteration(TrainingSession.java:129)
at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1936)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1792)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1732)
at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:172)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1747)
at com.newhighs.psformer.examples.example3.RunUnderlyingModel.testFLOAT16_noDropout_noScaling_noXavier(RunUnderlyingModel.java:492)
at com.newhighs.psformer.examples.example3.RunUnderlyingModel.main(RunUnderlyingModel.java:144)
========= Target application returned an error
========= RACECHECK SUMMARY: 25 hazards displayed (25 errors, 0 warnings)
Not sure where it is coming from. My test doesn’t spawn off any threads. It’s basically this (the error comes from sd.fit(trainIter,1):
final int M = 1;
final int L = 96;
final int F = 24;
final int P = 8;
final int layers = 2;
final long seed = 1234L;
// Set both the computation and variable data types
Nd4j.setDefaultDataTypes(DataType.HALF, DataType.HALF);
// Configure with FLOAT16
PSFormerConfig cfg = new PSFormerConfig(M, L, F, P, layers);
cfg.dtype = DataType.FLOAT16;
cfg.seed = seed;
cfg.dropout = null;
cfg.validate();
// Build model
SameDiff sd = PSFormer.build(cfg);
// Override parameters with constant initialization (0.01)
int N = cfg.N();
sd.getVariable("psf/ps/W1").getArr().assign(0.01);
sd.getVariable("psf/ps/W2").getArr().assign(0.31);
sd.getVariable("psf/ps/W3").getArr().assign(-0.51);
sd.getVariable("psf/head/WF").getArr().assign(-0.91);
System.out.println("psf/ps/W1 datatype = " + (sd.getVariable("psf/ps/W1").dataType()));
System.out.println("psf/ps/W2 datatype = " + (sd.getVariable("psf/ps/W2").dataType()));
System.out.println("psf/ps/W3 datatype = " + (sd.getVariable("psf/ps/W3").dataType()));
System.out.println("psf/head/WF datatype = " + (sd.getVariable("psf/head/WF").dataType()));
// Training config
TrainingConfig tc = new TrainingConfig.Builder()
.updater(new Adam(1e-3))
.dataSetFeatureMapping(PSFormer.INPUT)
.dataSetLabelMapping(PSFormer.TARGET)
.minimize(PSFormer.LOSS)
.l2(1e-4)
.build();
sd.setTrainingConfig(tc);
// Create training data
MultiDataSetIterator trainIter = createTrainingIterator(20, 32, M, L, F, DataType.FLOAT16);
// Train for 5 epochs
for (int epoch = 0; epoch < 5; epoch++) {
sd.fit(trainIter, 1);
trainIter.reset();
}
// Perform prediction
INDArray[] testData = createSawtoothData(1, M, L, F, DataType.FLOAT16);
Map<String, INDArray> feed = new HashMap<>();
feed.put(PSFormer.INPUT, testData[0]);
feed.put(PSFormer.TARGET, testData[1]);
INDArray prediction = sd.outputSingle(feed, PSFormer.PRED);
// Verify prediction is finite
assertNotNull(prediction);
assertEquals(DataType.FLOAT16, prediction.dataType());
// for (int i = 0; i < prediction.length(); i++) {
// double val = prediction.getDouble(i);
// assertTrue(Double.isFinite(val), "Prediction should be finite at index " + i);
// }
My final thought is that PSFormers share parameter blocks. Could that be the cause of these race conditions?