Memory leak showing up excessively on Linux backend

Hi. I’m using the same SameDiff model both on Linux (Debian 10 running on GCP) and Windows 10 (personal PC). Both OSs have 64GB of RAM. I run it with the same parameters (including max Heap 3GB and JDK 16), the only difference is that on Debian it runs in a docker container. For Windows I built the JAR (1.0.0-beta7) using Windows AVX2 backend, for Linux - corresponding Linux AVX2. I didn’t use a platform-backend because AVX2 wasn’t activated in this case. I decided to post this topic in ND4J category, because I think that the issue is not in SameDiff itself, but rather in INDArray memory deallocation. For memory consumption info retrieval I used Pointer.physicalBytes()

My observations:

  1. If org.bytedeco.javacpp.maxbytes=0, the model on Debian keeps using additional 250 to 400 MB RAM increasingly in 20 minutes (approximately 10 training iterations with shuffled after each iteration dataset), the same model on Windows - 90-110MB. After explicitly calling System.gc() the behavior on both OSs was similar - approximately 700 MB were released after each call.
  2. If org.bytedeco.javacpp.maxbytes=57GB, the model on Debian fails to start at all because there’s not enough RAM for data buffer memory allocation (for a detached INDArray). I had to decrease my batch size from 650 to 500 in order to get it working. Then it started taking additional whopping 400 to 500 MB RAM every 20 minutes. The same model on Windows went up without problems with an initial batch size of 650. It went from 56.4 to 57.8GB in 1 hour, after that it slowed down and after reaching 58.4GB in 6 hours it stopped consuming additional RAM at all (at least for the next 12 hours). After explicitly calling System.gc() on Debian, between 34 and 36GB were released every time, on Windows - 4 to 5GB.
  3. I tried to use explicitly workspaces closing them after each iteration, but first I had issues with an AdamUpdater (it initializes mean and squared gradients at startup and uses them all the time, they must not be closed) and after that occasionally discovered that the ArrayCacheMemoryMgr (which is used for SameDiff by default) initializes all arrays outside the cache as detached. Anyway workspaces could help maybe in Windows (because caching did), but even with caching enabled, as mentioned above, on Debian memory leak remains.
  4. Native Memory Tracker, Eclipse Memory Analyzer and Visual VM showed no issues with JVM-related memory. That leaves basically only buffer allocations for INDArrays.

I already searched through existing topics related to memory leaks, but couldn’t find anything related to my case.

Could anyone please help in identifying the root cause? As far as I know, normally in production exactly Linux-based VMs are used (it’s cost-efficient), so I guess it would be great to know if it’s a platform-specific issue (Windows backend seems to be less affected), or something else (config issue, docker-related issue or maybe my own code failure).

@partarstu could you benchmark with snapshots maybe?

@partarstu Beyond that, could you DM me with more details? I’d want to know some more details that might not be good to display in public. I will reach out. We can post the resolution publicly afterwards. Thanks!