Why only 1 cpu on multi socket systems

Why do we prefer to use only 1 cpu on multi socket systems?

This is due to an architecture issue associated with multi-socket systems called NUMA - non-uniform memory access.
This means that in some system (with multiple physical CPUs/sockets for example) how long it takes to access memory depends on which CPU is accessing it.

ND4J (and hence DL4J/SameDiff) is not at present NUMA aware (it’s on our roadmap) so performance may not scale well on multi-socket systems, due to lots of potentially costly data transfers between CPUs.
No such issues are present on standard multi-core systems (i.e., almost all consumer/workstation setups and a sizable fraction of servers aren’t NUMA so don’t have this issue).