Intel efficiency cores massively increase training time

I’ve been training networks on Intel 12700k (8 performance, 4 efficiency cores) and 13900k (8 performance, 16 efficiency cores) CPUs in Linux. I see massive increases in training time if the efficiency cores are turned on. For example, A 8 minute test with the efficiency cores turned off will take 13 minutes with them turned on for the 12700k. The 13900k performs even worse if the efficiency are on - taking about 39 minutes.

I don’t know if this problem is limited to Linux, and caused by poor thread management by the kernel (Linux kernel 6.1.9-060109-generic x86_64). My guess is that there is an misaligned distribution of performance/efficiency cores assigned to liner algebra and BLAS.

My work-around is to disable efficiency cores if I’m training, which isn’t ideal because some of my CPU hardware is not utilized. Does anyone have a better solution?

The problem is that in those calculations blas is expecting every piece of work to be done in about the same time.

As the efficiency cores are slower, giving them work will essentially make everything wait for them.

For other tasks like Cinebench, every piece of work is independent from every other piece of work, that is why using those cores there benefits the speed.

@daviddbal this actually might be allocation related like I mentioned din the other thread. Could you show me a jvisualvm or similar tool’s profiling output to clarify?