Performance analysis of DL4J-examples

I am a beginner of DL4J project. Recently, I profile the dl4j by dl4j-examples project. I want to figure out the following two problems:

First, I train the network in org.deeplearning4j.examples.quickstart.modeling.convolution.LeNetMNIST in both Pytorch & DL4J (keep the same network layers), and I profile the two processes with /usr/bin/time toolchain. I discover that the time in DL4J is pretty longer than Pytorch. Meantime, the time in systen kernel (systime) is even longer than user time (usrtime). What reasons cause this? Maybe there are many times syscall?


Secondly, I profile the by JFR (java flight recorder), the top of the flame graph shows that the CPU is mainly occupied by execCustomOp2. I want to figure out that the overhead for invoking the corresponding native methods. How can I do this?

My email is, if any one can explain the above problems, i will be very appreciate.

(reference to issue 9953 in github of dl4j)

The pytorch code is as follows:

import numpy as np
import os
import torch
from torchvision.datasets import mnist
from torch.nn import CrossEntropyLoss
from torch.optim import SGD
from import DataLoader
from torchvision.transforms import ToTensor

from torch.nn import Module
from torch import nn
import time

class Model(Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(256, 120)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(120, 84)
        self.relu4 = nn.ReLU()
        self.fc3 = nn.Linear(84, 10)
        self.relu5 = nn.ReLU()

    def forward(self, x):
        y = self.conv1(x)
        y = self.relu1(y)
        y = self.pool1(y)
        y = self.conv2(y)
        y = self.relu2(y)
        y = self.pool2(y)
        y = y.view(y.shape[0], -1)
        y = self.fc1(y)
        y = self.relu3(y)
        y = self.fc2(y)
        y = self.relu4(y)
        y = self.fc3(y)
        y = self.relu5(y)
        return y

if __name__ == '__main__':
    start = time.time()
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    batch_size = 64
    s = time.time()
    train_dataset = mnist.MNIST(root='./MNIST', train=True, transform=ToTensor())
    test_dataset = mnist.MNIST(root='./MNIST', train=False, transform=ToTensor())
    train_loader = DataLoader(train_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    print("load data: {:.3f} s".format(time.time() - s))
    model = Model().to(device)
    sgd = SGD(model.parameters(), lr=1e-1)
    loss_fn = CrossEntropyLoss()
    all_epoch = 1
    prev_acc = 0
    s = time.time()
    for current_epoch in range(all_epoch):
        for idx, (train_x, train_label) in enumerate(train_loader):
            si = time.time()
            train_x =
            train_label =
            predict_y = model(train_x.float())
            loss = loss_fn(predict_y, train_label.long())
            # print("iterate {}, time {:.3f}".format(idx, time.time() - si))

        all_correct_num = 0
        all_sample_num = 0
        for idx, (test_x, test_label) in enumerate(test_loader):
            test_x =
            test_label =
            predict_y = model(test_x.float()).detach()
            predict_y =torch.argmax(predict_y, dim=-1)
            current_correct_num = predict_y == test_label
            all_correct_num += np.sum('cpu').numpy(), axis=-1)
            all_sample_num += current_correct_num.shape[0]
        acc = all_correct_num / all_sample_num

        print('epoch %d, train acc %.3f'
              % (current_epoch + 1, acc))
        if not os.path.isdir("models"):
            os.mkdir("models"), 'models/mnist_{:.3f}.pkl'.format(acc))
        if np.abs(acc - prev_acc) < 1e-4:
        prev_acc = acc
    print("fit: {:.3f} s".format(time.time() - s))
    print("total time {:.3f}".format(time.time() - start))
    print("Model finished training")

The java code is in dl4j-examples.

@yaoyao did you run the example multiple times? On startup we have to download the dataset and then we cache it locally.

Also, just for reproducibility what version are you using of pytorch and dl4j? Was it the master branch like your github issue? I will try to figure out the difference there. Could you repost your original flame graph?

There might be some quick wins here.

If I had to guess it might be due to allocation/GC. There have been quite a few performance improvements in the current master branch I’d love to see if we can test here since you’re already compiling from source.

Do you mind helping with that? I’d really appreciate it. Thanks!