Forward pass has very high execution times

Dear all,
I implemented a fully connected MultiLayerNetwork with 5 hidden layers with 30 nodes each, 6 inputs and 4 outputs. When I perform a single forward pass, by passing an input INDArray to the output method:

INDArray output= myNetwork.output(input);

it takes 0.15 seconds.
The same thing implemented with pyton using tensorflow, results in a much lower execution time, approximately 0.05 seconds.

In both cases I used the CPU to execute computation (Intel Core I7 7700 3.6 Ghz with 8GB Ram).
Is it possible? should’nt be DL4j faster than python? where did I go wrong?
Thank you,
Lorenzo

Could you please post code? Unfortunately I can’t read your screen and know next to nothing about your scenario. Execution times for operations have many parts from their data transformation to the way the network is configured. You haven’t given me anything to work with. For all I know you could be using a very old version or you could be doing the data transforms in some way that looks fast but is actually very slow.

Putting it another way unless I can run your code (eg: don’t make me guess literally anything) I can’t really help you much. Guessing just makes you spend your time describing a bunch of things and means I have to spend a bunch of time trying to guess what the problem could be.

As you’d guess if we’re slower somewhere we’d like to improve it. None of this would be on purpose.

@ltiacci also some idea of how you setup your benchmark would be nice. If you could a github repo for both the tensorflow code and dl4j code would be appreciated. Asking “why is this slow” without giving some sort of way of seeing that and potentially letting me see some profiler output isn’t really helpful.

If I can see some profiler output I can probably give you an idea of either what’s going on or even potentially fix it if there really is a problem. If you do that we both win.

Thank you very much for your attention!
Sorry for giving you lacking information, I am not very expert and I thought I was doing something macroscopic wrong!
We are currently using dl4j-core version 1.0.0-M2.1.
The network is builded as follow:

int numFeatures = 6;
int numOutput = 4;
int numHiddenNodes = 30;

MultiLayerConfiguration config = new NeuralNetConfiguration.Builder()
.seed(this.randSeed)
.weightInit(WeightInit.XAVIER)
.updater(new Sgd(learningRate))
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.list()
.layer(new DenseLayer.Builder().nIn(numFeatures).nOut().activation(Activation.RELU).build())
.layer(new DenseLayer.Builder().nIn(numHiddenNodes).nOut(numHiddenNodes).activation(Activation.RELU).build())
.layer(new OutputLayer.Builder(LossFunctions.LossFunction.SQUARED_LOSS).nIn(numHiddenNodes).nOut(numOutput).activation(Activation.IDENTITY).build())
.build();

myNetwork = new MultiLayerNetwork(config);
myNetwork.init();

The execution time previously reported is referred only to the execution of the forward pass, while the data tranformation are not included in the execution time calculation provided.
To give a better comprehension here is a the part of the code interested.

Double inputArray = (Double) inputsList.getLast();
INDArray input = Nd4j.expandDims(Nd4j.createFromArray(inputArray), 0);

double timeStart = System.currentTimeMillis();
INDArray output = myNetwork.output(input);
double timeEnd = System.currentTimeMillis();
System.out.println(“Time elapsed\t”+(timeEnd-timeStart));

In the python script in which the network configuration is the same but built as tf.keras.Sequential() with the same configuration of the previous network, and the time elapsed calculation is computed only on the prediction execution.

input = np.expandDims(np.array(),axis=0)
time_start = time.time()
output = myNetworkPy.predict(input)
time_end = time.time()
print(“Time elapsed\t”,time_end-time_start)

So the question was if the difference in times could be something normal or if we could do better in order to improve our code.
Now I’m trying to generate a profiler output.
Thank you!

@ltiacci I haven’t looked at this thoroughly yet but I do believe due to this being smaller arrays we have a bit of fixed overhead that you don’t see on bigger problems. I made a fairly big dent in that in this current release. I will try this after all my pull requests to that are merged and let you know.