Logits from last layer

How can I retrieve and save logits from the last layer of trained model(implemented as computationalGraph in dl4j) ?

There are a few different signatures for doing what you want, depending on the exact usecase:
https://javadoc.io/static/org.deeplearning4j/deeplearning4j-nn/1.0.0-beta7/org/deeplearning4j/nn/graph/ComputationGraph.html#feedForward-org.nd4j.linalg.api.ndarray.INDArray-int-boolean-

If they are logits or something else depends on your actual network configuration.

You can then go on and save that however you want, e.g. there are a few Nd4j.write* methods that can write an INDArray somewhere.

Thanks for the response.

I am also looking to implement specialized version of cross entropy loss. How can I do it in dl4j.
I am looking to implement knowledge distillation.
Here is the link: https://nervanasystems.github.io/distiller/knowledge_distillation.html#hinton-et-al-2015

In that case, it is probably the easiest to just go ahead and implement a custom loss function.

Since beta7 it is pretty easy to do so, see the announcement blog post for an example: https://blog.konduit.ai/2020/05/14/deeplearning4j-1-0-0-beta7-released/#easier-custom-loss-functions

Thanks, will take a look.

Hi Treo,

I am trying to implement the loss function provided in this link:
https://nervanasystems.github.io/distiller/knowledge_distillation.html#hinton-et-al-2015

Custom Loss function could be something that I can leverage here.
Can you please provide me some direction how I can approach this using custom loss.

Appreciate your help.

Thanks
Ravi

adding more insights : It is one model, with loss function that calculates loss of entropy between 2 variables.

I’ve taken a closer look at the linked page and there are a few ways of implementing it. The following two are what I’d recommend trying first:

Precomputing the soft labels

As the teacher network is frozen, you will always get the same soft labels as its output, so unless you have more compute than storage space, it is very well worth it to precompute all of those labels, and then just load it as yet another set of labels for your student network.

But first you will need to modify your teacher network to output the soft labels in the first place.

The most straight forward way would be to implement a Softmax that applies the temperature to its inputs. But, as there isn’t a SameDiff base class for Activation functions yet, you would have to implement the backprop part of it – or create a SameDiff base class for activations :slight_smile:

If you don’t want to do that, you can also take the OutputLayer of your teacher network apart. An output layer can be thought of as a combination of a DenseLayer, an Activation Layer and a LossLayer. And that is exactly what you’d be doing: You’ll replace the OutputLayer with a DenseLayer with an Identity activation, then you add a ScaleVertex with a scale factor of 1/T, then an Activation Layer with Softmax and then a LossLayer. If your teacher network is already pretrained, you’d have to transplant the weights of the OutputLayer into the DenseLayer.

Once your teacher network outputs soft labels, you iterate through all of your training data, output those labels and save them to be loaded when training your student network.

Computing the soft labels on the fly

Obviously, you can also compute the soft labels on the fly. In this case you have two options, one is to create a computation graph that contains both the teacher model and the student model – this is too complex to describe in this already long answer.

The second one is to wrap your DataSetIterator, and apply your teacher model on the features to create the second set of labels on the fly when next is called.

Computing the soft labels on the fly however has the drawback that over multiple epochs you will recalculate those labels over and over again. But obviously in this case you don’t have to save those labels which reduces the storage requirements.

Setting up your student network

For your student network you set up two LossLayers, which both take a custom loss function. Both of them will be using sd.loss.softmaxCrossEntropy to actually calculate the cross entropy.

The first one will pass in its inputs unchanged, use the hard labels and multiply its output by alpha.

The second one will divide its inputs by the temperature (tau) first, use the soft labels and multiply its output by beta.

The loss of multiple outputs is just added up automatically, so you don’t have to do the addition part yourself.

Thanks Treo, really appreciate your thoughts on this.