Is there any example for doing policy gradient calculation with dl4j/rl4j?

I can not find any. Since dl4j embeds the activation function in the layer, it seems difficult to calculate the gradient externally and put it back in dl4j network.

There is an example for external errors:

If you don’t want the activation function to modify the output of a layer, you can always use an identity activation.

Yes, but I need an activation function such as softmax to get the probility of each action as the output of the policy network.

You can use external errors with softmax too, there should be nothing to prevent you from using a Softmax activation on your last layer.

How is the external error should be calculated in the case of softmax as the last layer?
As the policy gradient algorithm, the gradient should be calculated with the reward, such as reward*log(actionProb). I get the reward from the environment and get the action probility from dl4j network ouput.But to backprop the network with softmax as the final output, I should get the error of the probility right?

There are lots of different algorithms for that, so the exact details are going to be different depending on what exactly you want to do.

Thanks for the link. That’s true. My question is how to integrate any of the algorithm into dl4j network. To be more specific, I think those algorithms are all calculating the gradient of the parameters while a dl4j network(in case of using softmax as the output) requires the probility error to do the backprop.

If you take a closer look at the external errors example, you will find that it also shows you how to use external gradients. So you don’t even have to calculate an error, if you have the gradients already, you can pass them in directly.

That would be cool but I can not figure out how to transform the gradient from the algorithm to the gradient dl4j accepts. It is difficult for me to get the required shape of the dl4j gradient. Is there any example?