Need a neural network configuration for pong / Atari games

I tried a lot of different WeightInits and Activation functions, but I don’t get the network to learn. The gradients are from the initialization in my opinion very small 1e-3 to 1e-5. The starting q-values are okay between +1 and -1 but during training they are getting nearer and nearer to ±0 (1e-3 to 1e-5). Insert to the network are gray scale 1 byte images, with are scaled from 0-255 to 0-1 values.

            MultiLayerNetwork model = new MultiLayerNetwork(new NeuralNetConfiguration.Builder()
                    .seed(seed)
                    .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
                    .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)
                    .gradientNormalizationThreshold(4.0)
                    .updater(new Adam(0.00025))
                    .list()
                    .layer(new ConvolutionLayer.Builder(8, 8)
                            .stride(4, 4)
                            .nIn(CHANNELS)
                            .nOut(32)
                            .weightInit(WeightInit.RELU_UNIFORM)
                            .activation(Activation.LEAKYRELU)
                            .build())
//                    .layer(new BatchNormalization())
                    .layer(new ConvolutionLayer.Builder(4, 4)
                            .stride(2, 2)
                            .nOut(64)
                            .weightInit(WeightInit.RELU_UNIFORM)
                            .activation(Activation.LEAKYRELU)
                            .build())
//                    .layer(new BatchNormalization())
                    .layer(new ConvolutionLayer.Builder(3, 3)
                            .stride(1, 1)
                            .nOut(64)
                            .weightInit(WeightInit.RELU_UNIFORM)
                            .activation(Activation.LEAKYRELU)
                            .build())
//                    .layer(new BatchNormalization())
                    .layer(new DenseLayer.Builder()
                            .nOut(512)
                            .weightInit(WeightInit.XAVIER)
                            .activation(Activation.LEAKYRELU)
                            .build())
                    .layer(new OutputLayer.Builder(LossFunctions.LossFunction.MSE)
                            .nOut(GameAction.values().length) // Ausgabe (z.B. 10 Klassen)
                            .activation(Activation.IDENTITY) // No transformation
                            .weightInit(WeightInit.NORMAL)
                            .build())
                    .setInputType(InputType.convolutional(HEIGHT, WIDTH, CHANNELS)) // Input-Shape definieren
                    .build());
            model.init();

If I comment in the BatchNormatization layers, the the starting q values are much higher and escalate after a couple 1000 iterations.
I talked a lot about the Network configuration with ChatGPT. The discussion goes forward and backward and in the end tried all his advises, but nothing made it better.
Q value calculation I already verified with other little games like ConnectFour and TicTacToe.
Can you give me an advise, how I can optimize the Network configuration.

And I have another question in that context.
For Q-Value update I first do an output() with the actual state and then an output with the next state. Getting the max q-Value for the next state. Calculate the target value like:

double target = transition.getReward() + q * getGamma();

Then I replace in the Q-Vaue of the taken action in the current state and teach this vector back. All like it is in the literature.
But I wonder, if I can use the “labelMask” and just teaching back the changed Q value. That could result, that I don’t need to do an output() for actual state. because the value teached back is nevertheless overriden and with the labelMask only this is teached back. ChatGPT said no, but the explanation was not really meaningful :frowning: . I didn’t tried it, because even without that optimization my network is not learning :frowning:

Your gradients matter. Make sure that the movements aren’t too big or it will just swing back and forth. If you need a breakdown ask it about regularization and make sure your input data is scaled in some way. Either 0 to 1 or zero mean unit variance. Your numbers being too big can also cause inconsistent learning.

I already encountered, that the input data is relevant. I already scaled it between 0 and 1 and the scoring output became better. Do you have a clue, why my grandients are so small. As you wrote grandients do matter. I already tried with different weightInit (RELU, RELU_UNIFROM, XAVIER and so on), but it didn’t help really.

I continued analyzing my code and the results further. I pretty often get negative q-values back. If the maxq-value is negative, then "gamma * max(q-value) " makes the value greater instead of lower (discount of future rewards). That leads to a wrong learning. How can I prevent, that I get negative value with ouput. I tried to make a Transforms.max(output, 0), but that leads to the fact that the output contains just 0 and the maxIndex is always the first one. Is there a possibility to get better weights, that I don’t receive negatve values as Q-Values?

You can use an activation function that will only return positive values, e.g. RelU does that.

As you can see in my model above, I am using LeakyRELU on all layers except for the last one, because I had the problem of dead neurons. LeakyRELU should as well just return positive values (really small ones instead of zero, can be adjusted by the alpha parameter (DEFAULT_ALPHA = 0.01). On the output level I use identity because I want to feed back the q values for the action, which I have not taken as they are. I tried with LeakyRELU on this layer as well - I still get negative values.
I think as well, that that should not be the case. Can anybody tell me, why that happens?

You were asking what you can do about making the output be positive only: And Relu being max(0, x) does exactly that. So if you want to enforce that on the output you’ll want to use Relu as the activation function on your output.

The leaky variant, will give you both positive and negative numbers, as it will be identity for positive numbers (just like relu) and a small factor for negative numbers but a negative value times a positive factor will still be a negative number.

1 Like