My Network isn't Learning anything and A3C keeps crashing

I am trying to teach a network how to play snake. I am using 1.0.0-beta6
I have used

  • QLearning

  • Async QLearning

  • Actor-Critic

None of these systems have had any luck. After millions of steps on Async QLearning my snake has not learned a single thing and still runs straight into walls.

Here is how my network is setup
I pass a 256 double array that represents my snake board
1=apple
0=free space
.25 = body
.3+direction=head (.1<=direction<=.4)

I then have my neural network choose 1,2,3 forward,left,right.

My Reward function is

 if(!snake.inGame()) {
	 return -1.0; //dies
 }
 if(snake.gotApple()) {
	 return 1.0; //got apple
     //return 5.0+.37*(snake.getLength());
 }
 return  0.0; //doesn't die or get apple this move

My Async network is setup like so

 public static AsyncNStepQLearningDiscrete.AsyncNStepQLConfiguration TOY_ASYNC_QL =
            new AsyncNStepQLearningDiscrete.AsyncNStepQLConfiguration(
                    123,        //Random seed
                    400,     //Max step By epoch
                    5000000,      //Max step
                    16,          //Number of threads
                    25,          //t_max
                    10,        //target update (hard)
                    10,          //num step noop warmup
                    0.01,        //reward scaling
                    0.98,       //gamma
                    10.0,       //td-error clipping
                    0.15f,       //min epsilon
                    30000        //num step for eps greedy anneal
            );
    public static DQNFactoryStdDense.Configuration MALMO_NET = DQNFactoryStdDense.Configuration.builder().l2(0.001)
                    .updater(new Adam(0.0025)).numHiddenNodes(64).numLayer(3).build();

I can’t figure out how to get this thing to work is one of my settings wrong or something.

Addtionally and probably more frustratingly is the fact that whenever I try to use A3C it crashes after maybe like 10000 steps saying

Exception in thread "Thread-8" java.lang.RuntimeException: Output from network is not a probability distribution: [[         ?,         ?,         ?]]
at org.deeplearning4j.rl4j.policy.ACPolicy.nextAction(ACPolicy.java:82)
at org.deeplearning4j.rl4j.policy.ACPolicy.nextAction(ACPolicy.java:37)
at org.deeplearning4j.rl4j.learning.async.AsyncThreadDiscrete.trainSubEpoch(AsyncThreadDiscrete.java:96)
at org.deeplearning4j.rl4j.learning.async.AsyncThread.handleTraining(AsyncThread.java:144)
at org.deeplearning4j.rl4j.learning.async.AsyncThread.run(AsyncThread.java:121)

Is there any reason why this is happening? I have tried tweaking the network and my reward function to no avail to stop this error.

Deep Learning seems like such a fun thing to explore but when I can’t even get the library to work it becomes very frustrating.

I will appreciate any help that ya’ll can offer.

Seems a NaN problem.
I think the question marks (?) mean your output is too small at a certain point in the training.
You need to make sure your net doesn’t go in Arithmetic Underflow (https://deeplearning4j.konduit.ai/tuning-and-training/troubleshooting-training).

Have you tried to adjust the regularization?

1 Like

There are a few things that look pretty fishy to me:

This means that over 30000 steps it will reduce the chance of doing something randomly from 100% to 15%. Given your configuration, this means it will take at least 75 rounds of 400 steps to get down to those 15%, and when your snake is close to the edge or close to itself, 15% random is still quite huge, and might result in it crashing into the next best wall, so it will likely be more than those 75 rounds before it even gets that low.

As @StoicProgrammer has pointed out, and as I’ve already told you on stack overflow, the question marks are NaNs. You should take a look at the training scores , if they go very close to 0 or if they get very high, it is probably just an numeric under/overflow.

I used rl4j for snake too. I didn’t get crashes. I gave a small negative step reward for each step. You can try something like a reward of -0.00001 for each step that did not result in death or apple, instead of zero reward and see how you go.

Oh another thing, the reason why your snake has problems learning is that you’re feeding the grid and direction of the snake as an incremental number. It’s gonna have a hard time learning without using a convolutional network. There are so many possible body positions for a snake on a 16x16 grid that it could take ages to learn.

If you can’t use a convolutional network (I couldn’t due to bugs in rl4j), you’ll have to one hot encode the current direction, and use other features rather than feeding the entire grid to the network. Such as one hot encoding whether there is an obstacle in front, left, right of the snake, X distance to food, y distance to food, 8 directional distance to to its on body. Not gonna solve snake perfectly but it’s a start.

And lower the grid size until you see that it actually learns before trying larger sizes.

Also, consider rewarding the snake as it moves closer to the apple, and penalise it when it moves further away. Shape the reward so it won’t go in circles.

At the start of the game, it’s unlikely to even get even taste an apple, so it won’t know apples are sweet.

Your current implementation is just like a massive q table. There are way too many possible states for the network to learn anything in a short amount of time.

I had a similar problem, and I realized that the error was because I was doing a custom normalization of my observation without taking into account the NaN.