Trying to use QLearning in a custom MDP environment. Chooses action 0 every time, despite the heavy negative reward

It’s playing a dots and boxes game. I have been messing around with the hyperparameters constantly but nothing changes. I’m completely stuck and need help. If anyone could look through the code and see if anything is set up wrong. An area where I have no idea if it’s set up correctly is the observation space in the MDP.

Here’s the classes envolved:

public class testNeural {
    public static void main(String args[]) throws IOException, InterruptedException {
        GameBoard r = new GameBoard(3,3);
        DQNPolicy<testState> t = dots();
    }

    private static DQNPolicy<testState> dots() throws IOException {

        QLearningConfiguration DOTS_QL = QLearningConfiguration.builder()
                .seed(Long.valueOf(132))                //Random seed (for reproducability)
                .maxEpochStep(500)        // Max step By epoch
                .maxStep(1000)           // Max step
                .expRepMaxSize(15000)    // Max size of experience replay
                .batchSize(Graph.getEdgeList().size())            // size of batches
                .targetDqnUpdateFreq(100) // target update (hard)
                .updateStart(10)          // num step noop warmup
                .rewardFactor(0.1)       // reward scaling
                .gamma(0.95)              // gamma
                .errorClamp(1.0)          // /td-error clipping
                .minEpsilon(0.3f)         // min epsilon
                .epsilonNbStep(10)      // num step for eps greedy anneal
                .doubleDQN(false)          // double DQN
                .build();
        DQNDenseNetworkConfiguration DOTS_NET =
                DQNDenseNetworkConfiguration.builder()
                        .l2(0)
                        .updater(new RmsProp(0.000025))
                        .numHiddenNodes(50)
                        .numLayers(10)
                        .build();


        // The neural network used by the agent. Note that there is no need to specify the number of inputs/outputs.
        // These will be read from the gym environment at the start of training.

        testEnv env = new testEnv();
        QLearningDiscreteDense<testState> dql = new QLearningDiscreteDense<testState>(env, DOTS_NET, DOTS_QL);
        System.out.println(dql.toString());
        dql.train();
        return dql.getPolicy();
    }
}

The MDP environment:

public class testEnv implements MDP<testState, Integer, DiscreteSpace> {
    DiscreteSpace actionSpace = new DiscreteSpace(Graph.getEdgeList().size());

    // takes amount of possible edges      ^
    ObservationSpace<testState> observationSpace = new ArrayObservationSpace(new int[] {Graph.getEdgeList().size()});
    private testState state = new testState(Graph.getMatrix(),0);
    private NeuralNetFetchable<IDQN> fetchable;
    boolean illegal=false;
    public testEnv(){}
    @Override
    public ObservationSpace<testState> getObservationSpace() {
        return observationSpace;
    }
    @Override
    public DiscreteSpace getActionSpace() {
        return actionSpace;
    }
    @Override
    public testState reset() {
       // System.out.println("RESET");
        try {
            GameBoard r = new GameBoard(3,3);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return new testState(Graph.getMatrix(),0);
    }
    @Override
    public void close() { }
    @Override
    public StepReply<testState> step(Integer action) {
      //  System.out.println("Action: "+action);
     //   System.out.println(Arrays.deepToString(Graph.getMatrix()));
        int reward=0;
        try {
            placeEdge(action);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        // change the getPlayer1 to whichever player the neural is
     //   System.out.println("step: "+state.step);
        if(!illegal) {
            System.out.println("Not Illegal");
            if (isDone()) {
                if (Graph.getPlayer1Score() > Graph.getPlayer2Score()) {
                    reward = 5;
                } else {
                    reward = -5;
                }
            }else {
                if (Graph.numOfMoves < 1) {
                    if (Graph.player1Turn) {
                        Graph.player1Turn = false;
                    } else {
                        Graph.player1Turn = true;
                    }
                    Graph.setNumOfMoves(1);
                    while (Graph.numOfMoves > 0) {
                            //       System.out.println(Arrays.deepToString(Graph.getMatrix()));
                        if (!isDone()) {
                            Graph.getRandomBot().placeRandomEdge();
                        } else {
                            Graph.numOfMoves = 0;
                            if (Graph.getPlayer1Score() > Graph.getPlayer2Score()) {
                                reward = 5;
                            } else {
                                reward = -5;
                            }
                        }
                    }
                    if (!isDone()) {
                        if (Graph.player1Turn) {
                            Graph.player1Turn = false;
                        } else {
                            Graph.player1Turn = true;
                        }
                        Graph.setNumOfMoves(1);
                    }
                }
            }
        }else{
            reward=-100000;
            illegal=false;
        }
        testState t = new testState(Graph.getMatrix(), state.step + 1);
        state=t;
        return new StepReply<>(t, reward, isDone(), null);
    }

    @Override
    public boolean isDone() {
        return gameThread.checkFinished();
    }

    @Override
    public MDP<testState, Integer, DiscreteSpace> newInstance() {
        testEnv test = new testEnv();
        test.setFetchable(fetchable);
        return test;
    }
    public void setFetchable(NeuralNetFetchable<IDQN> fetchable) {
        this.fetchable = fetchable;
    }
}

The state class:

public class testState implements Encodable {
    int[][] matrix;
    int step;
    public testState(int[][] m,int step){
        matrix=m;
        this.step=step;
    }
    @Override
    public double[] toArray() {
        double[] array = new double[matrix.length*matrix[0].length];
        int i=0;
        for(int a=0;a< matrix.length;a++){
            for(int b=0;b<matrix[0].length;b++){
                array[i]= matrix[a][b];
                i++;
            }
        }
        return array;
    }

    @Override
    public boolean isSkipped() {
        return false;
    }

    @Override
    public INDArray getData() {
        return null;
    }

    @Override
    public Encodable dup() {
        return null;
    }
}

Chooses action 0 every time, despite the heavy negative reward

I guess you come to the conclusion that it is every time after less than 10 steps?

You are telling it in the configuration to do nothing (action 0 by convention) for 10 steps in a game where doing nothing is an invalid move.

Also this is probably not something you want to have either, as you are configuring it to do something random in 30% of the cases even after it is done with its initial random exploration.

And this also doesn’t quite make sense, as you are likely to produce huge batches that way, giving your model less opportunity to learn because it can only do a learning pass when it has collected enough moves and outcomes to fill a batch.

1 Like

Nope, I’ve let it run for an hour or so with it choosing action 0 every time. I changed the update start to 0 now, hasn’t changed anything.

I was messing around with all of the hyperparameters just in case I could luck my way into changing it. The thing is there’s no “random exploration”, it never chooses any action other than 0. Despite the fact if I run the discreteSpace random action it gives different values from 0-12…

Thanks for the help, I didn’t know what batchsize or updateStart were.

In that case it might actually learn that not playing is the best way of going about it :slight_smile:

What I mean with that is that it is hard to come up with a good reward function, and doubly so with a simple algorithm like QLearning.

I know that it shouldn’t always take 0 outside of the no-op warmup phase. So maybe something in your implementation isn’t quite working.

If you want to take a look at an example of a working custom MDP you can take a look at https://github.com/treo/rl4j-snake.

That was a work in progress example I never had the time to finish properly and somewhere along the way I lost the on good QLearning setup I had found. But it usually does learn to not die even if it doesn’t learn to collect points.

1 Like

I really don’t think that’s it unless it’s somehow not registering the rewards that I put in. I know that it’s correctly recognizing when something is an illegal move, if it is it sets reward=-1000, then:
testState t = new testState(Graph.getMatrix(), state.step + 1);
state=t;
return new StepReply<>(t, reward, isDone(), null);

I don’t know where that can go wrong. Thanks for the example though.

I should also add that 0 is a valid play the first turn (though it wouldn’t get any positive reward for choosing it), but then the edge is placed down so it’s no longer valid after that.

What I also don’t understand is how come it never chooses a random action.

Let me highlight the most important part here.

As far as I can tell it gets 5 reward points for winning a game, -5 for losing a game and -100000 points for making an invalid move.

So it basically kills the entire game score with one invalid move. Then, because you are using q learning, this score killer is propagated and essentially makes other moves unlikely to happen too, because they lead up to the invalid move.

Try making the invalid move less punishing, and see if it makes a difference. Also try to understand what is actually happening in the system as it trains, and then reconsider both how you have defined your reward function and how you are representing your game state as the input for your neural network.

1 Like

ah this was the other thing I was wondering. The size of the DiscreteSpace is 12, one for each edge, does that mean there’s 12 inputs into the network?

also shouldn’t it be choosing random action irrespective of the reward? Why isn’t that happening?

Yes.

No idea, you have to figure that out on your own. In principle you can force it to stay on random for a long time by setting something like this:

.minEpsilon(1.0f)
.epsilonNbStep(10000)

If it isn’t random even then, I guess something in your setup must be wrong. You’ll have to share a project that we can run and inspect your setup. But it can take a while until someone has enough time to actually take a look at it.

1 Like

Yeah unfortunately it’s still not random.
Here’s a link to a github repository with the code:


Thanks for all the help :slight_smile:

Just to make sure: Remove that. It pulls weights to zero when it is set high like that. While you are trying to make this work keep l2 = 0.

1 Like

I got it working!!
It was an issue with the observation space set up.
Thanks so much for the help I wouldn’t have been able to do it otherwise!
:slight_smile:

1 Like