Trying to use QLearning in a custom MDP environment. Chooses action 0 every time, despite the heavy negative reward

It’s playing a dots and boxes game. I have been messing around with the hyperparameters constantly but nothing changes. I’m completely stuck and need help. If anyone could look through the code and see if anything is set up wrong. An area where I have no idea if it’s set up correctly is the observation space in the MDP.

Here’s the classes envolved:

public class testNeural {
    public static void main(String args[]) throws IOException, InterruptedException {
        GameBoard r = new GameBoard(3,3);
        DQNPolicy<testState> t = dots();
    }

    private static DQNPolicy<testState> dots() throws IOException {

        QLearningConfiguration DOTS_QL = QLearningConfiguration.builder()
                .seed(Long.valueOf(132))                //Random seed (for reproducability)
                .maxEpochStep(500)        // Max step By epoch
                .maxStep(1000)           // Max step
                .expRepMaxSize(15000)    // Max size of experience replay
                .batchSize(Graph.getEdgeList().size())            // size of batches
                .targetDqnUpdateFreq(100) // target update (hard)
                .updateStart(10)          // num step noop warmup
                .rewardFactor(0.1)       // reward scaling
                .gamma(0.95)              // gamma
                .errorClamp(1.0)          // /td-error clipping
                .minEpsilon(0.3f)         // min epsilon
                .epsilonNbStep(10)      // num step for eps greedy anneal
                .doubleDQN(false)          // double DQN
                .build();
        DQNDenseNetworkConfiguration DOTS_NET =
                DQNDenseNetworkConfiguration.builder()
                        .l2(0)
                        .updater(new RmsProp(0.000025))
                        .numHiddenNodes(50)
                        .numLayers(10)
                        .build();


        // The neural network used by the agent. Note that there is no need to specify the number of inputs/outputs.
        // These will be read from the gym environment at the start of training.

        testEnv env = new testEnv();
        QLearningDiscreteDense<testState> dql = new QLearningDiscreteDense<testState>(env, DOTS_NET, DOTS_QL);
        System.out.println(dql.toString());
        dql.train();
        return dql.getPolicy();
    }
}

The MDP environment:

public class testEnv implements MDP<testState, Integer, DiscreteSpace> {
    DiscreteSpace actionSpace = new DiscreteSpace(Graph.getEdgeList().size());

    // takes amount of possible edges      ^
    ObservationSpace<testState> observationSpace = new ArrayObservationSpace(new int[] {Graph.getEdgeList().size()});
    private testState state = new testState(Graph.getMatrix(),0);
    private NeuralNetFetchable<IDQN> fetchable;
    boolean illegal=false;
    public testEnv(){}
    @Override
    public ObservationSpace<testState> getObservationSpace() {
        return observationSpace;
    }
    @Override
    public DiscreteSpace getActionSpace() {
        return actionSpace;
    }
    @Override
    public testState reset() {
       // System.out.println("RESET");
        try {
            GameBoard r = new GameBoard(3,3);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return new testState(Graph.getMatrix(),0);
    }
    @Override
    public void close() { }
    @Override
    public StepReply<testState> step(Integer action) {
      //  System.out.println("Action: "+action);
     //   System.out.println(Arrays.deepToString(Graph.getMatrix()));
        int reward=0;
        try {
            placeEdge(action);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        // change the getPlayer1 to whichever player the neural is
     //   System.out.println("step: "+state.step);
        if(!illegal) {
            System.out.println("Not Illegal");
            if (isDone()) {
                if (Graph.getPlayer1Score() > Graph.getPlayer2Score()) {
                    reward = 5;
                } else {
                    reward = -5;
                }
            }else {
                if (Graph.numOfMoves < 1) {
                    if (Graph.player1Turn) {
                        Graph.player1Turn = false;
                    } else {
                        Graph.player1Turn = true;
                    }
                    Graph.setNumOfMoves(1);
                    while (Graph.numOfMoves > 0) {
                            //       System.out.println(Arrays.deepToString(Graph.getMatrix()));
                        if (!isDone()) {
                            Graph.getRandomBot().placeRandomEdge();
                        } else {
                            Graph.numOfMoves = 0;
                            if (Graph.getPlayer1Score() > Graph.getPlayer2Score()) {
                                reward = 5;
                            } else {
                                reward = -5;
                            }
                        }
                    }
                    if (!isDone()) {
                        if (Graph.player1Turn) {
                            Graph.player1Turn = false;
                        } else {
                            Graph.player1Turn = true;
                        }
                        Graph.setNumOfMoves(1);
                    }
                }
            }
        }else{
            reward=-100000;
            illegal=false;
        }
        testState t = new testState(Graph.getMatrix(), state.step + 1);
        state=t;
        return new StepReply<>(t, reward, isDone(), null);
    }

    @Override
    public boolean isDone() {
        return gameThread.checkFinished();
    }

    @Override
    public MDP<testState, Integer, DiscreteSpace> newInstance() {
        testEnv test = new testEnv();
        test.setFetchable(fetchable);
        return test;
    }
    public void setFetchable(NeuralNetFetchable<IDQN> fetchable) {
        this.fetchable = fetchable;
    }
}

The state class:

public class testState implements Encodable {
    int[][] matrix;
    int step;
    public testState(int[][] m,int step){
        matrix=m;
        this.step=step;
    }
    @Override
    public double[] toArray() {
        double[] array = new double[matrix.length*matrix[0].length];
        int i=0;
        for(int a=0;a< matrix.length;a++){
            for(int b=0;b<matrix[0].length;b++){
                array[i]= matrix[a][b];
                i++;
            }
        }
        return array;
    }

    @Override
    public boolean isSkipped() {
        return false;
    }

    @Override
    public INDArray getData() {
        return null;
    }

    @Override
    public Encodable dup() {
        return null;
    }
}

Chooses action 0 every time, despite the heavy negative reward

I guess you come to the conclusion that it is every time after less than 10 steps?

You are telling it in the configuration to do nothing (action 0 by convention) for 10 steps in a game where doing nothing is an invalid move.

Also this is probably not something you want to have either, as you are configuring it to do something random in 30% of the cases even after it is done with its initial random exploration.

And this also doesn’t quite make sense, as you are likely to produce huge batches that way, giving your model less opportunity to learn because it can only do a learning pass when it has collected enough moves and outcomes to fill a batch.

1 Like

Nope, I’ve let it run for an hour or so with it choosing action 0 every time. I changed the update start to 0 now, hasn’t changed anything.

I was messing around with all of the hyperparameters just in case I could luck my way into changing it. The thing is there’s no “random exploration”, it never chooses any action other than 0. Despite the fact if I run the discreteSpace random action it gives different values from 0-12…

Thanks for the help, I didn’t know what batchsize or updateStart were.

In that case it might actually learn that not playing is the best way of going about it :slight_smile:

What I mean with that is that it is hard to come up with a good reward function, and doubly so with a simple algorithm like QLearning.

I know that it shouldn’t always take 0 outside of the no-op warmup phase. So maybe something in your implementation isn’t quite working.

If you want to take a look at an example of a working custom MDP you can take a look at https://github.com/treo/rl4j-snake.

That was a work in progress example I never had the time to finish properly and somewhere along the way I lost the on good QLearning setup I had found. But it usually does learn to not die even if it doesn’t learn to collect points.

1 Like

I really don’t think that’s it unless it’s somehow not registering the rewards that I put in. I know that it’s correctly recognizing when something is an illegal move, if it is it sets reward=-1000, then:
testState t = new testState(Graph.getMatrix(), state.step + 1);
state=t;
return new StepReply<>(t, reward, isDone(), null);

I don’t know where that can go wrong. Thanks for the example though.

I should also add that 0 is a valid play the first turn (though it wouldn’t get any positive reward for choosing it), but then the edge is placed down so it’s no longer valid after that.

What I also don’t understand is how come it never chooses a random action.

Let me highlight the most important part here.

As far as I can tell it gets 5 reward points for winning a game, -5 for losing a game and -100000 points for making an invalid move.

So it basically kills the entire game score with one invalid move. Then, because you are using q learning, this score killer is propagated and essentially makes other moves unlikely to happen too, because they lead up to the invalid move.

Try making the invalid move less punishing, and see if it makes a difference. Also try to understand what is actually happening in the system as it trains, and then reconsider both how you have defined your reward function and how you are representing your game state as the input for your neural network.

1 Like

ah this was the other thing I was wondering. The size of the DiscreteSpace is 12, one for each edge, does that mean there’s 12 inputs into the network?

also shouldn’t it be choosing random action irrespective of the reward? Why isn’t that happening?

Yes.

No idea, you have to figure that out on your own. In principle you can force it to stay on random for a long time by setting something like this:

.minEpsilon(1.0f)
.epsilonNbStep(10000)

If it isn’t random even then, I guess something in your setup must be wrong. You’ll have to share a project that we can run and inspect your setup. But it can take a while until someone has enough time to actually take a look at it.

1 Like

Yeah unfortunately it’s still not random.
Here’s a link to a github repository with the code:

Thanks for all the help :slight_smile:

Just to make sure: Remove that. It pulls weights to zero when it is set high like that. While you are trying to make this work keep l2 = 0.

1 Like

I got it working!!
It was an issue with the observation space set up.
Thanks so much for the help I wouldn’t have been able to do it otherwise!
:slight_smile:

1 Like

@jonathonbird810 @treo I am facing the same problem, my agent always chooses action 0… Could you please some details on how you solved this kind of problem because I still don’t understand how you managed so?

You’ll have to share some more information with us if you want help. Otherwise I can only tell you to check my original reply in this thread: If you are checking during the warmup, you are just seeing it doing “no op” (which action 0 should be).

The idea is that I’m performing load testing, (applying a workload, number of virtual users) on different transactions of a SUT. i.e., Homepage, AboutMe, Logout etc, and in each of these transactions there is a given number of virtual users with the purpose of detecting performance breaking points (Error rate and response time). The problem is that when I run the code, the agent will will always select the first Transaction (in this case, Homepage), and the workload will be increased only on that transaction.

I can share the github repository with the code also!

GitHub

I can see an error on the Library source of DiscreteSpace class saying that it does not match the bytecode for its class. In other words, I think there is a difference between DiscreteSpace.java and DiscreteSpace.class … Could this be the problem?

No, that is just because we are using Projekt Lombok to get around some of Java’s more annoying parts, and because it is a compile time code generator, the java file and the class file are a bit different.

Your problem is from something else.

There are plenty of things you can try: set the batch size to something like 16, 32, 64, 128, set the no-op warmup to 0, set l2 to 0 and set your learning rate (currently 1e-2) to something smaller, try 1e-3, 1e-4, 1e-5.

This would be mostly to try to see if you can get your model to behave randomly.

I don’t really get what your model tries to achieve though. You observe the response time and error rate and your reward is a score that grows with that two values.

Going just with that, I can see how any agent would just learn to hammer hard on a the page that takes the longest to load (which given your example, would most probably be the homepage).

I have tried different “combinations” with the parameters and hyperparameters but its still the same. We have done the same project with Q-Learning and it is working properly (there is a balance between exploration and exploitation) while in the DQN (in this setup) I’m facing this issue all the time.

The idea is that the agent tries different transactions/workloads and investigates the effects of the number of the virtual users (with the help of JMeter) in that transaction, in terms of error rate and response time. At the end, there is a csv file created where all the metrics are saved so we can analyze and see which of the transactions is more vulnerable and has exceeded our expectations (maximum of response time or error rate).

But as I said the agent is stuck only in the first action/transaction (but it does not mean that it is the page that has the longest response time or error rate!

Any other idea what could be wrong?
Thanks for the help

Have you tried forcing it to be random yet? If it isn’t random when set up like that, then your setup is somehow broken.

Yeah, I just tried but it’s still the same.

Then it means that your setup is broken.

I’ve dived into your code, and found the reason:

This is not an optional thing. It should probably have a default implementation of Nd4j.createFromArray(toArray()), but it doesn’t, so you have to do it yourself.

If you return null here, it will be interpreted as “skip this observation”. And if it can’t get an initial observation, it will try to do a no-op action (which by convention is action 0) until it can.

1 Like