Trying to use QLearning in a custom MDP environment. Chooses action 0 every time, despite the heavy negative reward

Oh it worked! Thank you very much for all your help, I really appreciate it :smile:

BTW, what kind of parameters/hyperparameters do you suggest to have a “balanced” observation, are the above mentioned (batch size, no-op warmup and learning rate) good ones to have for execution and running?

Just start with some sane defaults. I guess a batchsize of at least 64 and a learning rate of at most 10^-3 with an L2 of at most one order of magnitude less than the learning rate.

In your case, I don’t think it makes sense for you to have a warm up phase at all. And you probably want to keep epsilon (the rate of random exploration) at 0.1

1 Like

I understand, thank you again for your support!
All the best!