Problem removing lstm, shape exception
I am running some tests using rl4j with A3C applied to a gym-environment in OpenAI.
I am running on a Windows machine and using latest versions (1.0.0-M1.1) of dl4j and rl4j.
The setup is quite simple and I was using LSTM (Long Short-Term Memory) but since I am still experimenting, I wanted to run some tests without LSTM, and since I am using the configuration way of setting it up, this just meant switching the flag useLSTM from true to false. The problem and kind of surprise to me was then that just by doing this switch I then receive an exception indicating that the neural network used is not anymore correctly configured. I agree that removing the LSTM has implications on the neural network, but I was expecting this being taken care of implicitly when using the configuration based setup (see attached Java-code). When using a “manual” setup I could have tried doing operations like reshape etc, but since I am using the configuration based setup, I did not think this is the way to go. But maybe I am missing something here…
The exception occurs during gradient calculation (see the supplied file std-out-and-exception-no-lstm.txt).
Below I share the stdout and the exception and also some of my code in the Java class A3CTest.java, in the hope that somebody might see what is going on and can give some tip on if something is wrong or if this is a bug.
Note that if I switch the flag useLSTM from false to true, training completes without any problems, so the network is then correctly configured and all is fine. For comparison, I have attached the stdout when code is passing the same phase as when the exception occurs (this file is called std-out-all-ok-with-lstm.txt).
In case you need further info, let me know.
std-out-and-exception-no-lstm.txt
"C:\Program Files\Java\jdk-11.0.11\bin\java.exe" "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2021.1.2\lib\idea_rt.jar=52573:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2021.1.2\bin" -Dfile.encoding=UTF-8 -classpath C:\Users\pette\AppData\Local\Temp\classpath186559065.jar rta.RTA
loadProps, loaded 21 properties from property file
loadProps, total app properties 21
[ac.policy.file.name]: [acPolicy]
[ac.value.file.name]: [acValue]
[base.path]: [C:/.../]
[do.training]: [true]
[episode.log.interval]: [25]
[eval.training.iterations]: [20]
[gym.envUD]: [...]
[learner.update.freq]: [5]
[learning.gamma]: [0.99]
[learning.max.epoch.step]: [42]
[learning.max.step]: [500]
[learning.num.step]: [42]
[learning.num.threads]: [1]
[learning.reward.factor]: [0.000004]
[learning.seed]: [123]
[nn.adam.learning.rate]: [0.0005]
[nn.l2]: [0.0005]
[nn.learning.rate]: [0.0005]
[nn.lstm]: [false]
[nn.num.hidden.nodes]: [50]
[nn.num.layers]: [2]
envUD: [...], doTraining: true, episodeLogInterval: 25, nbrOfStepsInEpisode: 42
using env: ...
15:11:49.559 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
15:11:50.731 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 4
15:11:50.731 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory - Binary level Generic x86 optimization level AVX/AVX2
15:11:50.763 [main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 4
15:11:50.778 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Windows 10]
15:11:50.778 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [8]; Memory: [2,0GB];
15:11:50.778 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
15:11:50.778 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuBackend - Backend build information:
GCC: "10.3.0"
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CPU
HAVE_FLATBUFFERS
HAVE_OPENBLAS
found gymEnv: ...
gymEnv.envId: ...
gymEnv.actionSpace: org.deeplearning4j.rl4j.space.DiscreteSpace@72ba28ee
gymEnv.actionSpace.size: 3
gymEnv.actionSpace.noOp: 0
gymEnv.actionSpace.encode(0): 0
gymEnv.actionSpace.encode(1): 1
gymEnv.actionSpace.encode(2): 2
gymEnv.observationSpace: ArrayObservationSpace(name=Custom, shape=[1, 3], low=[0], high=[0])
gymEnv: rta.GymEnvRTA@1a5b8489
15:11:50.934 [main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
a3c.neuralNetwork(actorCriticOri): org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
a3c.neuralNetwork(actorCriticOri).NNs: [Lorg.deeplearning4j.nn.api.NeuralNetwork;@3e7dfd44
a3c.neuralNetwork(actorCriticOri).NNs.length: 2
NN[0]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@723ed581
Layer[0, name: layer0]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
Layer[1, name: layer1]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
Layer[2, name: layer2]: org.deeplearning4j.nn.layers.OutputLayer{conf=NeuralNetConfiguration(layer=OutputLayer(super=BaseOutputLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=identity, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=1, timeDistributedFormat=null), lossFn=LossMSE(), hasBias=true)), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
NN[1]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@6b760460
Layer[0, name: layer0]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
Layer[1, name: layer1]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
Layer[2, name: layer2]: org.deeplearning4j.nn.layers.OutputLayer{conf=NeuralNetConfiguration(layer=OutputLayer(super=BaseOutputLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=softmax, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=3, timeDistributedFormat=null), lossFn=ActorCriticLoss(), hasBias=true)), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
a3c.policy: org.deeplearning4j.rl4j.policy.ACPolicy@2c306a57
a3c.epochCount: 0
a3c.stepCount: 0
a3c.historyProcessor: null
a3c.progressMonitorFrequency: 20000
a3c.mdp: rta.GymEnvRTA@1a5b8489
a3c.learningConfiguration: A3CLearningConfiguration(numThreads=1, nStep=42, learnerUpdateFrequency=5)
a3c.neuralNetwork(actorCritic): ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
actorCritic.isRecurrent: false
a3c.nn: ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
a3c.nn: ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
a3c.nn.class: rta.RTA$ActorCritic2
actorCritic.neuralNetwork[0]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@723ed581
actorCritic.neuralNetwork[0].optimizer: org.deeplearning4j.optimize.solvers.StochasticGradientDescent@773e2eb5
nnConfig: NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[0_W, 0_b, 1_W, 1_b, 2_W, 2_b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0)
actorCritic.neuralNetwork[1]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@6b760460
actorCritic.neuralNetwork[1].optimizer: org.deeplearning4j.optimize.solvers.StochasticGradientDescent@d8948cd
nnConfig: NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[0_W, 0_b, 1_W, 1_b, 2_W, 2_b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0)
training START
15:11:51.066 [main] INFO org.deeplearning4j.rl4j.learning.async.AsyncLearning - AsyncLearning training starting.
15:11:51.091 [main] INFO org.deeplearning4j.rl4j.learning.async.AsyncLearning - Threads launched.
15:11:51.091 [Thread-2] INFO org.deeplearning4j.rl4j.learning.async.AsyncThread - ThreadNum-0 Started!
Exception in thread "Thread-2" org.deeplearning4j.exception.DL4JInvalidInputException: Input that is not a matrix; expected matrix (rank 2), got rank 3 array with shape [1, 1, 3]. Missing preprocessor or wrong input type? (layer name: layer0, layer index: 0, layer type: DenseLayer)
at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm(BaseLayer.java:312)
at org.deeplearning4j.nn.layers.BaseLayer.preOutput(BaseLayer.java:295)
at org.deeplearning4j.nn.layers.BaseLayer.activate(BaseLayer.java:343)
at org.deeplearning4j.nn.layers.AbstractLayer.activate(AbstractLayer.java:262)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.ffToLayerActivationsInWs(MultiLayerNetwork.java:1138)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2783)
at org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate.gradient(ActorCriticSeparate.java:172)
at rta.RTA$ActorCritic2.gradient(RTA.java:767)
at org.deeplearning4j.rl4j.learning.async.a3c.discrete.AdvantageActorCriticUpdateAlgorithm.computeGradients(AdvantageActorCriticUpdateAlgorithm.java:102)
at org.deeplearning4j.rl4j.learning.async.a3c.discrete.AdvantageActorCriticUpdateAlgorithm.computeGradients(AdvantageActorCriticUpdateAlgorithm.java:33)
at org.deeplearning4j.rl4j.learning.async.AsyncThreadDiscrete.trainSubEpoch(AsyncThreadDiscrete.java:142)
at org.deeplearning4j.rl4j.learning.async.AsyncThread.handleTraining(AsyncThread.java:188)
at org.deeplearning4j.rl4j.learning.async.AsyncThread.run(AsyncThread.java:164)
Process finished with exit code 130
---------------------------------------------------
A3CTest.java
package mytests;
import org.deeplearning4j.gym.StepReply;
import org.deeplearning4j.nn.api.Layer;
import org.deeplearning4j.nn.api.NeuralNetwork;
import org.deeplearning4j.nn.conf.InputPreProcessor;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.gradient.Gradient;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.deeplearning4j.rl4j.agent.learning.update.Features;
import org.deeplearning4j.rl4j.agent.learning.update.FeaturesLabels;
import org.deeplearning4j.rl4j.agent.learning.update.Gradients;
import org.deeplearning4j.rl4j.learning.IHistoryProcessor;
import org.deeplearning4j.rl4j.learning.Learning;
import org.deeplearning4j.rl4j.learning.async.a3c.discrete.A3CDiscreteDense;
import org.deeplearning4j.rl4j.learning.configuration.A3CLearningConfiguration;
import org.deeplearning4j.rl4j.mdp.MDP;
import org.deeplearning4j.rl4j.network.NeuralNetOutput;
import org.deeplearning4j.rl4j.network.ac.ActorCriticFactorySeparate;
import org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate;
import org.deeplearning4j.rl4j.network.ac.IActorCritic;
import org.deeplearning4j.rl4j.network.configuration.ActorCriticDenseNetworkConfiguration;
import org.deeplearning4j.rl4j.observation.Observation;
import org.deeplearning4j.rl4j.policy.ACPolicy;
import org.deeplearning4j.rl4j.space.*;
import org.deeplearning4j.rl4j.util.LegacyMDPWrapper;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.learning.config.Adam;
import java.io.*;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.logging.Logger;
public class A3CTest {
private static final String PARAM_NAME_PROP_FILE_PATH = "propFilePath";
// application property names
private static final String APP_PROP_NAME_BASE_PATH = "base.path";
private static final String APP_PROP_NAME_AC_VALUE_FILE_NAME = "ac.value.file.name";
private static final String APP_PROP_NAME_AC_POLICY_FILE_NAME = "ac.policy.file.name";
private static final String APP_PROP_NAME_DO_TRAINING = "do.training";
private static final String APP_PROP_NAME_EVAL_TRAINING_ITERATIONS = "eval.training.iterations";
private static final String APP_PROP_NAME_ENV_UD = "gym.envUD";
private static final String APP_PROP_NAME_RESULT_FILE_NAME = "result.file.path";
// logging of observations, actions etc is done every n-th episode
// e.g. 25 means that logging about observations, actions etc for the episode is done every 25th episode
private static final String APP_PROP_NAME_EPISODE_LOG_INTERVAL = "episode.log.interval";
private static final String APP_PROP_NAME_NN_ADAM_LEARNING_RATE = "nn.adam.learning.rate";
private static final String APP_PROP_NAME_NN_L2 = "nn.l2";
private static final String APP_PROP_NAME_NN_NBR_OF_HIDDEN_NODES = "nn.num.hidden.nodes";
private static final String APP_PROP_NAME_NN_NBR_OF_LAYERS = "nn.num.layers";
private static final String APP_PROP_NAME_NN_LSTM = "nn.lstm";
private static final String APP_PROP_NAME_NN_LEARNING_RATE = "nn.learning.rate";
private static final String APP_PROP_NAME_LEARNER_UPDATE_FREQ = "learner.update.freq";
private static final String APP_PROP_NAME_LEARNING_SEED = "learning.seed";
private static final String APP_PROP_NAME_LEARNING_MAX_EPOCH_STEP = "learning.max.epoch.step";
private static final String APP_PROP_NAME_LEARNING_MAX_STEP = "learning.max.step";
private static final String APP_PROP_NAME_LEARNING_NUM_THREADS = "learning.num.threads";
private static final String APP_PROP_NAME_LEARNING_NUM_STEP = "learning.num.step";
private static final String APP_PROP_NAME_LEARNING_REWARD_FACTOR = "learning.reward.factor";
private static final String APP_PROP_NAME_LEARNING_GAMMA = "learning.gamma";
private static Properties APP_PROPS;
public static void main(String[] args) throws IOException {
...
a3c(envUD, resultFileName, episodeLogInterval, logIntervalInSteps);
}
private static void a3c(String envUD, String resultFileName, int episodeLogInterval, long logIntervalInSteps) throws IOException {
GymEnvRTA<Box, Integer, DiscreteSpace> gymEnv =
createGymEnvBox(envUD, false, false, episodeLogInterval);
A3CLearningConfiguration learningConfNew = createA3CLearningConfigNew();
ActorCriticDenseNetworkConfiguration netConfNew = createActorCriticNeuralNetworkNew();
A3CDiscreteDense<Box> a3cOri = new A3CDiscreteDense<Box>(gymEnv, netConfNew, learningConfNew);
ActorCriticSeparate actorCritic = (ActorCriticSeparate)a3cOri.getNeuralNet();
System.out.println("a3c.neuralNetwork(actorCriticOri): " + actorCritic);
System.out.println("a3c.neuralNetwork(actorCriticOri).NNs: " + actorCritic.getNeuralNetworks());
System.out.println("a3c.neuralNetwork(actorCriticOri).NNs.length: " + actorCritic.getNeuralNetworks().length);
int i = 0;
for (NeuralNetwork nn : actorCritic.getNeuralNetworks()) {
MultiLayerNetwork mlnn = (MultiLayerNetwork) nn;
System.out.println(" NN[" + i + "]: " + mlnn);
int j = 0;
for (String layerName : mlnn.getLayerNames()) {
Layer layer = mlnn.getLayer(layerName);
System.out.println(" Layer[" + j + ", name: " + layerName + "]: " + layer);
j++;
}
i++;
}
ActorCritic2 actorCritic2 = new ActorCritic2(actorCritic, logIntervalInSteps);
A3CDiscreteDense2 a3c = new A3CDiscreteDense2(gymEnv, actorCritic2, learningConfNew);
showInfoAboutActorCritic(a3c);
System.out.println("training START");
a3c.train();
System.out.println("training DONE");
gymEnv.close();
System.out.println("gymEnv CLOSED");
ACPolicy<Box> policy = a3c.getPolicy();
System.out.println("policy: " + policy);
savePolicy(policy);
// for now we evaluate the agent using the same data as used for training
// in a real scenario we would use another envUD for the evaluation
useACPolicy(envUD, resultFileName, episodeLogInterval);
}
private static GymEnvRTA<Box, Integer, DiscreteSpace> createGymEnvBox(String envUD, boolean render, boolean monitor,
int episodeLogInterval) {
// define the mdp from gym (name, render)
System.out.println("using env: " + envUD);
GymEnvRTA<Box, Integer, DiscreteSpace> gymEnv =
new GymEnvRTA<Box, Integer, DiscreteSpace>(envUD, render, monitor, episodeLogInterval);
System.out.println("found gymEnv: " + envUD);
System.out.println("gymEnv.envId: " + gymEnv.getEnvId());
DiscreteSpace das = gymEnv.getActionSpace();
System.out.println("gymEnv.actionSpace: " + das);
System.out.println("gymEnv.actionSpace.size: " + das.getSize());
System.out.println("gymEnv.actionSpace.noOp: " + das.noOp());
System.out.println("gymEnv.actionSpace.encode(0): " + das.encode(0));
System.out.println("gymEnv.actionSpace.encode(1): " + das.encode(1));
System.out.println("gymEnv.actionSpace.encode(2): " + das.encode(2));
ObservationSpace<Box> os = gymEnv.getObservationSpace();
System.out.println("gymEnv.observationSpace: " + os);
System.out.println("gymEnv: " + gymEnv);
return gymEnv;
}
private static ActorCriticDenseNetworkConfiguration createActorCriticNeuralNetworkNew() {
ActorCriticDenseNetworkConfiguration neuralNet =
ActorCriticDenseNetworkConfiguration.builder()
// Adam is an optimization algorithm that can be used instead of the classical stochastic
// gradient descent procedure to update network weights iterative based in training data.
// Adam params:
// alpha: learningRate or stepSize, the proportion that weights are updated
// larger values (e.g. 0.3) results in faster initial learning
// beta1: exponential decay rate for first moment (mean) estimates (default 0.9)
// beta2: exponential decay rate for second moment (variance) estimates (default 0.999)
// epsilon: very small number to prevent division by zero (default 1E-8)
.updater(new Adam(getDoubleProperty(APP_PROP_NAME_NN_ADAM_LEARNING_RATE)))// 1e-2
// L2 normalization: higher value --> more regularization or generalization
.l2(getDoubleProperty(APP_PROP_NAME_NN_L2)) // 0 normally 10 times smaller than learningRate
.numHiddenNodes(getIntProperty(APP_PROP_NAME_NN_NBR_OF_HIDDEN_NODES)) // 16
.numLayers(getIntProperty(APP_PROP_NAME_NN_NBR_OF_LAYERS)) // 2
// true: use LSTM (Long Short-Term Memory)
.useLSTM(getBooleanProperty(APP_PROP_NAME_NN_LSTM))
.learningRate(getDoubleProperty(APP_PROP_NAME_NN_LEARNING_RATE)) // 0.001
.build();
return neuralNet;
}
private static A3CLearningConfiguration createA3CLearningConfigNew() {
A3CLearningConfiguration learningConfig = A3CLearningConfiguration.builder()
.learnerUpdateFrequency(getIntProperty(APP_PROP_NAME_LEARNER_UPDATE_FREQ))
.seed(getLongProperty(APP_PROP_NAME_LEARNING_SEED))
.maxEpochStep(getIntProperty(APP_PROP_NAME_LEARNING_MAX_EPOCH_STEP))
.maxStep(getIntProperty(APP_PROP_NAME_LEARNING_MAX_STEP)) // 500000
.numThreads(getIntProperty(APP_PROP_NAME_LEARNING_NUM_THREADS)) // 8
.nStep(getIntProperty(APP_PROP_NAME_LEARNING_NUM_STEP)) // 20
.rewardFactor(getDoubleProperty(APP_PROP_NAME_LEARNING_REWARD_FACTOR)) // 0.01
// discount factor, typically 0.9 or even 0.99
.gamma(getDoubleProperty(APP_PROP_NAME_LEARNING_GAMMA))
.build();
return learningConfig;
}
}