Problem removing lstm, shape exception

Problem removing lstm, shape exception

I am running some tests using rl4j with A3C applied to a gym-environment in OpenAI.
I am running on a Windows machine and using latest versions (1.0.0-M1.1) of dl4j and rl4j.

The setup is quite simple and I was using LSTM (Long Short-Term Memory) but since I am still experimenting, I wanted to run some tests without LSTM, and since I am using the configuration way of setting it up, this just meant switching the flag useLSTM from true to false. The problem and kind of surprise to me was then that just by doing this switch I then receive an exception indicating that the neural network used is not anymore correctly configured. I agree that removing the LSTM has implications on the neural network, but I was expecting this being taken care of implicitly when using the configuration based setup (see attached Java-code). When using a “manual” setup I could have tried doing operations like reshape etc, but since I am using the configuration based setup, I did not think this is the way to go. But maybe I am missing something here…

The exception occurs during gradient calculation (see the supplied file std-out-and-exception-no-lstm.txt).

Below I share the stdout and the exception and also some of my code in the Java class A3CTest.java, in the hope that somebody might see what is going on and can give some tip on if something is wrong or if this is a bug.

Note that if I switch the flag useLSTM from false to true, training completes without any problems, so the network is then correctly configured and all is fine. For comparison, I have attached the stdout when code is passing the same phase as when the exception occurs (this file is called std-out-all-ok-with-lstm.txt).

In case you need further info, let me know.


std-out-and-exception-no-lstm.txt

"C:\Program Files\Java\jdk-11.0.11\bin\java.exe" "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2021.1.2\lib\idea_rt.jar=52573:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2021.1.2\bin" -Dfile.encoding=UTF-8 -classpath C:\Users\pette\AppData\Local\Temp\classpath186559065.jar rta.RTA
loadProps, loaded 21 properties from property file
loadProps, total app properties 21
   [ac.policy.file.name]: [acPolicy]
   [ac.value.file.name]: [acValue]
   [base.path]: [C:/.../]
   [do.training]: [true]
   [episode.log.interval]: [25]
   [eval.training.iterations]: [20]
   [gym.envUD]: [...]
   [learner.update.freq]: [5]
   [learning.gamma]: [0.99]
   [learning.max.epoch.step]: [42]
   [learning.max.step]: [500]
   [learning.num.step]: [42]
   [learning.num.threads]: [1]
   [learning.reward.factor]: [0.000004]
   [learning.seed]: [123]
   [nn.adam.learning.rate]: [0.0005]
   [nn.l2]: [0.0005]
   [nn.learning.rate]: [0.0005]
   [nn.lstm]: [false]
   [nn.num.hidden.nodes]: [50]
   [nn.num.layers]: [2]

envUD: [...], doTraining: true, episodeLogInterval: 25, nbrOfStepsInEpisode: 42
using env: ...
15:11:49.559 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
15:11:50.731 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 4
15:11:50.731 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory - Binary level Generic x86 optimization level AVX/AVX2
15:11:50.763 [main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 4
15:11:50.778 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Windows 10]
15:11:50.778 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [8]; Memory: [2,0GB];
15:11:50.778 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
15:11:50.778 [main] INFO org.nd4j.linalg.cpu.nativecpu.CpuBackend - Backend build information:
 GCC: "10.3.0"
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CPU
HAVE_FLATBUFFERS
HAVE_OPENBLAS
found gymEnv: ...
gymEnv.envId: ...
gymEnv.actionSpace: org.deeplearning4j.rl4j.space.DiscreteSpace@72ba28ee
gymEnv.actionSpace.size: 3
gymEnv.actionSpace.noOp: 0
gymEnv.actionSpace.encode(0): 0
gymEnv.actionSpace.encode(1): 1
gymEnv.actionSpace.encode(2): 2
gymEnv.observationSpace: ArrayObservationSpace(name=Custom, shape=[1, 3], low=[0], high=[0])
gymEnv: rta.GymEnvRTA@1a5b8489
15:11:50.934 [main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
a3c.neuralNetwork(actorCriticOri): org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
a3c.neuralNetwork(actorCriticOri).NNs: [Lorg.deeplearning4j.nn.api.NeuralNetwork;@3e7dfd44
a3c.neuralNetwork(actorCriticOri).NNs.length: 2
   NN[0]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@723ed581
      Layer[0, name: layer0]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[1, name: layer1]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[2, name: layer2]: org.deeplearning4j.nn.layers.OutputLayer{conf=NeuralNetConfiguration(layer=OutputLayer(super=BaseOutputLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=identity, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=1, timeDistributedFormat=null), lossFn=LossMSE(), hasBias=true)), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
   NN[1]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@6b760460
      Layer[0, name: layer0]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[1, name: layer1]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[2, name: layer2]: org.deeplearning4j.nn.layers.OutputLayer{conf=NeuralNetConfiguration(layer=OutputLayer(super=BaseOutputLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=softmax, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=3, timeDistributedFormat=null), lossFn=ActorCriticLoss(), hasBias=true)), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
a3c.policy: org.deeplearning4j.rl4j.policy.ACPolicy@2c306a57
a3c.epochCount: 0
a3c.stepCount: 0
a3c.historyProcessor: null
a3c.progressMonitorFrequency: 20000
a3c.mdp: rta.GymEnvRTA@1a5b8489
a3c.learningConfiguration: A3CLearningConfiguration(numThreads=1, nStep=42, learnerUpdateFrequency=5)
a3c.neuralNetwork(actorCritic): ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
actorCritic.isRecurrent: false
a3c.nn: ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
a3c.nn: ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@6c298dc
a3c.nn.class: rta.RTA$ActorCritic2
actorCritic.neuralNetwork[0]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@723ed581
actorCritic.neuralNetwork[0].optimizer: org.deeplearning4j.optimize.solvers.StochasticGradientDescent@773e2eb5
   nnConfig: NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[0_W, 0_b, 1_W, 1_b, 2_W, 2_b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0)
actorCritic.neuralNetwork[1]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@6b760460
actorCritic.neuralNetwork[1].optimizer: org.deeplearning4j.optimize.solvers.StochasticGradientDescent@d8948cd
   nnConfig: NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=null), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[0_W, 0_b, 1_W, 1_b, 2_W, 2_b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0)
training START
15:11:51.066 [main] INFO org.deeplearning4j.rl4j.learning.async.AsyncLearning - AsyncLearning training starting.
15:11:51.091 [main] INFO org.deeplearning4j.rl4j.learning.async.AsyncLearning - Threads launched.
15:11:51.091 [Thread-2] INFO org.deeplearning4j.rl4j.learning.async.AsyncThread - ThreadNum-0 Started!
Exception in thread "Thread-2" org.deeplearning4j.exception.DL4JInvalidInputException: Input that is not a matrix; expected matrix (rank 2), got rank 3 array with shape [1, 1, 3]. Missing preprocessor or wrong input type? (layer name: layer0, layer index: 0, layer type: DenseLayer)
	at org.deeplearning4j.nn.layers.BaseLayer.preOutputWithPreNorm(BaseLayer.java:312)
	at org.deeplearning4j.nn.layers.BaseLayer.preOutput(BaseLayer.java:295)
	at org.deeplearning4j.nn.layers.BaseLayer.activate(BaseLayer.java:343)
	at org.deeplearning4j.nn.layers.AbstractLayer.activate(AbstractLayer.java:262)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.ffToLayerActivationsInWs(MultiLayerNetwork.java:1138)
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2783)
	at org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate.gradient(ActorCriticSeparate.java:172)
	at rta.RTA$ActorCritic2.gradient(RTA.java:767)
	at org.deeplearning4j.rl4j.learning.async.a3c.discrete.AdvantageActorCriticUpdateAlgorithm.computeGradients(AdvantageActorCriticUpdateAlgorithm.java:102)
	at org.deeplearning4j.rl4j.learning.async.a3c.discrete.AdvantageActorCriticUpdateAlgorithm.computeGradients(AdvantageActorCriticUpdateAlgorithm.java:33)
	at org.deeplearning4j.rl4j.learning.async.AsyncThreadDiscrete.trainSubEpoch(AsyncThreadDiscrete.java:142)
	at org.deeplearning4j.rl4j.learning.async.AsyncThread.handleTraining(AsyncThread.java:188)
	at org.deeplearning4j.rl4j.learning.async.AsyncThread.run(AsyncThread.java:164)

Process finished with exit code 130
---------------------------------------------------

A3CTest.java

package mytests;

import org.deeplearning4j.gym.StepReply;
import org.deeplearning4j.nn.api.Layer;
import org.deeplearning4j.nn.api.NeuralNetwork;
import org.deeplearning4j.nn.conf.InputPreProcessor;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.gradient.Gradient;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.deeplearning4j.rl4j.agent.learning.update.Features;
import org.deeplearning4j.rl4j.agent.learning.update.FeaturesLabels;
import org.deeplearning4j.rl4j.agent.learning.update.Gradients;
import org.deeplearning4j.rl4j.learning.IHistoryProcessor;
import org.deeplearning4j.rl4j.learning.Learning;
import org.deeplearning4j.rl4j.learning.async.a3c.discrete.A3CDiscreteDense;
import org.deeplearning4j.rl4j.learning.configuration.A3CLearningConfiguration;
import org.deeplearning4j.rl4j.mdp.MDP;
import org.deeplearning4j.rl4j.network.NeuralNetOutput;
import org.deeplearning4j.rl4j.network.ac.ActorCriticFactorySeparate;
import org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate;
import org.deeplearning4j.rl4j.network.ac.IActorCritic;
import org.deeplearning4j.rl4j.network.configuration.ActorCriticDenseNetworkConfiguration;
import org.deeplearning4j.rl4j.observation.Observation;
import org.deeplearning4j.rl4j.policy.ACPolicy;
import org.deeplearning4j.rl4j.space.*;
import org.deeplearning4j.rl4j.util.LegacyMDPWrapper;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.learning.config.Adam;

import java.io.*;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.logging.Logger;


public class A3CTest {

    private static final String PARAM_NAME_PROP_FILE_PATH = "propFilePath";

    // application property names
    private static final String APP_PROP_NAME_BASE_PATH = "base.path";
    private static final String APP_PROP_NAME_AC_VALUE_FILE_NAME = "ac.value.file.name";
    private static final String APP_PROP_NAME_AC_POLICY_FILE_NAME = "ac.policy.file.name";

    private static final String APP_PROP_NAME_DO_TRAINING = "do.training";
    private static final String APP_PROP_NAME_EVAL_TRAINING_ITERATIONS = "eval.training.iterations";
    private static final String APP_PROP_NAME_ENV_UD = "gym.envUD";
    private static final String APP_PROP_NAME_RESULT_FILE_NAME = "result.file.path";

    // logging of observations, actions etc is done every n-th episode
    // e.g. 25 means that logging about observations, actions etc for the episode is done every 25th episode
    private static final String APP_PROP_NAME_EPISODE_LOG_INTERVAL = "episode.log.interval";

    private static final String APP_PROP_NAME_NN_ADAM_LEARNING_RATE = "nn.adam.learning.rate";
    private static final String APP_PROP_NAME_NN_L2 = "nn.l2";
    private static final String APP_PROP_NAME_NN_NBR_OF_HIDDEN_NODES = "nn.num.hidden.nodes";
    private static final String APP_PROP_NAME_NN_NBR_OF_LAYERS = "nn.num.layers";
    private static final String APP_PROP_NAME_NN_LSTM = "nn.lstm";
    private static final String APP_PROP_NAME_NN_LEARNING_RATE = "nn.learning.rate";

    private static final String APP_PROP_NAME_LEARNER_UPDATE_FREQ = "learner.update.freq";
    private static final String APP_PROP_NAME_LEARNING_SEED = "learning.seed";
    private static final String APP_PROP_NAME_LEARNING_MAX_EPOCH_STEP = "learning.max.epoch.step";
    private static final String APP_PROP_NAME_LEARNING_MAX_STEP = "learning.max.step";
    private static final String APP_PROP_NAME_LEARNING_NUM_THREADS = "learning.num.threads";
    private static final String APP_PROP_NAME_LEARNING_NUM_STEP = "learning.num.step";
    private static final String APP_PROP_NAME_LEARNING_REWARD_FACTOR = "learning.reward.factor";
    private static final String APP_PROP_NAME_LEARNING_GAMMA = "learning.gamma";

    private static Properties APP_PROPS;

    public static void main(String[] args) throws IOException {
       ...
       a3c(envUD, resultFileName, episodeLogInterval, logIntervalInSteps);
    }

    private static void a3c(String envUD, String resultFileName, int episodeLogInterval, long logIntervalInSteps) throws IOException {

        GymEnvRTA<Box, Integer, DiscreteSpace> gymEnv =
                createGymEnvBox(envUD, false, false, episodeLogInterval);

        A3CLearningConfiguration learningConfNew = createA3CLearningConfigNew();
        ActorCriticDenseNetworkConfiguration netConfNew = createActorCriticNeuralNetworkNew();
        A3CDiscreteDense<Box> a3cOri = new A3CDiscreteDense<Box>(gymEnv, netConfNew, learningConfNew);
        ActorCriticSeparate actorCritic = (ActorCriticSeparate)a3cOri.getNeuralNet();
        System.out.println("a3c.neuralNetwork(actorCriticOri): " + actorCritic);
        System.out.println("a3c.neuralNetwork(actorCriticOri).NNs: " + actorCritic.getNeuralNetworks());
        System.out.println("a3c.neuralNetwork(actorCriticOri).NNs.length: " + actorCritic.getNeuralNetworks().length);
        int i = 0;
        for (NeuralNetwork nn : actorCritic.getNeuralNetworks()) {
            MultiLayerNetwork mlnn = (MultiLayerNetwork) nn;
            System.out.println("   NN[" + i + "]: " + mlnn);
            int j = 0;
            for (String layerName : mlnn.getLayerNames()) {
                Layer layer = mlnn.getLayer(layerName);
                System.out.println("      Layer[" + j + ", name: " + layerName + "]: " + layer);
                j++;
            }
            i++;
        }
        ActorCritic2 actorCritic2 = new ActorCritic2(actorCritic, logIntervalInSteps);

        A3CDiscreteDense2 a3c = new A3CDiscreteDense2(gymEnv, actorCritic2, learningConfNew);

        showInfoAboutActorCritic(a3c);

        System.out.println("training START");
        a3c.train();
        System.out.println("training DONE");
        gymEnv.close();
        System.out.println("gymEnv CLOSED");

        ACPolicy<Box> policy = a3c.getPolicy();
        System.out.println("policy: " + policy);
        savePolicy(policy);

        // for now we evaluate the agent using the same data as used for training
        // in a real scenario we would use another envUD for the evaluation
        useACPolicy(envUD, resultFileName, episodeLogInterval);
    }

    private static GymEnvRTA<Box, Integer, DiscreteSpace> createGymEnvBox(String envUD, boolean render, boolean monitor,
                                                                          int episodeLogInterval) {
        // define the mdp from gym (name, render)
        System.out.println("using env: " + envUD);
        GymEnvRTA<Box, Integer, DiscreteSpace> gymEnv =
                new GymEnvRTA<Box, Integer, DiscreteSpace>(envUD, render, monitor, episodeLogInterval);

        System.out.println("found gymEnv: " + envUD);
        System.out.println("gymEnv.envId: " + gymEnv.getEnvId());

        DiscreteSpace das = gymEnv.getActionSpace();
        System.out.println("gymEnv.actionSpace: " + das);
        System.out.println("gymEnv.actionSpace.size: " + das.getSize());
        System.out.println("gymEnv.actionSpace.noOp: " + das.noOp());
        System.out.println("gymEnv.actionSpace.encode(0): " + das.encode(0));
        System.out.println("gymEnv.actionSpace.encode(1): " + das.encode(1));
        System.out.println("gymEnv.actionSpace.encode(2): " + das.encode(2));

        ObservationSpace<Box> os = gymEnv.getObservationSpace();
        System.out.println("gymEnv.observationSpace: " + os);

        System.out.println("gymEnv: " + gymEnv);

        return gymEnv;
    }

    private static ActorCriticDenseNetworkConfiguration createActorCriticNeuralNetworkNew() {
        ActorCriticDenseNetworkConfiguration neuralNet =
                ActorCriticDenseNetworkConfiguration.builder()
                        // Adam is an optimization algorithm that can be used instead of the classical stochastic
                        // gradient descent procedure to update network weights iterative based in training data.
                        // Adam params:
                        // alpha: learningRate or stepSize, the proportion that weights are updated
                        //        larger values (e.g. 0.3) results in faster initial learning
                        // beta1: exponential decay rate for first moment (mean) estimates (default 0.9)
                        // beta2: exponential decay rate for second moment (variance) estimates (default 0.999)
                        // epsilon: very small number to prevent division by zero (default 1E-8)
                        .updater(new Adam(getDoubleProperty(APP_PROP_NAME_NN_ADAM_LEARNING_RATE)))// 1e-2
                        // L2 normalization: higher value --> more regularization or generalization
                        .l2(getDoubleProperty(APP_PROP_NAME_NN_L2)) // 0 normally 10 times smaller than learningRate
                        .numHiddenNodes(getIntProperty(APP_PROP_NAME_NN_NBR_OF_HIDDEN_NODES)) // 16
                        .numLayers(getIntProperty(APP_PROP_NAME_NN_NBR_OF_LAYERS)) // 2
                        // true: use LSTM (Long Short-Term Memory)
                        .useLSTM(getBooleanProperty(APP_PROP_NAME_NN_LSTM))
                        .learningRate(getDoubleProperty(APP_PROP_NAME_NN_LEARNING_RATE)) // 0.001
                        .build();
        return neuralNet;
    }

    private static A3CLearningConfiguration createA3CLearningConfigNew() {
        A3CLearningConfiguration learningConfig = A3CLearningConfiguration.builder()
                .learnerUpdateFrequency(getIntProperty(APP_PROP_NAME_LEARNER_UPDATE_FREQ))
                .seed(getLongProperty(APP_PROP_NAME_LEARNING_SEED))
                .maxEpochStep(getIntProperty(APP_PROP_NAME_LEARNING_MAX_EPOCH_STEP))
                .maxStep(getIntProperty(APP_PROP_NAME_LEARNING_MAX_STEP)) // 500000
                .numThreads(getIntProperty(APP_PROP_NAME_LEARNING_NUM_THREADS)) // 8
                .nStep(getIntProperty(APP_PROP_NAME_LEARNING_NUM_STEP)) // 20
                .rewardFactor(getDoubleProperty(APP_PROP_NAME_LEARNING_REWARD_FACTOR)) // 0.01
                // discount factor, typically 0.9 or even 0.99
                .gamma(getDoubleProperty(APP_PROP_NAME_LEARNING_GAMMA))
                .build();
        return learningConfig;
    }
}

Here comes also the stdout etc from the OK run with useLSTM=true:


std-out-all-ok-with-lstm.txt

...
found gymEnv: ...
gymEnv.envId: ...
gymEnv.actionSpace: org.deeplearning4j.rl4j.space.DiscreteSpace@72ba28ee
gymEnv.actionSpace.size: 3
gymEnv.actionSpace.noOp: 0
gymEnv.actionSpace.encode(0): 0
gymEnv.actionSpace.encode(1): 1
gymEnv.actionSpace.encode(2): 2
gymEnv.observationSpace: ArrayObservationSpace(name=Custom, shape=[1, 3], low=[0], high=[0])
gymEnv: rta.GymEnvRTA@1a5b8489
12:44:47.349 [main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
a3c.neuralNetwork(actorCriticOri): org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@4c2869a9
a3c.neuralNetwork(actorCriticOri).NNs: [Lorg.deeplearning4j.nn.api.NeuralNetwork;@518cf84a
a3c.neuralNetwork(actorCriticOri).NNs.length: 2
   NN[0]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@62e7dffa
      Layer[0, name: layer0]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=NCW), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[1, name: layer1]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=NCW), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[2, name: layer2]: org.deeplearning4j.nn.layers.recurrent.LSTM{conf=NeuralNetConfiguration(layer=LSTM(super=AbstractLSTM(super=BaseRecurrentLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=tanh, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=null), weightInitFnRecurrent=null, rnnDataFormat=NCW), forgetGateBiasInit=1.0, gateActivationFn=sigmoid, helperAllowFallback=true), forgetGateBiasInit=1.0, gateActivationFn=sigmoid), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, RW, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[3, name: layer3]: org.deeplearning4j.nn.layers.recurrent.RnnOutputLayer{conf=NeuralNetConfiguration(layer=RnnOutputLayer(super=BaseOutputLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=identity, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=1, timeDistributedFormat=null), lossFn=LossMSE(), hasBias=true), rnnDataFormat=NCW), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
   NN[1]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@6edcd0d8
      Layer[0, name: layer0]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=NCW), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[1, name: layer1]: org.deeplearning4j.nn.layers.feedforward.dense.DenseLayer{conf=NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=NCW), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[2, name: layer2]: org.deeplearning4j.nn.layers.recurrent.LSTM{conf=NeuralNetConfiguration(layer=LSTM(super=AbstractLSTM(super=BaseRecurrentLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=tanh, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=50, timeDistributedFormat=null), weightInitFnRecurrent=null, rnnDataFormat=NCW), forgetGateBiasInit=1.0, gateActivationFn=sigmoid, helperAllowFallback=true), forgetGateBiasInit=1.0, gateActivationFn=sigmoid), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, RW, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
      Layer[3, name: layer3]: org.deeplearning4j.nn.layers.recurrent.RnnOutputLayer{conf=NeuralNetConfiguration(layer=RnnOutputLayer(super=BaseOutputLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=softmax, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=50, nOut=3, timeDistributedFormat=null), lossFn=ActorCriticLoss(), hasBias=true), rnnDataFormat=NCW), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[W, b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0), score=0.0, optimizer=null, listeners=[]}
a3c.policy: org.deeplearning4j.rl4j.policy.ACPolicy@b8a7e43
a3c.epochCount: 0
a3c.stepCount: 0
a3c.historyProcessor: null
a3c.progressMonitorFrequency: 20000
a3c.mdp: rta.GymEnvRTA@1a5b8489
a3c.learningConfiguration: A3CLearningConfiguration(numThreads=1, nStep=42, learnerUpdateFrequency=5)
a3c.neuralNetwork(actorCritic): ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@4c2869a9
actorCritic.isRecurrent: true
a3c.nn: ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@4c2869a9
a3c.nn: ActorCritic2: org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate; org.deeplearning4j.rl4j.network.ac.ActorCriticSeparate@4c2869a9
a3c.nn.class: rta.RTA$ActorCritic2
actorCritic.neuralNetwork[0]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@62e7dffa
actorCritic.neuralNetwork[0].optimizer: org.deeplearning4j.optimize.solvers.StochasticGradientDescent@35835fa
   nnConfig: NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[L2Regularization(l2=FixedSchedule(value=5.0E-4))], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=NCW), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[0_W, 0_b, 1_W, 1_b, 2_W, 2_RW, 2_b, 3_W, 3_b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0)
   found inputPreProcessor with key[0]: RnnToFeedForwardPreProcessor(rnnDataFormat=NCW)
   found inputPreProcessor with key[2]: FeedForwardToRnnPreProcessor(rnnDataFormat=NCW)
actorCritic.neuralNetwork[1]: org.deeplearning4j.nn.multilayer.MultiLayerNetwork@6edcd0d8
actorCritic.neuralNetwork[1].optimizer: org.deeplearning4j.optimize.solvers.StochasticGradientDescent@56f71edb
   nnConfig: NeuralNetConfiguration(layer=DenseLayer(super=FeedForwardLayer(super=BaseLayer(activationFn=relu, weightInitFn=org.deeplearning4j.nn.weights.WeightInitXavier@1, biasInit=0.0, gainInit=1.0, regularization=[], regularizationBias=[], iUpdater=Adam(learningRate=5.0E-4, learningRateSchedule=null, beta1=0.9, beta2=0.999, epsilon=1.0E-8), biasUpdater=null, weightNoise=null, gradientNormalization=None, gradientNormalizationThreshold=1.0), nIn=3, nOut=50, timeDistributedFormat=NCW), hasLayerNorm=false, hasBias=true), miniBatch=true, maxNumLineSearchIterations=5, seed=12345, optimizationAlgo=STOCHASTIC_GRADIENT_DESCENT, variables=[0_W, 0_b, 1_W, 1_b, 2_W, 2_RW, 2_b, 3_W, 3_b], stepFunction=null, minimize=true, cacheMode=NONE, dataType=FLOAT, iterationCount=0, epochCount=0)
   found inputPreProcessor with key[0]: RnnToFeedForwardPreProcessor(rnnDataFormat=NCW)
   found inputPreProcessor with key[2]: FeedForwardToRnnPreProcessor(rnnDataFormat=NCW)
training START
12:44:47.474 [main] INFO org.deeplearning4j.rl4j.learning.async.AsyncLearning - AsyncLearning training starting.
12:44:47.505 [main] INFO org.deeplearning4j.rl4j.learning.async.AsyncLearning - Threads launched.
12:44:47.505 [Thread-2] INFO org.deeplearning4j.rl4j.learning.async.AsyncThread - ThreadNum-0 Started!
12:44:48.005 [Thread-2] INFO org.deeplearning4j.rl4j.learning.async.AsyncThread - ThreadNum-0 Episode step: 1, Episode: 1, Epoch: 1, reward: 0.0
12:44:48.068 [Thread-2] INFO org.deeplearning4j.rl4j.learning.async.AsyncThread - ThreadNum-0 Episode step: 1, Episode: 2, Epoch: 2, reward: 120000.0
...

Sorry for the crappy formatting, not sure how to do it better as I am not allowed to upload txt and Java-files… only jpegs etc…

@petter that looks related to the input. We’re passing in a matrix and it needs to be 3d. Do you mind reshaping the input to be 1 x rows x columns instead?

@agibsonccc thx a lot for your proosal, and yes, I guess you are right… :slight_smile: I assume this means the reshape could/should be done in the gym-environment, which is python-code, right? Or is there a way to do it on the Java side? I tried doing it in python, but my python reshaping skills are not up for it… Basically what I have tried looks as follows:

In the python gym-env I have the method step that returns the observation, originally it looks as follows (window_size=1), btw my observation consists of three values:

def step(self, action):

observation = self._get_observation()
return observation,…

def _get_observation(self):
res = self.signal_features[(self._current_tick-self.window_size : self._current_tick)
return res

I have tried to change it to the following, but it still gives me the same exception as before:

def _get_observation(self):
res = self.signal_features[(self._current_tick-self.window_size : self._current_tick)
return res.reshape(1,3)

Just to make sure I also tried the following (which I guess should do the same):

def _get_observation(self):
res = self.signal_features[(self._current_tick-self.window_size : self._current_tick)
return res.reshape(3)

Then I also tried the following, but also same error:

def _get_observation(self):
res = self.signal_features[(self._current_tick-self.window_size : self._current_tick)
res2 = res.view()
res2.shape(1,3)
return res2

@petter you’d need to make the input in to the network the proper shape. You’re not showing me where you’re feeding in the data.

Beyond that, there’s no “reshaping skills” required. I’m going to ignore that. I already told you what shape it needs to be and nd4j has the same exact calls the other tensor frameworks do. The only concept you need to know is that LSTMs are 3d with number of samples, features and number of time steps (think like the number of seconds or something) as I described earlier.

Read more on that here if you want to learn more:

That being said, show me where your input data is coming from and we can figure out how to get the right reshape calls to happen. Thanks!