Trained neural network hangs up

I hope that I can explain my request well.

I have trained a DQN and then saved the network.

C must find E (see figure). C must bypass O and must not come outside the space. The space size is 22x30. Size of C, O and E = 1.

The whole thing works wonderfully during training.

When I load and execute the saved mesh, C gets stuck at some point at O or at the edge of the space. When C gets stuck, for example, C only changes the field to the left and right.

I have solved it so far that I still insert epsilon Greedy function.
Then C doesn’t get stuck anymore but I don’t know if it’s a good solution.

This is my code to load the saved network

MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork(“network.zip”);
Training training = new Training();

while(true){

Gamestatus gamestatus = training.gamestatus ();

fINDArray output = multiLayerNetwork.output(gamestatus .getMatrix(), false);
double data = output.data().asDouble();

int valueDirection= 0;
Random r = new Random();

// epsilon Greedy
if(r.nextDouble() < 0.2)
valueDirection= r.nextInt(4);
else
valueDirection= getMaxValue(data);

training.changeDirection(valueDirection);
training.move();

if(isEnergyNeighbor()){
setNewEnergyPos()
}

}

@ynot could you clarify what “stuck” is? As in it’s not training? The learning is stuck in a local minimum?

In my opinion, I have already finished training the network. It doesn’t get stuck there.
It has learned well…

So I start the training.

final Training training = new Training();

final String networkname = “network.zip”;

final QLearningDiscreteDense dql = new QLearningDiscreteDense<>(
enviroment,
NetzwerkUtil.buildDQNFactory(),
NetzwerkUtil.buildConfig()
);

        // Start the training
        dql.train();
        enviroment.close();

        // Save network
        try {
            dql.getNeuralNet().save(networkname);
        } catch (
                IOException e) {
            //  LOG.error(e.getMessage(), e);
        }

then I load the network

MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork(“network.zip”);

… see above post

and there it gets stuck after about 200 iterations.

@ynot sorry could you still be more specific on what “hang” is? Remember I"m on the internet not looking at your environment here. I’m really trying to understand here. Freezing can mean a lot of things.

You keep saying you “trained the network” yet all I have to work with is…training code. It’s natural for me to assume that’s what you mean. The word “iterations” is also associated with training.

Sorry but you’re really not helping me here. “Freezes after 200 iterations” says to me “training process” if it’s not that then what is it? Both your code and your wording here suggest training.

Is there a difference to train the network and run the trained network?

If the while loop (see above post) has been run through about 200 iterations, then C only changes to up or down, for example. And this only happens if C is at O is or at the edge of the space.

with

double[] data = output.data().asDouble();

I have the data for the actions ( up, right, down, left)

for example

Data = [0.105776511132723232, 0.6717172468185425, 0.7585665711212158, 0.7527502183914185]

and C is here (see figure)

grafik

Now I get the highest value from the data (index 2 = down) and move C

with

valueDirection= getMaxValue(data);
training.changeDirection(valueDirection);
training.move();

C goes "down " one field

In the next Iteration, the data

data = [0.8845572471618652, 0.6729172468185425, 0.7911865711212158, 0.6647502183914185]

C goes up a “field” again and so on…
C is to bypass O but it doesn’t.
This only happens when I load the trained network and then run it.

@ynot so you are only performing inference. Thanks for clarifying.

Running training and then inference can really only have 1 side effect which can usually be memory.

A common hang is just the JVM staying frozen for a while while it’s about out of heap space and then it eventually throws an OutofMemoryError.

Beyond that, so you showed me what I was asking about. You trained the network in a separate process and now attempt to perform inference with it.

Yes training is a very specific term. That’s what builds the model and updates it. Using the model is…using the model. Typically we call this inference.

Firstly we need to see how you’re running it then. Could you clarify that a bit?

Are you running it in anything special like a docker container?

What kind of machine are you running it on?

What flags if any are you running the process with? Are you specifying any JVM flags like an Xmx?

Lastly, how big is the model you’re running? Do you mind printing the model with model.summary()?

How is the model used?

An action is triggered from the frontend. So that a player “P” moves one field and C moves one field.

In this case C should move one field to E.

Which machine do I use?

I start it on my CPU

With which flags, if any, do you run the process?

I don’t specify any flags.
Which ones could I specify?

How big is the model?

grafik

@ynot thanks a lot that helps. So from your last post you mentioned spring boot. So you basically kick off an RL job that uses the model for inference based on invoking it from a REST API is that correct?

On your CPU and the like could you be more specific? Like what kind of CPU do you have? How much RAM do you have to work with?

For your model…that’s actually not that big then. So something else has to be going on.

If you aren’t specifying any flags that automatically means java just picks the memory for you. Usually that “guess” for the heap space is pretty good so let’s not worry about that anytime soon.

Let’s start looking in to more general things. If performance and the memory isn’t your bottleneck it might be a more general java problem. Could you run jps to get your process id and jstack $PID where $PID is the value of the process id jps gives you for the java process.

A thread dump can tell us where it gets stuck. Do that when it freezes.

It only gets stuck because the values from Data are returned by getMaxValue(data); “up” and “down”?

  1. Yes this is correct

  1. Thread Dump

@ynot make sure you’re dumping the right process. I don’t see any spring or dl4j stack traces in there. This looks like the intellij process. JPS should output more than 1 process.

I am currently launching the model from the Main…

JPS and JSTACK - Command

Thread Dump with id 11712

@ynot this sticks out:

java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(java.base@16.0.2/Native Method)
        at netzwerk.util.NetzwerkUtil.waitMs(NetzwerkUtil.java:72)
        at netzwerk.MainTraining.bewerteNSC(MainTraining.java:94)
        at netzwerk.MainTraining.main(MainTraining.java:21)

Do you have some sort of an infinite loop somewhere?
The only other code I see is related to our deallocator. I would double check your main method to make sure you’re not sleeping too much or something.

You could potentially be running in to a deadlock somewhere but let’s keep the problems simple. Usually problems like this have a simple solution.

Beyond that, I would recommend turning on nd4j debug and verbose:

Nd4j.getExecutioner().enableVerboseMode(true);
Nd4j.getExecutioner().enableDebugMode(true);

That will at least show whenever math code is running.
Note not to keep this on when running your math code in real work otherwise it’s not only a lot of logs but a big performance hit.

C changes position to l4 then m4, then l4 again and m4 and so on.

Previously C has found E 25 times

I let 10 ms sleep the thread

@ynot I won’t comment on the specific state of your code without seeing the whole thing or being able to run something locally.

Could you just see if when it stops you see any patterns besides your sleeping thread?

Are you running a server? Being that this is a rest api I assume so.
If so that would just stay up till it receives the next request. Are you sure it’s not just finishing the output and then waiting for the next request?

Do you try a request more than once to see if it still responds?

I call the model for testing from the main.
There is no server running.

final MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork("network.zip");

        System.out.println(multiLayerNetwork.summary());

        Nd4j.getExecutioner().enableVerboseMode(true);
        Nd4j.getExecutioner().enableDebugMode(true);

        Training training = new Training();
        for (int i = 0; i < 10000; i++) {

                final Gamestatus gamestatus = training.gamestatus();
                INDArray d = gamestatus.getMatrix();
                final INDArray output = multiLayerNetwork.output(gamestatus.getMatrix(), false);
                double[] data = output.data().asDouble();

                int maxValueIndex = getMaxValue(data);;
                
                
                changeDirectory(maxValueIndex);
                move();

                if (isNeighborEnergy)
                setNewPosEnergy()

                // Needed so that we can see easier what is the game doing
             //   NetzwerkUtil.waitMs(10);
   
}

If I implement epsilon Greedy, it would work.

But this doesn’t seem to be a good solution then?

@ynot does it stop on a particular loop number? You mentioned 200 before I think. Beyond that I would have to dive deeper tomorrow. Try to figure out exactly when it stops and give me a trace of that. I guess you’ve already did that but whatever you can give me would be great.

It depends on how many objects are present in the room

It sticks to the objects and to the edge of the room

So the network runs but is just exhibiting dumb behavior? In that case that is just a learning problem. In that case train it more and try tuning the network a bit.

You aren’t being very clear about what’s wrong so I have had to just do process of elimination…

If my guess is correct then do what I suggested.
If it is another problem related to the functioning of the software itself then try to figure out if it runs out of memory when it renders more.

In that case look in to jvisualvm, yourkit and try to see more about what the program is doing internally.