Trained neural network hangs up

ynot · September 22, 2022, 10:20am

I hope that I can explain my request well.

I have trained a DQN and then saved the network.

C must find E (see figure). C must bypass O and must not come outside the space. The space size is 22x30. Size of C, O and E = 1.

The whole thing works wonderfully during training.

When I load and execute the saved mesh, C gets stuck at some point at O or at the edge of the space. When C gets stuck, for example, C only changes the field to the left and right.

I have solved it so far that I still insert epsilon Greedy function.
Then C doesn’t get stuck anymore but I don’t know if it’s a good solution.

This is my code to load the saved network

MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork(“network.zip”);
Training training = new Training();

while(true){

Gamestatus gamestatus = training.gamestatus ();

fINDArray output = multiLayerNetwork.output(gamestatus .getMatrix(), false);
double data = output.data().asDouble();

int valueDirection= 0;
Random r = new Random();

// epsilon Greedy
if(r.nextDouble() < 0.2)
valueDirection= r.nextInt(4);
else
valueDirection= getMaxValue(data);

training.changeDirection(valueDirection);
training.move();

if(isEnergyNeighbor()){
setNewEnergyPos()
}

}

agibsonccc · September 22, 2022, 12:53pm

@ynot could you clarify what “stuck” is? As in it’s not training? The learning is stuck in a local minimum?

ynot · September 22, 2022, 4:19pm

In my opinion, I have already finished training the network. It doesn’t get stuck there.
It has learned well…

So I start the training.

final Training training = new Training();

final String networkname = “network.zip”;

final QLearningDiscreteDense dql = new QLearningDiscreteDense<>(
enviroment,
NetzwerkUtil.buildDQNFactory(),
NetzwerkUtil.buildConfig()
);

        // Start the training
        dql.train();
        enviroment.close();

        // Save network
        try {
            dql.getNeuralNet().save(networkname);
        } catch (
                IOException e) {
            //  LOG.error(e.getMessage(), e);
        }

then I load the network

MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork(“network.zip”);

… see above post

and there it gets stuck after about 200 iterations.

agibsonccc · September 22, 2022, 10:01pm

@ynot sorry could you still be more specific on what “hang” is? Remember I"m on the internet not looking at your environment here. I’m really trying to understand here. Freezing can mean a lot of things.

You keep saying you “trained the network” yet all I have to work with is…training code. It’s natural for me to assume that’s what you mean. The word “iterations” is also associated with training.

Sorry but you’re really not helping me here. “Freezes after 200 iterations” says to me “training process” if it’s not that then what is it? Both your code and your wording here suggest training.

ynot · September 23, 2022, 7:09am

Is there a difference to train the network and run the trained network?

If the while loop (see above post) has been run through about 200 iterations, then C only changes to up or down, for example. And this only happens if C is at O is or at the edge of the space.

with

double[] data = output.data().asDouble();

I have the data for the actions ( up, right, down, left)

for example

Data = [0.105776511132723232, 0.6717172468185425, 0.7585665711212158, 0.7527502183914185]

and C is here (see figure)

grafik

Now I get the highest value from the data (index 2 = down) and move C

with

valueDirection= getMaxValue(data);
training.changeDirection(valueDirection);
training.move();

C goes "down " one field

In the next Iteration, the data

data = [0.8845572471618652, 0.6729172468185425, 0.7911865711212158, 0.6647502183914185]

C goes up a “field” again and so on…
C is to bypass O but it doesn’t.
This only happens when I load the trained network and then run it.

agibsonccc · September 23, 2022, 8:37am

@ynot so you are only performing inference. Thanks for clarifying.

Running training and then inference can really only have 1 side effect which can usually be memory.

A common hang is just the JVM staying frozen for a while while it’s about out of heap space and then it eventually throws an OutofMemoryError.

Beyond that, so you showed me what I was asking about. You trained the network in a separate process and now attempt to perform inference with it.

Yes training is a very specific term. That’s what builds the model and updates it. Using the model is…using the model. Typically we call this inference.

Firstly we need to see how you’re running it then. Could you clarify that a bit?

Are you running it in anything special like a docker container?

What kind of machine are you running it on?

What flags if any are you running the process with? Are you specifying any JVM flags like an Xmx?

Lastly, how big is the model you’re running? Do you mind printing the model with model.summary()?

ynot · September 23, 2022, 9:53am

How is the model used?

An action is triggered from the frontend. So that a player “P” moves one field and C moves one field.

In this case C should move one field to E.

Which machine do I use?

I start it on my CPU

With which flags, if any, do you run the process?

I don’t specify any flags.
Which ones could I specify?

How big is the model?

grafik

agibsonccc · September 23, 2022, 11:03am

@ynot thanks a lot that helps. So from your last post you mentioned spring boot. So you basically kick off an RL job that uses the model for inference based on invoking it from a REST API is that correct?

On your CPU and the like could you be more specific? Like what kind of CPU do you have? How much RAM do you have to work with?

For your model…that’s actually not that big then. So something else has to be going on.

If you aren’t specifying any flags that automatically means java just picks the memory for you. Usually that “guess” for the heap space is pretty good so let’s not worry about that anytime soon.

Let’s start looking in to more general things. If performance and the memory isn’t your bottleneck it might be a more general java problem. Could you run jps to get your process id and jstack $PID where $PID is the value of the process id jps gives you for the java process.

A thread dump can tell us where it gets stuck. Do that when it freezes.

ynot · September 23, 2022, 12:04pm

It only gets stuck because the values from Data are returned by getMaxValue(data); “up” and “down”?

Yes this is correct

Thread Dump

agibsonccc · September 23, 2022, 1:19pm

@ynot make sure you’re dumping the right process. I don’t see any spring or dl4j stack traces in there. This looks like the intellij process. JPS should output more than 1 process.

ynot · September 23, 2022, 1:55pm

I am currently launching the model from the Main…

JPS and JSTACK - Command

Thread Dump with id 11712

agibsonccc · September 23, 2022, 2:08pm

@ynot this sticks out:

java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(java.base@16.0.2/Native Method)
        at netzwerk.util.NetzwerkUtil.waitMs(NetzwerkUtil.java:72)
        at netzwerk.MainTraining.bewerteNSC(MainTraining.java:94)
        at netzwerk.MainTraining.main(MainTraining.java:21)

Do you have some sort of an infinite loop somewhere?
The only other code I see is related to our deallocator. I would double check your main method to make sure you’re not sleeping too much or something.

You could potentially be running in to a deadlock somewhere but let’s keep the problems simple. Usually problems like this have a simple solution.

Beyond that, I would recommend turning on nd4j debug and verbose:

Nd4j.getExecutioner().enableVerboseMode(true);
Nd4j.getExecutioner().enableDebugMode(true);

That will at least show whenever math code is running.
Note not to keep this on when running your math code in real work otherwise it’s not only a lot of logs but a big performance hit.

ynot · September 23, 2022, 2:30pm

C changes position to l4 then m4, then l4 again and m4 and so on.

Previously C has found E 25 times

I let 10 ms sleep the thread

agibsonccc · September 23, 2022, 2:35pm

@ynot I won’t comment on the specific state of your code without seeing the whole thing or being able to run something locally.

Could you just see if when it stops you see any patterns besides your sleeping thread?

Are you running a server? Being that this is a rest api I assume so.
If so that would just stay up till it receives the next request. Are you sure it’s not just finishing the output and then waiting for the next request?

Do you try a request more than once to see if it still responds?

ynot · September 23, 2022, 2:53pm

I call the model for testing from the main.
There is no server running.

final MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork("network.zip");

        System.out.println(multiLayerNetwork.summary());

        Nd4j.getExecutioner().enableVerboseMode(true);
        Nd4j.getExecutioner().enableDebugMode(true);

        Training training = new Training();
        for (int i = 0; i < 10000; i++) {

                final Gamestatus gamestatus = training.gamestatus();
                INDArray d = gamestatus.getMatrix();
                final INDArray output = multiLayerNetwork.output(gamestatus.getMatrix(), false);
                double[] data = output.data().asDouble();

                int maxValueIndex = getMaxValue(data);;
                
                
                changeDirectory(maxValueIndex);
                move();

                if (isNeighborEnergy)
                setNewPosEnergy()

                // Needed so that we can see easier what is the game doing
             //   NetzwerkUtil.waitMs(10);
   
}

If I implement epsilon Greedy, it would work.

But this doesn’t seem to be a good solution then?

agibsonccc · September 23, 2022, 3:02pm

@ynot does it stop on a particular loop number? You mentioned 200 before I think. Beyond that I would have to dive deeper tomorrow. Try to figure out exactly when it stops and give me a trace of that. I guess you’ve already did that but whatever you can give me would be great.

ynot · September 23, 2022, 3:04pm

It depends on how many objects are present in the room

It sticks to the objects and to the edge of the room

agibsonccc · September 23, 2022, 3:10pm

So the network runs but is just exhibiting dumb behavior? In that case that is just a learning problem. In that case train it more and try tuning the network a bit.

You aren’t being very clear about what’s wrong so I have had to just do process of elimination…

If my guess is correct then do what I suggested.
If it is another problem related to the functioning of the software itself then try to figure out if it runs out of memory when it renders more.

In that case look in to jvisualvm, yourkit and try to see more about what the program is doing internally.

Topic		Replies	Views
Error running dl4j library in android studio DL4J	42	583	June 9, 2022
Not able to train my Connect4 Network DL4J	1	31	January 23, 2025
Problem removing lstm, shape exception RL4J	5	365	July 3, 2022
Training CNN error / CNN text classification DL4J	59	2471	May 15, 2021
Adding trained neural net to android app DL4J	5	41	September 27, 2024

Trained neural network hangs up

Related topics