The whole thing works wonderfully during training.
When I load and execute the saved mesh, C gets stuck at some point at O or at the edge of the space. When C gets stuck, for example, C only changes the field to the left and right.
I have solved it so far that I still insert epsilon Greedy function.
Then C doesn’t get stuck anymore but I don’t know if it’s a good solution.
This is my code to load the saved network
MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork(“network.zip”);
Training training = new Training();
while(true){
Gamestatus gamestatus = training.gamestatus ();
fINDArray output = multiLayerNetwork.output(gamestatus .getMatrix(), false);
double data = output.data().asDouble();
@ynot sorry could you still be more specific on what “hang” is? Remember I"m on the internet not looking at your environment here. I’m really trying to understand here. Freezing can mean a lot of things.
You keep saying you “trained the network” yet all I have to work with is…training code. It’s natural for me to assume that’s what you mean. The word “iterations” is also associated with training.
Sorry but you’re really not helping me here. “Freezes after 200 iterations” says to me “training process” if it’s not that then what is it? Both your code and your wording here suggest training.
Is there a difference to train the network and run the trained network?
If the while loop (see above post) has been run through about 200 iterations, then C only changes to up or down, for example. And this only happens if C is at O is or at the edge of the space.
with
double[] data = output.data().asDouble();
I have the data for the actions ( up, right, down, left)
for example
Data = [0.105776511132723232, 0.6717172468185425, 0.7585665711212158, 0.7527502183914185]
and C is here (see figure)
Now I get the highest value from the data (index 2 = down) and move C
@ynot so you are only performing inference. Thanks for clarifying.
Running training and then inference can really only have 1 side effect which can usually be memory.
A common hang is just the JVM staying frozen for a while while it’s about out of heap space and then it eventually throws an OutofMemoryError.
Beyond that, so you showed me what I was asking about. You trained the network in a separate process and now attempt to perform inference with it.
Yes training is a very specific term. That’s what builds the model and updates it. Using the model is…using the model. Typically we call this inference.
Firstly we need to see how you’re running it then. Could you clarify that a bit?
Are you running it in anything special like a docker container?
What kind of machine are you running it on?
What flags if any are you running the process with? Are you specifying any JVM flags like an Xmx?
Lastly, how big is the model you’re running? Do you mind printing the model with model.summary()?
@ynot thanks a lot that helps. So from your last post you mentioned spring boot. So you basically kick off an RL job that uses the model for inference based on invoking it from a REST API is that correct?
On your CPU and the like could you be more specific? Like what kind of CPU do you have? How much RAM do you have to work with?
For your model…that’s actually not that big then. So something else has to be going on.
If you aren’t specifying any flags that automatically means java just picks the memory for you. Usually that “guess” for the heap space is pretty good so let’s not worry about that anytime soon.
Let’s start looking in to more general things. If performance and the memory isn’t your bottleneck it might be a more general java problem. Could you run jps to get your process id and jstack $PID where $PID is the value of the process id jps gives you for the java process.
A thread dump can tell us where it gets stuck. Do that when it freezes.
@ynot make sure you’re dumping the right process. I don’t see any spring or dl4j stack traces in there. This looks like the intellij process. JPS should output more than 1 process.
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base@16.0.2/Native Method)
at netzwerk.util.NetzwerkUtil.waitMs(NetzwerkUtil.java:72)
at netzwerk.MainTraining.bewerteNSC(MainTraining.java:94)
at netzwerk.MainTraining.main(MainTraining.java:21)
Do you have some sort of an infinite loop somewhere?
The only other code I see is related to our deallocator. I would double check your main method to make sure you’re not sleeping too much or something.
You could potentially be running in to a deadlock somewhere but let’s keep the problems simple. Usually problems like this have a simple solution.
Beyond that, I would recommend turning on nd4j debug and verbose:
That will at least show whenever math code is running.
Note not to keep this on when running your math code in real work otherwise it’s not only a lot of logs but a big performance hit.
@ynot I won’t comment on the specific state of your code without seeing the whole thing or being able to run something locally.
Could you just see if when it stops you see any patterns besides your sleeping thread?
Are you running a server? Being that this is a rest api I assume so.
If so that would just stay up till it receives the next request. Are you sure it’s not just finishing the output and then waiting for the next request?
Do you try a request more than once to see if it still responds?
I call the model for testing from the main.
There is no server running.
final MultiLayerNetwork multiLayerNetwork = NetzwerkUtil.loadNetwork("network.zip");
System.out.println(multiLayerNetwork.summary());
Nd4j.getExecutioner().enableVerboseMode(true);
Nd4j.getExecutioner().enableDebugMode(true);
Training training = new Training();
for (int i = 0; i < 10000; i++) {
final Gamestatus gamestatus = training.gamestatus();
INDArray d = gamestatus.getMatrix();
final INDArray output = multiLayerNetwork.output(gamestatus.getMatrix(), false);
double[] data = output.data().asDouble();
int maxValueIndex = getMaxValue(data);;
changeDirectory(maxValueIndex);
move();
if (isNeighborEnergy)
setNewPosEnergy()
// Needed so that we can see easier what is the game doing
// NetzwerkUtil.waitMs(10);
}
@ynot does it stop on a particular loop number? You mentioned 200 before I think. Beyond that I would have to dive deeper tomorrow. Try to figure out exactly when it stops and give me a trace of that. I guess you’ve already did that but whatever you can give me would be great.
So the network runs but is just exhibiting dumb behavior? In that case that is just a learning problem. In that case train it more and try tuning the network a bit.
You aren’t being very clear about what’s wrong so I have had to just do process of elimination…
If my guess is correct then do what I suggested.
If it is another problem related to the functioning of the software itself then try to figure out if it runs out of memory when it renders more.
In that case look in to jvisualvm, yourkit and try to see more about what the program is doing internally.