Loading inputs from an ongoing simulation

The input to my convolution layer is BATCH_SIZE x 11 feature planes x a 9x9 chessboard. I’d like to describe the representation of the data in my simulation (in heap memory) and ask for advice on how best to convey it to the neural net for inference.

The first feature plane is derived from values bit-packed into a java array of 9 ints: 18 bits of each integer are used to represent 9 values each in the range 0-3; so that in the 9-int array, 81 such values are represented. In the feature plane, I want 0 where the simulation has 0, and 1 where the simulation has nonzero.

The next seven features represent pieces on the board; there are seven different pieces, and I don’t think I need another plane to represent empty squares. In the simulation, again it’s an array of 9 ints, this time each using 27 bits, 3 per square, to represent empty or one of the seven pieces. In the neural net, these should become seven separate feature planes of 0s and 1s.

The last three features are derived from a single java int each, in which 18 bits are used to represent 9 values each in the range 0-2. In the first of these three planes, the 9 values are duplicated across rows, like:

aaaaaaaaa
bbbbbbbbb
ccccccccc
ddddddddd
eeeeeeeee
fffffffff
ggggggggg
hhhhhhhhh
iiiiiiiii

In the second, across columns, like:

jklmnopqr
jklmnopqr
jklmnopqr
...
jklmnopqr

In the third, in a 3x3 pattern, like:

ssstttuuu
ssstttuuu
ssstttuuu
vvvwwwxxx
vvvwwwxxx
vvvwwwxxx
yyyzzzAAA
yyyzzzAAA
yyyzzzAAA

Now, my guess is that each time I want to do inference, I should convert these into java int arrays (or should I use byte arrays given that the values would fit?) and call Nd4j.create (or should I use Nd4j.createFromArray?), and do the rest of the transformation inside the ComputationGraph. I.e. the ComputationGraph is responsible for expanding the 9x9 piece array into the seven different layers each of 9x9 0s and 1s, and expanding the 3x3 into 9x9, then concatenating the 11 planes together. Is this the most performant way?

Generally, represent everything as ndarrays from the beginning. You’ll have over head of moving data in and off heap otherwise.
Java arrays and nd4j are fundamentally incompatible due to ML models being predominantly run in c/c++.

Is there a reason why you have to have java arrays at all?

I implemented my simulation in Java. Just to check, there’s no way I should abandon java primitives for even the simulation, is there? Here are some things the simulation does a lot:

  • Draw a random card. The values in the range 0-2 I described earlier represent the quantity of each different card in the draw pile. To draw a card, I generate a random number x equal to the total number of cards remaining (which I have cached), then iterate through the card counts to find the xth card.
  • For each square, count how many of its neighbors equal a certain value, making a list of the squares with certain counts, then shuffle that list.

@Xom it’s similar in the python/numpy world. Generally people don’t have this problem because the default behavior is to just use numpy arrays from the outset due to speed.

Even though we’re not as prevalent, the underlying performance reasons are still the same.
If you want to keep most of your code the way it is, you’ll have to accept there will be a performance hit no matter what you do.

With that in mind, I’ll help you as best as I can.
The best thing you want to do here is 2 things:

  1. Batch as much as possible. It’s free performance.
  2. Be as close to your intended input state as possible. The less steps you have to convert the more performance you get.

Basically, you mentioned already how you want the neural net state to look so focus on creating that from the outset then calling Nd4j.create(…) on that array. Ideally, you would create a float/double array directly and represent those directly then just call create(…) in java on that array.

Normally, you would use ints to save memory here but since a neural net will be training on floats/doubles anyways representing it like that from the outset will help a bit. Otherwise we have to convert it anyways.

Thanks, this is helpful. I realize there’s a performance hit.

One of the conversions is from [1 2 3 4 5 6 7 8 9] into [[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]], before concatenation into the big input matrix. One of my questions is, would it make sense to wait to do this conversion and concatenation inside the computation graph, so that there are fewer values to copy from heap to off-heap?

Could you clarify by what you mean “in the computation graph”? What I imagine you doing is physically having a concatenate vertex specified in the neural network itself, rather than conversion to an ndarray.

Don’t ever couple what is specified in the neural network itself with what’s done within the nd4j framework.
I’m imagining you have a function that just treats dl4j components as a black box, but it doesn’t really help to think about things like that.

The neural network itself is decoupled from the input data and should be treated separately.

You got me, I was indeed thinking of a repeat vector + concatenate vertex. But upon reflection I see that the repeat and concatenate could be applied after conversion to ndarray but before invoking the neural net, and this would equally achieve my proposed reduction in the amount of data to copy from heap to offheap.

@Xom then yeah try to do that stuff outside the neural net if possible. You can do it inside but then your inputs become more complicated. It’s really up to you how you want to do that. Some people prefer to do everything in the graph so it’s easy to persist, but you’re always going to have a conversion step anyways.

Same project, unrelated questions.

  1. I have multiple outputs, and I’d like to weight the losses differently according to their importance. How do I do it?

For what it’s worth, here’s part of my code where I define two outputs. Note that I use LossLayer because OutputLayer comes with a built-in DenseLayer, which I don’t want.

      .addLayer("policy_softmax", new ActivationLayer.Builder().activation(Activation.SOFTMAX).build(), "policy_reshape")

      .addLayer("policy_pass_logit", new DenseLayer.Builder().nIn(POLICY_CHANNELS * 2).nOut(1).build(), "policy_pool_merge")
      .addVertex("policy_nopass_logit", new ScaleVertex(-1), "policy_pass_logit")
      .addVertex("policy_pass_merge", new MergeVertex(), "policy_pass_logit", "policy_nopass_logit")
      .addLayer("policy_pass_softmax", new ActivationLayer.Builder().activation(Activation.SOFTMAX).build(), "policy_pass_merge")
      .addVertex("policy_pass", new SubsetVertex(0, 0), "policy_pass_softmax")
      .addVertex("policy_nopass", new SubsetVertex(1, 1), "policy_pass_softmax")
      .addVertex("policy_product", new ElementWiseVertex(ElementWiseVertex.Op.Product), "policy_softmax", "policy_nopass")

      .addLayer("out_policy", new LossLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD).build(), "policy_product")
      .addLayer("out_pass", new LossLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD).build(), "policy_pass");
  1. I have some “auxiliary” loss layers (not quoted above) that empirically help training, but aren’t used after training is done. Is there a way to disable certain outputs during inference to avoid wasting compute?

Another unrelated question. I see that in the JSON output for my configuration, the batch normalization layer has an activation function parameter that defaults to sigmoid: https://pastebin.com/KFa7Dkex
Is that parameter actually being read and applied? Or is it just present due to class inheritance but inert?

By the way, if I am not mistaken, in org.deeplearning4j.examples.advanced.modelling.alphagozero.dualresidual, the convolution layers need to have their activation functions explicitly set to identity, otherwise they will default to sigmoid, which the relu exists to replace. Not that I’ve actually tried running that code or anything.

Any hints? To summarize:

  1. How can I weight loss layers differently according to their importance?
  2. Is there an easy way to skip computing certain outputs during inference, or is my only choice to delete them from the net after training?
  3. Does batch normalization layer apply its activation function member variable, or is it just inherited from base layer but not used?

(4. Unintended sigmoid activations in org.deeplearning4j.examples.advanced.modelling.alphagozero.dualresidual? This doesn’t affect me, of course.)

@Xom you can do a custom loss function: https://github.com/eclipse/deeplearning4j-examples/blob/9e971fdb782249873786a960f248247dd28650fa/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/features/customizingdl4j/lossfunctions/CustomLossDefinition.java

For removing layers, use the transfer learning api: https://github.com/eclipse/deeplearning4j-examples/search?q=transferlearning

@Xom re: batch norm.
Implementation here: https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/normalization/BatchNormalization.java

It doesn’t appear to.

If you aren’t sure, just set an identity activation function and then do an activation layer for whatever activation function you would prefer to use.