RNN with simple dense function

Hi,
i tried to build my first layers and got first success. Now in my second step i want to create from the simple dens layer example for samediff a same functional layer for RNN input with NCW format.

the base unit test for simple feedforward was like this:

		INDArray     weights = Nd4j.arange(36).reshape(new int[] {6,6}); 
		INDArray     value1  = Nd4j.arange(12).reshape(new int[] {2,6}); 
		INDArray     bias    = Nd4j.ones(6);
		
		SDVariable x1 = sd.var("x1",value1);
		SDVariable w  = sd.var("weights",weights);
		SDVariable b  = sd.var("bias",bias);
		
		SDVariable out = x1.mmul(w);

Now i struggle with the step one dimension up. I tried with the setup:

		INDArray     weights = Nd4j.arange(36).reshape(new int[] {6,6}); 
		INDArray     value1  = Nd4j.arange(12).reshape(new int[] {2,6,1}); 
		INDArray     bias    = Nd4j.ones(6);
		
		SDVariable x1 = sd.var("x1",value1);
		SDVariable w  = sd.var("weights",weights);
		SDVariable b  = sd.var("bias",bias);
		
		SDVariable out = sd.tensorMmul(x1, w, new int[] {1}, new int[] {0});

but the shape ist 2,1,6 instead of 2,6,1. I don’t relly know if i have to

  • preprocess to 2d
  • use different dimenstions
  • transpos in some way

If somebody could give me a snippet or hint would be appreciated.

Best regards

Thomas

@thomas could you post the full configuration? Are you trying to embed this in a a dl4j samediff layer? If so there are some overlapping techniques you could use depending on the layer. RNNs themselves are in samediff and it’s much closer to what we directly just run underneath the covers in the dl4j layers.

Hey,

thanks for your reply, my class looks like this at the moment:

public class RNNFeedForward extends SameDiffLayer {

	/**
	 * serialization id
	 */
	private static final long serialVersionUID = -1615438632558936374L;

	private int nIn;
	
	private int nOut;
	
	/**
	 * default constructor 
	 */
	public RNNFeedForward(int nIn,int nOut) {
		this.nIn  = nIn;
		this.nOut = nOut;
	}
	
	@Override
	public SDVariable defineLayer(SameDiff sd, SDVariable layerInput, Map<String, SDVariable> paramTable,
			SDVariable mask) {
		SDVariable weights = paramTable.get(DefaultParamInitializer.WEIGHT_KEY);
        SDVariable bias    = paramTable.get(DefaultParamInitializer.BIAS_KEY);

        SDVariable mmul = sd.transpose(sd.tensorMmul(layerInput, weights, new int[] {1}, new int[] {1}));
        mmul.reshape(layerInput.shape());
        
        SDVariable a = mmul.add(bias);
        
        return sd.nn.relu(a, 0);
	}

	@Override
	public void defineParameters(SDLayerParams params) {
		// dense layer parameter
        params.addWeightParam(DefaultParamInitializer.WEIGHT_KEY,1, nIn, nOut);
        params.addBiasParam(DefaultParamInitializer.BIAS_KEY, 1, nOut, 1);	
    }

	@Override
	public void initializeParameters(Map<String, INDArray> params) {
		params.get(DefaultParamInitializer.BIAS_KEY).assign(0);
		initWeights(nIn, nOut, weightInit, params.get(DefaultParamInitializer.WEIGHT_KEY));
	}

	@Override
	public InputType getOutputType(int layerIndex, InputType inputType) {
		return inputType;
	}

	@JsonProperty
	public int getnIn() {
		return nIn;
	}

	@JsonProperty
	public void setnIn(int nIn) {
		this.nIn = nIn;
	}

	@JsonProperty
	public int getnOut() {
		return nOut;
	}

	@JsonProperty
	public void setnOut(int nOut) {
		this.nOut = nOut;
	}
	
}

i try to use the same weight matrix for every timestep.

@thomas what about your rnn configuration? You can actually configure NCW/NWC and use both depending on what your expected ordering is.

That or you can just permute the result to what you expect to be the final output. Just make sure it’s consistent. In that case just use samediff.permute for your result.

Either way should be fine.

Thats more to be the second step, first i have to figure out how to correctly multiplay the weight with the input matrix to get the correct shape and values.

i tried with reshape and will check inside the layer later with:

		SameDiff sd = SameDiff.create();
		
		int nIn = 2;
		int nOut = 3;
		
		int batch = 2;
		int steps = 4;
		
		INDArray     weights = Nd4j.ones(nIn * nOut).reshape(new int[] {nIn,nOut}); 
		INDArray     value1  = Nd4j.arange(nIn * batch * steps).reshape(new int[] {batch,nIn,steps}); 
		INDArray     bias    = Nd4j.ones(1,nOut,1);
		
		SDVariable x1 = sd.var("x1",value1);
		SDVariable w  = sd.var("weights",weights);
		SDVariable b  = sd.var("bias",bias);
		
		SDVariable out = sd.tensorMmul(sd.transpose(x1), w, new int[] {1}, new int[] {0});		
		
		
		out = out.permute(new int[] {1,2,0});

thanks for the hint.

Ok i tried a model with a basic working iterator from other models and got the following error on back propagation:

Added differentiated op relu
Added differentiated op permute
Added differentiated op tensordot
Added differentiated op transpose
Exception in thread “main” java.lang.RuntimeException: ShapeUtils::evalShapeForTensorDot method: the numbers of a axes and b axes to make dot product along must have identical values !
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2067)
at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6554)
at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(InferenceSession.java:801)
at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:255)
at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:68)
at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:533)
at org.nd4j.autodiff.samediff.SameDiff.directExecHelper(SameDiff.java:2927)
at org.nd4j.autodiff.samediff.SameDiff.batchOutputHelper(SameDiff.java:2870)
at org.nd4j.autodiff.samediff.SameDiff.batchOutputHelper(SameDiff.java:2841)
at org.nd4j.autodiff.samediff.SameDiff.calculateGradientsAndOutputs(SameDiff.java:4620)
at org.nd4j.autodiff.samediff.SameDiff.calculateGradients(SameDiff.java:4580)
at org.deeplearning4j.nn.layers.samediff.SameDiffLayer.backpropGradient(SameDiffLayer.java:197)
at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doBackward(LayerVertex.java:148)
at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:2784)
at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1393)
at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1353)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:174)

i removed all other components and reduced the calcuation in defineLayer to:

	@Override
	public SDVariable defineLayer(SameDiff sd, SDVariable layerInput, Map<String, SDVariable> paramTable,
			SDVariable mask) {
		SDVariable weights = paramTable.get(DefaultParamInitializer.WEIGHT_KEY);

		SDVariable activation = sd.tensorMmul(sd.transpose(layerInput), weights, new int[] {1}, new int[] {0});		
		SDVariable dimact     = activation.permute(new int[] {1,2,0});
        

        
        return sd.nn.relu(dimact, 0);
	}

is there a possibility to debug the real input and output shapes when the graph is instantiated.

Best regards

Thomas

I looked for a hint inside your code and used the recurrent attention layer, this works. but i keep thinking there must be a better way to achive the goal:

	@Override
	public SDVariable defineLayer(SameDiff sd, SDVariable layerInput, Map<String, SDVariable> paramTable,
			SDVariable mask) {
		SDVariable weights = paramTable.get(DefaultParamInitializer.WEIGHT_KEY);
		SDVariable bias    = paramTable.get(DefaultParamInitializer.BIAS_KEY);
		//SDVariable gain    = paramTable.get(DefaultParamInitializer.GAIN_KEY);
		
		long[]             shape = layerInput.getShape();
		SDVariable[] inputSlices = sd.unstack(layerInput, 2, (int)shape[2]);
		int            timeSteps = inputSlices.length;
		
		SDVariable[] outputSlices = new SDVariable[timeSteps];
		for (int i=0;i<timeSteps;i++) {
			outputSlices[i] = inputSlices[i].mmul(weights).add(bias);
			outputSlices[i] = sd.expandDims(outputSlices[i], 2);
		}
		SDVariable out = sd.concat(2, outputSlices);
        
		//out = sd.nn.layerNorm(out, gain, false, new int[] {1});
		
        SDVariable relu = sd.nn.relu(out, 0);
        
        return relu;
	}

if anybody has an suggestion how to achive by simple multiplication would be nice.

Best regards

Thomas

@thomas You can usually see an SDVariables shape with SDVariable.shape().eval(). I would recommend debugging during the executino/.output calls. You can put some miscellaneous debug code in there. eg: you could call
INDArray shape = someVar.shape().eval();

Usually you’ll need some dummy input though. Any graph that relies on placeholders needs some sort of structure to determine shapes. If you say: define a shape for a placeholder then it’s possible but in the context of the layer itself there’s no guarantees as to what the input shape is. The samediff graph here is also self contained relative to the rest of the network unfortunately. It would take some work (likely more than it’s worth) to improve that.

Regarding your attempts at tensor matrix multiply if you can extract out the parameters and dimensions you’re trying in a self contained environment where I don’t have to guess all your variables and I can just run it myself I"d be happy to look at the result and help you out.

All I know from your inputs there is that the tensor matrix multiply is failing due to some sort of invalid dimensions you’re specifying.

Beyond that, maybe you could also look at doing batch matrix multiply? That would allow input * x for each time slice to happen in parallel.

Hey thanks for your answer. I checked the dimension in a unit test but didn’t see any problems there. I will check a second time tomorrow.

The idea of batchMmul seems nice, i setup a unit test to check the operations and run into a problem i don’t really know if it is a bug (and seems to be fixed in snapshot).

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.999 sec <<< FAILURE!
normalizeBatchLayerTest(simple.generator.test.AttentionTest)  Time elapsed: 2.963 sec  <<< ERROR!
java.lang.NullPointerException: Cannot load from long array because "firstShape" is null
        at org.nd4j.linalg.api.ops.impl.reduce.custom.BatchMmul.<init>(BatchMmul.java:79)
        at org.nd4j.linalg.api.ops.impl.reduce.custom.BatchMmul.<init>(BatchMmul.java:48)
        at org.nd4j.autodiff.samediff.ops.SDBaseOps.batchMmul(SDBaseOps.java:384)
        at simple.generator.test.AttentionTest.normalizeBatchLayerTest(AttentionTest.java:85)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:578)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)

The test looks like this:

		SameDiff sd = SameDiff.create();
		
		int nIn = 2;
		int nOut = 3;
		
		int batch = 2;
		int steps = 2;
		
		INDArray     weights = Nd4j.ones(nIn * nOut).reshape(new int[] {nIn,nOut}); 
		INDArray     value1  = Nd4j.arange(nIn * batch * steps).reshape(new int[] {batch,nIn,steps}); 
		
		SDVariable x1 = sd.var("x1",value1);
		SDVariable w  = sd.var("weights",weights);
		
		long[] shape = x1.getShape();
		SDVariable[] inputSlices = sd.unstack(x1, 2, (int)shape[2]);

		SDVariable[] outputSlices = sd.batchMmul(inputSlices, new SDVariable[] {w,w});

As i understand the documentation the arrays must both be of same length (here 2) and the dimensions must fit. so here previously checked by simple mmul loop. But the execption is also on a point line 79 where in the repository (master) only the begin of a constructor is. so i think if here is a bug fix already or don’t really know whats the cause of error. I checked both parameter and nothing is null here.

Best regards

Thomas

EDIT: it tried snapshot but they seemed to me out of date. repository sonatype is configured like in the documentation given.

@thomas ah yeah. Release will be going out soon so this shouldn’t be a big deal. I’ll publish snapshots by early next week after my last round of fixes gets merged.

Thanks for the info. I will check next week and will train now with the current model. finally my first complete transfomer block :wink:

@thomas great! Glad you were able to get up and running! Thanks for asking when you were stuck. I’ll work on publishing some better examples on transformers soon!

I tried to setup one first layer definition from my split classes for the different tasks, probably helps somebody (WARNING Not tested or sure if works). Only first steps and weight init not finshed and no debug … so use on own risk :wink: :

/**
 * first try to implement a complete transfomer in one layer 
 * 
 * @author mrrobot
 *
 */
public class TransformerLayer extends SameDiffLayer {

    /**
	 * serialization id
	 */
	private static final long serialVersionUID = 5974498113062600619L;
	
	// param names 
	private static final String WEIGHT_KEY_QUERY_PROJECTION = "Wq";
    private static final String WEIGHT_KEY_KEY_PROJECTION = "Wk";
    private static final String WEIGHT_KEY_VALUE_PROJECTION = "Wv";
    private static final String WEIGHT_KEY_OUT_PROJECTION = "Wo";
    
    private static final String GAIN_KEY_ADD1  = "Gadd1";    
    private static final String GAIN_KEY_ADD2  = "Gadd2";
    private static final String WEIGHT_KEY_FFN = "Wffn";

    // TODO add bias parameter 
  
	
	private int embWidth;
	
	private int seqLen;
	
	private int heads;
	
	private int headSize;
	
	/**
	 * builder constructor 
	 * 
	 * @param builder
	 */
	public TransformerLayer(Builder builder) {
		embWidth = builder.embWidth;
		seqLen   = builder.seqLen;
		heads    = builder.heads;
		headSize = heads / embWidth;
	}
	
    @Override
    public SDVariable defineLayer(SameDiff sameDiff, SDVariable layerInput, Map<String, SDVariable> paramTable, SDVariable mask) {
        SDVariable result;
        SDVariable attention;
        
        SDVariable Wffn  = paramTable.get(WEIGHT_KEY_FFN);
        SDVariable gain1 = paramTable.get(GAIN_KEY_ADD1);
        SDVariable gain2 = paramTable.get(GAIN_KEY_ADD2);
        
    	// first multi head attention dot product
    	if(heads > 1){
            SDVariable Wq = paramTable.get(WEIGHT_KEY_QUERY_PROJECTION);
            SDVariable Wk = paramTable.get(WEIGHT_KEY_KEY_PROJECTION);
            SDVariable Wv = paramTable.get(WEIGHT_KEY_VALUE_PROJECTION);
            SDVariable Wo = paramTable.get(WEIGHT_KEY_OUT_PROJECTION);

            attention = sameDiff.nn.multiHeadDotProductAttention(getLayerName(), layerInput, layerInput, layerInput, Wq, Wk, Wv, Wo, mask, true);
        }else{
            attention = sameDiff.nn.dotProductAttention(getLayerName(), layerInput, layerInput, layerInput, mask, true);
        }
    	
    	// add and norm
    	SDVariable add1  = attention.add(layerInput);
    	SDVariable norm1 = sameDiff.nn.layerNorm(add1, gain1, false, new int[] {1});
    	
    	// ffn network part 
		long[]             shape = layerInput.getShape();
		SDVariable[] inputSlices = sameDiff.unstack(layerInput, 2, (int)shape[2]);
		int            timeSteps = inputSlices.length;
		
		SDVariable[] outputSlices = new SDVariable[timeSteps];
		for (int i=0;i<timeSteps;i++) {
			outputSlices[i] = inputSlices[i].mmul(Wffn);
			outputSlices[i] = sameDiff.expandDims(outputSlices[i], 2);
		}
		SDVariable ffnout = sameDiff.concat(2, outputSlices);    	
    	
		// add and norm
		SDVariable add2 = ffnout.add(norm1);
		result = sameDiff.nn.layerNorm(add2,gain2,false,new int[] {1});
		
    	return result;
    }

    @Override
    public void defineParameters(SDLayerParams params) {
        params.clear();

        // check for multi head attention parameter
        if(heads > 1){
            params.addWeightParam(WEIGHT_KEY_QUERY_PROJECTION, heads, headSize, embWidth);
            params.addWeightParam(WEIGHT_KEY_KEY_PROJECTION,   heads, headSize, embWidth);
            params.addWeightParam(WEIGHT_KEY_VALUE_PROJECTION, heads, headSize, embWidth);
            params.addWeightParam(WEIGHT_KEY_OUT_PROJECTION, heads * headSize, embWidth);
        }
        
        // ffn parameters
        params.addWeightParam(WEIGHT_KEY_FFN, embWidth, embWidth);
        
        // layer normalization parameter 
        params.addWeightParam(GAIN_KEY_ADD1, seqLen);
        params.addWeightParam(GAIN_KEY_ADD2, seqLen);
    }

    @Override
    public void initializeParameters(Map<String, INDArray> params) {
        try (MemoryWorkspace ws = Nd4j.getWorkspaceManager().scopeOutOfWorkspaces()) {
            for (Map.Entry<String, INDArray> e : params.entrySet()) {
                if(e.getKey().equals(WEIGHT_KEY_OUT_PROJECTION)){
                    WeightInitUtil.initWeights(embWidth, headSize, e.getValue().shape(), weightInit, null, 'c', e.getValue());
                } else if (!e.getKey().startsWith("Gadd")){
                    WeightInitUtil.initWeights(heads * headSize, embWidth, e.getValue().shape(), weightInit, null, 'c', e.getValue());
                }
            }
        }
        
        params.get(GAIN_KEY_ADD1).assign(1.0);
        params.get(GAIN_KEY_ADD2).assign(1.0);
    }	
	
	/**
	 * ensure NCW input format to be compatible with attention implementation 
	 */
    @Override
    public InputPreProcessor getPreProcessorForInputType(InputType inputType) {
        return InputTypeUtil.getPreprocessorForInputTypeRnnLayers(inputType, RNNFormat.NCW,getLayerName());
    }
    
    /**
     * configure and info about output type  RNNFormat.NCW with seqLen and embWidth Size
     */
    @Override
    public InputType getOutputType(int layerIndex, InputType inputType) {
        if (inputType == null || inputType.getType() != InputType.Type.RNN) {
            throw new IllegalStateException("Invalid input for transformer layer (layer index = " + layerIndex
                    + ", layer name = \"" + getLayerName() + "\"): expect RNN input type with size > 0. Got: "
                    + inputType);
        }
        
        return InputType.recurrent(embWidth, seqLen);
    }
	
	
	/**
	 * util class to build a custom transformer layer 
	 */
	public static class Builder {
		
		public int embWidth;
		
		public int heads;
		
		public int seqLen;
		
		public Builder embWidth(int width) {
			embWidth = width;
			return this;
		}
		
		public Builder nHeads(int heads) {
			this.heads = heads;
			return this;
		}
		
		public Builder seqLen(int len) {
			seqLen = len;
			return this;
		}
		
		public TransformerLayer builder() {
			return new TransformerLayer(this);
		}
		
	}
	
}

Hi,

i tried also to replace the given code with a second permute snippet that works in unit test (only eval no backprop) i got some error clean up on RNN Dataformat for the output layer but still get some errors:

	@Test
	public void rnnAsFFNTest() {
		SameDiff sd = SameDiff.create();
		
		int nIn = 3;
		int nOut = 5;
		
		int batch = 2;
		int steps = 4;
		
		INDArray     weights = Nd4j.ones(nIn * nOut).reshape(new int[] {nIn,nOut}); 
		INDArray     value1  = Nd4j.arange(nIn * batch * steps).reshape(new int[] {batch,nIn,steps}); 
		
		SDVariable x1 = sd.var("x1",value1);
		SDVariable w  = sd.var("weights",weights);
	
		SDVariable xp = sd.permute(x1, new int[] {0,2,1});
		SDVariable out = sd.tensorMmul(xp, w, new int[] {2}, new int [] {0});
		SDVariable outp = sd.permute(out, new int[] {0,2,1});
		
		System.out.println(x1.eval().shapeInfoToString());
		System.out.println(outp.eval().shapeInfoToString());
	}

I thinks dimensions in general should fit. But got the same stacktrace:

[main] ERROR org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner - Failed to execute op tensormmul_bp. Attempted to execute with 3 inputs, 2 outputs, 0 targs,0 bargs and 4 iargs. Inputs: [(FLOAT,[16,80,128],c), (FLOAT,[128,128],c), (FLOAT,[16,80,128],c)]. Outputs: [(FLOAT,[16,80,128],c), (FLOAT,[128,128],c)]. tArgs: -. iArgs: [1, 1, 1, 2]. bArgs: -. Input var names: [permute, Wffn, tensordot-grad]. Output var names: [permute-grad, Wffn-grad] - Please see above message (printed out from c++) for a possible cause of error.
[main] WARN org.deeplearning4j.earlystopping.trainer.BaseEarlyStoppingTrainer - Early stopping training terminated due to exception at epoch 0, iteration 0
java.lang.RuntimeException: Op with name tensormmul_bp and op type [tensormmul_bp] execution failed with message ShapeUtils::evalShapeForTensorDot method: the dimensions at given axes for both input arrays must be the same !
	at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(NativeOpExecutioner.java:1905)
	at org.nd4j.linalg.factory.Nd4j.exec(Nd4j.java:6531)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.doExec(InferenceSession.java:491)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:218)
	at org.nd4j.autodiff.samediff.internal.InferenceSession.getOutputs(InferenceSession.java:60)
	at org.nd4j.autodiff.samediff.internal.AbstractSession.output(AbstractSession.java:391)
	at org.nd4j.autodiff.samediff.SameDiff.directExecHelper(SameDiff.java:2754)
	at org.nd4j.autodiff.samediff.SameDiff.batchOutputHelper(SameDiff.java:2722)
	at org.nd4j.autodiff.samediff.SameDiff.calculateGradientsAndOutputs(SameDiff.java:4248)
	at org.nd4j.autodiff.samediff.SameDiff.calculateGradients(SameDiff.java:4209)
	at org.deeplearning4j.nn.layers.samediff.SameDiffLayer.backpropGradient(SameDiffLayer.java:197)
	at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doBackward(LayerVertex.java:148)

i found the most parameter permute and Wffn inside my code but no “tensordot-grad”. Can anybody explain where it comes frome?

@thomas
At the beginning of every samediff method call you can specify a name of a variable. Feel free to do that to make it a little more debuggable.

it looks like there’s some mismatch. a “-grad” variable is just a variable that gets created during training when attempting to do backprop. It stands for “gradient of the variable that proceeds the -”

From there it looks like you’re still misusing tensor matrix multiply. The error is right there:

Failed to execute op tensormmul_bp. Attempted to execute with 3 inputs, 2 outputs, 0 targs,0 bargs and 4 iargs. Inputs: [(FLOAT,[16,80,128],c), (FLOAT,[128,128],c), (FLOAT,[16,80,128],c)]. Outputs: [(FLOAT,[16,80,128],c), (FLOAT,[128,128],c)]. tArgs: -. iArgs: [1, 1, 1, 2]. bArgs: -. Input var names: [permute, Wffn, tensordot-grad]. Output var names: [permute-grad, Wffn-grad] - Please see above message (printed out from c++) for a possible cause of error.

In that case the variable output from tensor matrix multiply needs to be the same as the variable it’s the -grad of.