I have a job that I’m running over spark, but it throws a Spark error when doing the parameter averaging:
20/06/23 04:57:36 INFO scheduler.DAGScheduler: Job 7 failed: treeAggregate at ParameterAveragingTrainingMaster.java:666, took 4.857751 s
Exception in thread “main” org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 8.0 failed 4 times, most recent failure: Lost task 19.3 in stage 8.0 (TID 284, data27, executor 5): java.lang.IllegalArgumentException: Op.X [DOUBLE] type must be the same as Op.Y [FLOAT] for op org.nd4j.linalg.api.ops.impl.reduce3.EuclideanDistance: x.shape=[512, 125], y.shape=[512, 125]
at org.nd4j.common.base.Preconditions.throwEx(Preconditions.java:636)
at org.nd4j.common.base.Preconditions.checkArgument(Preconditions.java:219)
at org.nd4j.linalg.api.ops.BaseReduceFloatOp.validateDataTypes(BaseReduceFloatOp.java:110)
at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(NativeOpExecutioner.java:258)
at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(NativeOpExecutioner.java:250)
at org.deeplearning4j.nn.graph.vertex.impl.L2Vertex.doForward(L2Vertex.java:81)
I’m using DL4J version 1.0.0-beta7 and Spark version 3.0.0. I did just update from an earlier version - is that the issue?
@agibsonccc - thanks for the response. The code is fairly verbose at this point, but your comment makes me think that it is a compatibility issue I’m seeing. Can you tell me what the latest supported Spark version is? 2.4.5?
The error itself says the problem is that you have both float and double tensors here and that both of them should be of the same type. If you updated from a rather old version, where only a single type of tensors were supported at all, this change in behavior might surprise you.
In order to find what exactly is going on, you’d have to share the graph definition though.
I went from DL4J version 1 beta2 to beta7 and was trying to go from spark 2.3.x to 3.0. I did make some other code changes at the same time, but it was just making the output size 2 instead of 1. I noticed that I am able to skirt the issue if I cast the INDArrays in my MultiDataSet to DataType.FLOAT (instead of double) and then train. I’m not sure why that makes difference?
OK - good to know!
I am working in scala, but yes, I was using mostly doubles. I haven’t looked closely to see if I explicitly have all of them as doubles, or if some of them get changed to doubles in the scala to java translation. But at any rate, I’m glad to have a solution (and explanation). Thanks!