Hi guys, I am trying to reproduce results of dl4j distributed training example on tinyImageNet dataset on my local dockerized Spark/Hadoop cluster. But I am getting this error:
07:04:04,547 INFO ~ Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
07:04:04,820 INFO ~ Loaded [CpuBackend] backend
07:04:09,241 INFO ~ Number of threads used for linear algebra: 4
07:04:09,248 INFO ~ Binary level Generic x86 optimization level Generic x86
07:04:09,291 INFO ~ Number of threads used for OpenMP BLAS: 4
07:04:09,333 INFO ~ Backend used: [CPU]; OS: [Linux]
07:04:09,333 INFO ~ Cores: [4]; Memory: [0.9GB];
07:04:09,333 INFO ~ Blas vendor: [OPENBLAS]
07:04:09,388 INFO ~ Backend build information:
GCC: “7.5.0”
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CPU
HAVE_FLATBUFFERS
HAVE_OPENBLAS
07:04:24,227 INFO ~ ImageRecordReader: 200 label classes inferred using label generator ParentPathLabelGenerator
07:05:31,464 INFO ~ — Starting Training: Epoch 1 of 10 —
07:05:31,464 INFO ~ Setting controller address to 172.23.0.4:49876
07:05:40,422 INFO ~ ModelParameterServer starting
1629111941401 Exception:
io.aeron.exceptions.ChannelEndpointException: ERROR - AeronException : ERROR - channel error - Cannot assign requested address (at sun.nio.ch.Net.bind0(Native Method)): aeron:udp?endpoint=172.23.0.4:49876
at io.aeron.ClientConductor.onChannelEndpointError(ClientConductor.java:246)
at io.aeron.DriverEventsAdapter.onMessage(DriverEventsAdapter.java:109)
at org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive(CopyBroadcastReceiver.java:116)
at io.aeron.DriverEventsAdapter.receive(DriverEventsAdapter.java:68)
at io.aeron.ClientConductor.service(ClientConductor.java:1071)
at io.aeron.ClientConductor.doWork(ClientConductor.java:192)
at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:291)
at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:164)
at java.lang.Thread.run(Thread.java:748)
07:05:44,399 INFO ~ Starting training of split 1 of 1. workerMiniBatchSize=32, thresholdAlgorithm=AdaptiveThresholdAlgorithm(initialThreshold=0.001,minTargetSparsity=1.0E-4,maxTargetSparsity=0.01,decayRate=0.9659363289248456), Configured for 4 workers
07:05:44,400 INFO ~ Repartitioning training data using repartitioner: DefaultRepartitioner(maxPartitions=5000)
[Stage 6:> (0 + 4) / 5000]
Is it a networking problem related to dockers? udp ports 49876 and 40123 are aleady opend in my master and worker and my cluster is working fine for other spark tasks. Thank you
2021-08-17 06:15:42 WARN TaskSetManager:66 - Lost task 2.0 in stage 6.0 (TID 22, 172.23.0.7, executor 0): org.nd4j.linalg.exception.ND4JIllegalStateException: Can’t establish connection afet 10 seconds. Terminating…
at org.nd4j.parameterserver.distributed.v2.transport.impl.AeronUdpTransport.addConnection(AeronUdpTransport.java:321)
at org.nd4j.parameterserver.distributed.v2.transport.impl.AeronUdpTransport.launch(AeronUdpTransport.java:351)
at org.nd4j.parameterserver.distributed.v2.ModelParameterServer.launch(ModelParameterServer.java:381)
at org.deeplearning4j.spark.parameterserver.pw.SharedTrainingWrapper.run(SharedTrainingWrapper.java:361)
at org.deeplearning4j.spark.parameterserver.functions.SharedFlatMapPaths.call(SharedFlatMapPaths.java:86)
at org.deeplearning4j.spark.parameterserver.functions.SharedFlatMapPaths.call(SharedFlatMapPaths.java:41)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
For both the master and the worker ?
Which ports I need to expose ?
49876 mentioned in the error message:
AeronException : ERROR - channel error - Cannot assign requested address (at sun.nio.ch.Net.bind0(Native Method)): aeron:udp?endpoint=172.23.0.2:49876
or there is other ones ?
@treo, I tried to expose 49876 udp port in the master then in the worker using:
ports:
- 8080:8080
- 7077:7077
- 49876:49876/udp
But I still get the same error message
Beyond that I can’t really help you here, as I’ve never used this in a docker compose setup (from my perspective it doesn’t make any sense what so ever to do so).