Unable to perform distributed training using tiny ImageNet example

Hi guys, I am trying to reproduce results of dl4j distributed training example on tinyImageNet dataset on my local dockerized Spark/Hadoop cluster. But I am getting this error:

07:04:04,547 INFO ~ Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
07:04:04,820 INFO ~ Loaded [CpuBackend] backend
07:04:09,241 INFO ~ Number of threads used for linear algebra: 4
07:04:09,248 INFO ~ Binary level Generic x86 optimization level Generic x86
07:04:09,291 INFO ~ Number of threads used for OpenMP BLAS: 4
07:04:09,333 INFO ~ Backend used: [CPU]; OS: [Linux]
07:04:09,333 INFO ~ Cores: [4]; Memory: [0.9GB];
07:04:09,333 INFO ~ Blas vendor: [OPENBLAS]
07:04:09,388 INFO ~ Backend build information:
GCC: “7.5.0”
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CPU
HAVE_FLATBUFFERS
HAVE_OPENBLAS
07:04:24,227 INFO ~ ImageRecordReader: 200 label classes inferred using label generator ParentPathLabelGenerator
07:05:31,464 INFO ~ — Starting Training: Epoch 1 of 10 —
07:05:31,464 INFO ~ Setting controller address to 172.23.0.4:49876
07:05:40,422 INFO ~ ModelParameterServer starting
1629111941401 Exception:
io.aeron.exceptions.ChannelEndpointException: ERROR - AeronException : ERROR - channel error - Cannot assign requested address (at sun.nio.ch.Net.bind0(Native Method)): aeron:udp?endpoint=172.23.0.4:49876
at io.aeron.ClientConductor.onChannelEndpointError(ClientConductor.java:246)
at io.aeron.DriverEventsAdapter.onMessage(DriverEventsAdapter.java:109)
at org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive(CopyBroadcastReceiver.java:116)
at io.aeron.DriverEventsAdapter.receive(DriverEventsAdapter.java:68)
at io.aeron.ClientConductor.service(ClientConductor.java:1071)
at io.aeron.ClientConductor.doWork(ClientConductor.java:192)
at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:291)
at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:164)
at java.lang.Thread.run(Thread.java:748)
07:05:44,399 INFO ~ Starting training of split 1 of 1. workerMiniBatchSize=32, thresholdAlgorithm=AdaptiveThresholdAlgorithm(initialThreshold=0.001,minTargetSparsity=1.0E-4,maxTargetSparsity=0.01,decayRate=0.9659363289248456), Configured for 4 workers
07:05:44,400 INFO ~ Repartitioning training data using repartitioner: DefaultRepartitioner(maxPartitions=5000)
[Stage 6:> (0 + 4) / 5000]

Is it a networking problem related to dockers? udp ports 49876 and 40123 are aleady opend in my master and worker and my cluster is working fine for other spark tasks. Thank you

DL4J version: 1.0.0-M1.1
O.S: Ubuntu 20.4

@MoslemTCM Aeron uses UDP underneath. Please ensure you have UDP as well as TCP mapped and let us know if you still run in to problems; Container networking | Docker Documentation

The next step would be reproducing this. If you can give us a standalone script + docker compose file to reproduce this locally that would help a ton.

Hi @agibsonccc, thank you for your answer.
Attached my docker-compose file:
https://drive.google.com/file/d/1kR6owbC3d4rwk4V0k9t8bxo8Uvb-X4eN/view?usp=sharing
with SparkHadoopCluster a bridge network
I am using the same example applied in deeplearning-examples to tinyImageNet dataset.
After getting the error mentionned above, the code exits with the fellowing error:

2021-08-17 06:15:42 WARN TaskSetManager:66 - Lost task 2.0 in stage 6.0 (TID 22, 172.23.0.7, executor 0): org.nd4j.linalg.exception.ND4JIllegalStateException: Can’t establish connection afet 10 seconds. Terminating…
at org.nd4j.parameterserver.distributed.v2.transport.impl.AeronUdpTransport.addConnection(AeronUdpTransport.java:321)
at org.nd4j.parameterserver.distributed.v2.transport.impl.AeronUdpTransport.launch(AeronUdpTransport.java:351)
at org.nd4j.parameterserver.distributed.v2.ModelParameterServer.launch(ModelParameterServer.java:381)
at org.deeplearning4j.spark.parameterserver.pw.SharedTrainingWrapper.run(SharedTrainingWrapper.java:361)
at org.deeplearning4j.spark.parameterserver.functions.SharedFlatMapPaths.call(SharedFlatMapPaths.java:86)
at org.deeplearning4j.spark.parameterserver.functions.SharedFlatMapPaths.call(SharedFlatMapPaths.java:41)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Hi guys,
Any help please!

Your docker-compose file only exposes tcp ports, but as @agibsonccc already told you, you need to expose UDP ports too.

For both the master and the worker ?
Which ports I need to expose ?
49876 mentioned in the error message:
AeronException : ERROR - channel error - Cannot assign requested address (at sun.nio.ch.Net.bind0(Native Method)): aeron:udp?endpoint=172.23.0.2:49876
or there is other ones ?

@treo, I tried to expose 49876 udp port in the master then in the worker using:
ports:
- 8080:8080
- 7077:7077
- 49876:49876/udp
But I still get the same error message

Is the endpoint you’ve got there the ip address of the master?

Yes, it is the master ip

And you are sure that the IP is correct?

Beyond that I can’t really help you here, as I’ve never used this in a docker compose setup (from my perspective it doesn’t make any sense what so ever to do so).

You may have luck if you try getting to run an aeron example in docker compose first (e.g. something like GitHub - AlexeyPirogov/aeron-docker, or GitHub - gatesy/AeronDockerExample: An example of how to use Aeron in Docker containers).