Unable to perform distributed training using tiny ImageNet example

MoslemTCM · August 16, 2021, 11:25am

Hi guys, I am trying to reproduce results of dl4j distributed training example on tinyImageNet dataset on my local dockerized Spark/Hadoop cluster. But I am getting this error:

07:04:04,547 INFO ~ Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
07:04:04,820 INFO ~ Loaded [CpuBackend] backend
07:04:09,241 INFO ~ Number of threads used for linear algebra: 4
07:04:09,248 INFO ~ Binary level Generic x86 optimization level Generic x86
07:04:09,291 INFO ~ Number of threads used for OpenMP BLAS: 4
07:04:09,333 INFO ~ Backend used: [CPU]; OS: [Linux]
07:04:09,333 INFO ~ Cores: [4]; Memory: [0.9GB];
07:04:09,333 INFO ~ Blas vendor: [OPENBLAS]
07:04:09,388 INFO ~ Backend build information:
GCC: “7.5.0”
STD version: 201103L
DEFAULT_ENGINE: samediff::ENGINE_CPU
HAVE_FLATBUFFERS
HAVE_OPENBLAS
07:04:24,227 INFO ~ ImageRecordReader: 200 label classes inferred using label generator ParentPathLabelGenerator
07:05:31,464 INFO ~ — Starting Training: Epoch 1 of 10 —
07:05:31,464 INFO ~ Setting controller address to 172.23.0.4:49876
07:05:40,422 INFO ~ ModelParameterServer starting
1629111941401 Exception:
io.aeron.exceptions.ChannelEndpointException: ERROR - AeronException : ERROR - channel error - Cannot assign requested address (at sun.nio.ch.Net.bind0(Native Method)): aeron:udp?endpoint=172.23.0.4:49876
at io.aeron.ClientConductor.onChannelEndpointError(ClientConductor.java:246)
at io.aeron.DriverEventsAdapter.onMessage(DriverEventsAdapter.java:109)
at org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive(CopyBroadcastReceiver.java:116)
at io.aeron.DriverEventsAdapter.receive(DriverEventsAdapter.java:68)
at io.aeron.ClientConductor.service(ClientConductor.java:1071)
at io.aeron.ClientConductor.doWork(ClientConductor.java:192)
at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:291)
at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:164)
at java.lang.Thread.run(Thread.java:748)
07:05:44,399 INFO ~ Starting training of split 1 of 1. workerMiniBatchSize=32, thresholdAlgorithm=AdaptiveThresholdAlgorithm(initialThreshold=0.001,minTargetSparsity=1.0E-4,maxTargetSparsity=0.01,decayRate=0.9659363289248456), Configured for 4 workers
07:05:44,400 INFO ~ Repartitioning training data using repartitioner: DefaultRepartitioner(maxPartitions=5000)
[Stage 6:> (0 + 4) / 5000]

Is it a networking problem related to dockers? udp ports 49876 and 40123 are aleady opend in my master and worker and my cluster is working fine for other spark tasks. Thank you

DL4J version: 1.0.0-M1.1
O.S: Ubuntu 20.4

agibsonccc · August 16, 2021, 12:21pm

@MoslemTCM Aeron uses UDP underneath. Please ensure you have UDP as well as TCP mapped and let us know if you still run in to problems; Container networking | Docker Documentation

The next step would be reproducing this. If you can give us a standalone script + docker compose file to reproduce this locally that would help a ton.

MoslemTCM · August 17, 2021, 11:08am

Hi @agibsonccc, thank you for your answer.
Attached my docker-compose file:
https://drive.google.com/file/d/1kR6owbC3d4rwk4V0k9t8bxo8Uvb-X4eN/view?usp=sharing
with SparkHadoopCluster a bridge network
I am using the same example applied in deeplearning-examples to tinyImageNet dataset.
After getting the error mentionned above, the code exits with the fellowing error:

2021-08-17 06:15:42 WARN TaskSetManager:66 - Lost task 2.0 in stage 6.0 (TID 22, 172.23.0.7, executor 0): org.nd4j.linalg.exception.ND4JIllegalStateException: Can’t establish connection afet 10 seconds. Terminating…
at org.nd4j.parameterserver.distributed.v2.transport.impl.AeronUdpTransport.addConnection(AeronUdpTransport.java:321)
at org.nd4j.parameterserver.distributed.v2.transport.impl.AeronUdpTransport.launch(AeronUdpTransport.java:351)
at org.nd4j.parameterserver.distributed.v2.ModelParameterServer.launch(ModelParameterServer.java:381)
at org.deeplearning4j.spark.parameterserver.pw.SharedTrainingWrapper.run(SharedTrainingWrapper.java:361)
at org.deeplearning4j.spark.parameterserver.functions.SharedFlatMapPaths.call(SharedFlatMapPaths.java:86)
at org.deeplearning4j.spark.parameterserver.functions.SharedFlatMapPaths.call(SharedFlatMapPaths.java:41)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

MoslemTCM · September 1, 2021, 1:53pm

Hi guys,
Any help please!

treo · September 1, 2021, 1:57pm

Your docker-compose file only exposes tcp ports, but as @agibsonccc already told you, you need to expose UDP ports too.

MoslemTCM · September 1, 2021, 2:02pm

For both the master and the worker ?
Which ports I need to expose ?
49876 mentioned in the error message:
AeronException : ERROR - channel error - Cannot assign requested address (at sun.nio.ch.Net.bind0(Native Method)): aeron:udp?endpoint=172.23.0.2:49876
or there is other ones ?

MoslemTCM · September 1, 2021, 2:35pm

@treo, I tried to expose 49876 udp port in the master then in the worker using:
ports:
- 8080:8080
- 7077:7077
- 49876:49876/udp
But I still get the same error message

treo · September 1, 2021, 3:33pm

Is the endpoint you’ve got there the ip address of the master?

MoslemTCM · September 1, 2021, 3:34pm

Yes, it is the master ip

treo · September 1, 2021, 3:45pm

And you are sure that the IP is correct?

Beyond that I can’t really help you here, as I’ve never used this in a docker compose setup (from my perspective it doesn’t make any sense what so ever to do so).

You may have luck if you try getting to run an aeron example in docker compose first (e.g. something like GitHub - AlexeyPirogov/aeron-docker, or GitHub - gatesy/AeronDockerExample: An example of how to use Aeron in Docker containers).

Topic		Replies	Views
Trouble with Cnn Implementation DL4J	1	338	February 15, 2021
Still a no jniNd4j error DL4J	3	436	May 29, 2020
How to add nd4j into classpath without an IDE DL4J	8	2586	May 5, 2020
Libjnicudart.so: libcudart.so.10.2: cannot open shared object file: No such file or directory ND4J	10	2215	December 23, 2020
Workspace "WS_LAYER_WORKING_MEM" for array type FF_WORKING_MEM is not open ND4J	6	589	April 20, 2021

Unable to perform distributed training using tiny ImageNet example

Related topics