Handle Data Pre-Processing

Hi Team

I’ve a CSV data with nulls both in training as well as testing set. Hence this requires imputation in the data and then I would like to remove outliers using z score for columns with continuous data in training set.

Now the challenge is if datavec is matured enough to handle such process.


I need to use spark dataframes which will surely support these tasks. This way the challenge comes in convertion to a data format dl4j understands, could be a dataset.

Please suggest a way to handle such data pre processing for a dl4j application.

A workaround to the challenge with second approach would be to write the spark dataframe to CSV and load it back in dl4j via CSVRecordReader. But this would break the flow of data in the application. Hence, also suggest if this is how a data science application handles data in production.

Thanks and Regards

You can do quite a lot with DataVec. And I’m pretty certain that you should be able to do what you need to do here.

Have you tried using it and run into any issues?

The DataVec documentation should have everything you need: https://deeplearning4j.konduit.ai/datavec/overview

And the examples also cover a lot of ground:

1 Like

Thanks @treo for the examples!

I’ll go through them and will refer to build my Pre-Processing tasks.

Now, I was wondering how the data would be handled if that was stored in a distributed system and requiring Apache spark. Wouldn’t in that case a shift from spark to dl4j is required ?

Please let me know how will we approach this scenario.


There are multiple ways how you can deal with data coming from Spark.

For example in https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-distributed-training-examples/src/main/java/org/deeplearning4j/distributedtrainingexamples/tinyimagenet/TrainSpark.java#L181 it uses DataVec to work through a list of files it reads from HDFS during training.

For more options, take a look at this in documentation in particular:

1 Like

Thanks @treo, that helped a lot!

Hi @treo

Please guide me in building a transform process for simple imputation of null values in:

  1. Double type column by the mean of the column
  2. String type categorical column by mode of the column.

using datavec. I am using CSVRecordReader for input file.

Thanks and Regards

Hi @treo

My last request to seek help might be too broad. Here is how I’ll proceed:

Way 1 : Using Spark SQL:

  1. Create spark dataframe from CSV.
  2. Pre-process(Imputation & Outliers removal by Z-Score) it and save it back to disk as CSV.
  3. Proceed normally with CSVRecordReader to model training.

Way 2 : Using Datavec without Spark:

  1. Use CSVRecordReader to load data.
  2. Build transform process for pre-processing.
  3. Proceed normally to model training.

Here is where I would seek help in Way 2.2. While creating transform process to replace nulls, I would need to compute mean and mode of the required columns. Similarly calculation of Z-Score is required for outlier removal.

Please guide me to compute them on the fly. Or Way 1 is more preferable here.

Thanks & Regards

I’m a bit time constrained at the moment, so I can’t give too detailed answers to advanced questions like that.

At the moment the fastest way to find a good answer to your question is going to be to take a look at the examples: https://github.com/eclipse/deeplearning4j-examples/tree/master/data-pipeline-examples

If I recall correctly it should have something that is very close to what you want to do.

Hi @treo

Using the examples I’ve been able to figure out the statistics needed:

  1. Mean -
    val analyse : DataAnalysis = AnalyzeLocal.analyze(schema, trainReader) val ageAnalysis : DoubleAnalysis = analyse.getColumnAnalysis("Age").asInstanceOf[DoubleAnalysis] ageAnalysis.getMean
    where trainReader is CSVRecordReader and schema is Schema object.

  2. Standard Deviation -

  3. Mode of a categorical column -
    val genderAnalysis : CategoricalAnalysis = analyse.getColumnAnalysis("Gender").asInstanceOf[CategoricalAnalysis] genderAnalysis.getMapOfCounts.asInstanceOf[Map[String,Long]].max._2

  4. Z-Score Calculation -
    val transformPx : TransformProcess = new TransformProcess.Builder(schema).doubleMathOp("Age", MathOp.Subtract, meanAge).doubleMathOp("Age", MathOp.Divide, stdDevAge).

The quality of training data is:

Now the issue is AnalyzeLocal throws error with nulls in data as: Exception in thread "main" java.lang.NullPointerException. Due to this I am unable to perform any of the above statistics.
Please help!

Thanks and Regards

As you want to apply some data replacement on those columns, I suggest you try it with applying a reduction to calculate what you need while skipping invalid values (see setIgnoreInvalid) and then a transformation to replace the invalid values.

Hi @treo

Thanks for the suggestion!
Please let me know some examples to guide me through this way.

Also let me know how will the mean and mode of the column persist while performing a replacement.