Handle Data Pre-Processing

Spark · June 27, 2020, 9:24am

Hi Team

I’ve a CSV data with nulls both in training as well as testing set. Hence this requires imputation in the data and then I would like to remove outliers using z score for columns with continuous data in training set.

Now the challenge is if datavec is matured enough to handle such process.

OR

I need to use spark dataframes which will surely support these tasks. This way the challenge comes in convertion to a data format dl4j understands, could be a dataset.

Please suggest a way to handle such data pre processing for a dl4j application.

A workaround to the challenge with second approach would be to write the spark dataframe to CSV and load it back in dl4j via CSVRecordReader. But this would break the flow of data in the application. Hence, also suggest if this is how a data science application handles data in production.

Thanks and Regards

treo · June 28, 2020, 7:49pm

You can do quite a lot with DataVec. And I’m pretty certain that you should be able to do what you need to do here.

Have you tried using it and run into any issues?

The DataVec documentation should have everything you need: https://deeplearning4j.konduit.ai/datavec/overview

And the examples also cover a lot of ground:

Spark · June 29, 2020, 4:02am

Thanks @treo for the examples!

I’ll go through them and will refer to build my Pre-Processing tasks.

Now, I was wondering how the data would be handled if that was stored in a distributed system and requiring Apache spark. Wouldn’t in that case a shift from spark to dl4j is required ?

Please let me know how will we approach this scenario.

Regards

treo · June 29, 2020, 5:23pm

There are multiple ways how you can deal with data coming from Spark.

For example in deeplearning4j-examples/TrainSpark.java at master · eclipse/deeplearning4j-examples · GitHub it uses DataVec to work through a list of files it reads from HDFS during training.

For more options, take a look at this in documentation in particular:
https://deeplearning4j.konduit.ai/distributed-deep-learning/data-howto

Spark · June 29, 2020, 6:29pm

Thanks @treo, that helped a lot!

Spark · July 2, 2020, 7:00pm

Hi @treo

Please guide me in building a transform process for simple imputation of null values in:

Double type column by the mean of the column
String type categorical column by mode of the column.

using datavec. I am using CSVRecordReader for input file.

Thanks and Regards

Spark · July 5, 2020, 5:47am

Hi @treo

My last request to seek help might be too broad. Here is how I’ll proceed:

Way 1 : Using Spark SQL:

Create spark dataframe from CSV.
Pre-process(Imputation & Outliers removal by Z-Score) it and save it back to disk as CSV.
Proceed normally with CSVRecordReader to model training.

Way 2 : Using Datavec without Spark:

Use CSVRecordReader to load data.
Build transform process for pre-processing.
Proceed normally to model training.

Here is where I would seek help in Way 2.2. While creating transform process to replace nulls, I would need to compute mean and mode of the required columns. Similarly calculation of Z-Score is required for outlier removal.

Please guide me to compute them on the fly. Or Way 1 is more preferable here.

Thanks & Regards

treo · July 8, 2020, 7:41am

I’m a bit time constrained at the moment, so I can’t give too detailed answers to advanced questions like that.

At the moment the fastest way to find a good answer to your question is going to be to take a look at the examples: deeplearning4j-examples/data-pipeline-examples at master · eclipse/deeplearning4j-examples · GitHub

If I recall correctly it should have something that is very close to what you want to do.

Spark · July 14, 2020, 4:30am

Hi @treo

Using the examples I’ve been able to figure out the statistics needed:

Mean -
val analyse : DataAnalysis = AnalyzeLocal.analyze(schema, trainReader) val ageAnalysis : DoubleAnalysis = analyse.getColumnAnalysis("Age").asInstanceOf[DoubleAnalysis] ageAnalysis.getMean
where trainReader is CSVRecordReader and schema is Schema object.
Standard Deviation -
ageAnalysis.getSampleStdev
Mode of a categorical column -
val genderAnalysis : CategoricalAnalysis = analyse.getColumnAnalysis("Gender").asInstanceOf[CategoricalAnalysis] genderAnalysis.getMapOfCounts.asInstanceOf[Map[String,Long]].max._2
Z-Score Calculation -
val transformPx : TransformProcess = new TransformProcess.Builder(schema).doubleMathOp("Age", MathOp.Subtract, meanAge).doubleMathOp("Age", MathOp.Divide, stdDevAge).

The quality of training data is:

Now the issue is AnalyzeLocal throws error with nulls in data as: Exception in thread "main" java.lang.NullPointerException. Due to this I am unable to perform any of the above statistics.
Please help!

Thanks and Regards

treo · July 14, 2020, 10:52am

As you want to apply some data replacement on those columns, I suggest you try it with applying a reduction to calculate what you need while skipping invalid values (see setIgnoreInvalid) and then a transformation to replace the invalid values.

Spark · July 14, 2020, 3:33pm

Hi @treo

Thanks for the suggestion!
Please let me know some examples to guide me through this way.

Also let me know how will the mean and mode of the column persist while performing a replacement.

Regards

Topic		Replies	Views
Normalization with DL4J on Spark DL4J	0	346	May 11, 2020
StringToWriteablesFunction DataVec	1	400	June 5, 2021
Access preprocessed data after transformation? DataVec	1	407	August 2, 2021
Trouble with vectorizing your data? Read this first! DataVec	2	782	August 17, 2021
Issue in datavec analysis DataVec	6	836	April 18, 2020

Handle Data Pre-Processing

Related topics