Handle Data Pre-Processing

Hi Team

I’ve a CSV data with nulls both in training as well as testing set. Hence this requires imputation in the data and then I would like to remove outliers using z score for columns with continuous data in training set.

Now the challenge is if datavec is matured enough to handle such process.

OR

I need to use spark dataframes which will surely support these tasks. This way the challenge comes in convertion to a data format dl4j understands, could be a dataset.

Please suggest a way to handle such data pre processing for a dl4j application.

A workaround to the challenge with second approach would be to write the spark dataframe to CSV and load it back in dl4j via CSVRecordReader. But this would break the flow of data in the application. Hence, also suggest if this is how a data science application handles data in production.

Thanks and Regards

You can do quite a lot with DataVec. And I’m pretty certain that you should be able to do what you need to do here.

Have you tried using it and run into any issues?

The DataVec documentation should have everything you need: https://deeplearning4j.konduit.ai/datavec/overview

And the examples also cover a lot of ground:

1 Like

Thanks @treo for the examples!

I’ll go through them and will refer to build my Pre-Processing tasks.

Now, I was wondering how the data would be handled if that was stored in a distributed system and requiring Apache spark. Wouldn’t in that case a shift from spark to dl4j is required ?

Please let me know how will we approach this scenario.

Regards

There are multiple ways how you can deal with data coming from Spark.

For example in https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-distributed-training-examples/src/main/java/org/deeplearning4j/distributedtrainingexamples/tinyimagenet/TrainSpark.java#L181 it uses DataVec to work through a list of files it reads from HDFS during training.

For more options, take a look at this in documentation in particular:
https://deeplearning4j.konduit.ai/distributed-deep-learning/data-howto

1 Like

Thanks @treo, that helped a lot!

Hi @treo

Please guide me in building a transform process for simple imputation of null values in:

  1. Double type column by the mean of the column
  2. String type categorical column by mode of the column.

using datavec. I am using CSVRecordReader for input file.

Thanks and Regards