I’ve a CSV data with nulls both in training as well as testing set. Hence this requires imputation in the data and then I would like to remove outliers using z score for columns with continuous data in training set.
Now the challenge is if datavec is matured enough to handle such process.
OR
I need to use spark dataframes which will surely support these tasks. This way the challenge comes in convertion to a data format dl4j understands, could be a dataset.
Please suggest a way to handle such data pre processing for a dl4j application.
A workaround to the challenge with second approach would be to write the spark dataframe to CSV and load it back in dl4j via CSVRecordReader. But this would break the flow of data in the application. Hence, also suggest if this is how a data science application handles data in production.
I’ll go through them and will refer to build my Pre-Processing tasks.
Now, I was wondering how the data would be handled if that was stored in a distributed system and requiring Apache spark. Wouldn’t in that case a shift from spark to dl4j is required ?
Please let me know how will we approach this scenario.
My last request to seek help might be too broad. Here is how I’ll proceed:
Way 1 : Using Spark SQL:
Create spark dataframe from CSV.
Pre-process(Imputation & Outliers removal by Z-Score) it and save it back to disk as CSV.
Proceed normally with CSVRecordReader to model training.
Way 2 : Using Datavec without Spark:
Use CSVRecordReader to load data.
Build transform process for pre-processing.
Proceed normally to model training.
Here is where I would seek help in Way 2.2. While creating transform process to replace nulls, I would need to compute mean and mode of the required columns. Similarly calculation of Z-Score is required for outlier removal.
Please guide me to compute them on the fly. Or Way 1 is more preferable here.
Using the examples I’ve been able to figure out the statistics needed:
Mean - val analyse : DataAnalysis = AnalyzeLocal.analyze(schema, trainReader) val ageAnalysis : DoubleAnalysis = analyse.getColumnAnalysis("Age").asInstanceOf[DoubleAnalysis] ageAnalysis.getMean
where trainReader is CSVRecordReader and schema is Schema object.
Standard Deviation - ageAnalysis.getSampleStdev
Mode of a categorical column - val genderAnalysis : CategoricalAnalysis = analyse.getColumnAnalysis("Gender").asInstanceOf[CategoricalAnalysis] genderAnalysis.getMapOfCounts.asInstanceOf[Map[String,Long]].max._2
Z-Score Calculation - val transformPx : TransformProcess = new TransformProcess.Builder(schema).doubleMathOp("Age", MathOp.Subtract, meanAge).doubleMathOp("Age", MathOp.Divide, stdDevAge).
Now the issue is AnalyzeLocal throws error with nulls in data as: Exception in thread "main" java.lang.NullPointerException. Due to this I am unable to perform any of the above statistics.
Please help!