Attributes normalization

Hello,

a general question about attributes normalization.

i do something like

DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(trainingData);
normalizer.transform(trainingData);
normalizer.transform(testData);

to normalize my dataset, but i am wondering if it have to be done always, not at all or only on specific case.

Now, this is my scenario, used just as a test case

temp,pressure,humidity,wind_speed,prec,prediction
29,1200,40,18,0,A
27,1200,23,14,1,B
21,1200,33,33,3,C

since the data are “related” on cols, and the scale it’s veryt different from one cols to another (unit of measure it’s different, like pressure and temperature and windspeed) should i normalize this dataset or convert it in some way? How and when should data normalization be used?

Yes normalizing the data will be necessary. That is because for most activation functions the sensitive region is between -1 and 1, and everything beyond that is saturated, which makes training very hard, as you have almost no gradient to work with.

As for the specific normalization you want to use, that may depend on the data (see also Quickstart with Deeplearning4J – dubs·tech). With NormalizerStandardize, the normalization you get will be to calculate the statistics for each column and then normalize each column in such a way that it has a zero mean and unit variance (μ=0, σ=1).

That has an interesting side effect if a column always has the same value (as you maybe have in the pressure column). Because it is moving the mean value to zero, that feature is effectively dropped (see Methods for dropping out inputs (features) during post-fit Evaluation? - #8 by treo for the explanation of the math for zero features).

When you are running a classification, it will leave the labels alone. If you were running a regression, you’d also have to set fitLabels(true) on the normalizer.