Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most appropriate normalization / transformation method for skewed features?

I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.

I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).

Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?

Any advice on the matter would be greatly appreciated!

Thanks!

like image 827
Anna Krystalli Avatar asked Nov 04 '22 11:11

Anna Krystalli


1 Answers

As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.

As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.

If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.

If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.

like image 104
Stan Avatar answered Nov 15 '22 08:11

Stan