Normalization, Standardization, etc. of input data
Posted: Wed 18. Aug 2010, 22:40
From what I have gathered so far, there are basically four options for pre-processing input data:
SITUATION: Exceeding of Normalization Limits:
There is one situation that cannot be solved by script and that is when limits are exceeded, the current program clips that value to either the MAX or MIN defined for Normalization. This is a problem when making predictions based on new Lessons. In many cases, it is not known what the limits of the inputs will be and in many cases, they will exceed normalization limits used for training. Usually the exception is not by a large amount and usually can be handled by the network without any problems.
Once a net has been trained, that training cannot be used if it is determined that prediction inputs exceed limits.
The only way out of this right now is that script must be written to adjust every limit of every input variable to allow some margin for excess. I usually use +-10% Of course, there is always the situation that these limits will be exceeded with even newer prediction cases.
NOTE: What is particularly bad about this situation is the case when the network is retrained of very long periods of time. If a limit is exceeded, that training is no longer usable and must be repeated with more margin. Since training is path dependent, the incremental trainings must be done if the true nature of that learning is to be preserved.
REMEDY: While script can be written to add margin, there is no guarantee that this will hold for all new predictions. So a new program options is the only way to deal with this.
SITUATION: Standardization (zero mean, unit variance)
Standardization is very popular. In terms of Normalization limits or [-1, +1], zero is the mean, -1 is -Variance and +1 is Variance.
But since there is much data greater than unit variance, there is a massive amount of data that exceeds Normalization limits. This is now small exception but quite large considering 4 sigma data and fliers.
REMEDY: While none is actually required since script can be written, due to the popularity of Standardization, it might be a nice option to add to the Normalization Wizard.
_____________________________
If I have missed that these situations are actually handled by Membrain, sorry for the wasted time, yours and mine. Which leads to the question, Where are they"
Thanks
Tom
- Do not pre-process the input data
- Allow automatic scaling with limits set to the MAX/MIN of the input variable
- Manually enter the automatic scaling limits.
- Write script to do any kind of scaling.
SITUATION: Exceeding of Normalization Limits:
There is one situation that cannot be solved by script and that is when limits are exceeded, the current program clips that value to either the MAX or MIN defined for Normalization. This is a problem when making predictions based on new Lessons. In many cases, it is not known what the limits of the inputs will be and in many cases, they will exceed normalization limits used for training. Usually the exception is not by a large amount and usually can be handled by the network without any problems.
Once a net has been trained, that training cannot be used if it is determined that prediction inputs exceed limits.
The only way out of this right now is that script must be written to adjust every limit of every input variable to allow some margin for excess. I usually use +-10% Of course, there is always the situation that these limits will be exceeded with even newer prediction cases.
NOTE: What is particularly bad about this situation is the case when the network is retrained of very long periods of time. If a limit is exceeded, that training is no longer usable and must be repeated with more margin. Since training is path dependent, the incremental trainings must be done if the true nature of that learning is to be preserved.
REMEDY: While script can be written to add margin, there is no guarantee that this will hold for all new predictions. So a new program options is the only way to deal with this.
SITUATION: Standardization (zero mean, unit variance)
Standardization is very popular. In terms of Normalization limits or [-1, +1], zero is the mean, -1 is -Variance and +1 is Variance.
But since there is much data greater than unit variance, there is a massive amount of data that exceeds Normalization limits. This is now small exception but quite large considering 4 sigma data and fliers.
REMEDY: While none is actually required since script can be written, due to the popularity of Standardization, it might be a nice option to add to the Normalization Wizard.
_____________________________
If I have missed that these situations are actually handled by Membrain, sorry for the wasted time, yours and mine. Which leads to the question, Where are they"
Thanks
Tom