I have a training dataset of 1000 samples. It contains about 50 features out of which 30 are categorical features where as the rest are numerical/continuous features. Which algorithm is best suited to handle mixed feature set of both categorical and continuous features?
In general, a preferred approach is to convert all your features into standardized continuous features.
For features that were originally continuous, perform standardization: x_i = (x_i - mean(x)) / standard_deviation(x). That is, for each feature, subtract the mean of the feature and then divide by the standard deviation of the feature. An alternative approach is to convert the continuous features into the range [0, 1]: x_i = (x_i - min(x)) / (max(x) - min(x)).
For categorical features, perform binarization on them so that each value is a continuous variable taking on the value of 0.0 or 1.0. For example, if you have a categorical variable "gender" that can take on values of MALE, FEMALE, and NA, create three binary binary variables IS_MALE, IS_FEMALE, and IS_NA, where each variable can be 0.0 or 1.0. You can then perform standardization as in step 1.
Now you have all your features as standardized continuous variables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With