I have a large dataset which I plan to do logistic regression on. It has lots of categorical variables, each having thousands of features which I am planning to use one hot encoding on. I will need to deal with the data in small batches. My question is how to make sure that one hot encoding sees all the features of each categorical variable during the first run?
There is no way around finding out which possible values your categorical features can take, which probably implies that you have to go through your data fully once in order to obtain a list of unique values of your categorical variables.
After that it is a matter of transforming your categorical variables to integer values and setting the n_values= kwarg in OneHotEncoder to an array corresponding to the number of different values each variable can take.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With