Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One-hot encoding of large dataset with scikit-learn

I have a large dataset which I plan to do logistic regression on. It has lots of categorical variables, each having thousands of features which I am planning to use one hot encoding on. I will need to deal with the data in small batches. My question is how to make sure that one hot encoding sees all the features of each categorical variable during the first run?

like image 805
Mostafa Mahmoud Avatar asked Dec 31 '25 16:12

Mostafa Mahmoud


1 Answers

There is no way around finding out which possible values your categorical features can take, which probably implies that you have to go through your data fully once in order to obtain a list of unique values of your categorical variables.

After that it is a matter of transforming your categorical variables to integer values and setting the n_values= kwarg in OneHotEncoder to an array corresponding to the number of different values each variable can take.

like image 117
eickenberg Avatar answered Jan 02 '26 07:01

eickenberg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!