One-hot encoding of large dataset with scikit-learn

Question

I have a large dataset which I plan to do logistic regression on. It has lots of categorical variables, each having thousands of features which I am planning to use one hot encoding on. I will need to deal with the data in small batches. My question is how to make sure that one hot encoding sees all the features of each categorical variable during the first run?

eickenberg · Accepted Answer

There is no way around finding out which possible values your categorical features can take, which probably implies that you have to go through your data fully once in order to obtain a list of unique values of your categorical variables.

After that it is a matter of transforming your categorical variables to integer values and setting the n_values= kwarg in OneHotEncoder to an array corresponding to the number of different values each variable can take.

One-hot encoding of large dataset with scikit-learn

Tags:

python

scikit-learn

Mostafa Mahmoud

1 Answers

eickenberg

Recent Activity

Donate For Us

One-hot encoding of large dataset with scikit-learn

Tags:

python

scikit-learn

Mostafa Mahmoud

1 Answers

eickenberg

Related questions

Recent Activity

Donate For Us