I have a csv file which has 25 columns some are numeric and some are categorical and some are like names of actors, directors. I want use regression models on this data. In order to do so I have to convert the categorical columns string types to numeric values using LabelBinarizer from scikit package. How can I use LabelBinarize on this dataframe which has multiple categorical data?
Essentially I want to binarize the labels and add them to the dataframe.
In the below code, I have retrieved the list of the columns I want to binarize not able to figure out how to add the new column back to the df?
categorylist = ['color', 'language', 'country', 'content_rating']
for col in categorylist:
tempdf = label_binarizer.fit_transform(df[col])
In the next step, I want add the tempdf
to df
and drop the original column df[col].
Using DataFrame. insert() method, we can add new columns at specific position of the column name sequence. Although insert takes single column name, value as input, but we can use it repeatedly to add multiple columns to the DataFrame.
Use LabelEncoder to Encode Single Columns. Use LabelEncoder to Encode Multiple Columns All at Once.
You can do this in a one-liner with pd.get_dummies
:
tempdf = pd.get_dummies(df, columns=categorylist)
Otherwise you can use a FeatureUnion
with FunctionTransformer
as in the answer to sklearn pipeline - how to apply different transformations on different columns
EDIT: As added by @dukebody in the comments, you can also use the sklearn-pandas package which purpose is to be able to apply different transformations to each dataframe column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With