LabelBinarizer for multiple columns in data frame

Tags:

I have a csv file which has 25 columns some are numeric and some are categorical and some are like names of actors, directors. I want use regression models on this data. In order to do so I have to convert the categorical columns string types to numeric values using LabelBinarizer from scikit package. How can I use LabelBinarize on this dataframe which has multiple categorical data?

SampleData

Essentially I want to binarize the labels and add them to the dataframe.

In the below code, I have retrieved the list of the columns I want to binarize not able to figure out how to add the new column back to the df?

categorylist = ['color', 'language', 'country', 'content_rating']
for col in categorylist:
    tempdf = label_binarizer.fit_transform(df[col])

In the next step, I want add the tempdf to df and drop the original column df[col].

930

asked Nov 07 '16 02:11

aks_Nin

1 Answers

You can do this in a one-liner with pd.get_dummies:

tempdf = pd.get_dummies(df, columns=categorylist)

Otherwise you can use a FeatureUnion with FunctionTransformer as in the answer to sklearn pipeline - how to apply different transformations on different columns

EDIT: As added by @dukebody in the comments, you can also use the sklearn-pandas package which purpose is to be able to apply different transformations to each dataframe column.

168

answered Sep 23 '22 08:09

maxymoo

Related questions
                            
                                How does a descriptor with __set__ but without __get__ work?
                            
                                Pandas extract comment lines
                            
                                How to get the difference between two 24 hour times?
                            
                                Dynamically accessing nested dictionary keys?
                            
                                How to send raw string to a dotmatrix printer using python in ubuntu?
                            
                                Can read() and readline() be used together?
                            
                                Set different color to specifc items in QListWidget
                            
                                is_max = s == s.max() | How should I read this?
                            
                                How to find instances that DONT have a tag using Boto3
                            
                                Sorting key function that uses custom comparison [duplicate]
                            
                                import all csv files in directory as pandas dfs and name them as csv filenames
                            
                                Cannot connect to SQL server from python using Active Directory Authentication
                            
                                Handling imported module Exceptions
                            
                                how to apply BREAK for Itertools count in List Comprehensions?
                            
                                Why conv2d in tensorflow gives an output has the same shape as input
                            
                                Using column header and values from one dataframe to find weights in another dataframe
                            
                                Are Python multiprocessing Pool thread safe?
                            
                                Find row in pandas and update specific value
                            
                                Read a large big-endian binary file
                            
                                convert csv to a string variable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

LabelBinarizer for multiple columns in data frame

Tags:

python

scipy

scikit-learn

sklearn-pandas

aks_Nin

People also ask

1 Answers

maxymoo

Recent Activity

Donate For Us