I am trying to encode all the textual data in a .csv
file to numeric using Python's Scikit-learn. I am using LabelEncoder
and OneHotEncoder
on the columns which are of datatype object
. I am wondering how to concatenate the new encoded columns with the original dataframe - df
in this case. I am new to this and would really appreciate some help. Here's my code:
"""Encode all columns with type Object using LabelEncoder"""
columnsToEncode=df.select_dtypes(include=[object])
labelEncoder = preprocessing.LabelEncoder()
df_2 = columnsToEncode.apply(labelEncoder.fit_transform)
"""Now encode using OneHotEncoder"""
oneHotEncoder = preprocessing.OneHotEncoder()
df_3=oneHotEncoder.fit_transform(df_2)
There are a couple of methods to do this. Assuming you want to encode the independent variables you can use pd.get_dummies with the drop_first=True included. Here is an example:
import pandas as pd
# Create a data of independent variables X for the example
X = pd.DataFrame({'Country':['China', 'India', 'USA', 'Indonesia', 'Brasil'],
'Continent': ['Asia', 'Asia', 'North America', 'Asia', 'South America'],
'Population, M': [1403.5, 1324.2, 322.2, 261.1, 207.6]})
print(X)
# Encode
columnsToEncode=X.select_dtypes(include=[object]).columns
X = pd.get_dummies(X, columns=columnsToEncode, drop_first=True)
print(X)
# X prior to encoding
Continent Country Population, M
0 Asia China 1403.5
1 Asia India 1324.2
2 North America USA 322.2
3 Asia Indonesia 261.1
4 South America Brasil 207.6
# X after encoding
Population, M Continent_North America Continent_South America \
0 1403.5 0 0
1 1324.2 0 0
2 322.2 1 0
3 261.1 0 0
4 207.6 0 1
Country_China Country_India Country_Indonesia Country_USA
0 1 0 0 0
1 0 1 0 0
2 0 0 0 1
3 0 0 1 0
4 0 0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With