I'm trying to use the car evaluation dataset from the UCI repository and I wonder whether there is a convenient way to binarize categorical variables in sklearn. One approach would be to use the DictVectorizer of LabelBinarizer but here I'm getting k different features whereas you should have just k-1 in order to avoid collinearization. I guess I could write my own function and drop one column but this bookkeeping is tedious, is there an easy way to perform such transformations and get as a result a sparse matrix?
In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. After that binary value is split into different columns. Binary encoding works really well when there are a high number of categories.
Another approach is to encode categorical values with a technique called "label encoding", which allows you to convert each value in a column to a number. Numerical labels are always between 0 and n_categories-1. You can do label encoding via attributes . cat.
We will be using . LabelEncoder() from sklearn library to convert categorical data to numerical data. We will use function fit_transform() in the process.
In ordinal encoding, each unique category value is assigned an integer value. For example, “red” is 1, “green” is 2, and “blue” is 3. This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.
if your data is a pandas DataFrame, then you can simply call get_dummies. Assume that your data frame is df, and you want to have one binary variable per level of variable 'key'. You can simply call:
pd.get_dummies(df['key'])
and then delete one of the dummy variables, to avoid the multi-colinearity problem. I hope this helps ...
The basic method is
import numpy as np
import pandas as pd, os
from sklearn.feature_extraction import DictVectorizer
def one_hot_dataframe(data, cols, replace=False):
vec = DictVectorizer()
mkdict = lambda row: dict((col, row[col]) for col in cols)
vecData = pd.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
vecData.columns = vec.get_feature_names()
vecData.index = data.index
if replace is True:
data = data.drop(cols, axis=1)
data = data.join(vecData)
return (data, vecData, vec)
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df2, _, _ = one_hot_dataframe(df, ['state'], replace=True)
print df2
Here is how to do in sparse format
import numpy as np
import pandas as pd, os
import scipy.sparse as sps
import itertools
def one_hot_column(df, cols, vocabs):
mats = []; df2 = df.drop(cols,axis=1)
mats.append(sps.lil_matrix(np.array(df2)))
for i,col in enumerate(cols):
mat = sps.lil_matrix((len(df), len(vocabs[i])))
for j,val in enumerate(np.array(df[col])):
mat[j,vocabs[i][val]] = 1.
mats.append(mat)
res = sps.hstack(mats)
return res
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': ['2000', '2001', '2002', '2001', '2002'],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
print df
vocabs = []
vals = ['Ohio','Nevada']
vocabs.append(dict(itertools.izip(vals,range(len(vals)))))
vals = ['2000','2001','2002']
vocabs.append(dict(itertools.izip(vals,range(len(vals)))))
print vocabs
print one_hot_column(df, ['state','year'], vocabs).todense()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With