I am working my way through Wes's Python For Data Analysis, and I've run into a strange problem that is not addressed in the book.
In the code below, based on page 199 of his book, I create a dataframe and then use pd.cut()
to create cat_obj
. According to the book, cat_obj
is
"a special Categorical object. You can treat it like an array of strings indicating the bin name; internally it contains a levels array indicating the distinct category names along with a labeling for the ages data in the labels attribute"
Awesome! However, if I use the exact same pd.cut()
code (In [5] below) to create a new column of the dataframe (called df['cat']
), that column is not treated as a special categorical variable but simply as a regular pandas series.
How, then, do I create a column in a dataframe that is treated as a categorical variable?
In [4]:
import pandas as pd
raw_data = {'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'score': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['name', 'score'])
bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
In [5]:
cat_obj = pd.cut(df['score'], bins, labels=group_names)
df['cat'] = pd.cut(df['score'], bins, labels=group_names)
In [7]:
type(cat_obj)
Out[7]:
pandas.core.categorical.Categorical
In [8]:
type(df['cat'])
Out[8]:
pandas.core.series.Series
The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.
DataFrame(dtype=”category”) : For creating a categorical dataframe, dataframe() method has dtype attribute set to category. All the columns in data-frame can be converted to categorical either during or after construction by specifying dtype=”category” in the DataFrame constructor.
For categorical data you can use Pandas string functions to filter the data. The startswith() function returns rows where a given column contains values that start with a certain value, and endswith() which returns rows with values that end with a certain value.
It might be happening because of this kind of behaviour by setter-:
Sample getter and setter-
class a:
x = 1
@property
def p(self):
return int(self.x)
@p.setter
def p(self,v):
self.x = v
t = 1.32
a().p = 1.32
print type(t) --> <type 'float'>
print type(a().p) --> <type 'int'>
For now df
only accepts Series data
and its setter converts Categorial data
into Series
. df
categorial support is due in Next Pandas release.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With