Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: Convert string column to ordered Category?

Tags:

python

pandas

I'm working with pandas for the first time. I have a column with survey responses in, which can take 'strongly agree', 'agree', 'disagree', 'strongly disagree', and 'neither' values.

This is the output of describe() and value_counts() for the column:

count      4996
unique        5
top       Agree
freq       1745
dtype: object
Agree                1745
Strongly agree        926
Strongly disagree     918
Disagree              793
Neither               614
dtype: int64

I want to do a linear regression on this question versus overall score. However, I have a feeling that I should convert the column into a Category variable first, given that it's inherently ordered. Is this correct? If so, how should I do this?

I've tried this:

df.EasyToUseQuestionFactor = pd.Categorical.from_array(df.EasyToUseQuestion)
print df.EasyToUseQuestionFactor

This produces output that looks vaguely right, but it seems that the categories are in the wrong order. Is there a way that I can specify ordering? Do I even need to specify ordering?

This is the rest of my code right now:

df = pd.read_csv('./data/responses.csv')
lm1 = ols('OverallScore ~ EasyToUseQuestion', data).fit()
print lm1.rsquared 
like image 789
Richard Avatar asked Sep 19 '14 16:09

Richard


1 Answers

Two ways to do it nowadays. Your column would be more readable and use less memory. Since it will be a Categorical Type you still will be able order the values.

First my preferred one:

df['grades'].astype('category')

astype used to accept a categories argument, but it isn't present anymore. So if you want to order your categories, or to have extra categories that aren't present in your data, you must use the solution below.

This recommendation is from the docs

In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"],
   ....:                             ordered=True)
In [29]: s_cat = s.astype(cat_type)

Extra tip: get all existing values from a column with df.colname.unique().

like image 147
neves Avatar answered Nov 01 '22 16:11

neves