I'm working with pandas for the first time. I have a column with survey responses in, which can take 'strongly agree', 'agree', 'disagree', 'strongly disagree', and 'neither' values.
This is the output of describe()
and value_counts()
for the column:
count 4996
unique 5
top Agree
freq 1745
dtype: object
Agree 1745
Strongly agree 926
Strongly disagree 918
Disagree 793
Neither 614
dtype: int64
I want to do a linear regression on this question versus overall score. However, I have a feeling that I should convert the column into a Category variable first, given that it's inherently ordered. Is this correct? If so, how should I do this?
I've tried this:
df.EasyToUseQuestionFactor = pd.Categorical.from_array(df.EasyToUseQuestion)
print df.EasyToUseQuestionFactor
This produces output that looks vaguely right, but it seems that the categories are in the wrong order. Is there a way that I can specify ordering? Do I even need to specify ordering?
This is the rest of my code right now:
df = pd.read_csv('./data/responses.csv')
lm1 = ols('OverallScore ~ EasyToUseQuestion', data).fit()
print lm1.rsquared
Two ways to do it nowadays. Your column would be more readable and use less memory. Since it will be a Categorical Type you still will be able order the values.
First my preferred one:
df['grades'].astype('category')
astype
used to accept a categories
argument, but it isn't present anymore. So if you want to order your categories, or to have extra categories that aren't present in your data, you must use the solution below.
This recommendation is from the docs
In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"],
....: ordered=True)
In [29]: s_cat = s.astype(cat_type)
Extra tip: get all existing values from a column with df.colname.unique()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With