Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANOVA in python using pandas dataframe with statsmodels or scipy?

I want to use the Pandas dataframe to breakdown the variance in one variable.

For example, if I have a column called 'Degrees', and I have this indexed for various dates, cities, and night vs. day, I want to find out what fraction of the variation in this series is coming from cross-sectional city variation, how much is coming from time series variation, and how much is coming from night vs. day.

In Stata I would use Fixed effects and look at the R^2. Hopefully my question makes sense.

Basically, what I want to do, is find the ANOVA breakdown of "Degrees" by three other columns.

like image 906
wolfsatthedoor Avatar asked Aug 27 '14 21:08

wolfsatthedoor


People also ask

What data type is used in pandas for any analysis?

Most of the time, using pandas default int64 and float64 types will work. The only reason I included in this table is that sometimes you may see the numpy types pop up on-line or in your own analysis.

Is PyArrow faster than pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.

Which Python library is similar to pandas?

Vaex. Vaex is a Python package used for processing and exploring big tabular datasets with interfaces similar to Pandas.


1 Answers

I set up a direct comparison to test them, found that their assumptions can differ slightly , got a hint from a statistician, and here is an example of ANOVA on a pandas dataframe matching R's results:

import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols   # R code on R sample dataset  #> anova(with(ChickWeight, lm(weight ~ Time + Diet))) #Analysis of Variance Table # #Response: weight #           Df  Sum Sq Mean Sq  F value    Pr(>F) #Time        1 2042344 2042344 1576.460 < 2.2e-16 *** #Diet        3  129876   43292   33.417 < 2.2e-16 *** #Residuals 573  742336    1296 #write.csv(file='ChickWeight.csv', x=ChickWeight, row.names=F)  cw = pd.read_csv('ChickWeight.csv')  cw_lm=ols('weight ~ Time + C(Diet)', data=cw).fit() #Specify C for Categorical print(sm.stats.anova_lm(cw_lm, typ=2)) #                  sum_sq   df            F         PR(>F) #C(Diet)    129876.056995    3    33.416570   6.473189e-20 #Time      2016357.148493    1  1556.400956  1.803038e-165 #Residual   742336.119560  573          NaN            NaN 
like image 58
cphlewis Avatar answered Oct 04 '22 09:10

cphlewis