Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate summary statistics of columns in dataframe

I have a dataframe of the following form (for example)

shopper_num,is_martian,number_of_items,count_pineapples,birth_country,tranpsortation_method 1,FALSE,0,0,MX, 2,FALSE,1,0,MX, 3,FALSE,0,0,MX, 4,FALSE,22,0,MX, 5,FALSE,0,0,MX, 6,FALSE,0,0,MX, 7,FALSE,5,0,MX, 8,FALSE,0,0,MX, 9,FALSE,4,0,MX, 10,FALSE,2,0,MX, 11,FALSE,0,0,MX, 12,FALSE,13,0,MX, 13,FALSE,0,0,CA, 14,FALSE,0,0,US, 

How can I use Pandas to calculate summary statistics of each column (column data types are variable, some columns have no information

And then return the a dataframe of the form:

columnname, max, min, median,  is_martian, NA, NA, FALSE 

So on and so on

like image 496
Tyler Wood Avatar asked Mar 06 '14 20:03

Tyler Wood


People also ask

How do you get a statistical summary of a data frame DF?

Summarizing DataThe describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.

How do you get the summary of a column in Python?

To calculate summary statistics in Python you need to use the . describe() method under Pandas. The . describe() method works on both numeric data as well as object data such as strings or timestamps.


2 Answers

describe may give you everything you want otherwise you can perform aggregations using groupby and pass a list of agg functions: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

In [43]:  df.describe()  Out[43]:         shopper_num is_martian  number_of_items  count_pineapples count      14.0000         14        14.000000                14 mean        7.5000          0         3.357143                 0 std         4.1833          0         6.452276                 0 min         1.0000      False         0.000000                 0 25%         4.2500          0         0.000000                 0 50%         7.5000          0         0.000000                 0 75%        10.7500          0         3.500000                 0 max        14.0000      False        22.000000                 0  [8 rows x 4 columns] 

Note that some columns cannot be summarised as there is no logical way to summarise them, for instance columns containing string data

As you prefer you can transpose the result if you prefer:

In [47]:  df.describe().transpose()  Out[47]:                   count      mean       std    min   25%  50%    75%    max shopper_num         14       7.5    4.1833      1  4.25  7.5  10.75     14 is_martian          14         0         0  False     0    0      0  False number_of_items     14  3.357143  6.452276      0     0    0    3.5     22 count_pineapples    14         0         0      0     0    0      0      0  [4 rows x 8 columns] 
like image 113
EdChum Avatar answered Sep 24 '22 00:09

EdChum


Now there is the pandas_profiling package, which is a more complete alternative to df.describe().

If your pandas dataframe is df, the below will return a complete analysis including some warnings about missing values, skewness, etc. It presents histograms and correlation plots as well.

import pandas_profiling pandas_profiling.ProfileReport(df) 

See the example notebook detailing the usage.

like image 37
akilat90 Avatar answered Sep 21 '22 00:09

akilat90