Calculate summary statistics of columns in dataframe

Tags:

I have a dataframe of the following form (for example)

shopper_num,is_martian,number_of_items,count_pineapples,birth_country,tranpsortation_method 1,FALSE,0,0,MX, 2,FALSE,1,0,MX, 3,FALSE,0,0,MX, 4,FALSE,22,0,MX, 5,FALSE,0,0,MX, 6,FALSE,0,0,MX, 7,FALSE,5,0,MX, 8,FALSE,0,0,MX, 9,FALSE,4,0,MX, 10,FALSE,2,0,MX, 11,FALSE,0,0,MX, 12,FALSE,13,0,MX, 13,FALSE,0,0,CA, 14,FALSE,0,0,US,

How can I use Pandas to calculate summary statistics of each column (column data types are variable, some columns have no information

And then return the a dataframe of the form:

columnname, max, min, median,  is_martian, NA, NA, FALSE

So on and so on

496

asked Mar 06 '14 20:03

Tyler Wood

2 Answers

describe may give you everything you want otherwise you can perform aggregations using groupby and pass a list of agg functions: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

In [43]:  df.describe()  Out[43]:         shopper_num is_martian  number_of_items  count_pineapples count      14.0000         14        14.000000                14 mean        7.5000          0         3.357143                 0 std         4.1833          0         6.452276                 0 min         1.0000      False         0.000000                 0 25%         4.2500          0         0.000000                 0 50%         7.5000          0         0.000000                 0 75%        10.7500          0         3.500000                 0 max        14.0000      False        22.000000                 0  [8 rows x 4 columns]

Note that some columns cannot be summarised as there is no logical way to summarise them, for instance columns containing string data

As you prefer you can transpose the result if you prefer:

In [47]:  df.describe().transpose()  Out[47]:                   count      mean       std    min   25%  50%    75%    max shopper_num         14       7.5    4.1833      1  4.25  7.5  10.75     14 is_martian          14         0         0  False     0    0      0  False number_of_items     14  3.357143  6.452276      0     0    0    3.5     22 count_pineapples    14         0         0      0     0    0      0      0  [4 rows x 8 columns]

113

answered Sep 24 '22 00:09

EdChum

Now there is the pandas_profiling package, which is a more complete alternative to df.describe().

If your pandas dataframe is df, the below will return a complete analysis including some warnings about missing values, skewness, etc. It presents histograms and correlation plots as well.

import pandas_profiling pandas_profiling.ProfileReport(df)

See the example notebook detailing the usage.

answered Sep 21 '22 00:09

akilat90

Related questions
                            
                                Django edit form based on add form?
                            
                                How to import from config file in Flask?
                            
                                How to concatenate element-wise two lists in Python
                            
                                Python readlines() usage and efficient practice for reading
                            
                                Python 3 Get and parse JSON API
                            
                                Anaconda version with Python 3.5
                            
                                connect to a DB using psycopg2 without password
                            
                                Why does Python installed via Homebrew not include Tkinter
                            
                                Set specific DNS server using dns.resolver (pythondns)
                            
                                Range with step of type float [duplicate]
                            
                                range in jinja2 inside a for loop
                            
                                python + SQLAlchemy: deleting with the Session object
                            
                                Boolean Indexing with multiple conditions [duplicate]
                            
                                Filtering a Pyspark DataFrame with SQL-like IN clause
                            
                                Why does integer division round down in many scripting languages?
                            
                                Python k-means algorithm
                            
                                How to get pdf filename with Python requests?
                            
                                Extracting the first day of month of a datetime type column in pandas
                            
                                How does perspective transformation work in PIL?
                            
                                Test if an internet connection is present in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate summary statistics of columns in dataframe

Tags:

python

pandas

dataframe

csv

profiling

Tyler Wood

People also ask

2 Answers

EdChum

akilat90

Recent Activity

Donate For Us