Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas 'describe' is not returning summary of all columns

Tags:

python

pandas

I am running 'describe()' on a dataframe and getting summaries of only int columns (pandas 14.0).

The documentation says that for object columns frequency of most common value, and additional statistics would be returned. What could be wrong? (no error message is returned by the way)

Edit:

I think it's how the function is set to behave on mixed column types in a dataframe. Although the documentation fails to mention it.

Example code:

df_test = pd.DataFrame({'$a':[1,2], '$b': [10,20]}) df_test.dtypes df_test.describe() df_test['$a'] = df_test['$a'].astype(str) df_test.describe() df_test['$a'].describe() df_test['$b'].describe() 

My ugly work around in the meanwhile:

def my_df_describe(df):     objects = []     numerics = []     for c in df:         if (df[c].dtype == object):             objects.append(c)         else:             numerics.append(c)      return df[numerics].describe(), df[objects].describe() 
like image 934
user2808117 Avatar asked Jul 02 '14 06:07

user2808117


People also ask

How do I summarize all columns in pandas?

As of pandas v15. 0, use the parameter, DataFrame. describe(include = 'all') to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.

What does pandas describe () method return?

Pandas DataFrame describe() Method The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value.


1 Answers

As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all') to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.

Example:

In[1]:  df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)}) df.describe(include = 'all')  Out[1]:          $a    $b count   5   5.000000 unique  4   NaN top     a   NaN freq    2   NaN mean    NaN 2.000000 std     NaN 1.581139 min     NaN 0.000000 25%     NaN 1.000000 50%     NaN 2.000000 75%     NaN 3.000000 max     NaN 4.000000 

The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.

Summarizing only numerical or object columns

  1. To call describe() on just the numerical columns use describe(include = [np.number])
  2. To call describe() on just the objects (strings) using describe(include = ['O']).

    In[2]:  df.describe(include = [np.number])  Out[3]:           $b count   5.000000 mean    2.000000 std     1.581139 min     0.000000 25%     1.000000 50%     2.000000 75%     3.000000 max     4.000000  In[3]:  df.describe(include = ['O'])  Out[3]:      $a count   5 unique  4 top     a freq    2 
like image 96
ilyas patanam Avatar answered Sep 24 '22 04:09

ilyas patanam