I am running 'describe()' on a dataframe and getting summaries of only int columns (pandas 14.0).
The documentation says that for object columns frequency of most common value, and additional statistics would be returned. What could be wrong? (no error message is returned by the way)
Edit:
I think it's how the function is set to behave on mixed column types in a dataframe. Although the documentation fails to mention it.
Example code:
df_test = pd.DataFrame({'$a':[1,2], '$b': [10,20]}) df_test.dtypes df_test.describe() df_test['$a'] = df_test['$a'].astype(str) df_test.describe() df_test['$a'].describe() df_test['$b'].describe()
My ugly work around in the meanwhile:
def my_df_describe(df): objects = [] numerics = [] for c in df: if (df[c].dtype == object): objects.append(c) else: numerics.append(c) return df[numerics].describe(), df[objects].describe()
As of pandas v15. 0, use the parameter, DataFrame. describe(include = 'all') to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.
Pandas DataFrame describe() Method The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value.
As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all')
to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.
Example:
In[1]: df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)}) df.describe(include = 'all') Out[1]: $a $b count 5 5.000000 unique 4 NaN top a NaN freq 2 NaN mean NaN 2.000000 std NaN 1.581139 min NaN 0.000000 25% NaN 1.000000 50% NaN 2.000000 75% NaN 3.000000 max NaN 4.000000
The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.
Summarizing only numerical or object columns
describe()
on just the numerical columns use describe(include = [np.number])
To call describe()
on just the objects (strings) using describe(include = ['O'])
.
In[2]: df.describe(include = [np.number]) Out[3]: $b count 5.000000 mean 2.000000 std 1.581139 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 4.000000 In[3]: df.describe(include = ['O']) Out[3]: $a count 5 unique 4 top a freq 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With