Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In pandas, how can I get a DataFrame as the output while I sum the DataFrame

While I sum a DataFrame, it returns a Series:

In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c'])

In [3]: df
Out[3]: 
      a  b  c
   0  1  2  3
   1  2  3  3

   In [4]: s = df.sum()

   In [5]: type(s)
   Out[5]: pandas.core.series.Series

I know I can construct a new DataFrame by this Series. But, is there any more "pandasic" way?

like image 461
waitingkuo Avatar asked May 09 '13 10:05

waitingkuo


People also ask

How can you get the sum of values of a column in pandas DataFrame?

Pandas DataFrame sum() MethodThe sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.

Does pandas sum ignore NaN?

sum() Method to Find the Sum Ignoring NaN Values. Use the default value of the skipna parameter i.e. skipna=True to find the sum of DataFrame along the specified axis, ignoring NaN values. If you set skipna=True , you'll get NaN values of sums if the DataFrame has NaN values.

How do I sum rows of A Pandas DataFrame in Python?

To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.

How do you get a value from a DataFrame in Python?

Select Cell Value from DataFrame Using df['col_name']. values[] We can use df['col_name']. values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].


2 Answers

I'm going to go ahead and say... "No", I don't think there is a direct way to do it, the pandastic way (and pythonic too) is to be explicit:

pd.DataFrame(df.sum(), columns=['sum'])

or, more elegantly, using a dictionary (be aware that this copies the summed array):

pd.DataFrame({'sum': df.sum()})

As @root notes it's faster to use:

pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])

(As the zen of python states: "practicality beats purity", so if you care about this time, use this).

However, perhaps the most pandastic way is to just use the Series! :)

.

Some %timeits for your tiny example:

In [11]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
1000 loops, best of 3: 356 us per loop

In [12]: %timeit pd.DataFrame({'sum': df.sum()})
1000 loops, best of 3: 462 us per loop

In [13]: %timeit  pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
1000 loops, best of 3: 205 us per loop

and for a slightly larger one:

In [21]: df = pd.DataFrame(np.random.randn(100000, 3), columns=list('abc'))

In [22]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
100 loops, best of 3: 7.99 ms per loop

In [23]: %timeit pd.DataFrame({'sum': df.sum()})
100 loops, best of 3: 8.3 ms per loop

In [24]: %timeit  pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
100 loops, best of 3: 2.47 ms per loop
like image 140
Andy Hayden Avatar answered Sep 22 '22 05:09

Andy Hayden


Often it is necessary not only to convert the sum of the columns into a dataframe, but also to transpose the resulting dataframe. There is also a method for this:

df.sum().to_frame().transpose()
like image 20
Plo_Koon Avatar answered Sep 21 '22 05:09

Plo_Koon