assigning values in each column to be the sum of that column

Question

I have DataFrame and I am trying to assign all values in each column to be the sum of that column.

x = pd.DataFrame(data = [[1,2],[3,4],[5,6],[7,8],[9,10]],index=[1,2,3,4,5],columns=['a','b'])
x 
   a   b
1  1   2
2  3   4
3  5   6
4  7   8
5  9  10

the output should be

I want to use x.apply(f, axis=0), but I do not know how to define a function that convert a column to be the sum of all column values in a lambda function. The following line raise SyntaxError: can't assign to lambda

f = lambda x : x[:]= x.sum()

jezrael · Accepted Answer

Another faster numpy solution with numpy.tile:

print (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), 
                    columns=x.columns, 
                    index=x.index))
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30

Another solution with numpy.repeat:

h = pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),
                 columns=x.columns,
                 index=x.index)

print (h)
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30


In [431]: %timeit df = pd.DataFrame([x.sum()] * len(x))
1000 loops, best of 3: 786 µs per loop

In [432]: %timeit (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), columns=x.columns, index=x.index))
1000 loops, best of 3: 192 µs per loop

In [460]: %timeit pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),columns=x.columns, index=x.index)
The slowest run took 8.65 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 184 µs per loop

Alexander · Answer

for col in df:
    df[col] = df[col].sum()

or a slower solution that doesn't use looping...

df = pd.DataFrame([df.sum()] * len(df))

Timings

@jezrael Thanks for the timings. This does them on a larger dataframe and includes the for loop as well. Most of the time is spent creating the dataframe rather than calculating the sums, so the most efficient method that does this appears to be the one from @ayhan that assigns the sum to the values directly:

from string import ascii_letters

df = pd.DataFrame(np.random.randn(10000, 52), columns=list(ascii_letters))

# A baseline timing figure to determine sum of each column.
%timeit df.sum()
1000 loops, best of 3: 1.47 ms per loop

# Solution 1 from @Alexander
%%timeit
for col in df:
    df[col] = df[col].sum()
100 loops, best of 3: 21.3 ms per loop

# Solution 2 from @Alexander (without `for loop`, but much slower)
%timeit df2 = pd.DataFrame([df.sum()] * len(df))
1 loops, best of 3: 270 ms per loop

# Solution from @PiRSquared
%timeit df.stack().groupby(level=1).transform('sum').unstack()
10 loops, best of 3: 159 ms per loop

# Solution 1 from @Jezrael
%timeit (pd.DataFrame(np.tile(df.sum().values, (len(df.index),1)), columns=df.columns, index=df.index))
100 loops, best of 3: 2.32 ms per loop

# Solution 2 from @Jezrael
%%timeit
df2 = pd.DataFrame(df.sum().values[np.newaxis,:].repeat(len(df.index), axis=0),
                 columns=df.columns,
                 index=df.index)
100 loops, best of 3: 2.3 ms per loop

# Solution from @ayhan
%time df.values[:] = df.values.sum(0)
CPU times: user 1.54 ms, sys: 485 µs, total: 2.02 ms
Wall time: 1.36 ms  # <<<< FASTEST

assigning values in each column to be the sum of that column

Tags:

python

pandas

lambda

Wang

2 Answers

jezrael

Alexander

Recent Activity

Donate For Us

assigning values in each column to be the sum of that column

Tags:

python

pandas

lambda

Wang

2 Answers

jezrael

Alexander

Related questions

Recent Activity

Donate For Us