Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

assigning values in each column to be the sum of that column

I have DataFrame and I am trying to assign all values in each column to be the sum of that column.

x = pd.DataFrame(data = [[1,2],[3,4],[5,6],[7,8],[9,10]],index=[1,2,3,4,5],columns=['a','b'])
x 
   a   b
1  1   2
2  3   4
3  5   6
4  7   8
5  9  10

the output should be

   a    b
1  25   30
2  25   30
3  25   30
4  25   30
5  25   30

I want to use x.apply(f, axis=0), but I do not know how to define a function that convert a column to be the sum of all column values in a lambda function. The following line raise SyntaxError: can't assign to lambda

f = lambda x : x[:]= x.sum()
like image 587
Wang Avatar asked Dec 04 '22 00:12

Wang


2 Answers

Another faster numpy solution with numpy.tile:

print (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), 
                    columns=x.columns, 
                    index=x.index))
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30

Another solution with numpy.repeat:

h = pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),
                 columns=x.columns,
                 index=x.index)

print (h)
    a   b
1  25  30
2  25  30
3  25  30
4  25  30
5  25  30


In [431]: %timeit df = pd.DataFrame([x.sum()] * len(x))
1000 loops, best of 3: 786 µs per loop

In [432]: %timeit (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), columns=x.columns, index=x.index))
1000 loops, best of 3: 192 µs per loop

In [460]: %timeit pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),columns=x.columns, index=x.index)
The slowest run took 8.65 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 184 µs per loop
like image 56
jezrael Avatar answered Feb 15 '23 00:02

jezrael


for col in df:
    df[col] = df[col].sum()

or a slower solution that doesn't use looping...

df = pd.DataFrame([df.sum()] * len(df))

Timings

@jezrael Thanks for the timings. This does them on a larger dataframe and includes the for loop as well. Most of the time is spent creating the dataframe rather than calculating the sums, so the most efficient method that does this appears to be the one from @ayhan that assigns the sum to the values directly:

from string import ascii_letters

df = pd.DataFrame(np.random.randn(10000, 52), columns=list(ascii_letters))

# A baseline timing figure to determine sum of each column.
%timeit df.sum()
1000 loops, best of 3: 1.47 ms per loop

# Solution 1 from @Alexander
%%timeit
for col in df:
    df[col] = df[col].sum()
100 loops, best of 3: 21.3 ms per loop

# Solution 2 from @Alexander (without `for loop`, but much slower)
%timeit df2 = pd.DataFrame([df.sum()] * len(df))
1 loops, best of 3: 270 ms per loop

# Solution from @PiRSquared
%timeit df.stack().groupby(level=1).transform('sum').unstack()
10 loops, best of 3: 159 ms per loop

# Solution 1 from @Jezrael
%timeit (pd.DataFrame(np.tile(df.sum().values, (len(df.index),1)), columns=df.columns, index=df.index))
100 loops, best of 3: 2.32 ms per loop

# Solution 2 from @Jezrael
%%timeit
df2 = pd.DataFrame(df.sum().values[np.newaxis,:].repeat(len(df.index), axis=0),
                 columns=df.columns,
                 index=df.index)
100 loops, best of 3: 2.3 ms per loop

# Solution from @ayhan
%time df.values[:] = df.values.sum(0)
CPU times: user 1.54 ms, sys: 485 µs, total: 2.02 ms
Wall time: 1.36 ms  # <<<< FASTEST
like image 23
Alexander Avatar answered Feb 14 '23 22:02

Alexander