Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas efficient dataframe set row

Tags:

First I have the following empty DataFrame preallocated:

df=DataFrame(columns=range(10000),index=range(1000))

Then I want to update the df row by row (efficiently) with a length-10000 numpy array as data. My problem is: I don't even have an idea what method of DataFrame I should use to accomplish this task.

Thank you!

like image 794
wdg Avatar asked Sep 12 '13 18:09

wdg


People also ask

Is pandas apply faster than Iterrows?

By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.

How do I iterate over a row in pandas?

DataFrame. iterrows() method is used to iterate over DataFrame rows as (index, Series) pairs. Note that this method does not preserve the dtypes across rows due to the fact that this method will convert each row into a Series .


2 Answers

Here's 3 methods, only 100 columns, 1000 rows

In [5]: row = np.random.randn(100)

Row wise assignment

In [6]: def method1():
   ...:     df = DataFrame(columns=range(100),index=range(1000))
   ...:     for i in xrange(len(df)):
   ...:         df.iloc[i] = row
   ...:     return df
   ...: 

Build up the arrays in a list, create the frame all at once

In [9]: def method2():
   ...:     return DataFrame([ row for i in range(1000) ])
   ...: 

Columnwise assignment (with transposes at both ends)

In [13]: def method3():
   ....:     df = DataFrame(columns=range(100),index=range(1000)).T
   ....:     for i in xrange(1000):
   ....:         df[i] = row
   ....:     return df.T
   ....: 

These all have the same output frame

In [22]: (method2() == method1()).all().all()
Out[22]: True

In [23]: (method2() == method3()).all().all()
Out[23]: True


In [8]: %timeit method1()
1 loops, best of 3: 1.76 s per loop

In [10]: %timeit method2()
1000 loops, best of 3: 7.79 ms per loop

In [14]: %timeit method3()
1 loops, best of 3: 1.33 s per loop

It is CLEAR that building up a list, THEN creating the frame all at once is orders of magnitude faster than doing any form of assignment. Assignment involves copying. Building up all at once only copies once.

like image 187
Jeff Avatar answered Oct 06 '22 09:10

Jeff


df=DataFrame(columns=range(10),index=range(10))
a = np.array( [9,9,9,9,9,9,9,9,9,9] )

Update row:

df.loc[2] = a

Using Jeff's idea...

df2 = DataFrame(data=np.random.randn(10,10), index=arange(10))
df2.head().T

I have written up a notebook answering the question: https://www.wakari.io/sharing/bundle/hrojas/pandas%20efficient%20dataframe%20set%20row

like image 23
DataByDavid Avatar answered Oct 06 '22 09:10

DataByDavid