First I have the following empty DataFrame preallocated:
df=DataFrame(columns=range(10000),index=range(1000))
Then I want to update the df
row by row (efficiently) with a length-10000 numpy array as data. My problem is: I don't even have an idea what method of DataFrame I should use to accomplish this task.
Thank you!
By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
DataFrame. iterrows() method is used to iterate over DataFrame rows as (index, Series) pairs. Note that this method does not preserve the dtypes across rows due to the fact that this method will convert each row into a Series .
Here's 3 methods, only 100 columns, 1000 rows
In [5]: row = np.random.randn(100)
Row wise assignment
In [6]: def method1():
...: df = DataFrame(columns=range(100),index=range(1000))
...: for i in xrange(len(df)):
...: df.iloc[i] = row
...: return df
...:
Build up the arrays in a list, create the frame all at once
In [9]: def method2():
...: return DataFrame([ row for i in range(1000) ])
...:
Columnwise assignment (with transposes at both ends)
In [13]: def method3():
....: df = DataFrame(columns=range(100),index=range(1000)).T
....: for i in xrange(1000):
....: df[i] = row
....: return df.T
....:
These all have the same output frame
In [22]: (method2() == method1()).all().all()
Out[22]: True
In [23]: (method2() == method3()).all().all()
Out[23]: True
In [8]: %timeit method1()
1 loops, best of 3: 1.76 s per loop
In [10]: %timeit method2()
1000 loops, best of 3: 7.79 ms per loop
In [14]: %timeit method3()
1 loops, best of 3: 1.33 s per loop
It is CLEAR that building up a list, THEN creating the frame all at once is orders of magnitude faster than doing any form of assignment. Assignment involves copying. Building up all at once only copies once.
df=DataFrame(columns=range(10),index=range(10))
a = np.array( [9,9,9,9,9,9,9,9,9,9] )
Update row:
df.loc[2] = a
Using Jeff's idea...
df2 = DataFrame(data=np.random.randn(10,10), index=arange(10))
df2.head().T
I have written up a notebook answering the question: https://www.wakari.io/sharing/bundle/hrojas/pandas%20efficient%20dataframe%20set%20row
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With