I'm writing some performance-sensitive code in which I have to add a large number of columns to a Pandas dataframe quickly.
I've managed to get a 3x improvement over naively repeating df[foo] = bar
by constructing a second dataframe from a dict and concatenating them:
def mkdf1(n):
df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
for i in xrange(n):
df['col%d' % i] = range(i, 10+i)
return df
def mkdf2(n):
df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
newcols = {}
for i in xrange(n):
newcols['col%d' % i] = range(i, 10+i)
return pd.concat([df, pd.DataFrame(newcols, index=df.index)], axis=1)
The timings show substantial improvement:
%timeit -r 1 mkdf1(100)
100 loops, best of 1: 16.6 ms per loop
%timeit -r 1 mkdf2(100)
100 loops, best of 1: 5.5 ms per loop
Are there any other optimizations I can make here?
EDIT: Also, the concat
call is taking much longer in my real code than my toy example; in particular the get_result
function takes a lot longer despite the production df having fewer rows and I can't figure out why. Any advice on how to speed this up would be appreciated.
Add multiple columns to a data frame using Dataframe. insert() method. Using DataFrame. insert() method, we can add new columns at specific position of the column name sequence.
Create pandas DataFrame from Multiple ListsUse column param and index param to provide column & row labels respectively to the DataFrame. Alternatively, you can also add column names to DataFrame and set the index using pandas. DataFrame. set_index() method.
I'm a little confused at exactly what your dataframe should look like, but it's easy to speed this up a lot with a general technique. Basically for pandas/numpy speed you want to avoid for
and any concat/merge/join/append
, if possible.
Your best bet here is most likely to use numpy
to create an array that will be the input to a dataframe and then name the columns however you like. Both of these operations should be trivial as far as computation time.
Here's the numpy part, it looks like you already know how to construct column names.
%timeit pd.DataFrame( np.ones([10,100]).cumsum(axis=0)
+ np.ones([10,100]).cumsum(axis=1) )
10000 loops, best of 3: 158 µs per loop
I think you are trying to make something like this? (If not, just check out numpy if you aren't familiar with it, it has all sorts of array operations that should make it very easy to do whatever you are trying to do here).
In [63]: df.ix[:5,:10]
Out[63]:
0 1 2 3 4 5 6 7 8 9 10
0 2 3 4 5 6 7 8 9 10 11 12
1 3 4 5 6 7 8 9 10 11 12 13
2 4 5 6 7 8 9 10 11 12 13 14
3 5 6 7 8 9 10 11 12 13 14 15
4 6 7 8 9 10 11 12 13 14 15 16
5 7 8 9 10 11 12 13 14 15 16 17
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With