Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to copy columns from one DataFrame to another using pandas?

Tags:

python

pandas

I have a large DataFrame (million +) records I'm using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I'm combining a few of the columns for each time step in my program which can be several thousand time steps . Both DataFrames are indexed the same way by a id column.

the code I'm using is:

df_large.loc[new_ids, core_cols] = df_small.loc[new_ids, core_cols]

Where core_cols is a list of about 10 fields that I'm coping over and new_ids are the ids from the small DataFrame. This code works fine but it is the slowest part of my code my a magnitude of three. I just wanted to know if they was a faster way to merge the data of the two DataFrame together.

I tried merging the data each time with the merge function but process took way to long that is way I have gone to creating a larger DataFrame that I update to improve the speed.

like image 677
user1204369 Avatar asked Jan 22 '14 22:01

user1204369


1 Answers

There is nothing inherently slow about using .loc to set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. FYI, this example is slightly different that the 2nd example.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from pandas import DataFrame

In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])

In [5]: df
Out[5]: 
   0  1  2
a  1  1  1
b  1  1  1
c  1  1  1
d  1  1  1
e  1  1  1
f  1  1  1
g  1  1  1
h  1  1  1
i  1  1  1
j  1  1  1

[10 rows x 3 columns]

In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])

In [7]: df2
Out[7]: 
   1  2
a  0  0
f  0  0
g  0  0

[3 rows x 2 columns]

In [8]: df.loc[df2.index,df2.columns] = df2

In [9]: df
Out[9]: 
   0  1  2
a  1  0  0
b  1  1  1
c  1  1  1
d  1  1  1
e  1  1  1
f  1  0  0
g  1  0  0
h  1  1  1
i  1  1  1
j  1  1  1

[10 rows x 3 columns]

Here's an alternative. It may or may not fit your data pattern. If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using .loc is about right).

Instead of updating the big frame, update the small frame with the columns from the big frame, e.g.:

In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])

In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])

In [12]: needed_columns = df.columns-df2.columns

In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)

In [14]: df2
Out[14]: 
   1  2  0
a  0  0  1
f  0  0  1
g  0  0  1

[3 rows x 3 columns]

In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])

In [16]: needed_columns = df.columns-df3.columns

In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)

In [18]: df3
Out[18]: 
   1  2  0
c  0  0  1
j  0  0  1
i  0  0  1

[3 rows x 3 columns]

And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step).

In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]: 
   0  1  2
a  1  0  0
b  1  1  1
c  1  0  0
d  1  1  1
e  1  1  1
f  1  0  0
g  1  0  0
h  1  1  1
i  1  0  0
j  1  0  0

[10 rows x 3 columns]

The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished.

I use this pattern all of the time, though with Panels actually.

  • perform a computation on a sub-set of the data and write each to a separate file
  • then at the end read them all in and concat (in memory), and write out a gigantic new file. The concat step could be done all at once in memory, or if truly a large task, then can be done iteratively.

I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. The only dependent part is the concat.

This is essentially a map-reduce pattern.

like image 127
Jeff Avatar answered Oct 14 '22 11:10

Jeff