Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing to multiple adjacent columns in pandas efficiently

With a numpy ndarray it is possible to write to multiple columns at a time without making a copy first (as long as they are adjacent). If I wanted to write to the first three columns of an array I would write

a[0,0:3] = 1,2,3 # this is very fast ('a' is a numpy ndarray)

I was hoping that in pandas I would similarly be able to select multiple adjacent columns by "label-slicing" like so (assuming the first 3 columns are labeled 'a','b','c')

a.loc[0,'a':'c'] = 1,2,3 # this works but is very slow ('a' is a pandas DataFrame)

or similarly

a.iloc[0,3:6] = 1,2,3 # this is equally as slow

However, this takes several 100s of milliseconds as compared to writing to a numpy array which takes only a few microseconds. I'm unclear on whether pandas is making a copy of the array under the hood. The only way I could find to write to the dataframe in this way that gives good speed is to work on the underlying ndarray directly

a.values[0,0:3] = 1,2,3 # this works fine and is fast

Have I missed something in the Pandas docs or is their no way to do multiple adjacent column indexing on a Pandas dataframe with speed comparable to numpy?

Edit

Here's the actual dataframe I am working with.

>> conn = sqlite3.connect('prath.sqlite')
>> prath = pd.read_sql("select image_id,pixel_index,skin,r,g,b from pixels",conn)
>> prath.shape
(5913307, 6)
>> prath.head()
   image_id  pixel_index  skin    r    g    b
0        21       113764     0    0    0    0
1        13       187789     0  183  149  173
2        17       535758     0  147   32   35
3        31         6255     0  116    1   16
4        15       119272     0  238  229  224
>> prath.dtypes
image_id       int64
pixel_index    int64
skin           int64
r              int64
g              int64
b              int64
dtype: object

Here is some runtime comparisons for the different indexing methods (again, pandas indexing is very slow)

>> %timeit prath.loc[0,'r':'b'] = 4,5,6
1 loops, best of 3: 888 ms per loop
>> %timeit prath.iloc[0,3:6] = 4,5,6
1 loops, best of 3: 894 ms per loop
>> %timeit prath.values[0,3:6] = 4,5,6
100000 loops, best of 3: 4.8 µs per loop
like image 309
toes Avatar asked Sep 28 '22 11:09

toes


1 Answers

Edit to clarify: I don't believe pandas has a direct analog to setting a view in numpy in terms of both speed and syntax. iloc and loc are probably the most direct analog in terms of syntax and purpose, but are much slower. This is a fairly common situation with numpy and pandas. Pandas does a lot more than numpy (labeled columns/indexes, automatic alignment, etc.), but is slower to varying degrees. When you need speed and can do things in numpy, then do them in numpy.


I think in a nutshell that the tradeoff here is that loc and iloc will be slower but work 100% of the time whereas values will be fast but not always work (to be honest, I didn't even realize it would work in the way you got it to work).

But here's a really simple example where values doesn't work because column 'g' is a float rather than integer.

prath['g'] = 3.33

prath.values[0,3:6] = 4,5,6

prath.head(3)

   image_id  pixel_index  skin    r     g    b
0        21       113764     0    0  3.33    0
1        13       187789     0  183  3.33  173
2        17       535758     0  147  3.33   35

prath.iloc[0,3:6] = 4,5,6

prath.head(3)

   image_id  pixel_index  skin    r     g    b
0        21       113764     0    4  5.00    6
1        13       187789     0  183  3.33  173
2        17       535758     0  147  3.33   35

You can often get numpy-like speed and behavior from pandas when columns are of homogeneous type, you want to be careful about this. Edit to add: As @toes notes in the comment, the documentation does state that you can do this with homogeneous data. However, it's potentially very error prone as the example above shows, and I don't think many people would consider this a good general practice in pandas.

My general recommendation would be to do things in numpy in cases where you need the speed (and have homogeneous data types), and pandas when you don't. The nice thing is that numpy and pandas play well together so it's really not that hard to convert between dataframes and arrays as you go.

Edit to add: The following seems to work (albeit with a warning) even with column 'g' as a float. The speed is in between the values way and loc/iloc ways. I'm not sure if this can be expected to work all the time though. Just putting it out as a possible middle way.

prath[0:1][['r','g','b']] = 4,5,6
like image 113
JohnE Avatar answered Oct 29 '22 06:10

JohnE