Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast alternative to run a numpy based function over all the rows in Pandas DataFrame

I have a Pandas data frame created the following way:

import pandas as pd
def create(n):  
    df = pd.DataFrame({ 'gene':["foo",
                            "bar",
                            "qux",
                            "woz"], 
                            'cell1':[433.96,735.62,483.42,10.33], 
                            'cell2':[94.93,2214.38,97.93,1205.30],
                            'cell3':[1500,90,100,80]})
    df = df[["gene","cell1","cell2","cell3"]]
    df = pd.concat([df]*n)
    df = df.reset_index(drop=True)
    return df

It looks like this:

In [108]: create(1)
Out[108]:
  gene   cell1    cell2  cell3
0  foo  433.96    94.93   1500
1  bar  735.62  2214.38     90
2  qux  483.42    97.93    100
3  woz   10.33  1205.30     80

Then I have a function that takes the values of each genes(row) to compute a certain score:

enter image description here

import numpy as np 
def sparseness(xvec):
    n = len(xvec)
    xvec_sum = np.sum(np.abs(xvec))
    xvecsq_sum = np.sum(np.square(xvec))

    denom = np.sqrt(n) - (xvec_sum / np.sqrt(xvecsq_sum))
    enum  = np.sqrt(n) - 1
    sparseness_x = denom/enum

    return sparseness_x

In reality I need to apply this function on 40K over rows. And currently it runs very slow using Pandas 'apply':

In [109]: df = create(10000)
In [110]: express_df   = df.ix[:,1:]
In [111]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 8.32 s per loop

What's the faster alternative to implement that?

like image 265
neversaint Avatar asked Nov 26 '15 06:11

neversaint


People also ask

How do I iterate through pandas DataFrame fast?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Is Itertuples faster than apply?

While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.

Is Iterrows faster than for loop?

The Pandas Built-In Function: iterrows() — 321 times faster.

Is Panda faster than NumPy?

NumPy can be said to be faster in performance than Pandas, up to fifty thousand (50K) rows and less of the dataset.


2 Answers

A faster way is to implement a vectorized version of the function, which operates on a two dimensional ndarray directly. This is very doable since many functions in numpy can operate on two dimensional ndarray, controlled using the axis parameter. A possible implementation:

def sparseness2(xs):
    nr = np.sqrt(xs.shape[1])
    a = np.sum(np.abs(xs), axis=1)
    b = np.sqrt(np.sum(np.square(xs), axis=1))
    sparseness = (nr - a/b) / (nr - 1)
    return sparseness

res_arr = sparseness2(express_df.values)
res2 = pd.Series(res_arr, index=express_df.index)

Some testing:

from pandas.util.testing import assert_series_equal
res1 = express_df.apply(sparseness, axis=1)
assert_series_equal(res1, res2) #OK
%timeit sparseness2(express_df.values)
# 1000 loops, best of 3: 655 µs per loop
like image 54
YS-L Avatar answered Oct 04 '22 19:10

YS-L


Here's one vectorized approach using np.einsum to perform all those operations in one go across the entire dataframe. Now, this np.einsum is supposedly pretty efficient for such multiplication and summing purposes. In our case, we can use it to perform summation along one dimension for the xvec_sum case and squaring and summation for the xvecsq_sum case. The implmentation would look like this -

def sparseness_vectorized(A):
    nsqrt = np.sqrt(A.shape[1])
    B = np.einsum('ij->i',np.abs(A))/np.sqrt(np.einsum('ij,ij->i',A,A))    
    denom = nsqrt - B
    enum  = nsqrt - 1
    return denom/enum

Runtime tests -

This section compares all the approaches listed thus far to solve the problem including the one in the question.

In [235]: df = create(1000)
     ...: express_df  = df.ix[:,1:]
     ...: 

In [236]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 1.36 s per loop

In [237]: %timeit sparseness2(express_df.values)
1000 loops, best of 3: 247 µs per loop

In [238]: %timeit sparseness_vectorized(express_df.values)
1000 loops, best of 3: 231 µs per loop



In [239]: df = create(5000)
     ...: express_df  = df.ix[:,1:]
     ...: 

In [240]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 6.66 s per loop

In [241]: %timeit sparseness2(express_df.values)
1000 loops, best of 3: 1.14 ms per loop

In [242]: %timeit sparseness_vectorized(express_df.values)
1000 loops, best of 3: 1.06 ms per loop
like image 22
Divakar Avatar answered Oct 04 '22 19:10

Divakar