Fast alternative to run a numpy based function over all the rows in Pandas DataFrame

Tags:

I have a Pandas data frame created the following way:

import pandas as pd
def create(n):  
    df = pd.DataFrame({ 'gene':["foo",
                            "bar",
                            "qux",
                            "woz"], 
                            'cell1':[433.96,735.62,483.42,10.33], 
                            'cell2':[94.93,2214.38,97.93,1205.30],
                            'cell3':[1500,90,100,80]})
    df = df[["gene","cell1","cell2","cell3"]]
    df = pd.concat([df]*n)
    df = df.reset_index(drop=True)
    return df

It looks like this:

In [108]: create(1)
Out[108]:
  gene   cell1    cell2  cell3
0  foo  433.96    94.93   1500
1  bar  735.62  2214.38     90
2  qux  483.42    97.93    100
3  woz   10.33  1205.30     80

Then I have a function that takes the values of each genes(row) to compute a certain score:

enter image description here

import numpy as np 
def sparseness(xvec):
    n = len(xvec)
    xvec_sum = np.sum(np.abs(xvec))
    xvecsq_sum = np.sum(np.square(xvec))

    denom = np.sqrt(n) - (xvec_sum / np.sqrt(xvecsq_sum))
    enum  = np.sqrt(n) - 1
    sparseness_x = denom/enum

    return sparseness_x

In reality I need to apply this function on 40K over rows. And currently it runs very slow using Pandas 'apply':

In [109]: df = create(10000)
In [110]: express_df   = df.ix[:,1:]
In [111]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 8.32 s per loop

What's the faster alternative to implement that?

265

asked Nov 26 '15 06:11

neversaint

2 Answers

A faster way is to implement a vectorized version of the function, which operates on a two dimensional ndarray directly. This is very doable since many functions in numpy can operate on two dimensional ndarray, controlled using the axis parameter. A possible implementation:

def sparseness2(xs):
    nr = np.sqrt(xs.shape[1])
    a = np.sum(np.abs(xs), axis=1)
    b = np.sqrt(np.sum(np.square(xs), axis=1))
    sparseness = (nr - a/b) / (nr - 1)
    return sparseness

res_arr = sparseness2(express_df.values)
res2 = pd.Series(res_arr, index=express_df.index)

Some testing:

from pandas.util.testing import assert_series_equal
res1 = express_df.apply(sparseness, axis=1)
assert_series_equal(res1, res2) #OK
%timeit sparseness2(express_df.values)
# 1000 loops, best of 3: 655 µs per loop

answered Oct 04 '22 19:10

YS-L

Here's one vectorized approach using np.einsum to perform all those operations in one go across the entire dataframe. Now, this np.einsum is supposedly pretty efficient for such multiplication and summing purposes. In our case, we can use it to perform summation along one dimension for the xvec_sum case and squaring and summation for the xvecsq_sum case. The implmentation would look like this -

def sparseness_vectorized(A):
    nsqrt = np.sqrt(A.shape[1])
    B = np.einsum('ij->i',np.abs(A))/np.sqrt(np.einsum('ij,ij->i',A,A))    
    denom = nsqrt - B
    enum  = nsqrt - 1
    return denom/enum

Runtime tests -

This section compares all the approaches listed thus far to solve the problem including the one in the question.

In [235]: df = create(1000)
     ...: express_df  = df.ix[:,1:]
     ...: 

In [236]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 1.36 s per loop

In [237]: %timeit sparseness2(express_df.values)
1000 loops, best of 3: 247 µs per loop

In [238]: %timeit sparseness_vectorized(express_df.values)
1000 loops, best of 3: 231 µs per loop



In [239]: df = create(5000)
     ...: express_df  = df.ix[:,1:]
     ...: 

In [240]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 6.66 s per loop

In [241]: %timeit sparseness2(express_df.values)
1000 loops, best of 3: 1.14 ms per loop

In [242]: %timeit sparseness_vectorized(express_df.values)
1000 loops, best of 3: 1.06 ms per loop

answered Oct 04 '22 19:10

Divakar

Related questions
                            
                                Given n tuples representing pairs, return a list with connected tuples
                            
                                Django migration file in an other app?
                            
                                Python pyqtgraph how to set x and y axis limits on graph, no autorange
                            
                                pkg_resources.resource_stream fails on python3
                            
                                Why do some includes in Django need strings, and others variable names?
                            
                                Save image with matplotlib.pyplot [duplicate]
                            
                                Create "The Economist" style graphs from python
                            
                                Index Error: list index out of range in Django
                            
                                Refering to a directory in a Flask app doesn't work unless the path is absolute
                            
                                Plot circular gradients using PIL in Python
                            
                                Flatten numpy array but also keep index of value positions?
                            
                                Parse XML Sitemap with Python
                            
                                How to add newline to end of file.write()?
                            
                                Why does python allow spaces between an object and the method name after the "."
                            
                                Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match
                            
                                GitPython: Get current tag (detached head)
                            
                                How to construct pandas dataframe from series of arrays
                            
                                Import own .py files in anaconda spyder
                            
                                How can a Python module single file be installed using pip and PyPI?
                            
                                Why does TensorFlow return [[nan nan]] instead of probabilities from a CSV file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast alternative to run a numpy based function over all the rows in Pandas DataFrame

Tags:

python

pandas

numpy

cython

neversaint

People also ask

2 Answers

YS-L

Divakar

Recent Activity

Donate For Us