I have a Pandas data frame created the following way:
import pandas as pd
def create(n):
df = pd.DataFrame({ 'gene':["foo",
"bar",
"qux",
"woz"],
'cell1':[433.96,735.62,483.42,10.33],
'cell2':[94.93,2214.38,97.93,1205.30],
'cell3':[1500,90,100,80]})
df = df[["gene","cell1","cell2","cell3"]]
df = pd.concat([df]*n)
df = df.reset_index(drop=True)
return df
It looks like this:
In [108]: create(1)
Out[108]:
gene cell1 cell2 cell3
0 foo 433.96 94.93 1500
1 bar 735.62 2214.38 90
2 qux 483.42 97.93 100
3 woz 10.33 1205.30 80
Then I have a function that takes the values of each genes(row) to compute a certain score:
import numpy as np
def sparseness(xvec):
n = len(xvec)
xvec_sum = np.sum(np.abs(xvec))
xvecsq_sum = np.sum(np.square(xvec))
denom = np.sqrt(n) - (xvec_sum / np.sqrt(xvecsq_sum))
enum = np.sqrt(n) - 1
sparseness_x = denom/enum
return sparseness_x
In reality I need to apply this function on 40K over rows. And currently it runs very slow using Pandas 'apply':
In [109]: df = create(10000)
In [110]: express_df = df.ix[:,1:]
In [111]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 8.32 s per loop
What's the faster alternative to implement that?
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.
The Pandas Built-In Function: iterrows() — 321 times faster.
NumPy can be said to be faster in performance than Pandas, up to fifty thousand (50K) rows and less of the dataset.
A faster way is to implement a vectorized version of the function, which operates on a two dimensional ndarray directly. This is very doable since many functions in numpy can operate on two dimensional ndarray, controlled using the axis
parameter. A possible implementation:
def sparseness2(xs):
nr = np.sqrt(xs.shape[1])
a = np.sum(np.abs(xs), axis=1)
b = np.sqrt(np.sum(np.square(xs), axis=1))
sparseness = (nr - a/b) / (nr - 1)
return sparseness
res_arr = sparseness2(express_df.values)
res2 = pd.Series(res_arr, index=express_df.index)
Some testing:
from pandas.util.testing import assert_series_equal
res1 = express_df.apply(sparseness, axis=1)
assert_series_equal(res1, res2) #OK
%timeit sparseness2(express_df.values)
# 1000 loops, best of 3: 655 µs per loop
Here's one vectorized approach using np.einsum
to perform all those operations in one go across the entire dataframe. Now, this np.einsum
is supposedly pretty efficient for such multiplication and summing purposes. In our case, we can use it to perform summation along one dimension for the xvec_sum
case and squaring and summation for the xvecsq_sum
case. The implmentation would look like this -
def sparseness_vectorized(A):
nsqrt = np.sqrt(A.shape[1])
B = np.einsum('ij->i',np.abs(A))/np.sqrt(np.einsum('ij,ij->i',A,A))
denom = nsqrt - B
enum = nsqrt - 1
return denom/enum
Runtime tests -
This section compares all the approaches listed thus far to solve the problem including the one in the question.
In [235]: df = create(1000)
...: express_df = df.ix[:,1:]
...:
In [236]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 1.36 s per loop
In [237]: %timeit sparseness2(express_df.values)
1000 loops, best of 3: 247 µs per loop
In [238]: %timeit sparseness_vectorized(express_df.values)
1000 loops, best of 3: 231 µs per loop
In [239]: df = create(5000)
...: express_df = df.ix[:,1:]
...:
In [240]: %timeit express_df.apply(sparseness, axis=1)
1 loops, best of 3: 6.66 s per loop
In [241]: %timeit sparseness2(express_df.values)
1000 loops, best of 3: 1.14 ms per loop
In [242]: %timeit sparseness_vectorized(express_df.values)
1000 loops, best of 3: 1.06 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With