I have a set of data frames with 4 columns and 1,000,000 rows each. For each row I'd like to run a hypergeometric test that takes the 4 values of those columns as input and return a p-value (using the cumulative probability density function of a hypergeometric distribution).
I have tried two implementations based on SciPy (below) but both scale badly. Is there any other way to achieve what I do below with better efficiency? I have a working solution written in R (at the bottom) but unfortunately the code has to be written in Python, because it is to be used in an Airflow task that loads the data from a Postgres DB and at the moment there is no Postgres hook for R.
Sample data created as such (using 10,000 rows rather than the full 52 * 1,000,000 rows):
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from timeit import default_timer as timer
n_rows = 10000
n_total = 1000
max_good = 400
max_sample = 200
s = 100
df = pd.DataFrame({
'ngood': np.random.hypergeometric(ngood=max_good, nbad=n_total - max_good,
nsample=s, size=n_rows),
'nsamp': np.random.hypergeometric(ngood=max_sample, nbad=n_total - max_sample,
nsample=s, size=n_rows)
})
df = df.assign(kgood=np.array([
np.random.hypergeometric(ngood=ngood, nbad=n_total - ngood,
nsample=nsamp, size=1)
for ngood, nsamp
in zip(df.ngood, df.nsamp)
]))
Slow implementation based on a for comprehension:
start = timer()
res = [
hypergeom.cdf(k=ngood_found, M=n_total, n=ngood, N=nsamp)
for ngood_found, ngood, nsamp
in zip(df.kgood, df.ngood, df.nsamp)
]
end = timer()
print(res[0:10])
print("Elapsed time: %fs" % (end - start))
[0.44247900002512713, 0.71587318053768023, 0.97215178135616498]
Elapsed time: 2.405838s
Slow implementation based on numpy vectorisation:
vectorized_test = np.vectorize(hypergeom.cdf, otypes=[np.float], excluded='M')
start = timer()
res = vectorized_test(k=df.kgood.values, M=n_total,
n=df.ngood.values, N=df.nsamp.values)
end = timer()
print(res[0:10])
print("Elapsed time: %fs" % (end - start))
[ 0.442479 0.71587318 0.97215178]
Elapsed time: 2.518952s
This shows that the above calculation can be completed within milliseconds.
The trick is that phyper
is vectorised on C level,
as opposed to the numpy vectorisation that is essentially python loop AFAIK.
library(microbenchmark)
n_rows <- 10000
n_total <- 1000
max_good <- 400
max_sample <- 200
s <- 100
df <- data.frame(
ngood = rhyper(nn=n_rows, m=max_good, n=n_total - max_good, k=s),
nsamp = rhyper(nn=n_rows, m=max_sample, n=n_total - max_sample, k=s)
)
df$kgood <- rhyper(nn=n_rows, m=df$ngood, n=n_total - df$ngood, k=df$nsamp)
microbenchmark(
res <- phyper(q = df$k, m = df$ngood, n = n_total - df$ngood, k=df$nsamp)
)
Unit: milliseconds
expr
phyper(q = df$k, m = df$ngood, n = n_total - df$ngood, k = df$nsamp)
min lq mean median uq max neval
2.984852 3.00838 3.350509 3.134745 3.439138 5.462694 100
small improvement could be obtained by caching the results of hypergeom.cdf
as:
from functools import lru_cache
#@lru_cache(maxsize = 16*1024)
#def fn(k, n, N):
# return hypergeom.cdf(k = k, M=n_total, n = n, N = N)
data = {}
def fn(k, n, N):
key = (k, n, N)
if not key in data:
val = hypergeom.cdf(k = k, M=n_total, n = n, N = N)
data[key] = val
else:
val = data[key]
return val
start = timer()
res = [
fn(ngood_found, ngood, nsamp)
for ngood_found, ngood, nsamp
in zip(df.kgood, df.ngood, df.nsamp)
]
end = timer()
print(res[0:10])
print("Elapsed time: %fs" % (end - start))
this gives on my machine: Elapsed time: 0.279891s
(0.315840s
with lru_cache
)
EDIT:
In fact, it seems that the bottleneck is rather the calculation of the hypergeometric CDF itself (rather than the overhead of the for
loop).
To test this, I created a SWIG file _cdf.i
for the function gsl_cdf_hypergeometric_P from the GSL package.
%module cdf
%{
#include "gsl/gsl_cdf.h"
%}
double gsl_cdf_hypergeometric_P(int, int, int, int);
This file is then "converted" into a package with:
swig -c++ -python _cdf.i
g++ -fPIC -c _cdf_wrap.c -I${HOME}/venvs/p3/include/python3.5m
g++ -shared _cdf_wrap.o -o _cdf.so -lgsl
One can then use this directly in the original example as:
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from timeit import default_timer as timer
from cdf import gsl_cdf_hypergeometric_P
n_rows = 10000
n_total = 1000
max_good = 400
max_sample = 200
s = 100
df = pd.DataFrame({
'ngood': np.random.hypergeometric(ngood=max_good, nbad=n_total - max_good,
nsample=s, size=n_rows),
'nsamp': np.random.hypergeometric(ngood=max_sample, nbad=n_total - max_sample,
nsample=s, size=n_rows)
})
df = df.assign(kgood=np.array([
np.random.hypergeometric(ngood=ngood, nbad=n_total - ngood,
nsample=nsamp, size=1)
for ngood, nsamp
in zip(df.ngood, df.nsamp)
]))
start = timer()
res = [
hypergeom.cdf(k=ngood_found, M=n_total, n=ngood, N=nsamp)
for ngood_found, ngood, nsamp
in zip(df.kgood, df.ngood, df.nsamp)
]
end = timer()
print(res[0:10])
print("Elapsed time: %fs" % (end - start))
def cdf(k, M, n, N):
return gsl_cdf_hypergeometric_P(int(k), int(n), int(M-n), int(N))
start = timer()
res = [
cdf(k=ngood_found, M=n_total, n=ngood, N=nsamp)
for ngood_found, ngood, nsamp
in zip(df.kgood, df.ngood, df.nsamp)
]
end = timer()
print(res[0:10])
print("Elapsed time: %fs" % (end - start))
This yields:
[0.58605423287644209, 0.38055520197355552, 0.70597920363472055, 0.99728041338849138, 0.79797439957395955, 0.42245057292366844, 0.58627644982763727, 0.74819471224742817, 0.75121042470714849, 0.48561471798885397]
Elapsed time: 2.069916s
[0.5860542328771666, 0.38055520197384757, 0.7059792036350717, 0.997280413389543, 0.7979743995750694, 0.4224505729249291, 0.5862764498272103, 0.7481947122472634, 0.7512104247082603, 0.4856147179890127]
Elapsed time: 0.018253s
So even with an ordinary for
loop, the speed-up is quite significant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With