I am trying to calculate a set of FDR-corrected p-values using Benjamini & Hochberg's method. However, the vector I am trying to run this on contains over 10 billion values.
Given the amount of data the normal method from statsmodel's multicomp module quickly runs out of memory. Looking at the source code of that function it seems that it creates multiple vectors with length 10 billion in memory, which is obviously not going to work, even on a machine with 100GB of RAM.
Is there a way to do this, ideally without having to keep the entire vector in memory? In particular I am wondering if it would be possible to re-implement BH in a way that allows it to run on-disk using h5py data structures.
Or any other suggestions?
In case anybody else stumbles across this:
The way I solved this was to first extract all p-values that had a chance of passing the FDR correction threshold (I used 1e-5). Memory-consumption was not an issue for this, since I could just iterate through the list of p-values on disk.
This gave me a set of the ~400k lowest p-values. I then manually applied the BH procedure to these p-values, but inserted the original number of tests into the formula. Since BH is a step-up procedure, this is (as far as I know) equivalent to applying BH to the whole vector, without requiring me to sort 10 billion values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With