I am getting my feet wet with some genome analysis and am a little stuck. I have some very sparse data and need to find places where the moving average exceeds some threshold, marking each point as 1 or 0. The data is of a unique type, so I can't use available programs for analysis.
Each point represents one point (basepair) on the human genome. For each dataset there are 200,000,000 potential points. The data is essentially a list of ~12000 index/value pairs where all other points are assumed to be zero. What I need to do is take a moving average across the whole dataset, and return regions where the average is above a threshold.
I am currently reading each point from the dataset sequentially and building an array around each point that I find, but this is very slow for large window sizes. Is there a more efficient way to do this, maybe with scipy or pandas?
Edit: Jamie's magic code below works great (but I can't upvote yet)! I am extremely appreciative.
You can vectorize the whole thing with numpy. I have built this random dataset of (aprox.) 12,000 indices between 0 and 199,999,999, and an equally long list of random floats between 0 and 1:
indices = np.unique(np.random.randint(2e8,size=(12000,)))
values = np.random.rand(len(indices))
Then I construct an array of indices of total window size 2*win+1
around each of the indices
, and a corresponding array of how much is contributed to the moving average by that point:
win = 10
avg_idx = np.arange(-win, win+1) + indices[:, None]
avg_val = np.tile(values[:, None]/(2*win+1), (1, 2*win+1))
All that is left is figuring out repeated indices and adding contributions to the moving average together:
unique_idx, _ = np.unique(avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())
You can now get the list of indices at which, e.g. the moving average exceeds 0.5, as:
unique_idx[mov_avg > 0.5]
As for performance, first turn the above code into a function:
def sparse_mov_avg(idx, val, win):
avg_idx = np.arange(-win, win+1) + idx[:, None]
avg_val = np.tile(val[:, None]/(2*win+1), (1, 2*win+1))
unique_idx, _ = np.unique(avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())
return unique_idx, mov_avg
and here are some timings for several window sizes, for the test data described at the beginning:
In [2]: %timeit sparse_mov_avg(indices, values, 10)
10 loops, best of 3: 33.7 ms per loop
In [3]: %timeit sparse_mov_avg(indices, values, 100)
1 loops, best of 3: 378 ms per loop
In [4]: %timeit sparse_mov_avg(indices, values, 1000)
1 loops, best of 3: 4.33 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With