Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

doing better than numpy's in1d mask function: ordered arrays?

This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.

So, I have a random array of unique integers (normally millions of elements).

totalIDs = [5,4,3,1,2,9,7,6,8 ...]

I have another array (normally a tens of thousands) of unique integers which I can create a mask.

subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...

I can use numpy to do

mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)

I can then extract the information I want of another array using the mask (say column 0 contains the one I want).

variable = allvariables[mask][:,0]

Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).

I thought of going through it once and writing out a binary file of an index (to speed up future searches).

for i in range(0,3):
    mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
    index[mask] = i

where X is in subsampleIDsX. Then I can just do:

for i in range(0,3):
    if index[i] == i:
        rowmatch = i
        break

variable = allvariables[rowmatch:len(subsampleIDs),0]

right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?

like image 738
Griff Avatar asked Mar 07 '13 19:03

Griff


1 Answers

I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].

Create some test data first:

import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))

subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)

Then convert you data in to a DataFrame:

import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs) 
df.ix[subsampleIDs]

DataFrame use a hashtable to map the index to it's location, it's very fast.

like image 82
HYRY Avatar answered Sep 21 '22 02:09

HYRY