Fast combination of non-unique rows in numpy array, mapped to columns (i.e. fast pivot table problem, without Pandas)

Tags:

I wonder if anyone can offer any ideas or advice on the following coding problem please, where I'm particularly interested in a fast Python implementation (i.e. avoiding Pandas).

I have a (dummy example) set of data like:

|   User   |   Day   |   Place   |   Foo   |   Bar   |
      1         10        5          True     False
      1         11        8          True     False
      1         11        9          True     False
      2         11        9          True     False
      2         12        1          False    True
      1         12        2          False    True

containing data for 2 users ("user1" and "user2") at a given day/place, where there's 2 boolean values of interest (called foo and bar here).

I'm only interested in situations where data is logged for BOTH users at the same day & place. With these relevant data rows, I then want to make new columns for the day/place entries that describe the user and foo/bar as bools.. e.g.

|   Day   |   Place   |   User 1 Foo   |   User 1 Bar   |   User 2 Foo   |   User 2 Bar   |
    11           9          True            False              True           False

Each column data is stored in numpy arrays. I appreciate this is an ideal problem for pandas, using the pivot table feature (e.g. Pandas solution is:

user = np.array([1, 1, 1, 2, 2, 1], dtype=int)
day = np.array([10, 11, 11, 11, 12, 12], dtype=int)
place = np.array([5,8,9,9,1,2], dtype=int)
foo = np.array([1, 1, 1, 1, 0, 0], dtype=bool)
bar = np.array([0, 0, 0, 0, 1, 1], dtype=bool) 

df = pd.DataFrame({
'user': user,
'day': day,
'place': place,
'foo': foo,
'bar': bar,
})
df2 = df.set_index(['day','place']).pivot(columns='user')

df2.columns = ["User1_foo", "User2_foo", "User1_bar", "User2_bar"]
df2 = df2.reset_index()
df2.dropna(inplace=True)

but in my practical usage, I have millions of rows of data and profiling shows that the dataframe usage and pivot operation is a performance bottleneck.

Therefore, how can I achieve the same output, i.e. numpy arrays for day, place and user1_foo, user1_bar, user2_foo, user2_bar for just the cases where there is data for both users at the same day AND place in the original input arrays?

I wonder if somehow finding indexes from np.unique then inverting them would be a possible solution, but couldn't make it work. Therefore, any solutions (ideally fast executing) would be great thanks!

825

asked Jul 14 '19 23:07

SLater01

1 Answers

Approach #1

Here's one based on dimensionality-reduction for memory-efficiency and np.searchsorted for tracing back and looking for matching ones between the two users data -

# Extract array data for efficiency, as we will work NumPy tools
a = df.to_numpy(copy=False) #Pandas >= 0.24, use df.values otherwise
i = a[:,:3].astype(int)
j = a[:,3:].astype(bool)
# Test out without astype(int),astype(bool) conversions and see how they perform

# Get grouped scalars for Day and place headers combined
# This assumes that Day and Place data are positive integers
g = i[:,2]*(i[:,1].max()+1) + i[:,1]

# Get groups for user1,2 for original and grouped-scalar items
m1 = i[:,0]==1
uj1,uj2 = j[m1],j[~m1]
ui1 = i[m1]
u1,u2 = g[m1],g[~m1]

# Use searchsorted to look for matching ones between user-1,2 grouped scalars
su1 = u1.argsort()
ssu1_idx = np.searchsorted(u1,u2,sorter=su1)
ssu1_idx[ssu1_idx==len(u1)] = 0
ssu1_idxc = su1[ssu1_idx]

match_mask = u1[ssu1_idxc]==u2
match_idx = ssu1_idxc[match_mask]

# Select matching items off original table
p1,p2 = uj1[match_idx],uj2[match_mask]

# Setup output arrays
day_place = ui1[match_idx,1:]
user1_bools = p1
user2_bools = p2

Approach #1-Extended : Generic Day and Place dtype data

We can extend to generic case when Day and Place data might not necessarily be positive integers. In that case, we can make use of dtype-combined view-based method to perform data-redcution. Thus, the only change needed would to get g differently and this would be a view-based array type and would be obtained like so -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

# Get grouped scalars for Day and place headers combined with dtype combined view
g = view1D(i[:,1:])

Approach #2

We will use lex-sorting to group data in such a way that looking for identical elements in consecutive rows would tell us if there are matching ones across the two users. We will re-use a,i,j from Approach#1. The implementation would be -

# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()

b = i[sidx]

# Get matching conditions on consecutive rows
m = (np.diff(b,axis=0)==[1,0,0]).all(1)
# Or m = (b[:-1,1] == b[1:,1]) & (b[:-1,2] == b[1:,2]) & (np.diff(b[:,0])==1)

# Trace back to original order by using sidx
match1_idx,match2_idx = sidx[:-1][m],sidx[1:][m]

# Index into relevant table and get desired array outputs
day_place,user1_bools,user2_bools = i[match1_idx,1:],j[match1_idx],j[match2_idx]

Alternatively, we could use an extended mask of m to index into sidx and generate match1_idx,match2_idx. Rest of the code stays the same. Hence, we could do -

from scipy.ndimage import binary_dilation

# Binary extend the mask to have the same length as the input.
# Index into sidx with it. Use one-off offset and stepsize of 2 to get
# user1,2 matching indices
m_ext = binary_dilation(np.r_[m,False],np.ones(2,dtype=bool),origin=-1)
match_idxs = sidx[m_ext]
match1_idx,match2_idx = match_idxs[::2],match_idxs[1::2]

Approach #3

Here's another based on Approach #2 and ported over to numba for memory and hence perf. efficiency and we will re-use a,i,j from approach #1 -

from numba import njit

@njit
def find_groups_numba(i_s,j_s,user_data,bools):
    n = len(i_s)
    found_iterID = 0
    for iterID in range(n-1):
        if i_s[iterID,1] == i_s[iterID+1,1] and i_s[iterID,2] == i_s[iterID+1,2]:
            bools[found_iterID,0] = j_s[iterID,0]
            bools[found_iterID,1] = j_s[iterID,1]
            bools[found_iterID,2] = j_s[iterID+1,0]
            bools[found_iterID,3] = j_s[iterID+1,1]
            user_data[found_iterID,0] = i_s[iterID,1]
            user_data[found_iterID,1] = i_s[iterID,2]        
            found_iterID += 1
    return found_iterID

# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()

i_s = i[sidx]
j_s = j[sidx]

n = len(i_s)
user_data = np.empty((n//2,2),dtype=i.dtype)
bools = np.empty((n//2,4),dtype=j.dtype)    
found_iterID = find_groups_numba(i_s,j_s,user_data,bools)    
out_bools = bools[:found_iterID] # Output bool
out_userd = user_data[:found_iterID] # Output user-Day, Place data

Append with .copy() at last 2 steps if outputs must have their own memory spaces.

Alternatively, we can offload the indexing operation back on NumPy side for a cleaner solution -

@njit
def find_consec_matching_group_indices(i_s,idx):
    n = len(i_s)
    found_iterID = 0
    for iterID in range(n-1):
        if i_s[iterID,1] == i_s[iterID+1,1] and i_s[iterID,2] == i_s[iterID+1,2]:
            idx[found_iterID] = iterID
            found_iterID += 1            
    return found_iterID

# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()

i_s = i[sidx]
j_s = j[sidx]

idx = np.empty(len(i_s)//2,dtype=np.uint64)
found_iterID = find_consec_matching_group_indices(i_s,idx)
fidx = idx[:found_iterID]
day_place,user1_bools,user2_bools = i_s[fidx,1:],j_s[fidx],j_s[fidx+1]

165

answered Oct 05 '22 13:10

Divakar

Related questions
                            
                                Absolute import results in ModuleNotFoundError
                            
                                Configure AWS Cloud9 to use Anaconda Python Environment
                            
                                Keras network producing inverse predictions
                            
                                How to fill in the blank using bidirectional RNN and pytorch?
                            
                                QThread in Qt on Python
                            
                                Plotly: How to set heatmap aspect ratio?
                            
                                get instagram followers list with python
                            
                                How do you apply layer normalization in an RNN using tf.keras?
                            
                                Why stdbuf has no effect on Python?
                            
                                How to use estimateRigidTransform in OpenCV 3.0 or higher, Is there any other alternative?
                            
                                Doing Multi-Label classification with BERT
                            
                                Convert pandas column with multiple timezones to single timezone
                            
                                find duplicates of items endings in a list
                            
                                Proper way to iterate tf.data.Dataset in session for 2.0
                            
                                How to get result from Pool.starmap_async()?
                            
                                Turning off Snowflake DB logging while still keeping log level as DEBUG
                            
                                How to fix Newspaper3k 403 Client Error for certain URL's?
                            
                                what is diffrence between number and repeat in python timeit?
                            
                                How to horizontally swap two halves of an image in python opencv
                            
                                Best Practices Python - Where to store API KEYS/TOKENS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast combination of non-unique rows in numpy array, mapped to columns (i.e. fast pivot table problem, without Pandas)

Tags:

python

arrays

pandas

vectorization

numpy

SLater01

People also ask

1 Answers

Divakar

Recent Activity

Donate For Us