I have two 4 column numpy arrays (2D) with several hundred (float) rows (cap and usp) in each. Considering a subset of 3 columns in each array (e.g. capind=cap[:,:3]
):
I am looking for an an efficient means to identify these common three value (row) subsets across both arrays while somehow retaining the 4th column from both arrays for further processing. In essence I'm looking for a great numpy way to do the equivalent of Matlab's intersect function with row option(i.e. ([c, ia, ib]=intersect(capind, uspind, 'rows');)
.
Which returns the index of the matching rows, so that it's now trivial to get the matching triplets and value from the 4th column from the original array (matchcap=cap[ia,:]
).
My current approach is based upon a similar question on the forum as I cannot find a good match for my problem. However, this approach seems a little inefficient considering my goal (I also haven't fully solved my problem):
The arrays are something like this:
cap=array([[ 2.50000000e+01, 1.27000000e+02, 1.00000000e+00,
9.81997200e-06],
[ 2.60000000e+01, 1.27000000e+02, 1.00000000e+00,
9.14296800e+00],
[ 2.70000000e+01, 1.27000000e+02, 1.00000000e+00,
2.30137100e-04],
...,
[ 6.10000000e+01, 1.80000000e+02, 1.06000000e+02,
8.44939900e-03],
[ 6.20000000e+01, 1.80000000e+02, 1.06000000e+02,
4.77729100e-03],
[ 6.30000000e+01, 1.80000000e+02, 1.06000000e+02,
1.40343500e-03]])
usp=array([[ 4.10000000e+01, 1.31000000e+02, 1.00000000e+00,
5.24197200e-06],
[ 4.20000000e+01, 1.31000000e+02, 1.00000000e+00,
8.39178800e-04],
[ 4.30000000e+01, 1.31000000e+02, 1.00000000e+00,
1.20279900e+01],
...,
[ 4.70000000e+01, 1.80000000e+02, 1.06000000e+02,
2.48667700e-02],
[ 4.80000000e+01, 1.80000000e+02, 1.06000000e+02,
4.23304600e-03],
[ 4.90000000e+01, 1.80000000e+02, 1.06000000e+02,
1.02051300e-03]])
I then convert each 4 column array (usp and cap) into a three column array (capind and uspind shown below as integers for ease of viewing).
capind=array([[ 25, 127, 1],
[ 26, 127, 1],
[ 27, 127, 1],
...,
[ 61, 180, 106],
[ 62, 180, 106],
[ 63, 180, 106]])
uspind=array([[ 41, 131, 1],
[ 42, 131, 1],
[ 43, 131, 1],
...,
[ 47, 180, 106],
[ 48, 180, 106],
[ 49, 180, 106]])
Using a set operation gives me the matching triplets: carray=np.array([x for x in set(tuple(x) for x in capind) & set(tuple(x) for x in uspind)])
.
This seems to work fairly well for finding the common row values from both uspind and capind arrays. I now need to get the 4th columns value from the matching rows (i.e. compare carray with the first three columns of the original source array (cap and usp) and somehow grab the value from the 4th column).
Is there a better a more efficient way to achieve this? Otherwise, any help on the best means to retrieve the 4th column values from the source arrays would be greatly appreciated.
Try using dictionaries.
capind = {tuple(row[:3]):row[3] for row in cap}
uspind = {tuple(row[:3]):row[3] for row in usp}
keys = capind.viewkeys() & uspind.viewkeys()
for key in keys:
# capind[key] and uspind[key] are the fourth columns
Using the assumptions you have that the rows are unique in each matrix and that there are common rows, here is one solution. The basic idea is to concatenate the two arrays, sort it so the similar rows are together and then do a difference across the rows. The first three values should be close to zero if the rows are the same.
[Original]
## Concatenate the matrices together
cu = concatenate( (cap, usp), axis=0 )
print cu
## Sort it
cu.sort( axis=0 )
print cu
## Do a forward difference from row to row
cu_diff = diff( cu, n=1, axis=0 )
## Now calculate the sum of the first three columns
## as it should be zero (or near zero)
cu_diff_s = sum( abs( cu_diff[:,:-1] ), axis=1 )
## Find the indices where it is zero
## Change this to be <= eps if you are using float numbers
indices = find( cu_diff_s == 0 )
print indices
## And here are the rows...
print cu[indices,:]
I contrived a dataset based on your example above. It appears to work. There might be a faster way to do it but this way you don't have to loop anything. (I don't like looping :-) ).
[Updated]
Ok. So I added two more columns two each matrix. The second last column is a 1 in the cap and a 2 in the usp. The last column is just an index into the original matrices.
## Store more info in the array
## The first 4 columns are the initial data
## The fifth column is a code of 1 or 2 (ie cap or usp)
## The sixth column is the index into the original matrix
cap_code = concatenate( (ones( (cap.shape[0], 1 )), reshape( r_[0:cap.shape[0]], (cap.shape[0], 1))), axis=1 )
cap_info = concatenate( (cap, cap_code ), axis=1 )
usp_code = concatenate( (2*ones( (usp.shape[0], 1 )), reshape( r_[0:usp.shape[0]], (usp.shape[0], 1))), axis=1 )
usp_info = concatenate( (usp, usp_code ), axis=1 )
## Concatenate the matrices together
cu = concatenate( (cap_info, usp_info), axis=0 )
print cu
## Sort it
cu.sort( axis=0 )
print cu
## Do a forward difference from row to row
cu_diff = diff( cu, n=1, axis=0 )
## Now calculate the sum of the first three columns
## as it should be zero (or near zero)
cu_diff_s = sum( abs( cu_diff[:,:3] ), axis=1 )
## Find the indices where it is zero
## Change this to be <= eps if you are using float numbers
indices = find( cu_diff_s == 0 )
print indices
## And here are the rows...
print cu[indices,:]
print cu[indices+1,:]
It appears to work based on my contrived data. It is getting a tad convoluted so I don't think I would want to pursue this direction much further.
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With