I'm transitioning some stuff from R to Python and am curious about merging efficiently. I've found some stuff on concatenate
in NumPy (using NumPy for operations, so I'd like to stick with it), but it doesn't work as expected.
Take two datasets
d1 = np.array([['1a2', '0'], ['2dd', '0'], ['z83', '1'], ['fz3', '0']])
ID Label
1a2 0
2dd 0
z83 1
fz3 0
and
d2 = np.array([['1a2', '33.3', '22.2'],
['43m', '66.6', '66.6'],
['z83', '12.2', '22.1']])
ID val1 val2
1a2 33.3 22.2
43m 66.6 66.6
z83 12.2 22.1
I want to merge these together so that the result is
d3
ID Label val1 val2
1a2 0 33.3 22.2
z83 1 12.2 22.1
So it's identified rows that match on the ID column and then concatenated these together. This is relatively simple in R using merge
, but in NumPy it's less obvious to me.
Is there a way to do this natively in NumPy that I am missing?
Use numpy. concatenate() to merge the content of two or multiple arrays into a single array. This function takes several arguments along with the NumPy arrays to concatenate and returns a Numpy array ndarray. Note that this method also takes axis as another argument, when not specified it defaults to 0.
Joining NumPy Arrays We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis. If axis is not explicitly passed, it is taken as 0.
NumPy's concatenate function can be used to concatenate two arrays either row-wise or column-wise. Concatenate function can take two or more arrays of the same shape and by default it concatenates row-wise i.e. axis=0. The resulting array after row-wise concatenation is of the shape 6 x 3, i.e. 6 rows and 3 columns.
The Numpy append function allows us to add new values to the end of an existing NumPy array. This function returns a copy of the existing array with the values appended to the specified axis. In Concatenation It can be used to concatenate two arrays either row-wise or column-wise.
Here's one NumPy based solution using masking -
def numpy_merge_bycol0(d1, d2):
# Mask of matches in d1 against d2
d1mask = np.isin(d1[:,0], d2[:,0])
# Mask of matches in d2 against d1
d2mask = np.isin(d2[:,0], d1[:,0])
# Mask respective arrays and concatenate for final o/p
return np.c_[d1[d1mask], d2[d2mask,1:]]
Sample run -
In [43]: d1
Out[43]:
array([['1a2', '0'],
['2dd', '0'],
['z83', '1'],
['fz3', '0']], dtype='|S3')
In [44]: d2
Out[44]:
array([['1a2', '33.3', '22.2'],
['43m', '66.6', '66.6'],
['z83', '12.2', '22.1']], dtype='|S4')
In [45]: numpy_merge_bycol0(d1, d2)
Out[45]:
array([['1a2', '0', '33.3', '22.2'],
['z83', '1', '12.2', '22.1']], dtype='|S4')
We could also use broadcasting
to get the indices and then integer-indexing in place of masking, like so -
idx = np.argwhere(d1[:,0,None] == d2[:,0])
out = np.c_[d1[idx[:,0]], d2[idx[:,0,1:]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With