I'm transitioning some stuff from R to Python and am curious about merging efficiently. I've found some stuff on <code>concatenate</code> in NumPy (using NumPy for operations, so I'd like to stick with it), but it doesn't work as expected. Take two datasets <pre class="prettyprint"><code>d1 = np.array([['1a2', '0'], ['2dd', '0'], ['z83', '1'], ['fz3', '0']]) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>ID Label 1a2 0 2dd 0 z83 1 fz3 0 </code></pre> and <pre class="prettyprint"><code>d2 = np.array([['1a2', '33.3', '22.2'], ['43m', '66.6', '66.6'], ['z83', '12.2', '22.1']]) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>ID val1 val2 1a2 33.3 22.2 43m 66.6 66.6 z83 12.2 22.1 </code></pre> I want to merge these together so that the result is <pre class="prettyprint lang-none prettyprint-override"><code>d3 ID Label val1 val2 1a2 0 33.3 22.2 z83 1 12.2 22.1 </code></pre> So it's identified rows that match on the ID column and then concatenated these together. This is relatively simple in R using <code>merge</code>, but in NumPy it's less obvious to me. Is there a way to do this natively in NumPy that I am missing?

Here's one NumPy based solution using masking - <pre class="prettyprint"><code>def numpy_merge_bycol0(d1, d2): # Mask of matches in d1 against d2 d1mask = np.isin(d1[:,0], d2[:,0]) # Mask of matches in d2 against d1 d2mask = np.isin(d2[:,0], d1[:,0]) # Mask respective arrays and concatenate for final o/p return np.c_[d1[d1mask], d2[d2mask,1:]] </code></pre> Sample run - <pre class="prettyprint"><code>In [43]: d1 Out[43]: array([['1a2', '0'], ['2dd', '0'], ['z83', '1'], ['fz3', '0']], dtype='|S3') In [44]: d2 Out[44]: array([['1a2', '33.3', '22.2'], ['43m', '66.6', '66.6'], ['z83', '12.2', '22.1']], dtype='|S4') In [45]: numpy_merge_bycol0(d1, d2) Out[45]: array([['1a2', '0', '33.3', '22.2'], ['z83', '1', '12.2', '22.1']], dtype='|S4') </code></pre> We could also use <code>broadcasting</code> to get the indices and then integer-indexing in place of masking, like so - <pre class="prettyprint"><code>idx = np.argwhere(d1[:,0,None] == d2[:,0]) out = np.c_[d1[idx[:,0]], d2[idx[:,0,1:] </code></pre>

NumPy equivalent of merge

Tags:

python

merge

numpy

I'm transitioning some stuff from R to Python and am curious about merging efficiently. I've found some stuff on concatenate in NumPy (using NumPy for operations, so I'd like to stick with it), but it doesn't work as expected.

Take two datasets

Click to copy

d1 = np.array([['1a2', '0'], ['2dd', '0'], ['z83', '1'], ['fz3', '0']])

Click to copy

ID      Label
1a2     0
2dd     0
z83     1
fz3     0

and

Click to copy

d2 = np.array([['1a2', '33.3', '22.2'], 
               ['43m', '66.6', '66.6'], 
               ['z83', '12.2', '22.1']])

Click to copy

ID     val1   val2
1a2    33.3   22.2
43m    66.6   66.6
z83    12.2   22.1

I want to merge these together so that the result is

Click to copy

d3

ID    Label    val1    val2
1a2   0        33.3    22.2
z83   1        12.2    22.1

So it's identified rows that match on the ID column and then concatenated these together. This is relatively simple in R using merge, but in NumPy it's less obvious to me.

Is there a way to do this natively in NumPy that I am missing?

743

asked Mar 26 '18 15:03

Jibril

1 Answers

Here's one NumPy based solution using masking -

Click to copy

def numpy_merge_bycol0(d1, d2):
    # Mask of matches in d1 against d2
    d1mask = np.isin(d1[:,0], d2[:,0])

    # Mask of matches in d2 against d1
    d2mask = np.isin(d2[:,0], d1[:,0])

    # Mask respective arrays and concatenate for final o/p
    return np.c_[d1[d1mask], d2[d2mask,1:]]

Sample run -

Click to copy

In [43]: d1
Out[43]: 
array([['1a2', '0'],
       ['2dd', '0'],
       ['z83', '1'],
       ['fz3', '0']], dtype='|S3')

In [44]: d2
Out[44]: 
array([['1a2', '33.3', '22.2'],
       ['43m', '66.6', '66.6'],
       ['z83', '12.2', '22.1']], dtype='|S4')

In [45]: numpy_merge_bycol0(d1, d2)
Out[45]: 
array([['1a2', '0', '33.3', '22.2'],
       ['z83', '1', '12.2', '22.1']], dtype='|S4')

We could also use broadcasting to get the indices and then integer-indexing in place of masking, like so -

Click to copy

idx = np.argwhere(d1[:,0,None] == d2[:,0])
out = np.c_[d1[idx[:,0]], d2[idx[:,0,1:]

answered Oct 05 '22 14:10

Divakar

Related questions
                            
                                Function that returns an accumulator in Python
                            
                                How to generate all combinations of a set of characters without repetitions?
                            
                                Why doesn't python3's print statement flush output when end keyword is specified?
                            
                                Python: Loop to open multiple folders and files in python
                            
                                Pythonic cumulative map
                            
                                Keras - method on_batch_end is slow but only callback I have is checkpoint
                            
                                Jupyter Notebook: (OperationalError('disk I/O error',))
                            
                                cant iterate nested for loop as wanted -python -maybe a simple mistake
                            
                                Creating a "white" image in numpy (2-D image)
                            
                                how to import resource module?
                            
                                Django channels: No module named 'asgiref.sync'
                            
                                Find the nearest location using numpy
                            
                                Asyncio in Django
                            
                                python decorators *args and ** kwargs
                            
                                CountVectorizer converts words to lower case
                            
                                Python Pandas DataFrame str contains merge if
                            
                                Getting meta values from multiple level with json_normalize
                            
                                Insert a value after another value in a list
                            
                                Stop shutil.make_archive adding archive to itself
                            
                                What is a good crawling speed rate?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NumPy equivalent of merge

Tags:

python

merge

numpy

Jibril

People also ask

1 Answers

Divakar

Recent Activity

Donate For Us