Performing a merge in Pandas on a column containing a Python `range` or list-like

Tags:

My question is an extension of this one made a few years ago.

I'm attempting a left join but one of the columns I want to join on needs to be a range value. It needs to be a range because expanding it would mean millions of new (and unnecessary) rows. Intuitively it seems possible using Python's in operator (as x in range(y, z) is very common) but would involve a nasty for loop and if/else block. There has to be a better way.

Here's a simple version of my data:

# These are in any order
sample = pd.DataFrame({
    'col1': ['1b', '1a', '1a', '1b'],
    'col2': ['2b', '2b', '2a', '2a'],
    'col3': [42, 3, 21, 7]
})

# The 'look-up' table
look_up = pd.DataFrame({
    'col1': ['1a', '1a', '1a', '1a', '1b', '1b', '1b', '1b'],
    'col2': ['2a', '2a', '2b', '2b', '2a', '2a', '2b', '2b'],
    'col3': [range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101)],
    'col4': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})

I initially tried a merge to see if pandas would understand but there was a type mismatch error.

sample.merge(
    look_up,
    how='left',
    left_on=['col1', 'col2', 'col3'],
    right_on=['col1', 'col2', 'col3']
)
# ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

Reviewing the documentation for pd.concat looks like it will not give me result I want either. Rather than appending, I'm still trying to get a result like merge. I tried to follow the answer given to question I linked at the start but that didn't work either. It's entirely possible I misunderstood how to use np.where but also I'm hoping there is a solution that is a little less hacky.

Here's my attempt using np.where:

s1 = sample['col1'].values
s2 = sample['col2'].values
s3 = sample['col3'].values

l1 = look_up['col1'].values
l2 = look_up['col2'].values
l3 = look_up['col3'].values

i, j = np.where((s3[:, None] in l3) & (s2[:, None] == l2) & (s1[:, None] == l1))
result = pd.DataFrame(
    np.column_stack([sample.values[i], look_up.values[j]]), 
    columns=sample.columns.append(look_up.columns)
)

len(result)  # returns 0

The result I want should look like this:

col1  col2 col3 col4
'1b'  '2b'   42  'h'
'1a'  '2b'    3  'c'
'1a'  '2a'   21  'b'
'1b'  '2a'    7  'e'

773

asked Oct 27 '20 20:10

dlindsay

1 Answers

Since it looks like ranges are pretty big, and you are working with integer vales, you can just compute the min, max:

columns = look_up.columns

look_up['minval'] = look_up['col3'].apply(min)
look_up['maxval'] = look_up['col3'].apply(max)
    
(sample.merge(look_up, on=['col1','col2'], how='left',
              suffixes=['','_'])
       .query('minval <= col3 <= maxval')
       [columns]
)

Output:

  col1 col2  col3 col4
1   1b   2b    42    h
2   1a   2b     3    c
5   1a   2a    21    b
6   1b   2a     7    e

119

answered Oct 17 '22 08:10

Quang Hoang

Related questions
                            
                                How do I add a layer in a shape of a box to an altair plot?
                            
                                ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array
                            
                                Unable to install Airflow even after setting SLUGIFY_USES_TEXT_UNIDECODE and AIRFLOW_GPL_UNIDECODE
                            
                                Search in Rotated Sorted Array in O(log n) time
                            
                                Why there is no UserSet class defined in Python?
                            
                                How to inspect clients that are connected to a GRPC server
                            
                                Checking of **kwargs in concrete implementation of abstract class method. Interface issue?
                            
                                How do I run a Python script from a subdirectory without breaking upper-level imports?
                            
                                Ctrl+C sends EOFError once after cancelling process [duplicate]
                            
                                Scikitlearn Column Transformer Error: Column ordering must be equal for fit and for transform when using the remainder keyword
                            
                                prevent camera resetting after plotting with Plotly-python
                            
                                Is it an error to return a value in a finally clause [duplicate]
                            
                                Python PyTorch Error: ModuleNotFoundError: No module named 'torch.utils.tensorboard'
                            
                                Itertools zip_longest with first item of each sub-list as padding values in stead of None by default
                            
                                Packing values into a tuple using *, just like function argument packing
                            
                                How to create multiple dataframes using multiple functions
                            
                                What are the differences between django-bootstrap4 and bootstrap4?
                            
                                Monotonic stacks and queues. Definition and examples
                            
                                Why multiprocess python grpc server do not work?
                            
                                Docker container's sshfs mount freezes, but only when mounted by Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performing a merge in Pandas on a column containing a Python `range` or list-like

Tags:

merge

python-3.x

pandas

range

dlindsay

People also ask

1 Answers

Quang Hoang

Recent Activity

Donate For Us