Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing a merge in Pandas on a column containing a Python `range` or list-like

My question is an extension of this one made a few years ago.

I'm attempting a left join but one of the columns I want to join on needs to be a range value. It needs to be a range because expanding it would mean millions of new (and unnecessary) rows. Intuitively it seems possible using Python's in operator (as x in range(y, z) is very common) but would involve a nasty for loop and if/else block. There has to be a better way.

Here's a simple version of my data:

# These are in any order
sample = pd.DataFrame({
    'col1': ['1b', '1a', '1a', '1b'],
    'col2': ['2b', '2b', '2a', '2a'],
    'col3': [42, 3, 21, 7]
})

# The 'look-up' table
look_up = pd.DataFrame({
    'col1': ['1a', '1a', '1a', '1a', '1b', '1b', '1b', '1b'],
    'col2': ['2a', '2a', '2b', '2b', '2a', '2a', '2b', '2b'],
    'col3': [range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101)],
    'col4': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})

I initially tried a merge to see if pandas would understand but there was a type mismatch error.

sample.merge(
    look_up,
    how='left',
    left_on=['col1', 'col2', 'col3'],
    right_on=['col1', 'col2', 'col3']
)
# ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

Reviewing the documentation for pd.concat looks like it will not give me result I want either. Rather than appending, I'm still trying to get a result like merge. I tried to follow the answer given to question I linked at the start but that didn't work either. It's entirely possible I misunderstood how to use np.where but also I'm hoping there is a solution that is a little less hacky.

Here's my attempt using np.where:

s1 = sample['col1'].values
s2 = sample['col2'].values
s3 = sample['col3'].values

l1 = look_up['col1'].values
l2 = look_up['col2'].values
l3 = look_up['col3'].values

i, j = np.where((s3[:, None] in l3) & (s2[:, None] == l2) & (s1[:, None] == l1))
result = pd.DataFrame(
    np.column_stack([sample.values[i], look_up.values[j]]), 
    columns=sample.columns.append(look_up.columns)
)

len(result)  # returns 0

The result I want should look like this:

col1  col2 col3 col4
'1b'  '2b'   42  'h'
'1a'  '2b'    3  'c'
'1a'  '2a'   21  'b'
'1b'  '2a'    7  'e'
like image 773
dlindsay Avatar asked Oct 27 '20 20:10

dlindsay


People also ask

How do I merge columns in pandas Python?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.

How do I merge two DataFrames based on a column?

Key Points Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

How does merging work in pandas?

INNER Merge Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.


1 Answers

Since it looks like ranges are pretty big, and you are working with integer vales, you can just compute the min, max:

columns = look_up.columns

look_up['minval'] = look_up['col3'].apply(min)
look_up['maxval'] = look_up['col3'].apply(max)
    
(sample.merge(look_up, on=['col1','col2'], how='left',
              suffixes=['','_'])
       .query('minval <= col3 <= maxval')
       [columns]
)

Output:

  col1 col2  col3 col4
1   1b   2b    42    h
2   1a   2b     3    c
5   1a   2a    21    b
6   1b   2a     7    e
like image 119
Quang Hoang Avatar answered Oct 17 '22 08:10

Quang Hoang