My question is an extension of this one made a few years ago.
I'm attempting a left join but one of the columns I want to join on needs to be a range value. It needs to be a range because expanding it would mean millions of new (and unnecessary) rows. Intuitively it seems possible using Python's in
operator (as x in range(y, z)
is very common) but would involve a nasty for loop and if/else block. There has to be a better way.
Here's a simple version of my data:
# These are in any order
sample = pd.DataFrame({
'col1': ['1b', '1a', '1a', '1b'],
'col2': ['2b', '2b', '2a', '2a'],
'col3': [42, 3, 21, 7]
})
# The 'look-up' table
look_up = pd.DataFrame({
'col1': ['1a', '1a', '1a', '1a', '1b', '1b', '1b', '1b'],
'col2': ['2a', '2a', '2b', '2b', '2a', '2a', '2b', '2b'],
'col3': [range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101)],
'col4': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})
I initially tried a merge to see if pandas would understand but there was a type mismatch error.
sample.merge(
look_up,
how='left',
left_on=['col1', 'col2', 'col3'],
right_on=['col1', 'col2', 'col3']
)
# ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Reviewing the documentation for pd.concat
looks like it will not give me result I want either. Rather than appending, I'm still trying to get a result like merge
. I tried to follow the answer given to question I linked at the start but that didn't work either. It's entirely possible I misunderstood how to use np.where
but also I'm hoping there is a solution that is a little less hacky.
Here's my attempt using np.where
:
s1 = sample['col1'].values
s2 = sample['col2'].values
s3 = sample['col3'].values
l1 = look_up['col1'].values
l2 = look_up['col2'].values
l3 = look_up['col3'].values
i, j = np.where((s3[:, None] in l3) & (s2[:, None] == l2) & (s1[:, None] == l1))
result = pd.DataFrame(
np.column_stack([sample.values[i], look_up.values[j]]),
columns=sample.columns.append(look_up.columns)
)
len(result) # returns 0
The result I want should look like this:
col1 col2 col3 col4
'1b' '2b' 42 'h'
'1a' '2b' 3 'c'
'1a' '2a' 21 'b'
'1b' '2a' 7 'e'
merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.
Key Points Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.
INNER Merge Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.
Since it looks like ranges are pretty big, and you are working with integer vales, you can just compute the min, max:
columns = look_up.columns
look_up['minval'] = look_up['col3'].apply(min)
look_up['maxval'] = look_up['col3'].apply(max)
(sample.merge(look_up, on=['col1','col2'], how='left',
suffixes=['','_'])
.query('minval <= col3 <= maxval')
[columns]
)
Output:
col1 col2 col3 col4
1 1b 2b 42 h
2 1a 2b 3 c
5 1a 2a 21 b
6 1b 2a 7 e
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With