SQL like joins in pandas

Tags:

pandas

I have two dataframes, the first is of the form (note that the dates are datetime objects):

df = DataFrame('key': [0,1,2,3,4,5],
               'date': [date0,date1, date2, date3, date4, date5],
               'value': [0,10,20,30,40,50])

And a second which is of the form:

df2 = DataFrame('key': [0,1,2,3,4,5],
                'valid_from': [date0, date0, date0, date3, date3, date3],
                'valid_to': [date2, date2, date2, date5, date5, date5],
                'value': [0, 100, 200, 300, 400, 500])

And I'm trying to efficiently join where the keys match and the date is between the valid_from and valid_to. What I've come up with is the following:

def map_keys(df2, key, date):
    value = df2[df2['key'] == key & 
        df2['valid_from'] <= date & 
        df2['valid_to'] >= date]['value'].values[0]
    return value

keys = df['key'].values
dates = df['date'].values
keys_dates = zip(keys, dates)

values = []
for key_date in keys_dates:
    value = map_keys(df2, key_date[0], key_date[1])
    values.append(value)

df['joined_value'] = values

While this seems to do the job it doesn't feel like a particularly elegant solution. I was wondering if anybody had a better idea for a join such as this.

Thanks for you help - it is much appreciated.

660

asked Jan 12 '13 22:01

1 Answers

Currently, you can do this in a few steps with the built-in pandas.merge() and boolean indexing.

merged = df.merge(df2, on='key')

valid = (merged.date >= merged.valid_from) & \
        (merged.date <= merged.valid_to)

df['joined_value'] = merged[valid].value_y

(Note: the value column of df2 is accessed as value_y after the merge because it conflicts with a column of the same name in df and the default merge-conflict suffixes are _x, _y for the left and right frames, respectively.)

Here's an example, with a different setup to show how invalid dates are handled.

n = 8
dates = pd.date_range('1/1/2013', freq='D', periods=n)
df = DataFrame({'key': np.arange(n),
                'date': dates,
                'value': np.arange(n) * 10})
df2 = DataFrame({'key': np.arange(n),
                 'valid_from': dates[[1,1,1,1,5,5,5,5]],
                 'valid_to': dates[[4,4,4,4,6,6,6,6]],
                 'value': np.arange(n) * 100})

Input df2:

   key          valid_from            valid_to  value
0    0 2013-01-02 00:00:00 2013-01-05 00:00:00      0
1    1 2013-01-02 00:00:00 2013-01-05 00:00:00    100
2    2 2013-01-02 00:00:00 2013-01-05 00:00:00    200
3    3 2013-01-02 00:00:00 2013-01-05 00:00:00    300
4    4 2013-01-06 00:00:00 2013-01-07 00:00:00    400
5    5 2013-01-06 00:00:00 2013-01-07 00:00:00    500
6    6 2013-01-06 00:00:00 2013-01-07 00:00:00    600
7    7 2013-01-06 00:00:00 2013-01-07 00:00:00    700

Intermediate frame merged:

                 date  key  value_x          valid_from            valid_to  value_y
0 2013-01-01 00:00:00    0        0 2013-01-02 00:00:00 2013-01-05 00:00:00        0
1 2013-01-02 00:00:00    1       10 2013-01-02 00:00:00 2013-01-05 00:00:00      100
2 2013-01-03 00:00:00    2       20 2013-01-02 00:00:00 2013-01-05 00:00:00      200
3 2013-01-04 00:00:00    3       30 2013-01-02 00:00:00 2013-01-05 00:00:00      300
4 2013-01-05 00:00:00    4       40 2013-01-06 00:00:00 2013-01-07 00:00:00      400
5 2013-01-06 00:00:00    5       50 2013-01-06 00:00:00 2013-01-07 00:00:00      500
6 2013-01-07 00:00:00    6       60 2013-01-06 00:00:00 2013-01-07 00:00:00      600
7 2013-01-08 00:00:00    7       70 2013-01-06 00:00:00 2013-01-07 00:00:00      700

Final value of df after adding column joined_value:

                 date  key  value  joined_value
0 2013-01-01 00:00:00    0      0           NaN
1 2013-01-02 00:00:00    1     10           100
2 2013-01-03 00:00:00    2     20           200
3 2013-01-04 00:00:00    3     30           300
4 2013-01-05 00:00:00    4     40           NaN
5 2013-01-06 00:00:00    5     50           500
6 2013-01-07 00:00:00    6     60           600
7 2013-01-08 00:00:00    7     70           NaN

answered Oct 12 '22 17:10

Garrett

Related questions
                            
                                Replace with newline python
                            
                                What is an InstrumentedList in Python?
                            
                                Use openpyxl to edit a Excel2007 file (.xlsx) without changing its own styles?
                            
                                What are some good web apps for learning Flask? [closed]
                            
                                Efficient Way to Create Numpy Arrays from Binary Files
                            
                                ValueError("color kwarg must have one color per dataset")?
                            
                                Import a class from a folder at another level
                            
                                Why is Clojure 10 times slower than Python for the equivalent solution of Euler 50?
                            
                                Python in Desktop Application Development
                            
                                What is the difference between **kwargs and dict in Python 3.2?
                            
                                String formatting [str.format()] with a dictionary key which is a str() of a number
                            
                                Python file operations
                            
                                Prevent pandas from automatically inferring type in read_csv
                            
                                Merging of two dictionaries [duplicate]
                            
                                What is the operator precedence when writing a double inequality in Python (explicitly in the code, and how can this be overridden for arrays?)
                            
                                Test/Test Coverage with Python in Sonar not showing up?
                            
                                Python import statement semantics
                            
                                Insert to cassandra from python using cql
                            
                                In pytest, how can I figure out if a test failed? (from "request")
                            
                                Keep a figure "on hold" after running a script

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL like joins in pandas

Tags:

python

pandas

landewednack

People also ask

1 Answers

Garrett

Recent Activity

Donate For Us