Logo Questions Linux Laravel Mysql Ubuntu Git Menu

SQL like joins in pandas




I have two dataframes, the first is of the form (note that the dates are datetime objects):

df = DataFrame('key': [0,1,2,3,4,5],
               'date': [date0,date1, date2, date3, date4, date5],
               'value': [0,10,20,30,40,50])

And a second which is of the form:

df2 = DataFrame('key': [0,1,2,3,4,5],
                'valid_from': [date0, date0, date0, date3, date3, date3],
                'valid_to': [date2, date2, date2, date5, date5, date5],
                'value': [0, 100, 200, 300, 400, 500])

And I'm trying to efficiently join where the keys match and the date is between the valid_from and valid_to. What I've come up with is the following:

def map_keys(df2, key, date):
    value = df2[df2['key'] == key & 
        df2['valid_from'] <= date & 
        df2['valid_to'] >= date]['value'].values[0]
    return value

keys = df['key'].values
dates = df['date'].values
keys_dates = zip(keys, dates)

values = []
for key_date in keys_dates:
    value = map_keys(df2, key_date[0], key_date[1])

df['joined_value'] = values

While this seems to do the job it doesn't feel like a particularly elegant solution. I was wondering if anybody had a better idea for a join such as this.

Thanks for you help - it is much appreciated.

like image 660
landewednack Avatar asked Jan 12 '13 22:01


People also ask

Is Panda like SQL?

Pandas and SQL may look quite same, but their nature is varied in many ways. Pandas mainly store data in the form of table-like objects and also provide a vast range of methods to transform those.

What kind of joins does pandas offer?

There are mainly five types of Joins in Pandas: Inner Join. Left Outer Join. Right Outer Join.

Which of the following merge methods of pandas can be used as the SQL equivalent?

In pandas, SQL's GROUP BY operations are performed using the similarly named groupby() method. groupby() typically refers to a process where we'd like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together.

1 Answers

Currently, you can do this in a few steps with the built-in pandas.merge() and boolean indexing.

merged = df.merge(df2, on='key')

valid = (merged.date >= merged.valid_from) & \
        (merged.date <= merged.valid_to)

df['joined_value'] = merged[valid].value_y

(Note: the value column of df2 is accessed as value_y after the merge because it conflicts with a column of the same name in df and the default merge-conflict suffixes are _x, _y for the left and right frames, respectively.)

Here's an example, with a different setup to show how invalid dates are handled.

n = 8
dates = pd.date_range('1/1/2013', freq='D', periods=n)
df = DataFrame({'key': np.arange(n),
                'date': dates,
                'value': np.arange(n) * 10})
df2 = DataFrame({'key': np.arange(n),
                 'valid_from': dates[[1,1,1,1,5,5,5,5]],
                 'valid_to': dates[[4,4,4,4,6,6,6,6]],
                 'value': np.arange(n) * 100})

Input df2:

   key          valid_from            valid_to  value
0    0 2013-01-02 00:00:00 2013-01-05 00:00:00      0
1    1 2013-01-02 00:00:00 2013-01-05 00:00:00    100
2    2 2013-01-02 00:00:00 2013-01-05 00:00:00    200
3    3 2013-01-02 00:00:00 2013-01-05 00:00:00    300
4    4 2013-01-06 00:00:00 2013-01-07 00:00:00    400
5    5 2013-01-06 00:00:00 2013-01-07 00:00:00    500
6    6 2013-01-06 00:00:00 2013-01-07 00:00:00    600
7    7 2013-01-06 00:00:00 2013-01-07 00:00:00    700

Intermediate frame merged:

                 date  key  value_x          valid_from            valid_to  value_y
0 2013-01-01 00:00:00    0        0 2013-01-02 00:00:00 2013-01-05 00:00:00        0
1 2013-01-02 00:00:00    1       10 2013-01-02 00:00:00 2013-01-05 00:00:00      100
2 2013-01-03 00:00:00    2       20 2013-01-02 00:00:00 2013-01-05 00:00:00      200
3 2013-01-04 00:00:00    3       30 2013-01-02 00:00:00 2013-01-05 00:00:00      300
4 2013-01-05 00:00:00    4       40 2013-01-06 00:00:00 2013-01-07 00:00:00      400
5 2013-01-06 00:00:00    5       50 2013-01-06 00:00:00 2013-01-07 00:00:00      500
6 2013-01-07 00:00:00    6       60 2013-01-06 00:00:00 2013-01-07 00:00:00      600
7 2013-01-08 00:00:00    7       70 2013-01-06 00:00:00 2013-01-07 00:00:00      700

Final value of df after adding column joined_value:

                 date  key  value  joined_value
0 2013-01-01 00:00:00    0      0           NaN
1 2013-01-02 00:00:00    1     10           100
2 2013-01-03 00:00:00    2     20           200
3 2013-01-04 00:00:00    3     30           300
4 2013-01-05 00:00:00    4     40           NaN
5 2013-01-06 00:00:00    5     50           500
6 2013-01-07 00:00:00    6     60           600
7 2013-01-08 00:00:00    7     70           NaN
like image 53
Garrett Avatar answered Oct 12 '22 17:10
