Join two pandas dataframes with only end dates

Question

I have a pandas DataFrame containing raw data which I would like to enrich by adding a lookup from another mapping table. The mapping table translates a symbol to another symbol, but since there are duplicate keys, it also has an 'end date' for the mapping.

The data to be enriched looks something like this:

    date                  symbol    price
0   2001-01-02 00:00:00   GCF5      1000.0
1   2001-01-02 00:00:00   GCZ5      1001.0
2   2001-01-03 00:00:00   GCF5      1002.0
3   2001-01-03 00:00:00   GCZ5      1003.0
4   2001-01-04 00:00:00   GCF5      1004.0
5   2001-01-04 00:00:00   GCZ5      1005.0

The mapping table looks like this:

    from_symbol    to_symbol     end_date
0   GCF5           GCF05         2001-01-03 00:00:00
1   GCF5           GCF15         2001-12-31 00:00:00
2   GCZ5           GCZ15         2001-12-31 00:00:00

And I would like the output to look like this:

    date                  symbol    mapped    price
0   2001-01-02 00:00:00   GCF5      GCF05     1000.0
1   2001-01-02 00:00:00   GCZ5      GCZ15     1001.0
2   2001-01-03 00:00:00   GCF5      GCF05     1002.0
3   2001-01-03 00:00:00   GCZ5      GCZ15     1003.0
4   2001-01-04 00:00:00   GCF5      GCF15     1004.0
5   2001-01-04 00:00:00   GCZ5      GCZ15     1005.0

I've looked at Series.asof() and the ordered_merge() functions but I can't see how to both join on the symbol == from_symbol clause, and use the end_date to find the first entry. The end_date is inclusive for the join.

Thanks, Jon

Roman Pekar · Accepted Answer

Don't know if there's more elegant way to do this, but at the moment I see 2 ways of doing it (I'm mostly use SQL, so these approaches are taken from this background, since join is actually taken from relational databases, I'll add SQL syntax also):

Join, then take first row.

SQL way to do this would be to use row_number() function and then take only rows where row_number = 1:

select
    a.date, d.symbol, d.price, m.to_symbol as mapping,
from (
    select
        d.date, d.symbol, d.price, m.to_symbol as mapping,
        row_number() over(partition by d.date, d.symbol order by m.end_date asc) as rn
    from df as d
        inner join mapping as m on m.from_symbol = d.symbol and d.date <= m.end_date
) as a
where a.rn = 1

If there's no duplicates on date, symbol in your DataFrame, then:

# merge data on symbols
>>> res = pd.merge(df, mapping, left_on='symbol', right_on='from_symbol')

# remove all records where date > end_date
>>> res = res[res['date'] <= res['end_date']]

# for each combination of date, symbol get only first occurence
>>> res = res.groupby(['date','symbol'], as_index=False, sort=lambda x: x['end_date']).first()

# subset result
>>> res = res[['date','symbol','to_symbol','price']]
>>> res
         date symbol to_symbol  price
0  2001-01-02   GCF5     GCF05   1000
1  2001-01-02   GCZ5     GCZ15   1001
2  2001-01-03   GCF5     GCF05   1002
3  2001-01-03   GCZ5     GCZ15   1003
4  2001-01-04   GCF5     GCF15   1004
5  2001-01-04   GCZ5     GCZ15   1005

If there're could be duplicates, you can create DataFrame mapping2 like above and join on it.

Apply

SQL (actually, SQL Server) way would be to use outer apply:

select
    d.date, d.symbol, d.price, m.to_symbol as mapping,
from df as d
    outer apply (
        select top 1
            m.to_symbol
        from mapping as m
        where m.from_symbol = d.symbol and d.date <= m.end_date
        order by m.end_date asc
    ) as m

I'm not at all guru at Pandas, but I think it would be faster if I reset index on mapping DataFrame:

>>> mapping2 = mapping.set_index(['from_symbol', 'end_date']).sort_index()
>>> mapping2
                       to_symbol
from_symbol end_date            
GCF5        2001-01-03     GCF05
            2001-12-31     GCF15
GCZ5        2001-12-31     GCZ15
>>> df['mapping'] = df.apply(lambda x: mapping2.loc[x['symbol']][x['date']:].values[0][0], axis=1)
>>> df
         date  price symbol mapping
0  2001-01-02   1000   GCF5   GCF05
1  2001-01-02   1001   GCZ5   GCZ15
2  2001-01-03   1002   GCF5   GCF05
3  2001-01-03   1003   GCZ5   GCZ15
4  2001-01-04   1004   GCF5   GCF15
5  2001-01-04   1005   GCZ5   GCZ15

Join two pandas dataframes with only end dates

Tags:

python

pandas

Jon

1 Answers

Join, then take first row.

Apply

Roman Pekar

Recent Activity

Donate For Us

Join two pandas dataframes with only end dates

Tags:

python

pandas

Jon

1 Answers

Join, then take first row.

Apply

Roman Pekar

Related questions

Recent Activity

Donate For Us