Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join pandas dataframe to itself by condition?

I'm having a python pandas dataframe with 2 relevant columns "date" and "value", let's assume it looks like this and is ordered by date:

data = pd.DataFrame({"date": ["2021-01-01", "2021-01-31", "2021-02-01", "2021-02-28", "2021-03-01", "2021-03-31", "2021-04-01", "2021-04-02"],
                     "value": [1,2,3,4,5,6,5,8]})
data["date"] = pd.to_datetime(data['date'])

Now I want to join the dataFrame to itself in such a way, that I get for each last available day in month the next available day where the value is higher. In our example this should basically look like this:

date, value, date2, value2:
2021-01-31, 2, 2021-02-01, 3
2021-02-28, 4, 2021-03-01, 5
2021-03-31, 6, 2021-04-02, 8
2021-04-02, 8, NaN, NaN

My current partial solution to this problem looks like this:

last_days = data.groupby([data.date.dt.year, data.date.dt.month]).last()
res = [data.loc[(data.date>date) & (data.value > value)][:1] for date, value in zip(last_days.date, last_days.value)]
print(res)

But because of this answer "Don't iterate over rows in a dataframe", it doesn't feel like the pandas way to me.

So the question is, how to solve it the pandas way?

like image 899
jackattack Avatar asked Jan 29 '26 16:01

jackattack


1 Answers

If you don’t have too many rows, you could generate all pairs of items and filter from there.

Let’s start with getting the last days in the month:

>>> last = data.loc[data['date'].dt.daysinmonth == data['date'].dt.day]
>>> last
        date  value
1 2021-01-31      2
3 2021-02-28      4
5 2021-03-31      6

Now use a cross join to map each last day to any possible day, then filter on criteria such as later date and larger value:

>>> pairs = pd.merge(last, data, how='cross', suffixes=('', '2'))
>>> pairs = pairs.loc[pairs['date2'].gt(pairs['date']) & pairs['value2'].gt(pairs['value'])]
>>> pairs
         date  value      date2  value2
2  2021-01-31      2 2021-02-01       3
3  2021-01-31      2 2021-02-28       4
4  2021-01-31      2 2021-03-01       5
5  2021-01-31      2 2021-03-31       6
6  2021-01-31      2 2021-04-01       5
7  2021-01-31      2 2021-04-02       8
12 2021-02-28      4 2021-03-01       5
13 2021-02-28      4 2021-03-31       6
14 2021-02-28      4 2021-04-01       5
15 2021-02-28      4 2021-04-02       8
23 2021-03-31      6 2021-04-02       8

Finally use GroupBy.idxmin() to get the first date2

>>> pairs.loc[pairs.groupby(['date', 'value'])['value2'].idxmin().values]
         date  value      date2  value2
2  2021-01-31      2 2021-02-01       3
12 2021-02-28      4 2021-03-01       5
23 2021-03-31      6 2021-04-02       8

Otherwise you might want apply, which is pretty much the same as iterating on rows to be entirely honest.

like image 67
Cimbali Avatar answered Feb 01 '26 06:02

Cimbali



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!