I'm having a python pandas dataframe with 2 relevant columns "date" and "value", let's assume it looks like this and is ordered by date:
data = pd.DataFrame({"date": ["2021-01-01", "2021-01-31", "2021-02-01", "2021-02-28", "2021-03-01", "2021-03-31", "2021-04-01", "2021-04-02"],
"value": [1,2,3,4,5,6,5,8]})
data["date"] = pd.to_datetime(data['date'])
Now I want to join the dataFrame to itself in such a way, that I get for each last available day in month the next available day where the value is higher. In our example this should basically look like this:
date, value, date2, value2:
2021-01-31, 2, 2021-02-01, 3
2021-02-28, 4, 2021-03-01, 5
2021-03-31, 6, 2021-04-02, 8
2021-04-02, 8, NaN, NaN
My current partial solution to this problem looks like this:
last_days = data.groupby([data.date.dt.year, data.date.dt.month]).last()
res = [data.loc[(data.date>date) & (data.value > value)][:1] for date, value in zip(last_days.date, last_days.value)]
print(res)
But because of this answer "Don't iterate over rows in a dataframe", it doesn't feel like the pandas way to me.
So the question is, how to solve it the pandas way?
If you don’t have too many rows, you could generate all pairs of items and filter from there.
Let’s start with getting the last days in the month:
>>> last = data.loc[data['date'].dt.daysinmonth == data['date'].dt.day]
>>> last
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
Now use a cross join to map each last day to any possible day, then filter on criteria such as later date and larger value:
>>> pairs = pd.merge(last, data, how='cross', suffixes=('', '2'))
>>> pairs = pairs.loc[pairs['date2'].gt(pairs['date']) & pairs['value2'].gt(pairs['value'])]
>>> pairs
date value date2 value2
2 2021-01-31 2 2021-02-01 3
3 2021-01-31 2 2021-02-28 4
4 2021-01-31 2 2021-03-01 5
5 2021-01-31 2 2021-03-31 6
6 2021-01-31 2 2021-04-01 5
7 2021-01-31 2 2021-04-02 8
12 2021-02-28 4 2021-03-01 5
13 2021-02-28 4 2021-03-31 6
14 2021-02-28 4 2021-04-01 5
15 2021-02-28 4 2021-04-02 8
23 2021-03-31 6 2021-04-02 8
Finally use GroupBy.idxmin() to get the first date2
>>> pairs.loc[pairs.groupby(['date', 'value'])['value2'].idxmin().values]
date value date2 value2
2 2021-01-31 2 2021-02-01 3
12 2021-02-28 4 2021-03-01 5
23 2021-03-31 6 2021-04-02 8
Otherwise you might want apply, which is pretty much the same as iterating on rows to be entirely honest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With