within a dataframe I want to filter all rows from some "Series" by hour / key and equal to the max date.
I have a working piece of code but wondered if there is not something more compact or elegant to achieve the same ?
df = pd.read_csv("./example.csv")
df["Date"] = pd.to_datetime(df["Date"])
Keys2Filter = (1, 2)
df
...
Key Date Value
0 1 2019-04-17 00:00:00 1
1 1 2019-04-17 01:00:00 2
2 1 2019-04-17 02:00:00 3
3 1 2019-04-17 00:00:00 4
4 1 2019-04-17 01:00:00 5
5 1 2019-04-17 00:00:00 7
6 2 2019-04-17 01:00:00 8
7 2 2019-04-17 02:00:00 9
8 2 2019-04-17 00:00:00 9
9 2 2019-04-17 01:00:00 9
10 3 2019-04-17 01:00:00 9
11 3 2019-04-17 01:00:00 9
...
if (len(Keys2Filter)):
dfMax = df[df.Key.isin(Keys2Filter)]
dfMax = dfMax.groupby(by=["Key", dfMax.Date.dt.hour]).max()
dfMax.index.names = ["Key", "Hour"]
dfMax = dfMax.reset_index()
df = df[~df.Key.isin(Keys2Filter)].append(dfMax.drop(columns=['Hour']))
df
Key Date Value
10 3 2019-04-17 01:00:00 9
11 3 2019-04-17 01:00:00 9
0 1 2019-04-17 00:00:00 7
1 1 2019-04-17 01:00:00 5
2 1 2019-04-17 02:00:00 3
3 2 2019-04-17 00:00:00 9
4 2 2019-04-17 01:00:00 9
5 2 2019-04-17 02:00:00 9
To filter rows based on dates, first format the dates in the DataFrame to datetime64 type. Then use the DataFrame. loc[] and DataFrame. query[] function from the Pandas package to specify a filter condition.
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.
In order to select rows between two dates in pandas DataFrame, first, create a boolean mask using mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date) to represent the start and end of the date range. Then you select the DataFrame that lies within the range using the DataFrame. loc[] method.
Solution is good, only small simplifying - boolean mask should be chached to variable m
and for inverting used ~
, also double DataFrame.reset_index
is used - first for remove second level and second for column from index:
m = df.Key.isin(Keys2Filter)
df = (df[~m].append(df[m].groupby(by=["Key", df.Date.dt.hour])
.max()
.reset_index(level=1, drop=True)
.reset_index()))
print (df)
Key Date Value
10 3 2019-04-17 01:00:00 9
11 3 2019-04-17 01:00:00 9
0 1 2019-04-17 00:00:00 7
1 1 2019-04-17 01:00:00 5
2 1 2019-04-17 02:00:00 3
3 2 2019-04-17 00:00:00 9
4 2 2019-04-17 01:00:00 9
5 2 2019-04-17 02:00:00 9
IIUC using transform
and isin
, transform will get max for all group tide with row , then we using isin
, if the value is not in Keys2Filter
, then we should select , if it is in Keys2Filter
which will return False
under ~isin
then we go check the 2rd condition , if that row yield the max value of group , then we still select it .
s=df.groupby([df.Key,df.Date.dt.strftime('%Y-%m-%d %H')])['Value'].transform('idxmax')
Keys2Filter = (1, 2)
df=df.loc[~df.Key.isin(Keys2Filter)|df.index.eq(s)].copy()#avoid the copy error
df
Out[991]:
Key Date Value
2 1 2019-04-17 02:00:00 3
4 1 2019-04-17 01:00:00 5
5 1 2019-04-17 00:00:00 7
7 2 2019-04-17 02:00:00 9
8 2 2019-04-17 00:00:00 9
9 2 2019-04-17 01:00:00 9
10 3 2019-04-17 01:00:00 9
11 3 2019-04-17 01:00:00 9
Notice , Using this method will keep the original order of your df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With