I have the following data frame
df = pd.DataFrame([[1990,7,1000],[1990,8,2500],[1990,9,2500],[1990,9,1500],[1991,1,250],[1991,2,350],[1991,3,350],[1991,7,450]], columns = ['year','month','data1'])
year month data1
1990 7 1000
1990 8 2500
1990 9 2500
1990 9 1500
1991 1 250
1991 2 350
1991 3 350
1991 7 450
I would like to filter the data such that it won't contain data with month/year 07/1990, 08/1990 and 01/1991. I can do for each combination month/year as follow:
df = df.loc[(df.year != 1990) | (df.month != 7)]
But it is not efficient if there are many combinations month/year. Is there any more efficient way of doing this?
Many thanks.
You could do:
mask = ~df[['year', 'month']].apply(tuple, 1).isin([(1990, 7), (1990, 8), (1991, 1)])
print(df[mask])
Output
year month data1
2 1990 9 2500
3 1990 9 1500
5 1991 2 350
6 1991 3 350
7 1991 7 450
Even faster (roughly 3x than the elegant version of @DaniMesejo applying tuple
). But also it relies on the knowledge that months are bounded to (well below) 100, so less generalizable:
mask = ~(df.year*100 + df.month).isin({199007, 199008, 199101})
df[mask]
# out:
year month data1
2 1990 9 2500
3 1990 9 1500
5 1991 2 350
6 1991 3 350
7 1991 7 450
How come this is 3x faster than the tuples solution? (Tricks for speed):
apply
..isin()
with a set as argument (not a list).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With