I am using pandas to check wether two dataframes are contained within each others. the method .isin()
is only helpful (e.g., returns True
) only when labels match (ref: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html) but I want to check further that this to include cases where the labels don't match.
Example: df1:
+----+----+----+----+----+
| 3 | 4 | 5 | 6 | 7 |
+----+----+----+----+----+
| 11 | 13 | 10 | 15 | 12 |
+----+----+----+----+----+
| 8 | 2 | 9 | 0 | 1 |
+----+----+----+----+----+
| 14 | 23 | 31 | 21 | 19 |
+----+----+----+----+----+
df2:
+----+----+
| 13 | 10 |
+----+----+
| 2 | 9 |
+----+----+
I want the output to be True
since df2
is inside df1
Any ideas how to do that using Pandas?
You can use numpy's sliding_window_view
:
from numpy.lib.stride_tricks import sliding_window_view as swv
(swv(df1, df2.shape)==df2.to_numpy()).all((-2, -1)).any()
Output: True
Intermediate:
(swv(df1, df2.shape)==df2.to_numpy()).all((-2, -1))
array([[False, False, False, False],
[False, True, False, False], # df2 is found in position 1,1
[False, False, False, False]])
Example 1: ≥ 75% of matches:
from numpy.lib.stride_tricks import sliding_window_view as swv
((swv(df1, df2.shape)==df2.to_numpy()).mean((-2, -1))>=0.75).any()
Example 2: ≥ 3 matches:
from numpy.lib.stride_tricks import sliding_window_view as swv
((swv(df1, df2.shape)==df2.to_numpy()).sum((-2, -1))>=3).any()
Alternative input:
df2 = pd.DataFrame([[13, 10], [2, 8]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With