I want to select all rows in a dataframe which contain values defined in a list. I've got two approaches which both do not work as expected/wanted.
My dataframe looks something like this:
Timestamp DEVICE READING VALUE
1 | DEV1 | READ1 | randomvalue
2 | DEV1 | READ2 | randomvalue
3 | DEV2 | READ1 | randomvalue
4 | DEV2 | READ2 | randomvalue
5 | DEV3 | READ1 | randomvalue
and I've got the list (ls) like follows:
[[DEV1, READ1], [DEV1, READ2], [DEV2,READ1]]
In this scenario I want to remove line 4 and 5:
My first approach was:
df = df[(df['DEVICE']. isin([ls[i][0] for i in range(len(ls))])) &
        (df['READING'].isin([ls[k][1] for k in range(len(ls))]))]
The problem with this one is obviously, that it does not remove line 4, because DEV2 has the READING READ2, but it should remove it.
My second approach was:
df = df[(df[['DEVICE','READING']].isin({'DEVICE':  [ls[i][0] for i in range(len(ls))],
                                        'READING': [ls[i][1] for i in range(len(ls))] }))]
This one selects the correct rows but it does not remove the other rows. Instead it sets every other cell to NaN, including the VALUE ROW, which i do want to keep. And It does not accumulate both so row 4 looks like 4 |DEV2|NaN|NaN
What would be the easiest or best way, to solve this problem? Can you help me?
~Fabian
MiltiIndex is also referred to as Hierarchical/multi-level index/advanced indexing in pandas enables us to create an index on multiple columns and store data in an arbitrary number of dimensions.
isin() function is used to filter the DataFrame rows that contain a list of values. When it is called on Series, it returns a Series of booleans indicating if each element is in values, True when present, False when not. You can pass this series to the DataFrame to filter the rows.
You can convert the list to list of tuples. Convert the required columns in dataframe to tuples and use isin
l = [['DEV1', 'READ1'], ['DEV1', 'READ2'], ['DEV2','READ1']]
l = [tuple(i) for i in l]
df[df[['DEVICE', 'READING']].apply(tuple, axis = 1).isin(l)]
You get
    Timestamp   DEVICE  READING VALUE
0   1   DEV1    READ1   randomvalue
1   2   DEV1    READ2   randomvalue
2   3   DEV2    READ1   randomvalue
                        You can use a multi-index to solve this problem.
values = [['DEV1', 'READ1'], ['DEV1', 'READ2'], ['DEV2', 'READ1']]
# DataFrame.loc requires tuples for multi-index lookups
index_values = [tuple(v) for v in values]
filtered = df.set_index(['DEVICE', 'READING']).loc[index_values].reset_index()
print(filtered)
  DEVICE READING  Timestamp        VALUE
0   DEV1   READ1          1  randomvalue
1   DEV1   READ2          2  randomvalue
2   DEV2   READ1          3  randomvalue  
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With