I want to select all rows in a dataframe which contain values defined in a list. I've got two approaches which both do not work as expected/wanted.
My dataframe looks something like this:
Timestamp DEVICE READING VALUE
1 | DEV1 | READ1 | randomvalue
2 | DEV1 | READ2 | randomvalue
3 | DEV2 | READ1 | randomvalue
4 | DEV2 | READ2 | randomvalue
5 | DEV3 | READ1 | randomvalue
and I've got the list (ls) like follows:
[[DEV1, READ1], [DEV1, READ2], [DEV2,READ1]]
In this scenario I want to remove line 4
and 5
:
My first approach was:
df = df[(df['DEVICE']. isin([ls[i][0] for i in range(len(ls))])) &
(df['READING'].isin([ls[k][1] for k in range(len(ls))]))]
The problem with this one is obviously, that it does not remove line 4, because DEV2 has the READING READ2, but it should remove it.
My second approach was:
df = df[(df[['DEVICE','READING']].isin({'DEVICE': [ls[i][0] for i in range(len(ls))],
'READING': [ls[i][1] for i in range(len(ls))] }))]
This one selects the correct rows but it does not remove the other rows. Instead it sets every other cell to NaN, including the VALUE ROW, which i do want to keep. And It does not accumulate both so row 4 looks like 4 |DEV2|NaN|NaN
What would be the easiest or best way, to solve this problem? Can you help me?
~Fabian
MiltiIndex is also referred to as Hierarchical/multi-level index/advanced indexing in pandas enables us to create an index on multiple columns and store data in an arbitrary number of dimensions.
isin() function is used to filter the DataFrame rows that contain a list of values. When it is called on Series, it returns a Series of booleans indicating if each element is in values, True when present, False when not. You can pass this series to the DataFrame to filter the rows.
You can convert the list to list of tuples. Convert the required columns in dataframe to tuples and use isin
l = [['DEV1', 'READ1'], ['DEV1', 'READ2'], ['DEV2','READ1']]
l = [tuple(i) for i in l]
df[df[['DEVICE', 'READING']].apply(tuple, axis = 1).isin(l)]
You get
Timestamp DEVICE READING VALUE
0 1 DEV1 READ1 randomvalue
1 2 DEV1 READ2 randomvalue
2 3 DEV2 READ1 randomvalue
You can use a multi-index to solve this problem.
values = [['DEV1', 'READ1'], ['DEV1', 'READ2'], ['DEV2', 'READ1']]
# DataFrame.loc requires tuples for multi-index lookups
index_values = [tuple(v) for v in values]
filtered = df.set_index(['DEVICE', 'READING']).loc[index_values].reset_index()
print(filtered)
DEVICE READING Timestamp VALUE
0 DEV1 READ1 1 randomvalue
1 DEV1 READ2 2 randomvalue
2 DEV2 READ1 3 randomvalue
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With