Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas isin with multiple columns

Tags:

python

pandas

I want to select all rows in a dataframe which contain values defined in a list. I've got two approaches which both do not work as expected/wanted.

My dataframe looks something like this:

Timestamp DEVICE READING VALUE
1 | DEV1 | READ1 | randomvalue
2 | DEV1 | READ2 | randomvalue
3 | DEV2 | READ1 | randomvalue
4 | DEV2 | READ2 | randomvalue
5 | DEV3 | READ1 | randomvalue

and I've got the list (ls) like follows:

[[DEV1, READ1], [DEV1, READ2], [DEV2,READ1]]

In this scenario I want to remove line 4 and 5:

My first approach was:

df = df[(df['DEVICE']. isin([ls[i][0] for i in range(len(ls))])) &
        (df['READING'].isin([ls[k][1] for k in range(len(ls))]))]

The problem with this one is obviously, that it does not remove line 4, because DEV2 has the READING READ2, but it should remove it.

My second approach was:

df = df[(df[['DEVICE','READING']].isin({'DEVICE':  [ls[i][0] for i in range(len(ls))],
                                        'READING': [ls[i][1] for i in range(len(ls))] }))]

This one selects the correct rows but it does not remove the other rows. Instead it sets every other cell to NaN, including the VALUE ROW, which i do want to keep. And It does not accumulate both so row 4 looks like 4 |DEV2|NaN|NaN

What would be the easiest or best way, to solve this problem? Can you help me?

~Fabian

like image 778
PythonF Avatar asked Nov 09 '18 00:11

PythonF


People also ask

Can a pandas index contain multiple columns?

MiltiIndex is also referred to as Hierarchical/multi-level index/advanced indexing in pandas enables us to create an index on multiple columns and store data in an arbitrary number of dimensions.

How do you filter a data frame in Isin?

isin() function is used to filter the DataFrame rows that contain a list of values. When it is called on Series, it returns a Series of booleans indicating if each element is in values, True when present, False when not. You can pass this series to the DataFrame to filter the rows.


2 Answers

You can convert the list to list of tuples. Convert the required columns in dataframe to tuples and use isin

l = [['DEV1', 'READ1'], ['DEV1', 'READ2'], ['DEV2','READ1']]
l = [tuple(i) for i in l]
df[df[['DEVICE', 'READING']].apply(tuple, axis = 1).isin(l)]

You get

    Timestamp   DEVICE  READING VALUE
0   1   DEV1    READ1   randomvalue
1   2   DEV1    READ2   randomvalue
2   3   DEV2    READ1   randomvalue
like image 161
Vaishali Avatar answered Oct 11 '22 00:10

Vaishali


You can use a multi-index to solve this problem.

values = [['DEV1', 'READ1'], ['DEV1', 'READ2'], ['DEV2', 'READ1']]
# DataFrame.loc requires tuples for multi-index lookups
index_values = [tuple(v) for v in values]

filtered = df.set_index(['DEVICE', 'READING']).loc[index_values].reset_index()
print(filtered)

  DEVICE READING  Timestamp        VALUE
0   DEV1   READ1          1  randomvalue
1   DEV1   READ2          2  randomvalue
2   DEV2   READ1          3  randomvalue  
like image 5
Matthias Ossadnik Avatar answered Oct 10 '22 22:10

Matthias Ossadnik