Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select dataframe rows according to multi-(other column)-condition on columnar groups?

Copy the following dataframe to your clipboard:

  textId   score              textInfo
0  name1     1.0            text_stuff
1  name1     2.0  different_text_stuff
2  name1     2.0            text_stuff
3  name2     1.0  different_text_stuff
4  name2     1.3  different_text_stuff
5  name2     2.0  still_different_text
6  name2     1.0              yoko ono
7  name2     3.0     I lika da Gweneth
8  name3     1.0     Always a tradeoff
9  name3     3.0                What?!

Now use

import pandas as pd
df=pd.read_clipboard(sep='\s\s+')

to load it into your environment. How does one slice this dataframe such that all the rows of a particular textId are returned if the score group of that textId includes at least one score that equals 1.0, 2.0 and 3.0? Here, the desired operation's result would exclude textId rows name1 since its score group is missing a 3.0 and exclude name3 since its score group is missing a 2.0:

  textId   score              textInfo
0  name2     1.0  different_text_stuff
1  name2     1.3  different_text_stuff
2  name2     2.0  still_different_text
3  name2     1.0              yoko ono
4  name2     3.0     I lika da Gweneth

Attempts

  1. df[df.textId == "textIdRowName" & df.score == 1.0 & df.score == 2.0 & & df.score == 3.0] isn't right since the condition isn't acting on the textId group but only individual rows. If this could be rewritten to match against textId groups then it could be placed in a for loop and fed the unique textIdRowName's. Such a function would collect the names of the textId in a series (say textIdThatMatchScore123) that could then be used to slice the original df like df[df.textId.isin(textIdThatMatchScore123)].
  2. Failing at groupby.
like image 520
zelusp Avatar asked Apr 13 '16 17:04

zelusp


1 Answers

Here's one solution - groupby textId, then keep only those groups where the unique values of score is a superset (>=) of [1.0, 2.0, 3.0].

In [58]: df.groupby('textId').filter(lambda x: set(x['score']) >= set([1.,2.,3.]))
Out[58]: 
  textId  score              textInfo
3  name2    1.0  different_text_stuff
4  name2    1.3  different_text_stuff
5  name2    2.0  still_different_text
6  name2    1.0              yoko ono
7  name2    3.0     I lika da Gweneth
like image 140
chrisb Avatar answered Sep 30 '22 08:09

chrisb