Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas dataframe : removing selected rows

Tags:

python

pandas

I have a pandas dataframe, something like :

df = pd.read_csv('fruit.csv')

print(df)

   fruitname  quant
0      apple     10
1      apple     11
2      apple     13
3     banana     10
4     banana     20
5     banana     30
6     banana     40
7       pear     10
8       pear    102
9       pear   1033
10      pear   1012
11      pear    101
12      pear    100
13      pear   1044
14    orange     10

I want to remove the last entry, PER FRUIT, if that fruit has an odd (uneven) number of entries (%2 == 1). Without looping through the dataframe. So the end result of the above would be:

-- remove the last apple, since apple occurs 3 times -- remove the last pear -- remove the last (only) orange

resulting in:

   fruitname  quant
0      apple     10
1      apple     11
2     banana     10
3     banana     20
4     banana     30
5     banana     40
6       pear     10
7       pear    102
8       pear   1033
9       pear   1012
10      pear    101
11      pear    100

Is this possible? Or do I have to loop through the DF? I've been googling for 4 days, and just can't figure out how to do this.

like image 917
W Kruger Avatar asked Oct 15 '15 14:10

W Kruger


1 Answers

Determine the number of items per fruit using value_counts and build a list of them based on whether there are an odd number. We can achieve this by just using % modulus operator to generate either a 1 or 0, cast this using astype to create a boolean mask.

Use the boolean mask to mask the index of value_counts.

Now you have a list of fruit, iterate over each fruit by filtering the df and get the last index label using iloc[-1] and .name attribute and append this to a list.

Now drop these labels in the list:

In [393]:
fruits = df['fruitname'].value_counts().index[(df['fruitname'].value_counts() % 2).astype(bool)]
idx = []
for fruit in fruits:
    idx.append(df[df['fruitname']==fruit].iloc[-1].name)
df.drop(idx)

Out[393]:
   fruitname  quant
0      apple     10
1      apple     11
3     banana     10
4     banana     20
5     banana     30
6     banana     40
7       pear     10
8       pear    102
9       pear   1033
10      pear   1012
11      pear    101
12      pear    100

Breaking the above down:

In [394]:
df['fruitname'].value_counts()

Out[394]:
pear      7
banana    4
apple     3
orange    1
Name: fruitname, dtype: int64

In [398]:   
df['fruitname'].value_counts() % 2

Out[398]:
pear      1
banana    0
apple     1
orange    1
Name: fruitname, dtype: int64

In [399]:
fruits = df['fruitname'].value_counts().index[(df['fruitname'].value_counts() % 2).astype(bool)]
fruits

Out[399]:
Index(['pear', 'apple', 'orange'], dtype='object')

In [401]:    
for fruit in fruits:
    print(df[df['fruitname']==fruit].iloc[-1].name)

13
2
14

Actually you can use last_valid_index instead of iloc[-1].name so the following would work:

fruits = df['fruitname'].value_counts().index[(df['fruitname'].value_counts() % 2).astype(bool)]
idx = []
for fruit in fruits:
    idx.append(df[df['fruitname']==fruit].last_valid_index())
df.drop(idx)
like image 170
EdChum Avatar answered Oct 29 '22 11:10

EdChum