Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select range of rows in Pandas?

I have created a dataframe that has a bunch of characteristics. I want to create a new column that selects all rows between two specific rows (which are going to be inputs).

Say that the dataframe is the following:

data = {'currency': ['Euro', 'Euro', 'Euro', 'Dollar', 'Dollar', 'Yen',
                     'Yen', 'Yen', 'Pound', 'Pound', 'Pound, 'Pesos',
                     'Pesos'], 
    'cost': [34, 67, 32, 29, 48, 123, 23, 45, 78, 86, 23, 45, 67]}
df = pd.DataFrame(data, columns = ['currency', 'cost'])
df

df table:

enter image description here

I want to add a new column that assigns a 1 when a condition is met. In my case, the condition is all rows between two specific currencies. For example, say that I want all the currencies between 'Dollar' and 'Pound'. My guess is I have to create a mask and use it as a conditional, that will say select all rows between the first 'Dollar' row and the last 'Pound' row (i.e. rows 3-10).

I have problems creating that mask though, as the currencies are selected alphabetically:

mask = (df['currency'] >= 'Dollar') & (df['currency'] <= 'Pound')

The above creates a new column with T in all currencies, except 'Yen'. I can see why the above is failing, but cannot think of a way of doing what I want.

Note: The same names of the currencies will come in groups, e.g. 'Pounds' cannot be in rows 4-5 and then 8-10.

like image 904
Dsto4 Avatar asked Apr 17 '18 06:04

Dsto4


2 Answers

General solution working for duplicated index too:

a = df['currency'].eq('Dollar').cumsum()
b = df['currency'].eq('Pound').iloc[::-1].cumsum()
df['new'] = a.mul(b).clip_upper(1)

Alternative working for unique index:

a = df['currency'].eq('Dollar').idxmax()
b = df['currency'].eq('Pound').iloc[::-1].idxmax()
df['new'] = 0
df.loc[a:b, 'new'] = 1

print (df)
   currency  cost  new
0      Euro    34    0
1      Euro    67    0
2      Euro    32    0
3    Dollar    29    1
4    Dollar    48    1
5       Yen   123    1
6       Yen    23    1
7       Yen    45    1
8     Pound    78    1
9     Pound    86    1
10    Pound    23    1
11    Pesos    45    0
12    Pesos    67    0

Explanation:

  1. Compare by Series.eq what is same as == first
  2. Get cumsum
  3. For second condition reverse mask by [::-1]
  4. Multiple together by mul and replace non 0 to 1 by clip_upper

Second solution use idxmax for first index value and set 1 by loc

like image 118
jezrael Avatar answered Oct 15 '22 23:10

jezrael


Using Numpy's accumulation on logical or

cumor = np.logical_or.accumulate

c = df.currency.values
d = c == 'Dollar'
p = c == 'Pound'

df.assign(new=(cumor(d) & cumor(p[::-1])[::-1]).astype(np.uint))

   currency  cost  new
0      Euro    34    0
1      Euro    67    0
2      Euro    32    0
3    Dollar    29    1
4    Dollar    48    1
5       Yen   123    1
6       Yen    23    1
7       Yen    45    1
8     Pound    78    1
9     Pound    86    1
10    Pound    23    1
11    Pesos    45    0
12    Pesos    67    0
like image 20
piRSquared Avatar answered Oct 15 '22 21:10

piRSquared