Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting value to a copy of a slice of a DataFrame

I am setting up the following example which is similar to my situation and data:

Say, I have the following DataFrame:

df = pd.DataFrame ({'ID' : [1,2,3,4],
             'price' : [25,30,34,40],
             'Category' : ['small', 'medium','medium','small']})


  Category  ID  price
0    small   1     25
1   medium   2     30
2   medium   3     34
3    small   4     40

Now, I have the following function, which returns the discount amount based on the following logic:

def mapper(price, category):
    if category == 'small':
        discount = 0.1 * price
    else:
        discount = 0.2 * price
    return discount

Now I want the resulting DataFrame:

  Category  ID  price Discount
0    small   1     25      0.25
1   medium   2     30      0.6
2   medium   3     40      0.8
3    small   4     40      0.4

So I decided to call series.map on the column price because I don't want to use apply. I am working on a large DataFrame and map is much faster than apply.

I tried doing this:

for c in list(sample.Category.unique()):
    sample[sample['Category'] == c]['Discount'] = sample[sample['Category'] == c]['price'].map(lambda x: mapper(x,c))

And that didn't work as I expected because I am trying to set a value on a copy of a slice of the DataFrame.

My question is, Is there a way to do this without using df.apply()?

like image 769
Rakesh Adhikesavan Avatar asked Mar 20 '17 21:03

Rakesh Adhikesavan


People also ask

How to suppress copy of slice errors in a Dataframe?

The only way I've found to suppress the errors is by explicitly adding df [ []].copy (). I could have sworn that in the past I did not have to do that and did not raise the copy of slice error. Similarly, I have some other code that runs a function on a dataframe to filter it in certain ways:

What does “a value is trying to be set on copy” mean?

1 Answer 1. "A value is trying to be set on a copy of a slice from a DataFrame" is a warning. SO contains many posts on this subject. df.assign was added in Pandas 0.16 and is a good way to avoid this warning.

Why do I need to copy values from one column to another?

You need copy, because if you modify values in df later you will find that the modifications do not propagate back to the original data ( df ), and that Pandas does warning. loc can be omit, but warning without copy too.


1 Answers

One approach with np.where -

mask = df.Category.values=='small'
df['Discount'] = np.where(mask,df.price*0.01, df.price*0.02)

Another way to put things a bit differently -

df['Discount'] = df.price*0.01
df['Discount'][df.Category.values!='small'] *= 2

For performance, you might want to work with array data, so we could use df.price.values instead at places where df.price was used.

Benchmarking

Approaches -

def app1(df): # Proposed app#1 here
    mask = df.Category.values=='small'
    df_price = df.price.values
    df['Discount'] = np.where(mask,df_price*0.01, df_price*0.02)
    return df

def app2(df): # Proposed app#2 here
    df['Discount'] = df.price.values*0.01
    df['Discount'][df.Category.values!='small'] *= 2
    return df

def app3(df): # @piRSquared's soln
    df.assign(
    Discount=((1 - (df.Category.values == 'small')) + 1) / 100 * df.price.values)
    return df

def app4(df): # @MaxU's soln
    df.assign(Discount=df.price * df.Category.map({'small':0.01}).fillna(0.02))
    return df

Timings -

1) Large dataset :

In [122]: df
Out[122]: 
  Category  ID  price  Discount
0    small   1     25      0.25
1   medium   2     30      0.60
2   medium   3     34      0.68
3    small   4     40      0.40

In [123]: df1 = pd.concat([df]*1000,axis=0)
     ...: df2 = pd.concat([df]*1000,axis=0)
     ...: df3 = pd.concat([df]*1000,axis=0)
     ...: df4 = pd.concat([df]*1000,axis=0)
     ...: 

In [124]: %timeit app1(df1)
     ...: %timeit app2(df2)
     ...: %timeit app3(df3)
     ...: %timeit app4(df4)
     ...: 
1000 loops, best of 3: 209 µs per loop
10 loops, best of 3: 63.2 ms per loop
1000 loops, best of 3: 351 µs per loop
1000 loops, best of 3: 720 µs per loop

2) Very large dataset :

In [125]: df1 = pd.concat([df]*10000,axis=0)
     ...: df2 = pd.concat([df]*10000,axis=0)
     ...: df3 = pd.concat([df]*10000,axis=0)
     ...: df4 = pd.concat([df]*10000,axis=0)
     ...: 

In [126]: %timeit app1(df1)
     ...: %timeit app2(df2)
     ...: %timeit app3(df3)
     ...: %timeit app4(df4)
     ...: 
1000 loops, best of 3: 758 µs per loop
1 loops, best of 3: 2.78 s per loop
1000 loops, best of 3: 1.37 ms per loop
100 loops, best of 3: 2.57 ms per loop

Further boost with data reuse -

def app1_modified(df):
    mask = df.Category.values=='small'
    df_price = df.price.values*0.01
    df['Discount'] = np.where(mask,df_price, df_price*2)
    return df

Timings -

In [133]: df1 = pd.concat([df]*10000,axis=0)
     ...: df2 = pd.concat([df]*10000,axis=0)
     ...: df3 = pd.concat([df]*10000,axis=0)
     ...: df4 = pd.concat([df]*10000,axis=0)
     ...: 

In [134]: %timeit app1(df1)
1000 loops, best of 3: 699 µs per loop

In [135]: %timeit app1_modified(df1)
1000 loops, best of 3: 655 µs per loop
like image 137
Divakar Avatar answered Sep 18 '22 18:09

Divakar