I am setting up the following example which is similar to my situation and data:
Say, I have the following DataFrame:
df = pd.DataFrame ({'ID' : [1,2,3,4],
'price' : [25,30,34,40],
'Category' : ['small', 'medium','medium','small']})
Category ID price
0 small 1 25
1 medium 2 30
2 medium 3 34
3 small 4 40
Now, I have the following function, which returns the discount amount based on the following logic:
def mapper(price, category):
if category == 'small':
discount = 0.1 * price
else:
discount = 0.2 * price
return discount
Now I want the resulting DataFrame:
Category ID price Discount
0 small 1 25 0.25
1 medium 2 30 0.6
2 medium 3 40 0.8
3 small 4 40 0.4
So I decided to call series.map on the column price because I don't want to use apply. I am working on a large DataFrame and map is much faster than apply.
I tried doing this:
for c in list(sample.Category.unique()):
sample[sample['Category'] == c]['Discount'] = sample[sample['Category'] == c]['price'].map(lambda x: mapper(x,c))
And that didn't work as I expected because I am trying to set a value on a copy of a slice of the DataFrame.
My question is,
Is there a way to do this without using df.apply()
?
The only way I've found to suppress the errors is by explicitly adding df [ []].copy (). I could have sworn that in the past I did not have to do that and did not raise the copy of slice error. Similarly, I have some other code that runs a function on a dataframe to filter it in certain ways:
1 Answer 1. "A value is trying to be set on a copy of a slice from a DataFrame" is a warning. SO contains many posts on this subject. df.assign was added in Pandas 0.16 and is a good way to avoid this warning.
You need copy, because if you modify values in df later you will find that the modifications do not propagate back to the original data ( df ), and that Pandas does warning. loc can be omit, but warning without copy too.
One approach with np.where
-
mask = df.Category.values=='small'
df['Discount'] = np.where(mask,df.price*0.01, df.price*0.02)
Another way to put things a bit differently -
df['Discount'] = df.price*0.01
df['Discount'][df.Category.values!='small'] *= 2
For performance, you might want to work with array data, so we could use df.price.values
instead at places where df.price
was used.
Approaches -
def app1(df): # Proposed app#1 here
mask = df.Category.values=='small'
df_price = df.price.values
df['Discount'] = np.where(mask,df_price*0.01, df_price*0.02)
return df
def app2(df): # Proposed app#2 here
df['Discount'] = df.price.values*0.01
df['Discount'][df.Category.values!='small'] *= 2
return df
def app3(df): # @piRSquared's soln
df.assign(
Discount=((1 - (df.Category.values == 'small')) + 1) / 100 * df.price.values)
return df
def app4(df): # @MaxU's soln
df.assign(Discount=df.price * df.Category.map({'small':0.01}).fillna(0.02))
return df
Timings -
1) Large dataset :
In [122]: df
Out[122]:
Category ID price Discount
0 small 1 25 0.25
1 medium 2 30 0.60
2 medium 3 34 0.68
3 small 4 40 0.40
In [123]: df1 = pd.concat([df]*1000,axis=0)
...: df2 = pd.concat([df]*1000,axis=0)
...: df3 = pd.concat([df]*1000,axis=0)
...: df4 = pd.concat([df]*1000,axis=0)
...:
In [124]: %timeit app1(df1)
...: %timeit app2(df2)
...: %timeit app3(df3)
...: %timeit app4(df4)
...:
1000 loops, best of 3: 209 µs per loop
10 loops, best of 3: 63.2 ms per loop
1000 loops, best of 3: 351 µs per loop
1000 loops, best of 3: 720 µs per loop
2) Very large dataset :
In [125]: df1 = pd.concat([df]*10000,axis=0)
...: df2 = pd.concat([df]*10000,axis=0)
...: df3 = pd.concat([df]*10000,axis=0)
...: df4 = pd.concat([df]*10000,axis=0)
...:
In [126]: %timeit app1(df1)
...: %timeit app2(df2)
...: %timeit app3(df3)
...: %timeit app4(df4)
...:
1000 loops, best of 3: 758 µs per loop
1 loops, best of 3: 2.78 s per loop
1000 loops, best of 3: 1.37 ms per loop
100 loops, best of 3: 2.57 ms per loop
Further boost with data reuse -
def app1_modified(df):
mask = df.Category.values=='small'
df_price = df.price.values*0.01
df['Discount'] = np.where(mask,df_price, df_price*2)
return df
Timings -
In [133]: df1 = pd.concat([df]*10000,axis=0)
...: df2 = pd.concat([df]*10000,axis=0)
...: df3 = pd.concat([df]*10000,axis=0)
...: df4 = pd.concat([df]*10000,axis=0)
...:
In [134]: %timeit app1(df1)
1000 loops, best of 3: 699 µs per loop
In [135]: %timeit app1_modified(df1)
1000 loops, best of 3: 655 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With