Let's suppose that we have got a list which appends an integer in each iteration which is between 15, 32(let's call the integer rand
). I want to design an algorithm which assigns a reward around 1 (between 1.25 and 0.75) to each rand
. the rule for assigning the reward goes like this.
first we calculate the average of the list. Then if rand
is more than average, we expect the reward to be less than 1, and if rand
is less than average, the reward gets higher than 1. The more distance between average and rand
, the more reward increases/decreases.
for example:
rand = 15, avg = 23 then reward = 1.25
rand = 32, avg = 23 then reward = 0.75
rand = 23, avg = 23 then reward = 1
and so on.
I had developed the code below for this algorithm:
import numpy as np
rollouts = np.array([])
i = 0
def modify_reward(lst, rand):
reward = 1
constant1 = 0.25
constant2 = 1
std = np.std(lst)
global avg
avg = np.mean(lst)
sub = np.subtract(avg, rand)
landa = sub / std if std != 0 else 0
coefficient = -1 + ( 2 / (1 + np.exp(-constant2 * landa)))
md_reward = reward + (reward * constant1 * coefficient)
return md_reward
while i < 100:
rand = np.random.randint(15, 33)
rollouts = np.append(rollouts, rand)
modified_reward = modify_reward(rollouts, rand)
i += 1
print([i,rand, avg, modified_reward])
# test the reward for upper bound and lower bound
rand1, rand2 = 15, 32
reward1, reward2 = modify_reward(rollouts, rand1), modify_reward(rollouts, rand2)
print(['reward for upper bound', rand1, avg, reward1])
print(['reward for lower bound', rand2, avg, reward2])
The algorithm works quite fine, but if you look at examples below, you would notice the problem with algorithm.
rand = 15, avg = 23.94 then reward = 1.17 # which has to be 1.25
rand = 32, avg = 23.94 then reward = 0.84 # which has to be 0.75
rand = 15, avg = 27.38 then reward = 1.15 # which has to be 1.25
rand = 32, avg = 27.38 then reward = 0.93 # which has to be 0.75
As you might have noticed, Algorithm doesn't consider the distance between avg
and bounds (15, 32).
The more avg
moves towards lower bound or higher bound, the more modified_reward
gets unbalanced.
I need modified_reward
to be uniformly assigned, no matter avg
moves toward upper bound or lower bound.
Can anyone suggest some modification to this algorithm which could consider the distance between avg
and bounds of the list.
Putting together these two requirements:
if
rand
is more than average, we expect the reward to be less than 1, and ifrand
is less than average, the reward gets higher than 1.I need
modified_reward
to be uniformly assigned, no matteravg
moves toward upper bound or lower bound.
is slightly tricky, depending on what you mean by 'uniformly'.
If you want 15 to always be rewarded with 1.25, and 32 to always be rewarded with 0.75, you can't have a single linear relationship while also respecting your first requirement.
If you are happy with two linear relationships, you can aim for a situation where modified_reward
depends on rand
like this:
which I produced with this Wolfram Alpha query. As you can see, this is two linear relationships, with a 'knee' at avg
. I expect you'll be able to derive the formulae for each part without too much trouble.
This code implements a linear distribution of weights proportional to the distance from average towards your given limits.
import numpy as np
class Rewarder(object):
lo = 15
hi = 32
weight = 0.25
def __init__(self):
self.lst = np.array([])
def append(self, x):
self.lst = np.append(self.lst, [x])
def average(self):
return np.mean(self.lst)
def distribution(self, a, x, b):
'''
Return a number between 0 and 1 proportional to
the distance of x from a towards b.
Note: Modify this fraction if you want a normal distribution
or quadratic etc.
'''
return (x - a) / (b - a)
def reward(self, x):
avg = self.average()
if x > avg :
w = self.distribution(avg, x, self.hi)
else:
w = - self.distribution(avg, x, self.lo)
return 1 - self.weight * w
rollouts = Rewarder()
rollouts.append(23)
print rollouts.reward(15)
print rollouts.reward(32)
print rollouts.reward(23)
Producing:
1.25
0.75
1.0
The code in your question seems to be using np.std
which I presume is an attempt to get a normal distribution. Remember that the normal distribution never actually gets to zero.
If you tell me what shape you want for the distribution we can modify Rewarder.distribution
to suit.
Edit:
I can't access the paper you refer to but infer that you want a sigmoid style distribution of rewards giving a 0 at mean and approximately +/-0.25 at min and max. Using the error function as the weighting if we scale by 2 we get approximately 0.995 at min and max.
Override the Rewarder.distribution:
import math
class RewarderERF(Rewarder):
def distribution(self, a, x, b):
"""
Return an Error Function (sigmoid) weigthing of the distance from a.
Note: scaled to reduce error at max to ~0.003
ref: https://en.wikipedia.org/wiki/Sigmoid_function
"""
return math.erf(2.0 * super(RewarderERF, self).distribution(a, x, b))
rollouts = RewarderERF()
rollouts.append(23)
print rollouts.reward(15)
print rollouts.reward(32)
print rollouts.reward(23)
results in:
1.24878131454
0.75121868546
1.0
You can choose which error function suits your application and how much error you can accept at min and max. I'd also expect that you'd integrate all these functions into your class, I've split everything out so we can see the parts.
Regarding the calculating the mean, do you need to keep the list of values and recalculate each time or can you keep a count and running total of the sum? Then you would not need numpy for this calculation.
I don't understand why you are calculating md_reward like this. Please provide logic and reason. But
landa = sub / std if std != 0 else 0
coefficient = -1 + ( 2 / (1 + np.exp(-constant2 * landa)))
md_reward = reward + (reward * constant1 * coefficient)
will not give what you are looking for. Because lets consider below cases
for md_reward to be .75
--> coefficient should be -1
--> landa == -infinite (negative large value, i.e. , rand should be much larger than 32)
for md_reward to be 1
--> coefficient should be 0
--> landa == 0 (std == 0 or sub == 0) # which is possible
for md_reward to be 1.25
--> coefficient should be 1
--> landa == infinite (positive large value, i.e. , rand should be much smaller than 15)
If you want to normalize reward from avg to max and avg to min. check below links. https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range https://stats.stackexchange.com/questions/70553/what-does-normalization-mean-and-how-to-verify-that-a-sample-or-a-distribution
Now modify your function with something below.
def modify_reward(lst, rand):
reward = 1
constant1 = 0.25
min_value = 15
max_value = 32
avg = np.mean(lst)
if rand >= avg:
md_reward = reward - constant1*(rand - avg)/(max_value - avg) # normalize rand from avg to max
else:
md_reward = reward + constant1*(1 - (rand - min_value)/(avg - min_value)) # normalize rand from min to avg
return md_reward
I have used below method
Normalized:
(X−min(X))/(max(X)−min(X))
for case rand >= avg
min(X) will be avg and max(X) is max_value
and for case rand < avg
min(X) in min_value and max(X) is avg
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With