Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Learning of Outcome Space Given Noisy Actions and Non-Monotonic Reinforcment

I'm looking to construct or adapt a model preferably based in RL theory that can solve the following problem. Would greatly appreciate any guidance or pointers.

I have a continuous action space, where actions can be chosen from the range 10-100 (inclusive). Each action is associated with a certain reinforcement value, ranging from 0 to 1 (also inclusive) according to a value function. So far, so good. Here's where I start to get in over my head:

Complication 1:

The value function V maps actions to reinforcement according to the distance between a given action x and a target action A. The less the distance between the two, the greater the reinforcement (that is, reinforcement is inversely proportional to abs(A - x). However, the value function is only nonzero for actions close to A ( abs(A - x) is less than, say, epsilon) and zero elsewhere. So:

**V** is proportional to 1 / abs(**A** - **x**) for abs(**A** - **x**) < epsilon, and

**V** = 0 for abs(**A** - **x**) > epsilon.

Complication 2:

I do not know precisely what actions have been taken at each step. I know roughly what they are, such that I know they belong to the range x +/- sigma, but cannot exactly associate a single action value with the reinforcement I receive.

The precise problem I would like to solve is as follows: I have a series of noisy action estimates and exact reinforcement values (e.g. on trial 1 I might have x of ~15-30 and reinforcement of 0; on trial 2 I might have x of ~25-40 and reinforcement of 0; on trial 3, x of ~80-95 and reinforcment of 0.6.) I would like to construct a model which represents the estimate of the most likely location of the target action A after each step, probably weighting new information according to some learning rate parameter (since certainty will increase with increasing samples).

like image 500
user2388629 Avatar asked May 16 '13 06:05

user2388629


1 Answers

This journal article which may be relevant: It addresses delayed rewards and robust learning in the presence of noise and inconsistent rewards.

"Rare neural correlations implement robot conditioning with delayed rewards and disturbances"

Specifically, they trace (remember) which synapses (or actions) had been firing before a reward event and reinforce all of them, where the amount of the reinforcement decays with time between the action and the reward.

An individual reward event will reward any synapses which happen to be firing before the reward (or actions performed) including those irrelevant to the reward. However, with a suitable learning rate, this should stabilize over a handful of iterations, with only the desired action being consistently rewarded and reinforced.

like image 190
python1981 Avatar answered Nov 05 '22 20:11

python1981