Recently I've been reading a lot about Q-learning with Neural Networks and thought about to update an existing old optimization system in a power plant boiler composed of a simple feed-forward neural network approximating an output from many sensory inputs. The output then is linked to a linear model-based controller that somehow output again an optimal action so the whole model can converge to a desired goal.
Identifying linear models is a consuming task. I thought about refurbishing the whole thing to model- free Q-learning with a Neural Network approximation of the Q-function. I drew a diagram to ask you if I'm on the right track or not.
My question: if you think I understood well the concept, should my training set be composed of State Features vectors
from one side and Q_target - Q_current
(here I'm assuming there's an increasing reward) in order to force the whole model towards the target or am I missing something?
Note: The diagram shows a comparison between the old system in the upper part and my proposed change on the lower part.
EDIT: Does a State Neural Network guarantee Experience Replay?
These networks have the same architecture but different weights. Every N steps, the weights from the main network are copied to the target network . Using both of these networks leads to more stability in the learning process and helps the algorithm to learn more effectively.
In deep Q learning, we utilize a neural network to approximate the Q value function. The network receives the state as an input (whether is the frame of the current state or a single value) and outputs the Q values for all possible actions. The biggest output is our next action.
In the case of deep reinforcement learning, a neural network is in charge of storing the experiences and thus improves the way the task is performed.
You might just use all the Q value of all the actions in the current state as the output layer in your network. A poorly drawn diagram is here
You can therefore take advatange of NN's ability to output multiple Q value at a time. Then, just back prop using loss derived by Q(s, a) <- Q(s, a) + alpha * (reward + discount * max(Q(s', a')) - Q(s, a)
, where max(Q(s', a'))
can be easily computed from the output layer.
Please let me know if you have further questions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With