I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode. So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input: <img src="https://i.stack.imgur.com/Fklf7.jpg" alt="enter image description here"> The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix. All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly. Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks. ======================= update of some definitions ================= 0. initial stage: <ul> <li>a 100 * 10 matrix, with every element as empty</li> </ul> 1. action space: <ul> <li>each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.</li> </ul> 2. environment: <ul> <li>each step I will have an updated matrix to learn.</li> <li> An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds). This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far. This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step. </li> </ul> 3. episode: By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least. 4. constraints Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element. Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well. 5. goal The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.

Not a solution to your question, just some general thoughts that maybe are relevant: <ul> <li>One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.</li> <li>The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.</li> </ul> And regarding to your specific question: <ul> <li>In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.</li> <li>You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.</li> </ul> That said, I'm looking forward to see an answer that really helps you to solve your problem.

Optimize deep Q network with long episode

Tags:

optimization

machine-learning

deep-learning

reinforcement-learning

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.

So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:

enter image description here

The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.

All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.

Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.

======================= update of some definitions =================

0. initial stage:

a 100 * 10 matrix, with every element as empty

1. action space:

each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.

2. environment:

each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).

This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.

This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.

3. episode:

By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.

4. constraints

Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.

Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.

5. goal

The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.

375

asked May 17 '19 19:05

lllllllllllll

2 Answers

Not a solution to your question, just some general thoughts that maybe are relevant:

One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.

And regarding to your specific question:

In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.

That said, I'm looking forward to see an answer that really helps you to solve your problem.

answered Nov 11 '22 20:11

Pablo EM

Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.

You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.

I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.

answered Nov 11 '22 21:11

Simon

Related questions
                            
                                what is the bootstrapped data in data mining?
                            
                                String Matching Using Recurrent Neural Networks
                            
                                Font Recognition From free Hand drawing
                            
                                How to get inertia value for each k-means cluster using scikit-learn?
                            
                                How to properly set steps_per_epoch and validation_steps in Keras?
                            
                                How to interprete the regression plot obtained at the end of neural network regression for multiple outputs?
                            
                                C++ Decision Tree Implementation Question: Think In Code
                            
                                Web/browser-oriented open source machine learning projects?
                            
                                Google Colaboratory local runtime using local GPU
                            
                                Compare ways to tune hyperparameters in scikit-learn
                            
                                Tensorflow Lite GPU support for python
                            
                                Open Source Question Answering Frameworks [closed]
                            
                                Backpropagation algorithm through cross-channel local response normalization (LRN) layer
                            
                                How to explain feature importance after one-hot encode used for decision tree
                            
                                How to interpret output of .predict() from fitted scikit-survival model in python?
                            
                                Predicting a users next action based on current day and time
                            
                                Multi label regression in Caffe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With