Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with different state space size in reinforcement learning?

I'm working in A2C reinforcement learning where my environment has an increasing and decreasing in the number of agents. As a result of the increasing and decreasing the number of agents, the state space will also change. I have tried to solve the problem of changing the state space this way:

  • If the state space exceeds the maximum state space that selected as n_input, the excess state space will be selected by np.random.choice where random choice provides a way of creating random samples from the state space after converting the state space into probabilities.

  • If the state space is less than the maximum state I padded the state space with zeros.

    def get_state_new(state):
     n_features =  n_input-len(get_state(env))
     # print("state",len(get_state(env)))
     p = np.array(state)
     p = np.exp(p)
     if p.sum() != 1.0:
         p = p * (1. / p.sum())
     if len(get_state(env)) > n_input:
         statappend = np.random.choice(state, size=n_input, p=p)
         # print(statappend)
     else:
         statappend = np.zeros(n_input)
         statappend[:state.shape[0]] = state
     return statappend
    

It works but the results are not as expected and I don't know if this correct or not.

My question

Are there any reference papers that deal with such a problem and how to deal with the changing of state space?

like image 771
I_Al-thamary Avatar asked Sep 03 '20 17:09

I_Al-thamary


2 Answers

For the paper, I'm gonna give the same reference as in the other post already: Benchmarks for reinforcement learning minmixed-autonomy traffic.

In this approach, indeed, an expected number of agents (which are expected to be present in the simulation at any moment in time) is predetermined. During runtime, observations of agents present in the simulation are then retrieved and squashed into a container (tensor) of fixed size (let's call it overall observation container), which can contain as many observations (from individual agents) as there are agents expected to be present at any moment in time in the simulation. Just to be clear: size(overall observation container) = expected number of agents * individual observation size. Since the actual number of agents present in a simulation may vary from time step to time step, the following applies:

  • If less agents than expected are present in the environment, and hence there are less observations provided than would fit into the overall observation container, then zero-padding is used to fill empty observation slots.
  • If the number of agents exceeds the expected number of agents, then only a subset of the observations provided will be used. So, only from a randomly selected subset of the available agents the observations are put into the overall observation container of fixed size. Only for the chosen agents, the controller will compute actions to be performed, while "excess agents" will have to be treated as non-controlled agents in the simulation.

Coming back to your sample code, there are a few things I would do differently.

First, I was wondering why you have both the variable state (passed to the function get_state_new) and the call get_state(env), since I would expect the information returned by get_state(env) to be the same as stored already in the variable state. As a tip, it would make the code a bit nicer to read if you could try to use the state variable only (if the variable and the function call indeed provide the same information).

The second thing I would do differently is how you process states: p = np.exp(p), p = p * (1. / p.sum()). This normalizes the overall observation container by the sum of all exponentiated values present in all individual observations. In contrast, I would normalize each individual observation in isolation.

This has the following reason: If you provide a small number of observations, then the sum of exponentiated values contained in all individual observations can be expected to be smaller than when taking the sum over the exponentiated values contained in a larger amount of individual observations. These differences in the sum, which is then used for normalization, will result in different magnitudes of the normalized values (as a function of the number of individual observations, roughly speaking). Consider the following example:

import numpy as np

# Less state representations
state = np.array([1,1,1])
state = state/state.sum()
state
# Output: array([0.33333333, 0.33333333, 0.33333333])

# More state representations
state = np.array([1,1,1,1,1])
state = state/state.sum()
state
# Output: array([0.2, 0.2, 0.2, 0.2, 0.2])

Actually, the same input state representation, as obtained by an individual agent, shall always result in the same output state representation after normalization, regardless of the number of agents currently present in the simulation. So, please make sure to normalize all observations on their own. I'll give an example below.

Also, please make sure to keep track of which agents' observations (and in which order) have been squashed into your variable statappend. This is important for the following reason.

If there are agents A1 through A5, but the overall observation container can take only three observations, three out of five state representations are going to be selected at random. Say the observations randomly selected to be squashed into the overall observation container stem from from the following agents in the following order: A2, A5, A1. Then, these agents' observations will be squashed into the overall observation container in exactly this order. First the observation of A2, then that of A5, and eventually that of A1. Correspondingly, given the aforementioned overall observation container, the three actions predicted by your Reinforcement Learning controller will correspond to agents A2, A5, and A1 (in order!), respectively. In other words, the order of the agents on the input side also dictates to which agents the predicted actions correspond on the output side.

I would propose something like the following:

import numpy as np

def get_overall_observation(observations, expected_observations=5):
    # Return value:
    #   order_agents: The returned observations stem from this ordered set of agents (in sequence)

    # Get some info
    n_observations = observations.shape[0]  # Actual nr of observations
    observation_size = list(observations.shape[1:])  # Shape of an agent's individual observation

    # Normalitze individual observations
    for i in range(n_observations):
        # TODO: handle possible 0-divisions
        observations[i,:] = observations[i,:] / observations[i,:].max()

    if n_observations == expected_observations:
        # Return (normalized) observations as they are & sequence of agents in order (i.e. no randomization)
        order_agents = np.arange(n_observations)
        return observations, order_agents
    if n_observations < expected_observations:
        # Return padded observations as they are & padded sequence of agents in order (i.e. no randomization)
        padded_observations = np.zeros([expected_observations]+observation_size)
        padded_observations[0:n_observations,:] = observations
        order_agents = list(range(n_observations))+[-1]*(expected_observations-n_observations) # -1 == agent absent
        return padded_observations, order_agents
    if n_observations > expected_observations:
        # Return random selection of observations in random order
        order_agents = np.random.choice(range(n_observations), size=expected_observations, replace=False)
        selected_observations = np.zeros([expected_observations] + observation_size)
        for i_selected, i_given_observations in enumerate(order_agents):
            selected_observations[i_selected,:] = observations[i_given_observations,:]
        return selected_observations, order_agents


# Example usage
n_observations = 5      # Number of actual observations
width = height =  2     # Observation dimension
state = np.random.random(size=[n_observations,height,width])  # Random state
print(state)
print(get_overall_observation(state))
like image 71
Daniel B. Avatar answered Oct 18 '22 17:10

Daniel B.


I solve the problem using different solutions but I found that the encoding is the best solution for my problem

  • Select the model with pre-estimate maximum state space and If the state space is less than the maximum state, we padded the state space with zeros
  • Consider only the state of the agents itself without any sharing of the other state.
  • As the paper [1] mentioned that the extra connected autonomous vehicles (CAVs) are not included in the state and if they are less than the max CAVs, the state is padded with zeros. We can select how many agents that we can share their state adding to the agent’s state.
  • Encode the state where it will help us to process the input and compress the information into a fixed length. In the encoder, every cell in the LSTM layer or RNN with Gated Recurrent Units (GRU) returns a hidden state (Ht) and cell state (E’t).

enter image description here

For the encoder, I use the Neural machine translation with attention code

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
  • LSTM zero paddings and mask where we pad the state with a special value to be masked (skipped) later. If we pad without masking, the padded value will be regarded as actual value, thus, it becomes noise in the state [2-4].

1- Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., ... & Bayen, A. M. (2018, October). Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409)

2- Kochkina, E., Liakata, M., & Augenstein, I. (2017). Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm. arXiv preprint arXiv:1704.07221.

3- Ma, L., & Liang, L. (2020). Enhance CNN Robustness Against Noises for Classification of 12-Lead ECG with Variable Length. arXiv preprint arXiv:2008.03609.

4- How to feed LSTM with different input array sizes?

5- Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., & Tang, J. (2018, September). Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (pp. 95-103).

like image 2
I_Al-thamary Avatar answered Oct 18 '22 17:10

I_Al-thamary