I'm working in A2C reinforcement learning where my environment has an increasing and decreasing in the number of agents. As a result of the increasing and decreasing the number of agents, the state space will also change. I have tried to solve the problem of changing the state space this way:
If the state space exceeds the maximum state space that selected
as n_input
, the excess state space will be selected by
np.random.choice
where random choice provides a way of creating random samples from the state space after converting the state space into probabilities.
If the state space is less than the maximum state I padded the state space with zeros.
def get_state_new(state):
n_features = n_input-len(get_state(env))
# print("state",len(get_state(env)))
p = np.array(state)
p = np.exp(p)
if p.sum() != 1.0:
p = p * (1. / p.sum())
if len(get_state(env)) > n_input:
statappend = np.random.choice(state, size=n_input, p=p)
# print(statappend)
else:
statappend = np.zeros(n_input)
statappend[:state.shape[0]] = state
return statappend
It works but the results are not as expected and I don't know if this correct or not.
My question
Are there any reference papers that deal with such a problem and how to deal with the changing of state space?
For the paper, I'm gonna give the same reference as in the other post already: Benchmarks for reinforcement learning minmixed-autonomy traffic.
In this approach, indeed, an expected number of agents (which are expected to be present in the simulation at any moment in time) is predetermined. During runtime, observations of agents present in the simulation are then retrieved and squashed into a container (tensor) of fixed size (let's call it overall observation container), which can contain as many observations (from individual agents) as there are agents expected to be present at any moment in time in the simulation. Just to be clear: size(overall observation container) = expected number of agents * individual observation size
. Since the actual number of agents present in a simulation may vary from time step to time step, the following applies:
Coming back to your sample code, there are a few things I would do differently.
First, I was wondering why you have both the variable state
(passed to the function get_state_new
) and the call get_state(env)
, since I would expect the information returned by get_state(env)
to be the same as stored already in the variable state
. As a tip, it would make the code a bit nicer to read if you could try to use the state
variable only (if the variable and the function call indeed provide the same information).
The second thing I would do differently is how you process states: p = np.exp(p)
, p = p * (1. / p.sum())
. This normalizes the overall observation container by the sum of all exponentiated values present in all individual observations. In contrast, I would normalize each individual observation in isolation.
This has the following reason: If you provide a small number of observations, then the sum of exponentiated values contained in all individual observations can be expected to be smaller than when taking the sum over the exponentiated values contained in a larger amount of individual observations. These differences in the sum, which is then used for normalization, will result in different magnitudes of the normalized values (as a function of the number of individual observations, roughly speaking). Consider the following example:
import numpy as np
# Less state representations
state = np.array([1,1,1])
state = state/state.sum()
state
# Output: array([0.33333333, 0.33333333, 0.33333333])
# More state representations
state = np.array([1,1,1,1,1])
state = state/state.sum()
state
# Output: array([0.2, 0.2, 0.2, 0.2, 0.2])
Actually, the same input state representation, as obtained by an individual agent, shall always result in the same output state representation after normalization, regardless of the number of agents currently present in the simulation. So, please make sure to normalize all observations on their own. I'll give an example below.
Also, please make sure to keep track of which agents' observations (and in which order) have been squashed into your variable statappend
. This is important for the following reason.
If there are agents A1
through A5
, but the overall observation container can take only three observations, three out of five state representations are going to be selected at random. Say the observations randomly selected to be squashed into the overall observation container stem from from the following agents in the following order: A2, A5, A1
. Then, these agents' observations will be squashed into the overall observation container in exactly this order. First the observation of A2
, then that of A5
, and eventually that of A1
. Correspondingly, given the aforementioned overall observation container, the three actions predicted by your Reinforcement Learning controller will correspond to agents A2
, A5
, and A1
(in order!), respectively. In other words, the order of the agents on the input side also dictates to which agents the predicted actions correspond on the output side.
I would propose something like the following:
import numpy as np
def get_overall_observation(observations, expected_observations=5):
# Return value:
# order_agents: The returned observations stem from this ordered set of agents (in sequence)
# Get some info
n_observations = observations.shape[0] # Actual nr of observations
observation_size = list(observations.shape[1:]) # Shape of an agent's individual observation
# Normalitze individual observations
for i in range(n_observations):
# TODO: handle possible 0-divisions
observations[i,:] = observations[i,:] / observations[i,:].max()
if n_observations == expected_observations:
# Return (normalized) observations as they are & sequence of agents in order (i.e. no randomization)
order_agents = np.arange(n_observations)
return observations, order_agents
if n_observations < expected_observations:
# Return padded observations as they are & padded sequence of agents in order (i.e. no randomization)
padded_observations = np.zeros([expected_observations]+observation_size)
padded_observations[0:n_observations,:] = observations
order_agents = list(range(n_observations))+[-1]*(expected_observations-n_observations) # -1 == agent absent
return padded_observations, order_agents
if n_observations > expected_observations:
# Return random selection of observations in random order
order_agents = np.random.choice(range(n_observations), size=expected_observations, replace=False)
selected_observations = np.zeros([expected_observations] + observation_size)
for i_selected, i_given_observations in enumerate(order_agents):
selected_observations[i_selected,:] = observations[i_given_observations,:]
return selected_observations, order_agents
# Example usage
n_observations = 5 # Number of actual observations
width = height = 2 # Observation dimension
state = np.random.random(size=[n_observations,height,width]) # Random state
print(state)
print(get_overall_observation(state))
I solve the problem using different solutions but I found that the encoding is the best solution for my problem
[1]
mentioned that the extra connected autonomous
vehicles (CAVs) are not included in the state and if they are less
than the max CAVs, the state is padded with zeros. We can select how
many agents that we can share their state adding to the agent’s
state.For the encoder, I use the Neural machine translation with attention code
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))
1- Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., ... & Bayen, A. M. (2018, October). Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409)
2- Kochkina, E., Liakata, M., & Augenstein, I. (2017). Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm. arXiv preprint arXiv:1704.07221.
3- Ma, L., & Liang, L. (2020). Enhance CNN Robustness Against Noises for Classification of 12-Lead ECG with Variable Length. arXiv preprint arXiv:2008.03609.
4- How to feed LSTM with different input array sizes?
5- Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., & Tang, J. (2018, September). Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (pp. 95-103).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With