if np.random.rand() < EPSILON:
action = max_number_of_actions - 1
make_action = id_to_action[action]
else:
max_q = np.max(q_values)
actions_argmax = np.arange(max_number_of_actions)[q_values >=
max_q - 0.0001]
probs_unnormed = 1/(np.arange(actions_argmax.shape[0]) + 1.)
probs_unnormed /= np.sum(probs_unnormed)
action = np.random.choice(actions_argmax)
make_action = id_to_action[action]
return action, make_action
1-Find the exploratory technique used in this code .
2- Find the bellman Equation for the following Code.
Let the state at time {\displaystyle t} be {\displaystyle x_{t}}. For a decision that begins at time 0, we take as given the initial state {\displaystyle x_{0}}. At any time, the set of possible actions depends on the current state; we can write this as {\displaystyle a_{t}\in \Gamma (x_{t})}, where the action {\displaystyle a_{t}} represents one or more control variables. We also assume that the state changes from {\displaystyle x} to a new state {\displaystyle T(x,a)} when action {\displaystyle a} is taken, and that the current payoff from taking action {\displaystyle a} in state {\displaystyle x} is {\displaystyle F(x,a)}. Finally, we assume impatience, represented by a discount factor {\displaystyle 0<\beta <1}.
Under these assumptions, an infinite-horizon decision problem takes the following form:
{\displaystyle V(x_{0})\;=\;\max _{\left\{a_{t}\right\}_{t=0}^{\infty }}\sum _{t=0}^{\infty }\beta ^{t}F(x_{t},a_{t}),}
subject to the constraints
{\displaystyle a_{t}\in \Gamma (x_{t}),\;x_{t+1}=T(x_{t},a_{t}),\;\forall t=0,1,2,\dots }
Notice that we have defined notation {\displaystyle V(x_{0})} to denote the optimal value that can be obtained by maximizing this objective function subject to the assumed constraints. This function is the value function. It is a function of the initial state
variable {\displaystyle x_{0}}, since the best value obtainable depends on the initial situation.
Bellman equation :
It is the value of the state with a maximum long term reward value
Get Answers For Free
Most questions answered within 1 hours.