Sam Stites

RL Notes (short)

July 11, 2018

Things to remember:

  • discounted return is with respect to the future even though, pragmatically, this is usually seen as an eligibility trace. This is because of the Markov Property.
  • γ = 0 implies a myopic agent – one that only cares about immediate reward.
  • γ = 1 implies a farsighted agent – one that cares about optimizing future reward.
  • 0 < γ < 1 implies that an agent cares about the future to varying degrees. Cart-pole is an environment where we can model the reward in two different ways, which require different levels of γ for any RL agent.
    • In one situation we set the reward to be +1. In this case any positive discount will enforce that future reward will accumulate as the agent steps through time.
    • Alternatively, we can set the reward to be -1. In this case an agent must set the discount rate to be 0 < γ < 1. In the case of γ = 0 the agent fails to learn anything (all immediate rewards are 0 and the agent forgets everything by the final timestep), and at γ = 1 the agent treats all episode lengths the same. With 0 < γ < 1, the -1 value will be minimized the larger we can set t in ( − 1) ⋅ γt.
    • in general we choose a discount rate that is closer to 1 than 0, like 0.9, otherwise the agent will be shortsighted.
  • The most simple policy is a deterministic one. This is usually the first set of rules/hueristics you would find in a naive business solution. Bumping these up to stochastic policies is as simple as an MDP and as complicated as function approximation.
  • Definition: π′ ≥ π if and only if vπ′(s) ≥ vπ(s)∀s ∈ S. Following this we get that an optimal policy π* is where π *  ≥ ππ
  • When you talk about deterministic action-value functions for a given policy, you are asking about the optimal route at each state. This is an example of how you can compare state-value and action-value functions, but you can only do this for a deterministic policy.