Things to remember:

- discounted return is with respect to
*the future*even though, pragmatically, this is usually seen as an eligibility trace. This is because of the Markov Property. *γ*= 0 implies a*myopic agent*– one that only cares about immediate reward.*γ*= 1 implies a*farsighted agent*– one that cares about optimizing future reward.- 0 <
*γ*< 1 implies that an agent cares about the future to varying degrees. Cart-pole is an environment where we can model the reward in two different ways, which require different levels of*γ*for any RL agent.- In one situation we set the reward to be +1. In this case any positive discount will enforce that future reward will accumulate as the agent steps through time.
- Alternatively, we can set the reward to be -1. In this case an agent
*must*set the discount rate to be 0 <*γ*< 1. In the case of*γ*= 0 the agent fails to learn anything (all immediate rewards are 0 and the agent forgets everything by the final timestep), and at*γ*= 1 the agent treats all episode lengths the same. With 0 <*γ*< 1, the -1 value will be minimized the larger we can set*t*in ( − 1) ⋅*γ*^{t}. - in general we choose a discount rate that is closer to 1 than 0, like 0.9, otherwise the agent will be shortsighted.

- The most simple policy is a deterministic one. This is usually the first set of rules/hueristics you would find in a naive business solution. Bumping these up to stochastic policies is as simple as an MDP and as complicated as function approximation.
- Definition:
*π*′ ≥*π*if and only if*v*_{π}′(*s*) ≥*v*_{π}(*s*)∀*s*∈*S*. Following*this*we get that an optimal policy*π** is where*π** ≥*π*∀*π* - When you talk about deterministic action-value functions for a given policy, you are asking about the optimal route at each state. This is an example of how you can compare state-value and action-value functions, but you can only do this for a deterministic policy.