Some quick reminders:

**On Dropout:**Don’t forget to remove dropout on validation! We don’t want to overfit when training, but on inference we would like to use the full power of our model.**Autoencoders:**Currently, these*aren’t*good at lossless compression (obvious, since they output probabilistic data structures), and*aren’t*so good at video compression (which I specify because you could probably allow for*some*loss-of-information).- That said, if I were to try solving that problem, I would mix autoencoders with neural architecture search – with model size + compression size as the output (+ accuracy) as the bandit reward.

**RNNs:**It seems like it’s synonymous to say that a “generated sequence” is the same as “applying recursion.” Intuitively, this seems incorrect since a sequence does not account for tree-structures or fancier things, like mutually recursive structures, but maybe I am missing something here.- When using least squares, normally we assume an iid process:
*σ*_{t = 0}(*s*_{t}=*f*(*s*_{t − 1}))^{2}, however when injecting temporal dependence with a recursive function, least squares becomes:*σ*_{t = 1}(*s*_{t}=*f*(*h*_{t − 1},*s*_{t − 1}))^{2}where*h*_{t − 1}is the hidden state of*f*, applied to the prior timestep. - This
*h*, hidden state, is the “memory” and requires an initial object (or, co-terminal object). Using this terminology we can infer that the final, ideal parameters are the terminal object. - linear combinations are how we keep things in memory

- When using least squares, normally we assume an iid process:

**Question:** Is there a notion of things being “approximate” in CT? For instance, a neural network requires an initial parameter space and moves towards an ideal, terminal, parameter space. That said, it will never get there. It’s the same with reinforcement learning and approximate optimization (which is probably the most formal of these three research areas) (edited)

Also: