Policy Gradient Methods¶
Part of policy-based methods
The existing variants applicable to both continuous and discrete domains, such as the on-policy asynchronous advantage actor critic (A3C) of Mnih et al. (2016), are sample inefficient.
Silver lecture on PG methods: https://www.youtube.com/watch?v=KHZVXao4qXs
The best explanation of policy gradient is probably the lecture from David Silver at UCL
This post highlights how policy gradient can be seen as a way to do supervised learning without a true label: https://amoudgl.github.io/blog/policy-gradient/
Last update: April 9, 2020