1. 概念
agent , state , policy,action,reward ,ruturn ,discounted return
2. 贝尔曼公式
state value action value
抽象为即时奖励以及后续奖励之和
矩阵计算,迭代收敛
贝尔曼最优公式
3.Model-based
policy iteration : policy estimation → policy update
value iteration
4.model-free
蒙特卡洛
平均值取代期望
5. stochastic approximation
在online 迭代进行的过程中逼近期望
RM 算法
TD 算法(exploration starts , greedy)(state value)
sarsa ,q-learning(action value)
6. funciton approximation
value function + policy update
DQN(mlp)
7.policy funciton
target function
function gradient
REINFORCE
value-update(MC) → policy -update
8. actor-critc
value update→TD
A2A (advantage function) bias → TD error
importance sample
on-policy off-policy
deterministic policy function
9.PPO,DPO,GRPO
reward model → return → state value
GAE (generalized advantage estimation)
KL 散度
critic model: boostrap+gae+return
10.marl
竞争,合作,mix,self-interest
中心化,去中心化,中心化训练(共享critic)去中心化部署

