-
内容大纲
本书从强化学习最基本的概念开始介绍,将介绍基础的分析工具,包括贝尔曼公式和贝尔曼最优公式,然后推广到基于模型的和无模型的强化学习算法,最后推广到基于函数逼近的强化学法。本书强调从数学的角度引入概念、分析问题、分析算法,并不强调算法的编程实现。本书不要求读者具备任何关于强化学习的知识背景,仅要求读者具备一定的概率论和线性代数的知识。如果读者已经具备强化学习的学习基础,本书可以帮助读者更深入地理解一些问题并提供新的视角。
本书面向对强化学习感兴趣的本科生、研究生、研究人员和企业或研究所的从业者。 -
作者介绍
赵世钰,西湖大学工学院AI分支特聘研究员,智能无人系统实验室负责人,国家海外高层次人才引进计划青年项目获得者;本硕毕业于北京航空航天大学,博士毕业于新加坡国立大学,曾任英国谢菲尔德大学自动控制与系统工程系Lecturer;致力于研发有趣、有用、有挑战性的下一代机器人系统,重点关注多机器人系统中的控制、决策与感知等问题。 -
目录
Overview of this Book
Chapter 1 Basic Concepts
1.1 A grid world example
1.2 State and action
1.3 State transition
1.4 Policy
1.5 Reward
1.6 Trajectories, returns, and episodes
1.7 Markov decision processes
1.8 Summary
1.9 Q&A
Chapter 2 State Values and the Bellman Equation
2.1 Motivating example 1: Why are returns important?
2.2 Motivating example 2: How to calculate returns?
2.3 State values
2.4 The Bellman equation
2.5 Examples for illustrating the Bellman equation
2.6 Matrix-vector form of the Bellman equation
2.7 Solving state values from the Bellman equation
2.7.1 Closed-form solution
2.7.2 Iterative solution
2.7.3 Illustrative examples
2.8 From state value to action value
2.8.1 Illustrative examples
2.8.2 The Bellman equation in terms of action values
2.9 Summary
2.10 Q&A
Chapter 3 Optimal State Values and the Bellman Optimality Equation
3.1 Motivating example: How to improve policies?
3.2 Optimal state values and optimal policies
3.3 The Bellman optimality equation
3.3.1 Maximization of the right-hand side of the BOE
3.3.2 Matrix-vector form of the BOE
3.3.3 Contraction mapping theorem
3.3.4 Contraction property of the right-hand side of the BOE
3.4 Solving an optimal policy from the BOE
3.5 Factors that influence optimal policies
3.6 Summary
3.7 Q&A
Chapter 4 Value Iteration and Policy Iteration
4.1 Value iteration
4.1.1 Elementwise form and implementation
4.1.2 Illustrative examples
4.2 Policy iteration
4.2.1 Algorithm analysis
4.2.2 Elementwise form and implementation
4.2.3 Illustrative examples
4.3 Truncated policy iteration
4.3.1 Comparing value iteration and policy iteration
4.3.2 Truncated policy iteration algorithm
4.4 Summary
4.5 Q&A
Chapter 5 Monte Carlo Methods
5.1 Motivating example: Mean estimation
5.2 MC Basic: The simplest MC-based algorithm
5.2.1 Converting policy iteration to be model-free
5.2.2 The MC Basic algorithm
5.2.3 Illustrative examples
5.3 MC Exploring Starts
5.3.1 Utilizing samples more efficiently
5.3.2 Updating policies more efficiently
5.3.3 Algorithm description
5.4 MC ∈-Greedy: Learning without exploring starts
5.4.1 ∈-greedy policies
5.4.2 Algorithm description
5.4.3 Illustrative examples
5.5 Exploration and exploitation of ∈-greedy policies
5.6 Summary
5.7 Q&A
Chapter 6 Stochastic Approximation
6.1 Motivating example: Mean estimation
6.2 Robbins-Monro algorithm
6.2.1 Convergence properties
6.2.2 Application to mean estimation
6.3 Dvoretzky's convergence theorem
6.3.1 Proof of Dvoretzky's theorem
6.3.2 Application to mean estimation
6.3.3 Application to the Robbins-Monro theorem
6.3.4 An extension of Dvoretzky's theorem
6.4 Stochastic gradient descent
6.4.1 Application to mean estimation
6.4.2 Convergence pattern of SGD
6.4.3 A deterministic formulation of SGD
6.4.4 BGD, SGD, and mini-batch GD
6.4.5 Convergence of SGD
6.5 Summary
6.6 Q&A
Chapter 7 Temporal-Difference Methods
7.1 TD learning of state values
7.1.1 Algorithm description
7.1.2 Property analysis
7.1.3 Convergence analysis
7.2 TD learning of action values: Sarsa
7.2.1 Algorithm description
7.2.2 Optimal policy learning via Sarsa
7.3 TD learning of action values: n-step Sarsa
7.4 TD learning of optimal action values: Q-learning
7.4.1 Algorithm description
7.4.2 Off-policy vs. on-policy
7.4.3 Implementation
7.4.4 Illustrative examples
7.5 A unifed viewpoint
7.6 Summary
7.7 Q&A
Chapter 8 Value Function Approximation
8.1 Value representation: From table to function
8.2 TD learning of state values with function approximation
8.2.1 Objective function
8.2.2 Optimization algorithms
8.2.3 Selection of function approximators
8.2.4 Illustrative examples
8.2.5 Theoretical analysis
8.3 TD learning of action values with function approximation
8.3.1 Sarsa with function approximation
8.3.2 Q-learning with function approximation
8.4 Deep Q-learning
8.4.1 Algorithm description
8.4.2 Illustrative examples
8.5 Summary
8.6 Q&A
Chapter 9 Policy Gradient Methods
9.1 Policy representation: From table to function
9.2 Metrics for defining optimal policies
9.3 Gradients of the metrics
9.3.1 Derivation of the gradients in the discounted case
9.3.2 Derivation of the gradients in the undiscounted case
9.4 Monte Carlo policy gradient (REINFORCE)
9.5 Summary
9.6 Q&A
Chapter 10 Actor-Critic Methods
10.1 The simplest actor-critic algorithm (QAC)
10.2 Advantage actor-critic (A2C)
10.2.1 Baseline invariance
10.2.2 Algorithm description
10.3 Of-policy actor-critic
10.3.1 Importance sampling
10.3.2 The off-policy policy gradient theorem
10.3.3 Algorithm description
10.4 Deterministic actor-critic
10.4.1 The deterministic policy gradient theorem
10.4.2 Algorithm description
10.5 Summary
10.6 Q&A
Appendix A Preliminaries for Probability Theory
Appendix B Measure-Theoretic Probability Theory
Appendix C Convergence of Sequences
C.1 Convergence of deterministic sequences
C.2 Convergence of stochastic sequences
Appendix D Preliminaries for Gradient Descent
Bibliography
Symbols
Index
同类热销排行榜
- C语言与程序设计教程(高等学校计算机类十二五规划教材)16
- 电机与拖动基础(教育部高等学校自动化专业教学指导分委员会规划工程应用型自动化专业系列教材)13.48
- 传感器与检测技术(第2版高职高专电子信息类系列教材)13.6
- ASP.NET项目开发实战(高职高专计算机项目任务驱动模式教材)15.2
- Access数据库实用教程(第2版十二五职业教育国家规划教材)14.72
- 信号与系统(第3版下普通高等教育九五国家级重点教材)15.08
- 电气控制与PLC(普通高等教育十二五电气信息类规划教材)17.2
- 数字电子技术基础(第2版)17.36
- VB程序设计及应用(第3版十二五职业教育国家规划教材)14.32
- Java Web从入门到精通(附光盘)/软件开发视频大讲堂27.92
推荐书目
-
孩子你慢慢来/人生三书 华人世界率性犀利的一枝笔,龙应台独家授权《孩子你慢慢来》20周年经典新版。她的《...
-
时间简史(插图版) 相对论、黑洞、弯曲空间……这些词给我们的感觉是艰深、晦涩、难以理解而且与我们的...
-
本质(精) 改革开放40年,恰如一部四部曲的年代大戏。技术突变、产品迭代、产业升级、资本对接...