博文

强化学习简介

已有 9162 次阅读 2017-11-27 15:17 |个人分类:强化学习|系统分类:科研笔记

一、强化学习涉及的学科及其特点

强化学习涉及数学（Mathematics）、工程（Engineering）、计算科学（Computer Science）、神经科学（Neuroscience ）、心理学（Psycology）和经济学（Economics）等众多学科，如下图。强化学习是机器学习的一大分支，介于监督学习和无监督学习之间。与其他机器学习的范式相比，强化学习的特点主要有：

（1）学习过程中没有监督者，只有奖励（reward）信号；

（2）其反馈信号（feedback）是延迟的而非瞬间的；

（3）强化学习过程与时间序列相关，是一个序贯决策的过程；

（4）Agent的action会影响到它所接受的序列数据。

二、强化学习应用举例

强化学习应用范围很广，比如直升飞机上的特技表演（Fly stunt manoeuvres in a helicopter）、西洋双陆战棋的胜利（Defeat the world champion at Backgammon）、证券投资组合的管理（Manage an investment portfolio）、发电厂的控制（Control a power station）、实现机器人行走（Make a humanoid robot walk）、在众多游戏上战胜人类（Play many dierent Atari games better than humans）。

三、强化学习相关概念简析

强化学习中几个重要的概念包括：奖励（reward）、代理（agent）、环境（environment）、状态（state）等。

（1）强化学习的假设（Rewards）

奖励R_t（Reward）是一个标量反馈信号，表示的是一个agent在时间步t的表现，而agent主要的工作是最大化累计的奖励。因此强化学习是建立在奖励假设下的，其表述为：

“All goals can be described by the maximisation of expected cumulative reward”

也就是说agent通过选择action来最大化总体的未来的奖励（total future reward）。可以看出actions可能有长期的因果关系（long term consequences）并且奖励可能延迟，即可能是牺牲即时的奖励来获取更长期的奖励，比如在金融投资领域可能要用多个月才能获利。

（2）Environments and Agents

environments和agents是强化学习的两个重要组成部分，其中agents接受时刻t的observation和来自environment的reward，然后执行action；environment接受时刻t来自agent的action，并产生下一时刻的observation和reward，具体过程如下图。

（3）History and State

history指的是一系列的observations、rewards和actions,形式化表达为

这个历史信息作用为：

1、决定Agent选择怎样的actions；

2、决定Environment选择怎样的observation和reward；

而状态（state）是历史信息的函数表达：

同时state也分为environment state 和 agent state：

environment state（S_t^e）：环境的私有表示（private representation），环境本身通过当前状态决定下一时刻的观测和奖励，同时该状态对于agent通常是不可见的，即使可见也不包含与之相关的信息；
agent state（S_t^a）：agent的内部表示（internal representation），agent通过当前时刻的状态决定下一时刻的action；

根据马尔科夫定理，下一时刻的状态只与前一个时刻相关，如下公式。而S_t^e和H_t都为马尔科夫过程。

当环境是全部可观测时，Agent state = environment state = information state（马尔科夫状态），整个过程就变成了一个马尔科夫决策过程。

（4）Agents组成部分

强化学习中agent可能包含以下三部分中的一个或多个：

1、Policy: agent's behaviour function

policy代表agent的行为，从状态（state）映射到行动（action），分为确定性策略（Deterministic policy）和随机性策略（Stochastic policy）。

2、Value function: how good is each state and/or action

该部分是对未来奖励的预测，用于估计状态的好坏。

3、Model: agent's representation of the environment

该部分用于预测环境下一步的行动，包括状态和奖励。

（5）Agents分类

根据（4）的三部分一般将agent分为5大类：

Value Based：No Policy (Implicit) ，Value Function
Policy Based：Policy，No Value Function
Actor Critic：Policy，Value Function
Model Free：Policy and/or Value Function，No Model
Model Based：Policy and/or Value Function，Model

四、使用迷宫游戏举例说明

（1）Rewards: -1 per time-step，每走一步奖励为-1；
（2）Actions: N, E, S, W，agent的行动有四种即向北、东，南、西走；
（3）States: Agent's location，状态为agent的位置；

（4）如图2，箭头方向表示每个状态时的策略（Arrows represent policy ∏(s) for each state s）

（5）如图3，每个状态的数字表示在该状态下选择相应策略的代价（Numbers represent value v_∏(s) of each state s）

（6）如图4，网格布局（即行走轨迹）表示转换模型（Grid layout represents transition model P_ss^a）

（7）如图4 ，网格布局中数字表示及时的奖励（Numbers represent immediate reward R_s^a from each state s(same for all a)）

五、两类序贯决策的问题的异同

（1）Reinforcement Learning:

The environment is initially unknown
The agent interacts with the environment
The agent improves its policy

（2）Planning:

A model of the environment is known
The agent performs computations with its model (without any external interaction)
The agent improves its policy

Reinforcement learning is like trial-and-error learning. The agent should discover a good policy from its experiences of the environment without losing too much reward along the way.

两对概念：