解析 DeepMind 深度強化學習 (Deep Reinforcement Learning) 技術

Neil Zhu，簡書ID Not_GOD，University AI 創始人 & Chief Scientist，致力于推進世界人工智能化進程。制定并實施 UAI 中長期增長戰略和目標，帶領團隊快速成長為人工智能領域最專業的力量。
作為行業領導者，他和UAI一起在2014年創建了TASA（中國最早的人工智能社團）, DL Center（深度學習知識中心全球價值網絡），AI growth（行業智庫培訓）等，為中國的人工智能人才建設輸送了大量的血液和養分。此外，他還參與或者舉辦過各類國際性的人工智能峰會和活動，產生了巨大的影響力，書寫了60萬字的人工智能精品技術內容，生產翻譯了全球第一本深度學習入門書《神經網絡與深度學習》，生產的內容被大量的專業垂直公眾號和媒體轉載與連載。曾經受邀為國內頂尖大學制定人工智能學習規劃和教授人工智能前沿課程，均受學生和老師好評。

原文

聲明：感謝 Tambet Matiisen 的創作，這里只對最為核心的部分進行的翻譯

Two years ago, a small company in London called DeepMind uploaded their pioneering paper “Playing Atari with Deep Reinforcement Learning” to Arxiv. In this paper they demonstrated how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased. The result was remarkable, because the games and the goals in every game were very different and designed to be challenging for humans. The same model architecture, without any change, was used to learn seven different games, and in three of them the algorithm performed even better than a human!

It has been hailed since then as the first step towards general artificial intelligence – an AI that can survive in a variety of environments, instead of being confined to strict realms such as playing chess. No wonder DeepMind was immediately bought by Google and has been on the forefront of deep learning research ever since. In February 2015 their paper “Human-level control through deep reinforcement learning” was featured on the cover of Nature, one of the most prestigious journals in science. In this paper they applied the same model to 49 different games and achieved superhuman performance in half of them.

Still, while deep models for supervised and unsupervised learning have seen widespread adoption in the community, deep reinforcement learning has remained a bit of a mystery. In this blog post I will be trying to demystify this technique and understand the rationale behind it. The intended audience is someone who already has background in machine learning and possibly in neural networks, but hasn’t had time to delve into reinforcement learning yet.

我們按照下面的幾個問題來看看到底深度強化學習技術長成什么樣？

什么是強化學習的主要挑戰？針對這個問題，我們會討論 credit assignment 問題和 exploration-exploitation 困境。
如何使用數學來形式化強化學習？我們會定義 Markov Decision Process 并用它來對強化學習進行分析推理。
我們如何指定長期的策略？這里，定義了 discounted future reward，這也給出了在下面部分的算法的基礎。
如何估計或者近似未來收益？給出了簡單的基于表的 Q-learning 算法的定義和分析。
如果狀態空間非常巨大該怎么辦？這里的 Q-table 就可以使用（深度）神經網絡來替代。
怎么樣將這個模型真正可行？采用 Experience replay 技術來穩定神經網絡的學習。
這已經足夠了么？最后會研究一些對 exploration-exploitation 問題的簡單解決方案。

強化學習

Consider the game Breakout. In this game you control a paddle at the bottom of the screen and have to bounce the ball back to clear all the bricks in the upper half of the screen. Each time you hit a brick, it disappears and your score increases – you get a reward.

圖 1：Atari Breakout 游戲：來自 DeepMind

Suppose you want to teach a neural network to play this game. Input to your network would be screen images, and output would be three actions: left, right or fire (to launch the ball). It would make sense to treat it as a classification problem – for each game screen you have to decide, whether you should move left, right or press fire. Sounds straightforward? Sure, but then you need training examples, and a lots of them. Of course you could go and record game sessions using expert players, but that’s not really how we learn. We don’t need somebody to tell us a million times which move to choose at each screen. We just need occasional feedback that we did the right thing and can then figure out everything else ourselves.
This is the task **reinforcement learning **tries to solve. Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in reinforcement learning one has sparse and time-delayed labels – the rewards. 基于這些收益，agent 必須學會在環境中如何行動。

盡管這個想法非常符合直覺，但是實際操作時卻困難重重。例如，當你擊中一個磚塊，并得到收益時，這常常和最近做出的行動（paddle 的移動）沒有關系。在你將 paddle 移動到了正確的位置時就可以將球彈回，其實所有的困難的工作已經完成。這個問題被稱作是 credit assignment 問題——先前的行動會影響到當前的收益的獲得與否及收益的總量。

一旦你想出來一個策略來收集一定數量的收益，你是要固定在當前的策略上還是嘗試其他的策略來得到可能更好的策略呢？在上面的 Breakout 游戲中，簡單的策略就是移動到最左邊的等在那里。發出球球時，球更可能是向左飛去，這樣你能夠輕易地在死前獲得 10 分。但是，你真的滿足于做到這個程度么？ 這就是 exploration-exploit 困境 ——你應當利用已有的策略還是去探索其他的可能更好的策略。

強化學習是關于人類（或者更一般的動物）學習的重要模型。我們受到來自父母、分數、薪水的獎賞都是收益的各類例子。credit assignment 問題 和 exploration-exploitation 困境 在商業和人際關系中常常出現。這也是研究強化學習及那些提供嘗試新的觀點的沙盒的博弈游戲的重要原因。

Markov Decision Process

現在的問題就是，如何來形式化這個強化學習問題使得我們可以對其進行分析推理。目前最常見的就是將其表示成一個 Markov decision process。

假設你是一個 agent，在一個環境中（比如說 Breakout 游戲）。環境會處在某個狀態下（比如說，paddle 的位置、球的位置和方向、每個磚塊是否存在等等）。agent 可以在環境中執行某種行動（比如說，將 paddle 向左或者向右移動）。這些行動有時候會帶來收益（比如說分數的增加）。行動會改變環境并使其新的狀態，然后 agent 又可以這行另外的行動，如此進行下去。如何選擇這些行動的規則稱為策略。一般來說環境是隨機的，這意味著下一個狀態的出現在某種程度上是隨機的（例如，你輸了一個球的時候，重新啟動新球，這個時候的射出方向是隨機選定的）。

圖 2：左：強化學習問題。右：Markov decision process

狀態和行動的集合，以及從一個狀態跳轉到另一個狀態的規則，共同真誠了 Markov decision process。這個過程（比方說一次游戲過程）的一個 episode 形成了一個有限的狀態、行動和收益的序列：

這里的 s_i 表示狀態，a_i 表示行動，而 r_{i+1} 是在執行行動的匯報。episode 以 terminal 狀態 s_n 結尾（可能就是“游戲結束”畫面）。MDP 依賴于 Markov 假設——下一個狀態 s_{i+1} 的概率僅僅依賴于當前的狀態 s_i 和行動 a_i，但不依賴此前的狀態和行動。

Discounted Future Reward

To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that?

Given one run of the Markov decision process, we can easily calculate the total reward for one episode:

Given that, the total future reward from time point t onward can be expressed as:

But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use **discounted future reward **instead:

Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1:

If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1.

A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.

Q-learning

In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on.

******The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a** in state s“. It is called Q-function, because it represents the “quality” of a certain action in a given state.**

This may sound like quite a puzzling definition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times: “*Q(s, a) *exists, *Q(s, a) *exists, …”. Feel it?

If you’re still not convinced, then consider what the implications of having such a function would be. Suppose you are in state and pondering whether you should take action a or b. You want to select the action that results in the highest score at the end of game. Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!

Here π represents the policy, the rule how we choose an action in each state.

OK, how do we get that Q-function then? Let’s focus on just one transition <s, a, r, s’>. Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’.

This is called the Bellman equation. If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.

The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of the Q-learning algorithm is as simple as the following[1]:

α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a] cancel and the update is exactly the same as the Bellman equation.

The maxa’ Q[s’,a’] that we use to update Q[s,a] is only an approximation and in early stages of learning it may be completely wrong. However the approximation get more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.

Deep Q Network

環境的狀態可以用 paddle 的位置，球的位置和方向以及每個磚塊是否消除來確定。不過這個直覺上的表示與游戲相關。我們能不能獲得某種更加通用適合所有游戲的表示呢？最明顯的選擇就是屏幕像素——他們隱式地包含所有關于除了球的速度和方向外的游戲情形的相關信息。不過兩個時間上相鄰接的屏幕可以包含這兩個丟失的信息。

如果我們像 DeepMind 的論文中那樣處理游戲屏幕的話——獲取四幅最后的屏幕畫面，將他們重新規整為 84 X 84 的大小，轉換為 256 灰度層級——我們會得到一個 256^{84X84X4} 大小的可能游戲狀態。這意味著我們的 Q-table 中需要有 10^67970 行——這比已知的宇宙空間中的原子的數量還要大得多！可能有人會說，很多像素的組合（也就是狀態）不會出現——這樣其實可以使用一個稀疏的 table 來包含那些被訪問到的狀態。即使這樣，很多的狀態仍然是很少被訪問到的，也許需要宇宙的生命這么長的時間讓 Q-table 收斂。我們希望理想化的情形是有一個對那些還未遇見的狀態的 Q-value 的猜測。

這里就是深度學習發揮作用的地方。神經網絡其實對從高度結構化的數據中獲取特征非常在行。我們可以用神經網絡表示 Q-function，以狀態（四幅屏幕畫面）和行動作為輸入，以對應的 Q-value 作為輸出。另外，我們可以僅僅用游戲畫面作為輸入對每個可能的行動輸出一個 Q-value。后面這個觀點對于我們想要進行 Q-value 的更新或者選擇最優的 Q-value 對應操作來說要更方便一些，這樣我們僅僅需要進行一遍網絡的前向傳播就可立即得到所有行動的 Q-value。

圖 3：左：DQN 的初級形式；右：DQN 的優化形式，用在 DeepMind 的論文中的版本

DeepMind 使用的深度神經網絡架構如下：

這實際上是一個經典的卷積神經網絡，包含 3 個卷積層加上 2 個全連接層。對圖像識別的人們應該會注意到這里沒有包含 pooling 層。但如果你好好想想這里的情況，你會明白，pooling 層會帶來變換不變性 —— 網絡會對圖像中的對象的位置沒有很強的感知。這個特性在諸如 ImageNet 這樣的分類問題中會有用，但是在這里游戲的球的位置其實是潛在能夠決定收益的關鍵因素，我們自然不希望失去這樣的信息了！

網絡的輸入是 4 幅 84X84 的灰度屏幕畫面。網絡的輸出是對每個可能的行動（在 Atari 中是 18 個）。Q-value 可能是任何實數，其實這是一個回歸任務，我們可以通過簡單的平方誤差來進行優化。

給定轉移 <s, a, r, s'>，Q-table 更新規則變動如下：

對當前的狀態 s 執行前向傳播，獲得對所有行動的預測 Q-value
對下一狀態 s' 執行前向傳播，計算網絡輸出最大操作：max_{a'} Q(s', a')
設置行動的 Q-value 目標值為 r + γ max_{a'} Q(s', a')。使用第二步的 max 值。對所有其他的行動，設置為和第一步返回結果相同的 Q-value 目標值，讓這些輸出的誤差設置為 0
使用反向傳播算法更新權重

Experience Replay

到現在，我們有了使用 Q-learning 如何估計未來回報和用卷積神經網絡近似 Q-function 的方法。但是有個問題是，使用非線性函數來近似 Q-value 其實是非常不穩定的。對這個問題其實也有一堆技巧來讓其收斂。不過這樣也會花點時間，在單個 GPU 上估計要一個禮拜。

其中最為重要的技巧是 experience replay。在游戲過程中，所有的經驗 <s, a, r', s'> 都存放在一個 replay memory 中。訓練網絡時，replay memory 中隨機的 minibatch 會取代最近的狀態轉移。這會將連續的訓練樣本之間的相似性打破，否則容易將網絡導向一個局部最優點。同樣 experience replay 讓訓練任務與通常的監督學習更加相似，這樣也簡化了程序的 debug 和算法的測試。當然我們實際上也是可以收集人類玩家的 experience 并用這些數據進行訓練。

Exploration-Exploitation

Q-learning 試著解決 credit assignment 問題——將受益按時間傳播，直到導致獲得受益的實際的關鍵決策點為止。但是我們并沒有解決 exploration-exploitation 困境……

首先我們注意到，當 Q-table 或者 Q-network 隨機初始化時，其預測結果也是隨機的。如果我們選擇一個擁有最高的 Q-value 的行動，這個行動是隨機的，這樣 agent 會進行任性的“exploration”。當 Q-function 收斂時，它會返回一個更加一致的 Q-value 此時 exploration 的次數就下降了。所以我們可以說，Q-learning 將 exploration 引入了算法的一部分。但是這樣的 exploration 是貪心的，它會采用找到的最有效的策略。

對上面問題的一個簡單卻有效的解決方案是 ** ε-greedy exploration——以概率ε選擇一個隨機行動，否則按照最高的 Q-value 進行貪心行動。在 DeepMind 的系統中，對ε本身根據時間進行的從 1 到 0.1 的下降，也就是說開始時系統完全進行隨機的行動來最大化地 explore 狀態空間，然后逐漸下降到一個固定的 exploration 的比例。

Deep Q-learning 算法

現在我們得到加入 experience replay的最終版本：

DeepMind 其實還使用了很多的技巧來讓系統工作得更好——如 target network、error clipping、reward clipping 等等，這里我們不做介紹。

該算法最為有趣的一點就是它可以學習任何東西。你仔細想想——由于我們的 Q-function 是隨機初始化的，剛開始給出的結果就完全是垃圾。然后我們就用這樣的垃圾（下個狀態的最高 Q-value）作為網絡的目標，僅僅會偶然地引入微小的收益。這聽起來非常瘋狂，為什么它能夠學習任何有意義的東西呢？然而，它確實如此神奇。

Final notes

Many improvements to deep Q-learning have been proposed since its first introduction – Double Q-learning, Prioritized Experience Replay, Dueling Network Architecture and extension to continuous action space to name a few. For latest advancements check out the NIPS 2015 deep reinforcement learning workshop and ICLR 2016 (search for “reinforcement” in title). But beware, that deep Q-learning has been patented by Google.

It is often said, that artificial intelligence is something we haven’t figured out yet. Once we know how it works, it doesn’t seem intelligent any more. But deep Q-networks still continue to amaze me. Watching them figure out a new game is like observing an animal in the wild – a rewarding experience by itself.

Credits

Thanks to Ardi Tampuu, Tanel P?rnamaa, Jaan Aru, Ilya Kuzovkin, Arjun Bansal and Urs K?ster for comments and suggestions on the drafts of this post.

Links

David Silver’s lecture about deep reinforcement learning
Slightly awkward but accessible illustration of Q-learning
UC Berkley’s course on deep reinforcement learning
David Silver’s reinforcement learning course
Nando de Freitas’ course on machine learning (two lectures about reinforcement learning in the end)
Andrej Karpathy’s course on convolutional neural networks

[1] Algorithm adapted from http://artint.info/html/ArtInt_265.html

最后編輯于：2017.12.03 02:31:19

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 227,967評論 6贊 531
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 98,273評論 3贊 415
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 175,870評論 0贊 373
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 62,742評論 1贊 309
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 71,527評論 6贊 407
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 55,010評論 1贊 322
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,108評論 3贊 440
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 42,250評論 0贊 288
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 48,769評論 1贊 333
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 40,656評論 3贊 354
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 42,853評論 1贊 369
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,371評論 5贊 358
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,103評論 3贊 347
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 34,472評論 0贊 26
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 35,717評論 1贊 281
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 51,487評論 3贊 390
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 47,815評論 2贊 372

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

解析 DeepMind 深度強化學習 (Deep Reinforcement Learning) 技術

解析 DeepMind 深度強化學習 (Deep Reinforcement Learning) 技術

強化學習

Markov Decision Process

Discounted Future Reward

Q-learning

Deep Q Network

Experience Replay

Exploration-Exploitation

Deep Q-learning 算法

Final notes

Credits

Links

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

解析 DeepMind 深度強化學習 (Deep Reinforcement Learning) 技術

強化學習

Markov Decision Process

Discounted Future Reward

Q-learning

Deep Q Network

Experience Replay

Exploration-Exploitation

Deep Q-learning 算法

Final notes

Credits

Links

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频