The Bitter Lesson(惨痛的教训)Rich Sutton

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

未校对的 AI 翻译:
惨痛的教训
Rich Sutton
2019 年 3 月 13 日
从 70 年的人工智能研究中可以学到的最重要的教训是,利用计算能力的一般方法最终是最有效的,而且优势巨大。其根本原因是摩尔定律,或者更广泛地说,单位计算成本呈指数下降。大多数人工智能研究都是在智能体可用的计算能力恒定的前提下进行的(在这种情况下,利用人类知识是提高性能的唯一方法之一),但在比典型研究项目稍长的时间里,不可避免地会有大量计算能力可用。为了寻求在短期内有所作为的改进,研究人员试图利用他们对该领域的人类知识,但从长远来看,唯一重要的是利用计算能力。这两者并不一定相互对立,但在实践中往往是如此。花在一件事上的时间就不能花在另一件事上。人们会从心理上倾向于投资于某种方法。而人类知识的方法往往会以使其不太适合利用计算的一般方法的方式来使方法复杂化。有许多例子表明,人工智能研究人员迟迟才认识到这一惨痛教训,回顾一些最突出的例子是有启发性的。

在国际象棋计算机中,1997 年击败世界冠军卡斯帕罗夫的方法是基于大量的深度搜索。当时,大多数计算机国际象棋研究人员都对此感到沮丧,他们曾致力于利用人类对国际象棋特殊结构的理解的方法。当一种更简单的、基于搜索的方法与特殊的硬件和软件被证明是更有效的时候,这些基于人类知识的国际象棋研究人员并没有很好地接受失败。他们说,"蛮力 "搜索可能这次赢了,但这不是一个通用的策略,而且无论如何,这不是人们下棋的方式。这些研究人员希望基于人类输入的方法能赢,当这些方法没有赢时,他们感到失望。

在计算机围棋中,也看到了类似的研究进展模式,只是又推迟了 20 年。大量的初始工作都投入到通过利用人类知识或游戏的特殊特征来避免搜索,但所有这些努力都被证明是无关紧要的,或者更糟糕的是,一旦搜索被有效地大规模应用。同样重要的是,通过自我对弈来学习价值函数(就像在许多其他游戏中,甚至在国际象棋中一样,尽管学习在 1997 年首次击败世界冠军的程序中没有发挥很大作用)。通过自我对弈学习,以及一般的学习,就像搜索一样,它使大量的计算得以发挥作用。搜索和学习是人工智能研究中利用大量计算的两个最重要的技术类别。在计算机围棋中,就像在计算机国际象棋中一样,研究人员最初的努力方向是利用人类的理解(这样就不需要那么多搜索),直到很久以后,通过拥抱搜索和学习才取得了更大的成功。

在语音识别领域,20 世纪 70 年代,DARPA 赞助了一项早期竞赛。参赛者包括许多利用人类知识的特殊方法—关于单词、音素、人类声道等的知识。另一方面是较新的方法,这些方法本质上是统计学的,基于隐马尔可夫模型(HMMs),计算量更大。同样,基于统计的方法战胜了基于人类知识的方法。这导致了整个自然语言处理领域的重大变化,在几十年的时间里,统计学和计算逐渐主导了这一领域。最近深度学习在语音识别领域的兴起是这一一致方向的最新一步。深度学习方法更少依赖人类知识,而使用更多的计算,再加上对大型训练集的学习,以产生明显更好的语音识别系统。就像在游戏中一样,研究人员总是试图让系统按照研究人员认为他们自己的思维方式工作—他们试图把这些知识放到他们的系统中—但事实证明,这最终会适得其反,而且会大量浪费研究人员的时间,因为通过摩尔定律,大量的计算变得可用,并且找到了一种方法可以很好地利用它。

在计算机视觉领域,也有类似的模式。早期的方法将视觉设想为搜索边缘,或广义圆柱体,或根据 SIFT 特征。但今天,所有这些都被抛弃了。现代的深度学习神经网络只使用卷积和某些不变性的概念,而且表现得更好。

这是一个很大的教训。作为一个领域,我们仍然没有彻底学会它,因为我们还在继续犯同样的错误。要看到这一点,并有效地抵制它,我们必须了解这些错误的吸引力。我们必须吸取惨痛的教训,即从长远来看,建立在我们认为我们如何思考的基础上是行不通的。惨痛的教训是基于历史观察,1)人工智能研究人员经常试图将知识植入他们的智能体,2)这总是在短期内有所帮助,并且对研究人员个人来说是令人满意的,但 3)从长远来看,它会停滞不前,甚至会阻碍进一步的进展,4)突破性的进展最终是通过基于搜索和学习的扩展计算的对立方法实现的。最终的成功是带有苦涩的,而且往往没有被完全消化,因为它是战胜了以人为中心的方法的成功。

从惨痛的教训中应该学到的一点是,通用方法的强大之处在于,即使可用的计算量变得非常大,这些方法也能随着计算的增加而继续扩展。似乎以这种方式任意扩展的两种方法是搜索和学习。

从惨痛的教训中应该学到的第二个普遍观点是,思维的实际内容是极其复杂的,而且是不可救药的;我们应该停止试图找到简单的方法来思考思维的内容,比如思考空间、物体、多重主体或对称性的简单方法。所有这些都是任意的、本质上复杂的外部世界的一部分。它们不是应该被内置的,因为它们的复杂性是无止境的;相反,我们应该只内置能够发现和捕捉这种任意复杂性的元方法。这些方法的关键在于它们能找到好的近似值,但对它们的搜索应该由我们的方法来完成,而不是由我们。我们希望人工智能智能体能够像我们一样发现,而不是包含我们已经发现的东西。把我们的发现植入其中只会让我们更难看到发现过程是如何完成的。

The Bitter Lesson

Rich Sutton

March 13, 2019

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers’ belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.

In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.

In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

刚才把这篇文章看完了,摘抄一些我认为关键的片段:

Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research.

We have to learn the bitter lesson that building in how we think we think does not work in the long run.


我有一些思考:

作者认为利用 人的经验 / 人认为人是怎么思考的知识 来解决问题在短期内可能有效,但长期来看必然会被 搜索 / 学习 的方法打败。

但是,在设计学习的算法的时候,更特殊的,例如在设计深度学习算法 / 神经网络架构 时利用人类经验,或是借助经验对搜索过程进行剪枝,避免无效搜索,这种做法又是好是坏呢?

我对作者的理解:在计算能力还在进步的时候,应该寻找那些普遍成立的方法,即 scable 的方法(CNN 和 transformer 能活到现在是因为对不同大小的模型都有效)传统计算机也是这样,很注重技术的 scale 能力(但这里的 scale 具体是说用户规模变大后技术仍然有效 而不是计算资源)

一个例子