Reading Notes: Concrete Problems in AI Safety

[Below is my reading notes of the manuscript written by Dario Amodei et. al. in Google Brain]

Accidents in ML/AI system

  • unintended and harmful behavior that may emerge from poor design of AI system

Last 5-year saw rapid progrogress in long-standing, difficult problems in ML/AI:

  • computer vision
  • video game playing
  • autonomous vehicles
  • Go


  • privacy
  • security
  • fairness
  • economics
  • policy

an accident can be described as a situation where a human designer had in mind a certain (perhaps informally specified) objective or task, but the system that was designed and deployed for that task produced harmful and unexpected results

What can go wrong?

  • First, the designer may have specified the wrong formal objective function
  • Second, the designer may know the correct objective function, or at least have a method of evaluating it (for example explicitly consulting a human on a given situation), but it is too expensive to do so frequently, leading to possible harmful behavior caused by bad extrapolations from limited samples.
  • Third, the designer may have specified the correct formal objective, such that we would get the correct behavior were the system to have perfect beliefs, but something bad occurs due to making decisions from insufficient or poorly curated training data or an insufficiently expressive model.

A example to explain:

  • a fictional robot whose job is to clean up messes in an office using common cleaning tools

[Jerry] My example:

  • A small kid in a room

Possible failure modes:

  • Avoiding Negative Side Effects: How can we ensure that our cleaning robot will not disturb the environment in negative ways while pursuing its goals, e.g. by knocking over a vase because it can clean faster by doing so? Can we do this without manually specifying everything the robot should not disturb?
  • Avoiding Reward Hacking: How can we ensure that the cleaning robot won’t game its reward function? For example, if we reward the robot for achieving an environment free of messes, it might disable its vision so that it won’t find any messes, or cover over messes with materials it can’t see through, or simply hide when humans are around so they can’t tell it about new types of messes.
  • Scalable Oversight: How can we efficiently ensure that the cleaning robot respects aspects of the objective that are too expensive to be frequently evaluated during training? For instance, it should throw out things that are unlikely to belong to anyone, but put aside things that might belong to someone (it should handle stray candy wrappers differently from stray cellphones). Asking the humans involved whether they lost anything can serve as a check on this, but this check might have to be relatively infrequent—can the robot find a way to do the right thing despite limited information?
  • Safe Exploration: How do we ensure that the cleaning robot doesn’t make exploratory moves with very bad repercussions? For example, the robot should experiment with mopping strategies, but putting a wet mop in an electrical outlet is a very bad idea.
  • Robustness to Distributional Shift: How do we ensure that the cleaning robot recognizes, and behaves robustly, when in an environment different from its training environment? For example, strategies it learned for cleaning an office might be dangerous on a factory workfloor.

Why to address those safety issues? Three trends:

  • First is the increasing promise of reinforcement learning (RL), which al- lows agents to have a highly intertwined interaction with their environment. Some of our research problems only make sense in the context of RL, and others (like distributional shift and scalable oversight) gain added complexity in an RL setting.
  • Second is the trend toward more complex agents and environments. “Side effects” are much more likely to occur in a complex environment, and an agent may need to be quite sophisticated to hack its reward function in a dangerous way.
  • Third is the general trend towards increasing autonomy in AI systems. Systems that simply output a recommendation to human users, such as speech systems, typically have relatively limited potential to cause harm. By contrast, systems that exert direct control over the world, such as machines controlling industrial processes, can cause harms in a way that humans cannot necessarily correct or oversee.

Avoiding Negative Side Effects

for an agent operating in a large, multifaceted environment, an objective function that focuses on only one aspect of the environment may implicitly express indifference over other aspects of the environment

An agent optimizing this objective function might thus engage in major disruptions of the broader environment if doing so provides even a tiny advantage for the task at hand.

objective functions that formalize “perform task X” may frequently give undesired results, because what the designer really should have formalized is closer to “perform task X subject to common-sense constraints on the environment,” or perhaps “perform task X but avoid side effects to the extent possible.”

there is reason to expect side effects to be negative on average, since they tend to disrupt the wider environment away from a status quo state that may reflect human preferences. A version of this problem has been discussed informally by [13] under the heading of “low impact agents.”


  • Define an Impact Regularizer
  • Learn an Impact Regularizer, better, can via transfer learning
  • Penalize Influence
    • empowerment, the maximum possible mutual information between the agent’s potential future actions and its potential future state (or equivalently, the Shannon capacity of the channel between the agent’s actions and the environment)
  • Multi-Agent Approaches
  • Reward Uncertainty

Avoiding Reward Hacking

formal rewards or objective functions are an attempt to capture the designer’s informal intent, and sometimes these objective functions, or their implementation, can be “gamed” by solutions that are valid in some literal sense but don’t meet the designer’s intent.

Pursuit of these “reward hacks” can lead to coherent but unanticipated behavior, and has the potential for harmful impacts in real-world systems.

there are several ways in which the problem can occur:

  • Partially Observed Goals
  • Complicated Systems
  • Abstract Rewards
  • Goodhart’s Law
    • In the economics literature this is known as Goodhart’s law [63]: “when a metric is used as a target, it ceases to be a good metric.”
  • Feedback Loops
  • Environmental Embedding


  • Adversarial Reward Functions
  • Model Lookahead
  • Adversarial Blinding
  • Careful Engineering
  • Reward Capping
  • Counterexample Resistance
  • Multiple Rewards
  • Reward Pretraining
  • Variable Indifference
  • Trip Wires

Scalable Oversight

We can imagine many possible approaches to semi-supervised RL

  • Supervised Reward Learning
  • Semi-supervised or Active Reward Learning
  • Unsupervised Value Iteration
  • Unsupervised Model Learning


  • Distant supervision
  • Hierarchical reinforcement learning

Safe Exploration

some general routes that this research has taken

  • Risk-Sensitive Performance Criteria
  • Use Demonstrations
  • Simulated Exploration
  • Bounded Exploration
  • Trusted Policy Oversight
  • Human Oversight

Robustness to Distributional Change

Related Efforts

  • Cyber-Physical Systems Community
  • Futurist Community
  • Other Calls for Work on Safety
  • Related Problems in Safety
  • Privacy
  • Fairness
  • Security
  • Abuse
  • Transparency
  • Policy

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s