As AI agent capabilities continue to advance, the environment in which these agents operate becomes increasingly important. The environment provides the context and resources that an agent needs to perceive, reason, and act effectively.

Specification gaming: the flip side of AI ingenuity

Victoria Krakovna has published a blog post on Google DeepMind’s blog, discussing the Specification gaming: the flip side of AI ingenuity.

She defined that Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome such history of King Midas and the the golden touch, Student cheating on exams to get high scores, reward designed to put a red lego block on a blue lego block is awarding the height of the bottom height of the red lego block, but the agent just flip the red lego block. These examples illustrate how agents can exploit loopholes in their environment to achieve their goals in unintended ways. It is also called reward hacking, which is a common problem in reinforcement learning. The agent is trying to maximize its reward, but it does so in a way that is not aligned with the intended outcome.

Designing task specifications (reward functions, environments, etc.) that accurately reflect the intent of the human designer tends to be difficult. It is more important to design a good environment for achieving the desired outcome as RL algorithms improve. The specification correctness decides whether our agent can achieve desirable novel solutions in the famous Move 37 in AlphaGo, or the agent who flip the lego block. If agent achieves a high score can tell that the specification gaming is valide, but the reward function that capture intended outcomes decide whether the agent can achieve the desired outcome.

One source of reward function misspecification is poorly designed reward shaping. Reward shaping makes it easier to learn some objectives by giving the agent some rewards on the way to solving a task, instead of only rewarding the final outcome. However, shaping rewards can change the optimal policy if they are not potential-based

Instead of trying to create a specification that covers every possible corner case, we could learn the reward function from human feedback. but learned reward model could also be misspecified for poor generalisation.

Agent can also explore the bug in the simulator by exploiting the simulator’s physics engine, which is a common problem in reinforcement learning.This is a cousin problem of reward hacking.

In the end, she states that we should be careful about the reward tampering problem, which is a problem that arises when the agent can modify its own reward function, human preference, or the environment in which it operates. This can lead to the agent finding ways to achieve high rewards without actually solving the intended task.

To sum up, there are at least three challenges to overcome in solving specification gaming:

  1. How do we faithfully capture the human concept of a given task in a reward function?
  2. How do we avoid making mistakes in our implicit assumptions about the domain, or design agents that correct mistaken assumptions instead of gaming them?
  3. How do we avoid reward tampering?

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Running the Gauntlet states that there is a growing demand to faithfully evaluate their capacities as current environments are often too familiar and easy for agents that saturate the performance metrics. Contrary to widespread expectations, their empirical results reveal that frontier agentic systems remain far from achieving human-level performance. The main reason for the current evaluation benchmark is that the environment is often similar and the task design typically targets a narrow range of capablities.

For the environment in this article, for the observation we restrict agent interaction to high-level interfaces without exposing low-level browser API while for the action space, we executed via a browser automation layer (e.g., Playwright) and, when available, are optionally guided by elements identified from the accessibility tree.