This week, leading AI lab OpenAI released their latest project: an AI that can play hide-and-seek. Itâs the latest example of how, with current machine learning techniques, a very simple setup can produce shockingly sophisticated results.
The AI agents play a very simple version of the game, where the âseekersâ get points whenever the âhidersâ are in their field of view. The âhidersâ get a little time at the start to set up a hiding place and get points when theyâve successfully hidden themselves; both sides can move objects around the playing field (like blocks, walls, and ramps) for an advantage.
The results from this simple setup werequite impressive. Over the course of 481 million games of hide-and-seek, the AI seemed to developstrategies and counterstrategies, and the AI agents moved from running around at random to coordinating with their allies to make complicated strategies work. (Along the way, they showed off their ability to break the game physics in unexpected ways, too; more on that below.)
Itâs the latest example of how much can be done with a simple AI technique called reinforcement learning, where AI systems get ârewardsâ for desired behavior and are set loose to learn, over millions of games, the best way to maximize their rewards.
Reinforcement learning is incredibly simple, but the strategic behavior it produces isnât simple at all. Researchers have in the past leveraged reinforcement learning among other techniques to build AI systems that can play complex wartime strategy games, and some researchers think that highly sophisticated systems could be built just with reinforcement learning. This simple game of hide-and-seek makes for a great example of how reinforcement learning works in action and how simple instructions produce shockingly intelligent behavior. AI capabilities are continuing to march forward, for better or for worse.
The first lesson: how to chase and hide
It may have taken a few million games of hide-and-seek, but eventually the AI agents figured out the basics of the game: chasing one another around the map.
The second lesson: how to build a defensive shelter
AI agents have the ability to âlockâ blocks in place. Only the team that locked a block can unlock it. After millions of games of practice, the AI agents learned to build a shelter out of the available blocks; you can see them doing that here. In the shelter, the âseekerâ agents canât find them, so this is a win for the âhidersâ â at least until someone comes up with a new idea.
Using ramps to breach a shelter
Millions of generations later, the seekers have figured out how to handle this behavior by the âhidersâ: they can drag a ramp over, climb the ramp, and find the hiders.
After a while, the hiders learned a counterattack: they could freeze the ramps in place so the seekers couldnât move them. OpenAIâs team notes that they thought this would be the end of the game, but they were wrong.
Box surfing to breach shelters
Eventually, seekers learned to push a box over to the frozen ramps, climb onto the box, and âsurfâ it over to the shelter where they can once again find the hiders.
Defending against box surfing
Thereâs an obvious counterstrategy for the hiders here: freezing everything around so the seekers have no tools to work with. Indeed, thatâs what they learn how to do.
Thatâs how a game of hide-and-seek between AI agents with millions of games of experience goes. The interesting thing here is that none of the behavior on display was directly taught or even directly rewarded. Agents only get rewards when they win the game. But that simple incentive was enough to encourage lots of creative in-game behavior.
Many AI researchers think that reinforcement learning can be used to solve complicated tasks with real-world implications, too. The way powerful strategic decision-making emerges from simple instructions is promising â but itâs also concerning. Solving problems with reinforcement learning leads, as weâve seen, to lots of unexpected behavior â charming in a game of hide-and-seek, but potentiallyalarming in a drug meant to treat cancer (if the unintended behavior causes life-threatening complications)or an algorithm meant to improve power plant output (if the AI arranges to exploit some obscure condition in its goals rather than simply provide consistent power).
Thatâs the hazardous flip side of techniques like reinforcement learning. On the one hand, theyâre powerful techniques that can produce advanced behavior from a simple starting point. On the other hand, theyâre powerful techniques that can produce unexpected â and sometimes undesired â advanced behavior from a simple starting point.
As AI systems grow more powerful, we need to give careful consideration to how to ensure they do what we want.