
Yes! Hide and seek got me very excited this week. Remember how it was such a cherished game back when we were growing up? Even toddlers love it when you hide your face with your palms. I’m way past that life stage but when hide and seek was infused with some “swancy” Reinforcement Learning, you knew I’d obviously lose my mind.
Okay let’s rewind a bit. I have explained this before but, for revision, Machine Learning can be subdivided into three main groups: Supervised Learning, Unsupervised Learning and Reinforcement Learning:
My work largely revolves around the Supervised Learning cluster with the a bit of Unsupervised (e.g. clustering) but Reinforcement Learning is my absolute darling. I drool over her. This subgroup is defined as:
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
In simple terms, this field allows us to create “agents” in our algorithm that will learn by themselves how to interact with their “environment” chasing some form of “reward”. Imagine when you were growing up. There are things you would do and your parents would “chastise” you and call it “bad behaviour”. Other things were rewarded and called “good behaviour” like passing your end of term exams. What that reinforcement did was it made you want to avoid “bad behaviour” actions and increase the number of the “good”. Why? Because it made you happy when your parents were happy. You got “positive rewards” for good behaviour and “negative reward” for bad ones. That’s essentially what Reinforcement Learning is: the agent in your algorithm interacts with the environment you create for it and by doing so learn by itself what constitutes good behaviour (and get positively rewarded) and what constitutes bad behaviour (sometimes you negatively reward it). In the definition above, the agent will then take actions to maximise positive rewards (through what is called a policy).
But it’s not that simple. The agent has to go through multiple trial and errors to (hopefully) eventually understand what is good and what is bad and stick to the good set of actions. A human child might need one of two trials of touching fire to know that is dangerous but these algorithms can play with fire, burn, reboot, and iterate multiple times until they fully grasp that fire = bad.
Doesn’t that just tickle your fancy? An algorithm that will chase a set of good actions in order to maximise positive rewards! So back to what got me excited this week.
There is a group of super smart researchers who work in a company called OpenAI who claim to be “…discovering and enacting the path to safe artificial general intelligence.” Their bread and butter is in Reinforcement Learning. They are just like DeepMind, the group from Alphabet who trained their reinforcement learning algorithms to beat humans at playing Atari games, Dota, Quake, learnt by itself how to walk like a human, and detect certain diseases better than human doctors. This week OpenAI a fascinating published a paper (you can read it here) detailing how agents in their algorithm taught themselves how to play “Hide and Seek”; hence the reason for my excitement. Let’s get into some of the details:
OpenAI created an environment consisting of an arena with two boxes that could be moved around unless locked down by any one of the agents in the environment. Two types of agents were involved: the “Seekers” represented by the red avatar and the “Hiders”, the blue guys. When training started, the hiders learnt by themselves that in order to survive, the first thing was to be away from the seekers. At the same time, the seekers learnt they had to chase the hiders around the arena. How beautiful is that? The algorithm’s agents learnt all this by themselves. And that’s just the tip of it because that’s how we play Cops and Robbers not Hide and Seek. The point was for the hiders to learn not to be “seen” by the seekers, i.e. the hiders needed to figure out that in order to “win”, they needed to break line of sight between them and the seekers.
After millions of trials running around and failing to break line of sight, the hiders started noticing the boxes in the arena and, again, by themselves, figured out they could push the boxes to the entrance of the room they were in, lock it down and thus block the seekers from entering; in other words, break line of sight. Even when the environment was changed to have two entrances, millions of trial and errors in the new environment, the hiders started blocking both entrances and locking down the blocks. Yes, they learnt not only the fact that blocking both entrances and locking the blocks will stop the seeker from finding them, they also learnt how to co-operate with each other to achieve this. One agent blocked one entrance and the other did the same to the other entrance. How brilliant!? This definitely frustrated the seeker. We are still going down the iceberg:
With this new setting of blocked entrances, the seekers were forced to adapt and they re-interacted with their environment chasing that goal of creating line of sight with the hiders. After millions of iterations, they came across the ramps that were part of the tools in the arena. Through multiple trial and errors, they figured out by themselves that they could grab the ramp, put it next to a wall and jump into the room the hiders were in!
Clever seekers! This went on for a couple more million rounds with the seekers jumping into the room each time because they knew, in other words they had learnt by themselves that using the ramp was a successful part of their solution. Because of previous successes, the hiders were confident that blocking the entrance was part of the solution so what was happening now? Each time a seeker jumped into the room, the hiders were being forced to re-interact with their environment, unlock the blocks, run out, try new things. After millions of rounds, they found a solution:
Like real-life Hide and Seek, the seekers had a countdown before they were activated. The hiders used this delay to interact with the environment before running to the room to lock it (and just have the seeker use the ramp to jump into it). What they discovered after millions more iterations was they could steal the ramp, drag it into their room when the seekers are still sleeping and lock the entrances!
They discovered this strategy after trying out thousands and even millions of different strategies, chasing that positive reward. Now the seekers were left with a useless solution because the ramp was not available to them anymore.
One might say this game favoured hiders because there was a prebuilt room they could run to. OpenAI ran the same hide and seek experiment with no prebuilt walls. After many (millions of) iterations, the hiders started building their own walls and locking them. With no prebuilt walls, the seekers had free line of sight but now they were being forced to adapt to hiders constructing them and, through multiple experiments, they rediscovered the use of the ramp. This then forced the hiders to re-learn a new way to prevent seekers from using it by moving the ramp to the edge of environment and locking it there before they made their walls:
Seekers were probably thinking, “These hiders learnt to build walls in this new environment and we learnt to use the ramp. Now the riders are locking down the ramps before they start construction so that renders that solution useless. What else can we do? Let’s re-interact with the environment and figure out a new strategy.” Sure enough, they figured out a new strategy after millions of new interactions with the environment. They figured they could push the blocks around towards the constructed room but still had no way to get inside until they discovered that they could ride (surf) the blocks around instead of pushing them!
Look at that sneaky seeker on the picture above! The hider thought he was safe but the seeker had interacted with the environment millions of times and had come up with a new solution/strategy the hider was not yet ready for – riding/surfing the box! Once the seeker got to the room the hider constructed, it was a simple matter of jumping in and winning. OpenAI confess that this is a solution they themselves did not even think of and that’s the beauty of Machine Learning: it helps us uncover trends, correlations, and solutions us humans cannot even begin to imagine! But as expected, once the seeker started jumping into the room, the hider was forced to adapt and come up with new strategies and over more iterations, the hiders started locking down as many of the blocks as they could before the seekers were activated and before constructing their safe room:
In this simple environment, this was the winning strategy for hiders and there was finally nothing the seekers could do. All this was learnt by the agents when they interacted with their environment and passed on learnt knowledge to the next iteration. How beautiful is that? An algorithm can teach itself the best strategies to play hide and seek and even develop ones us humans cannot think of!
This got me very excited this week especially because of the implications of what OpenAI is working on. I am convinced some version of Reinforcement Learning will lead to exciting discoveries in the near future. Right now this field of Machine Learning does not have as many practical applications as say Supervised Learning. Ones I can think of include resource allocation, traffic lights control systems, chemistry, and fraud in financial service. I will discuss these in an upcoming post.
For now, I think we can agree that there is just something beautiful and elegant about Reinforcement Learning even in the code itself (I will publish a tutorial with explanations on my blog HoustonTheNerd with a toy example of the Google Chrome dinosaur teaching itself how to run and jump over obstacles). Until next article,
It’s a brave new world!