Today I’m excited to tell you that our general AI project has reached another important milestone.
A quick reminder of what our AI brain team has achieved so far:
an AI that can play Pong, a Breakout game (left/right movement, responding to visual input, achieving a simple goal)
Brain Simulator (a visual editor for designing the architecture of artificial brains)
The new milestone is that our general AI is now able to play a game that requires it to complete a series of actions in order to reach a final goal. This means that our AI is capable of working with a delayed reward and that it is able to create a hierarchy of goals.
Without any prior knowledge of the rules of the game, the AI was motivated to move its body through a maze-like map and learn the rules of the game. The agent behaves according to the principles of reinforcement learning – this means that it seeks reward and avoids punishment. It moves to the place in the maze where it receives the highest reward, and avoids places where it won’t be rewarded. We have visualized this as a 2D map, but in fact the agent works on an arbitrary dimension and the 2D map is our visualization only. The agent actually “sees” 8 numbers (8-dimensional state space) which change according to the agent’s behavior, and it must learn to understand the effects of its actions on these numbers.
Here you can see an example map of the reward areas – the red places represent the highest reward for the AI, and the blue places represent the least reward. The AI agent always tries to move to the reddest place on the map.
Visualization of the agent’s knowledge for a particular task, which changes the state of the lights. It tells the agent what to do in order to change the state of the lights in all known states of the world. The heat map corresponds to the expected utility (“usefulness”) of the best action learned in a given state. A graphical representation of the best action is shown at each position on the map.
The agent’s current goal is to go towards the light switch and turn on the lights.
The maze we are using is one where doors can be opened and closed according to a switch, and lights can be turned on or off according to a different switch. When all of the doors are open, the AI agent moves easily through the maze to reach a final destination. This kind of task only requires that the agent complete one simple goal.
The agent uses its learned knowledge to reach the light switch and press the button in order to turn on the lights.
However, imagine that the agent wants to turn on the light but the doors to the light switch are closed. In order to get to the light switch, it first has to open the door by pressing a door switch. Now imagine that this door switch is located in a completely different part of the maze. Before the AI agent can reach its final destination, it has to understand that it cannot move directly to its goal location. If first has to move away from the light switch in order to press a different switch that will open the necessary door.
Our AI is able to follow a complex chain of strategy in order to complete its main goal. It can assign a hierarchical order to its various goals and plan ahead so it reaches an even bigger goal.
The agent solves a more complex task. It has to open two doors in a particular sequence in order to turn on/off the lights. Everything is learned autonomously online.
How this is different from Pong/Breakout, our first milestone with the AI
The AI is able to perform more complex directional tasks and (in some ways) in a more complex environment. While in the Pong environment it could only move left or right, in this maze the agent is able to move left/right, up/down, stay still, or press a switch.
Also, the AI agent in Pong acts according to visual input (pixels), which is raw and unstructured information. This means that the AI began to learn and acted according to what it could “see.” In the maze, the AI agent has full and structured information about the environment from the beginning.
Our next step is to have the AI agent get through the maze according to visual, unstructured input. This means that as it interacts with its environment, it will build a map of the environment based exclusively on the raw visual input it receives. It won’t have that information about the environment when it starts.
How the algorithm works
The brain we have implemented for this milestone is based on combination of a hierarchical Q-learning algorithm and a motivation model which is able to switch between different strategies in order to reach a complex goal. The Q-learning algorithm is more specifically known as HARM, or Hierarchical Action Reinforcement Motivation system.
In a nutshell, the Q-learning algorithm (HARM) is able to spread a reward given in a specific state (e.g. the agent reaching a position on the map) to the surrounding space so the brain can take proper action by climbing steepest gradient of the Q function. However, if the goal state is far away from the current state, it might take a long time to build a strategy that will lead to that goal state. Also, the number of variables in the environment can lead to extremely long routes through the “state space”, rendering the problem almost unsolvable.
There are several ideas that can improve the overall performance of the algorithm. First, we made the agent reward itself for any successful change to the environment. The motivation value can be assigned to each variable change so the agent is constantly motivated to change its surroundings.
Second, the brain can develop a set of abstract actions assigned to any type of change that is possible (e.g. changing the state of a door) and can build an underlying strategy for how this change can be made. With such knowledge, the whole hierarchy of Q functions can be created. Third, in order to lower the complexity of the problem, the brain can analyze its “experience buffer” from the past and eventually drop variables that are not affected by its actions or are not necessary for the current goal (i.e. strategy to fulfill the goal).
A mixture of these improvements creates a hierarchical decision model that is built during the exploration phase of learning (the agent is left to randomly explore the environment). After a sufficient amount of knowledge is gathered, we can “order” the agent to fulfill a goal by manually raising motivation value for a selected variable. The agent then will execute the learned abstract action (strategy) by traversing the strategy tree and unrolling it into a chain of primitive actions that lie at the bottom.
Like with the brain’s ability to play Pong/Breakout, this milestone doesn’t mean that our AI is useful to people or businesses at this stage. It does mean that our team is on the right track in its general AI research and development. We’re hitting the milestones we need to hit.
We never lose sight of our long term goal, which is to build a brain that can think, learn, and interact in the world like a human would. We want to create an agent which can be flexible in a changeable environment, just like human beings can. We also know that general AI will eventually bring amazing things to the world – cures for diseases, inventing things for people that would take much longer to invent without the cooperation of AI robots, and teaching us much more than we currently know about the universe.