Data Science Talent Logo
Call Now


Reinforcement learning, a branch of artificial intelligence, is being touted as the path to artificial general intelligence (AGI), according to some scientists. This is in contrast to those who believe that deep learning is sufficient to achieve AGI.

Francesco Gadaleta, the founder and chief engineer of Amethix Technologies and host of the Data Science At Home podcast, delves into what reinforcement learning is and how it works. While he acknowledges that reinforcement learning is a promising approach, he also discusses the limitations of relying solely on reward maximisation and the challenges of designing the reward and value functions for the system.

Some people believe that deep learning is enough to reach or build an artificial general intelligence. There are some others (and by that I mean scientists), who say quite the opposite. For example, deep learning methods are very good function approximators, but, we cannot call them artificial intelligence and definitely not artificial general intelligence or AGI.

There is, however, another branch of artificial intelligence known as reinforcement learning that some scientists state is enough for general artificial intelligence – a strong statement.

However, it does make sense to some extent, and so I’ll clarify what these scientists actually mean with that statement. I’ll then offer my own opinion based on reading a huge amount of relevant literature and papers in the last few years, including the latest cutting-edge results and findings in artificial intelligence and machine learning.

Reinforcement learning is a paradigm of computation that essentially allows an agent to learn how to solve a particular problem, a problem that has to be defined in a particular way. The classic example here is the typical agent that goes around within an environment.

Looking at this in a different way, assume that we have one agent: a mouse or a cat. And then the house is the environment. What reinforcement learning states is that the agent can perform an action that will alter the environment and the environment will respond to the agent with a new state, the next state. The reward that identifies the state encodes how good or bad the direction was, once performed in the environment.

If you think about the distance moving to the next position, it would alter the environment in the sense that the position of the mouse would have changed and the position with respect to the cheese would also have changed. The environment has obviously changed because the relative positions of the objects within the environment have changed. This could also happen if the mouse could move objects within the environment.

The action could be – for example – moving to another position, going up, down, left,right, diagonally…. you can have as many actions you want, but essentially, the mouse at state T chooses an action randomly, or in a more or less random way. This action will then bring the mouse into another state in the environment, and the environment will respond to the mouse with either that was good, or that was bad, depending on where the mouse is, where the cheese is, where the cat is and so on.

By repeating this simple concept a number of times by trial and error, we would get to an agent that understands the environment and understands the specification of the environment or the set of actions. If by trial and error we keep playing this game of cat and mouse, eventually the mouse will become intelligent enough to solve the game and win most of the time. To then provide more detail in the explanation of what reinforcement learning is, I’ll talk about the concept of policy.

The policy is a function that maps states to the actions, and this policy is essentially a set of actions. The policy is a plan that says if you are in a particular state and you perform a particular action, you are going to receive a particular value that will tell you if the action in that state was appropriate or not. You can have a number of these actions align a chain of actions, or a set of actions that will bring you from A to B by simply performing these actions one by one.

This policy can be learned or stored in a lookup table if the number of possible combinations that are state actions are approachable. Otherwise, it can be approximated with whatever machine learning model of your choice, including neural networks. So, if you have a neural network that allows you to know (approximately) the policy and the mapping, and states actions with the neural network, then we are not storing all combinations or all possibilities. That might be quite a large number if you have a lot of states and a lot of actions, but we are approximating it with a neural network, and with any other machine learning model.

There is another detail we need to look at, which is how the policy is usually trained. You can have on- policy reinforcement learning or off-policy reinforcement learning. Traditionally, the agent observes the state of the environment and then takes a certain action and performs that action, and this is usually based on a policy. The policy is identified with the Greek letter Pi. Pi of a given S is the policy of the action given the state; whereby the agent collects the result, the reward, and moves to the next state. That’s how it works now, how that action is, how policy is calculated. The learning can happen on policy, which means that experiences are collected using the latest learned policy, and then using that experience to improve the policy as we go or learning can happen off-policy. This means that the agent’s experience is buffered or is appended to buffer. It’s also called the replay buffer, and each new policy collects additional data, and is calculated and actually recalculated on the buffer. This means that the buffer becomes a kind of training set that will generally be used by a neural network or another model to learn the policy of the next time step.

This trial and error approach has been defined by a type of reward function that is, for example, survival:
a more brutal reward function. So, the closer the mouse is to the cheese, the higher the reward. This is,
of course, a theory. There’s nobody who can tell us that it is a hundred percent correct, but it is a theory that makes sense, and we are trying to apply this concept that we have theorised for humans and for living organisms. We are trying to apply the same concepts to artificial organisms that we define intelligent or we would like to be intelligent.

There are so many other behaviours that living organisms have discovered or developed that are too complicated to be explained by a simple reward maximisation strategy. One example is a squirrel that can only find food, but doesn’t have a particular behaviour that allows him to store additional nuts in his mouth. It’s almost impossible to fit anymore in, so they hide these nuts because they want to make sure that no other animal will steal them. This behaviour has allowed the squirrel to survive, but it’s not an immediate reward, like ‘I’m starving so I’ll eat and I will get a positive reward’.
This is more like ‘I’m planning for the end of the season or for the next season because I might die if there is a scarcity of food’. Obviously, human beings have even more powerful planning capabilities. This is hard to believe – that reward maximisation is the only reason we’ve evolved the amazing capabilities of the human brain.

There are other scientists who claim that reward maximisation should not be enough or cannot be enough because there are many other scenarios in which these agents or these living organisms have been leaving and experiencing the environment and the world. For example, taking into consideration collective behaviour, the fact that some behaviours have been developed due to society and communities. For instance, take survival as a positive reward from the environment for animals that protect their offspring. This is something that we also see in humans, not so much anymore because we have a very developed society that protects this by laws and regulations, which are very high concepts of behaviour. But if you look at animals, these behaviours are still present.

However, there is one observation from another scientist which goes against what many other scientists have been saying in recent papers. I’m referring to the author of Algorithms are not enough, Herbert Roitblat). They explain that reinforcement learning, as the cause for developing complex behaviour, is actually quite wrong or quite incomplete. I would argue that with reinforcement learning, there are some assumptions that have to be made in terms of the reward function and the value function.

There must therefore be someone (usually an engineer) who defines what reward is. And how do we reward the agent in any particular state for every particular action? In the case of a mouse and the cheese example, that’s pretty easy, as the mouse gets a higher reward if it gets closer to the cheese, or if it keeps a minimum distance from the mouse trap, or a combination of the two. In this world, everything is simple, therefore it’s easy to define what a reward function would look like, so I need to inject this knowledge. That is the reward function in the system, and then I will just run the algorithm that, by trial and error and approximations, will eventually converge to the optimal policy. But the system could not find that reward function by itself, and there is also the value function that assigns a particular value to a state of the environment or a particular action in a particular state. These are mathematical functions that actually drive the convergence towards the optimal policy.

This is where most of the difficulties and the challenges are when you design a reinforcement learning algorithm. These are probably the most critical components of the entire model, because a reward function that is wrong will probably never let the agent converge to the optimal policy – and the same occurs for a value function that is not well designed. So the statement from the author – Algorithms are not enough – is that reinforcement learning is okay, but the question remains, how do we deal with the fact that someone else should inject reward functions and value functions for the system to work? If this was the case, it would mean that even in the living world we should accept that another entity was designed as the reward function for us or for other living organisms.

This is an open topic, and is definitely subject to interpretation and various theories. There is no way to close the topic with a solution or with an explanation that explains a hundred percent of everything with just one reward and maximisation methodology. There are many other points that have been ignored by deep-minded scientists; for example, the fact that, in my opinion, touch and other sensory data should be part of an intelligent organism. And so intelligent organisms don’t take decisions or intelligent decisions just by using their eyes – we use almost every part of our body when we make our minds up.

Back to blogs
Share this:
© Data Science Talent Ltd, 2024. All Rights Reserved.