Home » Blog » Training for a reward

Training for a reward

  • by

Reinforcement Learning is one of the areas of Machine Learning more rapidly progressing. Every other day, new applications across domains are developed. At PickleTech we closely follow the progress of its techniques, researching, and being ready to apply them when beneficial for a project. We have already implemented them in the field of particle accelerators.  Last week, we were invited to give a seminar at the Physics Department at the University of Oxford to talk about it. Now, we exploit the impact of reinforcement learning in the fields of Health and Sports.

Standard supervised learning techniques require sizable labeled datasets in order to obtain a reliable model that is able to perform inference in tasks such as classification and regression. During the training phase, supervised models are shown a set of input-output labeled examples. The algorithms find patterns within these examples, and they carry out inference given new inputs.

Reinforcement learning follows a different approach. You do not need to provide a large labeled dataset to your algorithm in advance. Instead, you define the rules that govern an environment, where an agent acts on, and wait for it to learn by itself through its interactions with the environment.

The “agent” and the “environment” in Reinforcement Learning

In reinforcement learning an agent is trained to carry out some specific tasks by interacting with an environment. The agent takes an action based on a complete or partial observation of the environment, and the environment responds to the action with some reward. The agent tries to maximize the reward, aiming to take the action that provides the maximum reward at the end of the whole process, which we call an episode.

Reinforcement Learning principle diagram (source: KDnuggets).

The set of actions an agent can take might be very limited (e.g. moving left or right in a 1-dimensional environment) or very complex, involving thousands of possible actions. In complex scenarios, like for instance those involving computer vision, agents and their actions may be represented by deep neural networks that encode the learning process. In this particular case we speak about Deep Reinforcement Learning.

Problems in reinforcement learning are usually modeled by Markov Decision Processes. In these, the new state of the environment after an action of the agent is entirely determined by the current state. Therefore, we can forget about the past history of the environment. 

A milestone of Reinforcement Learning

A common example of an area where we find applications of reinforcement learning is in games. There, we have a well-defined set of rules that naturally provide the environment, while the game player naturally takes the role of the agent in reinforcement learning. The goal of the player is to obtain a reward, either gaining some immediate advantage in a particular sequence of the game, or just focusing on winning the entire game at the end.

The Go world champion Lee Sedol lost 4-1 against AI powered engine AlphaGo. (Source: “AlphaGo”, Netflix)

An actual milestone in computer intelligence happened in 2016 when AlphaGo, an AI algorithm based on reinforcement learning and developed by Deepmind, was able to beat the Go world champion Lee Sedol [1].

Applications: from Robotics to Physics

Reinforcement Learning (RL) is not restricted to games though.  Nowadays there exist applications of reinforcement learning in multiple areas. Teaching a robot how to perform certain tasks by means of an action-reward mechanism has been proven to be the most efficient way [2]. In finances, RL algorithms are used to predict certain fluctuations of the market [3]. In medicine, RL is used in the framework of personalized medicine to make accurate prescriptions of certain drugs, and to assign optimal treatments to patients with distinct characteristics [4].

The action-reward mechanism for learning is significantly impacting advances in machine learning. There are even some researchers that think that properly defining reward is enough to eventually achieve an Artificial General Intelligence [5].

Reinforcement Learning finds interesting applications in physics as well. The seminar at Oxford was focused on an application of reinforcement learning to particle accelerator control. The application uses Reinforcement Learning to perform complex optimization tasks related to the accelerator performance.

In this case, the particle accelerator represents the environment, which one could think of as a sequence of magnets. The actual magnet behaviours deviate from their expected nominal values, producing alterations on the expected performance of the accelerator. These deviations must be compensated by activating secondary magnets: the corrector magnets. In this application, an agent is trained to learn what is the optimal configuration of the correctors. If for a particular step of the correction the performance is increased, the reward the agent receives is positive, or negative otherwise. With this approach we train an algorithm that performs the machine correction in very few steps compared to the large number of iterations a regular numerical optimizer requires

Reinforcement Learning in Sports

At PickleTech we work on applications of reinforcement learning in medicine and sports. There, the environment is naturally based on the rules that govern the game. The agents are represented by the players taking part in it, they interact through a sequence of actions obtaining a reward. 

How to accurately model the environment, the set of actions a player can make, and the rewards obtained after each action depends on the particular sport and chosen framework. In football and basketball, there are well known frameworks that play with ideas around using the probability of scoring at the end of a play, and its variations, as the reward. This is for instance the case of expected goals or expected possession value models. In other sports like cycling, we can use other game context metrics such as times, distances, and positioning, in order to model immediate rewards. The final purpose is in any case common to all sports: maximize the chances of winning at the end of the game. In sports, we may also focus on particular actions during a competition. For instance, simulating and evaluating what is the best strategy in a corner, a free, or a penalty kick in football; or what to do when someone else is breaking away in cycling.

The agent-environment-reward logic is common to all games. But while in board or video games we can simulate multiple scenarios, in football, basketball or cycling, we do need actual historical data sets. These are available, and turn out to be very powerful, in the form of match event logs, tracking data, and other similar data sets.

Another example is Formula 1 race strategy. There, pit stops for tyre changing are scheduled following a Monte Carlo approach. In this case, multiple race scenarios are simulated and the outcome of each of them is evaluated in terms of positions gained or lost during the stop. Using Reinforcement Learning we can train an agent that learns the dynamics of the environment (race) and makes decisions based upon observations of the particular state in which the race is (e.g. drivers positions, tyre compound and degradation, safety car presence, weather conditions) in a more accurate manner compared to traditional methods.

Driven by Science

Reinforcement learning provides a new way to approach sports and find new strategies for playing the game, a notorious example being the 3-point revolution in the NBA in recent years. And it is not only about finding these new strategies, the approach is used to value the actual actions taken in the game, and thus the contribution of players in the team. At PickleTech we always work on the development of tools that assist decision makers in complex and demanding environments. And we believe reinforcement learning is a very powerful tool in many areas, so long as it is implemented within a scientific approach, and with a dedicated validation process.


[1] Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).

[2] Khan, Md.al-Masrur & Sikder, Niloy & MAHMUD, M. A. & Nahid, Abdullah. (2020). A Systematic Review on Reinforcement Learning-Based Robotics Within the Last Decade. IEEE Access.

[3] 7 Applications of Reinforcement Learning in Finance and Trading – neptune.ai

[4] Zhang Z; written on behalf of AME Big-Data Clinical Trial Collaborative Group. Reinforcement learning in clinical medicine: a method to optimize dynamic treatment regime over time. Ann Transl Med. 2019;7(14):345.

[5] David Silver, Satinder Singh, Doina Precup, Richard S. Sutton, Reward is enough, Artificial Intelligence, Volume 299, 2021, 103535, ISSN 0004-3702.