Active Inference is a theoretical framework of perception and action from neuroscience that can explain many phenomena in the brain. It aims to explain the behavior of cells, organs, animals, humans, and entire species. Active Inference is structured such as it only requires local computation and plasticity. In effect, it could be implemented on neural hardware. Even though Active Inference has wide-reaching potential application, for instance as an alternative to reinforcement learning, few people outside the neuroscience community are familiar with the framework. In this blog post, I want to give a machine learning perspective on the framework, omitting many neuroscience details. As such, this article is geared towards machine learning researchers familiar with probabilistic modeling and reinforcement learning.

Summary (tldr)
• Active Inference (in the presented form) relies only on local computation and plasticity
• Active Inference supports a hierarchy of spatiotemporal scales (cells, organs, animals, species, etc)
• In contrast to (commonly used) Reinforcement Learning approaches
• no reward has to be specified
• exploration is optimally solved and
• uncertainty is taken into account
• Active Inference relies solely on minimizing variational free energy
• Both perception and action become inference (including planning)
• Instead of rewards, agents have prior preferences over future observations that enter a prior over policies
• These preferences are shaped by model selection (e.g. evolution)
• Lower variational free energy corresponds to greater model evidence (e.g. adaptive fitness)

## The variational free energy

Active Inference is based on a single quantity: The variational free energy. Minimization of the free energy will explain both perception, as well as action in any organism, as we will see later. The variational free energy has its origins in neuroscience but also has a very clean machine learning interpretation and most researchers, therefore, should already be familiar with it.

If you are entirely unfamiliar with the variational free energy, I would recommend reading this blog post before proceeding. To recap, given a latent variable model as shown in the above figure, we often want to optimize the model evidence

This integral often has no analytical form and this numerical integration is hard. Therefore, we resort to optimizing an alternative quantity: The variational free energy.

We define the negative variational free energy

where $Q$ is some approximate posterior distribution. Note that when we refer to the variational free energy, we will often omit that we are talking about the negative free energy $F$.

To see that $F$ lower-bounds the model likelihood we can rewrite $F$ such that

$F$ is a lower bound because the $KL$-divergence is always larger or equal to zero. Thus, we can maximize $F$ instead of $\log P(x)$. The bound is tight when

## The variational free energy in the context of perception

Active Inference uses the variational free energy both in the context of perception and action. We begin by describing its role in perception. What do we mean by the term perception? Intuitively, an agent observes and understands its environment. This means the agent can extract the underlying dynamics and causes in order to make sense of its surroundings. Modeling the environment in that manner helps to predict the past, present, and future which can either be viewed as an inherent property of any organism or as a tool to act and realize goals (in Active Inference formulated as extrinsic prior beliefs) as we will see later. Note that the agent can never be certain about the underlying structure, but only updates beliefs based on what it has observed. A very simplified version of perception is the modeling of these hidden causes and dynamics as a time series latent variable model.

We formalize the task of perception using the following notation: An agent observes observations $o_{1:T}$ over time $t \in 1, \ldots, T$ and aims to model the underlying dynamics $s_{1:T}$. This is visualized in the following graphical model.

Similar to its original definition, the free energy for a specific point in time $t$ then becomes

The approximate posterior distribution $Q(s_t)$ now simply describes beliefs about the latent state $s_t$ based on the observation $o_t$. The task of perception is then simply optimizing over $Q(s_t)$. In other words, updating the beliefs about states $s_t$ such that the observations the agent has made are most likely.

The two terms of the free energy are often also called the accuracy $\mathbb{E}_Q[\log P(o_t|s_t)]$ and the complexity $KL[Q(s_t)|P(s_t)]$. Increasing the first term increases the likelihood that the inferred states model the observations correctly and the $KL$-divergence describes how different the beliefs will have to be from the prior beliefs (and therefore how much information they contain).

So what does it mean to minimize the free energy in the context of perception? It means to adapt (posterior) beliefs about the environment such that surprise of new observations is minimized. Thus, the free energy is a measure of surprise.

The careful reader might have noticed that we currently only take into account the observation $o_t$ to infer $s_t$ but neither past nor future. Also, we have not talked about learning the transitions from $s_t$ to $s_{t+1}$ and the observations these states generate ($P(o_t|s_t)$). We will look at these issues more carefully when introducing the general framework of Active Inference. Furthermore, while we have modeled a single sequence of latent variables $s_{1:T}$ it is possible to stack latent variables in order to build a hierarchy of abstractions (i.e. deep generative models).

## From the variational free energy to Active Inference

So far, we have discussed how an agent realizes perception by minimizing the free energy. We will now extend this free-energy framework to include action, yielding Active Inference. Whenever we talk about an agent acting in an environment, we usually use the reinforcement learning framework. Active Inference offers an alternative to this framework, not driving action through reward but through preferred observations. We will find that action is another result of the minimization of free energy.

Before we jump right in, recall the reinforcement learning framework, as depicted below.

We have an agent defined by its policy $\pi$ that selects the action $u_t$ to take at time $t$ such that the return $R = \sum_{\tau > t} r_\tau$ (the sum of rewards) is maximized. The optimal action can be derived from Bellman’s Optimality Principle such that the optimal policy takes action $u_t$ that maximizes the value function

In Active Inference we will not have to define a reward $r$, return $R$, or value function $V$. There is a fundamental reason for this: Active Inference treats exploration and exploitation as two sides of the same coin – in terms of choosing actions to minimize surprise or uncertainty. This brings something quite fundamental to the table; namely, the value function of states has to be replaced with a (free energy) functional of beliefs about states. Similarly, reward is replaced by (prior) beliefs about preferred, unsurprising, outcomes. This means we define the agent’s extrinsic preferences using priors on future observations $P(o_\tau)$ with $\tau > t$ (in the following we will always denote future observations with the index $\tau$ and past observations with $\rho$). The agent will then try to follow a policy that realizes these prior expectations of the future (so-called self-evidencing behavior). At the same time, we prefer observations $o_\tau$ that are likely under our model of the environment. In effect, both principles are a form of surprise minimization. Previously, we have identified the minimization of surprise as the minimization of the free energy. Thus, action can be integrated into our existing free energy model through priors over future observations $P(o_\tau)$. We then combine the prior $P(o_\tau)$ with our posterior beliefs $Q(s_\tau)$ to infer the best policy $\pi$ that minimizes our expected surprise. We describe our posterior belief over the policy we should follow with $Q(\pi)$. Using this strategy, we can essentially reduce action to inference through the means of minimizing expected free energy. This concept is also known as Hamilton’s Principle of Least Action.

The following figure visualizes Active Inferences as a principle of self-evidence. An agent acts to generate outcomes that fulfill its prior and model expectations (left side). At the same time, perception ensures that the model of the world is consistent with the observations (right side).

For the purpose of this blog post, it is very important to make a clear distinction between the free energy $F$ minimized for perception and the expected (future) free energy $G$ that reflects the amount of surprise in the future given a particular policy $\pi$. We will later present a precise formalization of $G$, for now, it suffices to say that it is the expected free energy of the future when a particular policy is followed. Thus, $G$ is the quantity to optimize in order to realize future preferences $P(o_\tau)$. Note that the free energies $F$ and $G$ can, in principle, be reformulated to a single quantity (but we will not further investigate this angle here).

At this point, it is worth noting that $G$ is a very universal quantity. Parallels can be drawn to the Bayesian surprise, risk-sensitive control, expected utility theory, and the maximum entropy principle. Having said that, we will not further investigate these similarities because they are not essential for understanding Active Inference. More details for references on this topic can be found in the last section.

To summarize, we solve perception and action in a unifying framework that minimizes the free energy. Furthermore, because we will implement action through Bayesian probability theory, we yield a Bayes optimal agent if no approximations have to be used. Before we turn to a precise mathematical definition of the different components that make up an Active Inference agent, we present some of the evidence for Active Inference being the fundamental principle underlying any form of living organism.

## Active Inference as a foundation for life

We have introduced Active Inference as a unifying framework for perception and action. The reach of this framework is quite extensive. It can be used to explain the self-organization of living systems, such as cells, neurons, organs, animals, and even entire species. Minimizing the free energy via action and perception is the central principle - called the free energy principle (FEP).

According to the FEP, central to any organism is its Markov blanket. In the statistical sense, a Markov blanket $b$ of states $s$ is the set of random variables that when conditioned upon make all other variables independent. Such a Markov blanket also exists for any living system. Intuitively, no organism can directly observe or modify its environment. Any interaction is through its sensors (only reflecting a reduced view on the environment) and its actuators (with limited capabilities to act upon the environment). It is this boundary that separates its internal states from its external milieu, and without it, the system would cease to exist. Formally, an organism receives sensory inputs $o \in O$ through its sensory states. Based on these inputs, it constructs a model of its environment $s \in S$. According to this model and in order to realize prior preferences, the organism takes actions $u \in \Upsilon$, the so-called active states. The only interaction with its environment $\psi \in \Psi$ is through its Markov blanket consisting of sensory and active states.

If an organism endures in a changing environment it must, by definition, retain its Markov blanket. When an organism minimizes free energy, it minimizes surprise (of sensory input, i.e. observations). Because under ergodic assumptions the long-term average of surprise is the entropy of these sensory states, retaining its Markov blanket is equivalent to placing an upper bound on the entropy of these sensory states. Conversely, if the Markov blanket is not maintained, entropy (disorder) of the sensory states diverges and subsequently leads to disintegration and death of the organism.

The concept of Markov blankets can be used to model living systems recursively across spatial scales. For instance, humans consist of different organs which in turn consist of countless cells. Each system can be viewed as free-energy-minimizing, maintaining its own Markov blanket.

Let’s look at an example. How can an ensemble of cells form entire organs? (Also called morphogenesis) Each cell needs to assume some specific function at a location for the entire organ to function. There is no central organization, thus, each cell must infer the function and position of all other cells just from the signals reaching its local Markov Boundary. Active Inference may solve this by assuming that each pluripotential cell starts out with a generative model of the entire ensemble. Therefore a cell can predict which sensory input it would receive depending on its location in the organ. Each cell only optimizes its local free energy, but because the generative model of each cell embodies a model of the entire organ, a local free energy minimum of each cell corresponds to the entire ensemble converging to a global free energy minimum. Crucially, through this optimization, each cell will act upon the environment through $u \in \Upsilon$ to help other cells reach their respective free energy minimum.

Finally, evolution plays a very central role in free energy minimization in biotic systems. In a changing environment, the Markov blanket will eventually be destroyed, which results in the death of the organism. Therefore, species have developed the ability to reproduce, effectively transferring genetic and epigenetic information to their descendants. This information specifies the generative model (including prior preferences) in their descendants. Crucially, information is not transmitted noise-free and each generation introduces slight variations. These variations lead to changing generative models and prior preferences that create a selection process among population members. The adaptive fitness of each organism is reflected in how well the model fits to the niche of the species. But not only these variations are driving species, each organism can also shape evolution by free energy minimization. The adaptive pressure mostly depends on the niche of the species. But free energy minimization prefers predictable environments that behave according to prior preferences. Therefore, the niche itself is also shaped through actions that lead to future free energy minimization.

Evolution, therefore, can be seen as a higher-level, more slowly moving, process that defines empirical priors $P(o_\tau)$. Similarly, this hierarchy of temporal scales can be constructed analogous to the hierarchy of spatial scales we constructed previously. In this hierarchy, higher levels treat the preferences of lower layers as outcomes that need to be explained.

Furthermore, because free energy is an extensive property, hierarchical applications of free energy minimization at different spatial and temporal scales mean that there is an interesting circular causality – in which the minimization at one scale (e.g., evolution of a species) both creates and depends on free energy minimization at a lower scale (e.g., econiche construction by the conspecifics of a species).

## The generative model

We now introduce the generative model for Active Inference in all its mathematical detail. In order to simplify explanations, we present a specific model for Active Inference that is a special case in some aspects and thus no longer is entirely assumption and approximation free. We make the following assumptions:

• We have several finite discrete random variables for
• outcomes $o_t$
• actions $u_t$
• states $s_t$
• policies $\pi$
(The policy itself is a function $\pi(t) = u_t$, but we have a discrete set of such policies)
• The transitions between states are Markovian

Furthermore, as described in more detail later, we use variational approximate inference to make the inference process tractable.

The following graphical model shows all the relationships between the variables in our model. Along the latent variables $s_t$ and $\pi$ we have the transition matrix $B$, probabilities of observations $A$, and a variable $D$ describing the original distribution over states $s_1$. At the beginning of this blog post, we considered $B$, $A$, and $D$ to be fixed. Furthermore, the matrix $U$ describes prior preferences over future observations $P(o_\tau) = \sigma(U_\tau)$ where $\sigma$ denotes the softmax function. These prior preferences will be integrated into the expected free energy $G(\pi)$ which in turn defines the prior over the policy $\pi$. This relationship is crucial to Active Inference! We want prior preferences on $P(o_\tau)$ but implement them into our generative model by expressing them as a prior $P(\pi|\gamma)$. Finally, $\gamma$ is a temperature parameter that increases or decreases the confidence in the policy $\pi$.

The policy $\pi$ is different from Reinforcement Learning in the sense that it is not a function of previous states $s_{t-1}$ but only on the current time $t$. Essentially, each possible policy describes a trajectory of actions taken. This is sufficient because the state history $s_{1:t}$ and future expected states $s_{\rho}$ are taken into consideration automatically by minimizing the free energy to achieve preferred outcomes $o_\rho$. Note that regarding tractability Friston argues that an agent only entertains a handful of policies at a time. This selection is optimized through a process further up in the temporal hierarchy (e.g. evolution). To give an example, the brain has evolved to only consider a limited repertoire of eye movements with a short time horizon.

Analogous to our introduction of the free energy and perception, we now define the negative variational free energy for this generative model, which we will have to maximize (or minimize the positive free energy):

where $x$ are all our latent variables $x = (s_{1:t}, \pi, A, B, D, \gamma)$.

In order to do inference, we have to define a posterior distribution $Q(x)$. We simplify our inference drastically by using a mean field approximation, such that the posterior factors

Each factor can be represented by its sufficient statistics, denoted by the bar $\bar x$

Because our approximate posterior $Q(x)$ factors, we can rewrite our free energy $F$ such that it factors into policy dependent terms $F(\pi, \rho)$ given by

Our policy dependent free energy then only takes the expectation over a single hidden state $s_\rho$ and conditioned on the policy:

Having factored our approximate posterior distribution and free energy in this manner, we can use belief propagation (or variational message passing) to calculate the sufficient statistics $\bar x$. Based on these, we can finally choose our action. Though, recall that the prior $P(\pi|\gamma)$ required a quantity called the expected free energy $G$ for each policy, which is what we will define next.

### Expected free energy for each policy

As we have established already, Active Inference requires two kinds of free energy: The free energy $F$ optimized for perception (i.e. inference), and the expected free energy $G$ for each policy $\pi$ that defines the prior distribution over policies $P(\pi)$; thereby informing the posterior $Q(\pi)$. Remember that we want to pick policies $\pi$ that minimize $-G$, essentially minimizing surprise. This is directly reflected in the prior $P(\pi|\gamma) = \sigma(\gamma \cdot G(\pi))$, making $\pi$ more likely for larger $G$.

$G(\pi)$ is defined by the path integral over future timesteps $\tau > t$. This essentially means that just like the free energy, we factor $G$ across time. $G(\pi)$ is given by

where each $G(\pi, \tau)$ is defined by the expected free energy at time $\tau$ if policy $\pi$ is pursued. Recall that the variational free energy can be written as

Because $G$ models the expectations over the future, we define the predictive posterior $\tilde Q$ that now also has observations $o_\tau$ as latent variables. Basically, we take the last posterior $Q(s_t)$ and recursively apply transitions $B$ according to policy $\pi$ (and transitions $A$ to yield observations $o_\tau$). Thus, $\tilde Q$ is given by

The expected variational free energy is now simply the free energy for each policy under the expectation of the predictive posterior $\tilde Q$ instead of the posterior $Q$:

The last equation now also makes the role of the prior $P(o_\tau)$ apparent. We simply interpret the marginal $P(o_\tau)$ as a distribution over the sorts of outcomes the agent prefers. Through this interpretation, the expected free energy $G$ will be shaped by these preferences. Because we maximize $G(\pi, \tau)$ we will also maximize the prior probability $\log P(o_\tau)$, in effect making prior preferences over observations more likely. We can gain even more intuition from the expected free energy if we rewrite it in an approximate form by simply replacing $P(s_\tau|o_\tau, o_{1:t}, \pi)$ with an approximate posterior $Q(s_\tau|o_\tau, \pi)$, essentially dropping the dependence on the observation history $o_{1:t}$.

This form shows that maximizing $G$ leads to behavior that either learns about the environment (epistemic value, information gain) or maximizes the extrinsic value. In experiments, it can be observed that initially, the first term dominates and later little information can be gained by exploration, therefore the extrinsic value is maximized. Active Inference, therefore, provides a Bayes optimal exploration and exploitation trade-off.

To summarize, we have defined the generative model, the free energy that is minimized for perception, its approximate posterior, and the expected free energy that is used to drive action.

## The algorithm in action

We now have all the pieces to put together the algorithm consisting of inference, learning and taking action as depicted in the figure below. We are only left with how to derive the belief updates and pick an action. Note that I will list all the belief updates in this section, feel free to skip the exact mathematical details. I only provide these for a complete understanding of how Active Inference would have to be computed.

### Inference

Recall that inference is done by maximizing

As mentioned before, we use a mean field approximation to factor our posterior $Q(x)$. Our beliefs $Q(x)$ are then updated through belief propagation, iteratively updating our sufficient statistics $\bar x$. The belief updates are derived by differentiating the variational free energy $F$ w.r.t. the sufficient statistics and setting the result to zero. One finally yields the following update equations:

where $F$ and $G$ are vectors for the free energy $% $ and expected free energy $G(\pi) = \sum_{\tau > t} G(\pi, \tau)$ of each policy.

$F(\pi, \rho)$ and $G(\pi, \tau)$ can be computed as follows.

### Learning

Learning $A$, $B$, and $D$ is simply inference as well. As such, we yield belief updates using the same method as above.

### Choosing action

Finally, we choose an action simply by marginalizing over the posterior beliefs about policies to form a posterior distribution over the next action. Generally, in simulating active inference, the most likely (a posteriori) action is selected and a new outcome of observation is sampled from the world.

where

## Conclusion

We have seen that Active Inference unifies action and perception by minimizing a single quantity - the variational free energy. Through this simple concept, we have reduced action to inference. Compared to Reinforcement Learning, no reward had to be specified, exploration is optimally solved and uncertainty is taken into account. Instead of rewards, agents have prior preferences over observations that are shaped by evolution. A brief look at Active Inference as the foundation of life showed that the concept has an extensive biological applicability. Furthermore, we presented a model of Active Inference that, while complete, made several approximations and has limitations. From a machine learning perspective, future work will have to investigate how this scheme can be extended to long sequences and large discrete or continuous state spaces in a scalable manner. Additionally, hierarchical internal states or recursive Markov-blanket-based systems are an interesting future research direction.

## Acknowledgements

I thank Karl Friston and Casper Hesp for valuable feedback and discussions.