Jekyll2021-01-01T09:16:25+00:00http://louiskirsch.com/feed.xmlLouis KirschDeep Learning and Reinforcement Learning researcher building RL agents that meta-learn their own learning algorithm. Currently pursuing a PhD in Artificial Intelligence at IDSIA with Jürgen Schmidhuber.Louis KirschGeneral Meta Learning and Variable Sharing2020-11-27T06:00:00+00:002020-11-27T06:00:00+00:00http://louiskirsch.com/neurips-2020<div class="video-wrapper">
<iframe src="https://www.youtube.com/embed/qwOqgrMaH88" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
<p>
<div class="card">
<div class="card-header">Invited talk abstract (NeurIPS 2020 Meta Learning Workshop)</div>
<div class="card-body">
<p>Humans develop learning algorithms that are incredibly general and can be applied across a wide range of tasks.
Unfortunately, this process is often tedious trial and error with numerous possibilities for suboptimal choices.
General Meta Learning seeks to automate many of these choices, generating new learning algorithms automatically.
Different from contemporary Meta Learning, where the generalization ability has been limited, these learning algorithms ought to be general-purpose.
This allows us to leverage data at scale for learning algorithm design that is difficult for humans to consider.
I present a General Meta Learner, MetaGenRL, that meta-learns novel Reinforcement Learning algorithms that can be applied to significantly different environments.
We further investigate how we can reduce inductive biases and simplify Meta Learning.
Finally, I introduce Variable Shared Meta Learning (VS-ML), a novel principle that generalizes Learned Learning Rules, Fast Weights, and Meta RNNs (learning in activations).
This enables (1) implementing backpropagation purely in the recurrent dynamics of an RNN and (2) meta-learning algorithms for supervised learning from scratch.</p>
</div>
</div>
</p>
<p><a class="btn btn-labeled btn-light" href="/metagenrl">
<i class="fas fa-robot"></i> Blog & Paper on MetaGenRL
</a>
<a class="btn btn-labeled btn-light" href="https://arxiv.org/abs/2012.14905">
<i class="fas fa-file"></i> Paper on VS-ML
</a></p>
<h2 id="variable-shared-meta-learning-vs-ml">Variable Shared Meta Learning (VS-ML)</h2>
<p><img src="/assets/publications/vsml-poster.svg" alt="Variable Shared Meta Learning Poster" />
<a href="/assets/publications/vsml-poster.pdf">Poster PDF</a></p>
<h2 id="invited-talk">Invited talk</h2>
<p>My invited talk took place at the <a href="https://neurips.cc/virtual/2020/protected/workshop_16141.html">NeurIPS 2020 Meta Learning Workshop</a>.</p>
<p>Please cite my talk using</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{
kirsch2020generalmeta,
title={General Meta Learning},
author={Louis Kirsch},
howpublished={Meta Learning Workshop at Advances in Neural Information Processing Systems},
year={2020}
}
</code></pre></div></div>
<p>and Variable Shared Meta Learning using</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{
kirsch2020vsml,
title={Meta Learning Backpropagation And Improving It},
author={Louis Kirsch and Juergen Schmidhuber},
journal={Meta Learning Workshop at Advances in Neural Information Processing Systems},
year={2020}
}
</code></pre></div></div>Louis KirschHumans develop learning algorithms that are incredibly general and can be applied across a wide range of tasks. Unfortunately, this process is often tedious trial and error with numerous possibilities for suboptimal choices. General Meta Learning seeks to automate many of these choices, generating new learning algorithms automatically. Different from contemporary Meta Learning, where the generalization ability has been limited, these learning algorithms ought to be general-purpose. This allows us to leverage data at scale for learning algorithm design that is difficult for humans to consider. I present a General Meta Learner, MetaGenRL, that meta-learns novel Reinforcement Learning algorithms that can be applied to significantly different environments. We further investigate how we can reduce inductive biases and simplify Meta Learning. Finally, I introduce Variable Shared Meta Learning (VS-ML), a novel principle that generalizes Learned Learning Rules, Fast Weights, and Meta RNNs (learning in activations). This enables (1) implementing backpropagation purely in the recurrent dynamics of an RNN and (2) meta-learning algorithms for supervised learning from scratch.MetaGenRL: Improving Generalization in Meta Reinforcement Learning2019-10-24T08:00:00+00:002019-10-24T08:00:00+00:00http://louiskirsch.com/metagenrl<div class="video-wrapper">
<iframe src="https://www.youtube.com/embed/pPBV54ZjJBc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
<p>
<div class="card">
<div class="card-header">Abstract (tldr)</div>
<div class="card-body">
<p>Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans.
Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process.
MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that affects how future individuals will learn.
Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training.
In some cases, it even outperforms human-engineered RL algorithms.
MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.</p>
</div>
</div>
</p>
<p><a class="btn btn-labeled btn-light" href="https://arxiv.org/abs/1910.04098">
<i class="fas fa-file"></i> Paper on ArXiv
</a>
<a class="btn btn-labeled btn-light" href="https://github.com/louiskirsch/metagenrl">
<i class="fab fa-github"></i> Code on Github
</a></p>
<h2 id="meta-learning-rl-algorithms">Meta-Learning RL algorithms</h2>
<p>Similar to many other researchers, my goal is to <strong>build intelligent general-purpose agents</strong> that can independently <a href="/ai/universal-ai#what-is-intelligence">solve a wide range of problems</a> and continuously improve.
At the core of this ability are learning algorithms.
Natural evolution for instance has equipped us humans with general learning algorithms that allow for quite intelligent behavior.
These learning algorithms are the result of distilling the collective experiences of many learners throughout the course of evolution into a compact genetic code.
In a sense, <strong>evolution is a learning algorithm that produced another learning algorithm</strong>.
This process is called Meta-Learning and our new paper <a href="https://arxiv.org/abs/1910.04098">MetaGenRL</a> for the first time shows that we can artificially learn quite general (albeit still simple) learning algorithms in a similar manner.</p>
<p>In contrast to this, most current Reinforcement Learning (RL) algorithms are the result of years of human engineering and design (such as <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">REINFORCE</a>, <a href="https://arxiv.org/abs/1707.06347">PPO</a>, or <a href="https://arxiv.org/abs/1509.02971">DDPG</a>).
The problem with this approach is that we don’t know what the best learning algorithm is or which learning algorithm to use in which context.
Thus, current algorithms are inherently limited by the ability of the researcher to make the right design choices.
This problem is also discussed in Jeff Clune’s <a href="https://arxiv.org/abs/1905.10985">AI generating algorithms paper</a>.</p>
<p>In Meta Reinforcement Learning we not only learn to act in the environment but also how to learn itself, reducing the beforementioned problem.
This in principle allows us to meta-learn general learning algorithms that surpass human-engineered alternatives.
Of course, we are not the first to suggest this, a good overview of Meta-RL can be found on <a href="https://lilianweng.github.io/lil-log/2019/06/23/meta-reinforcement-learning.html">Lilian Weng’s blog</a>.
Unfortunately, in practice, <strong>Meta Reinforcement Learning algorithms have focused on ‘adaptation’ to very similar RL tasks or environments until now</strong>.
Thus, the learned algorithm would not be useful in a considerably different environment.
For example, it would be unreasonable to expect that the algorithm could first learn to walk and then later learn to steer a car.</p>
<h2 id="how-metagenrl-works">How MetaGenRL works</h2>
<p>The goal of <strong>MetaGenRL</strong> is to meta-learn algorithms that <strong>generalize to entirely different environments</strong>.
For this, we train RL agents in multiple environments (often called the environment or task distribution) and leverage their experience to learn an algorithm that allows learning in all of these (and new) environments.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/metagenrl/comparison.svg" alt="Previously Meta-RL focused on adaptation to very similar use-cases. For example, changing the target position an ant has to walk to or the physical properties of the ant. In contrast, in MetaGenRL we want to learn learning algorithms that work across very diverse environments, e.g. learn to run with a HalfCheetah based on a learning algorithm that has been trained on landing a lunar lander and jumping with the Hopper." />
<figcaption class="figure-caption">Previously Meta-RL focused on adaptation to very similar use-cases. For example, changing the target position an ant has to walk to or the physical properties of the ant. In contrast, in MetaGenRL we want to learn learning algorithms that work across very diverse environments, e.g. learn to run with a HalfCheetah based on a learning algorithm that has been trained on landing a lunar lander and jumping with the Hopper.</figcaption>
</figure>
<p>This process consists of:</p>
<ul>
<li><strong>Meta-Training</strong>: Improve the learning algorithm by using it in one or multiple environments and changing it such that it works better when an RL agent uses it to learn (increase reward income)</li>
<li><strong>Meta-Testing</strong>: Initialize a new RL agent from scratch, place it in a new environment, and use the learning algorithm that we meta-learned previously instead of a human-engineered alternative</li>
</ul>
<p>When using a human-engineered algorithm we have no Meta-Training and only a testing phase:</p>
<ul>
<li><strong>Testing</strong>: Initialize a new RL agent from scratch, place it in a new environment, and train it using a human-engineered learning algorithm</li>
</ul>
<p>We represent our learning algorithm as an objective function \(L_\alpha\) that is parameterized by a neural network with parameters \(\alpha\).
Many other human-engineered RL algorithms are also represented by a specifically designed objective function but in MetaGenRL we meta-learn instead of design it.
When we minimize this objective function, the agent behavior improves to achieve higher rewards in an environment.
In MetaGenRL, <strong>we leverage the experience of a population of agents to improve a single randomly initialized objective function</strong>.
Each agent consists of a policy, a critic, and a replay buffer and acts in its own environment (schematic in the figure below).
Let’s say we use 20 agents and two environments, then we would equally distribute the agents such that there are 10 agents in each environment.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/metagenrl/scheme.svg" alt="Schematic of MetaGenRL. On the left, a population of agents (\(i \in 1, \ldots, N\)), where each member consists of a critic and a policy that interact with a particular environment and store collected data in a corresponding replay buffer. On the right, a meta-learned neural objective function \(L_\alpha\) that is shared across the population. Learning (dotted arrows) proceeds as follows: Each policy is updated by differentiating \(L_\alpha\), while the critic is updated using the usual TD-error (not shown). \(L_\alpha\) is meta-learned by computing second-order gradients by differentiating through the critic." />
<figcaption class="figure-caption">Schematic of MetaGenRL. On the left, a population of agents (\(i \in 1, \ldots, N\)), where each member consists of a critic and a policy that interact with a particular environment and store collected data in a corresponding replay buffer. On the right, a meta-learned neural objective function \(L_\alpha\) that is shared across the population. Learning (dotted arrows) proceeds as follows: Each policy is updated by differentiating \(L_\alpha\), while the critic is updated using the usual TD-error (not shown). \(L_\alpha\) is meta-learned by computing second-order gradients by differentiating through the critic.</figcaption>
</figure>
<p>During meta-training of the objective function, we will now:</p>
<ul>
<li>Have each agent interact with its environment and store this experience in its replay buffer</li>
<li>Improve the critics using data from their replay buffers</li>
<li>Improve the shared objective function that represents the learning algorithm using the current policies and critics.</li>
<li>Improve the policy of each agent using the current objective function</li>
<li>Repeat the process</li>
</ul>
<p>Each step is done in parallel across all agents.</p>
<p>During meta-testing an agent is initialized from scratch and only the objective function is used for learning.
The environment we test on can be different from the original environments we used for meta-training, i.e. our objective functions should generalize.</p>
<p>How does meta-training intuitively work?
All agents interact with their environment according to their current policy.
The collected experiences are stored in the replay buffer, essentially a history of everything that has happened.
Using this replay buffer, one can train a separate neural network, the critic, that can estimate how good it would be to take a specific action in any given situation.
MetaGenRL now uses the current objective function to change the policy (‘learning’).
Then, this changed policy outputs an action for a given situation and the critic can tell how good this action is.
Based on this information we can change the objective function to lead to better actions in the future when used as a learning algorithm (‘meta-learning’).
This is done by using a second-order gradient, backpropagating through the critic and policy into the objective function parameters.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/metagenrl/meta-learning-intuitive.svg" alt="An intuitive scheme of how meta-learning the objective function works in MetaGenRL." />
<figcaption class="figure-caption">An intuitive scheme of how meta-learning the objective function works in MetaGenRL.</figcaption>
</figure>
<h2 id="sample-efficiency-and-generalization">Sample efficiency and Generalization</h2>
<p>MetaGenRL is off-policy and thus requires fewer environment interactions both for meta-training as well as test-time training.
Unlike in evolution, there is no need to train multiple randomly initialized agents in their entirety to evaluate the objective function, thus speeding up the credit assignment.
Rather, at any point in time, any information that is deemed useful for future environment interactions can be directly incorporated into the objective function by making use of the critic.</p>
<p>Furthermore, the learned objective functions generalize to entirely different environments.
The figure below shows the test-time training (i.e. meta-testing) curve of agents being trained from scratch on the Hopper environment using the learned objective function.
In general, we can <strong>outperform human-engineered algorithms such as PPO and REINFORCE</strong>, but sometimes still struggle against DDPG.
Other Meta-RL baselines overfit to their training environments (see <a href="https://arxiv.org/abs/1611.02779">RL^2</a>) or do not even produce stable learning algorithms when we allow for 50 million environment interactions (twice as many compared to MetaGenRL, see <a href="https://arxiv.org/abs/1802.04821">EPG</a>).</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/metagenrl/combined_ood.svg" alt="Objective functions meta-learned by MetaGenRL generalize to a different environment (here the <a href='https://gym.openai.com/envs/Hopper-v2/'>Hopper</a> environment). The blue curve was meta-trained with 20 agents distributed over the <a href='https://gym.openai.com/envs/HalfCheetah-v2/'>HalfCheetah</a> and <a href='https://gym.openai.com/envs/LunarLanderContinuous-v2/'>LunarLander</a> environments, the orange curve was only trained on LunarLander." />
<figcaption class="figure-caption">Objective functions meta-learned by MetaGenRL generalize to a different environment (here the <a href="https://gym.openai.com/envs/Hopper-v2/">Hopper</a> environment). The blue curve was meta-trained with 20 agents distributed over the <a href="https://gym.openai.com/envs/HalfCheetah-v2/">HalfCheetah</a> and <a href="https://gym.openai.com/envs/LunarLanderContinuous-v2/">LunarLander</a> environments, the orange curve was only trained on LunarLander.</figcaption>
</figure>
<figure class="text-center">
<img class="figure-img rounded " style="max-width: 400px;" src="/assets/posts/metagenrl/meta-table.png" alt="Mean return across 6 seeds of training randomly initialized agents during meta-test time on previously seen environments (<span style='color: #74cae4;'>cyan</span>) and on unseen environments (<span style='color: #debfa1;'>brown</span>). MetaGenRL generalizes much better compared to other Meta-RL approaches." />
<figcaption class="figure-caption">Mean return across 6 seeds of training randomly initialized agents during meta-test time on previously seen environments (<span style="color: #74cae4;">cyan</span>) and on unseen environments (<span style="color: #debfa1;">brown</span>). MetaGenRL generalizes much better compared to other Meta-RL approaches.</figcaption>
</figure>
<figure class="text-center">
<img class="figure-img rounded " style="max-width: 400px;" src="/assets/posts/metagenrl/human-table.png" alt="Agent mean return across seeds for meta-test training on previously seen environments (<span style='color: #74cae4;'>cyan</span>) and on unseen (different) environments (<span style='color: #debfa1;'>brown</span>) compared to human engineered baselines. MetaGenRL outperforms human-engineered algorithms such as PPO and REINFORCE but still struggles with DDPG." />
<figcaption class="figure-caption">Agent mean return across seeds for meta-test training on previously seen environments (<span style="color: #74cae4;">cyan</span>) and on unseen (different) environments (<span style="color: #debfa1;">brown</span>) compared to human engineered baselines. MetaGenRL outperforms human-engineered algorithms such as PPO and REINFORCE but still struggles with DDPG.</figcaption>
</figure>
<h2 id="future-work">Future work</h2>
<p>In future work, we aim to further improve the learning capabilities of the meta-learned objective functions, including better leveraging knowledge from prior experiences.
Indeed, in our current implementation, the objective function is unable to observe the environment or the hidden state of the (recurrent) policy.
These extensions are especially interesting as they may allow more complicated curiosity-based or model-based algorithms to be learned.
To this extent, it will be important to develop introspection methods that analyze the learned objective function and to scale MetaGenRL to make use of many more environments and agents.</p>
<h2 id="further-reading">Further reading</h2>
<p>Have a look at <a href="https://arxiv.org/abs/1910.04098">the full paper on ArXiv</a>.</p>
<p>I also recommend reading <a href="https://arxiv.org/abs/1905.10985">Jeff Clune’s AI-GAs</a>.
He describes a similar quest for Artifical Intelligence Generating Algorithms (AI-GAs) with three pillars:</p>
<ul>
<li>Meta-Learning algorithms</li>
<li>Meta-Learning architectures</li>
<li>Generating environments</li>
</ul>
<p>Furthermore, there is a large body of work on meta-learning by my supervisor <a href="http://people.idsia.ch/~juergen/metalearner.html">Juergen Schmidhuber</a> (one good place to start is his first paper on <a href="http://people.idsia.ch/~juergen/fki198-94.pdf">Meta Learning for RL</a>).</p>
<p>Please cite this work using</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{
kirsch2020metagenrl,
title={Improving Generalization in Meta Reinforcement Learning using Learned Objectives},
author={Louis Kirsch and Sjoerd van Steenkiste and Juergen Schmidhuber},
booktitle={International Conference on Learning Representations},
year={2020}
}
</code></pre></div></div>Louis KirschBiological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Inspired by this process, MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that affects how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.NeurIPS 2018, Updates on the AI road map2019-01-10T06:00:00+00:002019-01-10T06:00:00+00:00http://louiskirsch.com/neurips-2018<p>In September, I published a technical report on what I consider the <a href="/assets/publications/contemporary-challenges-in-artificial-intelligence.pdf">most important challenges in Artificial Intelligence</a>.
I categorized them into four areas</p>
<ul>
<li><strong>Scalability</strong><br />
Neural networks where compute / memory cost does not scale quadratically / linearly with the number of neurons.</li>
<li><strong>Continual Learning</strong><br />
Agents that have to continually learn from their environment without forgetting previously acquired skills and the ability to reset the environment.</li>
<li><strong>Meta-Learning</strong><br />
Agents that are self-referential in order to modify their own learning algorithm.</li>
<li><strong>Benchmarks</strong><br />
Environments that have complex enough structure and diversity such that intelligent agents can emerge without hardcoding strong inductive biases.</li>
</ul>
<p>During the <strong>NeurIPS 2018 conference</strong> I investigated other <strong>researcher’s current approaches and perspectives</strong> on these issues.</p>
<h2 id="inductive-biases">Inductive biases?</h2>
<p>I think it is interesting to point out that this list contains little discussion of particular inductive biases that solve challenges we observe with current reinforcement learning agents.
Most of these challenges are absorbed into the Meta-Learning aspect of the system, similar to how evolution shaped a good learner.
It remains to be seen how feasible this approach is with strongly limited compute and time constraints.</p>
<h2 id="scalability">Scalability</h2>
<p>It is almost obvious that if we seek to implement 100 billion neurons as found in the human brain using artificial neural networks (ANNs) that standard matrix-matrix multiplications will not take us very far.
The number of required operations is quadratic in the number of neurons.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/modular-networks/modular-layer.gif" alt="The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input." />
<figcaption class="figure-caption">The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input.</figcaption>
</figure>
<p>To address this issue, I have worked on and published <a href="https://arxiv.org/abs/1811.05249">Modular Networks</a> at NeurIPS 2018.
Instead of evaluating the entire ANN for each input element, we decompose the network into a set of modules, where only a subset is used depending on the input.
This procedure is inspired by the human brain, where we can observe modularization that is also hypothesized to improve adaptation to changing environments and mitigate catastrophic forgetting.
In our approach, we learn both the parameters of these modules, as well as the decision which modules to use jointly.
Previous literature on conditional computation has had many issues with module collapse, i.e. the optimization process ignoring most of the available modules, leading to sub-optimal solutions.
Our Expectation-Maximization based approach prevents these kinds of issues.</p>
<p>Unfortunately, forcing this kind of separation into modules has its own issues that we discussed in the paper and in <a href="/modular-networks">this follow-up blog post</a> on modular networks.
Instead, we might seek to make use of sparsity and locality in weights and activations as discussed in <a href="/assets/publications/scale-through-sparsity.pdf">my technical report on sparsity</a>.
In short, we only want to perform operations on the few activations that are non-zero, discarding entire rows in the weight matrix.
If furthermore, connectivity is highly sparse, we in effect get rid of the quadratic cost down to a small constant.
This kind of conditional computation and non-coalesced weight access is quite expensive to implement on current GPUs and usually not worth it.</p>
<h3 id="nvidias-take-on-conditional-computation-and-sparsity">NVIDIA’s take on conditional computation and sparsity</h3>
<p>According to a software engineer and a manager at NVIDIA, there are no current plans to build hardware that can leverage conditional computation in the form of activation sparsity.
The main reason seems to be the trade-off of generality vs speed.
It is too expensive to build dedicated hardware for this use case because it might limit other (ML) applications.
Instead, NVIDIA is more focused on weight sparsity from a software perspective at the moment.
This weight sparsity also requires a high degree to be efficient.</p>
<h3 id="graphcores-take-on-conditional-computation-and-sparsity">GraphCore’s take on conditional computation and sparsity</h3>
<p><a href="https://www.graphcore.ai/">GraphCore</a> builds hardware that allows storing activations during the forward pass in caches close to the processing units instead of global memory on GPUs.
It also can make use of sparsity and specific graph structure by compiling and setting up a computational graph on the device itself.
Unfortunately, due to the expensive compilation, this structure is fixed and does not allow for conditional computation.</p>
<p>As an overall verdict, it seems that there is no hardware solution for conditional computation on the horizon and we have to stick with heavily parallelizing across machines for the moment.
In that regard, <a href="https://arxiv.org/abs/1811.02084">Mesh-Tensorflow</a>, a novel method to distribute gradient calculation not just across the batch but also across the model was published at NeurIPS, allowing even larger models to be trained in a distributed fashion.</p>
<h2 id="continual-learning">Continual Learning</h2>
<p>I have long advocated for the need for deep learning based continual learning systems, i.e. systems that can learn continually from experience and accumulate knowledge that can then be used as prior knowledge when new tasks arise.
As such, they need to be capable of forward transfer, as well as preventing catastrophic forgetting.
The Continual Learning workshop at NeurIPS discussed exactly these issues.
Perhaps these two criteria are incomplete though, multiple speakers (Mark Ring, Raia Hadsell) suggested a larger list of requirements</p>
<ul>
<li>forward transfer</li>
<li>backward transfer</li>
<li>no catastrophic forgetting</li>
<li>no catastrophic interference</li>
<li>scalable (fixed memory / computation)</li>
<li>can handle unlabeled task boundaries</li>
<li>can handle drift</li>
<li>no episodes</li>
<li>no human control</li>
<li>no repeatable states</li>
</ul>
<p>In general, it seems to me that there are six categories of approaches to the problem</p>
<ul>
<li>(partial) replay buffer</li>
<li>generative model that regenerates past experience</li>
<li>slowing down training of important weights</li>
<li>freezing weights</li>
<li>redundancy (bigger networks -> scalability)</li>
<li>conditional computation (-> scalability)</li>
</ul>
<p>None of these approaches handle all aspects of the continual learning list.
Unfortunately, this is also impossible in practice.
There is always a trade-off between transfer and memory / compute, and a trade-off between catastrophic forgetting and transfer / memory / compute.
Thus, it will be hard to purely quantitatively measure the success of an agent.
Instead, we should build benchmarks that require qualities we require from our continual learning agents, for instance, the <a href="https://marcpickett.com/cl2018/CL-2018_paper_48.pdf">Starcraft based environment</a> presented at the workshop.</p>
<p>Furthermore, Raia Hadsell argued that Continual Learning involves moving away from learning algorithms that rely on i.i.d. data to learning from a non-stationary distribution.
In particular, humans are good at learning incrementally instead of iid.
Thus, we might be able to unlock a more powerful ML paradigm when moving away from the iid requirement.</p>
<p>The paper <a href="https://arxiv.org/abs/1810.11910">Continual Learning by Maximizing Transfer and Minimizing Interference</a> showed an interesting connection between REPTILE (a MAML successor) and reducing catastrophic forgetting.
The dot product between gradients of datapoints (appears in REPTILE) that are drawn from a replay buffer leads to gradient updates that minimize interference and reduce catastrophic forgetting.</p>
<p>The panel with Marc’Aurelio Ranzato, Richard Sutton, Juergen Schmidhuber, Martha White, and Chelsea Finn was also quite interesting.
It has been argued that we should experiment with lifelong learning in the control setting (if that is what we ultimately care about) instead of supervised and unsupervised learning to prevent any mismatch between algorithm development and actual area of application.
Discount factors, while having useful properties for Bellman-equation based learning, might be problematic for more realistic RL settings.
Returns with long time-horizons are what make humans inherently smarter than many other species.
Furthermore, any learning, in particular meta-learning, is inherently constrained due to credit assignment.
Thus, developing algorithms with cheap credit assignment are the key to intelligent agents.</p>
<h2 id="meta-learning">Meta-Learning</h2>
<p>Meta-Learning is about modifying the learning algorithm itself.
This may be an outer optimization loop that modifies an inner optimization loop, or in its most universal form a self-referential algorithm that can modify itself.
Many researchers are also concerned with fast adaptation, i.e. forward transfer, to new tasks / environments etc.
This can be viewed as transfer learning, or meta-learning if we consider the initial parameters of a learning algorithm to be part of the learning algorithm.
One of the very recent algorithms by Chelsea Finn, <a href="https://arxiv.org/abs/1703.03400">MAML</a>, spiked great interest in this kind of fast adaptation algorithms.
This could, for instance, be used for model-based reinforcement learning, where the <a href="https://arxiv.org/abs/1803.11347">model is quickly updated</a> to changing dynamics.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/neurips-2018/evolved-policy-gradients.png" alt="In EPG a loss function optimizes the parameters of a policy using SGD while the parameters of the loss function are evolved." />
<figcaption class="figure-caption">In EPG a loss function optimizes the parameters of a policy using SGD while the parameters of the loss function are evolved.</figcaption>
</figure>
<p>Another interesting idea is to learn differentiable loss functions of the agent’s trajectory and the policy output.
This allows evolving the few parameters of the loss function while training the policy using SGD.
Furthermore, the authors of <a href="https://arxiv.org/abs/1802.04821">Evolved Policy Gradients (EPG)</a> showed that the learned loss functions generalize across reward functions and allow for fast adaptation.
One major issue with EPG is that credit assignment is quite slow:
An agent has to be fully trained using a loss function to obtain an average return (fitness) for the meta-learner.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/neurips-2018/loss-landscape.png" alt="The loss landscape of a learned optimizer becomes harder to navigate the more update steps are being unrolled.<br/>Left: one-dimensional. Right: two-dimensional. Taken from <a href='https://arxiv.org/abs/1810.10180'>Metz et al</a>" />
<figcaption class="figure-caption">The loss landscape of a learned optimizer becomes harder to navigate the more update steps are being unrolled.<br />Left: one-dimensional. Right: two-dimensional. Taken from <a href="https://arxiv.org/abs/1810.10180">Metz et al</a></figcaption>
</figure>
<p>Another interesting discovery I made during the Meta-Learning workshop is the structure of loss landscapes of meta-learners.
In a paper by Luke Metz on <a href="https://arxiv.org/abs/1810.10180">learning optimizers</a>, he showed that the loss function of the optimizer parameters becomes more complex the more update steps are being unrolled.
I suspect that this is a general behavior of meta-learning algorithms, small changes in parameter values can cascade to massive changes in the final performance.
I would be very interested in such an analysis.
In the case of learning optimization Luke addressed the issue by smoothing the loss landscape through <a href="https://arxiv.org/abs/1212.4507">Variational Optimization</a>, a principled interpretation of evolutionary strategies.</p>
<h2 id="benchmarks">Benchmarks</h2>
<p>Most current RL algorithms are benchmarked on games or simulators such as ATARI or Mujoco.
These are simple environments that have little resemblance of the richness our universe exhibits.
One major complaint researchers often voice is that our algorithms are sample-inefficient.
This can be fixed in part by using the existing data more efficiently through off-policy optimization and model-based RL.
Though, a large factor is also that our algorithms have no prior experience to use in these benchmarks.
We can get around this by handcrafting inductive biases into our algorithms that reflect some kind of prior knowledge but it might be much more interesting to <strong>build environments that allow the accumulation of knowledge</strong> that can be leveraged in the future.
To my knowledge, no such benchmark exists to date.
The <a href="https://github.com/Microsoft/malmo">Minecraft</a> simulator might be closest to such requirements.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/neurips-2018/starcraft.png" alt="The Continual Learning Starcraft environment is a curriculum starting with very simple tasks. Unfortunately, it still contains clear task boundaries and little possibilities for exploration." />
<figcaption class="figure-caption">The Continual Learning Starcraft environment is a curriculum starting with very simple tasks. Unfortunately, it still contains clear task boundaries and little possibilities for exploration.</figcaption>
</figure>
<p>An <strong>alternative</strong> to such rich environments is to build <strong>explicit curriculums</strong> such as the beforementioned <a href="https://marcpickett.com/cl2018/CL-2018_paper_48.pdf">Starcraft environment</a> that consists of a curriculum of tasks.
This is in part also what Shagun Sodhani asks for in his paper <a href="https://arxiv.org/abs/1811.10732">Environments for Lifelong Reinforcement Learning</a>.
Other aspects he puts on his wishlist are</p>
<ul>
<li>environment diversity</li>
<li>stochasticity</li>
<li>naturality</li>
<li>non-stationarity</li>
<li>multiple modalities</li>
<li>short-term and long-term goals</li>
<li>multiple agents</li>
<li>cause and effect interaction</li>
</ul>
<p>The game engine developer <a href="https://unity3d.com/">Unity3D</a> is also at the forefront of environment development.
It has released a toolkit <a href="https://unity3d.com/machine-learning">ML-Agents</a> to train and evaluate agents in environments build with Unity.
One of their new open-ended curriculum benchmarks is the <a href="https://twitter.com/awjuliani/status/1069048401596227584">Obstacle Tower</a>.
In general, a major problem for realistic environment construction is that the requirements are inherently different from game design:
To prevent overfitting it is important that objects in a vast world do not look alike and as such can not just be replicated as it is often done in computer games.
This means for true generalization we require generated or carefully designed environments.</p>
<p>Finally, I believe it might be possible to use computation to generate non-stationary environments instead of building them manually.
For instance, this could be a physics simulator that has similar properties to our universe.
To save compute, we could also start with a simplification based on voxels.
If this simulation exhibits the right properties we might be able to simulate a process similar to evolution, bootstrapping a non-stationary environment that develops many forms of life that interact with each other.
This idea fits nicely with the <a href="https://en.wikipedia.org/wiki/Simulation_hypothesis">simulation hypothesis</a> and has connections to <a href="https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life">Conway’s Game of Life</a>.
One major issue with this approach might be that the resulting complexity has no resemblance to human-known concepts.
Furthermore, the resulting intelligent agents will not be able to transfer to our universe.
Recently, I found out that this idea has been realized in part by Stanley and Clune’s group at UBER in their paper <a href="https://eng.uber.com/poet-open-ended-deep-learning/">POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments</a>.
The environment is non-stationary and can be viewed as an agent itself that maximizes complexity and agent learning progress.
They refer to this concept as open-ended learning, and I recommend reading <a href="https://www.oreilly.com/ideas/open-endedness-the-last-grand-challenge-youve-never-heard-of">this article</a>.</p>
<p>Please cite this work using</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{kirsch2019roadmap,
author = {Kirsch, Louis},
title = {{Updates on the AI road map}},
url = {http://louiskirsch.com/neurips-2018},
year = {2019}
}
</code></pre></div></div>Louis KirschI present an updated roadmap to AGI with four critical challenges: Continual Learning, Meta-Learning, Environments, and Scalability. I motivate the respective areas and discuss how research from NeurIPS 2018 has advanced them and where we need to go next.Is AI progress a function of knowledge or computational resources?2019-01-08T09:00:00+00:002019-01-08T09:00:00+00:00http://louiskirsch.com/ai-progress-computing<p>Deep Neural networks have rid us of feature engineering.
Meta-Learning will rid us of optimization objective engineering.</p>Louis KirschTODOThe Origins of Intelligence2019-01-07T09:00:00+00:002019-01-07T09:00:00+00:00http://louiskirsch.com/the-origins-of-intelligence<p><strong>How did intelligence come into existence?</strong>
What are the properties that make humans smart?
These questions have bothered me for a long while now, and I have tried to find my own answers and other perspectives in the field.
One method to get a grip on intelligence is to follow what kind of methods the field of Reinforcement Learning is developing.
This is essentially a bottom-up approach, aggregating all methods to see a pattern.
I just recently published a <a href="/maps/reinforcement-learning">huge mind map</a> that tried to do just that.
Other methods try to come up with principles from top-down and then test them or proof that these are optimal.
<a href="/ai/universal-ai">Universal AI in the form of AIXI</a> defines the theoretically optimal reinforcement agent while <a href="/ai/active-inference">Active Inference</a> is a physics and neuroscience inspired approach.</p>
<p>In this blog post I want to present my own, much simpler theory, summarized by a simple statement:</p>
<blockquote>
<p>Any dynamical system that acts such that it keeps existing or replicating is more likely to be observed in the future.</p>
</blockquote>
<p>For instance, any animal keeps <strong>regulating their body to survive and replicating</strong> for the continued existence of the species.
Thus, you are more likely to observe a cat than a non-functioning cat.
Both animals and species are dynamical systems.
These <strong>dynamical systems can be recursive</strong> in the sense of one containing the other.
One example is the recursion species -> animals -> organs -> cells -> physics.
The dynamics of these are informed by the <strong>seed information</strong> (e.g. genetics), and <strong>(pseudo-)random<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup> perturbations</strong> from the environment.
The lowest level in the recursion (i.e. physics) has fixed dynamics.
Some of these perturbations are resisted by the dynamical system, others are not.
For non-resisted perturbations, two cases can emerge.
Firstly, if these perturbations are <strong>harmful</strong>, the dynamical system might cease to exist and thus also its replication capability.
Secondly, if these perturbations are <strong>beneficial</strong>, the dynamical system might exist for prolonged periods of time or replicate such that another level of recursion is added on top.</p>
<p>At any time in a universe, the laws of this universe (i.e. the laws of physics) lead to random or pseudo-random interactions between entities.
Most of this randomness will not lead to coherent systems (TODO, what is a coherent system? attractor state? but there is an additional optimization to extend the attractor state?).
By chance, some of these interactions will be <strong>self-reinforcing and at some point replicating</strong>, from which perturbations create a wide variety of such dynamical systems.
Because these variations <strong>compete</strong> for each other for resources, <strong>additional pressure for self-modification</strong> at any level in the recursive hierarchy is applied.
This is reminiscent of <a href="https://en.wikipedia.org/wiki/On_the_Origin_of_Species">evolutionary adaptation (On the Origin of Species)</a>.</p>
<p>The proposed principle does not talk about intelligence per se, but rather of the existence of any system (e.g. living being).
Though, most living organisms are not perceived to be smart, but rather good at surviving in their niche.
For instance, there are many more bacteria and viruses on this planet than humans.
Thus, I am suggesting that <strong>intelligence is merely a side-effect</strong> of the existence/replication optimization objective in our universe (but a very powerful side-effect w.r.t. this optimization objective).
For artificial machines that shall not just create complexity according to this objective but get increasingly better at solving a specific task or any computable task it will encounter in the future, we will need to define extrinsic or intrinsic objectives, for instance see <a href="/ai/universal-ai">Universal AI</a>, <a href="http://people.idsia.ch/~juergen/interest.html">POWERPLAY and other artificial curiosity based systems</a>, or <a href="https://arxiv.org/abs/1310.1863">Empowerment</a>.</p>
<p><a href="/ai/active-inference">Active inference</a> is similar to this theory in the sense that any ‘living’ dynamical system optimizes to maintain certain attractor states by minimizing the free energy.
This makes three assumptions, that we do not require:
Firstly, we don’t assume ergodic behavior of the dynamical system, in the sense that states have to be revisited.
Thus, we don’t have to define attractor states, but instead, any state that supports the existence of the system is permitted.
Secondly, we don’t assume the minimization of a specific quantity for the internal dynamics of the system, like the free energy.
Crucially, this theory includes reactive systems that do not model their environment but directly react to observed changes.
As such, modeling the environment is not a prerequisite, but potentially useful.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>random only if there is true randomness in the universe, otherwise pseudo-random, <a href="http://people.idsia.ch/~juergen/randomness.html">read more on Schmidhuber’s website</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Louis KirschTODOA Map of Reinforcement Learning2019-01-03T14:00:00+00:002019-01-03T14:00:00+00:00http://louiskirsch.com/maps/reinforcement-learning<p><strong>Reinforcement Learning</strong> promises to solve the problem of <strong>designing intelligent agents</strong> in a formal but simple framework.
At the same time, there exists a large pool of <strong>methods</strong> to optimize an agent’s policy to maximize return such as value-based methods, policy-based methods, imitation learning, and model-based approaches.
These methods themselves have many variants and incremental improvements, mostly driven by a set of major <strong>challenges</strong> in the field of reinforcement learning.
All in all, it is easy to get lost in the large number of publications and subfields of research.</p>
<p>This blog post aims at <strong>tackling this massive quantity of approaches and challenges</strong>, providing an overview of the different challenges researchers are working on and the methods they devised to solve these problems.
This mind map is very far from complete and in large parts driven by my interests, if you have any particular suggestions, please let me know!</p>
<p><a href="/assets/posts/map-reinforcement-learning/overview.pdf"><img src="/assets/posts/map-reinforcement-learning/overview.svg" alt="A map of reinforcement learning" /></a></p>
<h1 id="goal">Goal</h1>
<p>What is the goal of Reinforcement Learning?
We have introduced the framework to solve the problem of designing intelligent agents.
It can be further formalized with <em>‘an agent that maximizes reward in a particular environment’</em>.
Or in the context of AGI, Intelligence has been defined as <em>‘The ability to achieve goals in a wide range of environments’</em> by Marcus Hutter and Shane Legg.
Marcus Hutter formalized the optimal universal agent AIXI, as I have described in <a href="/ai/universal-ai">another blog post</a>.</p>
<p><a href="/assets/posts/map-reinforcement-learning/goal.pdf"><img src="/assets/posts/map-reinforcement-learning/goal.svg" alt="The goal of Reinforcement Learning" /></a></p>
<h1 id="methods">Methods</h1>
<p>There is a variety of methods to optimize an agents policy.
Here we look at different categories and specific implementations of methods to optimize RL policies.</p>
<p><a href="/assets/posts/map-reinforcement-learning/methods.pdf"><img src="/assets/posts/map-reinforcement-learning/methods.svg" alt="Methods of Reinforcement Learning" /></a></p>
<h1 id="challenges">Challenges</h1>
<p>There is a wide range of challenges that current major algorithms do not handle very well yet.
Thus, many new specialized algorithms have been developed.
To understand where and why specific research is happening, I tried to sort in recent research into the respective challenges they’re addressing.</p>
<p><a href="/assets/posts/map-reinforcement-learning/challenges.pdf"><img src="/assets/posts/map-reinforcement-learning/challenges.svg" alt="Challenges of Reinforcement Learning" /></a></p>Louis KirschReinforcement Learning promises to solve the problem of designing intelligent agents in a formal but simple framework. This blog post aims at tackling the massive quantity of approaches and challenges in Reinforcement Learning, providing an overview of the different challenges researchers are working on and the methods they devised to solve these problems.NeurIPS 2018, Updates on the AI road map2018-12-13T10:00:00+00:002018-12-13T10:00:00+00:00http://louiskirsch.com/neurips-2018-confidential<p><strong>CONFIDENTIAL BLOG POST - DO NOT SHARE</strong></p>
<p>In September, I published a technical report on what I consider the <a href="/assets/publications/contemporary-challenges-in-artificial-intelligence.pdf">most important challenges in Artificial Intelligence</a>.
I categorized them into four areas</p>
<ul>
<li><strong>Scalability</strong><br />
Neural networks where compute / memory cost does not scale quadratically / linearly with the number of neurons.</li>
<li><strong>Continual Learning</strong><br />
Agents that have to continually learn from their environment without forgetting previously acquired skills and the ability to reset the environment.</li>
<li><strong>Meta-Learning</strong><br />
Agents that are self-referential in order to modify their own learning algorithm.</li>
<li><strong>Benchmarks</strong><br />
Environments that have complex enough structure and diversity such that intelligent agents can emerge without hardcoding strong inductive biases.</li>
</ul>
<p>During the <strong>NeurIPS 2018 conference</strong> I investigated other <strong>researcher’s current approaches and perspectives</strong> on these issues.</p>
<h2 id="inductive-biases">Inductive biases?</h2>
<p>I think it is interesting to point out that this list contains little discussion of particular inductive biases that solve challenges we observe with current reinforcement learning agents.
Most of these challenges are absorbed into the Meta-Learning aspect of the system, similar to how evolution shaped a good learner.
It remains to be seen how feasible this approach is with strongly limited compute and time constraints.</p>
<h2 id="scalability">Scalability</h2>
<p>It is almost obvious that if we seek to implement 100 billion neurons as found in the human brain using artificial neural networks (ANNs) that standard matrix-matrix multiplications will not take us very far.
The number of required operations is quadratic in the number of neurons.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/modular-networks/modular-layer.gif" alt="The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input." />
<figcaption class="figure-caption">The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input.</figcaption>
</figure>
<p>To address this issue, I have worked on and published <a href="https://arxiv.org/abs/1811.05249">Modular Networks</a> at NeurIPS 2018.
Instead of evaluating the entire ANN for each input element, we decompose the network into a set of modules, where only a subset is used depending on the input.
This procedure is inspired by the human brain, where we can observe modularization that is also hypothesized to improve adaptation to changing environments and mitigate catastrophic forgetting.
In our approach, we learn both the parameters of these modules, as well as the decision which modules to use jointly.
Previous literature on conditional computation has had many issues with module collapse, i.e. the optimization process ignoring most of the available modules, leading to sub-optimal solutions.
Our Expectation-Maximization based approach prevents these kinds of issues.</p>
<p>Unfortunately, forcing this kind of separation into modules has its own issues that we discussed in the paper and in <a href="/modular-networks">this follow-up blog post</a> on modular networks.
Instead, we might seek to make use of sparsity and locality in weights and activations as discussed in <a href="/assets/publications/scale-through-sparsity.pdf">my technical report on sparsity</a>.
In short, we only want to perform operations on the few activations that are non-zero, discarding entire rows in the weight matrix.
If furthermore, connectivity is highly sparse, we in effect get rid of the quadratic cost down to a small constant.
This kind of conditional computation and non-coalesced weight access is quite expensive to implement on current GPUs and usually not worth it.</p>
<h3 id="nvidias-take-on-conditional-computation-and-sparsity">NVIDIA’s take on conditional computation and sparsity</h3>
<p>According to a software engineer and a manager at NVIDIA, there are no current plans to build hardware that can leverage conditional computation in the form of activation sparsity.
The main reason seems to be the trade-off of generality vs speed.
It is too expensive to build dedicated hardware for this use case because it might limit other (ML) applications.
Instead, NVIDIA is more focused on weight sparsity from a software perspective at the moment.
This weight sparsity also requires a high degree to be efficient.</p>
<h3 id="graphcores-take-on-conditional-computation-and-sparsity">GraphCore’s take on conditional computation and sparsity</h3>
<p><a href="https://www.graphcore.ai/">GraphCore</a> builds hardware that allows storing activations during the forward pass in caches close to the processing units instead of global memory on GPUs.
It also can make use of sparsity and specific graph structure by compiling and setting up a computational graph on the device itself.
Unfortunately, due to the expensive compilation, this structure is fixed and does not allow for conditional computation.</p>
<p>As an overall verdict, it seems that there is no hardware solution for conditional computation on the horizon and we have to stick with heavily parallelizing across machines for the moment.
In that regard, <a href="https://arxiv.org/abs/1811.02084">Mesh-Tensorflow</a>, a novel method to distribute gradient calculation not just across the batch but also across the model was published at NeurIPS, allowing even larger models to be trained in a distributed fashion.</p>
<h2 id="continual-learning">Continual Learning</h2>
<p>I have long advocated for the need for deep learning based continual learning systems, i.e. systems that can learn continually from experience and accumulate knowledge that can then be used as prior knowledge when new tasks arise.
As such, they need to be capable of forward transfer, as well as preventing catastrophic forgetting.
The Continual Learning workshop at NeurIPS discussed exactly these issues.
Perhaps these two criteria are incomplete though, multiple speakers (Mark Ring, Raia Hadsell) suggested a larger list of requirements</p>
<ul>
<li>forward transfer</li>
<li>backward transfer</li>
<li>no catastrophic forgetting</li>
<li>no catastrophic interference</li>
<li>scalable (fixed memory / computation)</li>
<li>can handle unlabeled task boundaries</li>
<li>can handle drift</li>
<li>no episodes</li>
<li>no human control</li>
<li>no repeatable states</li>
</ul>
<p>In general, it seems to me that there are six categories of approaches to the problem</p>
<ul>
<li>(partial) replay buffer</li>
<li>generative model that regenerates past experience</li>
<li>slowing down training of important weights</li>
<li>freezing weights</li>
<li>redundancy (bigger networks -> scalability)</li>
<li>conditional computation (-> scalability)</li>
</ul>
<p>None of these approaches handle all aspects of the continual learning list.
Unfortunately, this is also impossible in practice.
There is always a trade-off between transfer and memory / compute, and a trade-off between catastrophic forgetting and transfer / memory / compute.
Thus, it will be hard to purely quantitatively measure the success of an agent.
Instead, we should build benchmarks that require qualities we require from our continual learning agents, for instance, the <a href="https://marcpickett.com/cl2018/CL-2018_paper_48.pdf">Starcraft based environment</a> presented at the workshop.</p>
<p>Furthermore, Raia Hadsell argued that Continual Learning involves moving away from learning algorithms that rely on i.i.d. data to learning from a non-stationary distribution.
In particular, humans are good at learning incrementally instead of iid.
Thus, we might be able to unlock a more powerful ML paradigm when moving away from the iid requirement.</p>
<p>The paper <a href="https://arxiv.org/abs/1810.11910">Continual Learning by Maximizing Transfer and Minimizing Interference</a> showed an interesting connection between REPTILE (a MAML successor) and reducing catastrophic forgetting.
The dot product between gradients of datapoints (appears in REPTILE) that are drawn from a replay buffer leads to gradient updates that minimize interference and reduce catastrophic forgetting.</p>
<p>The panel with Marc’Aurelio Ranzato, Richard Sutton, Juergen Schmidhuber, Martha White, and Chelsea Finn was also quite interesting.
It has been argued that we should experiment with lifelong learning in the control setting (if that is what we ultimately care about) instead of supervised and unsupervised learning to prevent any mismatch between algorithm development and actual area of application.
Discount factors, while having useful properties for Bellman-equation based learning, might be problematic for more realistic RL settings.
Returns with long time-horizons are what make humans inherently smarter than many other species.
Furthermore, any learning, in particular meta-learning, is inherently constrained due to credit assignment.
Thus, developing algorithms with cheap credit assignment are the key to intelligent agents.</p>
<h2 id="meta-learning">Meta-Learning</h2>
<p>Meta-Learning is about modifying the learning algorithm itself.
This may be an outer optimization loop that modifies an inner optimization loop, or in its most universal form a self-referential algorithm that can modify itself.
Many researchers are also concerned with fast adaptation, i.e. forward transfer, to new tasks / environments etc.
This can be viewed as transfer learning, or meta-learning if we consider the initial parameters of a learning algorithm to be part of the learning algorithm.
One of the very recent algorithms by Chelsea Finn, <a href="https://arxiv.org/abs/1703.03400">MAML</a>, spiked great interest in this kind of fast adaptation algorithms.
This could, for instance, be used for model-based reinforcement learning, where the <a href="https://arxiv.org/abs/1803.11347">model is quickly updated</a> to changing dynamics.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/neurips-2018/evolved-policy-gradients.png" alt="In EPG a loss function optimizes the parameters of a policy using SGD while the parameters of the loss function are evolved." />
<figcaption class="figure-caption">In EPG a loss function optimizes the parameters of a policy using SGD while the parameters of the loss function are evolved.</figcaption>
</figure>
<p>Another interesting idea is to learn differentiable loss functions of the agent’s trajectory and the policy output.
This allows evolving the few parameters of the loss function while training the policy using SGD.
Furthermore, the authors of <a href="https://arxiv.org/abs/1802.04821">Evolved Policy Gradients (EPG)</a> showed that the learned loss functions generalize across reward functions and allow for fast adaptation.</p>
<p>I consider this paper a special case of a much more powerful paradigm of evolving arbitrary loss functions between any hidden representations of an RNN (unpublished idea).
This may include actions, observations, rewards, and any other transformations thereof.
Such loss functions could represent world models, intrinsic motivation, internal rewards, and many other optimization objectives.
These kinds of loss functions could also be self-referential in the sense that the evolved parameters for the loss are themselves updated through gradient-descent.
The result is an evolved initial loss function that is further refined by gradient descent.
One major issue with EPG is that credit assignment is quite slow:
An agent has to be fully trained using a loss function to obtain an average return (fitness) for the meta-learner.
One possibility to speed this up might be the output of a premature ‘evaluation signal’ similar to the mechanism in the <a href="https://link.springer.com/article/10.1023/A:1007383707642">success story algorithm</a> that Juergen Schmidhuber talked about during the Continual Learning workshop.
I think learning loss functions is a very exciting direction and I would like to investigate it further.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/neurips-2018/loss-landscape.png" alt="The loss landscape of a learned optimizer becomes harder to navigate the more update steps are being unrolled.<br/>Left: one-dimensional. Right: two-dimensional. Taken from <a href='https://arxiv.org/abs/1810.10180'>Metz et al</a>" />
<figcaption class="figure-caption">The loss landscape of a learned optimizer becomes harder to navigate the more update steps are being unrolled.<br />Left: one-dimensional. Right: two-dimensional. Taken from <a href="https://arxiv.org/abs/1810.10180">Metz et al</a></figcaption>
</figure>
<p>Another interesting discovery I made during the Meta-Learning workshop is the structure of loss landscapes of meta-learners.
In a paper by Luke Metz on <a href="https://arxiv.org/abs/1810.10180">learning optimizers</a>, he showed that the loss function of the optimizer parameters becomes more complex the more update steps are being unrolled.
I suspect that this is a general behavior of meta-learning algorithms, small changes in parameter values can cascade to massive changes in the final performance.
I would be very interested in such an analysis.
In the case of learning optimization Luke addressed the issue by smoothing the loss landscape through <a href="https://arxiv.org/abs/1212.4507">Variational Optimization</a>, a principled interpretation of evolutionary strategies.</p>
<h2 id="benchmarks">Benchmarks</h2>
<p>Most current RL algorithms are benchmarked on games or simulators such as ATARI or Mujoco.
These are simple environments that have little resemblance of the richness our universe exhibits.
One major complaint researchers often voice is that our algorithms are sample-inefficient.
This can be fixed in part by using the existing data more efficiently through off-policy optimization and model-based RL.
Though, a large factor is also that our algorithms have no prior experience to use in these benchmarks.
We can get around this by handcrafting inductive biases into our algorithms that reflect some kind of prior knowledge but it might be much more interesting to <strong>build environments that allow the accumulation of knowledge</strong> that can be leveraged in the future.
To my knowledge, no such benchmark exists to date.
The <a href="https://github.com/Microsoft/malmo">Minecraft</a> simulator might be closest to such requirements.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/neurips-2018/starcraft.png" alt="The Continual Learning Starcraft environment is a curriculum starting with very simple tasks. Unfortunately, it still contains clear task boundaries and little possibilities for exploration." />
<figcaption class="figure-caption">The Continual Learning Starcraft environment is a curriculum starting with very simple tasks. Unfortunately, it still contains clear task boundaries and little possibilities for exploration.</figcaption>
</figure>
<p>An <strong>alternative</strong> to such rich environments is to build <strong>explicit curriculums</strong> such as the beforementioned <a href="https://marcpickett.com/cl2018/CL-2018_paper_48.pdf">Starcraft environment</a> that consists of a curriculum of tasks.
This is in part also what Shagun Sodhani asks for in his paper <a href="https://arxiv.org/abs/1811.10732">Environments for Lifelong Reinforcement Learning</a>.
Other aspects he puts on his wishlist are</p>
<ul>
<li>environment diversity</li>
<li>stochasticity</li>
<li>naturality</li>
<li>non-stationarity</li>
<li>multiple modalities</li>
<li>short-term and long-term goals</li>
<li>multiple agents</li>
<li>cause and effect interaction</li>
</ul>
<p>The game engine developer <a href="https://unity3d.com/">Unity3D</a> is also at the forefront of environment development.
It has released a toolkit <a href="https://unity3d.com/machine-learning">ML-Agents</a> to train and evaluate agents in environments build with Unity.
One of their new open-ended curriculum benchmarks is the <a href="https://twitter.com/awjuliani/status/1069048401596227584">Obstacle Tower</a>.
In general, a major problem for realistic environment construction is that the requirements are inherently different from game design:
To prevent overfitting it is important that objects in a vast world do not look alike and as such can not just be replicated as it is often done in computer games.
This means for true generalization we require generated or carefully designed environments.</p>
<p>Finally, I believe it might be possible to use computation to generate non-stationary environments instead of building them manually.
For instance, this could be a physics simulator that has similar properties to our universe.
To save compute, we could also start with a simplification based on voxels.
If this simulation exhibits the right properties we might be able to simulate a process similar to evolution, bootstrapping a non-stationary environment that develops many forms of life that interact with each other.
This idea fits nicely with the <a href="https://en.wikipedia.org/wiki/Simulation_hypothesis">simulation hypothesis</a> and has connections to <a href="https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life">Conway’s Game of Life</a>.
One major issue with this approach might be that the resulting complexity has no resemblance to human-known concepts.
Furthermore, the resulting intelligent agents will not be able to transfer to our universe.
Recently, I found out that this idea has been realized in part by Stanley and Clune’s group at UBER in their paper <a href="https://eng.uber.com/poet-open-ended-deep-learning/">POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments</a>.
The environment is non-stationary and can be viewed as an agent itself that maximizes complexity and agent learning progress.
They refer to this concept as open-ended learning, and I recommend reading <a href="https://www.oreilly.com/ideas/open-endedness-the-last-grand-challenge-youve-never-heard-of">this article</a>.</p>
<p>Please cite this work using</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{kirsch2019roadmap,
author = {Kirsch, Louis},
title = {{Updates on the AI road map}},
url = {http://louiskirsch.com/neurips-2018},
year = {2019}
}
</code></pre></div></div>Louis KirschI present an updated roadmap to AGI with four critical challenges: Continual Learning, Meta-Learning, Environments, and Scalability. I motivate the respective areas and discuss how research from NeurIPS 2018 has advanced them and where we need to go next.How to make your ML research more impactful2018-12-12T09:00:00+00:002018-12-12T09:00:00+00:00http://louiskirsch.com/ml-research-with-impact<p>We machine learning researchers all have a very <strong>limited amount of time</strong> to spend on reading research and there is only so few projects we can take on at a time.
Thus, it is paramount to understand what <strong>areas of research excite you</strong> and hold <strong>promise for the future</strong>.
In May 2018, I looked at precisely this question during a course at University College London:</p>
<ul>
<li>What makes ML research impactful?</li>
<li>And what can you do to <strong>increase your impact</strong>?</li>
</ul>
<p>I feel like everyone should at least ponder about this question for a bit, or better, <a href="/assets/publications/characteristics_of_ml_research_with_impact.pdf">read my technical report</a> on the topic and/or the summary I will provide in this blog post.</p>
<h1 id="what-can-you-do-to-increase-your-impact">What can you do to increase your impact?</h1>
<p>In my analysis, I focused on the <strong>field of deep learning</strong>.
I studied the co-author network of three important deep learning researchers and their publications.
I looked at the contents of these publications, the context in which these were published, and the changing citation count over time.
Here we interpret <strong>citation count as a metric for impact</strong> which has its limits in particular when looking at the effects on society at scale.
In my paper, I also discuss different metrics of impact, but here we focus on citation count as a metric for <strong>impact within the scientific community</strong>.
One might argue that outside of the community a lot of machine learning research is quickly converted to applications by industry.
In the following, I present my most important findings and actionable items.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/ml-research-impact/lstm_citations_years.svg" alt="While the LSTM (or its most successful version) has been published in 1997 it took several years and success in the form applicability for real-world challenges to be widely recognized and used in other research." />
<figcaption class="figure-caption">While the LSTM (or its most successful version) has been published in 1997 it took several years and success in the form applicability for real-world challenges to be widely recognized and used in other research.</figcaption>
</figure>
<p><strong>Demonstration of large-scale practical success.</strong><br />
Many particularly successful papers got the majority of their citations only decades later after the large-scale practical success of their methods was evident.
In general, it can be said that success depends on the ability of the approach to be scaled up to large problems.
Thus, always try to show your algorithm works great in large scale!</p>
<p><strong>Focus on novel techniques and large margin improvements.</strong><br />
Small improvements on benchmarks are quickly surpassed.
Working on applications only leads to high impact in the form of citations and adoption in the research community if it uses novel techniques that improve over existing (or non-existing) work by a large margin.</p>
<p><strong>Perseverance.</strong><br />
Because not all ideas will work out and demonstrating large-scale practical success is hard, great ideas in deep learning research often require perseverance for long periods of time.</p>
<p><strong>Do not follow the mainstream.</strong><br />
Backpropagation and LSTMs as examples for impactful ML research demonstrate that not the mainstream research established itself but the novel ideas that were not generally accepted at the time.</p>
<p><strong>General learning algorithms over applications.</strong><br />
Effort should be focused on general learning algorithms over applications.
None of the impactful papers I investigated solely applied existing techniques to a new application.</p>
<p><strong>Trial and error.</strong><br />
Many publications of famous authors are barely cited; research is trial and error.
Don’t be frustrated!</p>
<p><strong>A good intuition.</strong><br />
The most cited publications are distributed among a very small number of authors.
Which means these researchers must have a good intuition about what kind of problems are relevant, and what a good solution should look like.
Learn from them!</p>
<h1 id="what-can-the-community-do">What can the community do?</h1>
<p><strong>Introduce a ‘Crazy Work Award’ and other incentives.</strong><br />
Many later very successful ideas were rather unpopular during their inception.
We should encourage research groups, conferences, and journals to nurture crazy and currently unpopular ideas.</p>
<h1 id="conclusions">Conclusions</h1>
<p>There you have it.
I hope we can learn from my findings and produce exceptional work that pushes our field way beyond what is possible today.</p>
<p><strong>A final word of caution.</strong><br />
Please be aware that many of my observations might not generalize well into the future.
Furthermore, while I tried to extract actionable items from my findings, it may very well be that great research is still mostly determined by chance.</p>Louis KirschWe machine learning researchers all have a very limited amount of time to spend on reading research and there is only so few projects we can take on at a time. Thus, it is paramount to understand what areas of research excite you and hold promise for the future. I present my analysis of the factors of impactful ML research and how to increase your impact.Theories of Intelligence (2/2): Active Inference2018-08-15T19:45:00+00:002018-08-15T19:45:00+00:00http://louiskirsch.com/ai/active-inference<p>Active Inference is a theoretical framework of perception and action from neuroscience that can explain many phenomena in the brain.
It aims to explain the behavior of cells, organs, animals, humans, and entire species.
Active Inference is structured such as it only requires <em>local</em> computation and plasticity.
In effect, it could be implemented on neural hardware.
Even though Active Inference has wide-reaching potential application, for instance as an alternative to reinforcement learning, few people outside the neuroscience community are familiar with the framework.
In this blog post, I want to give a machine learning perspective on the framework, omitting many neuroscience details.
As such, this article is geared towards machine learning researchers familiar with probabilistic modeling and reinforcement learning.</p>
<p>
<div class="card">
<div class="card-header">Summary (tldr)</div>
<div class="card-body">
<ul>
<li>Active Inference (in the presented form) relies only on local computation and plasticity</li>
<li>Active Inference supports a hierarchy of spatiotemporal scales (cells, organs, animals, species, etc)</li>
<li>In contrast to (commonly used) Reinforcement Learning approaches
<ul>
<li>no reward has to be specified</li>
<li>exploration is optimally solved and</li>
<li>uncertainty is taken into account</li>
</ul>
</li>
<li>Active Inference relies solely on minimizing variational free energy</li>
<li>Both perception and action become inference (including planning)</li>
<li>Instead of rewards, agents have prior preferences over future observations that enter a prior over policies</li>
<li>These preferences are shaped by model selection (e.g. evolution)</li>
<li>Lower variational free energy corresponds to greater model evidence (e.g. adaptive fitness)</li>
</ul>
</div>
</div>
</p>
<h2 id="the-variational-free-energy">The variational free energy</h2>
<p>Active Inference is based on a single quantity: The variational free energy.
Minimization of the free energy will explain both perception, as well as action in any organism, as we will see later.
The variational free energy has its origins in neuroscience but also has a very clean machine learning interpretation and most researchers, therefore, should already be familiar with it.</p>
<figure class="text-center">
<img class="figure-img rounded small" style="" src="/assets/posts/active-inference/latent-variable-model.svg" alt="The variational free energy is often used to learn in such a latent variable model. In this model, \(x\) is an observed variable, while \(y\) is latent." />
<figcaption class="figure-caption">The variational free energy is often used to learn in such a latent variable model. In this model, \(x\) is an observed variable, while \(y\) is latent.</figcaption>
</figure>
<p>If you are entirely unfamiliar with the variational free energy, I would recommend reading <a href="https://blog.evjang.com/2016/08/variational-bayes.html">this blog post</a> before proceeding.
To recap, given a latent variable model as shown in the above figure, we often want to optimize the model evidence</p>
\[P(x) = \int P(x|y) P(y) d y\]
<p>This integral often has no analytical form and this numerical integration is hard.
Therefore, we resort to optimizing an alternative quantity:
The variational free energy.</p>
<p>We define the <em>negative</em> variational free energy</p>
\[F := \mathbb{E}_{Q(y)}[\log P(x|y)] - KL[Q(y)|P(y)]\]
<p>where \(Q\) is some approximate posterior distribution.
Note that when we refer to the variational free energy, we will often omit that we are talking about the negative free energy \(F\).</p>
<p>To see that \(F\) lower-bounds the model likelihood we can rewrite \(F\) such that</p>
\[F = \log P(x) - KL[Q(y)|P(y|x)]\]
<p>\(F\) is a lower bound because the \(KL\)-divergence is always larger or equal to zero.
Thus, we can maximize \(F\) instead of \(\log P(x)\).
The bound is tight when</p>
\[KL[Q(y)|P(y|x)] = 0\]
<h2 id="the-variational-free-energy-in-the-context-of-perception">The variational free energy in the context of perception</h2>
<p>Active Inference uses the variational free energy both in the context of perception and action.
We begin by describing its role in perception.
What do we mean by the term perception?
Intuitively, an agent observes and understands its environment.
This means the agent can extract the underlying dynamics and causes in order to make sense of its surroundings.
Modeling the environment in that manner helps to predict the past, present, and future which can either be viewed as an inherent property of any organism or as a tool to act and realize goals (in Active Inference formulated as extrinsic prior beliefs) as we will see later.
Note that the agent can never be certain about the underlying structure, but only updates beliefs based on what it has observed.
A very simplified version of perception is the modeling of these hidden causes and dynamics as a time series latent variable model.</p>
<p>We formalize the task of perception using the following notation:
An agent observes observations \(o_{1:T}\) over time \(t \in 1, \ldots, T\) and aims to model the underlying dynamics \(s_{1:T}\).
This is visualized in the following graphical model.</p>
<figure class="text-center">
<img class="figure-img rounded small" style="" src="/assets/posts/active-inference/latent-time.svg" alt="The latent variable model for perception of an agent. The agent observes \(o_t\) while the environment states \(s_t\) are latent." />
<figcaption class="figure-caption">The latent variable model for perception of an agent. The agent observes \(o_t\) while the environment states \(s_t\) are latent.</figcaption>
</figure>
<p>Similar to its original definition, the free energy for a specific point in time \(t\) then becomes</p>
\[F_t = \underbrace{\mathbb{E}_Q[\log P(o_t|s_t)]}_\text{accuracy} - \underbrace{KL[Q(s_t)|P(s_t|s_{t-1})]}_\text{complexity}\]
<p>The approximate posterior distribution \(Q(s_t)\) now simply describes beliefs about the latent state \(s_t\) based on the observation \(o_t\).
The task of perception is then simply optimizing over \(Q(s_t)\).
In other words, updating the beliefs about states \(s_t\) such that the observations the agent has made are most likely.</p>
\[Q(s_t) = \arg\max_Q F\]
<p>The two terms of the free energy are often also called the accuracy \(\mathbb{E}_Q[\log P(o_t|s_t)]\) and the complexity \(KL[Q(s_t)|P(s_t)]\).
Increasing the first term increases the likelihood that the inferred states model the observations correctly and the \(KL\)-divergence describes how different the beliefs will have to be from the prior beliefs (and therefore how much information they contain).</p>
<p>So what does it mean to minimize the free energy in the context of perception?
It means to adapt (posterior) beliefs about the environment such that surprise of new observations is minimized.
Thus, the free energy is a measure of surprise.</p>
<p>The careful reader might have noticed that we currently only take into account the observation \(o_t\) to infer \(s_t\) but neither past nor future.
Also, we have not talked about learning the transitions from \(s_t\) to \(s_{t+1}\) and the observations these states generate (\(P(o_t|s_t)\)).
We will look at these issues more carefully when introducing the general framework of Active Inference.
Furthermore, while we have modeled a single sequence of latent variables \(s_{1:T}\) it is possible to stack latent variables in order to build a hierarchy of abstractions (i.e. deep generative models).</p>
<h2 id="from-the-variational-free-energy-to-active-inference">From the variational free energy to Active Inference</h2>
<p>So far, we have discussed how an agent realizes perception by minimizing the free energy.
We will now extend this free-energy framework to include action, yielding Active Inference.
Whenever we talk about an agent acting in an environment, we usually use the reinforcement learning framework.
Active Inference offers an alternative to this framework, not driving action through reward but through preferred observations.
We will find that action is another result of the minimization of free energy.</p>
<p>Before we jump right in, recall the reinforcement learning framework, as depicted below.</p>
<figure class="text-center">
<img class="figure-img rounded small" style="" src="/assets/posts/active-inference/reinforcement-learning.svg" alt="The conventional reinforcement learning framework to describe the interaction between an environment and agent." />
<figcaption class="figure-caption">The conventional reinforcement learning framework to describe the interaction between an environment and agent.</figcaption>
</figure>
<p>We have an agent defined by its policy \(\pi\) that selects the action \(u_t\) to take at time \(t\) such that the return \(R = \sum_{\tau > t} r_\tau\) (the sum of rewards) is maximized.
The optimal action can be derived from Bellman’s Optimality Principle such that the optimal policy takes action \(u_t\) that maximizes the value function</p>
\[\begin{gather*}
u_t^* = \arg\max_{u_t} V(s_t, u_t) = \pi(s_t) \\
V(s_t, u_t) = \sum_{s_{t+1}} (r_{t+1} + \max_{u_{t+1}} V(s_{t+1}, u_{t+1})) P(s_{t+1}|s_t, u_t)
\end{gather*}\]
<p>In Active Inference we will not have to define a reward \(r\), return \(R\), or value function \(V\).
There is a fundamental reason for this: Active Inference treats exploration and exploitation as two sides of the same coin – in terms of choosing actions to minimize surprise or uncertainty.
This brings something quite fundamental to the table; namely, the value function of states has to be replaced with a (free energy) functional of beliefs about states.
Similarly, reward is replaced by (prior) beliefs about preferred, unsurprising, outcomes.
This means we define the agent’s extrinsic preferences using priors on future observations \(P(o_\tau)\) with \(\tau > t\)
(in the following we will always denote future observations with the index \(\tau\) and past observations with \(\rho\)).
The agent will then try to follow a policy that realizes these prior expectations of the future (so-called self-evidencing behavior).
At the same time, we prefer observations \(o_\tau\) that are likely under our model of the environment.
In effect, both principles are a form of surprise minimization.
Previously, we have identified the minimization of surprise as the minimization of the free energy.
Thus, action can be integrated into our existing free energy model through priors over future observations \(P(o_\tau)\).
We then combine the prior \(P(o_\tau)\) with our posterior beliefs \(Q(s_\tau)\) to infer the best policy \(\pi\) that minimizes our expected surprise.
We describe our posterior belief over the policy we should follow with \(Q(\pi)\).
Using this strategy, we can essentially reduce action to inference through the means of minimizing expected free energy.
This concept is also known as Hamilton’s Principle of Least Action.</p>
<p>The following figure visualizes Active Inferences as a principle of self-evidence.
An agent acts to generate outcomes that fulfill its prior and model expectations (left side).
At the same time, perception ensures that the model of the world is consistent with the observations (right side).</p>
<figure class="text-center">
<img class="figure-img rounded large" style="" src="/assets/posts/active-inference/self-evidencing.svg" alt="Active Inference extends the variational free energy framework with the principle of self-evidencing. A priori expected outcomes are achieved by inferring policies \(Q(\pi)\). The expected free energy \(G(\pi, \tau)\) takes the future and prior preferences over observations into account. The symbol \(\sigma\) denotes the softmax function. We will define the policy dependent free energy \(F(\pi, \rho)\) and expected free energy \(G(\pi, \tau)\) rigorously later." />
<figcaption class="figure-caption">Active Inference extends the variational free energy framework with the principle of self-evidencing. A priori expected outcomes are achieved by inferring policies \(Q(\pi)\). The expected free energy \(G(\pi, \tau)\) takes the future and prior preferences over observations into account. The symbol \(\sigma\) denotes the softmax function. We will define the policy dependent free energy \(F(\pi, \rho)\) and expected free energy \(G(\pi, \tau)\) rigorously later.</figcaption>
</figure>
<p>For the purpose of this blog post, it is very important to make a clear distinction between the free energy \(F\) minimized for perception and the expected (future) free energy \(G\) that reflects the amount of surprise in the future given a particular policy \(\pi\).
We will later present a precise formalization of \(G\), for now, it suffices to say that it is the expected free energy of the future when a particular policy is followed.
Thus, \(G\) is the quantity to optimize in order to realize future preferences \(P(o_\tau)\).
Note that the free energies \(F\) and \(G\) can, in principle, be reformulated to a single quantity (but we will not further investigate this angle here).</p>
<p>At this point, it is worth noting that \(G\) is a very universal quantity.
Parallels can be drawn to the Bayesian surprise, risk-sensitive control, expected utility theory, and the maximum entropy principle.
Having said that, we will not further investigate these similarities because they are not essential for understanding Active Inference.
More details for references on this topic can be found in the last section.</p>
<p>To summarize, we solve perception and action in a unifying framework that minimizes the free energy.
Furthermore, because we will implement action through Bayesian probability theory, we yield a Bayes optimal agent if no approximations have to be used.
Before we turn to a precise mathematical definition of the different components that make up an Active Inference agent, we present some of the evidence for Active Inference being the fundamental principle underlying any form of living organism.</p>
<h2 id="active-inference-as-a-foundation-for-life">Active Inference as a foundation for life</h2>
<p>We have introduced Active Inference as a unifying framework for perception and action.
The reach of this framework is quite extensive.
It can be used to explain the self-organization of living systems, such as cells, neurons, organs, animals, and even entire species.
Minimizing the free energy via action and perception is the central principle - called the free energy principle (FEP).</p>
<p>According to the FEP, central to any organism is its Markov blanket.
In the statistical sense, a Markov blanket \(b\) of states \(s\) is the set of random variables that when conditioned upon make all other variables independent.
Such a Markov blanket also exists for any living system.
Intuitively, no organism can directly observe or modify its environment.
Any interaction is through its sensors (only reflecting a reduced view on the environment) and its actuators (with limited capabilities to act upon the environment).
It is this boundary that separates its internal states from its external milieu, and without it, the system would cease to exist.
Formally, an organism receives sensory inputs \(o \in O\) through its sensory states.
Based on these inputs, it constructs a model of its environment \(s \in S\).
According to this model and in order to realize prior preferences, the organism takes actions \(u \in \Upsilon\), the so-called active states.
The only interaction with its environment \(\psi \in \Psi\) is through its Markov blanket consisting of sensory and active states.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/active-inference/markov-boundary.svg" alt="The Markov blanket of any living system, sustained by minimizing the free energy." />
<figcaption class="figure-caption">The Markov blanket of any living system, sustained by minimizing the free energy.</figcaption>
</figure>
<p>If an organism endures in a changing environment it must, by definition, retain its Markov blanket.
When an organism minimizes free energy, it minimizes surprise (of sensory input, i.e. observations).
Because under ergodic assumptions the long-term average of surprise is the entropy of these sensory states, retaining its Markov blanket is equivalent to placing an upper bound on the entropy of these sensory states.
Conversely, if the Markov blanket is not maintained, entropy (disorder) of the sensory states diverges and subsequently leads to disintegration and death of the organism.</p>
<p>The concept of Markov blankets can be used to model living systems recursively across spatial scales.
For instance, humans consist of different organs which in turn consist of countless cells.
Each system can be viewed as free-energy-minimizing, maintaining its own Markov blanket.</p>
<figure class="text-center">
<img class="figure-img rounded large" style="" src="/assets/posts/active-inference/recursive-markov-blanket.png" alt="
Markov blankets can be recursively composed across spatial scales.
Here, internal states are denoted as \(\mu\), while \(b := \{u, o\}\) is the Markov blanket.
Reproduced from [<a href='https://www.sciencedirect.com/science/article/pii/S1571064517301409'>Ramstead et. al</a>].
" />
<figcaption class="figure-caption">
Markov blankets can be recursively composed across spatial scales.
Here, internal states are denoted as \(\mu\), while \(b := \{u, o\}\) is the Markov blanket.
Reproduced from [<a href="https://www.sciencedirect.com/science/article/pii/S1571064517301409">Ramstead et. al</a>].
</figcaption>
</figure>
<p>Let’s look at an example.
How can an ensemble of cells form entire organs? (Also called morphogenesis)
Each cell needs to assume some specific function at a location for the entire organ to function.
There is no central organization, thus, each cell must infer the function and position of all other cells just from the signals reaching its local Markov Boundary.
Active Inference may solve this by assuming that each pluripotential cell starts out with a generative model of the entire ensemble.
Therefore a cell can predict which sensory input it would receive depending on its location in the organ.
Each cell only optimizes its local free energy, but because the generative model of each cell embodies a model of the entire organ, a local free energy minimum of each cell corresponds to the entire ensemble converging to a global free energy minimum.
Crucially, through this optimization, each cell will act upon the environment through \(u \in \Upsilon\) to help other cells reach their respective free energy minimum.</p>
<p>Finally, evolution plays a very central role in free energy minimization in biotic systems.
In a changing environment, the Markov blanket will eventually be destroyed, which results in the death of the organism.
Therefore, species have developed the ability to reproduce, effectively transferring genetic and epigenetic information to their descendants.
This information specifies the generative model (including prior preferences) in their descendants.
Crucially, information is not transmitted noise-free and each generation introduces slight variations.
These variations lead to changing generative models and prior preferences that create a selection process among population members.
The adaptive fitness of each organism is reflected in how well the model fits to the niche of the species.
But not only these variations are driving species, each organism can also shape evolution by free energy minimization.
The adaptive pressure mostly depends on the niche of the species.
But free energy minimization prefers predictable environments that behave according to prior preferences.
Therefore, the niche itself is also shaped through actions that lead to future free energy minimization.</p>
<p>Evolution, therefore, can be seen as a higher-level, more slowly moving, process that defines <em>empirical priors</em> \(P(o_\tau)\).
Similarly, this hierarchy of temporal scales can be constructed analogous to the hierarchy of spatial scales we constructed previously.
In this hierarchy, higher levels treat the preferences of lower layers as outcomes that need to be explained.</p>
<p>Furthermore, because free energy is an extensive property, hierarchical applications of free energy minimization at different spatial and temporal scales mean that there is an interesting circular causality – in which the minimization at one scale (e.g., evolution of a species) both creates and depends on free energy minimization at a lower scale (e.g., econiche construction by the conspecifics of a species).</p>
<h2 id="the-generative-model">The generative model</h2>
<p>We now introduce the generative model for Active Inference in all its mathematical detail.
In order to simplify explanations, we present a specific model for Active Inference that is a special case in some aspects and thus no longer is entirely assumption and approximation free.
We make the following assumptions:</p>
<ul>
<li>We have several finite discrete random variables for
<ul>
<li>outcomes \(o_t\)</li>
<li>actions \(u_t\)</li>
<li>states \(s_t\)</li>
<li>policies \(\pi\) <br />
(The policy itself is a function \(\pi(t) = u_t\), but we have a discrete set of such policies)</li>
</ul>
</li>
<li>The transitions between states are Markovian</li>
</ul>
<p>Furthermore, as described in more detail later, we use variational approximate inference to make the inference process tractable.</p>
<p>The following graphical model shows all the relationships between the variables in our model.
Along the latent variables \(s_t\) and \(\pi\) we have the transition matrix \(B\), probabilities of observations \(A\), and a variable \(D\) describing the original distribution over states \(s_1\).
At the beginning of this blog post, we considered \(B\), \(A\), and \(D\) to be fixed.
Furthermore, the matrix \(U\) describes prior preferences over future observations \(P(o_\tau) = \sigma(U_\tau)\) where \(\sigma\) denotes the softmax function.
These prior preferences will be integrated into the expected free energy \(G(\pi)\) which in turn defines the prior over the policy \(\pi\).
This relationship is crucial to Active Inference!
We want prior preferences on \(P(o_\tau)\) but implement them into our generative model by expressing them as a prior \(P(\pi|\gamma)\).
Finally, \(\gamma\) is a temperature parameter that increases or decreases the confidence in the policy \(\pi\).</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/active-inference/generative-model.svg" alt="The generative model for Active Inference. Note that the prior probability for a policy \(\pi\) depends on the expected free energy \(G\)." />
<figcaption class="figure-caption">The generative model for Active Inference. Note that the prior probability for a policy \(\pi\) depends on the expected free energy \(G\).</figcaption>
</figure>
<p>The policy \(\pi\) is different from Reinforcement Learning in the sense that it is not a function of previous states \(s_{t-1}\) but only on the current time \(t\).
Essentially, each possible policy describes a trajectory of actions taken.
This is sufficient because the state history \(s_{1:t}\) and future expected states \(s_{\rho}\) are taken into consideration automatically by minimizing the free energy to achieve preferred outcomes \(o_\rho\).
Note that regarding tractability Friston argues that an agent only entertains a handful of policies at a time.
This selection is optimized through a process further up in the temporal hierarchy (e.g. evolution).
To give an example, the brain has evolved to only consider a limited repertoire of eye movements with a short time horizon.</p>
<p>Analogous to our introduction of the free energy and perception, we now define the negative variational free energy for this generative model, which we will have to maximize (or minimize the positive free energy):</p>
\[F := \mathbb{E}_Q[\log P(o_{1:t}|x)] - KL[Q(x)|P(x)]\]
<p>where \(x\) are all our latent variables \(x = (s_{1:t}, \pi, A, B, D, \gamma)\).</p>
<p>In order to do inference, we have to define a posterior distribution \(Q(x)\).
We simplify our inference drastically by using a mean field approximation, such that the posterior factors</p>
\[Q(x) = \prod_{\rho=1}^{t} Q(s_\rho|\pi) Q(\pi) Q(A) Q(B) Q(D) Q(\gamma)\]
<p>Each factor can be represented by its sufficient statistics, denoted by the bar \(\bar x\)</p>
\[\begin{align*}
Q(s_\rho|\pi) &= Cat(\bar s_\rho^\pi) \\
Q(\pi) &= Cat(\bar \pi) \\
Q(A) &= Dir(\bar a) \\
Q(B) &= Dir(\bar b) \\
Q(D) &= Dir(\bar d) \\
Q(\gamma) &= \Gamma(1, \bar \beta) \\
\end{align*}\]
<p>Because our approximate posterior \(Q(x)\) factors, we can rewrite our free energy \(F\) such that it factors into policy dependent terms \(F(\pi, \rho)\) given by</p>
\[\begin{align*}
F &= \mathbb{E}_Q[\log P(o_{1:t}|x)] - KL[Q(x)|P(x)] \\
&= \sum_{\rho < t} F(\pi, \rho) - KL[Q(\pi)|P(\pi)] - KL[Q(\gamma)|P(\gamma)] \\
&\quad - KL[Q(A)|P(A)] - KL[Q(B)|P(B)] - KL[Q(D)|P(D)]
\end{align*}\]
<p>Our policy dependent free energy then only takes the expectation over a single hidden state \(s_\rho\) and conditioned on the policy:</p>
\[F(\pi, \rho) = \mathbb{E}_Q[\log P(o_\rho|s_\rho)] - KL[Q(s_\rho|\pi)|P(s_\rho| s_{\rho - 1}, \pi)]\]
<p>Having factored our approximate posterior distribution and free energy in this manner, we can use belief propagation (or variational message passing) to calculate the sufficient statistics \(\bar x\).
Based on these, we can finally choose our action.
Though, recall that the prior \(P(\pi|\gamma)\) required a quantity called the expected free energy \(G\) for each policy, which is what we will define next.</p>
<h3 id="expected-free-energy-for-each-policy">Expected free energy for each policy</h3>
<p>As we have established already, Active Inference requires two kinds of free energy:
The free energy \(F\) optimized for perception (i.e. inference), and the expected free energy \(G\) for each policy \(\pi\) that defines the prior distribution over policies \(P(\pi)\); thereby informing the posterior \(Q(\pi)\).
Remember that we want to pick policies \(\pi\) that minimize \(-G\), essentially minimizing surprise.
This is directly reflected in the prior \(P(\pi|\gamma) = \sigma(\gamma \cdot G(\pi))\), making \(\pi\) more likely for larger \(G\).</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/active-inference/expected-free-energy.svg" alt="The prior on \(\pi\) has a special interpretation. It requires a quantity called the expected free energy \(G\) that is based on expected future observations." />
<figcaption class="figure-caption">The prior on \(\pi\) has a special interpretation. It requires a quantity called the expected free energy \(G\) that is based on expected future observations.</figcaption>
</figure>
<p>\(G(\pi)\) is defined by the path integral over future timesteps \(\tau > t\).
This essentially means that just like the free energy, we factor \(G\) across time.
\(G(\pi)\) is given by</p>
\[G(\pi) = \sum_{\tau > t} G(\pi, \tau)\]
<p>where each \(G(\pi, \tau)\) is defined by the expected free energy at time \(\tau\) if policy \(\pi\) is pursued.
Recall that the variational free energy can be written as</p>
\[\begin{align*}
F &= \mathbb{E}_Q[\log P(o_{1:t}|x)] - KL[Q(x)|P(x)] \\
&= \mathbb{E}_Q[\log P(x, o_{1:t}) - \log Q(x))]
\end{align*}\]
<p>Because \(G\) models the expectations over the future, we define the predictive posterior \(\tilde Q\) that now also has observations \(o_\tau\) as latent variables.
Basically, we take the last posterior \(Q(s_t)\) and recursively apply transitions \(B\) according to policy \(\pi\) (and transitions \(A\) to yield observations \(o_\tau\)).
Thus, \(\tilde Q\) is given by</p>
\[\begin{align*}
\tilde Q &= Q(o_\tau, s_\tau | \pi) \\
&= \mathbb{E}_{Q(s_t)}[P(o_\tau, s_\tau| s_t, \pi)]
\end{align*}\]
<p>The expected variational free energy is now simply the free energy for each policy under the expectation of the predictive posterior \(\tilde Q\) instead of the posterior \(Q\):</p>
\[\begin{align*}
G(\pi, \tau) &= \mathbb{E}_{\tilde Q}[\log P(s_\tau, o_\tau|o_{1:t}, \pi) - \log Q(s_\tau|\pi)] \\
&= \mathbb{E}_{\tilde Q}[\log P(s_\tau|o_\tau, o_{1:t}, \pi) + \log P(o_\tau) - \log Q(s_\tau|\pi)]
\end{align*}\]
<p>The last equation now also makes the role of the prior \(P(o_\tau)\) apparent.
We simply interpret the marginal \(P(o_\tau)\) as a distribution over the sorts of outcomes the agent prefers.
Through this interpretation, the expected free energy \(G\) will be shaped by these preferences.
Because we maximize \(G(\pi, \tau)\) we will also maximize the prior probability \(\log P(o_\tau)\), in effect making prior preferences over observations more likely.
We can gain even more intuition from the expected free energy if we rewrite it in an approximate form by simply replacing \(P(s_\tau|o_\tau, o_{1:t}, \pi)\) with an approximate posterior \(Q(s_\tau|o_\tau, \pi)\), essentially dropping the dependence on the observation history \(o_{1:t}\).</p>
\[\begin{align*}
G(\pi, \tau) &\approx \underbrace{\mathbb{E}_{\tilde Q}[\log Q(s_\tau|o_\tau, \pi) - \log Q(s_\tau|\pi)]}_\text{epistemic value or information gain} + \underbrace{\mathbb{E}_{\tilde Q}[\log P(o_\tau)]}_\text{extrinsic value}
\end{align*}\]
<p>This form shows that maximizing \(G\) leads to behavior that either learns about the environment (epistemic value, information gain) or maximizes the extrinsic value.
In experiments, it can be observed that initially, the first term dominates and later little information can be gained by exploration, therefore the extrinsic value is maximized.
Active Inference, therefore, provides a Bayes optimal exploration and exploitation trade-off.</p>
<p>To summarize, we have defined the generative model, the free energy that is minimized for perception, its approximate posterior, and the expected free energy that is used to drive action.</p>
<h2 id="the-algorithm-in-action">The algorithm in action</h2>
<p>We now have all the pieces to put together the algorithm consisting of inference, learning and taking action as depicted in the figure below.
We are only left with how to derive the belief updates and pick an action.
Note that I will list all the belief updates in this section, feel free to skip the exact mathematical details.
I only provide these for a complete understanding of how Active Inference would have to be computed.</p>
<figure class="text-center">
<img class="figure-img rounded small" style="" src="/assets/posts/active-inference/algorithm.svg" alt="The different stages in the Active Inference algorithm." />
<figcaption class="figure-caption">The different stages in the Active Inference algorithm.</figcaption>
</figure>
<h3 id="inference">Inference</h3>
<p>Recall that inference is done by maximizing</p>
\[Q(x) = \arg\max_{Q(x)} F\]
<p>As mentioned before, we use a mean field approximation to factor our posterior \(Q(x)\).
Our beliefs \(Q(x)\) are then updated through belief propagation, iteratively updating our sufficient statistics \(\bar x\).
The belief updates are derived by differentiating the variational free energy \(F\) w.r.t. the sufficient statistics and setting the result to zero.
One finally yields the following update equations:</p>
\[\begin{align*}
\bar s_\rho^\pi &= \sigma(\log A \cdot o_\rho + \log B^\pi_{\rho -1} \cdot \bar s_{\rho-1}^\pi + \log B_\rho^\pi \cdot \bar s_{\rho + 1}^\pi) \\
\bar \pi &= \sigma(F + \bar\gamma \cdot G) \\
\bar \pi_0 &= \sigma(\bar\gamma \cdot G) \\
\bar \beta &= \beta + (\bar \pi_0 - \bar \pi) \cdot G
\end{align*}\]
<p>where \(F\) and \(G\) are vectors for the free energy \(F(\pi) = \sum_{\rho < t} F(\pi, \rho)\) and expected free energy \(G(\pi) = \sum_{\tau > t} G(\pi, \tau)\) of each policy.</p>
<p>\(F(\pi, \rho)\) and \(G(\pi, \tau)\) can be computed as follows.</p>
\[\begin{align*}
F(\pi, \rho) &= \mathbb{E}_Q[\log P(o_\rho|s_\rho)] - KL[Q(s_\rho|\pi)|P(s_\rho|s_{\rho - 1}, \pi)] \\
&= \bar s_\rho^\pi \cdot (\log A \cdot o_\rho + \log B_{\rho - 1}^\pi \bar s_{\rho - 1}^\pi - \log\bar s_\rho^\pi) \\
\\
G(\pi, \tau) &= -KL[Q(o_\tau|\pi)|P(o_\tau)] - \mathbb{E}_{\tilde Q}[\mathbb{H}[P(o_\tau|s_\tau)] \\
&= - \underbrace{\bar o_\tau^\pi \cdot (\log\bar o_\tau^\pi - U_\tau)}_\text{risk} - \underbrace{\bar s_\tau^\pi \cdot H}_\text{ambiguity} \\
\bar o_\tau^\pi &= \tilde A \cdot \bar s_\tau^\pi \\
U_\tau &= \log P(o_\tau) \\
H &= -diag(\tilde A \cdot \hat A) \\
\hat A &= \mathbb{E}_Q[\log A] = \psi(\bar a) - \psi(\bar a_0) \qquad \text{where }\psi\text{ is the digamma function}\\
\tilde A &= \mathbb{E}_Q[A] = \bar a \times \bar a_0^{-1} \\
\bar a_{0j} &= \sum_i \bar a_{ij}
\end{align*}\]
<h3 id="learning">Learning</h3>
<p>Learning \(A\), \(B\), and \(D\) is simply inference as well.
As such, we yield belief updates using the same method as above.</p>
\[\begin{align*}
\log A &= \psi(\bar a) - \psi(\bar a_0) & \bar a &= a + \sum_\rho o_\rho \otimes \bar s_\rho \\
\log B &= \psi(\bar b) - \psi(\bar b_0) & \bar b(u) &= b(u) + \sum_{\pi(\rho)=u} \bar \pi_\pi \cdot \bar s_\rho^\pi \otimes \bar s_{\rho - 1}^\pi \\
\log D &= \psi(\bar d) - \psi(\bar d_0) & \bar d &= d + \bar s_1
\end{align*}\]
<h3 id="choosing-action">Choosing action</h3>
<p>Finally, we choose an action simply by marginalizing over the posterior beliefs about policies to form a posterior distribution over the next action.
Generally, in simulating active inference, the most likely (a posteriori) action is selected and a new outcome of observation is sampled from the world.</p>
\[\begin{align*}
u_t &= \arg\max_u \mathbb{E}_{q(\pi)}[P(u|\pi)]
\end{align*}\]
<p>where</p>
\[P(u|\pi) =
\begin{cases}
1, & \text{if }u = \pi(t) \\
0, & \text{otherwise}
\end{cases}\]
<h2 id="conclusion">Conclusion</h2>
<p>We have seen that Active Inference unifies action and perception by minimizing a single quantity - the variational free energy.
Through this simple concept, we have reduced action to inference.
Compared to Reinforcement Learning, no reward had to be specified, exploration is optimally solved and uncertainty is taken into account.
Instead of rewards, agents have prior preferences over observations that are shaped by evolution.
A brief look at Active Inference as the foundation of life showed that the concept has an extensive biological applicability.
Furthermore, we presented a model of Active Inference that, while complete, made several approximations and has limitations.
From a machine learning perspective, future work will have to investigate how this scheme can be extended to long sequences and large discrete or continuous state spaces in a scalable manner.
Additionally, hierarchical internal states or recursive Markov-blanket-based systems are an interesting future research direction.</p>
<p>Please leave a comment down below for discussions, ideas, and questions!</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>I thank Karl Friston and Casper Hesp for valuable feedback and discussions.</p>
<h2 id="learn-more">Learn more</h2>
<p>I hope this blog post gave you an intuitive and condensed perspective on Active Inference.
If you would like to learn more, in particular from the perspective of neuroscience, check out this selection of the original papers:</p>
<table>
<thead>
<tr>
<th>Title</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://www.researchgate.net/publication/325473101_A_Multi-scale_View_of_the_Emergent_Complexity_of_Life_A_Free-energy_Proposal">A Multi-scale View of the Emergent Complexity of Life</a></td>
<td>Easy to read article about the <strong>reach of free energy minimization</strong> (& Active Inference)</td>
</tr>
<tr>
<td><a href="https://www.sciencedirect.com/science/article/pii/S0022249615000759">A tutorial on the free-energy framework for modeling perception and learning</a></td>
<td><strong>Beginner friendly</strong> tutorial on the free energy in the neuroscience context (no Active Inference)</td>
</tr>
<tr>
<td><a href="https://www.youtube.com/watch?v=Y1egnoCWgUg&t=2029s">Youtube video on Active Inference</a></td>
<td>Great <strong>intuitive</strong> introduction to Active Inference</td>
</tr>
<tr>
<td><a href="http://openaccess.city.ac.uk/16683/">Active Inference: A Process Theory</a></td>
<td>Detailed & <strong>mathematical</strong> description of Active Inference</td>
</tr>
</tbody>
</table>Louis KirschWe look at Active Inference, a theoretical formulation of perception and action from neuroscience that can explain many phenomena in the brain. It aims to explain the behavior of cells, organs, animals, humans, and entire species. This article is geared towards machine learning researchers familiar with probabilistic modeling and reinforcement learning.Theories of Intelligence (1/2): Universal AI2018-07-12T19:30:00+00:002018-07-12T19:30:00+00:00http://louiskirsch.com/ai/universal-ai<p>Many AI researchers, in particular in the field of deep learning and reinforcement learning, pursue a bottom-up approach.
We reason by analogy to neuroscience and aim to solve current limitations of AI systems on a case-by-case basis.
While this allows for steady progress, it is unclear which properties of an artificial intelligence system are required to be engineered and which may be learned.
In fact, one might argue that any design decision by a human can be outperformed by a learned rule, given enough data.
The less data we have, the more inductive biases we have to build into our systems.
But how do we know which aspects of an artificial system are necessary and which are better left to learn?
I want to give insight into a <strong>theoretic top-down approach of universal artificial intelligence</strong>.
It promises to prove what an <strong>optimal agent for any set of environments</strong> would have to look like and <strong>how intelligence can be measured</strong>.
Furthermore, we will learn about theoretical limits of (non-)computable intelligence.</p>
<p>The post is structured as follows</p>
<ul>
<li>We try to define intelligence</li>
<li>We motivate how Epicurus’ principle and Occam’s razor relate to the problem of intelligence</li>
<li>We introduce optimal sequence prediction using Solomonoff induction</li>
<li>We extend our agent to active environments (Reinforcement Learning paradigm) and show its optimality</li>
<li>We define a formal measure of intelligence</li>
</ul>
<p>This blog post is, in most parts, based on Marcus Hutter’s book <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artificial Intelligence</a> and Shane Legg’s PhD thesis <a href="http://www.vetta.org/documents/Machine_Super_Intelligence.pdf">Universal Super Intelligence</a>.</p>
<p>
<div class="card">
<div class="card-header">Summary (tldr)</div>
<div class="card-body">
<ul>
<li>Most machine learning tasks can be reduced to sequence prediction in passive or active environments</li>
<li>Sequences, environments, and hypotheses have universal prior probabilities based on Epicurus’ principle and Occam’s razor</li>
<li>The central quantity is the Kolmogorov complexity, a measure of complexity</li>
<li>Using these prior probabilities we can construct universally optimal sequence prediction and agents in active environments</li>
<li>The intelligence of an agent can be formally defined to be its performance in all environments weighted by complexity</li>
<li>We can derive the universal agent AIXI that maximizes this intelligence measure</li>
<li>Any other agent that is better than AIXI in a specific environment will be at least as much worse in some other environment</li>
</ul>
</div>
</div>
</p>
<h2 id="what-is-intelligence">What is intelligence?</h2>
<p>What is intelligence?
While there are many colloquial meanings of intelligence, many intelligence tests have been developed that can predict future academic or commercial performance very well.
But these tests are limited in scope because they only work for humans and might even be prone to cultural biases.
Additionally, how do we compare the intelligence of animals or machines with humans?
It seems that intelligence is not as easy to define as one might think.
If our goal is to design the most intelligent agent, we better define the term as precisely and general as possible.</p>
<p>Among many definitions of intelligence one can extract two very common aspects:
Firstly, intelligence is seen as a property of an actor interacting with an external environment.
Secondly, intelligence is related to this actor’s ability to succeed, implying the existence of some kind of goal.
It is therefore fair to say that the greater the capability of an actor to reach certain goals in an environment, the greater the individual’s intelligence.
It is also noteworthy that when describing intelligence, the focus is often on adaptation, learning, and experience.
Shane Legg argues that this is an indicator that the true environment in which these goals are pursued is not fully known and needs to be discovered first.
The actor, therefore, needs to quickly learn and adapt in order to perform as well as possible in a wide range of environments.
This leads us to a simple working definition that intuitively covers all aspects of intelligence:</p>
<blockquote>
<p>Intelligence measures an agent’s ability to achieve goals in a wide range of environments.</p>
</blockquote>
<h2 id="universal-artificial-intelligence">Universal Artificial Intelligence</h2>
<p>In the following, we will derive the universal artificial intelligence AIXI.
It will be based on three fundamental principles:
Epicurus’ principle of multiple explanations, Occam’s razor, and Bayes’ rule.
While Epicurus’ principle and Occam’s razor are important to motivate AIXI, none of the proofs rely on them as assumptions.</p>
<p>All intelligent agents will have to make predictions to achieve their goals.
Only if the underlying generating process can (approximately) be modeled by such an agent it can determine whether progress toward the goal is being made.
In this context, we model hypotheses \(h \in \mathcal{H}\) of the generating process that explain the data.
This may be the dynamics of an environment or the probability distribution over a sequence.
Epicurus’ principle of multiple explanations states</p>
<blockquote>
<p>Keep all hypotheses that are consistent with the data.</p>
</blockquote>
<p>Intuitively, this makes sense because future evidence may increase the likelihood of any current hypothesis.
This also implies that we will have to store the entire previous sequence or interaction history.
Any information that is discarded may be informative in the future.
Bayes’ rule defines how we need to integrate our observations \(D\) to yield a posterior probability \(P(h|D)\)</p>
\[P(h|D) = \frac{P(D|h)P(h)}{P(D)}\]
<p>While we can find \(P(D)\) by marginalization, we are faced with a problem: Where does the prior \(P(h)\) come from?
Bayesian statisticians argue that domain knowledge can be used to inform the prior \(P(h)\).
In the absence of such knowledge, one can simply impose a uniform prior (indifference principle).
This works great for finite sets \(\mathcal{H}\) but leads to problems with non-finite hypothesis spaces where the use of an improper prior is required.
In addition, the uniform prior highly depends on the problem formulation and is not invariant to reparameterization and group transformations.
A simple example illustrates that:
We could have three hypotheses \(\mathcal{H}_3 := \{\text{heads biased}, \text{tails biased}, \text{fair}\}\) for a coin flip.
Alternatively, we can regroup our hypotheses to \(\mathcal{H}_2 := \{\text{biased}, \text{fair}\}\).
Depending on the grouping chosen, the uniform prior will assign a different probability to the hypothesis \(h_b = \text{biased}\).
Occam’s razor will solve this issue. It roughly says</p>
<blockquote>
<p>Among all hypotheses that are consistent with the observations, the simplest is most likely.</p>
</blockquote>
<p>Occam’s razor can be motivated from several perspectives.
Empirically, hypotheses that make fewer assumptions are more testable and have led to theories that capture the underlying structure better and thus have better predictive performance.
Any human not adhering to this principle would be judged irrational.
In probability theory, by definition, all assumptions introduce possibilities of error.
If an assumption does not improve the accuracy of a theory, its only effect is to increase the probability that the overall theory is wrong.</p>
<p>There is one additional important observation before we can make use of Occam’s razor:
While we could define a prior on hypotheses \(P(h)\) as previously shown, there is an equivalent formulation where we use a prior for sequences \(P(x)\).
Let’s consider our data \(D\) to be a (binary) sequence \(x \in \mathbb{B}^*\).
Bayesian probability theory requires predictions over sequence continuations \(x_{t+1}\) given \(x_{1:t}\) to be weighted, i.e.</p>
\[P(x_{t+1}|x_{1:t}) = \sum_{h \in \mathcal{H}} P(h|x_{1:t}) P(x_{t+1}|x_{1:t}, h)\]
<p>where \(P(h|x_{1:t})\) is the posterior defined by Bayes’ rule, therefore requiring a prior \(P(h)\).
We can rewrite</p>
\[\begin{align}
P(x_{t+1}|x_{1:t}) &= \sum_{h \in \mathcal{H}} P(h|x_{1:t}) P(x_{t+1}|x_{1:t}, h) \\
&= \sum_{h \in \mathcal{H}} \frac{P(x_{1:t}|h)P(h)}{P(x_{1:t})} \frac{P(x_{1:t+1}|h)}{P(x_{1:t}|h)} \\
&= \frac{P(x_{1:t+1})}{P(x_{1:t})}
\end{align}\]
<p>such as that we now require a prior over sequences.
Such a prior over sequences \(x \in \mathbb{B}^*\) is what we will have to find.</p>
<p>So far, we have shown the relevance of finding the right prior probability distribution which may either be expressed over hypothesis space or sequence space.
Furthermore, we motivated why Epicurus’ principle of multiple explanations and Occams’ razor may help us to solve this issue.</p>
<h3 id="solomonoffs-prior-and-kolmogorov-complexity">Solomonoff’s prior and Kolmogorov complexity</h3>
<p>Next, in order to solve our problem with prior probabilities, we will formalize a so-called universal prior probability for sequences, hypotheses, and environments based on Epicurus’ principle of multiple explanations and Occam’s razor.
We begin with the simplest case, directly defining a prior over sequences.</p>
<p>Let’s extend the sequence prediction problem we briefly introduced.
Let \(\mu\) be the true generating distribution over an infinite sequence \(\omega \in \mathbb{B}^{\infty}\).
Sequence prediction is a quite powerful framework because any supervised prediction problem can be reduced to such sequence prediction.
The task of predicting \(\omega_{t+1}\) after having seen \(\omega_{1:t}\) is known as induction.</p>
<p>Occam’s razor is formalized by Solomonoff’s prior probability for (binary) sequences.
Intuitively, a sequence \(x \in \mathbb{B}^*\) is more likely if it is generated by many and short programs \(p\) fed to a universal Turing Machine \(U\).
You can think of the program \(p\) as a description of the sequence \(x\).
And the Turing Machine \(U\) executes your program \(p\) to generate \(x\).
If you can describe \(x\) with few instructions (say bits \(l(p)\)) it must be simpler, and therefore more likely.
We describe this probability thus simply with \(2^{-l(p)}\).
Also, if there are many explanations for \(x\), we just sum these probabilities.</p>
<p>We can now formally put these intuitions together:
We say Turing Machine \(U\) generates \(x\) if \(x\) is a prefix of the output of \(U(p)\).
For this prefix with any continuation, we write \(x*\).
Solomonoff’s prior probability that a sequence begins with binary string \(x \in \mathbb{B}^*\) is given by</p>
\[M(x) := \sum_{p:U(p)=x*} 2^{-l(p)}\]
<p>where \(l(p)\) is the length of program \(p\) and \(U\) is a prefix universal Turing Machine.
The fact that we use a prefix universal Turing Machine is a technicality.
Such a Turing Machine ensures that no valid program for \(U\) is a prefix of any other.
In this definition \(2^{-l(p)}\) can be interpreted as the probability that program \(p\) is sampled uniformly from all possible programs, halving the probability for each additional bit.</p>
<p>To summarize, Solomonoff’s prior probability assigns sequences generated by shorter programs higher probability.
We say sequences generated by shorter programs are less complex.</p>
<p>We can further formalize the complexity of a sequence.
Instead of taking into account all possible programs \(p\) that could generate a sequence \(x\) we just report the shortest program.
Formally, the Kolmogorov complexity of an infinite sequence \(\omega \in \mathbb{B}^\infty\) is the length of the shortest program that produces \(\omega\), given by</p>
\[K(\omega) := \min_{p \in \mathbb{B}^*}\{l(p): U(p) = \omega\}\]
<p>Similarly, the Kolmogorov complexity over finite strings \(x \in \mathbb{B}^*\) is the length of the shortest program that outputs \(x\) and then halts.</p>
<p>We have seen how to formalize Occams’ razor to derive a prior probability distribution.
In principle, we could now directly apply Solomonoff’s universal prior probability \(M(x)\) to perform induction.
But it will be useful to define an alternative variant that takes our hypotheses into account.</p>
<h3 id="solomonoff-levin-prior">Solomonoff-Levin prior</h3>
<p>Another way to look at the problem is to define a universal prior probability over hypotheses.
It will allow us to define an alternative prior over sequences known as the Solomonoff-Levin prior probability.</p>
<p>We start by defining our hypothesis space \(M_e\) over possible distributions and assume that the true generating distribution \(\mu \in M_e\).</p>
\[M_e := v_1, v_2, v_3, \ldots\]
<p>Each \(v_i \in M_e\) is a so-called enumerable probability semi-measure.
For now, it is sufficient to know that such a measure assigns each prefix sequence \(x \in \mathbb{B}^*\) of \(\omega\) a probability.
Therefore, with such a measure we can describe a distribution over sequences.
In order to make sure that \(\mu \in M_e\) we will pick a really large set of hypotheses.
In technical terms, \(M_e\) is a computable enumeration of enumerable probability semi-measures.
If you are interested in the mathematical details, refer to the optional section.</p>
<p>The index \(i\) can be used a representation of the semi-measure.
To assign hypothesis \(v_i\) a probability, we can make use of the Kolmogorov complexity again!
If we can describe \(i\) with a short program \(p\), \(v_i\) must be simple.
Formally, we define the Kolmogorov complexity for each of the \(v_i \in M_e\) as</p>
\[K(v_i) := \min_{p \in \mathbb{B}^*} \{ l(p): U(p) = i \}\]
<p>where \(v_i\) is the \(i^{th}\) element in our enumeration.</p>
<p>This gives us all the tools to define a prior probability for hypotheses!
We just assign simpler hypotheses larger probabilities, prescribed by the Kolmogorov complexity from above.
Each hypothesis \(v \in M_e\) is assigned a universal algorithmic prior probability</p>
\[P_{M_e}(v) := 2^{-K(v)}\]
<p>Finally, we construct a mixture to define a prior over the space of sequences, arriving at the Solomonoff-Levin prior probability of a binary sequence beginning with string \(x \in \mathbb{B}^*\)</p>
\[\xi(x) := \sum_{v \in M_e} P_{M_e}(v) v(x)\]
<p>
<div class="card">
<div class="card-header clearfix collapse-header">
<h4 class="float-left">
<a data-toggle="collapse" href="#probability-measures,-enumerable-functions,-and-computable-enumerations-content" aria-expanded="true" aria-controls="probability-measures,-enumerable-functions,-and-computable-enumerations-content" id="probability-measures,-enumerable-functions,-and-computable-enumerations" class="d-block">
Probability measures, enumerable functions, and computable enumerations
</a>
</h4>
<i class="float-right">optional section</i>
</div>
<div id="probability-measures,-enumerable-functions,-and-computable-enumerations-content" class="collapse" aria-labelledby="probability-measures,-enumerable-functions,-and-computable-enumerations">
<div class="card-body">
<p>In order to derive AIXI we will require three concepts that may not be familiar to the reader:
Probability (semi-)measures, enumerable functions, and computable enumerations.
We denote a sequence \(xy\) as the concatenation of \(x \in \mathbb{B}^*\) and \(y \in \mathbb{B}^*\).</p>
<p>A probability measure is a function over binary sequences \(v: \mathbb{B}^* \to [0, 1]\) such that</p>
\[v(\epsilon) = 1, \forall x \in \mathbb{B}^*: v(x) = v(x0) + v(x1)\]
<p>where \(\epsilon\) is the empty string.
In other words, we assign each finite binary sequence \(x \in \mathbb{B}^*\) (which could be a prefix of \(\omega\)) a probability such that they consistently add up.</p>
<p>A generalization is the probability semi-measure</p>
\[v(\epsilon) \leq 1, \forall x \in \mathbb{B}^*: v(x) \geq v(x0) + v(x1)\]
<p>that can be normalized to a probability measure.</p>
<p>Intuitively, a function is enumerable if it can be progressively approximated from below with a Turing machine in finite time.
Conversely, a function is co-enumerable if it can be progressively approximated from above with a Turing machine in finite time.</p>
<p>We stated that we’d like to enumerate enumerable semi-measures.
Note that an enumeration is <em>different</em> from an enumerable function.
What is a computable enumeration?
Basically, we want to enumerate all semi-measures with a Turing Machine in a computable manner.
We can not directly output probability measures using our Turing Machine and therefore use its index \(i\) as a description that can then be used to approximate \(v_i(x)\) from below.
More precisely, there exists a Turing Machine \(T\) that for any enumerable semi-measure \(v \in M_e\) there exists an index \(i \in \mathbb{N}\) such that</p>
\[\forall x \in \mathbb{B}^*: v(x) = v_i(x) := \lim_{k \to \infty}T(i, k, x)\]
<p>with \(T\) increasing in \(k\).
In other words, we have a Turing Machine \(T\) and for a given sequence \(x\) and index \(i\) we approximate the value of the semi-measure \(v_i\) for \(x\) from below.
The argument \(k\) intuitively describes how exact this approximation shall be.
In the limit of \(\lim_{k \to \infty}\) we will yield the exact value \(v_i(x)\).</p>
<p>Why can the index \(i\) be used to define the Kolmogorov complexity for each hypothesis \(v_i \in M_e\)?
The more complex \(i\) is, the more complex is the input to the above Turing machine \(T\).
For a given \(k\) and \(x\) the complexity of the output can therefore only depend on \(i\).</p>
</div>
</div>
</div>
</p>
<h3 id="solomonoff-induction">Solomonoff induction</h3>
<p>We now have two formulations for prior distributions over binary sequences:
The Solomonoff prior \(M\) and the Solomonoff-Levin prior \(\xi\).
Both priors can be shown to only be a multiplicative constant away from each other, i.e. \(M \overset{\times}{=} \xi\).
We will need the second formulation \(\xi\) to construct our universal agent AIXI.
But for now, let’s focus on the sequence prediction problem.</p>
<p>By the definition of conditional probability our prediction problem reduces to</p>
\[\xi(\omega_{t+1}|\omega_{1:t}) = \frac{\xi(\omega_{1:t+1})}{\xi(\omega_{1:t})}\]
<p>Because we used \(\xi\) as our universal prior, this is known as Solomonoff induction for sequence prediction.</p>
<p>Hooray, we’ve done it! But what do we gain from all this?
Given this scheme of induction, how good is the predictor \(\xi(\omega_{t+1}|\omega_{1:t})\) relative to the true \(\mu(\omega_{t+1}|\omega_{1:t})\)?
It turns out that as long as \(\mu\) can be described by a computable distribution (which is a very minor restriction) our estimator \(\xi\) will converge rapidly to the true \(\mu\), in effect solving the problem optimally!</p>
<p>
<div class="card">
<div class="card-header clearfix collapse-header">
<h4 class="float-left">
<a data-toggle="collapse" href="#convergence-of-solomonoff-induction-content" aria-expanded="true" aria-controls="convergence-of-solomonoff-induction-content" id="convergence-of-solomonoff-induction" class="d-block">
Convergence of Solomonoff induction
</a>
</h4>
<i class="float-right">optional section</i>
</div>
<div id="convergence-of-solomonoff-induction-content" class="collapse" aria-labelledby="convergence-of-solomonoff-induction">
<div class="card-body">
<p>We can measure the relative performance of \(\xi\) to \(\mu\) using the prediction error \(S_t\) given by</p>
\[S_t = \sum_{x \in \mathbb{B}^{t-1}} \mu(x)(\xi(x_t = 0|x) - \mu(x_t = 0|x))^2\]
<p>Consider the set of computable probability measures \(M_c \subset M_e\).
It can be shown that for any \(\mu \in M_c\) the sum of all prediction errors is bounded by a constant</p>
\[\sum_{t=1}^{\infty} S_t \leq \frac{\ln 2}{2} K(\mu)\]
<p>Due to this bound, it follows that the estimator \(\xi\) rapidly converges for any \(\mu\) that can be described by a computable distribution (see <a href="https://arxiv.org/abs/0709.1516">Hutter 2007</a> for what ‘rapid’ describes).</p>
</div>
</div>
</div>
</p>
<h3 id="active-environments">Active environments</h3>
<p>So far we only dealt with passive environments in the form of sequence predictions.
We will now extend the framework to active environments using the Reinforcement Learning framework as depicted below.</p>
<figure class="text-center">
<img class="figure-img rounded " style="" src="/assets/posts/universal-ai/rl-framework.svg" alt="The Reinforcement Learning framework." />
<figcaption class="figure-caption">The Reinforcement Learning framework.</figcaption>
</figure>
<p>We define an environment to be the tuple \((A, O, R, \mu)\):</p>
<ul>
<li>\(A\) is the of all actions our agent can take</li>
<li>\(O\) is the set of observations the agent can receive</li>
<li>\(R\) is the set of possible rewards</li>
<li>\(\mu\) is the environment’s transition probability measure (as defined in the following)</li>
</ul>
<p>We express the concatenation of observation and reward as \(x := or\).
The index \(t\) of a sequence \(ax_t\) refers to both action \(a\) and input \(x\) at time \(t\).
Then \(\mu\) is simply the probability measure over transitions \(\mu(x_t|ax_{<t}a_t)\).
Depending on the design objective of the agent we will have to specify two signals to yield a well-defined optimally informed agent.
Firstly, a communication stream of rewards needs to be specified.
This could be something like pain or reward signals similar to what humans experience.
Secondly, we need to specify a temporal preference.
This can be done in two ways.
We may specify a discount factor \(\gamma_t\) directly, for instance, a geometrically decreasing discount factor as commonly seen in the RL-literature:</p>
\[\forall i: \gamma_i := \alpha^i \qquad \text{for} \quad \alpha \in (0, 1)\]
<p>Alternatively, we require that the total reward is bounded, therefore directly specifying our preferences in the stream of rewards.
We call this set of environments reward-summable environments \(\mathbb{E} \subset E\).</p>
<p>It is a known result that the optimal agent in environment \(\mu\) with discounting \(\gamma\) would then be defined using the Bellman equation</p>
\[\pi^\mu := \arg\max_\pi V_\gamma^{\pi\mu} \\
V_\gamma^{\pi\mu}(ax_{<t}) = \sum_{ax_t} [\gamma_t r_t + V_\gamma^{\pi\mu}(ax_{1:t})]\overset{\pi}{\mu}(ax_t|ax_{<t}) \\\]
<p>where \(\overset{\pi}{\mu}\) is the probability distribution jointly over policy and environment, e.g.</p>
\[\overset{\pi}{\mu}(ax_{1:2}) := \pi(a_1) \mu(x_1|a_1) \pi(a_2|ax_1) \mu(x_2|ax_1a_2)\]
<p>If the environment \(\mu\) was known we could directly infer the optimal action \(a_t^{\pi^\mu}\) in the \(t^{th}\) step</p>
\[a_t^{\pi^\mu} := \arg\max_{a_t} \lim_{m \to \infty} \sum_{x_t} \max_{a_{t+1}} \sum_{x_{t+1}} \ldots \max_{a_m} \sum_{x_m} [\gamma_t r_t + \ldots + \gamma_m r_m] \mu(x_{t:m}|ax_{<t}a_{t:m})\]
<p>But it is easy to see that this agent does not fulfill our intelligence definition yet.
The agent needs to be able to cope with many different environments and learn from them, therefore it can not have \(\mu\) precoded.
We will take the agent from above and replace the environment dynamics \(\mu\) with the generalized universal prior distribution \(\xi\).</p>
<p>Again, we will have hypotheses, this time over possible environment probability measures.
Formally, our environment hypothesis space will now be the set of all enumerable chronological semi-measures</p>
\[E := \{v_1, v_2, \ldots\}\]
<p>We call these measures chronological because our sequence of interactions with the environment has a time component.
Again, we define the universal prior probability of a chronological environment \(v \in E\) using the Kolmogorov complexity</p>
\[P_E(v) := 2^{-K(v)}\]
<p>where \(K(v)\) is the length of the shortest program that computes the environment’s index.
We yield the prior over the agent’s observations by constructing another mixture</p>
\[\xi(x_{1:n}|a_{1:n}) := \sum_{v\in E} 2^{-K(v)} v(x_{1:n}|a_{1:n})\]
<p>By replacing \(\mu\) with \(\xi\) (and using conditional probability) we finally yield the universal AIXI agent \(\pi^\xi\):</p>
<p>
<div class="card">
<div class="card-header">Universal AIXI agent</div>
<div class="card-body">
\[a_t^{\pi^\xi} := \arg\max_{a_t} \lim_{m \to \infty} \sum_{x_t} \max_{a_{t+1}} \sum_{x_{t+1}} \ldots \max_{a_m} \sum_{x_m} [\gamma_t r_t + \ldots + \gamma_m r_m] \xi(x_{t:m}|ax_{<t}a_{t:m})\]
</div>
</div>
</p>
<h3 id="convergence">Convergence</h3>
<p>Can we show that \(\xi\) converges to the true environment \(\mu\) in the same way we have shown that our universal prior \(\xi\) for sequence prediction converged?
Indeed, we can, but with a strong limitation: The interaction history \(ax_{<t}\) must come from \(\pi^\mu\).
Of course, \(\mu\) is unknown and this is not feasible.
Even worse, it can be shown that it is in general impossible to match the performance of \(\pi^\mu\).
Different from the sequence prediction problem, possibly irreversible interaction with the environment is necessary.
To give a very simple example:
Let’s imagine an environment where an agent has to pick between two doors.
One door leads to hell (very little reward), the other to heaven (plenty of reward).
Crucially, after having chosen one door, the agent can not return.
But because the agent does not have any prior knowledge about what is behind the doors, there is no optimal behavior that guarantees that heaven is chosen.
Conversely, if \(\mu\) was known to the agent, it could infer which door leads to larger rewards.
We will investigate this problem again in a later section.
For now, let’s at least investigate whether we are as optimal as any agent can be in such a situation.</p>
<p>Luckily, there is a key theorem proven by Hutter that says exactly that.
Let \(\pi^\zeta\) be the equivalent of agent \(\pi^\xi\) defined over \(\mathcal{E} \subseteq E\) instead of \(E\), then we have</p>
<p>
<div class="card">
<div class="card-header">Pareto optimality of AIXI [<a href="http://www.hutter1.net/ai/uaibook.htm">proof in Section 5.5</a>]</div>
<div class="card-body">
<p>For any \(\mathcal{E} \subseteq E\) the agent \(\pi^\zeta\) is Pareto optimal.</p>
<p>An agent \(\pi\) is Pareto optimal if there is no other agent \(\rho\) such that</p>
\[\forall \mu \in \mathcal{E} \quad \forall ax_{<t}: V_\gamma^{\rho\mu}(ax_{<t}) \geq V_\gamma^{\pi\mu}(ax_{<t})\]
<p>with strict inequality for at least one \(\mu\).</p>
</div>
</div>
</p>
<p>In other words, there exists no agent that is at least as good as \(\pi^\zeta\) in all environments in \(\mathcal{E}\), and strictly better in at least one.
An even stronger result can be shown:
\(\pi^\zeta\) is also balanced Pareto optimal which means that any increase in performance in some environment due to switching to another agent is compensated for by an equal or greater decrease in performance in some other environment.</p>
<p>Great!
But even if we can not construct a better agent, for what kind of environments is our agent \(\pi^\xi\) guaranteed to converge to the optimal performance of \(\pi^\mu\)?
Intuitively, the agent needs to be given time to learn about its environment \(\mu\) to reach optimal performance.
But due to irreversible interaction, not all environments permit this kind of behavior.
We formalize this concept of convergence by defining self-optimizing agents and categorizing environments into the ones admitting self-optimizing agents, and the ones that do not.
An agent \(\pi\) is self-optimizing in an environment \(\mu\) if,</p>
\[\frac{1}{\Gamma_t}V_\gamma^{\pi\mu}(ax_{<t}) \to \frac{1}{\Gamma_t} V_{\gamma}^{\pi^\mu\mu}(\hat a \hat x_{<t}) \qquad \text{where} \quad \Gamma_t := \sum_{i=t}^{\infty} \gamma_i\]
<p>with probability 1 as \(t \to \infty\).
The interaction histories \(ax_{<t}\) and \(\hat a \hat x_{<t}\) are sample from \(\pi\) and \(\pi^\mu\) respectively.
We require normalization \(\frac{1}{\Gamma_t}\) because un-normalized expected future discounted reward always converges to zero.
Intuitively, the above statement says that with high probability the performance of a self-optimizing agent \(\pi\) converges to the performance of the optimal agent \(\pi^\mu\).
Furthermore, it can be proven that if there exists a sequence (agent might be non-stationary) of self-optimizing agents \(\pi_m\) for a class of environments \(\mathcal{E}\), then the agents \(\pi^\zeta\) is also self-optimizing for \(\mathcal{E}\) [<a href="http://www.hutter1.net/ai/uaibook.htm">proof in Hutter 2005</a>].
This is a promising result because while not all environments admit self-optimizing agents we can rest assured that the ones who do will result in \(\pi^\zeta\) being self-optimizing.</p>
<p>For the commonly used Markov Decision Process (MDP) it can be shown that if it is ergodic, then it admits self-optimizing agents.
We say that an MDP environment \((A, X, \mu)\) is ergodic iff there exists an agent \((A, X, \pi)\) such that \(\overset{\pi}{\mu}\) defines an <a href="https://math.dartmouth.edu/archive/m20x06/public_html/Lecture15.pdf">ergodic Markov chain</a>.
Other self-optimizing environments are summarized in the following figure.</p>
<figure class="text-center">
<img class="figure-img rounded" style="max-height:500px;" src="/assets/posts/universal-ai/environments.png" alt="Taxonomy of environments" />
<figcaption class="figure-caption">Taxonomy of environments and whether they allow self-optimizing agents. Reproduced from [<a href="http://www.vetta.org/documents/Machine_Super_Intelligence.pdf">1</a>].</figcaption>
</figure>
<p>It is important to note, that being self-optimizing only tells us about the performance in the limit, not how quickly the agent will learn to perform well.
While we were able to upper-bound the error by a constant in the case of sequence prediction, such a general result is impossible for active environments.
Further research is required to prove convergence results for different classes of environments.
Due to the Pareto optimality of AIXI, we can at least be confident that no other agent could converge more quickly in all environments.</p>
<h3 id="what-about-exploration">What about exploration?</h3>
<p>You may wonder why the RL exploration problem does not appear in our universal agent.
The answer is that exploration is optimally encoded in our universal prior \(\xi\).
All possible environments are being considered, weighted by their complexity, such that actions are considered not just with respect to the optimal policy in a single environment, but all environments.</p>
<h2 id="measuring-intelligence">Measuring intelligence</h2>
<p>Up until now, we derived the universal agent AIXI based on our working definition of intelligence.</p>
<blockquote>
<p>Intelligence measures an agent’s ability to achieve goals in a wide range of environments.</p>
</blockquote>
<p>In the following, we reverse the process and aim to define intelligence itself mathematically such that it permits us to measure intelligence universally for humans, animals, machines or any other arbitrary forms of intelligence.
This may even allow us to directly optimize for the intelligence of our agents instead of surrogates such as spreading genes, survival or any other objective we specify as designers.</p>
<p>We have already derived a pretty powerful toolbox to define such a formal measure of intelligence.
We formalize the wide range of environments and any goal that could be defined in these with the class of reward-summable environments \(\mathbb{E}\).
The ability of the agent \(\pi\) to achieve these goals is then just its value function \(V_\mu^\pi\).
We arrive at the following very simple formulation:</p>
<p>
<div class="card">
<div class="card-header">Formal measure of intelligence</div>
<div class="card-body">
<p>The universal intelligence of an agent \(\pi\) is its expected performance with respect to the universal distribution \(2^{-K(\mu)}\) over the space of all computable reward-summable environments \(\mathbb{E} \subset E\), that is,</p>
\[\Upsilon(\pi) := \sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} V_\mu^\pi\]
</div>
</div>
</p>
<p>This very simple measure of intelligence is remarkably close to our working definition of intelligence!
All we have added to our informal definition is a preference over environments we care about in the form of the prior probability \(2^{-K(\mu)}\).</p>
<p>How does this measure relate to AIXI?
By the linearity of expectation the above definition is just the expected future discounted return \(V_\xi^\pi\) under the universal prior probability distribution \(\xi\):</p>
\[\begin{align}
V_\xi^\pi &= V_\xi^\pi(\epsilon) \\
&= \sum_{ax_1}[r_t + V_\xi^\pi(ax_1)]\xi(x_1)\pi(a_1|x_1) \\
&= \sum_{ax_1}[r_t + V_\xi^\pi(ax_1)]\sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} \mu(x_1) \pi(a_1|x_1) \\
&= \sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} \sum_{ax_1}[r_t + V_\xi^\pi(ax_1)] \mu(x_1) \pi(a_1|x_1) \\
&= \sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} V_\mu^\pi
\end{align}\]
<p>We have previously seen that AIXI maximizes \(V_\xi^\pi\), therefore, by construction, the upper bound on universal intelligence is given by \(\pi^\xi\)</p>
\[\overline\Upsilon := \max_\pi \Upsilon(\pi) = \Upsilon(\pi^\xi)\]
<h2 id="practical-considerations">Practical considerations</h2>
<p>While the definition of universal artificial intelligence is an interesting theoretical endeavor on its own, we would like to construct algorithms that are as close as possible to this optimum.
Unfortunately, Kolmogorov complexity and therefore \(\pi^\xi\) are not computable.
In fact, AIXI is only enumerable as summarized in the table below.
This means that while we can make arbitrarily good approximations we can not devise an \(\epsilon\)-approximation because the upper bound of the function value is unknown.</p>
<table>
<thead>
<tr>
<th>Quantity</th>
<th>C*</th>
<th>E*</th>
<th>Co-E*</th>
<th>Comment</th>
<th>Other properties</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kolmogorov complexity</td>
<td></td>
<td></td>
<td>x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>\(v \in E\)</td>
<td></td>
<td>x</td>
<td></td>
<td>By definition of enumeration \(E\) of all enumerable chronological semi-measures</td>
<td></td>
</tr>
<tr>
<td>\(v \in \mathbb{E}\)</td>
<td>x</td>
<td></td>
<td></td>
<td>By definition of enumeration \(\mathbb{E}\) of all computable chronological semi-measures</td>
<td></td>
</tr>
<tr>
<td>\(\xi(x_{1:t}|a_{1:t})\) <br /> \(P_E(v) := 2^{-K(v)}\)</td>
<td></td>
<td>x</td>
<td></td>
<td>Follows from Kolmogorov complexity</td>
<td>\(\xi \in E\)</td>
</tr>
<tr>
<td>\(\pi^\xi\), \(V_\xi^\pi\), \(\Upsilon(\pi)\)</td>
<td></td>
<td>x</td>
<td></td>
<td>Follows from \(\xi\)</td>
<td></td>
</tr>
</tbody>
</table>
<p>* C = Computable, E = Enumerable, Co-E = Co-Enumerable</p>
<p>In more detail, what are the parts that make \(\Upsilon(\pi)\) of an agent \(\pi\) (and therefore \(\pi^\xi\)) impossible to compute?</p>
<ul>
<li>Computation of Kolmogorov complexity \(K(v)\) for \(v \in \mathbb{E}\)</li>
<li>Sum over the non-finite set of chronological semi-measures (environments) \(\sum_{v \in \mathbb{E}}\)</li>
<li>Computation of the value function \(V_v^\pi\) over an infinite time horizon (assuming infinite episode length)</li>
</ul>
<p>To estimate the Kolmogorov complexity techniques such as compression can be used.
AIXI could be Monte-Carlo approximated by sampling many environments (programs) and approximating the infinite sum over environments by a program length weighted finite sum over environments.
Similarly, we would have to limit the time horizon to compute the estimated discounted reward.</p>
<p>Other approximations and related approaches are HL(\(\lambda\)), AIXItl, Fitness Uniform Optimisation, the Speed prior, the Optimal Ordered Problem Solver and the Gödel Machine (Hutter, Legg, Schmidhuber, and others).</p>
<h2 id="conclusion">Conclusion</h2>
<p>Many interesting discussions emerge:
In the context of artificial intelligence by evolution, we aim to simulate evolution in order to produce intelligent agents.
Common approaches, inspired by evolution, optimize the spread of genes and survival instead of intelligence itself.
If the goal is intelligence over survival, can we directly (approximately) optimize for the intelligence of an agent?
Furthermore, which model should be chosen for \(\pi^\xi\)?
Any Turing Machine equivalent representation would be valid, be it neural networks or any symbolic programming language.
Of course, in practice, this choice makes a significant difference.
How are we going to make these design decisions?
Shane Legg argues that neuroscience is a good source for inspiration.</p>Louis KirschI give insight into a theoretic top-down approach of universal artificial intelligence. This may allow directing our future research by theoretical guidance, avoiding to handcraft properties of a system that may be learned by an intelligent agent. Furthermore, we will learn about theoretical limits of (non-)computable intelligence and introduce a universal measure of intelligence.