Jekyll2019-01-10T07:32:32+00:00http://louiskirsch.com/feed.xmlLouis KirschDeep Learning and Reinforcement Learning researcher striving to build life-long learning machines. Currently pursuing a PhD in Artificial Intelligence at IDSIA with Jürgen Schmidhuber.Louis KirschNeurIPS 2018, Updates on the AI road map2019-01-10T06:00:00+00:002019-01-10T06:00:00+00:00http://louiskirsch.com/neurips-2018<p>In September, I published a technical report on what I consider the <a href="/assets/publications/contemporary-challenges-in-artificial-intelligence.pdf">most important challenges in Artificial Intelligence</a>.
I categorized them into four areas</p>
<ul>
<li><strong>Scalability</strong><br />
Neural networks where compute / memory cost does not scale quadratically / linearly with the number of neurons.</li>
<li><strong>Continual Learning</strong><br />
Agents that have to continually learn from their environment without forgetting previously acquired skills and the ability to reset the environment.</li>
<li><strong>Meta-Learning</strong><br />
Agents that are self-referential in order to modify their own learning algorithm.</li>
<li><strong>Benchmarks</strong><br />
Environments that have complex enough structure and diversity such that intelligent agents can emerge without hardcoding strong inductive biases.</li>
</ul>
<p>During the <strong>NeurIPS 2018 conference</strong> I investigated other <strong>researcher’s current approaches and perspectives</strong> on these issues.</p>
<h2 id="inductive-biases">Inductive biases?</h2>
<p>I think it is interesting to point out that this list contains little discussion of particular inductive biases that solve challenges we observe with current reinforcement learning agents.
Most of these challenges are absorbed into the Meta-Learning aspect of the system, similar to how evolution shaped a good learner.
It remains to be seen how feasible this approach is with strongly limited compute and time constraints.</p>
<h2 id="scalability">Scalability</h2>
<p>It is almost obvious that if we seek to implement 100 billion neurons as found in the human brain using artificial neural networks (ANNs) that standard matrix-matrix multiplications will not take us very far.
The number of required operations is quadratic in the number of neurons.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/modular-networks/modular-layer.gif" alt="The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input." />
<figcaption class="figure-caption">The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input.</figcaption>
</figure>
<p>To address this issue, I have worked on and published <a href="https://arxiv.org/abs/1811.05249">Modular Networks</a> at NeurIPS 2018.
Instead of evaluating the entire ANN for each input element, we decompose the network into a set of modules, where only a subset is used depending on the input.
This procedure is inspired by the human brain, where we can observe modularization that is also hypothesized to improve adaptation to changing environments and mitigate catastrophic forgetting.
In our approach, we learn both the parameters of these modules, as well as the decision which modules to use jointly.
Previous literature on conditional computation has had many issues with module collapse, i.e. the optimization process ignoring most of the available modules, leading to sub-optimal solutions.
Our Expectation-Maximization based approach prevents these kinds of issues.</p>
<p>Unfortunately, forcing this kind of separation into modules has its own issues that we discussed in the paper and in <a href="/modular-networks">this follow-up blog post</a> on modular networks.
Instead, we might seek to make use of sparsity and locality in weights and activations as discussed in <a href="/assets/publications/scale-through-sparsity.pdf">my technical report on sparsity</a>.
In short, we only want to perform operations on the few activations that are non-zero, discarding entire rows in the weight matrix.
If furthermore, connectivity is highly sparse, we in effect get rid of the quadratic cost down to a small constant.
This kind of conditional computation and non-coalesced weight access is quite expensive to implement on current GPUs and usually not worth it.</p>
<h3 id="nvidias-take-on-conditional-computation-and-sparsity">NVIDIA’s take on conditional computation and sparsity</h3>
<p>According to a software engineer and a manager at NVIDIA, there are no current plans to build hardware that can leverage conditional computation in the form of activation sparsity.
The main reason seems to be the trade-off of generality vs speed.
It is too expensive to build dedicated hardware for this use case because it might limit other (ML) applications.
Instead, NVIDIA is more focused on weight sparsity from a software perspective at the moment.
This weight sparsity also requires a high degree to be efficient.</p>
<h3 id="graphcores-take-on-conditional-computation-and-sparsity">GraphCore’s take on conditional computation and sparsity</h3>
<p><a href="https://www.graphcore.ai/">GraphCore</a> builds hardware that allows storing activations during the forward pass in caches close to the processing units instead of global memory on GPUs.
It also can make use of sparsity and specific graph structure by compiling and setting up a computational graph on the device itself.
Unfortunately, due to the expensive compilation, this structure is fixed and does not allow for conditional computation.</p>
<p>As an overall verdict, it seems that there is no hardware solution for conditional computation on the horizon and we have to stick with heavily parallelizing across machines for the moment.
In that regard, <a href="https://arxiv.org/abs/1811.02084">Mesh-Tensorflow</a>, a novel method to distribute gradient calculation not just across the batch but also across the model was published at NeurIPS, allowing even larger models to be trained in a distributed fashion.</p>
<h2 id="continual-learning">Continual Learning</h2>
<p>I have long advocated for the need for deep learning based continual learning systems, i.e. systems that can learn continually from experience and accumulate knowledge that can then be used as prior knowledge when new tasks arise.
As such, they need to be capable of forward transfer, as well as preventing catastrophic forgetting.
The Continual Learning workshop at NeurIPS discussed exactly these issues.
Perhaps these two criteria are incomplete though, multiple speakers (Mark Ring, Raia Hadsell) suggested a larger list of requirements</p>
<ul>
<li>forward transfer</li>
<li>backward transfer</li>
<li>no catastrophic forgetting</li>
<li>no catastrophic interference</li>
<li>scalable (fixed memory / computation)</li>
<li>can handle unlabeled task boundaries</li>
<li>can handle drift</li>
<li>no episodes</li>
<li>no human control</li>
<li>no repeatable states</li>
</ul>
<p>In general, it seems to me that there are six categories of approaches to the problem</p>
<ul>
<li>(partial) replay buffer</li>
<li>generative model that regenerates past experience</li>
<li>slowing down training of important weights</li>
<li>freezing weights</li>
<li>redundancy (bigger networks -> scalability)</li>
<li>conditional computation (-> scalability)</li>
</ul>
<p>None of these approaches handle all aspects of the continual learning list.
Unfortunately, this is also impossible in practice.
There is always a trade-off between transfer and memory / compute, and a trade-off between catastrophic forgetting and transfer / memory / compute.
Thus, it will be hard to purely quantitatively measure the success of an agent.
Instead, we should build benchmarks that require qualities we require from our continual learning agents, for instance, the <a href="https://marcpickett.com/cl2018/CL-2018_paper_48.pdf">Starcraft based environment</a> presented at the workshop.</p>
<p>Furthermore, Raia Hadsell argued that Continual Learning involves moving away from learning algorithms that rely on i.i.d. data to learning from a non-stationary distribution.
In particular, humans are good at learning incrementally instead of iid.
Thus, we might be able to unlock a more powerful ML paradigm when moving away from the iid requirement.</p>
<p>The paper <a href="https://arxiv.org/abs/1810.11910">Continual Learning by Maximizing Transfer and Minimizing Interference</a> showed an interesting connection between REPTILE (a MAML successor) and reducing catastrophic forgetting.
The dot product between gradients of datapoints (appears in REPTILE) that are drawn from a replay buffer leads to gradient updates that minimize interference and reduce catastrophic forgetting.</p>
<p>The panel with Marc’Aurelio Ranzato, Richard Sutton, Juergen Schmidhuber, Martha White, and Chelsea Finn was also quite interesting.
It has been argued that we should experiment with lifelong learning in the control setting (if that is what we ultimately care about) instead of supervised and unsupervised learning to prevent any mismatch between algorithm development and actual area of application.
Discount factors, while having useful properties for Bellman-equation based learning, might be problematic for more realistic RL settings.
Returns with long time-horizons are what make humans inherently smarter than many other species.
Furthermore, any learning, in particular meta-learning, is inherently constrained due to credit assignment.
Thus, developing algorithms with cheap credit assignment are the key to intelligent agents.</p>
<h2 id="meta-learning">Meta-Learning</h2>
<p>Meta-Learning is about modifying the learning algorithm itself.
This may be an outer optimization loop that modifies an inner optimization loop, or in its most universal form a self-referential algorithm that can modify itself.
Many researchers are also concerned with fast adaptation, i.e. forward transfer, to new tasks / environments etc.
This can be viewed as transfer learning, or meta-learning if we consider the initial parameters of a learning algorithm to be part of the learning algorithm.
One of the very recent algorithms by Chelsea Finn, <a href="https://arxiv.org/abs/1703.03400">MAML</a>, spiked great interest in this kind of fast adaptation algorithms.
This could, for instance, be used for model-based reinforcement learning, where the <a href="https://arxiv.org/abs/1803.11347">model is quickly updated</a> to changing dynamics.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/neurips-2018/evolved-policy-gradients.png" alt="In EPG a loss function optimizes the parameters of a policy using SGD while the parameters of the loss function are evolved." />
<figcaption class="figure-caption">In EPG a loss function optimizes the parameters of a policy using SGD while the parameters of the loss function are evolved.</figcaption>
</figure>
<p>Another interesting idea is to learn differentiable loss functions of the agent’s trajectory and the policy output.
This allows evolving the few parameters of the loss function while training the policy using SGD.
Furthermore, the authors of <a href="https://arxiv.org/abs/1802.04821">Evolved Policy Gradients (EPG)</a> showed that the learned loss functions generalize across reward functions and allow for fast adaptation.
One major issue with EPG is that credit assignment is quite slow:
An agent has to be fully trained using a loss function to obtain an average return (fitness) for the meta-learner.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/neurips-2018/loss-landscape.png" alt="The loss landscape of a learned optimizer becomes harder to navigate the more update steps are being unrolled.<br/>Left: one-dimensional. Right: two-dimensional. Taken from <a href='https://arxiv.org/abs/1810.10180'>Metz et al</a>" />
<figcaption class="figure-caption">The loss landscape of a learned optimizer becomes harder to navigate the more update steps are being unrolled.<br />Left: one-dimensional. Right: two-dimensional. Taken from <a href="https://arxiv.org/abs/1810.10180">Metz et al</a></figcaption>
</figure>
<p>Another interesting discovery I made during the Meta-Learning workshop is the structure of loss landscapes of meta-learners.
In a paper by Luke Metz on <a href="https://arxiv.org/abs/1810.10180">learning optimizers</a>, he showed that the loss function of the optimizer parameters becomes more complex the more update steps are being unrolled.
I suspect that this is a general behavior of meta-learning algorithms, small changes in parameter values can cascade to massive changes in the final performance.
I would be very interested in such an analysis.
In the case of learning optimization Luke addressed the issue by smoothing the loss landscape through <a href="https://arxiv.org/abs/1212.4507">Variational Optimization</a>, a principled interpretation of evolutionary strategies.</p>
<h2 id="benchmarks">Benchmarks</h2>
<p>Most current RL algorithms are benchmarked on games or simulators such as ATARI or Mujoco.
These are simple environments that have little resemblance of the richness our universe exhibits.
One major complaint researchers often voice is that our algorithms are sample-inefficient.
This can be fixed in part by using the existing data more efficiently through off-policy optimization and model-based RL.
Though, a large factor is also that our algorithms have no prior experience to use in these benchmarks.
We can get around this by handcrafting inductive biases into our algorithms that reflect some kind of prior knowledge but it might be much more interesting to <strong>build environments that allow the accumulation of knowledge</strong> that can be leveraged in the future.
To my knowledge, no such benchmark exists to date.
The <a href="https://github.com/Microsoft/malmo">Minecraft</a> simulator might be closest to such requirements.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/neurips-2018/starcraft.png" alt="The Continual Learning Starcraft environment is a curriculum starting with very simple tasks. Unfortunately, it still contains clear task boundaries and little possibilities for exploration." />
<figcaption class="figure-caption">The Continual Learning Starcraft environment is a curriculum starting with very simple tasks. Unfortunately, it still contains clear task boundaries and little possibilities for exploration.</figcaption>
</figure>
<p>An <strong>alternative</strong> to such rich environments is to build <strong>explicit curriculums</strong> such as the beforementioned <a href="https://marcpickett.com/cl2018/CL-2018_paper_48.pdf">Starcraft environment</a> that consists of a curriculum of tasks.
This is in part also what Shagun Sodhani asks for in his paper <a href="https://arxiv.org/abs/1811.10732">Environments for Lifelong Reinforcement Learning</a>.
Other aspects he puts on his wishlist are</p>
<ul>
<li>environment diversity</li>
<li>stochasticity</li>
<li>naturality</li>
<li>non-stationarity</li>
<li>multiple modalities</li>
<li>short-term and long-term goals</li>
<li>multiple agents</li>
<li>cause and effect interaction</li>
</ul>
<p>The game engine developer <a href="https://unity3d.com/">Unity3D</a> is also at the forefront of environment development.
It has released a toolkit <a href="https://unity3d.com/machine-learning">ML-Agents</a> to train and evaluate agents in environments build with Unity.
One of their new open-ended curriculum benchmarks is the <a href="https://twitter.com/awjuliani/status/1069048401596227584">Obstacle Tower</a>.
In general, a major problem for realistic environment construction is that the requirements are inherently different from game design:
To prevent overfitting it is important that objects in a vast world do not look alike and as such can not just be replicated as it is often done in computer games.
This means for true generalization we require generated or carefully designed environments.</p>
<p>Finally, I believe it might be possible to use computation to generate non-stationary environments instead of building them manually.
For instance, this could be a physics simulator that has similar properties to our universe.
To save compute, we could also start with a simplification based on voxels.
If this simulation exhibits the right properties we might be able to simulate a process similar to evolution, bootstrapping a non-stationary environment that develops many forms of life that interact with each other.
This idea fits nicely with the <a href="https://en.wikipedia.org/wiki/Simulation_hypothesis">simulation hypothesis</a> and has connections to <a href="https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life">Conway’s Game of Life</a>.
One major issue with this approach might be that the resulting complexity has no resemblance to human-known concepts.
Furthermore, the resulting intelligent agents will not be able to transfer to our universe.
Recently, I found out that this idea has been realized in part by Stanley and Clune’s group at UBER in their paper <a href="https://eng.uber.com/poet-open-ended-deep-learning/">POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments</a>.
The environment is non-stationary and can be viewed as an agent itself that maximizes complexity and agent learning progress.
They refer to this concept as open-ended learning, and I recommend reading <a href="https://www.oreilly.com/ideas/open-endedness-the-last-grand-challenge-youve-never-heard-of">this article</a>.</p>
<p>Please cite this work using</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{kirsch2019roadmap,
author = {Kirsch, Louis},
title = {{Updates on the AI road map}},
url = {http://louiskirsch.com/neurips-2018},
year = {2019}
}
</code></pre></div></div>Louis KirschI present an updated roadmap to AGI with four critical challenges: Continual Learning, Meta-Learning, Environments, and Scalability. I motivate the respective areas and discuss how research from NeurIPS 2018 has advanced them and where we need to go next.A Map of Reinforcement Learning2019-01-03T14:00:00+00:002019-01-03T14:00:00+00:00http://louiskirsch.com/maps/reinforcement-learning<p><strong>Reinforcement Learning</strong> promises to solve the problem of <strong>designing intelligent agents</strong> in a formal but simple framework.
At the same time, there exists a large pool of <strong>methods</strong> to optimize an agent’s policy to maximize return such as value-based methods, policy-based methods, imitation learning, and model-based approaches.
These methods themselves have many variants and incremental improvements, mostly driven by a set of major <strong>challenges</strong> in the field of reinforcement learning.
All in all, it is easy to get lost in the large number of publications and subfields of research.</p>
<p>This blog post aims at <strong>tackling this massive quantity of approaches and challenges</strong>, providing an overview of the different challenges researchers are working on and the methods they devised to solve these problems.
This mind map is very far from complete and in large parts driven by my interests, if you have any particular suggestions, please let me know!</p>
<p><a href="/assets/posts/map-reinforcement-learning/overview.pdf"><img src="/assets/posts/map-reinforcement-learning/overview.svg" alt="A map of reinforcement learning" /></a></p>
<h1 id="goal">Goal</h1>
<p>What is the goal of Reinforcement Learning?
We have introduced the framework to solve the problem of designing intelligent agents.
It can be further formalized with <em>‘an agent that maximizes reward in a particular environment’</em>.
Or in the context of AGI, Intelligence has been defined as <em>‘The ability to achieve goals in a wide range of environments’</em> by Marcus Hutter and Shane Legg.
Marcus Hutter formalized the optimal universal agent AIXI, as I have described in <a href="/ai/universal-ai">another blog post</a>.</p>
<p><a href="/assets/posts/map-reinforcement-learning/goal.pdf"><img src="/assets/posts/map-reinforcement-learning/goal.svg" alt="The goal of Reinforcement Learning" /></a></p>
<h1 id="methods">Methods</h1>
<p>There is a variety of methods to optimize an agents policy.
Here we look at different categories and specific implementations of methods to optimize RL policies.</p>
<p><a href="/assets/posts/map-reinforcement-learning/methods.pdf"><img src="/assets/posts/map-reinforcement-learning/methods.svg" alt="Methods of Reinforcement Learning" /></a></p>
<h1 id="challenges">Challenges</h1>
<p>There is a wide range of challenges that current major algorithms do not handle very well yet.
Thus, many new specialized algorithms have been developed.
To understand where and why specific research is happening, I tried to sort in recent research into the respective challenges they’re addressing.</p>
<p><a href="/assets/posts/map-reinforcement-learning/challenges.pdf"><img src="/assets/posts/map-reinforcement-learning/challenges.svg" alt="Challenges of Reinforcement Learning" /></a></p>Louis KirschReinforcement Learning promises to solve the problem of designing intelligent agents in a formal but simple framework. This blog post aims at tackling the massive quantity of approaches and challenges in Reinforcement Learning, providing an overview of the different challenges researchers are working on and the methods they devised to solve these problems.How to make your ML research more impactful2018-12-12T09:00:00+00:002018-12-12T09:00:00+00:00http://louiskirsch.com/ml-research-with-impact<p>We machine learning researchers all have a very <strong>limited amount of time</strong> to spend on reading research and there is only so few projects we can take on at a time.
Thus, it is paramount to understand what <strong>areas of research excite you</strong> and hold <strong>promise for the future</strong>.
In May 2018, I looked at precisely this question during a course at University College London:</p>
<ul>
<li>What makes ML research impactful?</li>
<li>And what can you do to <strong>increase your impact</strong>?</li>
</ul>
<p>I feel like everyone should at least ponder about this question for a bit, or better, <a href="/assets/publications/characteristics_of_ml_research_with_impact.pdf">read my technical report</a> on the topic and/or the summary I will provide in this blog post.</p>
<h1 id="what-can-you-do-to-increase-your-impact">What can you do to increase your impact?</h1>
<p>In my analysis, I focused on the <strong>field of deep learning</strong>.
I studied the co-author network of three important deep learning researchers and their publications.
I looked at the contents of these publications, the context in which these were published, and the changing citation count over time.
Here we interpret <strong>citation count as a metric for impact</strong> which has its limits in particular when looking at the effects on society at scale.
In my paper, I also discuss different metrics of impact, but here we focus on citation count as a metric for <strong>impact within the scientific community</strong>.
One might argue that outside of the community a lot of machine learning research is quickly converted to applications by industry.
In the following, I present my most important findings and actionable items.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/ml-research-impact/lstm_citations_years.svg" alt="While the LSTM (or its most successful version) has been published in 1997 it took several years and success in the form applicability for real-world challenges to be widely recognized and used in other research." />
<figcaption class="figure-caption">While the LSTM (or its most successful version) has been published in 1997 it took several years and success in the form applicability for real-world challenges to be widely recognized and used in other research.</figcaption>
</figure>
<p><strong>Demonstration of large-scale practical success.</strong><br />
Many particularly successful papers got the majority of their citations only decades later after the large-scale practical success of their methods was evident.
In general, it can be said that success depends on the ability of the approach to be scaled up to large problems.
Thus, always try to show your algorithm works great in large scale!</p>
<p><strong>Focus on novel techniques and large margin improvements.</strong><br />
Small improvements on benchmarks are quickly surpassed.
Working on applications only leads to high impact in the form of citations and adoption in the research community if it uses novel techniques that improve over existing (or non-existing) work by a large margin.</p>
<p><strong>Perseverance.</strong><br />
Because not all ideas will work out and demonstrating large-scale practical success is hard, great ideas in deep learning research often require perseverance for long periods of time.</p>
<p><strong>Do not follow the mainstream.</strong><br />
Backpropagation and LSTMs as examples for impactful ML research demonstrate that not the mainstream research established itself but the novel ideas that were not generally accepted at the time.</p>
<p><strong>General learning algorithms over applications.</strong><br />
Effort should be focused on general learning algorithms over applications.
None of the impactful papers I investigated solely applied existing techniques to a new application.</p>
<p><strong>Trial and error.</strong><br />
Many publications of famous authors are barely cited; research is trial and error.
Don’t be frustrated!</p>
<p><strong>A good intuition.</strong><br />
The most cited publications are distributed among a very small number of authors.
Which means these researchers must have a good intuition about what kind of problems are relevant, and what a good solution should look like.
Learn from them!</p>
<h1 id="what-can-the-community-do">What can the community do?</h1>
<p><strong>Introduce a ‘Crazy Work Award’ and other incentives.</strong><br />
Many later very successful ideas were rather unpopular during their inception.
We should encourage research groups, conferences, and journals to nurture crazy and currently unpopular ideas.</p>
<h1 id="conclusions">Conclusions</h1>
<p>There you have it.
I hope we can learn from my findings and produce exceptional work that pushes our field way beyond what is possible today.</p>
<p><strong>A final word of caution.</strong><br />
Please be aware that many of my observations might not generalize well into the future.
Furthermore, while I tried to extract actionable items from my findings, it may very well be that great research is still mostly determined by chance.</p>
<p>Please cite <a href="/assets/publications/characteristics_of_ml_research_with_impact.pdf">my report</a>, this blog post is based on, using</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@techreport{Kirsch2018impact,
author = {Kirsch, Louis},
institution = {University College London},
title = {{Characteristics of Machine Learning Research with Impact}},
year = {2018}
}
</code></pre></div></div>Louis KirschWe machine learning researchers all have a very limited amount of time to spend on reading research and there is only so few projects we can take on at a time. Thus, it is paramount to understand what areas of research excite you and hold promise for the future. I present my analysis of the factors of impactful ML research and how to increase your impact.Theories of Intelligence (2/2): Active Inference2018-08-15T19:45:00+00:002018-08-15T19:45:00+00:00http://louiskirsch.com/ai/active-inference<p>Active Inference is a theoretical framework of perception and action from neuroscience that can explain many phenomena in the brain.
It aims to explain the behavior of cells, organs, animals, humans, and entire species.
Active Inference is structured such as it only requires <em>local</em> computation and plasticity.
In effect, it could be implemented on neural hardware.
Even though Active Inference has wide-reaching potential application, for instance as an alternative to reinforcement learning, few people outside the neuroscience community are familiar with the framework.
In this blog post, I want to give a machine learning perspective on the framework, omitting many neuroscience details.
As such, this article is geared towards machine learning researchers familiar with probabilistic modeling and reinforcement learning.</p>
<p>
<div class="card">
<div class="card-header">Summary (tldr)</div>
<div class="card-body">
<ul>
<li>Active Inference (in the presented form) relies only on local computation and plasticity</li>
<li>Active Inference supports a hierarchy of spatiotemporal scales (cells, organs, animals, species, etc)</li>
<li>In contrast to (commonly used) Reinforcement Learning approaches
<ul>
<li>no reward has to be specified</li>
<li>exploration is optimally solved and</li>
<li>uncertainty is taken into account</li>
</ul>
</li>
<li>Active Inference relies solely on minimizing variational free energy</li>
<li>Both perception and action become inference (including planning)</li>
<li>Instead of rewards, agents have prior preferences over future observations that enter a prior over policies</li>
<li>These preferences are shaped by model selection (e.g. evolution)</li>
<li>Lower variational free energy corresponds to greater model evidence (e.g. adaptive fitness)</li>
</ul>
</div>
</div>
</p>
<h2 id="the-variational-free-energy">The variational free energy</h2>
<p>Active Inference is based on a single quantity: The variational free energy.
Minimization of the free energy will explain both perception, as well as action in any organism, as we will see later.
The variational free energy has its origins in neuroscience but also has a very clean machine learning interpretation and most researchers, therefore, should already be familiar with it.</p>
<figure class="text-center">
<img class="figure-img rounded small" src="/assets/posts/active-inference/latent-variable-model.svg" alt="The variational free energy is often used to learn in such a latent variable model. In this model, \(x\) is an observed variable, while \(y\) is latent." />
<figcaption class="figure-caption">The variational free energy is often used to learn in such a latent variable model. In this model, \(x\) is an observed variable, while \(y\) is latent.</figcaption>
</figure>
<p>If you are entirely unfamiliar with the variational free energy, I would recommend reading <a href="https://blog.evjang.com/2016/08/variational-bayes.html">this blog post</a> before proceeding.
To recap, given a latent variable model as shown in the above figure, we often want to optimize the model evidence</p>
<script type="math/tex; mode=display">P(x) = \int P(x|y) P(y) d y</script>
<p>This integral often has no analytical form and this numerical integration is hard.
Therefore, we resort to optimizing an alternative quantity:
The variational free energy.</p>
<p>We define the <em>negative</em> variational free energy</p>
<script type="math/tex; mode=display">F := \mathbb{E}_{Q(y)}[\log P(x|y)] - KL[Q(y)|P(y)]</script>
<p>where <script type="math/tex">Q</script> is some approximate posterior distribution.
Note that when we refer to the variational free energy, we will often omit that we are talking about the negative free energy <script type="math/tex">F</script>.</p>
<p>To see that <script type="math/tex">F</script> lower-bounds the model likelihood we can rewrite <script type="math/tex">F</script> such that</p>
<script type="math/tex; mode=display">F = \log P(x) - KL[Q(y)|P(y|x)]</script>
<p><script type="math/tex">F</script> is a lower bound because the <script type="math/tex">KL</script>-divergence is always larger or equal to zero.
Thus, we can maximize <script type="math/tex">F</script> instead of <script type="math/tex">\log P(x)</script>.
The bound is tight when</p>
<script type="math/tex; mode=display">KL[Q(y)|P(y|x)] = 0</script>
<h2 id="the-variational-free-energy-in-the-context-of-perception">The variational free energy in the context of perception</h2>
<p>Active Inference uses the variational free energy both in the context of perception and action.
We begin by describing its role in perception.
What do we mean by the term perception?
Intuitively, an agent observes and understands its environment.
This means the agent can extract the underlying dynamics and causes in order to make sense of its surroundings.
Modeling the environment in that manner helps to predict the past, present, and future which can either be viewed as an inherent property of any organism or as a tool to act and realize goals (in Active Inference formulated as extrinsic prior beliefs) as we will see later.
Note that the agent can never be certain about the underlying structure, but only updates beliefs based on what it has observed.
A very simplified version of perception is the modeling of these hidden causes and dynamics as a time series latent variable model.</p>
<p>We formalize the task of perception using the following notation:
An agent observes observations <script type="math/tex">o_{1:T}</script> over time <script type="math/tex">t \in 1, \ldots, T</script> and aims to model the underlying dynamics <script type="math/tex">s_{1:T}</script>.
This is visualized in the following graphical model.</p>
<figure class="text-center">
<img class="figure-img rounded small" src="/assets/posts/active-inference/latent-time.svg" alt="The latent variable model for perception of an agent. The agent observes \(o_t\) while the environment states \(s_t\) are latent." />
<figcaption class="figure-caption">The latent variable model for perception of an agent. The agent observes \(o_t\) while the environment states \(s_t\) are latent.</figcaption>
</figure>
<p>Similar to its original definition, the free energy for a specific point in time <script type="math/tex">t</script> then becomes</p>
<script type="math/tex; mode=display">F_t = \underbrace{\mathbb{E}_Q[\log P(o_t|s_t)]}_\text{accuracy} - \underbrace{KL[Q(s_t)|P(s_t|s_{t-1})]}_\text{complexity}</script>
<p>The approximate posterior distribution <script type="math/tex">Q(s_t)</script> now simply describes beliefs about the latent state <script type="math/tex">s_t</script> based on the observation <script type="math/tex">o_t</script>.
The task of perception is then simply optimizing over <script type="math/tex">Q(s_t)</script>.
In other words, updating the beliefs about states <script type="math/tex">s_t</script> such that the observations the agent has made are most likely.</p>
<script type="math/tex; mode=display">Q(s_t) = \arg\max_Q F</script>
<p>The two terms of the free energy are often also called the accuracy <script type="math/tex">\mathbb{E}_Q[\log P(o_t|s_t)]</script> and the complexity <script type="math/tex">KL[Q(s_t)|P(s_t)]</script>.
Increasing the first term increases the likelihood that the inferred states model the observations correctly and the <script type="math/tex">KL</script>-divergence describes how different the beliefs will have to be from the prior beliefs (and therefore how much information they contain).</p>
<p>So what does it mean to minimize the free energy in the context of perception?
It means to adapt (posterior) beliefs about the environment such that surprise of new observations is minimized.
Thus, the free energy is a measure of surprise.</p>
<p>The careful reader might have noticed that we currently only take into account the observation <script type="math/tex">o_t</script> to infer <script type="math/tex">s_t</script> but neither past nor future.
Also, we have not talked about learning the transitions from <script type="math/tex">s_t</script> to <script type="math/tex">s_{t+1}</script> and the observations these states generate (<script type="math/tex">P(o_t|s_t)</script>).
We will look at these issues more carefully when introducing the general framework of Active Inference.
Furthermore, while we have modeled a single sequence of latent variables <script type="math/tex">s_{1:T}</script> it is possible to stack latent variables in order to build a hierarchy of abstractions (i.e. deep generative models).</p>
<h2 id="from-the-variational-free-energy-to-active-inference">From the variational free energy to Active Inference</h2>
<p>So far, we have discussed how an agent realizes perception by minimizing the free energy.
We will now extend this free-energy framework to include action, yielding Active Inference.
Whenever we talk about an agent acting in an environment, we usually use the reinforcement learning framework.
Active Inference offers an alternative to this framework, not driving action through reward but through preferred observations.
We will find that action is another result of the minimization of free energy.</p>
<p>Before we jump right in, recall the reinforcement learning framework, as depicted below.</p>
<figure class="text-center">
<img class="figure-img rounded small" src="/assets/posts/active-inference/reinforcement-learning.svg" alt="The conventional reinforcement learning framework to describe the interaction between an environment and agent." />
<figcaption class="figure-caption">The conventional reinforcement learning framework to describe the interaction between an environment and agent.</figcaption>
</figure>
<p>We have an agent defined by its policy <script type="math/tex">\pi</script> that selects the action <script type="math/tex">u_t</script> to take at time <script type="math/tex">t</script> such that the return <script type="math/tex">R = \sum_{\tau > t} r_\tau</script> (the sum of rewards) is maximized.
The optimal action can be derived from Bellman’s Optimality Principle such that the optimal policy takes action <script type="math/tex">u_t</script> that maximizes the value function</p>
<script type="math/tex; mode=display">\begin{gather*}
u_t^* = \arg\max_{u_t} V(s_t, u_t) = \pi(s_t) \\
V(s_t, u_t) = \sum_{s_{t+1}} (r_{t+1} + \max_{u_{t+1}} V(s_{t+1}, u_{t+1})) P(s_{t+1}|s_t, u_t)
\end{gather*}</script>
<p>In Active Inference we will not have to define a reward <script type="math/tex">r</script>, return <script type="math/tex">R</script>, or value function <script type="math/tex">V</script>.
There is a fundamental reason for this: Active Inference treats exploration and exploitation as two sides of the same coin – in terms of choosing actions to minimize surprise or uncertainty.
This brings something quite fundamental to the table; namely, the value function of states has to be replaced with a (free energy) functional of beliefs about states.
Similarly, reward is replaced by (prior) beliefs about preferred, unsurprising, outcomes.
This means we define the agent’s extrinsic preferences using priors on future observations <script type="math/tex">P(o_\tau)</script> with <script type="math/tex">\tau > t</script>
(in the following we will always denote future observations with the index <script type="math/tex">\tau</script> and past observations with <script type="math/tex">\rho</script>).
The agent will then try to follow a policy that realizes these prior expectations of the future (so-called self-evidencing behavior).
At the same time, we prefer observations <script type="math/tex">o_\tau</script> that are likely under our model of the environment.
In effect, both principles are a form of surprise minimization.
Previously, we have identified the minimization of surprise as the minimization of the free energy.
Thus, action can be integrated into our existing free energy model through priors over future observations <script type="math/tex">P(o_\tau)</script>.
We then combine the prior <script type="math/tex">P(o_\tau)</script> with our posterior beliefs <script type="math/tex">Q(s_\tau)</script> to infer the best policy <script type="math/tex">\pi</script> that minimizes our expected surprise.
We describe our posterior belief over the policy we should follow with <script type="math/tex">Q(\pi)</script>.
Using this strategy, we can essentially reduce action to inference through the means of minimizing expected free energy.
This concept is also known as Hamilton’s Principle of Least Action.</p>
<p>The following figure visualizes Active Inferences as a principle of self-evidence.
An agent acts to generate outcomes that fulfill its prior and model expectations (left side).
At the same time, perception ensures that the model of the world is consistent with the observations (right side).</p>
<figure class="text-center">
<img class="figure-img rounded large" src="/assets/posts/active-inference/self-evidencing.svg" alt="Active Inference extends the variational free energy framework with the principle of self-evidencing. A priori expected outcomes are achieved by inferring policies \(Q(\pi)\). The expected free energy \(G(\pi, \tau)\) takes the future and prior preferences over observations into account. The symbol \(\sigma\) denotes the softmax function. We will define the policy dependent free energy \(F(\pi, \rho)\) and expected free energy \(G(\pi, \tau)\) rigorously later." />
<figcaption class="figure-caption">Active Inference extends the variational free energy framework with the principle of self-evidencing. A priori expected outcomes are achieved by inferring policies \(Q(\pi)\). The expected free energy \(G(\pi, \tau)\) takes the future and prior preferences over observations into account. The symbol \(\sigma\) denotes the softmax function. We will define the policy dependent free energy \(F(\pi, \rho)\) and expected free energy \(G(\pi, \tau)\) rigorously later.</figcaption>
</figure>
<p>For the purpose of this blog post, it is very important to make a clear distinction between the free energy <script type="math/tex">F</script> minimized for perception and the expected (future) free energy <script type="math/tex">G</script> that reflects the amount of surprise in the future given a particular policy <script type="math/tex">\pi</script>.
We will later present a precise formalization of <script type="math/tex">G</script>, for now, it suffices to say that it is the expected free energy of the future when a particular policy is followed.
Thus, <script type="math/tex">G</script> is the quantity to optimize in order to realize future preferences <script type="math/tex">P(o_\tau)</script>.
Note that the free energies <script type="math/tex">F</script> and <script type="math/tex">G</script> can, in principle, be reformulated to a single quantity (but we will not further investigate this angle here).</p>
<p>At this point, it is worth noting that <script type="math/tex">G</script> is a very universal quantity.
Parallels can be drawn to the Bayesian surprise, risk-sensitive control, expected utility theory, and the maximum entropy principle.
Having said that, we will not further investigate these similarities because they are not essential for understanding Active Inference.
More details for references on this topic can be found in the last section.</p>
<p>To summarize, we solve perception and action in a unifying framework that minimizes the free energy.
Furthermore, because we will implement action through Bayesian probability theory, we yield a Bayes optimal agent if no approximations have to be used.
Before we turn to a precise mathematical definition of the different components that make up an Active Inference agent, we present some of the evidence for Active Inference being the fundamental principle underlying any form of living organism.</p>
<h2 id="active-inference-as-a-foundation-for-life">Active Inference as a foundation for life</h2>
<p>We have introduced Active Inference as a unifying framework for perception and action.
The reach of this framework is quite extensive.
It can be used to explain the self-organization of living systems, such as cells, neurons, organs, animals, and even entire species.
Minimizing the free energy via action and perception is the central principle - called the free energy principle (FEP).</p>
<p>According to the FEP, central to any organism is its Markov blanket.
In the statistical sense, a Markov blanket <script type="math/tex">b</script> of states <script type="math/tex">s</script> is the set of random variables that when conditioned upon make all other variables independent.
Such a Markov blanket also exists for any living system.
Intuitively, no organism can directly observe or modify its environment.
Any interaction is through its sensors (only reflecting a reduced view on the environment) and its actuators (with limited capabilities to act upon the environment).
It is this boundary that separates its internal states from its external milieu, and without it, the system would cease to exist.
Formally, an organism receives sensory inputs <script type="math/tex">o \in O</script> through its sensory states.
Based on these inputs, it constructs a model of its environment <script type="math/tex">s \in S</script>.
According to this model and in order to realize prior preferences, the organism takes actions <script type="math/tex">u \in \Upsilon</script>, the so-called active states.
The only interaction with its environment <script type="math/tex">\psi \in \Psi</script> is through its Markov blanket consisting of sensory and active states.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/active-inference/markov-boundary.svg" alt="The Markov blanket of any living system, sustained by minimizing the free energy." />
<figcaption class="figure-caption">The Markov blanket of any living system, sustained by minimizing the free energy.</figcaption>
</figure>
<p>If an organism endures in a changing environment it must, by definition, retain its Markov blanket.
When an organism minimizes free energy, it minimizes surprise (of sensory input, i.e. observations).
Because under ergodic assumptions the long-term average of surprise is the entropy of these sensory states, retaining its Markov blanket is equivalent to placing an upper bound on the entropy of these sensory states.
Conversely, if the Markov blanket is not maintained, entropy (disorder) of the sensory states diverges and subsequently leads to disintegration and death of the organism.</p>
<p>The concept of Markov blankets can be used to model living systems recursively across spatial scales.
For instance, humans consist of different organs which in turn consist of countless cells.
Each system can be viewed as free-energy-minimizing, maintaining its own Markov blanket.</p>
<figure class="text-center">
<img class="figure-img rounded large" src="/assets/posts/active-inference/recursive-markov-blanket.png" alt="
Markov blankets can be recursively composed across spatial scales.
Here, internal states are denoted as \(\mu\), while \(b := \{u, o\}\) is the Markov blanket.
Reproduced from [<a href='https://www.sciencedirect.com/science/article/pii/S1571064517301409'>Ramstead et. al</a>].
" />
<figcaption class="figure-caption">
Markov blankets can be recursively composed across spatial scales.
Here, internal states are denoted as \(\mu\), while \(b := \{u, o\}\) is the Markov blanket.
Reproduced from [<a href="https://www.sciencedirect.com/science/article/pii/S1571064517301409">Ramstead et. al</a>].
</figcaption>
</figure>
<p>Let’s look at an example.
How can an ensemble of cells form entire organs? (Also called morphogenesis)
Each cell needs to assume some specific function at a location for the entire organ to function.
There is no central organization, thus, each cell must infer the function and position of all other cells just from the signals reaching its local Markov Boundary.
Active Inference may solve this by assuming that each pluripotential cell starts out with a generative model of the entire ensemble.
Therefore a cell can predict which sensory input it would receive depending on its location in the organ.
Each cell only optimizes its local free energy, but because the generative model of each cell embodies a model of the entire organ, a local free energy minimum of each cell corresponds to the entire ensemble converging to a global free energy minimum.
Crucially, through this optimization, each cell will act upon the environment through <script type="math/tex">u \in \Upsilon</script> to help other cells reach their respective free energy minimum.</p>
<p>Finally, evolution plays a very central role in free energy minimization in biotic systems.
In a changing environment, the Markov blanket will eventually be destroyed, which results in the death of the organism.
Therefore, species have developed the ability to reproduce, effectively transferring genetic and epigenetic information to their descendants.
This information specifies the generative model (including prior preferences) in their descendants.
Crucially, information is not transmitted noise-free and each generation introduces slight variations.
These variations lead to changing generative models and prior preferences that create a selection process among population members.
The adaptive fitness of each organism is reflected in how well the model fits to the niche of the species.
But not only these variations are driving species, each organism can also shape evolution by free energy minimization.
The adaptive pressure mostly depends on the niche of the species.
But free energy minimization prefers predictable environments that behave according to prior preferences.
Therefore, the niche itself is also shaped through actions that lead to future free energy minimization.</p>
<p>Evolution, therefore, can be seen as a higher-level, more slowly moving, process that defines <em>empirical priors</em> <script type="math/tex">P(o_\tau)</script>.
Similarly, this hierarchy of temporal scales can be constructed analogous to the hierarchy of spatial scales we constructed previously.
In this hierarchy, higher levels treat the preferences of lower layers as outcomes that need to be explained.</p>
<p>Furthermore, because free energy is an extensive property, hierarchical applications of free energy minimization at different spatial and temporal scales mean that there is an interesting circular causality – in which the minimization at one scale (e.g., evolution of a species) both creates and depends on free energy minimization at a lower scale (e.g., econiche construction by the conspecifics of a species).</p>
<h2 id="the-generative-model">The generative model</h2>
<p>We now introduce the generative model for Active Inference in all its mathematical detail.
In order to simplify explanations, we present a specific model for Active Inference that is a special case in some aspects and thus no longer is entirely assumption and approximation free.
We make the following assumptions:</p>
<ul>
<li>We have several finite discrete random variables for
<ul>
<li>outcomes <script type="math/tex">o_t</script></li>
<li>actions <script type="math/tex">u_t</script></li>
<li>states <script type="math/tex">s_t</script></li>
<li>policies <script type="math/tex">\pi</script> <br />
(The policy itself is a function <script type="math/tex">\pi(t) = u_t</script>, but we have a discrete set of such policies)</li>
</ul>
</li>
<li>The transitions between states are Markovian</li>
</ul>
<p>Furthermore, as described in more detail later, we use variational approximate inference to make the inference process tractable.</p>
<p>The following graphical model shows all the relationships between the variables in our model.
Along the latent variables <script type="math/tex">s_t</script> and <script type="math/tex">\pi</script> we have the transition matrix <script type="math/tex">B</script>, probabilities of observations <script type="math/tex">A</script>, and a variable <script type="math/tex">D</script> describing the original distribution over states <script type="math/tex">s_1</script>.
At the beginning of this blog post, we considered <script type="math/tex">B</script>, <script type="math/tex">A</script>, and <script type="math/tex">D</script> to be fixed.
Furthermore, the matrix <script type="math/tex">U</script> describes prior preferences over future observations <script type="math/tex">P(o_\tau) = \sigma(U_\tau)</script> where <script type="math/tex">\sigma</script> denotes the softmax function.
These prior preferences will be integrated into the expected free energy <script type="math/tex">G(\pi)</script> which in turn defines the prior over the policy <script type="math/tex">\pi</script>.
This relationship is crucial to Active Inference!
We want prior preferences on <script type="math/tex">P(o_\tau)</script> but implement them into our generative model by expressing them as a prior <script type="math/tex">P(\pi|\gamma)</script>.
Finally, <script type="math/tex">\gamma</script> is a temperature parameter that increases or decreases the confidence in the policy <script type="math/tex">\pi</script>.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/active-inference/generative-model.svg" alt="The generative model for Active Inference. Note that the prior probability for a policy \(\pi\) depends on the expected free energy \(G\)." />
<figcaption class="figure-caption">The generative model for Active Inference. Note that the prior probability for a policy \(\pi\) depends on the expected free energy \(G\).</figcaption>
</figure>
<p>The policy <script type="math/tex">\pi</script> is different from Reinforcement Learning in the sense that it is not a function of previous states <script type="math/tex">s_{t-1}</script> but only on the current time <script type="math/tex">t</script>.
Essentially, each possible policy describes a trajectory of actions taken.
This is sufficient because the state history <script type="math/tex">s_{1:t}</script> and future expected states <script type="math/tex">s_{\rho}</script> are taken into consideration automatically by minimizing the free energy to achieve preferred outcomes <script type="math/tex">o_\rho</script>.
Note that regarding tractability Friston argues that an agent only entertains a handful of policies at a time.
This selection is optimized through a process further up in the temporal hierarchy (e.g. evolution).
To give an example, the brain has evolved to only consider a limited repertoire of eye movements with a short time horizon.</p>
<p>Analogous to our introduction of the free energy and perception, we now define the negative variational free energy for this generative model, which we will have to maximize (or minimize the positive free energy):</p>
<script type="math/tex; mode=display">F := \mathbb{E}_Q[\log P(o_{1:t}|x)] - KL[Q(x)|P(x)]</script>
<p>where <script type="math/tex">x</script> are all our latent variables <script type="math/tex">x = (s_{1:t}, \pi, A, B, D, \gamma)</script>.</p>
<p>In order to do inference, we have to define a posterior distribution <script type="math/tex">Q(x)</script>.
We simplify our inference drastically by using a mean field approximation, such that the posterior factors</p>
<script type="math/tex; mode=display">Q(x) = \prod_{\rho=1}^{t} Q(s_\rho|\pi) Q(\pi) Q(A) Q(B) Q(D) Q(\gamma)</script>
<p>Each factor can be represented by its sufficient statistics, denoted by the bar <script type="math/tex">\bar x</script></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
Q(s_\rho|\pi) &= Cat(\bar s_\rho^\pi) \\
Q(\pi) &= Cat(\bar \pi) \\
Q(A) &= Dir(\bar a) \\
Q(B) &= Dir(\bar b) \\
Q(D) &= Dir(\bar d) \\
Q(\gamma) &= \Gamma(1, \bar \beta) \\
\end{align*} %]]></script>
<p>Because our approximate posterior <script type="math/tex">Q(x)</script> factors, we can rewrite our free energy <script type="math/tex">F</script> such that it factors into policy dependent terms <script type="math/tex">F(\pi, \rho)</script> given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
F &= \mathbb{E}_Q[\log P(o_{1:t}|x)] - KL[Q(x)|P(x)] \\
&= \sum_{\rho < t} F(\pi, \rho) - KL[Q(\pi)|P(\pi)] - KL[Q(\gamma)|P(\gamma)] \\
&\quad - KL[Q(A)|P(A)] - KL[Q(B)|P(B)] - KL[Q(D)|P(D)]
\end{align*} %]]></script>
<p>Our policy dependent free energy then only takes the expectation over a single hidden state <script type="math/tex">s_\rho</script> and conditioned on the policy:</p>
<script type="math/tex; mode=display">F(\pi, \rho) = \mathbb{E}_Q[\log P(o_\rho|s_\rho)] - KL[Q(s_\rho|\pi)|P(s_\rho| s_{\rho - 1}, \pi)]</script>
<p>Having factored our approximate posterior distribution and free energy in this manner, we can use belief propagation (or variational message passing) to calculate the sufficient statistics <script type="math/tex">\bar x</script>.
Based on these, we can finally choose our action.
Though, recall that the prior <script type="math/tex">P(\pi|\gamma)</script> required a quantity called the expected free energy <script type="math/tex">G</script> for each policy, which is what we will define next.</p>
<h3 id="expected-free-energy-for-each-policy">Expected free energy for each policy</h3>
<p>As we have established already, Active Inference requires two kinds of free energy:
The free energy <script type="math/tex">F</script> optimized for perception (i.e. inference), and the expected free energy <script type="math/tex">G</script> for each policy <script type="math/tex">\pi</script> that defines the prior distribution over policies <script type="math/tex">P(\pi)</script>; thereby informing the posterior <script type="math/tex">Q(\pi)</script>.
Remember that we want to pick policies <script type="math/tex">\pi</script> that minimize <script type="math/tex">-G</script>, essentially minimizing surprise.
This is directly reflected in the prior <script type="math/tex">P(\pi|\gamma) = \sigma(\gamma \cdot G(\pi))</script>, making <script type="math/tex">\pi</script> more likely for larger <script type="math/tex">G</script>.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/active-inference/expected-free-energy.svg" alt="The prior on \(\pi\) has a special interpretation. It requires a quantity called the expected free energy \(G\) that is based on expected future observations." />
<figcaption class="figure-caption">The prior on \(\pi\) has a special interpretation. It requires a quantity called the expected free energy \(G\) that is based on expected future observations.</figcaption>
</figure>
<p><script type="math/tex">G(\pi)</script> is defined by the path integral over future timesteps <script type="math/tex">\tau > t</script>.
This essentially means that just like the free energy, we factor <script type="math/tex">G</script> across time.
<script type="math/tex">G(\pi)</script> is given by</p>
<script type="math/tex; mode=display">G(\pi) = \sum_{\tau > t} G(\pi, \tau)</script>
<p>where each <script type="math/tex">G(\pi, \tau)</script> is defined by the expected free energy at time <script type="math/tex">\tau</script> if policy <script type="math/tex">\pi</script> is pursued.
Recall that the variational free energy can be written as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
F &= \mathbb{E}_Q[\log P(o_{1:t}|x)] - KL[Q(x)|P(x)] \\
&= \mathbb{E}_Q[\log P(x, o_{1:t}) - \log Q(x))]
\end{align*} %]]></script>
<p>Because <script type="math/tex">G</script> models the expectations over the future, we define the predictive posterior <script type="math/tex">\tilde Q</script> that now also has observations <script type="math/tex">o_\tau</script> as latent variables.
Basically, we take the last posterior <script type="math/tex">Q(s_t)</script> and recursively apply transitions <script type="math/tex">B</script> according to policy <script type="math/tex">\pi</script> (and transitions <script type="math/tex">A</script> to yield observations <script type="math/tex">o_\tau</script>).
Thus, <script type="math/tex">\tilde Q</script> is given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tilde Q &= Q(o_\tau, s_\tau | \pi) \\
&= \mathbb{E}_{Q(s_t)}[P(o_\tau, s_\tau| s_t, \pi)]
\end{align*} %]]></script>
<p>The expected variational free energy is now simply the free energy for each policy under the expectation of the predictive posterior <script type="math/tex">\tilde Q</script> instead of the posterior <script type="math/tex">Q</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
G(\pi, \tau) &= \mathbb{E}_{\tilde Q}[\log P(s_\tau, o_\tau|o_{1:t}, \pi) - \log Q(s_\tau|\pi)] \\
&= \mathbb{E}_{\tilde Q}[\log P(s_\tau|o_\tau, o_{1:t}, \pi) + \log P(o_\tau) - \log Q(s_\tau|\pi)]
\end{align*} %]]></script>
<p>The last equation now also makes the role of the prior <script type="math/tex">P(o_\tau)</script> apparent.
We simply interpret the marginal <script type="math/tex">P(o_\tau)</script> as a distribution over the sorts of outcomes the agent prefers.
Through this interpretation, the expected free energy <script type="math/tex">G</script> will be shaped by these preferences.
Because we maximize <script type="math/tex">G(\pi, \tau)</script> we will also maximize the prior probability <script type="math/tex">\log P(o_\tau)</script>, in effect making prior preferences over observations more likely.
We can gain even more intuition from the expected free energy if we rewrite it in an approximate form by simply replacing <script type="math/tex">P(s_\tau|o_\tau, o_{1:t}, \pi)</script> with an approximate posterior <script type="math/tex">Q(s_\tau|o_\tau, \pi)</script>, essentially dropping the dependence on the observation history <script type="math/tex">o_{1:t}</script>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
G(\pi, \tau) &\approx \underbrace{\mathbb{E}_{\tilde Q}[\log Q(s_\tau|o_\tau, \pi) - \log Q(s_\tau|\pi)]}_\text{epistemic value or information gain} + \underbrace{\mathbb{E}_{\tilde Q}[\log P(o_\tau)]}_\text{extrinsic value}
\end{align*} %]]></script>
<p>This form shows that maximizing <script type="math/tex">G</script> leads to behavior that either learns about the environment (epistemic value, information gain) or maximizes the extrinsic value.
In experiments, it can be observed that initially, the first term dominates and later little information can be gained by exploration, therefore the extrinsic value is maximized.
Active Inference, therefore, provides a Bayes optimal exploration and exploitation trade-off.</p>
<p>To summarize, we have defined the generative model, the free energy that is minimized for perception, its approximate posterior, and the expected free energy that is used to drive action.</p>
<h2 id="the-algorithm-in-action">The algorithm in action</h2>
<p>We now have all the pieces to put together the algorithm consisting of inference, learning and taking action as depicted in the figure below.
We are only left with how to derive the belief updates and pick an action.
Note that I will list all the belief updates in this section, feel free to skip the exact mathematical details.
I only provide these for a complete understanding of how Active Inference would have to be computed.</p>
<figure class="text-center">
<img class="figure-img rounded small" src="/assets/posts/active-inference/algorithm.svg" alt="The different stages in the Active Inference algorithm." />
<figcaption class="figure-caption">The different stages in the Active Inference algorithm.</figcaption>
</figure>
<h3 id="inference">Inference</h3>
<p>Recall that inference is done by maximizing</p>
<script type="math/tex; mode=display">Q(x) = \arg\max_{Q(x)} F</script>
<p>As mentioned before, we use a mean field approximation to factor our posterior <script type="math/tex">Q(x)</script>.
Our beliefs <script type="math/tex">Q(x)</script> are then updated through belief propagation, iteratively updating our sufficient statistics <script type="math/tex">\bar x</script>.
The belief updates are derived by differentiating the variational free energy <script type="math/tex">F</script> w.r.t. the sufficient statistics and setting the result to zero.
One finally yields the following update equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\bar s_\rho^\pi &= \sigma(\log A \cdot o_\rho + \log B^\pi_{\rho -1} \cdot \bar s_{\rho-1}^\pi + \log B_\rho^\pi \cdot \bar s_{\rho + 1}^\pi) \\
\bar \pi &= \sigma(F + \bar\gamma \cdot G) \\
\bar \pi_0 &= \sigma(\bar\gamma \cdot G) \\
\bar \beta &= \beta + (\bar \pi_0 - \bar \pi) \cdot G
\end{align*} %]]></script>
<p>where <script type="math/tex">F</script> and <script type="math/tex">G</script> are vectors for the free energy <script type="math/tex">% <![CDATA[
F(\pi) = \sum_{\rho < t} F(\pi, \rho) %]]></script> and expected free energy <script type="math/tex">G(\pi) = \sum_{\tau > t} G(\pi, \tau)</script> of each policy.</p>
<p><script type="math/tex">F(\pi, \rho)</script> and <script type="math/tex">G(\pi, \tau)</script> can be computed as follows.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
F(\pi, \rho) &= \mathbb{E}_Q[\log P(o_\rho|s_\rho)] - KL[Q(s_\rho|\pi)|P(s_\rho|s_{\rho - 1}, \pi)] \\
&= \bar s_\rho^\pi \cdot (\log A \cdot o_\rho + \log B_{\rho - 1}^\pi \bar s_{\rho - 1}^\pi - \log\bar s_\rho^\pi) \\
\\
G(\pi, \tau) &= -KL[Q(o_\tau|\pi)|P(o_\tau)] - \mathbb{E}_{\tilde Q}[\mathbb{H}[P(o_\tau|s_\tau)] \\
&= - \underbrace{\bar o_\tau^\pi \cdot (\log\bar o_\tau^\pi - U_\tau)}_\text{risk} - \underbrace{\bar s_\tau^\pi \cdot H}_\text{ambiguity} \\
\bar o_\tau^\pi &= \tilde A \cdot \bar s_\tau^\pi \\
U_\tau &= \log P(o_\tau) \\
H &= -diag(\tilde A \cdot \hat A) \\
\hat A &= \mathbb{E}_Q[\log A] = \psi(\bar a) - \psi(\bar a_0) \qquad \text{where }\psi\text{ is the digamma function}\\
\tilde A &= \mathbb{E}_Q[A] = \bar a \times \bar a_0^{-1} \\
\bar a_{0j} &= \sum_i \bar a_{ij}
\end{align*} %]]></script>
<h3 id="learning">Learning</h3>
<p>Learning <script type="math/tex">A</script>, <script type="math/tex">B</script>, and <script type="math/tex">D</script> is simply inference as well.
As such, we yield belief updates using the same method as above.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\log A &= \psi(\bar a) - \psi(\bar a_0) & \bar a &= a + \sum_\rho o_\rho \otimes \bar s_\rho \\
\log B &= \psi(\bar b) - \psi(\bar b_0) & \bar b(u) &= b(u) + \sum_{\pi(\rho)=u} \bar \pi_\pi \cdot \bar s_\rho^\pi \otimes \bar s_{\rho - 1}^\pi \\
\log D &= \psi(\bar d) - \psi(\bar d_0) & \bar d &= d + \bar s_1
\end{align*} %]]></script>
<h3 id="choosing-action">Choosing action</h3>
<p>Finally, we choose an action simply by marginalizing over the posterior beliefs about policies to form a posterior distribution over the next action.
Generally, in simulating active inference, the most likely (a posteriori) action is selected and a new outcome of observation is sampled from the world.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
u_t &= \arg\max_u \mathbb{E}_{q(\pi)}[P(u|\pi)]
\end{align*} %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
P(u|\pi) =
\begin{cases}
1, & \text{if }u = \pi(t) \\
0, & \text{otherwise}
\end{cases} %]]></script>
<h2 id="conclusion">Conclusion</h2>
<p>We have seen that Active Inference unifies action and perception by minimizing a single quantity - the variational free energy.
Through this simple concept, we have reduced action to inference.
Compared to Reinforcement Learning, no reward had to be specified, exploration is optimally solved and uncertainty is taken into account.
Instead of rewards, agents have prior preferences over observations that are shaped by evolution.
A brief look at Active Inference as the foundation of life showed that the concept has an extensive biological applicability.
Furthermore, we presented a model of Active Inference that, while complete, made several approximations and has limitations.
From a machine learning perspective, future work will have to investigate how this scheme can be extended to long sequences and large discrete or continuous state spaces in a scalable manner.
Additionally, hierarchical internal states or recursive Markov-blanket-based systems are an interesting future research direction.</p>
<p>Please leave a comment down below for discussions, ideas, and questions!</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>I thank Karl Friston and Casper Hesp for valuable feedback and discussions.</p>
<h2 id="learn-more">Learn more</h2>
<p>I hope this blog post gave you an intuitive and condensed perspective on Active Inference.
If you would like to learn more, in particular from the perspective of neuroscience, check out this selection of the original papers:</p>
<table>
<thead>
<tr>
<th>Title</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://www.researchgate.net/publication/325473101_A_Multi-scale_View_of_the_Emergent_Complexity_of_Life_A_Free-energy_Proposal">A Multi-scale View of the Emergent Complexity of Life</a></td>
<td>Easy to read article about the <strong>reach of free energy minimization</strong> (& Active Inference)</td>
</tr>
<tr>
<td><a href="https://www.sciencedirect.com/science/article/pii/S0022249615000759">A tutorial on the free-energy framework for modeling perception and learning</a></td>
<td><strong>Beginner friendly</strong> tutorial on the free energy in the neuroscience context (no Active Inference)</td>
</tr>
<tr>
<td><a href="https://www.youtube.com/watch?v=Y1egnoCWgUg&t=2029s">Youtube video on Active Inference</a></td>
<td>Great <strong>intuitive</strong> introduction to Active Inference</td>
</tr>
<tr>
<td><a href="http://openaccess.city.ac.uk/16683/">Active Inference: A Process Theory</a></td>
<td>Detailed & <strong>mathematical</strong> description of Active Inference</td>
</tr>
</tbody>
</table>Louis KirschWe look at Active Inference, a theoretical formulation of perception and action from neuroscience that can explain many phenomena in the brain. It aims to explain the behavior of cells, organs, animals, humans, and entire species. This article is geared towards machine learning researchers familiar with probabilistic modeling and reinforcement learning.Theories of Intelligence (1/2): Universal AI2018-07-12T19:30:00+00:002018-07-12T19:30:00+00:00http://louiskirsch.com/ai/universal-ai<p>Many AI researchers, in particular in the field of deep learning and reinforcement learning, pursue a bottom-up approach.
We reason by analogy to neuroscience and aim to solve current limitations of AI systems on a case-by-case basis.
While this allows for steady progress, it is unclear which properties of an artificial intelligence system are required to be engineered and which may be learned.
In fact, one might argue that any design decision by a human can be outperformed by a learned rule, given enough data.
The less data we have, the more inductive biases we have to build into our systems.
But how do we know which aspects of an artificial system are necessary and which are better left to learn?
I want to give insight into a <strong>theoretic top-down approach of universal artificial intelligence</strong>.
It promises to prove what an <strong>optimal agent for any set of environments</strong> would have to look like and <strong>how intelligence can be measured</strong>.
Furthermore, we will learn about theoretical limits of (non-)computable intelligence.</p>
<p>The post is structured as follows</p>
<ul>
<li>We try to define intelligence</li>
<li>We motivate how Epicurus’ principle and Occam’s razor relate to the problem of intelligence</li>
<li>We introduce optimal sequence prediction using Solomonoff induction</li>
<li>We extend our agent to active environments (Reinforcement Learning paradigm) and show its optimality</li>
<li>We define a formal measure of intelligence</li>
</ul>
<p>This blog post is, in most parts, based on Marcus Hutter’s book <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artificial Intelligence</a> and Shane Legg’s PhD thesis <a href="http://www.vetta.org/documents/Machine_Super_Intelligence.pdf">Universal Super Intelligence</a>.</p>
<p>
<div class="card">
<div class="card-header">Summary (tldr)</div>
<div class="card-body">
<ul>
<li>Most machine learning tasks can be reduced to sequence prediction in passive or active environments</li>
<li>Sequences, environments, and hypotheses have universal prior probabilities based on Epicurus’ principle and Occam’s razor</li>
<li>The central quantity is the Kolmogorov complexity, a measure of complexity</li>
<li>Using these prior probabilities we can construct universally optimal sequence prediction and agents in active environments</li>
<li>The intelligence of an agent can be formally defined to be its performance in all environments weighted by complexity</li>
<li>We can derive the universal agent AIXI that maximizes this intelligence measure</li>
<li>Any other agent that is better than AIXI in a specific environment will be at least as much worse in some other environment</li>
</ul>
</div>
</div>
</p>
<h2 id="what-is-intelligence">What is intelligence?</h2>
<p>What is intelligence?
While there are many colloquial meanings of intelligence, many intelligence tests have been developed that can predict future academic or commercial performance very well.
But these tests are limited in scope because they only work for humans and might even be prone to cultural biases.
Additionally, how do we compare the intelligence of animals or machines with humans?
It seems that intelligence is not as easy to define as one might think.
If our goal is to design the most intelligent agent, we better define the term as precisely and general as possible.</p>
<p>Among many definitions of intelligence one can extract two very common aspects:
Firstly, intelligence is seen as a property of an actor interacting with an external environment.
Secondly, intelligence is related to this actor’s ability to succeed, implying the existence of some kind of goal.
It is therefore fair to say that the greater the capability of an actor to reach certain goals in an environment, the greater the individual’s intelligence.
It is also noteworthy that when describing intelligence, the focus is often on adaptation, learning, and experience.
Shane Legg argues that this is an indicator that the true environment in which these goals are pursued is not fully known and needs to be discovered first.
The actor, therefore, needs to quickly learn and adapt in order to perform as well as possible in a wide range of environments.
This leads us to a simple working definition that intuitively covers all aspects of intelligence:</p>
<blockquote>
<p>Intelligence measures an agent’s ability to achieve goals in a wide range of environments.</p>
</blockquote>
<h2 id="universal-artificial-intelligence">Universal Artificial Intelligence</h2>
<p>In the following, we will derive the universal artificial intelligence AIXI.
It will be based on three fundamental principles:
Epicurus’ principle of multiple explanations, Occam’s razor, and Bayes’ rule.
While Epicurus’ principle and Occam’s razor are important to motivate AIXI, none of the proofs rely on them as assumptions.</p>
<p>All intelligent agents will have to make predictions to achieve their goals.
Only if the underlying generating process can (approximately) be modeled by such an agent it can determine whether progress toward the goal is being made.
In this context, we model hypotheses <script type="math/tex">h \in \mathcal{H}</script> of the generating process that explain the data.
This may be the dynamics of an environment or the probability distribution over a sequence.
Epicurus’ principle of multiple explanations states</p>
<blockquote>
<p>Keep all hypotheses that are consistent with the data.</p>
</blockquote>
<p>Intuitively, this makes sense because future evidence may increase the likelihood of any current hypothesis.
This also implies that we will have to store the entire previous sequence or interaction history.
Any information that is discarded may be informative in the future.
Bayes’ rule defines how we need to integrate our observations <script type="math/tex">D</script> to yield a posterior probability <script type="math/tex">P(h|D)</script></p>
<script type="math/tex; mode=display">P(h|D) = \frac{P(D|h)P(h)}{P(D)}</script>
<p>While we can find <script type="math/tex">P(D)</script> by marginalization, we are faced with a problem: Where does the prior <script type="math/tex">P(h)</script> come from?
Bayesian statisticians argue that domain knowledge can be used to inform the prior <script type="math/tex">P(h)</script>.
In the absence of such knowledge, one can simply impose a uniform prior (indifference principle).
This works great for finite sets <script type="math/tex">\mathcal{H}</script> but leads to problems with non-finite hypothesis spaces where the use of an improper prior is required.
In addition, the uniform prior highly depends on the problem formulation and is not invariant to reparameterization and group transformations.
A simple example illustrates that:
We could have three hypotheses <script type="math/tex">\mathcal{H}_3 := \{\text{heads biased}, \text{tails biased}, \text{fair}\}</script> for a coin flip.
Alternatively, we can regroup our hypotheses to <script type="math/tex">\mathcal{H}_2 := \{\text{biased}, \text{fair}\}</script>.
Depending on the grouping chosen, the uniform prior will assign a different probability to the hypothesis <script type="math/tex">h_b = \text{biased}</script>.
Occam’s razor will solve this issue. It roughly says</p>
<blockquote>
<p>Among all hypotheses that are consistent with the observations, the simplest is most likely.</p>
</blockquote>
<p>Occam’s razor can be motivated from several perspectives.
Empirically, hypotheses that make fewer assumptions are more testable and have led to theories that capture the underlying structure better and thus have better predictive performance.
Any human not adhering to this principle would be judged irrational.
In probability theory, by definition, all assumptions introduce possibilities of error.
If an assumption does not improve the accuracy of a theory, its only effect is to increase the probability that the overall theory is wrong.</p>
<p>There is one additional important observation before we can make use of Occam’s razor:
While we could define a prior on hypotheses <script type="math/tex">P(h)</script> as previously shown, there is an equivalent formulation where we use a prior for sequences <script type="math/tex">P(x)</script>.
Let’s consider our data <script type="math/tex">D</script> to be a (binary) sequence <script type="math/tex">x \in \mathbb{B}^*</script>.
Bayesian probability theory requires predictions over sequence continuations <script type="math/tex">x_{t+1}</script> given <script type="math/tex">x_{1:t}</script> to be weighted, i.e.</p>
<script type="math/tex; mode=display">P(x_{t+1}|x_{1:t}) = \sum_{h \in \mathcal{H}} P(h|x_{1:t}) P(x_{t+1}|x_{1:t}, h)</script>
<p>where <script type="math/tex">P(h|x_{1:t})</script> is the posterior defined by Bayes’ rule, therefore requiring a prior <script type="math/tex">P(h)</script>.
We can rewrite</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(x_{t+1}|x_{1:t}) &= \sum_{h \in \mathcal{H}} P(h|x_{1:t}) P(x_{t+1}|x_{1:t}, h) \\
&= \sum_{h \in \mathcal{H}} \frac{P(x_{1:t}|h)P(h)}{P(x_{1:t})} \frac{P(x_{1:t+1}|h)}{P(x_{1:t}|h)} \\
&= \frac{P(x_{1:t+1})}{P(x_{1:t})}
\end{align} %]]></script>
<p>such as that we now require a prior over sequences.
Such a prior over sequences <script type="math/tex">x \in \mathbb{B}^*</script> is what we will have to find.</p>
<p>So far, we have shown the relevance of finding the right prior probability distribution which may either be expressed over hypothesis space or sequence space.
Furthermore, we motivated why Epicurus’ principle of multiple explanations and Occams’ razor may help us to solve this issue.</p>
<h3 id="solomonoffs-prior-and-kolmogorov-complexity">Solomonoff’s prior and Kolmogorov complexity</h3>
<p>Next, in order to solve our problem with prior probabilities, we will formalize a so-called universal prior probability for sequences, hypotheses, and environments based on Epicurus’ principle of multiple explanations and Occam’s razor.
We begin with the simplest case, directly defining a prior over sequences.</p>
<p>Let’s extend the sequence prediction problem we briefly introduced.
Let <script type="math/tex">\mu</script> be the true generating distribution over an infinite sequence <script type="math/tex">\omega \in \mathbb{B}^{\infty}</script>.
Sequence prediction is a quite powerful framework because any supervised prediction problem can be reduced to such sequence prediction.
The task of predicting <script type="math/tex">\omega_{t+1}</script> after having seen <script type="math/tex">\omega_{1:t}</script> is known as induction.</p>
<p>Occam’s razor is formalized by Solomonoff’s prior probability for (binary) sequences.
Intuitively, a sequence <script type="math/tex">x \in \mathbb{B}^*</script> is more likely if it is generated by many and short programs <script type="math/tex">p</script> fed to a universal Turing Machine <script type="math/tex">U</script>.
You can think of the program <script type="math/tex">p</script> as a description of the sequence <script type="math/tex">x</script>.
And the Turing Machine <script type="math/tex">U</script> executes your program <script type="math/tex">p</script> to generate <script type="math/tex">x</script>.
If you can describe <script type="math/tex">x</script> with few instructions (say bits <script type="math/tex">l(p)</script>) it must be simpler, and therefore more likely.
We describe this probability thus simply with <script type="math/tex">2^{-l(p)}</script>.
Also, if there are many explanations for <script type="math/tex">x</script>, we just sum these probabilities.</p>
<p>We can now formally put these intuitions together:
We say Turing Machine <script type="math/tex">U</script> generates <script type="math/tex">x</script> if <script type="math/tex">x</script> is a prefix of the output of <script type="math/tex">U(p)</script>.
For this prefix with any continuation, we write <script type="math/tex">x*</script>.
Solomonoff’s prior probability that a sequence begins with binary string <script type="math/tex">x \in \mathbb{B}^*</script> is given by</p>
<script type="math/tex; mode=display">M(x) := \sum_{p:U(p)=x*} 2^{-l(p)}</script>
<p>where <script type="math/tex">l(p)</script> is the length of program <script type="math/tex">p</script> and <script type="math/tex">U</script> is a prefix universal Turing Machine.
The fact that we use a prefix universal Turing Machine is a technicality.
Such a Turing Machine ensures that no valid program for <script type="math/tex">U</script> is a prefix of any other.
In this definition <script type="math/tex">2^{-l(p)}</script> can be interpreted as the probability that program <script type="math/tex">p</script> is sampled uniformly from all possible programs, halving the probability for each additional bit.</p>
<p>To summarize, Solomonoff’s prior probability assigns sequences generated by shorter programs higher probability.
We say sequences generated by shorter programs are less complex.</p>
<p>We can further formalize the complexity of a sequence.
Instead of taking into account all possible programs <script type="math/tex">p</script> that could generate a sequence <script type="math/tex">x</script> we just report the shortest program.
Formally, the Kolmogorov complexity of an infinite sequence <script type="math/tex">\omega \in \mathbb{B}^\infty</script> is the length of the shortest program that produces <script type="math/tex">\omega</script>, given by</p>
<script type="math/tex; mode=display">K(\omega) := \min_{p \in \mathbb{B}^*}\{l(p): U(p) = \omega\}</script>
<p>Similarly, the Kolmogorov complexity over finite strings <script type="math/tex">x \in \mathbb{B}^*</script> is the length of the shortest program that outputs <script type="math/tex">x</script> and then halts.</p>
<p>We have seen how to formalize Occams’ razor to derive a prior probability distribution.
In principle, we could now directly apply Solomonoff’s universal prior probability <script type="math/tex">M(x)</script> to perform induction.
But it will be useful to define an alternative variant that takes our hypotheses into account.</p>
<h3 id="solomonoff-levin-prior">Solomonoff-Levin prior</h3>
<p>Another way to look at the problem is to define a universal prior probability over hypotheses.
It will allow us to define an alternative prior over sequences known as the Solomonoff-Levin prior probability.</p>
<p>We start by defining our hypothesis space <script type="math/tex">M_e</script> over possible distributions and assume that the true generating distribution <script type="math/tex">\mu \in M_e</script>.</p>
<script type="math/tex; mode=display">M_e := v_1, v_2, v_3, \ldots</script>
<p>Each <script type="math/tex">v_i \in M_e</script> is a so-called enumerable probability semi-measure.
For now, it is sufficient to know that such a measure assigns each prefix sequence <script type="math/tex">x \in \mathbb{B}^*</script> of <script type="math/tex">\omega</script> a probability.
Therefore, with such a measure we can describe a distribution over sequences.
In order to make sure that <script type="math/tex">\mu \in M_e</script> we will pick a really large set of hypotheses.
In technical terms, <script type="math/tex">M_e</script> is a computable enumeration of enumerable probability semi-measures.
If you are interested in the mathematical details, refer to the optional section.</p>
<p>The index <script type="math/tex">i</script> can be used a representation of the semi-measure.
To assign hypothesis <script type="math/tex">v_i</script> a probability, we can make use of the Kolmogorov complexity again!
If we can describe <script type="math/tex">i</script> with a short program <script type="math/tex">p</script>, <script type="math/tex">v_i</script> must be simple.
Formally, we define the Kolmogorov complexity for each of the <script type="math/tex">v_i \in M_e</script> as</p>
<script type="math/tex; mode=display">K(v_i) := \min_{p \in \mathbb{B}^*} \{ l(p): U(p) = i \}</script>
<p>where <script type="math/tex">v_i</script> is the <script type="math/tex">i^{th}</script> element in our enumeration.</p>
<p>This gives us all the tools to define a prior probability for hypotheses!
We just assign simpler hypotheses larger probabilities, prescribed by the Kolmogorov complexity from above.
Each hypothesis <script type="math/tex">v \in M_e</script> is assigned a universal algorithmic prior probability</p>
<script type="math/tex; mode=display">P_{M_e}(v) := 2^{-K(v)}</script>
<p>Finally, we construct a mixture to define a prior over the space of sequences, arriving at the Solomonoff-Levin prior probability of a binary sequence beginning with string <script type="math/tex">x \in \mathbb{B}^*</script></p>
<script type="math/tex; mode=display">\xi(x) := \sum_{v \in M_e} P_{M_e}(v) v(x)</script>
<p>
<div class="card">
<div class="card-header clearfix collapse-header">
<h4 class="float-left">
<a data-toggle="collapse" href="#probability-measures,-enumerable-functions,-and-computable-enumerations-content" aria-expanded="true" aria-controls="probability-measures,-enumerable-functions,-and-computable-enumerations-content" id="probability-measures,-enumerable-functions,-and-computable-enumerations" class="d-block">
Probability measures, enumerable functions, and computable enumerations
</a>
</h4>
<i class="float-right">optional section</i>
</div>
<div id="probability-measures,-enumerable-functions,-and-computable-enumerations-content" class="collapse" aria-labelledby="probability-measures,-enumerable-functions,-and-computable-enumerations">
<div class="card-body">
<p>In order to derive AIXI we will require three concepts that may not be familiar to the reader:
Probability (semi-)measures, enumerable functions, and computable enumerations.
We denote a sequence <script type="math/tex">xy</script> as the concatenation of <script type="math/tex">x \in \mathbb{B}^*</script> and <script type="math/tex">y \in \mathbb{B}^*</script>.</p>
<p>A probability measure is a function over binary sequences <script type="math/tex">v: \mathbb{B}^* \to [0, 1]</script> such that</p>
<script type="math/tex; mode=display">v(\epsilon) = 1, \forall x \in \mathbb{B}^*: v(x) = v(x0) + v(x1)</script>
<p>where <script type="math/tex">\epsilon</script> is the empty string.
In other words, we assign each finite binary sequence <script type="math/tex">x \in \mathbb{B}^*</script> (which could be a prefix of <script type="math/tex">\omega</script>) a probability such that they consistently add up.</p>
<p>A generalization is the probability semi-measure</p>
<script type="math/tex; mode=display">v(\epsilon) \leq 1, \forall x \in \mathbb{B}^*: v(x) \geq v(x0) + v(x1)</script>
<p>that can be normalized to a probability measure.</p>
<p>Intuitively, a function is enumerable if it can be progressively approximated from below with a Turing machine in finite time.
Conversely, a function is co-enumerable if it can be progressively approximated from above with a Turing machine in finite time.</p>
<p>We stated that we’d like to enumerate enumerable semi-measures.
Note that an enumeration is <em>different</em> from an enumerable function.
What is a computable enumeration?
Basically, we want to enumerate all semi-measures with a Turing Machine in a computable manner.
We can not directly output probability measures using our Turing Machine and therefore use its index <script type="math/tex">i</script> as a description that can then be used to approximate <script type="math/tex">v_i(x)</script> from below.
More precisely, there exists a Turing Machine <script type="math/tex">T</script> that for any enumerable semi-measure <script type="math/tex">v \in M_e</script> there exists an index <script type="math/tex">i \in \mathbb{N}</script> such that</p>
<script type="math/tex; mode=display">\forall x \in \mathbb{B}^*: v(x) = v_i(x) := \lim_{k \to \infty}T(i, k, x)</script>
<p>with <script type="math/tex">T</script> increasing in <script type="math/tex">k</script>.
In other words, we have a Turing Machine <script type="math/tex">T</script> and for a given sequence <script type="math/tex">x</script> and index <script type="math/tex">i</script> we approximate the value of the semi-measure <script type="math/tex">v_i</script> for <script type="math/tex">x</script> from below.
The argument <script type="math/tex">k</script> intuitively describes how exact this approximation shall be.
In the limit of <script type="math/tex">\lim_{k \to \infty}</script> we will yield the exact value <script type="math/tex">v_i(x)</script>.</p>
<p>Why can the index <script type="math/tex">i</script> be used to define the Kolmogorov complexity for each hypothesis <script type="math/tex">v_i \in M_e</script>?
The more complex <script type="math/tex">i</script> is, the more complex is the input to the above Turing machine <script type="math/tex">T</script>.
For a given <script type="math/tex">k</script> and <script type="math/tex">x</script> the complexity of the output can therefore only depend on <script type="math/tex">i</script>.</p>
</div>
</div>
</div>
</p>
<h3 id="solomonoff-induction">Solomonoff induction</h3>
<p>We now have two formulations for prior distributions over binary sequences:
The Solomonoff prior <script type="math/tex">M</script> and the Solomonoff-Levin prior <script type="math/tex">\xi</script>.
Both priors can be shown to only be a multiplicative constant away from each other, i.e. <script type="math/tex">M \overset{\times}{=} \xi</script>.
We will need the second formulation <script type="math/tex">\xi</script> to construct our universal agent AIXI.
But for now, let’s focus on the sequence prediction problem.</p>
<p>By the definition of conditional probability our prediction problem reduces to</p>
<script type="math/tex; mode=display">\xi(\omega_{t+1}|\omega_{1:t}) = \frac{\xi(\omega_{1:t+1})}{\xi(\omega_{1:t})}</script>
<p>Because we used <script type="math/tex">\xi</script> as our universal prior, this is known as Solomonoff induction for sequence prediction.</p>
<p>Hooray, we’ve done it! But what do we gain from all this?
Given this scheme of induction, how good is the predictor <script type="math/tex">\xi(\omega_{t+1}|\omega_{1:t})</script> relative to the true <script type="math/tex">\mu(\omega_{t+1}|\omega_{1:t})</script>?
It turns out that as long as <script type="math/tex">\mu</script> can be described by a computable distribution (which is a very minor restriction) our estimator <script type="math/tex">\xi</script> will converge rapidly to the true <script type="math/tex">\mu</script>, in effect solving the problem optimally!</p>
<p>
<div class="card">
<div class="card-header clearfix collapse-header">
<h4 class="float-left">
<a data-toggle="collapse" href="#convergence-of-solomonoff-induction-content" aria-expanded="true" aria-controls="convergence-of-solomonoff-induction-content" id="convergence-of-solomonoff-induction" class="d-block">
Convergence of Solomonoff induction
</a>
</h4>
<i class="float-right">optional section</i>
</div>
<div id="convergence-of-solomonoff-induction-content" class="collapse" aria-labelledby="convergence-of-solomonoff-induction">
<div class="card-body">
<p>We can measure the relative performance of <script type="math/tex">\xi</script> to <script type="math/tex">\mu</script> using the prediction error <script type="math/tex">S_t</script> given by</p>
<script type="math/tex; mode=display">S_t = \sum_{x \in \mathbb{B}^{t-1}} \mu(x)(\xi(x_t = 0|x) - \mu(x_t = 0|x))^2</script>
<p>Consider the set of computable probability measures <script type="math/tex">M_c \subset M_e</script>.
It can be shown that for any <script type="math/tex">\mu \in M_c</script> the sum of all prediction errors is bounded by a constant</p>
<script type="math/tex; mode=display">\sum_{t=1}^{\infty} S_t \leq \frac{\ln 2}{2} K(\mu)</script>
<p>Due to this bound, it follows that the estimator <script type="math/tex">\xi</script> rapidly converges for any <script type="math/tex">\mu</script> that can be described by a computable distribution (see <a href="https://arxiv.org/abs/0709.1516">Hutter 2007</a> for what ‘rapid’ describes).</p>
</div>
</div>
</div>
</p>
<h3 id="active-environments">Active environments</h3>
<p>So far we only dealt with passive environments in the form of sequence predictions.
We will now extend the framework to active environments using the Reinforcement Learning framework as depicted below.</p>
<figure class="text-center">
<img class="figure-img rounded " src="/assets/posts/universal-ai/rl-framework.svg" alt="The Reinforcement Learning framework." />
<figcaption class="figure-caption">The Reinforcement Learning framework.</figcaption>
</figure>
<p>We define an environment to be the tuple <script type="math/tex">(A, O, R, \mu)</script>:</p>
<ul>
<li><script type="math/tex">A</script> is the of all actions our agent can take</li>
<li><script type="math/tex">O</script> is the set of observations the agent can receive</li>
<li><script type="math/tex">R</script> is the set of possible rewards</li>
<li><script type="math/tex">\mu</script> is the environment’s transition probability measure (as defined in the following)</li>
</ul>
<p>We express the concatenation of observation and reward as <script type="math/tex">x := or</script>.
The index <script type="math/tex">t</script> of a sequence <script type="math/tex">ax_t</script> refers to both action <script type="math/tex">a</script> and input <script type="math/tex">x</script> at time <script type="math/tex">t</script>.
Then <script type="math/tex">\mu</script> is simply the probability measure over transitions <script type="math/tex">% <![CDATA[
\mu(x_t|ax_{<t}a_t) %]]></script>.
Depending on the design objective of the agent we will have to specify two signals to yield a well-defined optimally informed agent.
Firstly, a communication stream of rewards needs to be specified.
This could be something like pain or reward signals similar to what humans experience.
Secondly, we need to specify a temporal preference.
This can be done in two ways.
We may specify a discount factor <script type="math/tex">\gamma_t</script> directly, for instance, a geometrically decreasing discount factor as commonly seen in the RL-literature:</p>
<script type="math/tex; mode=display">\forall i: \gamma_i := \alpha^i \qquad \text{for} \quad \alpha \in (0, 1)</script>
<p>Alternatively, we require that the total reward is bounded, therefore directly specifying our preferences in the stream of rewards.
We call this set of environments reward-summable environments <script type="math/tex">\mathbb{E} \subset E</script>.</p>
<p>It is a known result that the optimal agent in environment <script type="math/tex">\mu</script> with discounting <script type="math/tex">\gamma</script> would then be defined using the Bellman equation</p>
<script type="math/tex; mode=display">% <![CDATA[
\pi^\mu := \arg\max_\pi V_\gamma^{\pi\mu} \\
V_\gamma^{\pi\mu}(ax_{<t}) = \sum_{ax_t} [\gamma_t r_t + V_\gamma^{\pi\mu}(ax_{1:t})]\overset{\pi}{\mu}(ax_t|ax_{<t}) \\ %]]></script>
<p>where <script type="math/tex">\overset{\pi}{\mu}</script> is the probability distribution jointly over policy and environment, e.g.</p>
<script type="math/tex; mode=display">\overset{\pi}{\mu}(ax_{1:2}) := \pi(a_1) \mu(x_1|a_1) \pi(a_2|ax_1) \mu(x_2|ax_1a_2)</script>
<p>If the environment <script type="math/tex">\mu</script> was known we could directly infer the optimal action <script type="math/tex">a_t^{\pi^\mu}</script> in the <script type="math/tex">t^{th}</script> step</p>
<script type="math/tex; mode=display">% <![CDATA[
a_t^{\pi^\mu} := \arg\max_{a_t} \lim_{m \to \infty} \sum_{x_t} \max_{a_{t+1}} \sum_{x_{t+1}} \ldots \max_{a_m} \sum_{x_m} [\gamma_t r_t + \ldots + \gamma_m r_m] \mu(x_{t:m}|ax_{<t}a_{t:m}) %]]></script>
<p>But it is easy to see that this agent does not fulfill our intelligence definition yet.
The agent needs to be able to cope with many different environments and learn from them, therefore it can not have <script type="math/tex">\mu</script> precoded.
We will take the agent from above and replace the environment dynamics <script type="math/tex">\mu</script> with the generalized universal prior distribution <script type="math/tex">\xi</script>.</p>
<p>Again, we will have hypotheses, this time over possible environment probability measures.
Formally, our environment hypothesis space will now be the set of all enumerable chronological semi-measures</p>
<script type="math/tex; mode=display">E := \{v_1, v_2, \ldots\}</script>
<p>We call these measures chronological because our sequence of interactions with the environment has a time component.
Again, we define the universal prior probability of a chronological environment <script type="math/tex">v \in E</script> using the Kolmogorov complexity</p>
<script type="math/tex; mode=display">P_E(v) := 2^{-K(v)}</script>
<p>where <script type="math/tex">K(v)</script> is the length of the shortest program that computes the environment’s index.
We yield the prior over the agent’s observations by constructing another mixture</p>
<script type="math/tex; mode=display">\xi(x_{1:n}|a_{1:n}) := \sum_{v\in E} 2^{-K(v)} v(x_{1:n}|a_{1:n})</script>
<p>By replacing <script type="math/tex">\mu</script> with <script type="math/tex">\xi</script> (and using conditional probability) we finally yield the universal AIXI agent <script type="math/tex">\pi^\xi</script>:</p>
<p>
<div class="card">
<div class="card-header">Universal AIXI agent</div>
<div class="card-body">
<script type="math/tex; mode=display">% <![CDATA[
a_t^{\pi^\xi} := \arg\max_{a_t} \lim_{m \to \infty} \sum_{x_t} \max_{a_{t+1}} \sum_{x_{t+1}} \ldots \max_{a_m} \sum_{x_m} [\gamma_t r_t + \ldots + \gamma_m r_m] \xi(x_{t:m}|ax_{<t}a_{t:m}) %]]></script>
</div>
</div>
</p>
<h3 id="convergence">Convergence</h3>
<p>Can we show that <script type="math/tex">\xi</script> converges to the true environment <script type="math/tex">\mu</script> in the same way we have shown that our universal prior <script type="math/tex">\xi</script> for sequence prediction converged?
Indeed, we can, but with a strong limitation: The interaction history <script type="math/tex">% <![CDATA[
ax_{<t} %]]></script> must come from <script type="math/tex">\pi^\mu</script>.
Of course, <script type="math/tex">\mu</script> is unknown and this is not feasible.
Even worse, it can be shown that it is in general impossible to match the performance of <script type="math/tex">\pi^\mu</script>.
Different from the sequence prediction problem, possibly irreversible interaction with the environment is necessary.
To give a very simple example:
Let’s imagine an environment where an agent has to pick between two doors.
One door leads to hell (very little reward), the other to heaven (plenty of reward).
Crucially, after having chosen one door, the agent can not return.
But because the agent does not have any prior knowledge about what is behind the doors, there is no optimal behavior that guarantees that heaven is chosen.
Conversely, if <script type="math/tex">\mu</script> was known to the agent, it could infer which door leads to larger rewards.
We will investigate this problem again in a later section.
For now, let’s at least investigate whether we are as optimal as any agent can be in such a situation.</p>
<p>Luckily, there is a key theorem proven by Hutter that says exactly that.
Let <script type="math/tex">\pi^\zeta</script> be the equivalent of agent <script type="math/tex">\pi^\xi</script> defined over <script type="math/tex">\mathcal{E} \subseteq E</script> instead of <script type="math/tex">E</script>, then we have</p>
<p>
<div class="card">
<div class="card-header">Pareto optimality of AIXI [<a href="http://www.hutter1.net/ai/uaibook.htm">proof in Section 5.5</a>]</div>
<div class="card-body">
<p>For any <script type="math/tex">\mathcal{E} \subseteq E</script> the agent <script type="math/tex">\pi^\zeta</script> is Pareto optimal.</p>
<p>An agent <script type="math/tex">\pi</script> is Pareto optimal if there is no other agent <script type="math/tex">\rho</script> such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\forall \mu \in \mathcal{E} \quad \forall ax_{<t}: V_\gamma^{\rho\mu}(ax_{<t}) \geq V_\gamma^{\pi\mu}(ax_{<t}) %]]></script>
<p>with strict inequality for at least one <script type="math/tex">\mu</script>.</p>
</div>
</div>
</p>
<p>In other words, there exists no agent that is at least as good as <script type="math/tex">\pi^\zeta</script> in all environments in <script type="math/tex">\mathcal{E}</script>, and strictly better in at least one.
An even stronger result can be shown:
<script type="math/tex">\pi^\zeta</script> is also balanced Pareto optimal which means that any increase in performance in some environment due to switching to another agent is compensated for by an equal or greater decrease in performance in some other environment.</p>
<p>Great!
But even if we can not construct a better agent, for what kind of environments is our agent <script type="math/tex">\pi^\xi</script> guaranteed to converge to the optimal performance of <script type="math/tex">\pi^\mu</script>?
Intuitively, the agent needs to be given time to learn about its environment <script type="math/tex">\mu</script> to reach optimal performance.
But due to irreversible interaction, not all environments permit this kind of behavior.
We formalize this concept of convergence by defining self-optimizing agents and categorizing environments into the ones admitting self-optimizing agents, and the ones that do not.
An agent <script type="math/tex">\pi</script> is self-optimizing in an environment <script type="math/tex">\mu</script> if,</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{1}{\Gamma_t}V_\gamma^{\pi\mu}(ax_{<t}) \to \frac{1}{\Gamma_t} V_{\gamma}^{\pi^\mu\mu}(\hat a \hat x_{<t}) \qquad \text{where} \quad \Gamma_t := \sum_{i=t}^{\infty} \gamma_i %]]></script>
<p>with probability 1 as <script type="math/tex">t \to \infty</script>.
The interaction histories <script type="math/tex">% <![CDATA[
ax_{<t} %]]></script> and <script type="math/tex">% <![CDATA[
\hat a \hat x_{<t} %]]></script> are sample from <script type="math/tex">\pi</script> and <script type="math/tex">\pi^\mu</script> respectively.
We require normalization <script type="math/tex">\frac{1}{\Gamma_t}</script> because un-normalized expected future discounted reward always converges to zero.
Intuitively, the above statement says that with high probability the performance of a self-optimizing agent <script type="math/tex">\pi</script> converges to the performance of the optimal agent <script type="math/tex">\pi^\mu</script>.
Furthermore, it can be proven that if there exists a sequence (agent might be non-stationary) of self-optimizing agents <script type="math/tex">\pi_m</script> for a class of environments <script type="math/tex">\mathcal{E}</script>, then the agents <script type="math/tex">\pi^\zeta</script> is also self-optimizing for <script type="math/tex">\mathcal{E}</script> [<a href="http://www.hutter1.net/ai/uaibook.htm">proof in Hutter 2005</a>].
This is a promising result because while not all environments admit self-optimizing agents we can rest assured that the ones who do will result in <script type="math/tex">\pi^\zeta</script> being self-optimizing.</p>
<p>For the commonly used Markov Decision Process (MDP) it can be shown that if it is ergodic, then it admits self-optimizing agents.
We say that an MDP environment <script type="math/tex">(A, X, \mu)</script> is ergodic iff there exists an agent <script type="math/tex">(A, X, \pi)</script> such that <script type="math/tex">\overset{\pi}{\mu}</script> defines an <a href="https://math.dartmouth.edu/archive/m20x06/public_html/Lecture15.pdf">ergodic Markov chain</a>.
Other self-optimizing environments are summarized in the following figure.</p>
<figure class="text-center">
<img class="figure-img rounded" style="max-height:500px;" src="/assets/posts/universal-ai/environments.png" alt="Taxonomy of environments" />
<figcaption class="figure-caption">Taxonomy of environments and whether they allow self-optimizing agents. Reproduced from [<a href="http://www.vetta.org/documents/Machine_Super_Intelligence.pdf">1</a>].</figcaption>
</figure>
<p>It is important to note, that being self-optimizing only tells us about the performance in the limit, not how quickly the agent will learn to perform well.
While we were able to upper-bound the error by a constant in the case of sequence prediction, such a general result is impossible for active environments.
Further research is required to prove convergence results for different classes of environments.
Due to the Pareto optimality of AIXI, we can at least be confident that no other agent could converge more quickly in all environments.</p>
<h3 id="what-about-exploration">What about exploration?</h3>
<p>You may wonder why the RL exploration problem does not appear in our universal agent.
The answer is that exploration is optimally encoded in our universal prior <script type="math/tex">\xi</script>.
All possible environments are being considered, weighted by their complexity, such that actions are considered not just with respect to the optimal policy in a single environment, but all environments.</p>
<h2 id="measuring-intelligence">Measuring intelligence</h2>
<p>Up until now, we derived the universal agent AIXI based on our working definition of intelligence.</p>
<blockquote>
<p>Intelligence measures an agent’s ability to achieve goals in a wide range of environments.</p>
</blockquote>
<p>In the following, we reverse the process and aim to define intelligence itself mathematically such that it permits us to measure intelligence universally for humans, animals, machines or any other arbitrary forms of intelligence.
This may even allow us to directly optimize for the intelligence of our agents instead of surrogates such as spreading genes, survival or any other objective we specify as designers.</p>
<p>We have already derived a pretty powerful toolbox to define such a formal measure of intelligence.
We formalize the wide range of environments and any goal that could be defined in these with the class of reward-summable environments <script type="math/tex">\mathbb{E}</script>.
The ability of the agent <script type="math/tex">\pi</script> to achieve these goals is then just its value function <script type="math/tex">V_\mu^\pi</script>.
We arrive at the following very simple formulation:</p>
<p>
<div class="card">
<div class="card-header">Formal measure of intelligence</div>
<div class="card-body">
<p>The universal intelligence of an agent <script type="math/tex">\pi</script> is its expected performance with respect to the universal distribution <script type="math/tex">2^{-K(\mu)}</script> over the space of all computable reward-summable environments <script type="math/tex">\mathbb{E} \subset E</script>, that is,</p>
<script type="math/tex; mode=display">\Upsilon(\pi) := \sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} V_\mu^\pi</script>
</div>
</div>
</p>
<p>This very simple measure of intelligence is remarkably close to our working definition of intelligence!
All we have added to our informal definition is a preference over environments we care about in the form of the prior probability <script type="math/tex">2^{-K(\mu)}</script>.</p>
<p>How does this measure relate to AIXI?
By the linearity of expectation the above definition is just the expected future discounted return <script type="math/tex">V_\xi^\pi</script> under the universal prior probability distribution <script type="math/tex">\xi</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
V_\xi^\pi &= V_\xi^\pi(\epsilon) \\
&= \sum_{ax_1}[r_t + V_\xi^\pi(ax_1)]\xi(x_1)\pi(a_1|x_1) \\
&= \sum_{ax_1}[r_t + V_\xi^\pi(ax_1)]\sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} \mu(x_1) \pi(a_1|x_1) \\
&= \sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} \sum_{ax_1}[r_t + V_\xi^\pi(ax_1)] \mu(x_1) \pi(a_1|x_1) \\
&= \sum_{\mu \in \mathbb{E}} 2^{-K(\mu)} V_\mu^\pi
\end{align} %]]></script>
<p>We have previously seen that AIXI maximizes <script type="math/tex">V_\xi^\pi</script>, therefore, by construction, the upper bound on universal intelligence is given by <script type="math/tex">\pi^\xi</script></p>
<script type="math/tex; mode=display">\overline\Upsilon := \max_\pi \Upsilon(\pi) = \Upsilon(\pi^\xi)</script>
<h2 id="practical-considerations">Practical considerations</h2>
<p>While the definition of universal artificial intelligence is an interesting theoretical endeavor on its own, we would like to construct algorithms that are as close as possible to this optimum.
Unfortunately, Kolmogorov complexity and therefore <script type="math/tex">\pi^\xi</script> are not computable.
In fact, AIXI is only enumerable as summarized in the table below.
This means that while we can make arbitrarily good approximations we can not devise an <script type="math/tex">\epsilon</script>-approximation because the upper bound of the function value is unknown.</p>
<table>
<thead>
<tr>
<th>Quantity</th>
<th>C*</th>
<th>E*</th>
<th>Co-E*</th>
<th>Comment</th>
<th>Other properties</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kolmogorov complexity</td>
<td></td>
<td></td>
<td>x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>\(v \in E\)</td>
<td></td>
<td>x</td>
<td></td>
<td>By definition of enumeration \(E\) of all enumerable chronological semi-measures</td>
<td></td>
</tr>
<tr>
<td>\(v \in \mathbb{E}\)</td>
<td>x</td>
<td></td>
<td></td>
<td>By definition of enumeration \(\mathbb{E}\) of all computable chronological semi-measures</td>
<td></td>
</tr>
<tr>
<td>\(\xi(x_{1:t}|a_{1:t})\) <br /> \(P_E(v) := 2^{-K(v)}\)</td>
<td></td>
<td>x</td>
<td></td>
<td>Follows from Kolmogorov complexity</td>
<td>\(\xi \in E\)</td>
</tr>
<tr>
<td>\(\pi^\xi\), \(V_\xi^\pi\), \(\Upsilon(\pi)\)</td>
<td></td>
<td>x</td>
<td></td>
<td>Follows from \(\xi\)</td>
<td></td>
</tr>
</tbody>
</table>
<p>* C = Computable, E = Enumerable, Co-E = Co-Enumerable</p>
<p>In more detail, what are the parts that make <script type="math/tex">\Upsilon(\pi)</script> of an agent <script type="math/tex">\pi</script> (and therefore <script type="math/tex">\pi^\xi</script>) impossible to compute?</p>
<ul>
<li>Computation of Kolmogorov complexity <script type="math/tex">K(v)</script> for <script type="math/tex">v \in \mathbb{E}</script></li>
<li>Sum over the non-finite set of chronological semi-measures (environments) <script type="math/tex">\sum_{v \in \mathbb{E}}</script></li>
<li>Computation of the value function <script type="math/tex">V_v^\pi</script> over an infinite time horizon (assuming infinite episode length)</li>
</ul>
<p>To estimate the Kolmogorov complexity techniques such as compression can be used.
AIXI could be Monte-Carlo approximated by sampling many environments (programs) and approximating the infinite sum over environments by a program length weighted finite sum over environments.
Similarly, we would have to limit the time horizon to compute the estimated discounted reward.</p>
<p>Other approximations and related approaches are HL(<script type="math/tex">\lambda</script>), AIXItl, Fitness Uniform Optimisation, the Speed prior, the Optimal Ordered Problem Solver and the Gödel Machine (Hutter, Legg, Schmidhuber, and others).</p>
<h2 id="conclusion">Conclusion</h2>
<p>Many interesting discussions emerge:
In the context of artificial intelligence by evolution, we aim to simulate evolution in order to produce intelligent agents.
Common approaches, inspired by evolution, optimize the spread of genes and survival instead of intelligence itself.
If the goal is intelligence over survival, can we directly (approximately) optimize for the intelligence of an agent?
Furthermore, which model should be chosen for <script type="math/tex">\pi^\xi</script>?
Any Turing Machine equivalent representation would be valid, be it neural networks or any symbolic programming language.
Of course, in practice, this choice makes a significant difference.
How are we going to make these design decisions?
Shane Legg argues that neuroscience is a good source for inspiration.</p>Louis KirschI give insight into a theoretic top-down approach of universal artificial intelligence. This may allow directing our future research by theoretical guidance, avoiding to handcraft properties of a system that may be learned by an intelligent agent. Furthermore, we will learn about theoretical limits of (non-)computable intelligence and introduce a universal measure of intelligence.Simple hyperparameter and architecture search in tensorflow with ray tune2018-06-20T11:00:00+00:002018-06-20T11:00:00+00:00http://louiskirsch.com/ai/ray-tune<p>In a previous <a href="/ai/population-based-training">blog post</a> I have shown a very clean method on how to implement efficient hyperparameter search in tensorflow from scratch.
I presented <a href="/ai/population-based-training#population-based-training-pbt">population-based training</a>, an evolutionary method that allows cheap and adaptive hyperparameter search by changing hyperparameters already during training instead of having to train until convergence before the resulting performance statistics can be used to inform the choice of hyperparameters.
That said, the vanilla version I presented had two major downsides: The training was not distributed out of the box and also required that the graph is constructed in the beginning and can be reused for different parameter settings.</p>
<p>In this blog post we want to look at the distributed computation framework <a href="https://github.com/ray-project/ray">ray</a> and its little brother <a href="http://ray.readthedocs.io/en/latest/tune.html">ray tune</a> that allow distributed and easy to implement hyperparameter search.
It not only supports population-based training but also other hyperparameter search algorithms.
Ray and ray tune support any autograd package, including tensorflow and PyTorch.</p>
<figure class="text-center">
<img class="figure-img rounded" style="max-height:200px;" src="/assets/posts/ray-tune/preview.png" alt="The architecture of ray tune" />
<figcaption class="figure-caption">The architecture of ray tune.</figcaption>
</figure>
<h2 id="setting-up-ray">Setting up ray</h2>
<p>Install ray on all your machines</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install ray
</code></pre></div></div>
<p>If you only have a single machine at your disposal (it will make use of all local CPU and GPU resources), initialize ray with</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ray</span>
<span class="n">ray</span><span class="o">.</span><span class="n">init</span><span class="p">()</span>
</code></pre></div></div>
<p>Otherwise, pick a machine to be the head node</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ray start <span class="nt">--head</span> <span class="nt">--redis-port</span><span class="o">=</span>6379
</code></pre></div></div>
<p>and connect your other instances</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ray start <span class="nt">--redis-address</span><span class="o">=</span>HEAD_HOSTNAME:6379
</code></pre></div></div>
<p>Finally, run your code on any node</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ray</span>
<span class="n">ray</span><span class="o">.</span><span class="n">init</span><span class="p">(</span><span class="n">redis_address</span><span class="o">=</span><span class="s">'HEAD_HOSTNAME:6379'</span><span class="p">)</span>
</code></pre></div></div>
<p>Having done that, we are ready to distribute tasks such as hyperparameter tuning.
More information on how to set up a cluster can be found <a href="http://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html">here</a>.</p>
<h2 id="implementing-your-model">Implementing your model</h2>
<p>We will need to implement a model using the following skeleton.
You should be able to plug in your existing tensorflow code easily:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ray.tune</span> <span class="k">as</span> <span class="n">tune</span>
<span class="k">class</span> <span class="nc">Model</span><span class="p">:</span>
<span class="c"># TODO implement</span>
<span class="k">class</span> <span class="nc">MyTrainable</span><span class="p">(</span><span class="n">Trainable</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">_setup</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c"># Load your data</span>
<span class="bp">self</span><span class="o">.</span><span class="n">data</span> <span class="o">=</span> <span class="o">...</span>
<span class="c"># Setup your tensorflow model</span>
<span class="c"># Hyperparameters for this trial can be accessed in dictionary self.config</span>
<span class="bp">self</span><span class="o">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">hyperparameters</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">config</span><span class="p">)</span>
<span class="c"># To save and restore your model</span>
<span class="bp">self</span><span class="o">.</span><span class="n">saver</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">Saver</span><span class="p">()</span>
<span class="c"># Start a tensorflow session</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sess</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_train</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c"># Run your training op for n iterations</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">training_op</span><span class="p">)</span>
<span class="c"># Report a performance metric to be used in your hyperparameter search</span>
<span class="n">validation_loss</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">validation_loss</span><span class="p">)</span>
<span class="k">return</span> <span class="n">tune</span><span class="o">.</span><span class="n">TrainingResult</span><span class="p">(</span><span class="n">timesteps_this_iter</span><span class="o">=</span><span class="n">n</span><span class="p">,</span> <span class="n">mean_loss</span><span class="o">=</span><span class="n">validation_loss</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_stop</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sess</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="c"># This function will be called if a population member</span>
<span class="c"># is good enough to be exploited</span>
<span class="k">def</span> <span class="nf">_save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">checkpoint_dir</span><span class="p">):</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">checkpoint_dir</span> <span class="o">+</span> <span class="s">'/save'</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">saver</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sess</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">global_step</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">_timesteps_total</span><span class="p">)</span>
<span class="c"># Population members that perform very well will be</span>
<span class="c"># exploited (restored) from their checkpoint</span>
<span class="k">def</span> <span class="nf">_restore</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">saver</span><span class="o">.</span><span class="n">restore</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sess</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="p">)</span>
</code></pre></div></div>
<p>A new Trainable will be instantiated and executed on an available GPU in your cluster / on your machine for each trial or population member (each having their own hyperparameters) in your population-based training.</p>
<h2 id="setting-up-ray-tune">Setting up ray tune</h2>
<p>Next, register your trainable and specify your experiments</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">tune</span><span class="o">.</span><span class="n">register_trainable</span><span class="p">(</span><span class="s">'MyTrainable'</span><span class="p">,</span> <span class="n">MyTrainable</span><span class="p">)</span>
<span class="n">train_spec</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'run'</span><span class="p">:</span> <span class="s">'MyTrainable'</span><span class="p">,</span>
<span class="c"># Specify the number of CPU cores and GPUs each trial requires</span>
<span class="s">'trial_resources'</span><span class="p">:</span> <span class="p">{</span><span class="s">'cpu'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'gpu'</span><span class="p">:</span> <span class="mi">1</span><span class="p">},</span>
<span class="s">'stop'</span><span class="p">:</span> <span class="p">{</span><span class="s">'timesteps_total'</span><span class="p">:</span> <span class="mi">20000</span><span class="p">},</span>
<span class="c"># All your hyperparameters (variable and static ones)</span>
<span class="s">'config'</span><span class="p">:</span> <span class="p">{</span>
<span class="s">'batch_size'</span><span class="p">:</span> <span class="mi">20</span><span class="p">,</span>
<span class="s">'units'</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span>
<span class="s">'l1_scale'</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">cfg</span><span class="p">:</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="mf">1e-3</span><span class="p">,</span> <span class="mf">1e-5</span><span class="p">),</span>
<span class="s">'learning_rate'</span><span class="p">:</span> <span class="n">tune</span><span class="o">.</span><span class="n">random_search</span><span class="p">([</span><span class="mf">1e-3</span><span class="p">,</span> <span class="mf">1e-4</span><span class="p">])</span>
<span class="o">...</span>
<span class="p">},</span>
<span class="c"># Number of trials</span>
<span class="s">'repeat'</span><span class="p">:</span> <span class="mi">4</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The entry <em>‘repeat’</em> describes the number of trials / population members.
Each trial will sample its <em>‘config’</em> from the specification above (i.e. using the predefined values or running the specified function).
The instruction <code class="highlighter-rouge">tune.random_search</code> will multiply the number of trials by the number of elements it was given, effectively running a grid search.</p>
<p>Finally, we have to define the kind of hyperparameter tuning we would like to perform and start our experiments.
This is how it works for population-based training:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pbt</span> <span class="o">=</span> <span class="n">PopulationBasedTraining</span><span class="p">(</span>
<span class="n">time_attr</span><span class="o">=</span><span class="s">'training_iteration'</span><span class="p">,</span>
<span class="n">reward_attr</span><span class="o">=</span><span class="s">'mean_loss'</span><span class="p">,</span>
<span class="n">perturbation_interval</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">hyperparam_mutations</span><span class="o">=</span><span class="p">{</span>
<span class="s">'l1_scale'</span><span class="p">:</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="mf">1e-3</span><span class="p">,</span> <span class="mf">1e-5</span><span class="p">),</span>
<span class="s">'learning_rate'</span><span class="p">:</span> <span class="p">[</span><span class="mf">1e-2</span><span class="p">,</span> <span class="mf">1e-3</span><span class="p">,</span> <span class="mf">1e-4</span><span class="p">]</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="n">tune</span><span class="o">.</span><span class="n">run_experiments</span><span class="p">({</span><span class="s">'population_based_training'</span><span class="p">:</span> <span class="n">train_spec</span><span class="p">},</span> <span class="n">scheduler</span><span class="o">=</span><span class="n">pbt</span><span class="p">)</span>
</code></pre></div></div>
<p>The above example will save, explore and exploit your population every time after <code class="highlighter-rouge">_train</code> has been called on your Trainable.
This is because we set <em>‘perturbation_interval’</em> to 1.
Furthermore, both <em>‘l1_scale’</em>, as well as <em>‘learning_rate’</em>, will be perturbed or resampled during explore operations according to the scheme specified.
You can even implement your own exploration function, to learn more see <a href="http://ray.readthedocs.io/en/latest/pbt.html">here</a>.</p>
<p>Because at every mutation step of the population-based training the entire model is saved to disk and restored (and possibly sent to a different machine), ray tune requires a significantly bigger overhead compared to our <a href="/ai/population-based-training">vanilla tensorflow version</a>.
Therefore you might want to increase <em>pertubation_interval</em> depending on the length of each iteration.</p>
<h2 id="optional-network-morphisms">Optional: Network morphisms</h2>
<p>On the other hand, if your graph needs to be rebuilt anyways, for instance, to make architectural changes based on hyperparameters, the graph reconstruction makes the process quite easy.
To give you an example, let’s say you want to increase the number of units in a layer.
You then could implement a network morphism that keeps your neural network function <script type="math/tex">f</script> identical but changes the number of units by padding your weight and bias variables with zeros.
You will have to redefine your <code class="highlighter-rouge">_restore</code> function</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_restore</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="p">):</span>
<span class="n">reader</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">NewCheckpointReader</span><span class="p">(</span><span class="n">checkpoint_path</span><span class="p">)</span>
<span class="k">for</span> <span class="n">var</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">saver</span><span class="o">.</span><span class="n">_var_list</span><span class="p">:</span>
<span class="n">tensor_name</span> <span class="o">=</span> <span class="n">var</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">':'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">reader</span><span class="o">.</span><span class="n">has_tensor</span><span class="p">(</span><span class="n">tensor_name</span><span class="p">):</span>
<span class="k">continue</span>
<span class="n">saved_value</span> <span class="o">=</span> <span class="n">reader</span><span class="o">.</span><span class="n">get_tensor</span><span class="p">(</span><span class="n">tensor_name</span><span class="p">)</span>
<span class="n">resized_value</span> <span class="o">=</span> <span class="n">fit_to_shape</span><span class="p">(</span><span class="n">saved_value</span><span class="p">,</span> <span class="n">var</span><span class="o">.</span><span class="n">shape</span><span class="o">.</span><span class="n">as_list</span><span class="p">())</span>
<span class="n">var</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">resized_value</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">sess</span><span class="p">)</span>
</code></pre></div></div>
<p>where <code class="highlighter-rouge">fit_to_shape</code> is defined as</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">fit_to_shape</span><span class="p">(</span><span class="n">array</span><span class="p">,</span> <span class="n">target_shape</span><span class="p">):</span>
<span class="n">source_shape</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">array</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">target_shape</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">target_shape</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">target_shape</span><span class="p">)</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">source_shape</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">'Axes must match'</span><span class="p">)</span>
<span class="n">size_diff</span> <span class="o">=</span> <span class="n">target_shape</span> <span class="o">-</span> <span class="n">source_shape</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">all</span><span class="p">(</span><span class="n">size_diff</span> <span class="o">==</span> <span class="mi">0</span><span class="p">):</span>
<span class="k">return</span> <span class="n">array</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">any</span><span class="p">(</span><span class="n">size_diff</span> <span class="o">></span> <span class="mi">0</span><span class="p">):</span>
<span class="n">paddings</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">target_shape</span><span class="p">),</span> <span class="mi">2</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
<span class="n">paddings</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">maximum</span><span class="p">(</span><span class="n">size_diff</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">pad</span><span class="p">(</span><span class="n">array</span><span class="p">,</span> <span class="n">paddings</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s">'constant'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">any</span><span class="p">(</span><span class="n">size_diff</span> <span class="o"><</span> <span class="mi">0</span><span class="p">):</span>
<span class="n">slice_desc</span> <span class="o">=</span> <span class="p">[</span><span class="nb">slice</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">target_shape</span><span class="p">]</span>
<span class="n">array</span> <span class="o">=</span> <span class="n">array</span><span class="p">[</span><span class="n">slice_desc</span><span class="p">]</span>
<span class="k">return</span> <span class="n">array</span>
</code></pre></div></div>
<p>Note that in the case where your number of units is reduced, the above code will not keep the function <script type="math/tex">f</script> identical!
One might, for instance, derive a more intelligent algorithm that removes only the weights and biases that have zero magnitudes.
This might be encouraged by l1-regularization.</p>
<h2 id="waiting-for-results-and-visualizing">Waiting for results and visualizing</h2>
<p>Finally, lean back and let the magic happen.
You will see that ray tune outputs nice statistics along the way</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>== Status ==
PopulationBasedTraining: 42 checkpoints, 28 perturbs
Resources used: 3/12 CPUs, 3/3 GPUs
Result logdir: /home/louis/ray_results/population_based_training
PAUSED trials:
- Experiment_0: PAUSED [pid=20781], 971 s, 1600 ts, -1.64e+03 rew, 0.935 loss, 0.676 acc
RUNNING trials:
- Experiment_1: RUNNING [pid=23121], 1162 s, 1350 ts, -1.68e+03 rew, 0.994 loss, 0.665 acc
- Experiment_2: RUNNING [pid=18700], 979 s, 1550 ts, -1.63e+03 rew, 0.988 loss, 0.663 acc
- Experiment_3: RUNNING [pid=22593], 990 s, 1550 ts, -1.77e+03 rew, 0.959 loss, 0.671 acc
</code></pre></div></div>
<p>To visualize log data both from ray tune and your own tensorflow summaries use Tensorboard</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tensorboard <span class="nt">--logdir</span> ~/ray_results/population_based_training
</code></pre></div></div>
<p>If you want to learn more about ray tune, have a look at the <a href="http://ray.readthedocs.io/en/latest/tune.html">documentation</a> and <a href="https://github.com/ray-project/ray/tree/master/python/ray/tune/examples">examples</a>.</p>Louis KirschIn this blog post we want to look at the distributed computation framework ray and its little brother ray tune that allow distributed and easy to implement hyperparameter search. It not only supports population-based training, but also other hyperparameter search algorithms. Ray and ray tune support any autograd package, including tensorflow and PyTorch.Hyperparameter search with population based training in tensorflow2018-06-07T06:00:00+00:002018-06-07T06:00:00+00:00http://louiskirsch.com/ai/population-based-training<p>Deep learning researchers often spend a large amount of time tuning their architecture and hyper-parameters.
A simple method to reduce this human effort is to perform a grid search, essentially random search, on a given range of possible hyper-parameter values.
This method is exponential in the number of hyper-parameters and requires to train the entire neural network until convergence for each hyper-parameter setting.</p>
<p>More advanced work such as <a href="https://arxiv.org/abs/1611.01578">reinforcement learning for architecture search</a> and <a href="https://arxiv.org/abs/1703.01041">evolutionary approaches</a> allow to not only search in the space of hyper-parameters but also architectures.
On the downside, these methods are really expensive and require hundreds of GPUs.</p>
<p>Recently, a very simple method for hyper-parameter search has been proposed by DeepMind [<a href="https://arxiv.org/abs/1711.09846">1</a>].
In this article, I will showcase the method with a simple tensorflow example.
While this method is only suitable for fixed architectures with variable hyper-parameters, there are methods such as network morphisms [<a href="https://arxiv.org/abs/1711.04528">2</a>] that allow to extend this approach to architecture search.
The implementation of these network morphisms is a bit more complicated, so I will leave an implementation of network morphisms to a follow-up blog post.</p>
<h2 id="population-based-training-pbt">Population based training (PBT)</h2>
<p>Population based training is based on the idea that we have a population of training runs with different hyper-parameters, and every \(m\) iterations we exploit and explore.
Exploitation of the best training runs is defined by overwriting the parameters and hyper-parameters of the worst training runs, while exploration is ensured by perturbation of their hyper-parameters with noise.
This procedure is visualized in the following figure.</p>
<figure class="text-center">
<img class="figure-img rounded" src="/assets/posts/population-based-training/overview.png" alt="A visualization of population based neural network training" />
<figcaption class="figure-caption">A visualization of population based neural network training. Reproduced from <a href="https://arxiv.org/abs/1711.09846">[1]</a>.</figcaption>
</figure>
<h2 id="implementation-in-tensorflow">Implementation in tensorflow</h2>
<p>We will do <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR10</a> image classification in this example.
Our goal is nothing fancy - just use standard fully connected layers.
But how do we choose the capacity for these layers?
This is where our population based training comes in - we just specify very large layers and then apply l1-regularization to essentially zero-out unnecessary connections.
This prevents overfitting.
The scale of this l1-regularizer will be a hyper-parameter and determined by our algorithm.</p>
<p>For this example, we will need tensorflow, numpy, matplotlib and a nice package called ‘observations’ that allows us to load common datasets in seconds.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">import</span> <span class="nn">tensorflow.contrib</span> <span class="k">as</span> <span class="n">tfc</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">observations</span>
<span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">lru_cache</span>
</code></pre></div></div>
<p>These few lines of code are literally everything we need to create two <code class="highlighter-rouge">tf.Dataset</code> instances of the <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR10 dataset</a> for training and testing.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tf</span><span class="o">.</span><span class="n">reset_default_graph</span><span class="p">()</span>
<span class="n">train_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> <span class="n">observations</span><span class="o">.</span><span class="n">cifar10</span><span class="p">(</span><span class="s">'data/cifar'</span><span class="p">,)</span>
<span class="n">test_data</span> <span class="o">=</span> <span class="n">test_data</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">test_data</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span> <span class="c"># Fix test_data dtype</span>
<span class="n">train</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">from_tensor_slices</span><span class="p">(</span><span class="n">train_data</span><span class="p">)</span><span class="o">.</span><span class="n">repeat</span><span class="p">()</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="mi">10000</span><span class="p">)</span><span class="o">.</span><span class="n">batch</span><span class="p">(</span><span class="mi">64</span><span class="p">)</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">from_tensors</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span><span class="o">.</span><span class="n">repeat</span><span class="p">()</span>
</code></pre></div></div>
<p>We now create an iterator to iterate over either the training or test data.
For that, we use iterator handles as described in the <a href="https://www.tensorflow.org/programmers_guide/datasets">tensorflow documentation</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">handle</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">string</span><span class="p">,</span> <span class="p">[])</span>
<span class="n">itr</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Iterator</span><span class="o">.</span><span class="n">from_string_handle</span><span class="p">(</span><span class="n">handle</span><span class="p">,</span> <span class="n">train</span><span class="o">.</span><span class="n">output_types</span><span class="p">,</span> <span class="n">train</span><span class="o">.</span><span class="n">output_shapes</span><span class="p">)</span>
<span class="n">inputs</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">itr</span><span class="o">.</span><span class="n">get_next</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">make_handle</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span> <span class="n">dataset</span><span class="p">):</span>
<span class="n">iterator</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">make_initializable_iterator</span><span class="p">()</span>
<span class="n">handle</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">iterator</span><span class="o">.</span><span class="n">string_handle</span><span class="p">(),</span> <span class="n">iterator</span><span class="o">.</span><span class="n">initializer</span><span class="p">])</span>
<span class="k">return</span> <span class="n">handle</span>
</code></pre></div></div>
<p>The two tensors <code class="highlighter-rouge">inputs</code> and <code class="highlighter-rouge">labels</code> now contain the data, we cast them to the right data types and shape.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">inputs</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span> <span class="o">/</span> <span class="mf">255.0</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">flatten</span><span class="p">(</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">labels</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
</code></pre></div></div>
<p>Next, we create the model we’ll be using to classify CIFAR10 images.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Model</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model_id</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">regularize</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">model_id</span> <span class="o">=</span> <span class="n">model_id</span>
<span class="bp">self</span><span class="o">.</span><span class="n">name_scope</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">get_default_graph</span><span class="p">()</span><span class="o">.</span><span class="n">get_name_scope</span><span class="p">()</span>
<span class="c"># Regularization</span>
<span class="k">if</span> <span class="n">regularize</span><span class="p">:</span>
<span class="n">l1_reg</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_create_regularizer</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">l1_reg</span> <span class="o">=</span> <span class="bp">None</span>
<span class="c"># Network and loglikelihood</span>
<span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_create_network</span><span class="p">(</span><span class="n">l1_reg</span><span class="p">)</span>
<span class="c"># We maximixe the loglikelihood of the data as a training objective</span>
<span class="n">distr</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">distributions</span><span class="o">.</span><span class="n">Categorical</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span>
<span class="n">loglikelihood</span> <span class="o">=</span> <span class="n">distr</span><span class="o">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span>
<span class="c"># Define accuracy of prediction</span>
<span class="n">prediction</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">output_type</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">accuracy</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_mean</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">equal</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">labels</span><span class="p">),</span> <span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">))</span>
<span class="c"># Loss and optimization</span>
<span class="bp">self</span><span class="o">.</span><span class="n">loss</span> <span class="o">=</span> <span class="o">-</span><span class="n">tf</span><span class="o">.</span><span class="n">reduce_mean</span><span class="p">(</span><span class="n">loglikelihood</span><span class="p">)</span>
<span class="c"># Retrieve all weights and hyper-parameter variables of this model</span>
<span class="n">trainable</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">get_collection</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">GraphKeys</span><span class="o">.</span><span class="n">TRAINABLE_VARIABLES</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">name_scope</span> <span class="o">+</span> <span class="s">'/'</span><span class="p">)</span>
<span class="c"># The loss to optimize is the negative loglikelihood + the l1-regularizer</span>
<span class="n">reg_loss</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">loss</span> <span class="o">+</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">get_regularization_loss</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">optimize</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">AdamOptimizer</span><span class="p">()</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">reg_loss</span><span class="p">,</span> <span class="n">var_list</span><span class="o">=</span><span class="n">trainable</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_create_network</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">l1_reg</span><span class="p">):</span>
<span class="c"># Our deep neural network will have two hidden layers with plenty of units</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span> <span class="mi">1024</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">relu</span><span class="p">,</span>
<span class="n">kernel_regularizer</span><span class="o">=</span><span class="n">l1_reg</span><span class="p">)</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">hidden</span><span class="p">,</span> <span class="mi">1024</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">relu</span><span class="p">,</span>
<span class="n">kernel_regularizer</span><span class="o">=</span><span class="n">l1_reg</span><span class="p">)</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="n">hidden</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span>
<span class="n">kernel_regularizer</span><span class="o">=</span><span class="n">l1_reg</span><span class="p">)</span>
<span class="k">return</span> <span class="n">logits</span>
<span class="k">def</span> <span class="nf">_create_regularizer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c"># We will define the l1 regularizer scale in log2 space</span>
<span class="c"># This allows changing one unit to half or double the effective l1 scale</span>
<span class="bp">self</span><span class="o">.</span><span class="n">l1_scale</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">get_variable</span><span class="p">(</span><span class="s">'l1_scale'</span><span class="p">,</span> <span class="p">[],</span> <span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">trainable</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">initializer</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">constant_initializer</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log2</span><span class="p">(</span><span class="mf">1e-5</span><span class="p">)))</span>
<span class="c"># We define a 'pertub' operation that adds some noise to our regularizer scale</span>
<span class="c"># We will use this pertubation during exploration in our population based training</span>
<span class="n">noise</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([],</span> <span class="n">stddev</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">perturb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">l1_scale</span><span class="o">.</span><span class="n">assign_add</span><span class="p">(</span><span class="n">noise</span><span class="p">)</span>
<span class="k">return</span> <span class="n">tfc</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">l1_regularizer</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="bp">self</span><span class="o">.</span><span class="n">l1_scale</span><span class="p">)</span>
<span class="nd">@lru_cache</span><span class="p">(</span><span class="n">maxsize</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">copy_from</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other_model</span><span class="p">):</span>
<span class="c"># This method is used for exploitation. We copy all weights and hyper-parameters</span>
<span class="c"># from other_model to this model</span>
<span class="n">my_weights</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">get_collection</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">GraphKeys</span><span class="o">.</span><span class="n">GLOBAL_VARIABLES</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">name_scope</span> <span class="o">+</span> <span class="s">'/'</span><span class="p">)</span>
<span class="n">their_weights</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">get_collection</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">GraphKeys</span><span class="o">.</span><span class="n">GLOBAL_VARIABLES</span><span class="p">,</span> <span class="n">other_model</span><span class="o">.</span><span class="n">name_scope</span> <span class="o">+</span> <span class="s">'/'</span><span class="p">)</span>
<span class="n">assign_ops</span> <span class="o">=</span> <span class="p">[</span><span class="n">mine</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">theirs</span><span class="p">)</span><span class="o">.</span><span class="n">op</span> <span class="k">for</span> <span class="n">mine</span><span class="p">,</span> <span class="n">theirs</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">my_weights</span><span class="p">,</span> <span class="n">their_weights</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="o">*</span><span class="n">assign_ops</span><span class="p">)</span>
</code></pre></div></div>
<p>We will have to create several models, one for each population member.
Each population member will have separate hyper-parameters and weights.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">create_model</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">variable_scope</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="s">'model'</span><span class="p">):</span>
<span class="k">return</span> <span class="n">Model</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, let’s train a standard neural network without regularization to obtain a baseline we are competing with.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ITERATIONS</span> <span class="o">=</span> <span class="mi">50000</span>
<span class="n">nonreg_accuracy_hist</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">ITERATIONS</span> <span class="o">//</span> <span class="mi">100</span><span class="p">,))</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">create_model</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">regularize</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">train_handle</span> <span class="o">=</span> <span class="n">make_handle</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span> <span class="n">train</span><span class="p">)</span>
<span class="n">test_handle</span> <span class="o">=</span> <span class="n">make_handle</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">())</span>
<span class="n">feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">handle</span><span class="p">:</span> <span class="n">train_handle</span><span class="p">}</span>
<span class="n">test_feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">handle</span><span class="p">:</span> <span class="n">test_handle</span><span class="p">}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ITERATIONS</span><span class="p">):</span>
<span class="c"># Training</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">optimize</span><span class="p">,</span> <span class="n">feed_dict</span><span class="p">)</span>
<span class="c"># Evaluate</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">100</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">nonreg_accuracy_hist</span><span class="p">[</span><span class="n">i</span> <span class="o">//</span> <span class="mi">100</span><span class="p">]</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">accuracy</span><span class="p">,</span> <span class="n">test_feed_dict</span><span class="p">)</span>
</code></pre></div></div>
<p>The following is essentially the core of population based training.
We create a population of models, and repeatedly</p>
<ul>
<li><strong>Exploit</strong> the best models by discarding the worst models and replacing them with the weights and hyper-parameters of the best model</li>
<li><strong>Explore</strong> the search space of hyper-parameters by adding noise through the <code class="highlighter-rouge">perturb</code> operation</li>
<li>Train each population member for a certain amount of iterations</li>
<li><strong>Evaluate</strong> each population member in terms of their validation set accuracy</li>
</ul>
<p><strong>NOTE:</strong> In this example we actually used the test set for scoring instead of the validation set.
In practice, you should not do this because it might lead to overfitting to the test set.
In the case of l1-regularization, it is highly unlikely that we can actually overfit in a meaningful way, but it is certainly not best-practice.
In principle, you could partition the training set into a training and validation set.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">POPULATION_SIZE</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">BEST_THRES</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">WORST_THRES</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">POPULATION_STEPS</span> <span class="o">=</span> <span class="mi">500</span>
<span class="n">ITERATIONS</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">accuracy_hist</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">POPULATION_SIZE</span><span class="p">,</span> <span class="n">POPULATION_STEPS</span><span class="p">))</span>
<span class="n">l1_scale_hist</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">POPULATION_SIZE</span><span class="p">,</span> <span class="n">POPULATION_STEPS</span><span class="p">))</span>
<span class="n">best_accuracy_hist</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">POPULATION_STEPS</span><span class="p">,))</span>
<span class="n">best_l1_scale_hist</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">POPULATION_STEPS</span><span class="p">,))</span>
<span class="n">models</span> <span class="o">=</span> <span class="p">[</span><span class="n">create_model</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">POPULATION_SIZE</span><span class="p">)]</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">train_handle</span> <span class="o">=</span> <span class="n">make_handle</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span> <span class="n">train</span><span class="p">)</span>
<span class="n">test_handle</span> <span class="o">=</span> <span class="n">make_handle</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">())</span>
<span class="n">feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">handle</span><span class="p">:</span> <span class="n">train_handle</span><span class="p">}</span>
<span class="n">test_feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">handle</span><span class="p">:</span> <span class="n">test_handle</span><span class="p">}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">POPULATION_STEPS</span><span class="p">):</span>
<span class="c"># Copy best</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">m</span><span class="o">.</span><span class="n">copy_from</span><span class="p">(</span><span class="n">models</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">[</span><span class="o">-</span><span class="n">WORST_THRES</span><span class="p">:]])</span>
<span class="c"># Perturb others</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">m</span><span class="o">.</span><span class="n">perturb</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">[</span><span class="n">BEST_THRES</span><span class="p">:]])</span>
<span class="c"># Training</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ITERATIONS</span><span class="p">):</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">m</span><span class="o">.</span><span class="n">optimize</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">],</span> <span class="n">feed_dict</span><span class="p">)</span>
<span class="c"># Evaluate</span>
<span class="n">l1_scales</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">({</span><span class="n">m</span><span class="p">:</span> <span class="n">m</span><span class="o">.</span><span class="n">l1_scale</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">})</span>
<span class="n">accuracies</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">({</span><span class="n">m</span><span class="p">:</span> <span class="n">m</span><span class="o">.</span><span class="n">accuracy</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">},</span> <span class="n">test_feed_dict</span><span class="p">)</span>
<span class="n">models</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">accuracies</span><span class="p">[</span><span class="n">m</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Logging</span>
<span class="n">best_accuracy_hist</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">accuracies</span><span class="p">[</span><span class="n">models</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
<span class="n">best_l1_scale_hist</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">l1_scales</span><span class="p">[</span><span class="n">models</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
<span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">:</span>
<span class="n">l1_scale_hist</span><span class="p">[</span><span class="n">m</span><span class="o">.</span><span class="n">model_id</span><span class="p">,</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">l1_scales</span><span class="p">[</span><span class="n">m</span><span class="p">]</span>
<span class="n">accuracy_hist</span><span class="p">[</span><span class="n">m</span><span class="o">.</span><span class="n">model_id</span><span class="p">,</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">accuracies</span><span class="p">[</span><span class="n">m</span><span class="p">]</span>
</code></pre></div></div>
<p>Let’s see how well we are doing.
In the following graph we compare our baseline with the best model throughout our population based training.
Clearly, our method overfits less than the baseline, achieving higher test accuracies at the end of training.</p>
<p><img src="/assets/posts/population-based-training/output_22_0.png" alt="png" /></p>
<p>Another interesting observation can be made looking at the changing l1 scale for each of the models.
Instead of fluctuating around the initial value \(10^{-5}\) there is an evolving pattern over time.
This means, depending on the time in training, different l1 regularization is optimal to gain maximal test accuracy.</p>
<p><img src="/assets/posts/population-based-training/output_24_0.png" alt="png" /></p>
<p>All code can be found in this <a href="https://gist.github.com/timediv/308ff50c7f15191c8fe6582be3c810f0">github gist</a>.</p>
<h2 id="closing-remarks">Closing remarks</h2>
<p>Obviously, we could have done much better by using convolutions on CIFAR10 image classification.
But nevertheless, the concept of population based training can be applied to any machine learning problem involving hyper-parameters: supervised, unsupervised and reinforcement learning.
If you have an interesting application, leave it in the comments!</p>
<p>Also, we’ve been training all population members on the same GPU, waiting for all of them to finish before exploring and exploiting.
In a larger setup, we will want to train each member on a separate GPU asynchronously.
This can be extended to a distributed setup with multiple machines, which I may explore in another blog post.
I’ve also mentioned architecture search and network morphisms already – that’s when all this stuff gets really exciting!</p>Louis KirschDeep learning researchers often spend a large amount of time tuning their architecture and hyper-parameters. A simple method to reduce this human effort is to perform a grid search, essentially random search, but requires exponential compute in the number of parameters. We will investigate a cheap and powerful alternative called population based training.Modular Networks: Learning to decompose neural computation2018-05-26T11:00:00+00:002018-05-26T11:00:00+00:00http://louiskirsch.com/modular-networks<p><strong>UPDATE 17. December 2018</strong><br />
For a more detailed description of the <a href="#alternatives-for-modular-learning-and-conditional-computation">idea space</a> around conditional computation and sparsity see <a href="/assets/publications/scale-through-sparsity.pdf">my new technical report (Kirsch October 2018)</a>.</p>
<p>Recently, I thought a lot about the concept of modularity in artificial neural networks and wrote a <a href="https://arxiv.org/abs/1811.05249">paper</a> on the topic.
In this blog post, I will discuss my findings and possible future work in detail.</p>
<p>Currently, our largest neural networks have only billions of parameters, while the brain has trillions of synapses that are arguably even more complex than simple multiplication.
To this date, we are not able to <strong>scale up artificial neural networks to trillions of parameters</strong> by performing the usual dense matrix multiplication.
It is questionable whether this dense matrix multiplication approach would make sense anyways, given that the brain certainly has sparse connectivity and sparse activations.
Likewise, in a previous experiment, I was able to show that activations in artificial neural networks are already sparse or can be even more sparse by introducing L1 regularization.
The following figure visualizes this phenomenon for the task of image classification on the CIFAR10 dataset.</p>
<figure class="text-center">
<img class="figure-img rounded" src="/assets/posts/modular-networks/activity.png" alt="Sparsity of weights and and activations on CIFAR10" />
<figcaption class="figure-caption">Sparsity of weights and activations on fully-connected CIFAR10.<br />The regularized version had a L1 regularizer applied to both weights and activations (with similar test accuracy).</figcaption>
</figure>
<p>While the brain saves energy when not activating neurons, due to our dense matrix multiplication we do not have similar gains.
Furthermore, there is evidence that the brain develops a <strong>modular structure in order to optimize energy cost, improve adaptation to changing environments and mitigate catastrophic forgetting</strong>.
Sounds like we should try and implement some modularity for our artificial neural networks too!</p>
<h2 id="modular-networks">Modular Networks</h2>
<p>Inspired by these findings, I developed a generalized Expectation Maximization algorithm that allows decomposing neural computation into multiple modules.
At the core of it, we have the modular layer, as seen in the following figure.</p>
<figure class="text-center">
<img class="figure-img rounded" src="/assets/posts/modular-networks/modular-layer.gif" alt="The modular layer" />
<figcaption class="figure-caption">The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input.</figcaption>
</figure>
<p>The controller picks from the modules M1 to M6 (which are arbitrary differentiable functions) based on the given input and executes them.
Therefore, depending on the input, we evaluate different parts of the network – in effect modularizing our neural network architecture.
Now we can just stack this modular layer or insert it arbitrarily into existing architectures, for instance, a recurrent neural network.
Both the decomposition of functionality into modules, as well as the parameters of the modules, are learned.
More details can be found in my <a href="https://arxiv.org/abs/1811.05249">publication</a>.</p>
<h2 id="learnings-from-my-research">Learnings from my research</h2>
<p>Ultimately my goal was to make modular learning a tool for <strong>large, scalable architectures</strong> that are less prone to catastrophic forgetting and would, therefore, be suitable for <strong>life-long learning</strong>.
The idea was that if gradients need only be propagated through very few modules of a large pool of potentially thousands of modules, then it might be easier to prevent it from damaging other functionality for different tasks or datapoints.
Given that we’d like to optimize only the parameters of a small set of modules, one easy method might be to add other data samples to the mini-batch that are also assigned to these selected modules.</p>
<p>But it turns out, learning this kind of modularity has problems in itself – both conceptually, as well as technically.</p>
<p><strong>Generalization vs Specialization.</strong>
Naturally, each module is supposed to specialize in certain aspects of the training data.
For instance, in my research, I showed how a word-level language model had modules that focused on different semantics such as the beginning of a sentence or quantitative words.
But of course, in machine learning, we also aim to generalize to unseen datapoints.
This kind of modularization leads to specialization that might hurt generalization by specializing on aspects of the data that may only exist in the training set.
Indeed, we showed that modularization does not generalize very well for image classification on the CIFAR10 dataset.
Interestingly, this issue was less severe for language modeling and therefore depends on the modality of the data.
One open question remains: How do we make sure that specialization occurs only such that computational efficiencies are gained but generalization remains uninhibited?</p>
<p><strong>Not every problem might be a composition of modules.</strong>
By learning the composition of modules we are somewhat enforcing that the problem is decomposable – how do we know this is even true?
While it might make sense to split language understanding and visual understanding into two different brain regions, does it necessarily make sense to also subdivide vision further into discrete module choices?</p>
<p><strong>Reduced learning signal for each module for smaller datasets.</strong>
Because modules are only active for some datapoints, the number of samples that each of these modules is trained on is significantly reduced.
This means we either require even larger datasets or this may further reduce the generalization ability of each module.
It might be worth investigating whether larger datasets can reduce the effect of overfitting, such as unsupervised image reconstruction for the domain of images.</p>
<p><strong>Inefficiencies because of lack of batching.</strong>
This is not an inherent problem of modularity but a technical one.
We are splitting a mini batch of data and distribute the datapoints to different modules ultimately resulting in smaller batch sizes for each of the modules.
Because our GPU architectures are efficient in parallelizing matrix multiplication smaller batch sizes mean slower computation.
There are workarounds to increase the batch-size dramatically and distribute computation to multiple workers in a distributed environment, but this requires significant engineering effort.
<a href="https://arxiv.org/abs/1701.06538">This paper</a> demonstrates how this can be done.</p>
<h2 id="alternatives-for-modular-learning-and-conditional-computation">Alternatives for modular learning and conditional computation</h2>
<p>Putting modularity aside, there might be another way to achieve computational benefits through the observed sparsity by leveraging sparse matrix-matrix multiplication.
As we’ve seen previously, about 90% of all activations can be zero – effectively making 90% of the weight matrix irrelevant.
This might be a very interesting direction for future research.
Of course, this will still distribute computation over the entire network instead of localizing it into modules.
Therefore, it is unclear whether we can yield any mitigation of catastrophic forgetting in this way.</p>
<p>Another interesting aspect I would like to investigate is whether modularity naturally occurs in artificial neural networks.
Could we create a graph of co-activity of neurons to see whether functionality clusters into modules automatically?
If this modularization does not occur, might it be beneficial to introduce a soft locality constraint that moves correlated activations together such that larger parts of the matrix need not be computed entirely because these correspond to ‘inactive modules’?</p>Louis KirschCurrently, our largest neural networks have only billions of parameters, while the brain has trillions of synapses. To this date, we are not able to scale up artificial neural networks to trillions of parameters by performing the usual dense matrix multiplication. We introduce Modular Networks, a way to make neural networks modular, similar to what has been observed in the brain. In our approach, we execute only some of these modules, conditioned on the input to the network. Furthermore, we look at future possibilities of leveraging sparse computation.Why work on artificial intelligence?2018-05-02T09:00:00+00:002018-05-02T09:00:00+00:00http://louiskirsch.com/anthropology/why-ai<p><img src="/assets/posts/why-ai/croatia.jpg" alt="Picture at the sea in croatia" /></p>
<p>This morning, you woke up and got out of bed. A human-made object that enables you to have an enjoyable and effective rest. You checked your smartphone, to communicate with the entire world instantly. You received all important news and notifications without any effort. Every day you use thousands of objects and processes that are human inventions, to make our lives more enjoyable and effective. Inventions, that enable us to truly focus on what’s important. They free us from the requirement to dedicate our entire day to survival. Instead, we can focus on social, creative or inventive issues. Behind all this is only a single factor, human intelligence. Our brains are constantly working on coming up with new ways to make life more effective and fun.</p>
<p>Obviously, our brains have limited capacity, constrained by energy consumption and the size of the body. How much more could we achieve if we were freed of these limitations? This makes us wonder, how can we build intelligent systems, that are not human? The term artificial intelligence describes just that - inspired by the processes in the brain, we run programs that make computers do things we describe as intelligent in a human. We have come quite far with specialized artificial intelligence. Just recently, an algorithm called <a href="https://deepmind.com/research/alphago/">Alpha Go</a> defeated the human champion in the board-game Go, believed to be the hardest board-game to solve by a computer that humans have ever invented. Nevertheless, we still have a long way to go to build an intelligent system that can cope with any challenges in the real world such as humans do. This system we call an artificial general intelligence (AGI).</p>
<p>I argue the quest for building such system is not only inevitable, it is the most important quest of this century. Why would we want to develop such generally intelligent systems? In our present, we have a sheer endless number of unsolved problems. Machine intelligence can us help solve many of these in an unprecedented manner.</p>
<p><strong>Automatization</strong> is the first impact of artificial intelligence and machine learning research which is already ongoing. Today, we can automate production with robots, chat with virtual assistants and automatically detect criminal activity. Soon, autonomous driving will overtake the highways in the entire world. Automatization will help us eliminate simple jobs people don’t actually want to work for but do in an attempt of survival. A <a href="http://news.gallup.com/poll/165269/worldwide-employees-engaged-work.aspx">study</a> from 2013 shows that 87% of us still go to work without being motivated and 24% are truly unhappy at their job. Automatization could lead to further increases in productivity that could allow us to pay every single individual an unconditional income, freeing us from the need to work in an unfulfilling job for survival but lets us <strong>focus on the things we really want to do</strong>. This concept is known as the universal basic income (UBI) and may be the start of a new era that begins with artificial intelligence being omnipresent. In effect, it may allow us to <strong>eliminate poverty</strong> on a global scale.</p>
<p><strong>Education</strong> differences create wealth imbalances and social tensions that can be a thing of the past. Imagine having your own virtual teacher with you at all times that can answer your questions based on any information on the internet just like a human expert would, essentially eliminating the need to search through long forums or Wikipedia pages. This means personalized learning for quicker and more effective learning instead of attending possibly dissatisfying school classes or university lectures and even more importantly, access to free education to anyone.</p>
<p>Imagine your capabilities if you could directly <strong>extend your brain with an AI</strong> system. A company working on just that is called <a href="https://www.neuralink.com/">Neuralink</a>. In the future, this could allow us to capture our thoughts instantly, convert them to text and images without having to write them down, interact with other people without actually saying anything, or accessing the internet just by pure thought.</p>
<p>If we chose to, our <strong>AI systems could become a scientist of their own</strong>, achieving intelligence that goes far beyond our own. Researchers have already used machine learning to automatically <a href="https://www.forbes.com/sites/bridaineparnell/2016/05/17/ai-recreates-nobel-prize-physics-experiment/#7b0041b86678">reproduced experimental results</a> of a physics experiment that won the Nobel price in 2001. Autonomous AI could get us to the next level, solving problems that seem very hard to achieve for mankind at the moment, in the long run, probably surpassing our imagination. The main reason for these much greater capabilities is that intelligent machines are not constrained to use a very limited amount of energy and space for their computations. Also, they can use any of their artificial sensors to measure input that goes beyond the limited human capabilities of touch, audio and visuals.</p>
<p><strong>Safe, clean and almost free energy</strong> would truly change our world. Solutions to this challenge may be achievable by human efforts, but AI certainly could speed up the process of finding new scientific approaches that could get us closer to that goal. AGI could also help us in finally taking care of our planet and all species living on it, researching innovative ways of reducing pollution and <strong>protecting our climate</strong>.</p>
<p>The crowning discipline of any AGI will be the <strong>capability of improving itself</strong>. This could ultimately lead to an intelligence explosion, creating exponentially smarter machines.</p>
<p>If you think all this sounded too good to be true, you might be right. While there are utopian scenarios for our future, in particular, the capability of self-improvement has its risks. If machines become hundreds to thousands of times smarter than humans, and this is not an unlikely scenario, <strong>how can we dare to think we have control?</strong> Part of our research also has to be making sure the AI’s intentions are aligned with ours and there is no possibility of malformed objectives that end in a human extinction event. While we should be aware of these risks and invest efforts into avoiding catastrophic outcomes, current technology and research are far from self-awareness or self-preservation. If you’re interested to learn more, Nick Bostrom wrote an interesting <a href="https://en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies">book</a> about the dangers and strategies involved. A nicely illustrated blog post can be found <a href="https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html">here</a>.</p>
<p>Beyond solving all these challenges, it is also merely the fascinating question of how our brain is capable of achieving all these things. We as humans pride ourselves on many scientific discoveries, yet we have not unlocked the mysteries of our own brain yet. That is ironic since the brain is the reason we can make scientific discoveries in the first place. AI research aims at understanding what intelligence is and how we can build it.</p>
<p>Ultimately, our goal should be <strong>keeping the human race alive</strong>. This means we will have to explore far <strong>beyond our own planet into the universe</strong>. Artificial Intelligence can help us accelerate this endeavor to successfully inhabit space before a natural disaster will wipe humanity out.</p>
<p>We live in marvelous times of big opportunities. If we can make AGI work safely, there is a potentially utopian future for all of us. We have the unique opportunity to make a life-changing impact by building machines that are capable of much more than we humans are at the moment. When we do reach that point, we truly are wizards, having technology that is indistinguishable from magic.</p>
<p>So what are we waiting for? Let’s start the next big revolution in human history.</p>Louis KirschThis morning, you woke up and got out of bed. A human-made object that enables you to have an enjoyable and effective rest. You checked your smartphone, to communicate with the entire world instantly. You received all important news and notifications without any effort. Every day you use thousands of objects and processes that are human inventions, to make our lives more enjoyable and effective. Inventions, that enable us to truly focus on what's important. They free us from the requirement to dedicate our entire day to survival. Instead, we can focus on social, creative or inventive issues. Behind all this is only a single factor, human intelligence.