Skip to main content# [R] Higher Spheres: Information Theory V: Mathematical Basics of Active Inference — Variational Bayes', Relative Entropy / KL-Divergence and Jensen's Inequality

# Introduction

# Motivation for a Student’s Tutorial on Active Inference

# 1 Bayes’ Rule with More than Two Dimensions

## 1.2 The Problem of the Intractability of the Model Evidence

# 2 Variational Bayes Inference

## 2.1 Minimizing Variational Free Energy

## 2.1.1 The Principal of Maximum Entropy and Minimum Inner Energy

## 2.2 Generative Model and Generative Process

# 3 KL Divergence / Relative Entropy / Information Gain

# 4 Jensen’s Inequality

Intermediate to Advanced Stat-o-Sphere

Published onAug 26, 2023

[R] Higher Spheres: Information Theory V: Mathematical Basics of Active Inference — Variational Bayes', Relative Entropy / KL-Divergence and Jensen's Inequality

**Follow ****this link to our previous part IV**** of this series, if you wish,** or have a look at other tutorials of our collection Stat-o-Sphere.

** Review:** This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Pub preview image generated by Mel Andrews via midjourney.

*Corresponding R script:*

**Welcome to the fifth part of our series on information theory (NOT PART OF THE Summer of Math Exposition contest, since published afer 18.08)!** You have come quite a way to get here, and we really hope it was worth it for you so far. Information theory is a fascinating topic and a good example on how some at the end rather simple mathematical formulas and concepts can make quite a career.

**Note that you have again entered the intermediate to advanced Stat-o-Sphere.** This tutorial especially requires the knowledge of the first (Bayes’ rule) and especially the second part (entropy/surprisal) of this tutorial series on information theory (we will get to advanced Markov processes, such as Markov blankets, some other time). This means **we are extending both our understanding of Bayes’ inference and on statistical thermodynamics in this tutorial.** At the end, the below will especially be a gain for our intuitive understanding of Bayesian inference, since we will learn more mathematics that helps us better represent certain aspects of our current intuition on it. **So “advanced” does again not mean that this tutorial will be particularly harder to go through than most of the others we have published so far, **but requires at least a stable intuition on conditional probability / Bayes’ rule and an understanding of why statistical thermodynamics is related to it.

For some orientation for those that never have heared from active inference, predictive processing or the free energy: it can be understood as a form of “machine learning / AI”, but different to, e.g., ChatGPT active inference does not refer a mere passive framework, but involves action and perception. So ChatGPT output can be understood as mere “passive perception” in some way — active inference adds a twist to classic modelling of perception. It was also shown bei Isomura et al. in 2022 that cannonical neural networks can be translated as performing active inference, which makes active inference mathematically universal on that respect. However, the basic concept actually derived from the field of neuroscience in order to model action and perception in humans (mostly driven by schizophrenia research), as mentioned in earlier parts of this series (Friston 2010, amongst earlier formulations).

Recent advances in computational neuroscience concerning concepts such as *predictive processing*, *active inference* and the *free energy principle*, have gained increasing attention in the last couple of years, as they provide a vast interdisciplinary and intuitive explanatory power for both: neurocognitive processes, as well as experience itself.

Due to its growing popularity, an extensive and amazing effort has been put into increasing its accessibility to a wider public of researchers, both technically (programming a model) and conceptually (scientific and (neuro)philosophical background and discourse). Especially **“A Step-by-Step Tutorial on Active Inference and its Application to Empirical Data”** (ATUT), by Ryan Smith, Christopher Whyte and Karl Friston (2022) provides a thorough and detailed introduction into actually computationally modelling neurocognitive processes by oneself.

Using **active inference** to do so **involves simulating behavioral tasks and predicted neural responses, which can subsequently be compared to evaluated empirical data, gained from running the same behavioral tasks with actual participants.** The ATUT only requires a rather minimal background in mathematics and programming and also provides all the information needed, accompanied by heavily commented code for *Matlab** *(*Mat*rix *Lab*oritory), in order to fully program behavioral tasks oneself. A *Python* *package called pymdp* (*py*thon *M*arkov *d*ecision *p*rocess) has been published as well, explicitly accompanying the upper tutorials *Matlab* code, also making it available to open access communities. This tutorial series on information theory especially provides you with background knowledge on information theory and its relation to thermodynamics, which is not covered in the ATUT.

I personally also started to convert some of the Matlab scripts into R a while ago (see this Github repository). I definitely want to bring this further in the future, since it is the perfect way to learn how active inference works in action (thanks to Ryan Smith and Christopher Whyte for some feedback on the R conversion!).

**In case you are also interested in making active inference run on R, let me know if you want to team up for the project! You can contact me via my email: **[email protected]

The goal of this “student’s tutorial” is to accompany their work by providing a detailed and slow-paced overview of the mentioned ‘minimal background’ in programming and in mathematics required to get through the upper mentioned tutorial (especially for people outside of computer science related fields). As mentioned in previous parts, this series on information theory was initially all about active inference. Since the basics mostly consisted of classic information theoretic concepts, I split the initial version apart and added and re-wrote a lot of sections, in order to serve a more general purpose for our statistics tutorial project “Stat-o-Sphere” at NOS (e.g. as basics for understanding Akaike and Bayesian information criterion (AIC/BIC), MCMC…).

The motivation for this tutorial came after I built up a collection of notes, codes and math tutorials myself, in order to wrap my head around active inference, due to my personal interest in computational neuroscience and abductive inference (especially from a linguistic/semiotic and clinical perspective). This collection entailed scattered answers to the most important questions I asked myself, often still lacking the upper requirements.

**We hope this series will serve as a decelerated support and addition to the upper mentioned ATUT for people — students and researchers new to the field — that want to understand and apply “Higher Sphere” methods of inference.**

There has also been an *open access* MIT publication on active inference by Thomas Parr, Giovanni Pezzulo and Karl Friston (2022) that we can recommend! It also comes with some code examples for Matlab.

**The fifth and current part of this series entails most of the major basic mathematical / physics knowledge needed to understand active inference / the free energy principle / predictive processing in computational neuroscience.** In the next part of this series we will give you a short introduction on active inference / FEP in general and most of all supply you with extra material on it (paper recommendations, videos, interviews etc.).

**You may wonder why we start with mathematics this time? What happened to our triad of Intuition/Concept—Mathematics —Code?** Well, in fact we have already covered most of the intuition we need to understand active inference / the free energy principle, since they can most of all be looked at as a direct extension of information theory:

Specially the rather **“self-experiential” approach of discussing abductive inference as a way to understand the basic idea behind “Bayesian-Brain” concepts has supplied us with a stable intuition** on current computational neuroscientific approaches to model neural activity as well as cognition and decision making. The below will just build upon what we have gone through so far already.

In order to understand why we need approximate Bayes inference methods to get to more complex applications of information theory and Bayes’ rule in general, we have to look at the formula of conditional probability / Bayes’ rule that entails three or more instead of only two variables and we will also introduce Bayes’ rule for continuous probability distributions.

Below you see the formula for classic conditional probability given two, three and n-dimensions / variables for discrete probability distributions.

To get a better overview what changed, note that conditional probability involving three variables **can be simplified via substitutions** of

The posterior probability with three variables can be spoken as:

**“the probability of **** given **** and ****”,** or as

**“the probability of **** given a model of (or between) **** and ****”**.

We can perform the same rearrangements as with regular conditional probability / Bayes’ rule — again via the chain and the product rule. However, since we could substitute

**On the other hand, we can also literally think of the below as a simple extension of a second dimension with another variable (similar to how we discussed the binominal distribution in the ****first part**** of this series).** So from the perspective of Bayes’ rule, you can just add “

We can also rearrange the above in order to have a look at the outwritten joint probabilities:

In the above

We could also reformulate the above in order to obtain one of the possible **marginal probabilities**:

**Here is where it gets a little tricky!! :O** For discrete probabilities, we will still get an answer, but the more variables we add the more computational problems we will get. In technical terms, too many variables will result in an **intractable** **marginal likelihood**. The reason is that the more dimensions we add, the more numbers have to be summed out, which grows exponentially with the number of variables (which is especially a problem with continuous variables).

First of all, the mentioned boundary can be thought of as a problem of computability. **Computability means, ***having an algorithm to perform an calculation with limited resources and time***.** **We will partially follow ****this short tutorial from Ben Lambert.**** **

So from a mere pragmatic perspective, the more variables, the more relations, the more computing power is needed. Below you will see the formula for Bayes’ rule with n-variables:

**Let us also quickly go through Bayes’ rule considering continuous probability functions (which we haven’t done in any of our tutorials so far).** The only formal difference is that we use

**Before we get into more details of intractability**, the problem of computability may become more graspable when we think of a Markov chain graph that has **would theoretically take possibly forever to be computed — importantly also by computers**. Anyhow, it will take a while until tractability becomes a real problem in the discrete inference case. The problem of summation becomes more prevalent though considering continuous functions. A short explanation of intractability can be found in a tutorial on active inference by Smith, Whyte, Friston 2022.

While Bayesian inference represents the optimal way to infer posterior beliefs within a generative model, Bayes theorem is computationally intractable for anything but the simplest distributions. […] the marginal likelihood (denominator) in Bayes’ theorem – requires us to sum the probabilities of observations under all possible states in the generative model (i.e., based on the sum rule of probability […] ).

For discrete distributions, as the number of dimensions (and possible values) increases, the number of terms that must be summed increases exponentially. In the case of continuous distributions, it requires the evaluation of integrals that do not always have closed-form (analytic) solutions.As such, approximation techniques are required to solve this problem. (Smith, Whyte, Friston 2022)

**The problem of intractability in general** has a mere pragmatic (efficiency), but also deeper mathematical and computer scientific dimension to it that is referring to the so-called **P-NP problem** from complexity theory. There is a great tutorial by Jade Tan-Holmes on her channel “Up and Atom” that we can recommend if you want to get a first hint on that topic. We decided to further precede in presenting a solution for our problem.

**So the big question is, can we find a way to work around and efficiently approximate the model evidence in order to normalize our joint probability with it and update our model (obtain the posterior)?** Intuitively, since we multiply compared Bayes’ inference with “human” inference and related it to computational neuroscience several times, note that going through “everything”, every possible case / fact etc. when making a probabilistic decision is importantly not what we humans do, since it takes a lot of time (is not pragmatic or even doable) — and we also wouldn’t want to do so either, since we are always aiming for specific information in relevance to us or a specific matter. **In other words, you do not use and need a 1:1 map down to every atom in order to orientate within a city or any region (this comparison is made by Maxwell Ramstead and can be found ****here****).**

The latter also implies a biological perspective on inference and what it takes. This also entails boundaries in terms of metabolism and complexity of structures. Concerning the latter, minimizing free energy is also considered an attempt of an organism as a whole to *overcome* *the second law of thermodynamics*, i.e., trying to *delay* *decay*. Decay does not only concern the physical but also the conceptual level: our model of the world, which we try to keep or try to confirm. In other words: Bayesian brain theories somewhat follow a scheme of *self-evidencing* (e.g., in “All thinking is wishful thinking” by Kruglanski, Jaso, Friston, 2020).

This is also where our thermodynamic analogy on a higher level of abstraction (active inference as mathematical tool to describe decision making) becomes related to the actual thermodynamics (organism as such), when arguing that inferring on the world is done by minimizing free energy to keep in *trajectory* of a *model*, meaning to keep deviation between the true state and our expectations on present states within a generative model of those events *low,* in order to *save cost* or affordance when trying to not fall apart and handle life as person at the same time. You will find a lot of recorded lectures on the free energy principle and active inference on various topics (e.g., this one called “Me and my Markov blanket”), in which Karl Friston often gives simple examples of models that involve a daily routine, e.g., getting up, making coffee, going to the subway station and so on. Such a model would consist of a steadiness in terms of periodically getting back to states, such as “getting up, making coffee”, but would *not* involve an equilibrium, where nothing changes. This dynamic homeostasis or better *homeorhesis* is referred to as a *non-equilibrium steady state* distributions (NESS, latest formulations on that discourse can be found here: Bayesian mechanics for stationary processes by Da Costa et al.).

However, as discussed before, strong deviations of a model (a lot of “questions that need to be asked”) will result in a lot of model updating. Models themselves are an attempt to keep such deviations in boundaries. On the other hand, a NESS also implies the possibility of a desire of change as whole, as in getting rid of a model in which one is in joint with. This can be understood in the optimistic or pessimistic sense, or better as voluntary or non-voluntary: finding a way to fulfill a positive desire or change (changing peer group, social context, or going on an adventure etc.), or as a result to cope with undesired changes, even though it is hard to do so (trauma, escape from war etc.).

To answer the question if we can get a grip on the model evidence: yes, we can given approximate Bayesian inference methods. **As noted in the introduction, this and the next part of this series will focus on the mathematical basics of active inference** — and we are after one particular method, the variational inference method, which goes back to the variational method defined by Richard Feynman (Statistical Mechanics, p. 86 (1972), also see Friston et. al (2006)).

**By the way, glad that you ask, yes,** we are going to have a closer look into the relation between physics and information theoretic methods again and expand our understanding of thermodynamics, discussing the **concept of inner and free energy and their relation to entropy**.

Now that we know that our previous method of computing a posterior probability distribution is not enough to perform inference on the level of complexity we are after, let us finally start to upgrade our known Bayes’ rule to be able to perform an approximate Bayes inference method, which overcomes the problem of possible intractability of regular Bayes’ theorem.

As mentioned, for this to happen we need to expand our understanding of statistical thermodynamics / mechanics. In general, what we will present below is actually not that difficult mathematically. The physics behind the concept of **free energy will also be much easier to understand than the concept of entropy for itself (which is a part of the calculation of free energy)**. Most of all we will actually get to know a very clever ‘trick’ in order to approximate the model evidence. A ‘trick’ that will also make our calculations, as well as our general concept of abductive Bayesian inference more intuitive.

Below you will see the classic formula for computing the model evidence (as discussed before). We will also adapt the notation of the variables that are used in active inference, where **Importantly note that active inference actually works with conditional independence. However, for the sake of simplicity this tutorial will only operate with conditional dependent relations, i.e., classic or also so-called exact Bayes’ inference.** **We will catch up with conditional independence in the next part of this series! **

**To relate the notion of conditional independence to our intractability/approximation problem and information theory: conditional independence means that we cannot fully infer the entropy of a message, i.e., on a message itself, since we lack certain “exact” or complete insights into the information that is around us. **

As we are unable to infer on the model evidence directly in order to then obtain the posterior, we could still try to play around a bit and compare our results, in order to optimize an approximation of **given the joint probability distribution from our prior and the likelihood.** So again, we cannot *directly* compare our approximation of the model evidence or the *true posterior* with the model evidence itself, again, due to fact that we cannot compute it in case of higher dimensions. We have to find an indirect way to so.

**For more simplicity we will also again go through the formulas for approx. Bayes with only two elements in joint with each other** ─ a threefold joint is used in the ATUT anyways, so here you will find a downgraded version of the formulas, in case of losing orientation.

However, we can say that epistemically *heuristically* make assumptions on **In case this does not fully makes sense to you yet, we will get to a point were it gets really easy in a bit.**

**Our professional method of choice to get around the intractability problem will be the so-called ***variational inference*** method.** Roughly, this method is **used to approximate the ***posterior*** (!)** by heuristically trying out *possible* posterior distributions, denoted **The right orientation for such approximations is provided by making use of an ***inequality*** that can be derived from formulas that again reach back to statistical thermodynamics, namely ***Helmholtz Free Energy***.** This is also the point where the information theoretic quantity of surprisal / entropy becomes important within approximate Bayes inference.

In general, the actual formula for variational free energy within the active inference framework looks much more complicated as it is and I promise it entails nothing we couldn’t understand, after all that we have gone through so far (*wiping-off-a-tiny-tear*). **The only difficulty relies** **in bringing in some additional facts from statistical thermodynamics (Helmholtz free energy, variational methods (Feynman)) and nowadays information theory (Jensen’s inequality, KL divergence, variational free energy).** However, a lot of the below will again consist of mere rearranging of formulas and alternating notations. As promised, we will again only work with two variables and within the realm of exact Bayes’ inference (even though variational Bayes’ would actually not be required in such a case).

In classic thermodynamics, the *free energy* represents the result of a *relation between the inner energy and entropy* of a system and is considered a quantity for *the amount of work that can be extracted from a system* at constant temperature and volume ─ as both, changes in temperature in the form of thermal energy being lost in a process (as in a steam engine), as well as changes in volume of a system (exploding steam engine) would subsequently change the *free energy* of a system as such (we can also recommend this paper by Gottwald and Braun (2020) on this topic).

From a physical perspective a so-called *Stirling engine* is a very good example to understand what free energy is and how it converts to work:

The cylinder can be considered our system (exchange of heat). The heat at the bottom of the cylinder (red) heats up the air (gas) within it, which means that the atoms essentially accelerate and bump against the displacer plate (pink). In other words: the entropy rises and therefore kinetic energy of gas atoms. The atoms then push against the displacer plate (pink) in a way that makes it move up. At the top of the cylinder the air and the displacer cool down again, making it move down again, since the pressure decreases. This also happens due to the involved mechanics that eventually pushes the displacer down again — a kind of competing difference. The pressure rises beneath the displacer plate, also due to the compression of the gas beneath the displacer palte moving down again, and therefore the displacer eventually moves up again after some time, resulting in cycle of the displacer being pushed up and down, up and down…. You can see that the structure is a little more complex, since there is another smaller cylinder (green) that also transfers energy into mechanical work, but with a slight latency to it, helping that the displacer plate is pushed downwards again (so no constant volume in the example above). **If the above system would be in an equilibrium state, such that no pressure is build up in the cylinder (no temperature difference), nothing would happen.** We will later see that the pressure that gets coverted into mechanical work can be looked at as so-called free energy. **Note that the above also works the other way around:** if the bottom of the cylinder was cold (ice cube at the bottom) and the upper part of the cylinder would be warmer (room temperatur), the same thing happens!

In the context of information theory or in general Bayes’ inference, this “amount of work” can also be seen as the amount of *effort* or *cost* of a system ─ especially when we keep in mind that the information theoretic perspective refers to statistical *inference* as a potentially costly process of gaining information (our intractability problem refers to that as well): **The less information gained, the more I knew already, the less costly an inference was ─ e.g., the closer my prior had been matching the likelihood already, or the closer my approximate posterior will match the actual posterior.** Before we get back to our intuition on abductive inference, let us first look at the physics formula for free energy. **We know the relation between physics and abductive inference my not be as clear already, but we promise we will soon get there!**

The formula for the *Helmholtz Free Energy*, i.e., the maximum usable part of the energy

*inner* energy of a system, which is also an information theoretic quantity, as well as *free energy understood as information theoretic quantity* involves *constant temperature*, as well as volume (at least the basics). In other words, similar to the Boltzmann’s constant

At this point it made sense to me to recall the units used in physics to express

Inner energy =

/ Mass (kg)*Energy (J)*Entropy =

/ Temperature (K)*Energy (J)*

*Both* *quantities* actually reflect on the *energy* of a system, just in relation to different factors ─ in a banal sense: this is the reason why the formula goes “*energy*. Recalling the units also sheds light to what inner energy reflects on when representing a quantity for the kinetic energy of the movement (Energy in Joule) of particles within a system with respect to their mass as something that can be used to do work with, or represents a potential cost in some way when looked at its dependent relation to the entropy of a system. Same goes for the disorder within a system that results from such movements (Energy in relation to the Temperature) and indirectly how it may change towards a state of maximum disorder as a *result* of the (change of the) inner energy of a system, as we will see below.

As mentioned, the relationship between free energy, inner energy and entropy relies on the second *law of thermodynamics,* stating that a system preferably occupies a state over time that has ** maximum entropy** given the inner energy. We know what maximum entropy is in classic information theory: it states that the entropy is at its maximum in the case of equal probability of each microstate ─ which also makes sense in terms of

With all of this is in mind, we can now mathematically represent a part of the free energy equation above already ─ the entropy. Recall that our information theoretic entropy formula also represented the expected value of *average surprisal*, where surprisal is obtained just by taking the negative log of

As hinted above, the *principle of maximum entropy* can also be inverted in the sense of a *principle of *** minimum inner energy** stating that in the case of

In essence, looking at the formula for free energy again, and without an idea how to formulate the *expected value* of *conceptually* *know already* that when the value for the inner energy *the actual* entropy/surprisal that can also represent a *set* “maximum entropy”. In other words, changes in the values of

**We can extract some facts from that above intuition from physics to build up some mathematical constraints for our formula of ****:**

As the entropy is considered constant, a value of

Conceptually and intuitively, it also wouldn’t make sense that the value of *below* zero, as there has to be *energy* in terms of information in a system, if there is information in terms of *entropy* in a system ─ so a value of *a divergence from the entropy itself* (

This is also consistent to our previous understanding of information theory in general, where negative quantities of information do not make a lot of sense, if information is in general something phenomenologically given, so to speak. From the perspective of abductive Bayes’ inference again, free energy can here be seen as the *effort* or *cost* of an inference made (work), since it implies that there is a divergence that has to be overcome to obtain an equilibrium state, so the above also holds for

From what we have learned conceptually so far, we can derive our first *inequality*:

The above is the case, since the values of **In case this is all going to fast, next we will look at some familiar numerical examples!**

**Let us recall some of what we already know on that matter of inequalities from classic information theory:** here is an inequality we have encountered before. The code below entails all the code we need from Information Theory III on Markov chains:

```
### We will again use the last Markov chain example
### from Information Theory III:
MessageABC = c("A", "B", "C")
MessageABCTransMatrix = matrix(c(.0,.8,.2,
.5,.5,.0,
.5,.4,.1),
nrow = 3,
byrow = TRUE,
dimname = list(MessageABC, MessageABC))
MCmessageABC = new("markovchain", states = MessageABC,
byrow = TRUE,
transitionMatrix = MessageABCTransMatrix,
name = "WritingMessage")
markovchainSequence(n = 20, markovchain = MCmessageABC, t0 = "A")
# Plot Markov Chain
plot(MCmessageABC, edge.arrow.size = 0.1)
# We will quickly write a function adding a tiny
# value to our inputs:
bit_log_nonzero = function(x) {
nonzerolog = log2(x+2^(-16))
} # End of function
### Joint matrix:
steady = steadyStates(MCmessageABC)
trans_mat = as.matrix(MessageABCTransMatrix)
# Initialize empty matrix:
joint_mat = matrix(0, ncol = ncol(trans_mat), nrow = nrow(trans_mat))
for (i in 1:length(steady)){
for (j in 1:ncol(trans_mat)){
joint_mat[[i,j]] = steady[[i]]*trans_mat[[i,j]]
} # end for j
} # End for i
### CONDITIONAL ENTROPY H(y|x) (AMTC p. 11)
# Below we will work with the numbers of our last Markov chain example:
EntropyPOST = -sum(joint_mat*bit_log_nonzero(MessageABCTransMatrix))
# [1] 0.9340018 = H(y|x)
# ENTROPY OF A SINGLE EVENT OF A JOINT:
EntropyX = -sum(joint_mat*bit_log_nonzero(rowSums(joint_mat)))
EntropyY = -sum(joint_mat*bit_log_nonzero(colSums(joint_mat)))
# EntropyY is greater than or equal to H(y)
EntropyY>=EntropyPOST
```

The above essentially states that the amount of information I need to infer on a state, only having the prior at hand, is *greater* or equal to the amount of information entailed in the inference on a state given an observation **In other words: I know at least as much as ***before*** (prior), ***after*** I have been in joint, or have encountered an event (likelihood) and updated my model (posterior)** ─ which relies on several notions of physics that we have encountered before, e.g., information can’t be lost within a process (of inference), or gaining information can be expressed as cost (

From this we can also make sense of the following:

```
# EntropyY-EntropyPOST >= 0
EntropyY-EntropyPOST >= 0
```

We can actually relate this to our free energy formula, including the fact that *free* energy would have to follow the same rules, such that:

and

This again says that energy itself in any sense, inner of free, cannot be less than zero. This is also what is referred to as ** free energy being an upper bound on entropy**, and is mathematically derived from

We can also derive an *equality* as a special case of the above, where:

From which we can derive that in such a case:

The above represents the case of *minimum inner energy*, *therefore* automatically maximum entropy.

We could now create a similar situation *just with the entropy* *in relation to itself (since *

Looking closely at the formula will reveal a surprising chance to bypass the evaluation of the model evidence itself using a simple, but sophisticated *variational* trick to do so (goes back to *variational methods* in statistical mechanics, introduced by Richard Feynman, see Statistical Mechanics, p. 86f. (1972)).

From what we know so far,

Let us now do some rearrangement, based on our thoughts around **We will also add some intentionally redundant terms, which will represent a deviation from the free energy from the actual entropy, given that the approximate posterior deviates from the actual posterior that fits the actual entropy ****, such that ****.** Note that the below just represents alternative arrangements of the same formula:

Combining the arrangement from the last two lines results in:

Here we just take the log *into* the sum:

The last line represents the *expected value notation* of the previous formula:

I have crossed out everything that was just redundantly added (exception in the last line), but see how we now got the posterior right in front of us? This will lead us to an important trick, that brings us beyond the special case, where

Let us quickly compute the above, to see if the equations hold, before reflecting on how to fully represent our free energy formula, resulting in

```
#### Generative Model
prior = c(.5,.5); likelihood = c(.8,.2)
joint = prior*likelihood
# Trueposterior (here calculated via Bayes; for
# simulations think of a supervised situation,
# so the true state is known):
modelevidence = sum(joint)
Truepost = joint/modelevidence
# Expected model evidence = Entropy
# In our case Entropy and surprisal are equivalent.
Entropy = -sum(modelevidence*log(modelevidence)+((1-modelevidence)*log(1-modelevidence)))
Surprisal = -log(modelevidence)
Temperature = 1
# Going through all the lines:
# H = H
Entropy==Entropy
-log(sum(joint))==Entropy
# The below does not exactly work with vectors/matrices.
# For a single value result use: -log(.5*.8/.8)
-log(Truepost*modelevidence/Truepost)==Entropy
-log(sum(Truepost*(Truepost*modelevidence/Truepost)))==Entropy
# With expected value notation, i.e., average surprisal
-sum(Truepost*log(Truepost*modelevidence/Truepost))==Entropy
```

All of the formulas above hold for the case of

Let us now substitute *Helmholtz Free Energy* formula, such that:

From what we know, we could try to extract the entropy from our

The latter line may look a little weird, but recall that *and* the logarithmic rule we applied ** all involve a subtraction for their own!** Also note that

```
# E[E] - H = H (for the example from ATUT p. 4)
Energy = -sum(Truepost*log(Truepost/Truepost))
HelmholtzFE = Energy-(-Temperature*Entropy)
# F >= H
HelmholtzFE>=Entropy
# F = H (minimized FE)
HelmholtzFE==Entropy
```

The important trick that is hidden in the formulas now is, that we can *approximate our posterior* and ** at the same time** are able to indirectly see how far our approximation

**This makes it conceptually look like that minimizing the value of** **will eventually lead us to a point at which we are equal to our** **, or at least as close to it as possible ─ the latter being enough for our approximation. Minimizing free energy therefore maximizes the model evidence of our approximate version of the actual** **, which will heuristically reveal to us the ideal approximate posterior that fits the true posterior, i.e., the true model evidence.** This is simply because that when our approximate **as the ***free energy is an upper bound*** on entropy and always greater than zero.** In other words: however far away we will speculate, we will never lose information in the sense of being below the actual model evidence.

Mathematically spoken, the process of minimizing free energy can be looked at as performing *gradient descent*. The code below for the example at ATUT p. 5 shows how the value of *recognition distribution*), denoted

The very left term on the left-hand side of the last line can also be understood as the KL divergence or relative entropy, i.e., how far does the entropy deviate from the actual entropy, when

Here is the code for the *variational free energy* example on ATUT p. 5, which is equivalent to the calculation done in the ATUT Matlab script “VFE_calculation_example”(see Ryan Smith’s Github repository):

```
# Minimizing Free Energy:
# Example ATUT p.5
Qs1 = c(.5,.5)
Energy1 = -sum(Qs1*log(Truepost/Qs1))
VFE1 = Energy1-(-Temperature*Entropy)
Qs2 = c(.6,.4)
Energy2 = -sum(Qs2*log(Truepost/Qs2))
VFE2 = Energy2-(-Temperature*Entropy)
Qs3 = c(.7,.3)
Energy3 = -sum(Qs3*log(Truepost/Qs3))
VFE3 = Energy3-(-Temperature*Entropy)
Qs4 = c(.8,.2)
Energy4 = -sum(Qs4*log(Truepost/Qs4))
VFE4 = Energy4-(-Temperature*Entropy)
# Plot that makes clear what descending a gradient means conceptually:
plot(x =c(1:4), y=c(VFE1, VFE2, VFE3, VFE4), typ = "l")
```

Here is the classic formula including the inequality of *one* formula, i.e., without extracting the entropy from *lower* bound (ELBO) of the *negative* VFE is renowned in machine learning).

The divergence we want to overcome (minimize) is that between our generative model and that of the actual generative process in the world.

Drawing a relation back to Shannon’s work and the discussion we had over Weavers work and semiotics, we can see that an exact inference problem is turned into an optimization problem. In general, we hope we could show that active inference can be seen as a direct extension of Claude Shannon’s information theoretic methods of inference.

Here is the same formula, just using three variables:

Below we will have a brief look into the general structure of an active inference model, discussing the difference between generative process and generative model (which we both briefly mentioned before). This is also very well explained in the ATUT though. After that we will further look into the KL divergence, the relative entropy, as well as Jensen’s inequality.

As mentioned previously in this the tutorial, the joint probability is also referred to as *generative model** *and is distinguished by the *generative*** **** process**. This, again, reflects on the epistemics of inference:

A generative model, as discussed above, is constituted by

beliefsabout the world and can be inaccurate (sometimes referred to as ‘fictive’). In other words, explanations for (i.e., beliefs about) how observations are generated do not have to represent a veridical account of how they are actually generated. (ATUT, p. 4)

Apart from the possibility of counterfactual inference within causal inference (compare Corcoran, Pezzulo, Hohwy (2020), Pearl (2009)), this also addresses the fact that we do not experience or model the world in terms of a 1:1 scale, so to speak (compare previously mentioned video tutorial by Ramstead (2020), and ATUT p. 4). We are only interested in specific information related to a model of the world; therefore, active abductive inference is also cast as attention (Friston (2010)). In other words: a 1:1 map would not be necessary to get from one point to another or more specific: not every detail is important or necessary in order to perform a successful relation between a map and the actual trail it represents in the actual world, even though it will not answer every question we could have on the world ─ but specific questions indeed (Ramstead (2020)).

In contrast, the **generative process** refers to what is actually going on out in the world – that is, it describes the veridical ‘ground truth’ about the causes of sensory input. For example, a model might hold the prior belief that the probability of seeing a pigeon vs. a hawk while at a city park is

Both, the generative process and model represent joint probabilities. In that sense beliefs in the form of policies or anything related to beliefs being something internal is actually not fully demarkated, but again in a joint relation with each other: a being in the world — in other words: active inference and the free energy principle inherit a phenomenological perspective. Including action as a way to change the state of the world, a graphical representation of our upper example would look like this (similar to ATUT, p. 5):

I have chosen the example of a ball falling to the ground and gravitation as model, as it entails all the ingrediency that Galileo used to do his first measurements of the time an object needs until it lands when falling from a certain height. This is not supposed to be a random fact that I want to include in this tutorial, but an example for hierarchical abductive inference that not only applies to our intuition, but also to abductive inference in terms of an actual scientific method ─ also entailing deduction and induction.

To perform his measurements, Galileo used a slide as it made objects “fall slower” and steady. It was hard to measure the speed of a fallen object without any nowadays technology at that time. In order to provide a scaled measure, water was filled into a glass for the time the ball was rolling down the slide (just think of tick marks for the amount of water after 1 m, 10 m etc. and the time that has passed that can be measured separately). After multiple attempts the measurement turned out to be very precise, indicating none or at least no measurable differences of gravity over time.

The structure of such experimental projects appears to be always the same by intuition. At first there may be a bunch of questions, an indirect hypothesis, e.g., things fall down, things fall down steady, the mass makes things fall faster or not… Next up is the likelihood in terms of an event to check on the hypothesis in terms of checking on mere prior assumptions (abduction). An experimental setting is designed and built up in order to make the event comparable to a previous event. In the case of Galileo this meant that the same ball, the same slide etc. had to be used for every trial (deduction). Eventually Galileo obtained a whole bunch of events and was able to compare them with each other, figuring that the fall time is stable under comparable conditions and can be generalized in mathematical formulas (induction), to further see how this works out for an approximation with, e.g., predicting behavior of other objects. This and some other experiments performed by Galileo eventually resulted in a measurement of for the gravitational acceleration.

All the latter is supposed to roughly demonstrate an intuitive ontogenesis of science and the plausibility of the development of scientific methods by humans, based on abductive inference being the prime form of every inference. This also shows why C.S. Peirce pragmatist concept of abductive inference and semiotics can be understood as a phenomenological approach to the description of the structures of inference as such. Active inference though was much more influenced by the work of Helmholtz on perception as a kind of inference, as well as his formula for *free energy*, which we will get to know soon in its information theoretic form.

The *Kullback-Leibler* *divergence* is also known as the *relative entropy, Bayesian surprise* or *information gain* and represents

**Example:** Imagine having a set of colored balls that are spread within space. The space is divided into two spaces and the question is now: how good the split is in terms of how likely it is to find each of the color in one of the two sides *compared to before*, i.e., the space without a split. As usual we will reflect on this using information theory. We have 5 blue and 5 green balls. The formula for the entropy is our expected value of either green or blue, which can be written as:

The code below follows our classic entropy formula in terms of the negative sum of the weighted log of the probability of each color appearing, when sampling from it.

```
# We will use a matrix to express p_i:
pColor = matrix(c(.5, .5))
# H_before:
Hbefore =-sum(pColor*(log2(pColor)))
```

Let us now look at the split:

```
# EXAMPLE:
# Imagine having 5 green and 5 blue balls:
plot(x = 1,
xlab = "X Label",
ylab = "Y Label",
xlim = c(0, 3),
ylim = c(0, 3),
main = "Blue and green balls",
type = "n")
# Blue balls
points(1,2, col = "Blue")
points(0.4,2.8, col = "Blue")
points(1.2,2.2, col = "Blue")
points(1.7,0.7, col = "Blue")
points(0.9,0.6, col = "Blue")
# Green balls
points(1.6,1.6, col = "Green")
points(1.9,2.6, col = "Green")
points(2.3,0.6, col = "Green")
points(2.8,2.85, col = "Green")
points(2.4,1.6, col = "Green")
# Split at X = 1.5.
abline(v=1.5, col="black")
```

The split at a certain point can be thought of as a kind of change in the entropy due to inference, but also as, e.g., evaluating compressed data. Concerning our use in active inference the relation between the entropies before and after the split can be thought of as changes in *the approximate* *and the actual* *posterior* that leads to

On the left side, all *four* of the balls are blue:

On the right side we find *five* green and *one* blue ball, together *six*.

```
# Hleft is simple
Hleft = -sum(1*log2(1))
# Hleft probs as vector
probRight = c(1/6,5/6)
Hright = -sum(probRight*log2(probRight))
```

We can now weight the quality of the split by weighting the entropies by the number of elements of the respective side via:

```
# Relative Enropy
Hsplit = .4*0 + .6*.65
# KL divergence / relative entropy:
Informationgain = Hbefore - Hsplit
```

As said, the relative entropy now represents a relation of the entropy of each state, i.e., before and after the split. In other words, and related to active inference: the *inner* energy represents the divergence to the actual *minimum* *inner* energy (*when* *the* *maximum* *entropy is given*, *after the posterior was approximated*. Relating this to classic exact Bayes inference, the difference between prior and posterior can also be reflected on in the same way and is also termed *Bayesian surprise*.

This part is focusing on the exact mathematics behind the inequality of *Jensen’s inequality*. We will follow two more videos by Ben Lambert on the intuition and the proof of Jensen’s inequality. It is in general hard to find good introductions on that topic. Here I will again provide the code for the example provided by Ben Lambert.

To demonstrate Jensen’s inequality, we will play another round of dice (*payoff* value from playing the game. The game only consists of “winning” so to speak, so think of similar games such as *Yahtzee*. The *payoff* in our particular game will be calculated by *squaring the rolled dice value* (

Mathematically, the payoff can be understood as a parabola function, which we will call

```
# Define function x^2 = g(x) =
g <- function(x) (x^2)
# Plot of g(x) ranging from x=-1 to -7.
curve(g, -1, 6, ylab = "g(x)")
abline(v = 0)
# Points marking all possible payoffs
points(1,g(1)) # Payoff for x = 1
points(2,g(2)) # ... for x = 2
points(3,g(3)) # ...
points(4,g(4)) #
points(5,g(5)) #
points(6,g(6)) #
```

A special characteristic of our function *convexity*. I have found several ways to define the characteristics of a convex function:

A **rather** **graphical approach** of defining convexity involves choosing two random points in the space *above* the function, the so-called *epigraph*. When the points are connected via a straight line, the line will never cross the function.

Another approach is by **describing the graph of a convex function** as continuously increasing its slope. A more mathematical approach to this would involve evaluating the *derivatives* of our function

We will obtain the first and second derivative via these rules:

such that *in our case*:

Both derivatives are positive either in the sense that they consist of a single positive value or that *slope is constantly rising*.

**Now that we have a rough idea of what a convex function is**, let us now see what Jensen’s inequality actually states ─ **and don’t worry, it may seem a lot, but we will again decompose the inequality step by step.** In relation to our game, **Jensen’s inequality states** that the *expected payoff of playing the game*, denoted as the expected value of a function **greater or equal** to the *payoff from the expected value of*

Let us first evaluate our expected value of

`EX = sum(1/6*(1:6))`

The payoff from the expected value, i.e., the function

`gEX = g(EX)`

**Translated, this delivers the answer to the question:** considering the *weighted average* of the possible dice values, *payoff* *of that average value* be? Resulting in

The left-hand side of Jensen´s inequality above represents the expected value of a function, i.e., our expected payoff from playing the game. The formula just slightly deviates from *payoff* of our value

`Egx = sum(1/6*g(1:6))`

**Translated this says**, that the weighted average payoff from playing the game will be at a value of 15.16667, **and delivers an answer to the question**: If I play this game, what payoff from playing the game will I get on *average* each time I am playing the game?

```
# Jensen's inequality
# E[g(x)] ≥ g[E(x)]
Egx >= gEX
```

Now we are going to look at our graph again, in order to reflect on the above:

```
# gEX and Egx at x=EX
points(EX, gEX, col="blue")
points(EX,Egx, col="red")
```

We can see that at our coordinate **Egx** is greater than **gEX** at **EX**, i.e., where

We will now draw a line going through

We can evaluate the slope of our function via:

```
# Slope b of f(x)=y=a+bx via two points:
P1 = matrix(c(0, 0)) # x = 0, g(x) = 0
P2 = matrix(c(EX,Egx)) # x = EX, g(x) = Egx
b = (P2[2]-P1[2])/(P2[1]-P1[1])
```

We can now evaluate

```
# We can now evaluate a by filling in a point
# say P2 in y = a + bx => a = -y + bx
a = -0+b*0
# f(x) = 4.33333*x
f = function(x) (b*x)
# add to plot
curve(f, -1, 6, ylab = "x", add=TRUE)
```

We are now going to evaluate the *crossing points* graphically *and* mathematically to get a better overview. We will start with our new function *equality,* such that (first crossing point is at the origin (0|0)):

```
# In our case the value of x of the point where
# f(x) crosses g(x) is equal to the slope of
# f(x), so we can evaluate the value of y
# of our point via a shortcut.
y = b*b
# Crossing points:
# upper crossing point f(x) with g(x)
# where Egx=gEX!
segments(x0=0,y0=y, x1= b, y1=y, lty =3)
segments(x0=b,y0=0, x1= b, y1=y, lty =3)
text(x=5.3,y=18, label="E[g[4.33]]=g[E[4.33]]", srt = 3, col = "darkgreen")
text(x=-.6,y=y, label="g(x)=g(4.33)", srt = 3, col = "darkgreen")
text(x=b+.5,y=0.2, label="x=4.33", srt = 3, col = "darkgreen")
```

Now we can check mathematically, if our assumptions are true:

```
# Check P1:
EXP1 = sum(1/6*(0))
gEXP1 = f(EXP1)
EgxP1 = sum(1/6*f(0))
# Is exactly equal?
EgxP1 == gEXP1
# Check P2:
EXP2 = sum(1/6*(b))
gEXP2 = f(EXP2)
EgxP2 = sum(1/6*f(b))
# Is exactly equal?
EgxP2 == gEXP2
```

Now we will add the other points and also add some annotation and lines to the plot to get further orientation over our results:

```
# Add rest of the points and some annotation:
# (EX|Egx)
segments(x0=0,y0=Egx, x1= EX, y1=Egx, lty =3)
segments(x0=EX,y0=0, x1= EX, y1=Egx, lty =3)
text(x=-.6,y=Egx, label="E[g[x]]", srt = 3, col = "red")
text(x=EX+.35,y=0.2, label="x=E[X]", srt = 3, col = "black")
# (EX|gEX)
segments(x0=0,y0=gEX, x1= EX, y1=gEX, lty =3)
segments(x0=EX,y0=0, x1= EX, y1=gEX, lty =3)
text(x=-.6,y=gEX, label="g[E[x]]", srt = 3, col = "blue")
```

I hope this chapter helped to understand and get some orientation over Jensen’s inequality from a mathematical and graphical perspective. Also check out Oleg Solopchuk’s tutorial on medium on the FEP for some more perspectives on Jensen’s inequality. His tutorial has very nice visualizations, but has a much stronger pacing at some point (and also includes aspects such as conditional indipendence etc.

In the next part of this series, we will also have a look into the concept of Markov blankets and other more advanced aspects of active inference. **Nevertheless,** **you should now be already prepared to precede to the actual ****ATUT**.