Skip to main content
SearchLoginLogin or Signup

[R] Higher Spheres: Information Theory V: Mathematical Basics of Active Inference — Variational Bayes', Relative Entropy / KL-Divergence and Jensen's Inequality

Intermediate to Advanced Stat-o-Sphere

Published onAug 26, 2023
[R] Higher Spheres: Information Theory V: Mathematical Basics of Active Inference — Variational Bayes', Relative Entropy / KL-Divergence and Jensen's Inequality

Follow this link to our previous part IV of this series, if you wish, or have a look at other tutorials of our collection Stat-o-Sphere.

Review: This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Fig. How to start a discussion. A pubpub account is needed to do so. Sign up here. Also have a look at articles all about our Open Science Agenda.

Pub preview image generated by Mel Andrews via midjourney.

Corresponding R script:


Welcome to the fifth part of our series on information theory (NOT PART OF THE Summer of Math Exposition contest, since published afer 18.08)! You have come quite a way to get here, and we really hope it was worth it for you so far. Information theory is a fascinating topic and a good example on how some at the end rather simple mathematical formulas and concepts can make quite a career.

Note that you have again entered the intermediate to advanced Stat-o-Sphere. This tutorial especially requires the knowledge of the first (Bayes’ rule) and especially the second part (entropy/surprisal) of this tutorial series on information theory (we will get to advanced Markov processes, such as Markov blankets, some other time). This means we are extending both our understanding of Bayes’ inference and on statistical thermodynamics in this tutorial. At the end, the below will especially be a gain for our intuitive understanding of Bayesian inference, since we will learn more mathematics that helps us better represent certain aspects of our current intuition on it. So “advanced” does again not mean that this tutorial will be particularly harder to go through than most of the others we have published so far, but requires at least a stable intuition on conditional probability / Bayes’ rule and an understanding of why statistical thermodynamics is related to it.

Fig. Use the pubpub contents function on the upper right of the site to orientate within the script. The autogenerated output of documents is unfortunatly not well tuned and nothing we have control over.

For some orientation for those that never have heared from active inference, predictive processing or the free energy: it can be understood as a form of “machine learning / AI”, but different to, e.g., ChatGPT active inference does not refer a mere passive framework, but involves action and perception. So ChatGPT output can be understood as mere “passive perception” in some way — active inference adds a twist to classic modelling of perception. It was also shown bei Isomura et al. in 2022 that cannonical neural networks can be translated as performing active inference, which makes active inference mathematically universal on that respect. However, the basic concept actually derived from the field of neuroscience in order to model action and perception in humans (mostly driven by schizophrenia research), as mentioned in earlier parts of this series (Friston 2010, amongst earlier formulations).

Motivation for a Student’s Tutorial on Active Inference

Recent advances in computational neuroscience concerning concepts such as predictive processing, active inference and the free energy principle, have gained increasing attention in the last couple of years, as they provide a vast interdisciplinary and intuitive explanatory power for both: neurocognitive processes, as well as experience itself. 

Due to its growing popularity, an extensive and amazing effort has been put into increasing its accessibility to a wider public of researchers, both technically (programming a model) and conceptually (scientific and (neuro)philosophical background and discourse). Especially “A Step-by-Step Tutorial on Active Inference and its Application to Empirical Data” (ATUT), by Ryan Smith, Christopher Whyte and Karl Friston (2022) provides a thorough and detailed introduction into actually computationally modelling neurocognitive processes by oneself.

Using active inference to do so involves simulating behavioral tasks and predicted neural responses, which can subsequently be compared to evaluated empirical data, gained from running the same behavioral tasks with actual participants. The ATUT only requires a rather minimal background in mathematics and programming and also provides all the information needed, accompanied by heavily commented code for Matlab (Matrix Laboritory), in order to fully program behavioral tasks oneself. A Python package called pymdp (python Markov decision process) has been published as well, explicitly accompanying the upper tutorials Matlab code, also making it available to open access communities. This tutorial series on information theory especially provides you with background knowledge on information theory and its relation to thermodynamics, which is not covered in the ATUT.

Fig. Output of the simplified simulation Matlab script corresponding to the ATUT by Smith, Whyte, Friston (2022). The ATUT offers a multi-armed bandit explore/exploit task as an examplatory application of active inference.

I personally also started to convert some of the Matlab scripts into R a while ago (see this Github repository). I definitely want to bring this further in the future, since it is the perfect way to learn how active inference works in action (thanks to Ryan Smith and Christopher Whyte for some feedback on the R conversion!).

In case you are also interested in making active inference run on R, let me know if you want to team up for the project! You can contact me via my email: [email protected]

The goal of this “student’s tutorial” is to accompany their work by providing a detailed and slow-paced overview of the mentioned ‘minimal background’ in programming and in mathematics required to get through the upper mentioned tutorial (especially for people outside of computer science related fields). As mentioned in previous parts, this series on information theory was initially all about active inference. Since the basics mostly consisted of classic information theoretic concepts, I split the initial version apart and added and re-wrote a lot of sections, in order to serve a more general purpose for our statistics tutorial project “Stat-o-Sphere” at NOS (e.g. as basics for understanding Akaike and Bayesian information criterion (AIC/BIC), MCMC…).

The motivation for this tutorial came after I built up a collection of notes, codes and math tutorials myself, in order to wrap my head around active inference, due to my personal interest in computational neuroscience and abductive inference (especially from a linguistic/semiotic and clinical perspective). This collection entailed scattered answers to the most important questions I asked myself, often still lacking the upper requirements.

We hope this series will serve as a decelerated support and addition to the upper mentioned ATUT for people — students and researchers new to the field — that want to understand and apply “Higher Sphere” methods of inference.

There has also been an open access MIT publication on active inference by Thomas Parr, Giovanni Pezzulo and Karl Friston (2022) that we can recommend! It also comes with some code examples for Matlab.

Fig. Cover of the open access book on active inference, published via MIT press. The ring including the formula of the FEP is an intentional and self-ironic reference to the “Lord of the rings”. The free energy principle can be understood as a generalization of active inference beyond “Mind, brain and behavior” and successively turned into a “theory of everything” based on mathematical physics. Karl Friston also invented SPM, a fMRI software to interpret data from neural activity, and therefore had a huge influence on neuroscience. The community around active inference / FEP is quite diverse and embraces open democratized discourse around it. As a neuroscientific theory active inference is in general popular and successful, since it, e.g., can model decision processes (behavior/cognition) and neural activity at the same time by just rearranging the same formula, which is a great advantage over other methods that, e.g., just model neural activity.

The fifth and current part of this series entails most of the major basic mathematical / physics knowledge needed to understand active inference / the free energy principle / predictive processing in computational neuroscience. In the next part of this series we will give you a short introduction on active inference / FEP in general and most of all supply you with extra material on it (paper recommendations, videos, interviews etc.).

You may wonder why we start with mathematics this time? What happened to our triad of Intuition/Concept—Mathematics —Code? Well, in fact we have already covered most of the intuition we need to understand active inference / the free energy principle, since they can most of all be looked at as a direct extension of information theory:

Fig. Variational methods, most of all active inference / the free energy principle (see above), can be understood as a direct progression of information theory and statistical mechanics/thermodynamics, also describing dynamical systems. Note that the third level could be represented as involving so-called precision weighting. Intuitively this means something like: asking how effective/important etc. a message is supposed to be (relying on a some policy). We will get there in further parts of this series...

Specially the rather “self-experiential” approach of discussing abductive inference as a way to understand the basic idea behind “Bayesian-Brain” concepts has supplied us with a stable intuition on current computational neuroscientific approaches to model neural activity as well as cognition and decision making. The below will just build upon what we have gone through so far already.

1 Bayes’ Rule with More than Two Dimensions

In order to understand why we need approximate Bayes inference methods to get to more complex applications of information theory and Bayes’ rule in general, we have to look at the formula of conditional probability / Bayes’ rule that entails three or more instead of only two variables and we will also introduce Bayes’ rule for continuous probability distributions.

Below you see the formula for classic conditional probability given two, three and n-dimensions / variables for discrete probability distributions.

To get a better overview what changed, note that conditional probability involving three variables can be simplified via substitutions of BB and CC below:

The posterior probability with three variables can be spoken as:

“the probability of AA given BB and CC”, or as

“the probability of AA given a model of (or between) BB and CC.

We can perform the same rearrangements as with regular conditional probability / Bayes’ rule — again via the chain and the product rule. However, since we could substitute P(B,C)P(B,C) with P(BC)P(B)P(B|C)P(B) or with P(CB)P(C)P(C|B)P(C) etc., it can become all a little hard to overlook.

On the other hand, we can also literally think of the below as a simple extension of a second dimension with another variable (similar to how we discussed the binominal distribution in the first part of this series). So from the perspective of Bayes’ rule, you can just add “,C,C” to every probability variable of the 2D form and you are good.

We can also rearrange the above in order to have a look at the outwritten joint probabilities:

In the above P(B,C)P(B,C) and P(A,C)P(A,C) could also be replaced to represent Bayes’ rule (by applying the chain / product rule):

We could also reformulate the above in order to obtain one of the possible marginal probabilities:

Here is where it gets a little tricky!! :O For discrete probabilities, we will still get an answer, but the more variables we add the more computational problems we will get. In technical terms, too many variables will result in an intractable marginal likelihood. The reason is that the more dimensions we add, the more numbers have to be summed out, which grows exponentially with the number of variables (which is especially a problem with continuous variables).

1.2 The Problem of the Intractability of the Model Evidence

First of all, the mentioned boundary can be thought of as a problem of computability. Computability means, having an algorithm to perform an calculation with limited resources and time. We will partially follow this short tutorial from Ben Lambert.

So from a mere pragmatic perspective, the more variables, the more relations, the more computing power is needed. Below you will see the formula for Bayes’ rule with n-variables:

Let us also quickly go through Bayes’ rule considering continuous probability functions (which we haven’t done in any of our tutorials so far). The only formal difference is that we use ...dx\int...dx instead of \sum. Just keep in mind though that every probability is now represented via continuous probability functions, not a discrete vector list of values as before.

Before we get into more details of intractability, the problem of computability may become more graspable when we think of a Markov chain graph that has nn nodes — not only three or so (compare the third part of this series). When summing over all B, it would theoretically take possibly forever to be computed — importantly also by computers. Anyhow, it will take a while until tractability becomes a real problem in the discrete inference case. The problem of summation becomes more prevalent though considering continuous functions. A short explanation of intractability can be found in a tutorial on active inference by Smith, Whyte, Friston 2022.

While Bayesian inference represents the optimal way to infer posterior beliefs within a generative model, Bayes theorem is computationally intractable for anything but the simplest distributions. […] the marginal likelihood (denominator) in Bayes’ theorem – requires us to sum the probabilities of observations under all possible states in the generative model (i.e., based on the sum rule of probability […] ). For discrete distributions, as the number of dimensions (and possible values) increases, the number of terms that must be summed increases exponentially. In the case of continuous distributions, it requires the evaluation of integrals that do not always have closed-form (analytic) solutions. As such, approximation techniques are required to solve this problem. (Smith, Whyte, Friston 2022)

The problem of intractability in general has a mere pragmatic (efficiency), but also deeper mathematical and computer scientific dimension to it that is referring to the so-called P-NP problem from complexity theory. There is a great tutorial by Jade Tan-Holmes on her channel “Up and Atom” that we can recommend if you want to get a first hint on that topic. We decided to further precede in presenting a solution for our problem.

So the big question is, can we find a way to work around and efficiently approximate the model evidence in order to normalize our joint probability with it and update our model (obtain the posterior)? Intuitively, since we multiply compared Bayes’ inference with “human” inference and related it to computational neuroscience several times, note that going through “everything”, every possible case / fact etc. when making a probabilistic decision is importantly not what we humans do, since it takes a lot of time (is not pragmatic or even doable) — and we also wouldn’t want to do so either, since we are always aiming for specific information in relevance to us or a specific matter. In other words, you do not use and need a 1:1 map down to every atom in order to orientate within a city or any region (this comparison is made by Maxwell Ramstead and can be found here).

The latter also implies a biological perspective on inference and what it takes. This also entails boundaries in terms of metabolism and complexity of structures. Concerning the latter, minimizing free energy is also considered an attempt of an organism as a whole to overcome the second law of thermodynamics, i.e., trying to delay decay. Decay does not only concern the physical but also the conceptual level: our model of the world, which we try to keep or try to confirm. In other words: Bayesian brain theories somewhat follow a scheme of self-evidencing (e.g., in “All thinking is wishful thinking” by Kruglanski, Jaso, Friston, 2020).

This is also where our thermodynamic analogy on a higher level of abstraction (active inference as mathematical tool to describe decision making) becomes related to the actual thermodynamics (organism as such), when arguing that inferring on the world is done by minimizing free energy to keep in trajectory of a model, meaning to keep deviation between the true state and our expectations on present states within a generative model of those events low, in order to save cost or affordance when trying to not fall apart and handle life as person at the same time. You will find a lot of recorded lectures on the free energy principle and active inference on various topics (e.g., this one called “Me and my Markov blanket”), in which Karl Friston often gives simple examples of models that involve a daily routine, e.g., getting up, making coffee, going to the subway station and so on. Such a model would consist of a steadiness in terms of periodically getting back to states, such as “getting up, making coffee”, but would not involve an equilibrium, where nothing changes. This dynamic homeostasis or better homeorhesis is referred to as a non-equilibrium steady state distributions (NESS, latest formulations on that discourse can be found here: Bayesian mechanics for stationary processes by Da Costa et al.).

However, as discussed before, strong deviations of a model (a lot of “questions that need to be asked”) will result in a lot of model updating. Models themselves are an attempt to keep such deviations in boundaries. On the other hand, a NESS also implies the possibility of a desire of change as whole, as in getting rid of a model in which one is in joint with. This can be understood in the optimistic or pessimistic sense, or better as voluntary or non-voluntary: finding a way to fulfill a positive desire or change (changing peer group, social context, or going on an adventure etc.), or as a result to cope with undesired changes, even though it is hard to do so (trauma, escape from war etc.).

To answer the question if we can get a grip on the model evidence: yes, we can given approximate Bayesian inference methods. As noted in the introduction, this and the next part of this series will focus on the mathematical basics of active inference — and we are after one particular method, the variational inference method, which goes back to the variational method defined by Richard Feynman (Statistical Mechanics, p. 86 (1972), also see Friston et. al (2006)).

By the way, glad that you ask, yes, we are going to have a closer look into the relation between physics and information theoretic methods again and expand our understanding of thermodynamics, discussing the concept of inner and free energy and their relation to entropy.

2 Variational Bayes Inference

Now that we know that our previous method of computing a posterior probability distribution is not enough to perform inference on the level of complexity we are after, let us finally start to upgrade our known Bayes’ rule to be able to perform an approximate Bayes inference method, which overcomes the problem of possible intractability of regular Bayes’ theorem.

As mentioned, for this to happen we need to expand our understanding of statistical thermodynamics / mechanics. In general, what we will present below is actually not that difficult mathematically. The physics behind the concept of free energy will also be much easier to understand than the concept of entropy for itself (which is a part of the calculation of free energy). Most of all we will actually get to know a very clever ‘trick’ in order to approximate the model evidence. A ‘trick’ that will also make our calculations, as well as our general concept of abductive Bayesian inference more intuitive.

2.1 Minimizing Variational Free Energy

Below you will see the classic formula for computing the model evidence (as discussed before). We will also adapt the notation of the variables that are used in active inference, where thetatheta is denoted ss for sensory statesensory \ state — the state that we experience — and datadata is denoted oo for observationobservation — the data we observe and conditionally relate to our sensory states. In the context of perception, an observation would be the information that hits the retina for example and the sensory state is what we try to make of such observations in relation to states we experience. Importantly note that active inference actually works with conditional independence. However, for the sake of simplicity this tutorial will only operate with conditional dependent relations, i.e., classic or also so-called exact Bayes’ inference. We will catch up with conditional independence in the next part of this series!

To relate the notion of conditional independence to our intractability/approximation problem and information theory: conditional independence means that we cannot fully infer the entropy of a message, i.e., on a message itself, since we lack certain “exact” or complete insights into the information that is around us.

As we are unable to infer on the model evidence directly in order to then obtain the posterior, we could still try to play around a bit and compare our results, in order to optimize an approximation of p(o)p(o) or of our posterior p(so)p(s|o)given the joint probability distribution from our prior and the likelihood. So again, we cannot directly compare our approximation of the model evidence or the true posterior with the model evidence itself, again, due to fact that we cannot compute it in case of higher dimensions. We have to find an indirect way to so.

For more simplicity we will also again go through the formulas for approx. Bayes with only two elements in joint with each other ─ a threefold joint is used in the ATUT anyways, so here you will find a downgraded version of the formulas, in case of losing orientation.

However, we can say that epistemically oo is something that is part of the joint probability, and the joint is given to us (since we are in fact in relation with the world). In other words, what is again given to us — since it can be set — is the prior and likelihood (especially in a experimental setting these can be set). So theoretically we could heuristically make assumptions on p(o)p(o) or p(so)p(s|o) ─ our only missing values ─ with the right technique that gets around our intractability problem (even though it would not be necessary in our case of conditional probability with two elements, since we could use regular Bayes for this…). In case this does not fully makes sense to you yet, we will get to a point were it gets really easy in a bit.

Our professional method of choice to get around the intractability problem will be the so-called variational inference method. Roughly, this method is used to approximate the posterior (!) by heuristically trying out possible posterior distributions, denoted q(s)q(s), eventually finding the posterior that fits best to the model p(s,o)p(s,o). The approx. posterior is technically denoted q(s)q(s), since it is actually not conditionally related to o, since it is just an approx. guess so to speak. The right orientation for such approximations is provided by making use of an inequality that can be derived from formulas that again reach back to statistical thermodynamics, namely Helmholtz Free Energy. This is also the point where the information theoretic quantity of surprisal / entropy becomes important within approximate Bayes inference.

In general, the actual formula for variational free energy within the active inference framework looks much more complicated as it is and I promise it entails nothing we couldn’t understand, after all that we have gone through so far (*wiping-off-a-tiny-tear*). The only difficulty relies in bringing in some additional facts from statistical thermodynamics (Helmholtz free energy, variational methods (Feynman)) and nowadays information theory (Jensen’s inequality, KL divergence, variational free energy). However, a lot of the below will again consist of mere rearranging of formulas and alternating notations. As promised, we will again only work with two variables and within the realm of exact Bayes’ inference (even though variational Bayes’ would actually not be required in such a case).

2.1.1 The Principal of Maximum Entropy and Minimum Inner Energy

In classic thermodynamics, the free energy represents the result of a relation between the inner energy and entropy of a system and is considered a quantity for the amount of work that can be extracted from a system at constant temperature and volume ─ as both, changes in temperature in the form of thermal energy being lost in a process (as in a steam engine), as well as changes in volume of a system (exploding steam engine) would subsequently change the free energy of a system as such (we can also recommend this paper by Gottwald and Braun (2020) on this topic).

From a physical perspective a so-called Stirling engine is a very good example to understand what free energy is and how it converts to work:

Fig. The simple but mind-blowing structure of a so-called Stirling engine. It uses a temperature difference to convert free energy into mechanical work. Here is a great short tutorial by Michael S. Source of the gif above: Wikimedia.

The cylinder can be considered our system (exchange of heat). The heat at the bottom of the cylinder (red) heats up the air (gas) within it, which means that the atoms essentially accelerate and bump against the displacer plate (pink). In other words: the entropy rises and therefore kinetic energy of gas atoms. The atoms then push against the displacer plate (pink) in a way that makes it move up. At the top of the cylinder the air and the displacer cool down again, making it move down again, since the pressure decreases. This also happens due to the involved mechanics that eventually pushes the displacer down again — a kind of competing difference. The pressure rises beneath the displacer plate, also due to the compression of the gas beneath the displacer palte moving down again, and therefore the displacer eventually moves up again after some time, resulting in cycle of the displacer being pushed up and down, up and down…. You can see that the structure is a little more complex, since there is another smaller cylinder (green) that also transfers energy into mechanical work, but with a slight latency to it, helping that the displacer plate is pushed downwards again (so no constant volume in the example above). If the above system would be in an equilibrium state, such that no pressure is build up in the cylinder (no temperature difference), nothing would happen. We will later see that the pressure that gets coverted into mechanical work can be looked at as so-called free energy. Note that the above also works the other way around: if the bottom of the cylinder was cold (ice cube at the bottom) and the upper part of the cylinder would be warmer (room temperatur), the same thing happens!

In the context of information theory or in general Bayes’ inference, this “amount of work” can also be seen as the amount of effort or cost of a system ─ especially when we keep in mind that the information theoretic perspective refers to statistical inference as a potentially costly process of gaining information (our intractability problem refers to that as well): The less information gained, the more I knew already, the less costly an inference was ─ e.g., the closer my prior had been matching the likelihood already, or the closer my approximate posterior will match the actual posterior. Before we get back to our intuition on abductive inference, let us first look at the physics formula for free energy. We know the relation between physics and abductive inference my not be as clear already, but we promise we will soon get there!

The formula for the Helmholtz Free Energy, i.e., the maximum usable part of the energy EE of a system, goes like this:

TT refers to the temperature (in Kelvin) and SS to the entropy of that system (equivalent to HH). E[E]E\lbrack E\rbrack refers to the (expected) inner energy of a system, which is also an information theoretic quantity, as well as FF, and may appear a little confusing, after we have mostly spoken of information as entropy before, also in the context of expected values. Note that free energy understood as information theoretic quantity involves constant temperature, as well as volume (at least the basics). In other words, similar to the Boltzmann’s constant kBk_B, the temperature will be set to be just 1.

At this point it made sense to me to recall the units used in physics to express EE and SS to understand why EE and FF can also be an information theoretic quantity without ruining our intuition all around entropy and thermodynamics:

  • Inner energy = Energy (J) / Mass (kg)

  • Entropy = Energy (J) / Temperature (K)

Both quantities actually reflect on the energy of a system, just in relation to different factors ─ in a banal sense: this is the reason why the formula goes “F = F\ = \ \ldots” reflecting on a relation between the inner energy and the entropy of a system, potentially resulting in free energy. Recalling the units also sheds light to what inner energy reflects on when representing a quantity for the kinetic energy of the movement (Energy in Joule) of particles within a system with respect to their mass as something that can be used to do work with, or represents a potential cost in some way when looked at its dependent relation to the entropy of a system. Same goes for the disorder within a system that results from such movements (Energy in relation to the Temperature) and indirectly how it may change towards a state of maximum disorder as a result of the (change of the) inner energy of a system, as we will see below.

As mentioned, the relationship between free energy, inner energy and entropy relies on the second law of thermodynamics, stating that a system preferably occupies a state over time that has maximum entropy given the inner energy. We know what maximum entropy is in classic information theory: it states that the entropy is at its maximum in the case of equal probability of each microstate ─ which also makes sense in terms of maximum uncertainty (randomness) and minimum certainty when inferring on information. Note that all of this also relates to the equilibrium steady state distribution of our Markov chain ─ where equilibrium represents the notion of a consistency of (transition) probabilities over time (our eigenvector).

With all of this is in mind, we can now mathematically represent a part of the free energy equation above already ─ the entropy. Recall that our information theoretic entropy formula also represented the expected value of logp(o)- \log{p(o)}, representing the average surprisal, where surprisal is obtained just by taking the negative log of p(o)p(o). Next up is the inner energy:

As hinted above, the principle of maximum entropy can also be inverted in the sense of a principle of minimum inner energy stating that in the case of constant entropy (set maximum entropy) a system will occupy the state with the least amount of inner energy. Constrains, such as constant entropy within the definition of the principal of minimal inner energy will actually be helpful in a bit, in order to fully understand the actual calculation of FF using extended Bayes inference methods via simplified formulas from physics. To also clear that upfront: we will forget about volume in the first place and, as mentioned, set the temperature to be T=1T = 1. This will simplify our formula down to F=EHF = E - H, but we will stick with TT in the R code, to keep in mind the physics behind it ─ for some intuition on the analogy to Helmholtz free energy.

In essence, looking at the formula for free energy again, and without an idea how to formulate the expected value of EE in Bayesian terms, we still conceptually know already that when the value for the inner energy EE is getting lower, the resulting value of FF will somewhat decrease and will move towards HH, given that entropy is set to be constant. “Set constant” is essentially regarding to a certain amount of entropy, a certain amount of interest: the actual entropy/surprisal that can also represent a set “maximum entropy”. In other words, changes in the values of FF indirectly reflect on a “distance” of FF to the entropy value itself.

We can extract some facts from that above intuition from physics to build up some mathematical constraints for our formula of FF:

As the entropy is considered constant, a value of E=0E = 0, i.e., a minimum of inner energy (at least none, but not below!), would result in F= HF = \ H, as well as FH=0F - H = 0. In such a state nothing changes, therefore no kinetic energy in the form of free energy is given, so to speak (equilibrium).

Conceptually and intuitively, it also wouldn’t make sense that the value of EE would be below zero, as there has to be energy in terms of information in a system, if there is information in terms of entropy in a system ─ so a value of E=0E = 0 is logically consistent when inner energy is understood as a divergence from the entropy itself (EE is also referred to as relative entropy in information theory, as we will see). If EHE-H is positive, we therefore know that there is free energy given in the form of a left over, so to speak.

This is also consistent to our previous understanding of information theory in general, where negative quantities of information do not make a lot of sense, if information is in general something phenomenologically given, so to speak. From the perspective of abductive Bayes’ inference again, free energy can here be seen as the effort or cost of an inference made (work), since it implies that there is a divergence that has to be overcome to obtain an equilibrium state, so the above also holds for FF as well: free energy cannot be negative! Let us update our mathematical model on the relations between FF and HH given our thoughts above:

From what we have learned conceptually so far, we can derive our first inequality:

The above is the case, since the values of FF, EE and HH cannot be negative, but the relation between EE and HH can result in deviating values between FF and HH respectively. In case this is all going to fast, next we will look at some familiar numerical examples!

Let us recall some of what we already know on that matter of inequalities from classic information theory: here is an inequality we have encountered before. The code below entails all the code we need from Information Theory III on Markov chains:

### We will again use the last Markov chain example 
### from Information Theory III:
MessageABC = c("A", "B", "C")
MessageABCTransMatrix = matrix(c(.0,.8,.2,
                               nrow = 3,
                               byrow = TRUE,
                               dimname = list(MessageABC, MessageABC))

MCmessageABC = new("markovchain", states = MessageABC,
                   byrow = TRUE,
                   transitionMatrix = MessageABCTransMatrix,
                   name = "WritingMessage") 
markovchainSequence(n = 20, markovchain = MCmessageABC, t0 = "A")

# Plot Markov Chain
plot(MCmessageABC, edge.arrow.size = 0.1) 

# We will quickly write a function adding a tiny
# value to our inputs:
bit_log_nonzero = function(x) {
  nonzerolog = log2(x+2^(-16))
} # End of function

### Joint matrix:
steady = steadyStates(MCmessageABC)
trans_mat = as.matrix(MessageABCTransMatrix)

# Initialize empty matrix:
joint_mat = matrix(0, ncol = ncol(trans_mat), nrow = nrow(trans_mat))
for (i in 1:length(steady)){ 
  for (j in 1:ncol(trans_mat)){
    joint_mat[[i,j]] = steady[[i]]*trans_mat[[i,j]]
  } # end for j
} # End for i

# Below we will work with the numbers of our last Markov chain example:
EntropyPOST = -sum(joint_mat*bit_log_nonzero(MessageABCTransMatrix))
# [1] 0.9340018 = H(y|x)

EntropyX = -sum(joint_mat*bit_log_nonzero(rowSums(joint_mat)))
EntropyY = -sum(joint_mat*bit_log_nonzero(colSums(joint_mat)))

# EntropyY is greater than or equal to H(y)

The above essentially states that the amount of information I need to infer on a state, only having the prior at hand, is greater or equal to the amount of information entailed in the inference on a state given an observation Hx(y)=H(yx)H_{x}(y) = H(y|x), which reduced the uncertainty on the future in comparison to a prior, such that H(prior)H(posterior)H(prior) \geq H(posterior). In other words: I know at least as much as before (prior), after I have been in joint, or have encountered an event (likelihood) and updated my model (posterior) ─ which relies on several notions of physics that we have encountered before, e.g., information can’t be lost within a process (of inference), or gaining information can be expressed as cost (FF) needed to make an inference (amount of work that can be extracted from a system, until it hits equilibrium, i.e., maximum entropy)…

From this we can also make sense of the following:

# EntropyY-EntropyPOST >= 0
EntropyY-EntropyPOST >= 0

We can actually relate this to our free energy formula, including the fact that F0F \geq 0, as subtracting the entropy from the free energy would have to follow the same rules, such that:


This again says that energy itself in any sense, inner of free, cannot be less than zero. This is also what is referred to as free energy being an upper bound on entropy, and is mathematically derived from Jensen’s inequality, to which we will get to later in a separate chapter.

We can also derive an equality as a special case of the above, where:

From which we can derive that in such a case:

The above represents the case of minimum inner energy, E=0E = 0, given constant and therefore automatically maximum entropy.

We could now create a similar situation just with the entropy in relation to itself (since F=HF=H), i.e., subtracting the lnp(o)- \ln{p(o)} from the sum of the joint over all ss, which is essentially saying surprisal minus surprisal is 0, such that:

Looking closely at the formula will reveal a surprising chance to bypass the evaluation of the model evidence itself using a simple, but sophisticated variational trick to do so (goes back to variational methods in statistical mechanics, introduced by Richard Feynman, see Statistical Mechanics, p. 86f. (1972)).

From what we know so far, FF can be seen as representing a deviation to our log(p(o))- log(p(o)). Also recall the fact that p(os)p(s) = p(o,s) = p(so)p(o)p(o|s)p(s)\ = \ p(o,s)\ = \ p(s|o)p(o).

Let us now do some rearrangement, based on our thoughts around p(o,s)p(o,s) above. We will also add some intentionally redundant terms, which will represent a deviation from the free energy from the actual entropy, given that the approximate posterior deviates from the actual posterior that fits the actual entropy p(o)p(o), such that F=HF=H. Note that the below just represents alternative arrangements of the same formula:

Combining the arrangement from the last two lines results in:

Here we just take the log into the sum:

The last line represents the expected value notation of the previous formula:

I have crossed out everything that was just redundantly added (exception in the last line), but see how we now got the posterior right in front of us? This will lead us to an important trick, that brings us beyond the special case, where F=HF = H.

Let us quickly compute the above, to see if the equations hold, before reflecting on how to fully represent our free energy formula, resulting in FHF \geq H:

#### Generative Model
prior = c(.5,.5); likelihood = c(.8,.2)
joint = prior*likelihood

# Trueposterior (here calculated via Bayes; for
# simulations think of a supervised situation,
# so the true state is known):
modelevidence = sum(joint)
Truepost = joint/modelevidence

# Expected model evidence = Entropy
# In our case Entropy and surprisal are equivalent.
Entropy = -sum(modelevidence*log(modelevidence)+((1-modelevidence)*log(1-modelevidence)))
Surprisal = -log(modelevidence)
Temperature = 1

# Going through all the lines:
# H = H
# The below does not exactly work with vectors/matrices.
# For a single value result use: -log(.5*.8/.8)
# With expected value notation, i.e., average surprisal

All of the formulas above hold for the case of F=HF=H, where Shannon entropy can be understood as a special case of (variational) free energy:

Let us now substitute FF by the Helmholtz Free Energy formula, such that:

From what we know, we could try to extract the entropy from our FF or HH above, recalling our product rule p(s,o)=p(so)p(o)p(s,o) = p(s|o)p(o), as well as the logarithmic rule ln(xy)=ln(x)ln(y)\ln\left( \frac{x}{y} \right) = \ln{(x)} - \ln(y). To do so we have to “flip around” our Ep(so)E_{p(s|o)}, resulting in:

The latter line may look a little weird, but recall that EH=FE - H = F, H = logp(o)H\ = \ - \log{p(o)} and the logarithmic rule we applied all involve a subtraction for their own! Also note that EE would be 00 in this case, so it does not matter if the expected posterior is positive or negative, therefore the “()( - )”. Getting rid of the symbolic redundancies, i.e., “redundant distributions”, results in the following subtraction ─ now just recall that the logp(o)- logp(o) results in a positive value, i.e., 0\geq 0 and you will actually get an addition on the numeric level.

# E[E] - H = H (for the example from ATUT p. 4)
Energy = -sum(Truepost*log(Truepost/Truepost))
HelmholtzFE = Energy-(-Temperature*Entropy)

# F >= H

# F = H (minimized FE)

The important trick that is hidden in the formulas now is, that we can approximate our posterior and at the same time are able to indirectly see how far our approximation q(s)q(s) of the posterior p(so)p(s|o) deviates from the posterior suitable for the actual model evidence, since F=HF = H, when E=0E = 0.

This makes it conceptually look like that minimizing the value of F\mathbf{F} will eventually lead us to a point at which we are equal to our lnp(o)\mathbf{- lnp}\left( \mathbf{o} \right), or at least as close to it as possible ─ the latter being enough for our approximation. Minimizing free energy therefore maximizes the model evidence of our approximate version of the actual E\mathbf{E}, which will heuristically reveal to us the ideal approximate posterior that fits the true posterior, i.e., the true model evidence. This is simply because that when our approximate q(s)q(s) does not fit the upper requirements, we know that it will be greater than the actual entropy, as the free energy is an upper bound on entropy and always greater than zero. In other words: however far away we will speculate, we will never lose information in the sense of being below the actual model evidence.

Mathematically spoken, the process of minimizing free energy can be looked at as performing gradient descent. The code below for the example at ATUT p. 5 shows how the value of FF is descending down a gradient towards the actual entropy when adjusting the posterior until it matches the “true or actual posterior” that fits the actual entropy p(o)p(o) (following the principle of least action). The general formula for the free energy that also entails a denotation for the approximate posterior (also called recognition distribution), denoted q(sq(s), will look like this. The approximate posterior is denoted to be unconditional, even though it is not an actual unconditional probabilty such as p(o)p(o), but an approximate of p(so)p(s|o) — since the approximation lacks an observation by logic, it is also considered unconditional on such observations (it’s a methodological speculation of the posterior / the future, in order to be able to learn, so to speak):

The very left term on the left-hand side of the last line can also be understood as the KL divergence or relative entropy, i.e., how far does the entropy deviate from the actual entropy, when F=HF = H, but we will get to that in detail in the next chapter. The notation for the KL divergence looks like this and is analogue to the above:

Here is the code for the variational free energy example on ATUT p. 5, which is equivalent to the calculation done in the ATUT Matlab script “VFE_calculation_example”(see Ryan Smith’s Github repository):

# Minimizing Free Energy: 
# Example ATUT p.5
Qs1 = c(.5,.5)
Energy1 = -sum(Qs1*log(Truepost/Qs1))
VFE1 = Energy1-(-Temperature*Entropy)
Qs2 = c(.6,.4)
Energy2 = -sum(Qs2*log(Truepost/Qs2))
VFE2 = Energy2-(-Temperature*Entropy)
Qs3 = c(.7,.3)
Energy3 = -sum(Qs3*log(Truepost/Qs3))
VFE3 = Energy3-(-Temperature*Entropy)
Qs4 = c(.8,.2)
Energy4 = -sum(Qs4*log(Truepost/Qs4))
VFE4 = Energy4-(-Temperature*Entropy)

# Plot that makes clear what descending a gradient means conceptually:
plot(x =c(1:4), y=c(VFE1, VFE2, VFE3, VFE4), typ = "l")

Fig. The plot of our variational free energy (VFE). We can see that adjusting our approximate posterior q(s)q(s) successivly minimized the VFE.

Here is the classic formula including the inequality of FF with HH that you will find in the ATUT and literature on the FEP and active inference in general. Below FF is directly calculated via one formula, i.e., without extracting the entropy from FF. You will encounter more possible rearrangements with three joint elements within ATUT. The inequality, again, states that our indirect approximation of HH will always be greater or equal to HH. Free energy is therefore an upper bound on entropy (the evidence lower bound (ELBO) of the negative VFE is renowned in machine learning).

The divergence we want to overcome (minimize) is that between our generative model and that of the actual generative process in the world.

Drawing a relation back to Shannon’s work and the discussion we had over Weavers work and semiotics, we can see that an exact inference problem is turned into an optimization problem. In general, we hope we could show that active inference can be seen as a direct extension of Claude Shannon’s information theoretic methods of inference.

Here is the same formula, just using three variables:

Below we will have a brief look into the general structure of an active inference model, discussing the difference between generative process and generative model (which we both briefly mentioned before). This is also very well explained in the ATUT though. After that we will further look into the KL divergence, the relative entropy, as well as Jensen’s inequality.

2.2 Generative Model and Generative Process

As mentioned previously in this the tutorial, the joint probability is also referred to as generative model and is distinguished by the generative process. This, again, reflects on the epistemics of inference:

A generative model, as discussed above, is constituted by beliefs about the world and can be inaccurate (sometimes referred to as ‘fictive’). In other words, explanations for (i.e., beliefs about) how observations are generated do not have to represent a veridical account of how they are actually generated. (ATUT, p. 4)

Apart from the possibility of counterfactual inference within causal inference (compare Corcoran, Pezzulo, Hohwy (2020), Pearl (2009)), this also addresses the fact that we do not experience or model the world in terms of a 1:1 scale, so to speak (compare previously mentioned video tutorial by Ramstead (2020), and ATUT p. 4). We are only interested in specific information related to a model of the world; therefore, active abductive inference is also cast as attention (Friston (2010)). In other words: a 1:1 map would not be necessary to get from one point to another or more specific: not every detail is important or necessary in order to perform a successful relation between a map and the actual trail it represents in the actual world, even though it will not answer every question we could have on the world ─ but specific questions indeed (Ramstead (2020)).  

In contrast, the generative process refers to what is actually going on out in the world – that is, it describes the veridical ‘ground truth’ about the causes of sensory input. For example, a model might hold the prior belief that the probability of seeing a pigeon vs. a hawk while at a city park is [.9.1][.9 .1], whereas the true probability in the generative process may instead be [.7.3][.7 .3]. This distinction is important in practical uses of modelling when one wants to simulate behavior under false beliefs and unexpected observations (e.g., when modelling delusions or hallucinations). (ATUT, p. 4)

Both, the generative process and model represent joint probabilities. In that sense beliefs in the form of policies or anything related to beliefs being something internal is actually not fully demarkated, but again in a joint relation with each other: a being in the world — in other words: active inference and the free energy principle inherit a phenomenological perspective. Including action as a way to change the state of the world, a graphical representation of our upper example would look like this (similar to ATUT, p. 5):

Fig. Distinction between generative process, representing the hidden states of the world (true states) causing observations oo, and the generative model, representing a (generative) model of the hidden causes of sensory input (model inversion), given a model in general, e.g., in the form of a policy: Compared to the sender-receiver model before, the upper distinction represents the epistemic circumstances mathematically, as being unable to sum over all possible , which means: we do not have information on all possible states in the world (or over all possible hypotheses). The variable policy is usually denoted π\pi within the active inference / FEP literature (we will adapt below). The graphic also entails action, denoted uu given a policy. Action in a simple context of perceptual inference can mean, e.g., turning one’s head to the side to better see what happens, in order to get a better model of the world (gaining information). Note that the action itself can be considered in two ways: as the result of an inference, but also in general as becoming part of the generative process, realizing an inference in terms of action in the world that we again perceive, eventually forming an action-perception-cycle. This refers to the notion of a Markov blanket, action sensory being the blanket states that inferentially relates external and internal. What is called intentionality would then be understood as adapting to the world and adapting the world for us. Source: Wikimedia.

I have chosen the example of a ball falling to the ground and gravitation as model, as it entails all the ingrediency that Galileo used to do his first measurements of the time an object needs until it lands when falling from a certain height. This is not supposed to be a random fact that I want to include in this tutorial, but an example for hierarchical abductive inference that not only applies to our intuition, but also to abductive inference in terms of an actual scientific method ─ also entailing deduction and induction.  

To perform his measurements, Galileo used a slide as it made objects “fall slower” and steady. It was hard to measure the speed of a fallen object without any nowadays technology at that time. In order to provide a scaled measure, water was filled into a glass for the time the ball was rolling down the slide (just think of tick marks for the amount of water after 1 m, 10 m etc. and the time that has passed that can be measured separately). After multiple attempts the measurement turned out to be very precise, indicating none or at least no measurable differences of gravity over time.

The structure of such experimental projects appears to be always the same by intuition. At first there may be a bunch of questions, an indirect hypothesis, e.g., things fall down, things fall down steady, the mass makes things fall faster or not… Next up is the likelihood in terms of an event to check on the hypothesis in terms of checking on mere prior assumptions (abduction). An experimental setting is designed and built up in order to make the event comparable to a previous event. In the case of Galileo this meant that the same ball, the same slide etc. had to be used for every trial (deduction). Eventually Galileo obtained a whole bunch of events and was able to compare them with each other, figuring that the fall time is stable under comparable conditions and can be generalized in mathematical formulas (induction), to further see how this works out for an approximation with, e.g., predicting behavior of other objects. This and some other experiments performed by Galileo eventually resulted in a measurement of  for the gravitational acceleration.

All the latter is supposed to roughly demonstrate an intuitive ontogenesis of science and the plausibility of the development of scientific methods by humans, based on abductive inference being the prime form of every inference. This also shows why C.S. Peirce pragmatist concept of abductive inference and semiotics can be understood as a phenomenological approach to the description of the structures of inference as such. Active inference though was much more influenced by the work of Helmholtz on perception as a kind of inference, as well as his formula for free energy, which we will get to know soon in its information theoretic form.     

3 KL Divergence / Relative Entropy / Information Gain

The Kullback-Leibler divergence is also known as the relative entropy, Bayesian surprise or information gain and represents EE in our formula for variational free energy (FF). To get more intuition on it, we will partially follow an online blog that gives a good example and to which I will provide some code. You will know most of what is discussed in the blog, so we are going to take the quick route.

Example: Imagine having a set of colored balls that are spread within space. The space is divided into two spaces and the question is now: how good the split is in terms of how likely it is to find each of the color in one of the two sides compared to before, i.e., the space without a split. As usual we will reflect on this using information theory. We have 5 blue and 5 green balls. The formula for the entropy is our expected value of either green or blue, which can be written as:

The code below follows our classic entropy formula in terms of the negative sum of the weighted log of the probability of each color appearing, when sampling from it.

# We will use a matrix to express p_i:
pColor = matrix(c(.5, .5))

# H_before:
Hbefore =-sum(pColor*(log2(pColor)))

Let us now look at the split:

Fig. Plot of our example, code below.

# Imagine having 5 green and 5 blue balls:
plot(x = 1,                 
     xlab = "X Label", 
     ylab = "Y Label",
     xlim = c(0, 3), 
     ylim = c(0, 3),
     main = "Blue and green balls",
     type = "n")

# Blue balls
points(1,2, col = "Blue")
points(0.4,2.8, col = "Blue")
points(1.2,2.2, col = "Blue")
points(1.7,0.7, col = "Blue")
points(0.9,0.6, col = "Blue")

# Green balls
points(1.6,1.6, col = "Green")
points(1.9,2.6, col = "Green")
points(2.3,0.6, col = "Green")
points(2.8,2.85, col = "Green")
points(2.4,1.6, col = "Green")

# Split at X = 1.5.
abline(v=1.5, col="black")

The split at a certain point can be thought of as a kind of change in the entropy due to inference, but also as, e.g., evaluating compressed data. Concerning our use in active inference the relation between the entropies before and after the split can be thought of as changes in E[E]E\lbrack E\rbrack. This is the reason why the values of the inner energy had been zero in the case of no divergence between the approximate HH and the actual HH, as we are aiming to find an approximate posterior that leads to F=HF = H, in other words: we are looking for the maximum entropy / minimum inner energy being equivalent to the free energy.

On the left side, all four of the balls are blue:

On the right side we find five green and one blue ball, together six.

# Hleft is simple
Hleft = -sum(1*log2(1))

# Hleft probs as vector
probRight = c(1/6,5/6)
Hright = -sum(probRight*log2(probRight))

We can now weight the quality of the split by weighting the entropies by the number of elements of the respective side via:

# Relative Enropy
Hsplit = .4*0 + .6*.65

# KL divergence / relative entropy:
Informationgain = Hbefore - Hsplit

As said, the relative entropy now represents a relation of the entropy of each state, i.e., before and after the split. In other words, and related to active inference: the inner energy represents the divergence to the actual minimum inner energy (00), when the maximum entropy is given, after the posterior was approximated. Relating this to classic exact Bayes inference, the difference between prior and posterior can also be reflected on in the same way and is also termed Bayesian surprise.

4 Jensen’s Inequality

This part is focusing on the exact mathematics behind the inequality of FHF \geq H within information theory, namely Jensen’s inequality. We will follow two more videos by Ben Lambert on the intuition and the proof of Jensen’s inequality. It is in general hard to find good introductions on that topic. Here I will again provide the code for the example provided by Ben Lambert.

To demonstrate Jensen’s inequality, we will play another round of dice (pi)\mathbf{p}_{\mathbf{i}}\mathbf{)}. Every round we play, we obtain a payoff value from playing the game. The game only consists of “winning” so to speak, so think of similar games such as Yahtzee. The payoff in our particular game will be calculated by squaring the rolled dice value (xi2){\mathbf{x}_{\mathbf{i}}}^{\mathbf{2}}\mathbf{)}.

Table: xi\mathbf{x}_{\mathbf{i}}: Possible values of xx, the dice. pi\mathbf{p}_{\mathbf{i}} Probability of a certain value of xx, i.e., P(X=x)P(X = x). xi2:{\mathbf{x}_{\mathbf{i}}}^{\mathbf{2}}\mathbf{:} Payoff from playing a round of the game, i.e., squared value of xx that was rolled.

Mathematically, the payoff can be understood as a parabola function, which we will call g(x)g(x), with a minimal value of zero representing the vertex point of the parabola. Let us plot the payoff function, to get an overview. For this we will define a function in R in an actual mathematical sense for the first time:

# Define function x^2 = g(x) =
g <- function(x) (x^2)

# Plot of g(x) ranging from x=-1 to -7.
curve(g, -1, 6, ylab = "g(x)")
abline(v = 0)

# Points marking all possible payoffs
points(1,g(1)) # Payoff for x = 1
points(2,g(2)) #   ...  for x = 2
points(3,g(3)) #   ...
points(4,g(4)) # 
points(5,g(5)) # 
points(6,g(6)) #

Fig. Plot of the code above with markings for every possible (IiI \ni i) payoff xi2{x_{i}}^{2}.

A special characteristic of our function g(x)g(x) that I want to draw attention to is its convexity. I have found several ways to define the characteristics of a convex function:

A rather graphical approach of defining convexity involves choosing two random points in the space above the function, the so-called epigraph. When the points are connected via a straight line, the line will never cross the function.

Another approach is by describing the graph of a convex function as continuously increasing its slope. A more mathematical approach to this would involve evaluating the derivatives of our function g(x)g(x): If a function is considered convex then first and second derivative of that function, i.e., g(x)g'(x) and g(x)g''(x), will be positive (>0). Let’s see if that is true.

We will obtain the first and second derivative via these rules:

such that in our case:

Both derivatives are positive either in the sense that they consist of a single positive value or that xx is non-negative. The first derivative tells us that the tangent at a point xx is always positive or equal to zero, and the second derivative tells us that the slope is constantly rising.

Fig. Description of a function as a process, specifically referring to the computational abstraction used in programs such as R: A mathematical function, say f(x)=x2f(x)=x^2, has an input, say x=2x=2, and delivers an output for f(2)f(2), which in this case results in 22=42^2=4 and resembles the y-coordinate of a point on that function with the respective x-coordinate value of x=2x=2. In essence, a function represents an algorithm to evaluate the relation between an independent variable f(x)f(x), which is often time, and a dependent variable xx.

Now that we have a rough idea of what a convex function is, let us now see what Jensen’s inequality actually states ─ and don’t worry, it may seem a lot, but we will again decompose the inequality step by step. In relation to our game, Jensen’s inequality states that the expected payoff of playing the game, denoted as the expected value of a function E[g(x)]E\lbrack g(x)\rbrack, will always be greater or equal to the payoff from the expected value of xx, which is a function of the expected value of xx, denoted g[E[X]]g\lbrack E\lbrack X\rbrack\rbrack. In addition: this holds if and only if the function g(x)g(x) is convex. In active inference, this represents the VFE, being always greater or equal than entropy.

Let us first evaluate our expected value of xx, E[X],E\lbrack X\rbrack, which we consider the weighted average of our possible dice values xix_{i}, where xXx \in X, and X={1,2,3,4,5,6}X = \{ 1,2,3,4,5,6\}.

EX = sum(1/6*(1:6))

The payoff from the expected value, i.e., the function gg of our expected value of xx will then be:

gEX = g(EX)

Translated, this delivers the answer to the question: considering the weighted average of the possible dice values, 3.53.5., what will the payoff of that average value be? Resulting in 3.52{3.5}^{2}, when payoff means a function g(x)=x2g(x) = x^{2}, xx being the input value of that function ─ just as in R, so to speak.

The left-hand side of Jensen´s inequality above represents the expected value of a function, i.e., our expected payoff from playing the game. The formula just slightly deviates from E[X]E\lbrack X\rbrack, as we are now weighting the payoff of our value xix_{i}, i.e., weighting xi2{x_{i}}^{2}.

Egx = sum(1/6*g(1:6))

Translated this says, that the weighted average payoff from playing the game will be at a value of 15.16667, and delivers an answer to the question: If I play this game, what payoff from playing the game will I get on average each time I am playing the game?

# Jensen's inequality
# E[g(x)] ≥ g[E(x)]
Egx >= gEX

Now we are going to look at our graph again, in order to reflect on the above:

# gEX and Egx at x=EX
points(EX, gEX, col="blue")
points(EX,Egx, col="red")

We can see that at our coordinate g(x)g(x) = Egx is greater than g(x)g(x) = gEX at xx = EX, i.e., where xx in the coordinate system is also the expected value of xx.

We will now draw a line going through P1(00)P_{1}(0|0), the origin, and another point P2(E[X]  E[g(x)])P_{2}(E\lbrack X\rbrack\ |\ E\left\lbrack g(x) \right\rbrack) to get some overview. For this we will set up a function:

We can evaluate the slope of our function via:

# Slope b of f(x)=y=a+bx via two points:
P1 = matrix(c(0, 0))    # x =  0, g(x) =   0
P2 = matrix(c(EX,Egx))  # x = EX, g(x) = Egx
b = (P2[2]-P1[2])/(P2[1]-P1[1])

We can now evaluate aa by filling in a given point and define as well as plot our function:

# We can now evaluate a by filling in a point
# say P2 in y = a + bx => a = -y + bx
a = -0+b*0

# f(x) = 4.33333*x
f = function(x) (b*x)
# add to plot
curve(f, -1, 6, ylab = "x", add=TRUE)

We are now going to evaluate the crossing points graphically and mathematically to get a better overview. We will start with our new function f(x)f(x): Note that the crossing points between f(x)f(x) with g(x)g(x) each represent a point of equality, such that (first crossing point is at the origin (0|0)):

# In our case the value of x of the point where
# f(x) crosses g(x) is equal to the slope of 
# f(x), so we can evaluate the value of y 
# of our point via a shortcut. 
y = b*b

# Crossing points: 
# upper crossing point f(x) with g(x)
# where Egx=gEX!
segments(x0=0,y0=y, x1= b, y1=y, lty =3)
segments(x0=b,y0=0, x1= b, y1=y, lty =3)
text(x=5.3,y=18, label="E[g[4.33]]=g[E[4.33]]", srt = 3, col = "darkgreen")
text(x=-.6,y=y, label="g(x)=g(4.33)", srt = 3, col = "darkgreen")
text(x=b+.5,y=0.2, label="x=4.33", srt = 3, col = "darkgreen")

Now we can check mathematically, if our assumptions are true:

# Check P1:
EXP1 = sum(1/6*(0))
gEXP1 = f(EXP1) 
EgxP1 = sum(1/6*f(0))
# Is exactly equal?
EgxP1 == gEXP1

# Check P2:
EXP2 = sum(1/6*(b))
gEXP2 = f(EXP2) 
EgxP2 = sum(1/6*f(b))
# Is exactly equal?
EgxP2 == gEXP2

Now we will add the other points and also add some annotation and lines to the plot to get further orientation over our results:

# Add rest of the points and some annotation:
# (EX|Egx)
segments(x0=0,y0=Egx, x1= EX, y1=Egx, lty =3)
segments(x0=EX,y0=0, x1= EX, y1=Egx, lty =3)
text(x=-.6,y=Egx, label="E[g[x]]", srt = 3, col = "red")
text(x=EX+.35,y=0.2, label="x=E[X]", srt = 3, col = "black")

# (EX|gEX)
segments(x0=0,y0=gEX, x1= EX, y1=gEX, lty =3)
segments(x0=EX,y0=0, x1= EX, y1=gEX, lty =3)
text(x=-.6,y=gEX, label="g[E[x]]", srt = 3, col = "blue")

Fig. THIS NEEDS A LITTLE BIT OF CONCENTRATION TO FOLLOW: Plot of g(x)g(x) and f(x)f(x). Moving the parallel points of E[g(x)]E\left\lbrack g(x) \right\rbrack and g[E[X]]g\lbrack E\lbrack X\rbrack\rbrack within the range of x=0x = 0 up to x=4.33x = 4.33 will always meet the requirements of E[g(x)]   g[E[X]]E\left\lbrack g(x) \right\rbrack\ \ \geq \ g\lbrack E\lbrack X\rbrack\rbrack, except of when both functions meet/cross at the points P1(E[0]  E[g(0)])P_{1}(E\lbrack 0\rbrack\ |\ E\left\lbrack g(0) \right\rbrack) and P2(E[4.33]  E[g(4.33)])P_{2}(E\lbrack 4.33\rbrack\ |\ E\left\lbrack g(4.33) \right\rbrack), where the output of both functions are exclusively equal. If this is still too confusing to get a grip on, then just recall that the y-axis potentially relates to both, g(x)g(x) and f(x)f(x), such that the function g(x)=g(4.33)=E[g[4.33]]=g[E[4.33]]g(x) = g(4.33) = E\lbrack g\lbrack 4.33\rbrack\rbrack = g\lbrack E\lbrack 4.33\rbrack\rbrack, when x=4.33x = 4.33. Also note that g(x)=g[E[X]]g(x) = g\lbrack E\lbrack X\rbrack\rbrack, when x=E[X]=3.5x = E\lbrack X\rbrack = 3.5; we also see that f(x)=E[g[x]]f(x) = E\lbrack g\lbrack x\rbrack\rbrack, when x=E[X]x = E\lbrack X\rbrack is given, though still being evidently inequal to g(x)g(x), when xx in that particular g(x)g(x) is again equal to E[X]E\lbrack X\rbrack (again, apart from the case of x = 4.33x\ = \ 4.33 or 00, where equality is given). So, the relation x = E[X]x\ = \ E\lbrack X\rbrack to the y-axis is divergent in the sense that the xx, i.e., the input value is equal for both functions, xg=xfx_{g} = x_{f}, but again, the values of the y axis diverge such that the points of these functions at x=E[X]x = E\lbrack X\rbrack do not share the same y-coordinates: PE[g(x)](E[X]|E[g(x)]=15.16667)P_{E\left\lbrack g(x) \right\rbrack}\left( E\lbrack X\rbrack \middle| E\left\lbrack g(x) \right\rbrack = 15.16667 \right), Pg[E[X]](E[X]|g[E[X]=12.25])P_{g\lbrack E\lbrack X\rbrack\rbrack}\left( E\lbrack X\rbrack \middle| g\lbrack E\lbrack X\rbrack = 12.25\rbrack \right).

I hope this chapter helped to understand and get some orientation over Jensen’s inequality from a mathematical and graphical perspective. Also check out Oleg Solopchuk’s tutorial on medium on the FEP for some more perspectives on Jensen’s inequality. His tutorial has very nice visualizations, but has a much stronger pacing at some point (and also includes aspects such as conditional indipendence etc.

In the next part of this series, we will also have a look into the concept of Markov blankets and other more advanced aspects of active inference. Nevertheless, you should now be already prepared to precede to the actual ATUT.

Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other students and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within NOS. Original photo by Alfred Kenneally

No comments here
Why not start the discussion?