Intermediate to Advanced Stat-o-Sphere
Follow this link to our previous part IV of this series, if you wish, or have a look at other tutorials of our collection Stat-o-Sphere.
Review: This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.
Pub preview image generated by Mel Andrews via midjourney.
Corresponding R script:
Welcome to the fifth part of our series on information theory! You have come quite a way to get here, and we really hope it was worth it for you so far. Information theory is a fascinating topic and a good example on how some at the end rather simple mathematical formulas and concepts can make quite a career.
Note that you have again entered the intermediate to advanced Stat-o-Sphere. This tutorial especially requires the knowledge of the first (Bayes’ rule) and especially the second part (entropy/surprisal) of this tutorial series on information theory (we will get to advanced Markov processes, such as Markov blankets, some other time). This means we are extending both our understanding of Bayes’ inference and on statistical thermodynamics in this tutorial. At the end, the below will especially be a gain for our intuitive understanding of Bayesian inference, since we will learn more mathematics that helps us better represent certain aspects of our current intuition on it. So “advanced” does again not mean that this tutorial will be particularly harder to go through than most of the others we have published so far, but requires at least a stable intuition on conditional probability / Bayes’ rule and an understanding of why statistical thermodynamics is related to it.
For some orientation for those that never have heared from active inference, predictive processing or the free energy: it can be understood as a form of “machine learning / AI”, but different to, e.g., ChatGPT active inference does not refer a mere passive framework, but involves action and perception. So ChatGPT output can be understood as mere “passive perception” in some way — active inference adds a twist to classic modelling of perception. It was also shown bei Isomura et al. in 2022 that cannonical neural networks can be translated as performing active inference, which makes active inference mathematically universal on that respect. However, the basic concept actually derived from the field of neuroscience in order to model action and perception in humans (mostly driven by schizophrenia research), as mentioned in earlier parts of this series (Friston 2010, amongst earlier formulations).
Recent advances in computational neuroscience concerning concepts such as predictive processing, active inference and the free energy principle, have gained increasing attention in the last couple of years, as they provide a vast interdisciplinary and intuitive explanatory power for both: neurocognitive processes, as well as experience itself.
Due to its growing popularity, an extensive and amazing effort has been put into increasing its accessibility to a wider public of researchers, both technically (programming a model) and conceptually (scientific and (neuro)philosophical background and discourse). Especially “A Step-by-Step Tutorial on Active Inference and its Application to Empirical Data” (ATUT), by Ryan Smith, Christopher Whyte and Karl Friston (2022) provides a thorough and detailed introduction into actually computationally modelling neurocognitive processes by oneself.
Using active inference to do so involves simulating behavioral tasks and predicted neural responses, which can subsequently be compared to evaluated empirical data, gained from running the same behavioral tasks with actual participants. The ATUT only requires a rather minimal background in mathematics and programming and also provides all the information needed, accompanied by heavily commented code for Matlab (Matrix Laboritory), in order to fully program behavioral tasks oneself. A Python package called pymdp (python Markov decision process) has been published as well, explicitly accompanying the upper tutorials Matlab code, also making it available to open access communities. This tutorial series on information theory especially provides you with background knowledge on information theory and its relation to thermodynamics, which is not covered in the ATUT.
I personally also started to convert some of the Matlab scripts into R a while ago (see this Github repository). I definitely want to bring this further in the future, since it is the perfect way to learn how active inference works in action (thanks to Ryan Smith and Christopher Whyte for some feedback on the R conversion!).
In case you are also interested in making active inference run on R, let me know if you want to team up for the project! You can contact me via my email: [email protected]
The goal of this “student’s tutorial” is to accompany their work by providing a detailed and slow-paced overview of the mentioned ‘minimal background’ in programming and in mathematics required to get through the upper mentioned tutorial (especially for people outside of computer science related fields). As mentioned in previous parts, this series on information theory was initially all about active inference. Since the basics mostly consisted of classic information theoretic concepts, I split the initial version apart and added and re-wrote a lot of sections, in order to serve a more general purpose for our statistics tutorial project “Stat-o-Sphere” at NOS (e.g. as basics for understanding Akaike and Bayesian information criterion (AIC/BIC), MCMC…).
The motivation for this tutorial came after I built up a collection of notes, codes and math tutorials myself, in order to wrap my head around active inference, due to my personal interest in computational neuroscience and abductive inference (especially from a linguistic/semiotic and clinical perspective). This collection entailed scattered answers to the most important questions I asked myself, often still lacking the upper requirements.
We hope this series will serve as a decelerated support and addition to the upper mentioned ATUT for people — students and researchers new to the field — that want to understand and apply “Higher Sphere” methods of inference.
There has also been an open access MIT publication on active inference by Thomas Parr, Giovanni Pezzulo and Karl Friston (2022) that we can recommend! It also comes with some code examples for Matlab.
The fifth and current part of this series entails most of the major basic mathematical / physics knowledge needed to understand active inference / the free energy principle / predictive processing in computational neuroscience. In the next part of this series we will give you a short introduction on active inference / FEP in general and most of all supply you with extra material on it (paper recommendations, videos, interviews etc.).
You may wonder why we start with mathematics this time? What happened to our triad of Intuition/Concept—Mathematics —Code? Well, in fact we have already covered most of the intuition we need to understand active inference / the free energy principle, since they can most of all be looked at as a direct extension of information theory:
Specially the rather “self-experiential” approach of discussing abductive inference as a way to understand the basic idea behind “Bayesian-Brain” concepts has supplied us with a stable intuition on current computational neuroscientific approaches to model neural activity as well as cognition and decision making. The below will just build upon what we have gone through so far already.
In order to understand why we need approximate Bayes inference methods to get to more complex applications of information theory and Bayes’ rule in general, we have to look at the formula of conditional probability / Bayes’ rule that entails three or more instead of only two variables and we will also introduce Bayes’ rule for continuous probability distributions.
Below you see the formula for classic conditional probability given two, three and n-dimensions / variables for discrete probability distributions.
To get a better overview what changed, note that conditional probability involving three variables can be simplified via substitutions of
The posterior probability with three variables can be spoken as:
“the probability of
“the probability of
We can perform the same rearrangements as with regular conditional probability / Bayes’ rule — again via the chain and the product rule. However, since we could substitute
On the other hand, we can also literally think of the below as a simple extension of a second dimension with another variable (similar to how we discussed the binominal distribution in the first part of this series). So from the perspective of Bayes’ rule, you can just add “
We can also rearrange the above in order to have a look at the outwritten joint probabilities:
In the above
We could also reformulate the above in order to obtain one of the possible marginal probabilities:
Here is where it gets a little tricky!! :O For discrete probabilities, we will still get an answer, but the more variables we add the more computational problems we will get. In technical terms, too many variables will result in an intractable marginal likelihood. The reason is that the more dimensions we add, the more numbers have to be summed out, which grows exponentially with the number of variables (which is especially a problem with continuous variables).
First of all, the mentioned boundary can be thought of as a problem of computability. Computability means, having an algorithm to perform an calculation with limited resources and time. We will partially follow this short tutorial from Ben Lambert.
So from a mere pragmatic perspective, the more variables, the more relations, the more computing power is needed. Below you will see the formula for Bayes’ rule with n-variables:
Let us also quickly go through Bayes’ rule considering continuous probability functions (which we haven’t done in any of our tutorials so far). The only formal difference is that we use
Before we get into more details of intractability, the problem of computability may become more graspable when we think of a Markov chain graph that has
While Bayesian inference represents the optimal way to infer posterior beliefs within a generative model, Bayes theorem is computationally intractable for anything but the simplest distributions. […] the marginal likelihood (denominator) in Bayes’ theorem – requires us to sum the probabilities of observations under all possible states in the generative model (i.e., based on the sum rule of probability […] ). For discrete distributions, as the number of dimensions (and possible values) increases, the number of terms that must be summed increases exponentially. In the case of continuous distributions, it requires the evaluation of integrals that do not always have closed-form (analytic) solutions. As such, approximation techniques are required to solve this problem. (Smith, Whyte, Friston 2022)
The problem of intractability in general has a mere pragmatic (efficiency), but also deeper mathematical and computer scientific dimension to it that is referring to the so-called P-NP problem from complexity theory. There is a great tutorial by Jade Tan-Holmes on her channel “Up and Atom” that we can recommend if you want to get a first hint on that topic. We decided to further precede in presenting a solution for our problem.
So the big question is, can we find a way to work around and efficiently approximate the model evidence in order to normalize our joint probability with it and update our model (obtain the posterior)? Intuitively, since we multiply compared Bayes’ inference with “human” inference and related it to computational neuroscience several times, note that going through “everything”, every possible case / fact etc. when making a probabilistic decision is importantly not what we humans do, since it takes a lot of time (is not pragmatic or even doable) — and we also wouldn’t want to do so either, since we are always aiming for specific information in relevance to us or a specific matter. In other words, you do not use and need a 1:1 map down to every atom in order to orientate within a city or any region (this comparison is made by Maxwell Ramstead and can be found here).
The latter also implies a biological perspective on inference and what it takes. This also entails boundaries in terms of metabolism and complexity of structures. Concerning the latter, minimizing free energy is also considered an attempt of an organism as a whole to overcome the second law of thermodynamics, i.e., trying to delay decay. Decay does not only concern the physical but also the conceptual level: our model of the world, which we try to keep or try to confirm. In other words: Bayesian brain theories somewhat follow a scheme of self-evidencing (e.g., in “All thinking is wishful thinking” by Kruglanski, Jaso, Friston, 2020).
This is also where our thermodynamic analogy on a higher level of abstraction (active inference as mathematical tool to describe decision making) becomes related to the actual thermodynamics (organism as such), when arguing that inferring on the world is done by minimizing free energy to keep in trajectory of a model, meaning to keep deviation between the true state and our expectations on present states within a generative model of those events low, in order to save cost or affordance when trying to not fall apart and handle life as person at the same time. You will find a lot of recorded lectures on the free energy principle and active inference on various topics (e.g., this one called “Me and my Markov blanket”), in which Karl Friston often gives simple examples of models that involve a daily routine, e.g., getting up, making coffee, going to the subway station and so on. Such a model would consist of a steadiness in terms of periodically getting back to states, such as “getting up, making coffee”, but would not involve an equilibrium, where nothing changes. This dynamic homeostasis or better homeorhesis is referred to as a non-equilibrium steady state distributions (NESS, latest formulations on that discourse can be found here: Bayesian mechanics for stationary processes by Da Costa et al.).
However, as discussed before, strong deviations of a model (a lot of “questions that need to be asked”) will result in a lot of model updating. Models themselves are an attempt to keep such deviations in boundaries. On the other hand, a NESS also implies the possibility of a desire of change as whole, as in getting rid of a model in which one is in joint with. This can be understood in the optimistic or pessimistic sense, or better as voluntary or non-voluntary: finding a way to fulfill a positive desire or change (changing peer group, social context, or going on an adventure etc.), or as a result to cope with undesired changes, even though it is hard to do so (trauma, escape from war etc.).
To answer the question if we can get a grip on the model evidence: yes, we can given approximate Bayesian inference methods. As noted in the introduction, this and the next part of this series will focus on the mathematical basics of active inference — and we are after one particular method, the variational inference method, which goes back to the variational method defined by Richard Feynman (Statistical Mechanics, p. 86 (1972), also see Friston et. al (2006)).
By the way, glad that you ask, yes, we are going to have a closer look into the relation between physics and information theoretic methods again and expand our understanding of thermodynamics, discussing the concept of inner and free energy and their relation to entropy.
Now that we know that our previous method of computing a posterior probability distribution is not enough to perform inference on the level of complexity we are after, let us finally start to upgrade our known Bayes’ rule to be able to perform an approximate Bayes inference method, which overcomes the problem of possible intractability of regular Bayes’ theorem.
As mentioned, for this to happen we need to expand our understanding of statistical thermodynamics / mechanics. In general, what we will present below is actually not that difficult mathematically. The physics behind the concept of free energy will also be much easier to understand than the concept of entropy for itself (which is a part of the calculation of free energy). Most of all we will actually get to know a very clever ‘trick’ in order to approximate the model evidence. A ‘trick’ that will also make our calculations, as well as our general concept of abductive Bayesian inference more intuitive.
Below you will see the classic formula for computing the model evidence (as discussed before). We will also adapt the notation of the variables that are used in active inference, where
To relate the notion of conditional independence to our intractability/approximation problem and information theory: conditional independence means that we cannot fully infer the entropy of a message, i.e., on a message itself, since we lack certain “exact” or complete insights into the information that is around us.
As we are unable to infer on the model evidence directly in order to then obtain the posterior, we could still try to play around a bit and compare our results, in order to optimize an approximation of
For more simplicity we will also again go through the formulas for approx. Bayes with only two elements in joint with each other ─ a threefold joint is used in the ATUT anyways, so here you will find a downgraded version of the formulas, in case of losing orientation.
However, we can say that epistemically
Our professional method of choice to get around the intractability problem will be the so-called variational inference method. Roughly, this method is used to approximate the posterior (!) by heuristically trying out possible posterior distributions, denoted
In general, the actual formula for variational free energy within the active inference framework looks much more complicated as it is and I promise it entails nothing we couldn’t understand, after all that we have gone through so far (*wiping-off-a-tiny-tear*). The only difficulty relies in bringing in some additional facts from statistical thermodynamics (Helmholtz free energy, variational methods (Feynman)) and nowadays information theory (Jensen’s inequality, KL divergence, variational free energy). However, a lot of the below will again consist of mere rearranging of formulas and alternating notations. As promised, we will again only work with two variables and within the realm of exact Bayes’ inference (even though variational Bayes’ would actually not be required in such a case).
In classic thermodynamics, the free energy represents the result of a relation between the inner energy and entropy of a system and is considered a quantity for the amount of work that can be extracted from a system at constant temperature and volume ─ as both, changes in temperature in the form of thermal energy being lost in a process (as in a steam engine), as well as changes in volume of a system (exploding steam engine) would subsequently change the free energy of a system as such (we can also recommend this paper by Gottwald and Braun (2020) on this topic).
From a physical perspective a so-called Stirling engine is a very good example to understand what free energy is and how it converts to work:
The cylinder can be considered our system (exchange of heat). The heat at the bottom of the cylinder (red) heats up the air (gas) within it, which means that the atoms essentially accelerate and bump against the displacer plate (pink). In other words: the entropy rises and therefore kinetic energy of gas atoms. The atoms then push against the displacer plate (pink) in a way that makes it move up. At the top of the cylinder the air and the displacer cool down again, making it move down again, since the pressure decreases. This also happens due to the involved mechanics that eventually pushes the displacer down again — a kind of competing difference. The pressure rises beneath the displacer plate, also due to the compression of the gas beneath the displacer palte moving down again, and therefore the displacer eventually moves up again after some time, resulting in cycle of the displacer being pushed up and down, up and down…. You can see that the structure is a little more complex, since there is another smaller cylinder (green) that also transfers energy into mechanical work, but with a slight latency to it, helping that the displacer plate is pushed downwards again (so no constant volume in the example above). If the above system would be in an equilibrium state, such that no pressure is build up in the cylinder (no temperature difference), nothing would happen. We will later see that the pressure that gets coverted into mechanical work can be looked at as so-called free energy. Note that the above also works the other way around: if the bottom of the cylinder was cold (ice cube at the bottom) and the upper part of the cylinder would be warmer (room temperatur), the same thing happens!
In the context of information theory or in general Bayes’ inference, this “amount of work” can also be seen as the amount of effort or cost of a system ─ especially when we keep in mind that the information theoretic perspective refers to statistical inference as a potentially costly process of gaining information (our intractability problem refers to that as well): The less information gained, the more I knew already, the less costly an inference was ─ e.g., the closer my prior had been matching the likelihood already, or the closer my approximate posterior will match the actual posterior. Before we get back to our intuition on abductive inference, let us first look at the physics formula for free energy. We know the relation between physics and abductive inference my not be as clear already, but we promise we will soon get there!
The formula for the Helmholtz Free Energy, i.e., the maximum usable part of the energy
At this point it made sense to me to recall the units used in physics to express
Inner energy = Energy (J) / Mass (kg)
Entropy = Energy (J) / Temperature (K)
Both quantities actually reflect on the energy of a system, just in relation to different factors ─ in a banal sense: this is the reason why the formula goes “
As mentioned, the relationship between free energy, inner energy and entropy relies on the second law of thermodynamics, stating that a system preferably occupies a state over time that has maximum entropy given the inner energy. We know what maximum entropy is in classic information theory: it states that the entropy is at its maximum in the case of equal probability of each microstate ─ which also makes sense in terms of maximum uncertainty (randomness) and minimum certainty when inferring on information. Note that all of this also relates to the equilibrium steady state distribution of our Markov chain ─ where equilibrium represents the notion of a consistency of (transition) probabilities over time (our eigenvector).
With all of this is in mind, we can now mathematically represent a part of the free energy equation above already ─ the entropy. Recall that our information theoretic entropy formula also represented the expected value of
As hinted above, the principle of maximum entropy can also be inverted in the sense of a principle of minimum inner energy stating that in the case of constant entropy (set maximum entropy) a system will occupy the state with the least amount of inner energy. Constrains, such as constant entropy within the definition of the principal of minimal inner energy will actually be helpful in a bit, in order to fully understand the actual calculation of
In essence, looking at the formula for free energy again, and without an idea how to formulate the expected value of
We can extract some facts from that above intuition from physics to build up some mathematical constraints for our formula of
As the entropy is considered constant, a value of
Conceptually and intuitively, it also wouldn’t make sense that the value of
This is also consistent to our previous understanding of information theory in general, where negative quantities of information do not make a lot of sense, if information is in general something phenomenologically given, so to speak. From the perspective of abductive Bayes’ inference again, free energy can here be seen as the effort or cost of an inference made (work), since it implies that there is a divergence that has to be overcome to obtain an equilibrium state, so the above also holds for
From what we have learned conceptually so far, we can derive our first inequality:
The above is the case, since the values of
Let us recall some of what we already know on that matter of inequalities from classic information theory: here is an inequality we have encountered before. The code below entails all the code we need from Information Theory III on Markov chains:
### We will again use the last Markov chain example
### from Information Theory III:
MessageABC = c("A", "B", "C")
MessageABCTransMatrix = matrix(c(.0,.8,.2,
.5,.5,.0,
.5,.4,.1),
nrow = 3,
byrow = TRUE,
dimname = list(MessageABC, MessageABC))
MCmessageABC = new("markovchain", states = MessageABC,
byrow = TRUE,
transitionMatrix = MessageABCTransMatrix,
name = "WritingMessage")
markovchainSequence(n = 20, markovchain = MCmessageABC, t0 = "A")
# Plot Markov Chain
plot(MCmessageABC, edge.arrow.size = 0.1)
# We will quickly write a function adding a tiny
# value to our inputs:
bit_log_nonzero = function(x) {
nonzerolog = log2(x+2^(-16))
} # End of function
### Joint matrix:
steady = steadyStates(MCmessageABC)
trans_mat = as.matrix(MessageABCTransMatrix)
# Initialize empty matrix:
joint_mat = matrix(0, ncol = ncol(trans_mat), nrow = nrow(trans_mat))
for (i in 1:length(steady)){
for (j in 1:ncol(trans_mat)){
joint_mat[[i,j]] = steady[[i]]*trans_mat[[i,j]]
} # end for j
} # End for i
### CONDITIONAL ENTROPY H(y|x) (AMTC p. 11)
# Below we will work with the numbers of our last Markov chain example:
EntropyPOST = -sum(joint_mat*bit_log_nonzero(MessageABCTransMatrix))
# [1] 0.9340018 = H(y|x)
# ENTROPY OF A SINGLE EVENT OF A JOINT:
EntropyX = -sum(joint_mat*bit_log_nonzero(rowSums(joint_mat)))
EntropyY = -sum(joint_mat*bit_log_nonzero(colSums(joint_mat)))
# EntropyY is greater than or equal to H(y)
EntropyY>=EntropyPOST
The above essentially states that the amount of information I need to infer on a state, only having the prior at hand, is greater or equal to the amount of information entailed in the inference on a state given an observation
From this we can also make sense of the following:
# EntropyY-EntropyPOST >= 0
EntropyY-EntropyPOST >= 0
We can actually relate this to our free energy formula, including the fact that
and
This again says that energy itself in any sense, inner of free, cannot be less than zero. This is also what is referred to as free energy being an upper bound on entropy, and is mathematically derived from Jensen’s inequality, to which we will get to later in a separate chapter.
We can also derive an equality as a special case of the above, where:
From which we can derive that in such a case:
The above represents the case of minimum inner energy,
We could now create a similar situation just with the entropy in relation to itself (since
Looking closely at the formula will reveal a surprising chance to bypass the evaluation of the model evidence itself using a simple, but sophisticated variational trick to do so (goes back to variational methods in statistical mechanics, introduced by Richard Feynman, see Statistical Mechanics, p. 86f. (1972)).
From what we know so far,
Let us now do some rearrangement, based on our thoughts around
Combining the arrangement from the last two lines results in:
Here we just take the log into the sum:
The last line represents the expected value notation of the previous formula:
I have crossed out everything that was just redundantly added (exception in the last line), but see how we now got the posterior right in front of us? This will lead us to an important trick, that brings us beyond the special case, where
Let us quickly compute the above, to see if the equations hold, before reflecting on how to fully represent our free energy formula, resulting in
#### Generative Model
prior = c(.5,.5); likelihood = c(.8,.2)
joint = prior*likelihood
# Trueposterior (here calculated via Bayes; for
# simulations think of a supervised situation,
# so the true state is known):
modelevidence = sum(joint)
Truepost = joint/modelevidence
# Expected model evidence = Entropy
# In our case Entropy and surprisal are equivalent.
Entropy = -sum(modelevidence*log(modelevidence)+((1-modelevidence)*log(1-modelevidence)))
Surprisal = -log(modelevidence)
Temperature = 1
# Going through all the lines:
# H = H
Entropy==Entropy
-log(sum(joint))==Entropy
# The below does not exactly work with vectors/matrices.
# For a single value result use: -log(.5*.8/.8)
-log(Truepost*modelevidence/Truepost)==Entropy
-log(sum(Truepost*(Truepost*modelevidence/Truepost)))==Entropy
# With expected value notation, i.e., average surprisal
-sum(Truepost*log(Truepost*modelevidence/Truepost))==Entropy
All of the formulas above hold for the case of
Let us now substitute
From what we know, we could try to extract the entropy from our
The latter line may look a little weird, but recall that
# E[E] - H = H (for the example from ATUT p. 4)
Energy = -sum(Truepost*log(Truepost/Truepost))
HelmholtzFE = Energy-(-Temperature*Entropy)
# F >= H
HelmholtzFE>=Entropy
# F = H (minimized FE)
HelmholtzFE==Entropy
The important trick that is hidden in the formulas now is, that we can approximate our posterior and at the same time are able to indirectly see how far our approximation
This makes it conceptually look like that minimizing the value of
Mathematically spoken, the process of minimizing free energy can be looked at as performing gradient descent. The code below for the example at ATUT p. 5 shows how the value of
The very left term on the left-hand side of the last line can also be understood as the KL divergence or relative entropy, i.e., how far does the entropy deviate from the actual entropy, when
Here is the code for the variational free energy example on ATUT p. 5, which is equivalent to the calculation done in the ATUT Matlab script “VFE_calculation_example”(see Ryan Smith’s Github repository):
# Minimizing Free Energy:
# Example ATUT p.5
Qs1 = c(.5,.5)
Energy1 = -sum(Qs1*log(Truepost/Qs1))
VFE1 = Energy1-(-Temperature*Entropy)
Qs2 = c(.6,.4)
Energy2 = -sum(Qs2*log(Truepost/Qs2))
VFE2 = Energy2-(-Temperature*Entropy)
Qs3 = c(.7,.3)
Energy3 = -sum(Qs3*log(Truepost/Qs3))
VFE3 = Energy3-(-Temperature*Entropy)
Qs4 = c(.8,.2)
Energy4 = -sum(Qs4*log(Truepost/Qs4))
VFE4 = Energy4-(-Temperature*Entropy)
# Plot that makes clear what descending a gradient means conceptually:
plot(x =c(1:4), y=c(VFE1, VFE2, VFE3, VFE4), typ = "l")
Here is the classic formula including the inequality of
The divergence we want to overcome (minimize) is that between our generative model and that of the actual generative process in the world.
Drawing a relation back to Shannon’s work and the discussion we had over Weavers work and semiotics, we can see that an exact inference problem is turned into an optimization problem. In general, we hope we could show that active inference can be seen as a direct extension of Claude Shannon’s information theoretic methods of inference.
Here is the same formula, just using three variables:
Below we will have a brief look into the general structure of an active inference model, discussing the difference between generative process and generative model (which we both briefly mentioned before). This is also very well explained in the ATUT though. After that we will further look into the KL divergence, the relative entropy, as well as Jensen’s inequality.
As mentioned previously in this the tutorial, the joint probability is also referred to as generative model and is distinguished by the generative process. This, again, reflects on the epistemics of inference:
A generative model, as discussed above, is constituted by beliefs about the world and can be inaccurate (sometimes referred to as ‘fictive’). In other words, explanations for (i.e., beliefs about) how observations are generated do not have to represent a veridical account of how they are actually generated. (ATUT, p. 4)
Apart from the possibility of counterfactual inference within causal inference (compare Corcoran, Pezzulo, Hohwy (2020), Pearl (2009)), this also addresses the fact that we do not experience or model the world in terms of a 1:1 scale, so to speak (compare previously mentioned video tutorial by Ramstead (2020), and ATUT p. 4). We are only interested in specific information related to a model of the world; therefore, active abductive inference is also cast as attention (Friston (2010)). In other words: a 1:1 map would not be necessary to get from one point to another or more specific: not every detail is important or necessary in order to perform a successful relation between a map and the actual trail it represents in the actual world, even though it will not answer every question we could have on the world ─ but specific questions indeed (Ramstead (2020)).
In contrast, the generative process refers to what is actually going on out in the world – that is, it describes the veridical ‘ground truth’ about the causes of sensory input. For example, a model might hold the prior belief that the probability of seeing a pigeon vs. a hawk while at a city park is
Both, the generative process and model represent joint probabilities. In that sense beliefs in the form of policies or anything related to beliefs being something internal is actually not fully demarkated, but again in a joint relation with each other: a being in the world — in other words: active inference and the free energy principle inherit a phenomenological perspective. Including action as a way to change the state of the world, a graphical representation of our upper example would look like this (similar to ATUT, p. 5):
I have chosen the example of a ball falling to the ground and gravitation as model, as it entails all the ingrediency that Galileo used to do his first measurements of the time an object needs until it lands when falling from a certain height. This is not supposed to be a random fact that I want to include in this tutorial, but an example for hierarchical abductive inference that not only applies to our intuition, but also to abductive inference in terms of an actual scientific method ─ also entailing deduction and induction.
To perform his measurements, Galileo used a slide as it made objects “fall slower” and steady. It was hard to measure the speed of a fallen object without any nowadays technology at that time. In order to provide a scaled measure, water was filled into a glass for the time the ball was rolling down the slide (just think of tick marks for the amount of water after 1 m, 10 m etc. and the time that has passed that can be measured separately). After multiple attempts the measurement turned out to be very precise, indicating none or at least no measurable differences of gravity over time.
The structure of such experimental projects appears to be always the same by intuition. At first there may be a bunch of questions, an indirect hypothesis, e.g., things fall down, things fall down steady, the mass makes things fall faster or not… Next up is the likelihood in terms of an event to check on the hypothesis in terms of checking on mere prior assumptions (abduction). An experimental setting is designed and built up in order to make the event comparable to a previous event. In the case of Galileo this meant that the same ball, the same slide etc. had to be used for every trial (deduction). Eventually Galileo obtained a whole bunch of events and was able to compare them with each other, figuring that the fall time is stable under comparable conditions and can be generalized in mathematical formulas (induction), to further see how this works out for an approximation with, e.g., predicting behavior of other objects. This and some other experiments performed by Galileo eventually resulted in the first measurement of gravitational acceleration.
All the latter is supposed to roughly demonstrate an intuitive ontogenesis of science and the plausibility of the development of scientific methods by humans, based on abductive inference being the prime form of every inference (not matter the hierarchical and heterarchical scale of relations). This also shows why C.S. Peirce pragmatist concept of abductive inference and semiotics can be understood as a phenomenological approach to the description of the structures of inference as such. Active inference though was much more influenced by the physics/mathematical work of Helmholtz on perception as a kind of inference, as well as his formula for free energy, which we will get to know soon in its information theoretic form.
The Kullback-Leibler divergence is also known as the relative entropy, Bayesian surprise or information gain and represents
Example: Imagine having a set of colored balls that are spread within space. The space is divided into two spaces and the question is now: how good the split is in terms of how likely it is to find each of the color in one of the two sides compared to before, i.e., the space without a split. As usual we will reflect on this using information theory. We have 5 blue and 5 green balls. The formula for the entropy is our expected value of either green or blue, which can be written as:
The code below follows our classic entropy formula in terms of the negative sum of the weighted log of the probability of each color appearing, when sampling from it.
# We will use a matrix to express p_i:
pColor = matrix(c(.5, .5))
# H_before:
Hbefore =-sum(pColor*(log2(pColor)))
Let us now look at the split:
# EXAMPLE:
# Imagine having 5 green and 5 blue balls:
plot(x = 1,
xlab = "X Label",
ylab = "Y Label",
xlim = c(0, 3),
ylim = c(0, 3),
main = "Blue and green balls",
type = "n")
# Blue balls
points(1,2, col = "Blue")
points(0.4,2.8, col = "Blue")
points(1.2,2.2, col = "Blue")
points(1.7,0.7, col = "Blue")
points(0.9,0.6, col = "Blue")
# Green balls
points(1.6,1.6, col = "Green")
points(1.9,2.6, col = "Green")
points(2.3,0.6, col = "Green")
points(2.8,2.85, col = "Green")
points(2.4,1.6, col = "Green")
# Split at X = 1.5.
abline(v=1.5, col="black")
The split at a certain point can be thought of as a kind of change in the entropy due to inference, but also as, e.g., evaluating compressed data. Concerning our use in active inference the relation between the entropies before and after the split can be thought of as changes in
On the left side, all four of the balls are blue:
On the right side we find five green and one blue ball, together six.
# Hleft is simple
Hleft = -sum(1*log2(1))
# Hleft probs as vector
probRight = c(1/6,5/6)
Hright = -sum(probRight*log2(probRight))
We can now weight the quality of the split by weighting the entropies by the number of elements of the respective side via:
# Relative Enropy
Hsplit = .4*0 + .6*.65
# KL divergence / relative entropy:
Informationgain = Hbefore - Hsplit
As said, the relative entropy now represents a relation of the entropy of each state, i.e., before and after the split. In other words, and related to active inference: the inner energy represents the divergence to the actual minimum inner energy (
This part is focusing on the exact mathematics behind the inequality of
To demonstrate Jensen’s inequality, we will play another round of dice (
Mathematically, the payoff can be understood as a parabola function, which we will call
# Define function x^2 = g(x) =
g <- function(x) (x^2)
# Plot of g(x) ranging from x=-1 to -7.
curve(g, -1, 6, ylab = "g(x)")
abline(v = 0)
# Points marking all possible payoffs
points(1,g(1)) # Payoff for x = 1
points(2,g(2)) # ... for x = 2
points(3,g(3)) # ...
points(4,g(4)) #
points(5,g(5)) #
points(6,g(6)) #
A special characteristic of our function
A rather graphical approach of defining convexity involves choosing two random points in the space above the function, the so-called epigraph. When the points are connected via a straight line, the line will never cross the function.
Another approach is by describing the graph of a convex function as continuously increasing its slope. A more mathematical approach to this would involve evaluating the derivatives of our function
We will obtain the first and second derivative via these rules:
such that in our case:
Both derivatives are positive either in the sense that they consist of a single positive value or that
Now that we have a rough idea of what a convex function is, let us now see what Jensen’s inequality actually states ─ and don’t worry, it may seem a lot, but we will again decompose the inequality step by step. In relation to our game, Jensen’s inequality states that the expected payoff of playing the game, denoted as the expected value of a function
Let us first evaluate our expected value of
EX = sum(1/6*(1:6))
The payoff from the expected value, i.e., the function
gEX = g(EX)
Translated, this delivers the answer to the question: considering the weighted average of the possible dice values,
The left-hand side of Jensen´s inequality above represents the expected value of a function, i.e., our expected payoff from playing the game. The formula just slightly deviates from
Egx = sum(1/6*g(1:6))
Translated this says, that the weighted average payoff from playing the game will be at a value of 15.16667, and delivers an answer to the question: If I play this game, what payoff from playing the game will I get on average each time I am playing the game?
# Jensen's inequality
# E[g(x)] ≥ g[E(x)]
Egx >= gEX
Now we are going to look at our graph again, in order to reflect on the above:
# gEX and Egx at x=EX
points(EX, gEX, col="blue")
points(EX,Egx, col="red")
We can see that at our coordinate
We will now draw a line going through
We can evaluate the slope of our function via:
# Slope b of f(x)=y=a+bx via two points:
P1 = matrix(c(0, 0)) # x = 0, g(x) = 0
P2 = matrix(c(EX,Egx)) # x = EX, g(x) = Egx
b = (P2[2]-P1[2])/(P2[1]-P1[1])
We can now evaluate
# We can now evaluate a by filling in a point
# say P2 in y = a + bx => a = -y + bx
a = -0+b*0
# f(x) = 4.33333*x
f = function(x) (b*x)
# add to plot
curve(f, -1, 6, ylab = "x", add=TRUE)
We are now going to evaluate the crossing points graphically and mathematically to get a better overview. We will start with our new function
# In our case the value of x of the point where
# f(x) crosses g(x) is equal to the slope of
# f(x), so we can evaluate the value of y
# of our point via a shortcut.
y = b*b
# Crossing points:
# upper crossing point f(x) with g(x)
# where Egx=gEX!
segments(x0=0,y0=y, x1= b, y1=y, lty =3)
segments(x0=b,y0=0, x1= b, y1=y, lty =3)
text(x=5.3,y=18, label="E[g[4.33]]=g[E[4.33]]", srt = 3, col = "darkgreen")
text(x=-.6,y=y, label="g(x)=g(4.33)", srt = 3, col = "darkgreen")
text(x=b+.5,y=0.2, label="x=4.33", srt = 3, col = "darkgreen")
Now we can check mathematically, if our assumptions are true:
# Check P1:
EXP1 = sum(1/6*(0))
gEXP1 = f(EXP1)
EgxP1 = sum(1/6*f(0))
# Is exactly equal?
EgxP1 == gEXP1
# Check P2:
EXP2 = sum(1/6*(b))
gEXP2 = f(EXP2)
EgxP2 = sum(1/6*f(b))
# Is exactly equal?
EgxP2 == gEXP2
Now we will add the other points and also add some annotation and lines to the plot to get further orientation over our results. The most challenging aspect to orientate within the below plot is the fact that the y-axis refers to two different functions (like
# Add rest of the points and some annotation:
# (EX|Egx)
segments(x0=0,y0=Egx, x1= EX, y1=Egx, lty =3)
segments(x0=EX,y0=0, x1= EX, y1=Egx, lty =3)
text(x=-.6,y=Egx, label="E[g[x]]", srt = 3, col = "red")
text(x=EX+.35,y=0.2, label="x=E[X]", srt = 3, col = "black")
# (EX|gEX)
segments(x0=0,y0=gEX, x1= EX, y1=gEX, lty =3)
segments(x0=EX,y0=0, x1= EX, y1=gEX, lty =3)
text(x=-.6,y=gEX, label="g[E[x]]", srt = 3, col = "blue")
I hope this chapter helped to understand and get some orientation over Jensen’s inequality from a mathematical and graphical perspective. Also check out Oleg Solopchuk’s tutorial on medium on the FEP for some more perspectives on Jensen’s inequality. His tutorial has very nice visualizations, but has a much stronger pacing at some point (and also includes aspects such as conditional indipendence etc.
In the next part of this series, we will also have a look into the concept of Markov blankets and other more advanced aspects of active inference. Nevertheless, you should now be already prepared to precede to the actual ATUT.