Intermediate to Advanced Stat-o-Sphere
Follow this link directly to part III of this series, if you wish, or have a look at other tutorials of our collection Stat-o-Sphere.
Review: This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.
Fig. How to start a discussion. A pubpub account is needed to do so. Sign up here. Also have a look at articles all about our Open Science Agenda.
Pub preview image generated by Mel Andrews via midjourney.
Corresponding R script:
Welcome to the second part of our series on information theory. You have now entered the intermediate to advanced Stat-o-Sphere, so we recommend you to have read at least the first part of this series in order to make sense of the below without being overwhelmed by it. However, the basic ideas introduced in this tutorial are actually not that difficult, since we again (and again and again) will just follow and extend the basic idea behind Bayes’ rule and conditional probability.
Note that this tutorial can also be helpful for, e.g., medical students to get a hint about what the very basic idea behind thermodynamics is, which is helpfull for understanding physics, chemistry and most of all to get a very basic intuition for understanding general biology and physiology.
In the first part of our series on information theory we tried to get some basic insights into Claude Shannon’s solutions for problems in the field of (telephone) communication that led to several different forms of information technology that we are now casually making use of every day.
Fig. — “Hello, Meowssissippi Cat Emergency speaking, how can I help you?” — “Krzzszzlszlllzszlslsz” — “Hello? Unfortunately I cannot hear you well due to a very noisy connection, can you please go outside for a better connection?” — “Krzzzzzzz” — ”Can you hear me now? Please state your emergency.” The call center Cat having trouble to communicate due to noisy channels.
We also briefly discussed how information theory can be found in statistics and of course in several of todays technologies, such as neuroscientific methods that use variational inference to, e.g., represent and interpret EEG and fMRI data (SPM and dynamic causal modelling), or to model decision processes and corresponding neuronal activity (predictive processing / active inference, i.e., “Bayesian Brain Hypotheses”) and how it is used in current AI technology (such as ChatGPT or in general stable diffusion models). We also briefly discussed how Boolean logic is represented as ciruitry and then focused on the relation between Bayes’ rule and the sender receiver model, defined by Claude Shannon. We also anticipated the relation between linguistics/semiotics and information theory, which we will go into in a bit in this tutorial again, when further reflecting on why communication can be looked at as a stochastic process. Here is again a link to the paper “A mathematical theory of communication” by Claude Shannon, which we will continue to explore further below.
Fig. On the left we see a modified graph of the sender receiver model, given a noiseless channel from the paper “A mathematical theory of communication” by Claude Shannon. We also discussed how the relation between sender, channel and receiver can again be looked at from a triadic perspective and how this again relates to classi Bayes’ rule / conditional probability. On the right we see a depiction of the so-called binary symmetric channel, which can be used to understand noisy channels. We also discussed ways to deal with noise, such as adding redundancy, either in the form of a parity bit, or by inferring via majority vote.
In this tutorial we will discuss the relation and differences between statistical thermodynamics and information theory and will build up all the knowledge we need for the concept of Markov processes (Markov chains) in the third part of this series.
For those curious upfront: In statistics Markov chains are, e.g., used in the so-called “Markov Chain Monte Carlo” (MCMC) method. The relevant parts of this series will therefore also serve as basis for later tutorials within our Inferential Statistics series on that topic. Same goes for the Akaike and Bayesian information criterion. A Markov chain is a rather simple concept, since it is again just another way to represent and extend classic Bayes’ rule / conditional probability. We will also introduce two ways to calculate a Markov chain: a simple version directly involving Bayes’ rule and a more complex method using basics from linear algebra.
Fig. Original graph from AMTC, p. 8. A Markov Chain can be represented as graph. The below essentially expresses the following: say we are obtaining a message that consists of three possible signs (A, B and C). Say, the first sign we receive is an A (the upper “node” or connection). The graph tells us to what probability we will receive a B or a C next. The probability to receive a C next is denoted to be .2 and .8 for B. Say, we now obtained a B. Looking at the graph we can now see that we have a probability of .5 to obtain another and .5 to obtain another A (and a probability of 0 to obtain a c, since there is no probability assigned for that case. And so on… This probabilistic transition of states (signs we receive) is essentially what a Markov chain tells us, or can be used for. No worries, we will get there in detail later below.
You may wonder, why information theory is linked to statistical thermodynamics and therefore physics in the first place. In fact, we are also talking about a kind of physics that is often considered the rather crazy type, since statistical thermodynamics was actually the start of quantum mechanics. On the other hand you probably heard of quantum computers and terms such as QBits — so from that perspective you actually already know a little bit (hehe) about it and can connect it with information you obtained in the first part of this series (by the way there is also an ongoing discourse on how to formulate a quantum information theory).
However, the easy answer how information theory and physics are connected is: it is because thermodynamics is a statistical concept.
There are even major questions of life linked to thermodynamics that you probably encountered, such as: What is life? And how does life manage to maintain itself by overcoming entropy in space and time (or how does it minimize free energy in order to overcome entropy)? Such questions were evoked within a physics discourse by Erwin Schrödinger in his book “What is life?” from 1944 (link to an online archived facsimile can be found here).
One answer to this question that is further and further developed is given by the concept of active inference / the free energy principle (FEP) (Ramstead et al. 2018). The FEP as well as other “Bayesian Brain” concepts also historically rely on the concept of Helmholtz Free Energy and his work on perception as “unconscious inference” — again in a statistical sense (compare SEP on Helmhotz; by the way, this is also where Freud took his ideas from when starting as a neurologist). Active inference conceptually also relies on the idea of Helmholtz Free Energy, therefore also the name of the generalized concept behind active inference: the free energy principle (in its very general form it can be extended to a mathematical theory of physical (self-organizing) systems in general, e.g., discussed here in Ramstead et al. 2022). We will provide some basics on it in further parts of this series.
Sidenote: The Helmholtz approach is different to, e.g., an ecological psychological approach to perception, influenced by James J. Gibson, which argues that perception does not involve inference, but direct (realist) interaction between the world and us as an complex organisms resulting in perception (we will not go into this in details in this series, but it is good to know that this approach is also still discussed today!).
Fig. Shannon’s entropy formula, including weighted probabilities (based on Gibbs’ entropy), i.e., the average surprisal of a hypothesis on an event amongst several hypotheses on an event, or signs amongst an alphabet. The Z here stands for the German word ‘Zeichen’, i.e., the word ‘sign’ (Berlin, Kreuzberg 2020; photo taken and modified by the author, the cloud that represents a capital sigma (sum sign) was created by perspective and thermodynamics itself though).
Thermodynamics and concepts such as entropy can be difficult to understand, but we hope to show that there is an intuition behind it that we have encountered before, which makes it much easier to understand: Bayes’ rule / conditional probability (again…). The math behind it will therefore turn out to be rather simple, as we will see. However, there is a popular and rather ironic and sarcastic introduction line for a chapter on statistical mechanics in a textbook called States of Matter, by David L. Goodstein, reflecting on its difficulty. I have encountered this quote several times, so I do not want to withhold it from you:
“Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the work, died similarly in 1933. Now it is our turn to study statistical mechanics.” ─ textbook, States of Matter (1975), by David L. Goodstein
I promise, we will not go that far and it is really a cynical exaggeration in a lot of ways. Going through thermodynamics will take us a bit of time and imagination, but we will see that it’s well worth it and nothing to be afraid of either! As you have seen above, this tutorial is also supported by the “Meowssissippi Cat Emergency Center”, so believe us, you will be all safe and sound.
Let’s get to it!!
At first let us distinguish statistical mechanics from statistical thermodynamics: Statistical mechanics describes ‘ensembles of microstates’, i.e., a range or number of different possible states a system can occupy based on the energies of the particles of a system (involves quantum mechanics). A classic system that is described this way is a closed space with gas particles. For simplicity and “vividness” we will mostly talk about an example with ink particles within a glas of water.
Fig. As mentioned before, our tutorial series on information theory will indirectly reveal the basics around the question of how “diffusion models”, which imply physics (entropy, temperature, movement of molecules etc.), are related to large language models, image generation and to methods from computational neuroscience. Above we see ink within water that can be considered being in a low entropy state. If all ink particles were evenly / uniformly distributed, we would speak about a high or even a state of maximum entropy (details further below). Image by Chaozzy Lin.
These microstates (energy levels or frequencies of atoms in a system) are what merge into or correspond with macrostates (temperature etc.). Statistical thermodynamics can therefore be seen as a link between statistical mechanics and classical thermodynamics, drawing a relation from a macrostate (e.g., entropy) to its possible microstate: the energies of all the particles in a system (all microstates) in relation to, e.g., the temperature of a system(macrostate).
Macrostates in general are, e.g., changes in volume, temperature and the description of the entropy of a system ─ we will especially focus on the latter concept. Note that statistical thermodynamics can be looked at as a partially independent concept from statistical mechanics and classic thermodynamics (physics in general so to speak), as it can be used for anything that can be referred to as state space in an abstract way, suitable for a general statistical method ─ its use in information theory being a popular example for it.
As mentioned, microstates are the number of possible states that all the elements of a system (particles) can occupy within that system (again, also referred to as the energies / frequencies of the particles of a system). This all comes with a fundamental assumption in statistical thermodynamics, assuming an equal probability of each possible (micro)state, where states, again, is importantly to be understood as, e.g., different distributions of (different) atoms with different frequencies and location in a systems space (e.g., atoms in a gas). So note that, one microstate refers to all the elements (energies of the atoms / molecules) of a system, e.g., at a point in time (not only the microstate of one atom — always the system as a whole!). Assuming equal probability a priori the probability of a single state of the elements within a state space can be expressed as
To make this sound a little more familiar in the context of information theory: See the relation to the math we have gone through before? “Number of possible states” is similar to, e.g., our
Fig. In the first part of our series on information theory we also discussed the number of possible states of a NAND-gate. The presence of two relay switches has doubled the number of possible states (22=4), compared to one switch, or the number of possible states given one binary digit was sent (21=2), representing either 1 or 0 only. This of course does not concern the output, which is a single binary (column on the right side of the table).
Another important aspect of the relation between macro- and microstates is the fact that changes in the microstate of a system can correspond to the same, i.e., stable macrostate of an isolated system. This means that no matter what microstate is given, say a message with either 0 or 1, in both cases we will still have an overview of the overall possibilities of that system (possible macrostates of a state space or message): it will always have two possibilities!
The physical system is closed, when it does not lose particles/atoms (matter) but exchanges energy, and a system is isolated when it cannot exchange matter and also no energy, e.g., in the form of heat, which would change the set of possible microstates (so imagine no environmental influence / energy exchange; open system exchanges both (see figure below); a biological membrane does both (partially) selectively). We will soon see what that means in mathematical formulas, and we are not so much looking for a complete physical or a biochemical description of the thermodynamics of living systems — just the basics. For now, just remember that heat “speeds up” particles, such that interactions with other atoms cause greater fluctuations, which means that if we were to feed a system with heat this would result in more possibilities during a timespace compared to a system with less heat in the same time ─ as there is more change happening within a time space in a system due to more energy, so to speak (more action, harder to track, more possible states within time, more entropy). There is also a steady state where nothing changes, an equilibrium state, e.g., in a state of maximum entropy (and minimum inner energy ─ we will get there when we deal with approximate Bayes inference in further parts of this series, e.g., when discussing the basics behind variational inference).
Time plays an important role in this case, as it is known in physics that everything (in a closed and isolated system) tends towards “chaos”, i.e., entropy over time (diffusion is driven by this). Entropy therefore also works as an index for the direction of time or information flow (the so-called time arrow of entropy; in the context of neuroscience compare Ryan Smith 2022 and Lynn et al. 2022).
In order to fully draw a mathematical relation to information theory, let us first think of a more canonical intuitive description of thermodynamics in physics (some of them also entail some conceptual traps). We will focus on the phenomenon of entropy in everyday life, where in general entropy is often referred to as the disorder or randomness of a system and a tendency towards it (even though this brief definition entails some counterintuitive traps, as we will see). In order to get more physical, we will consider an everyday example: a drop of ink that fell into a glass of water. The glass will be our system and we want to know how the ink molecules behave within the water (system) over time (diffusion). For this we will look at two points in time, i.e., two states the molecules within the system can be in (two microstates):
Fig. a) a simplification of a drop of ink within water (as discrete cells) at time
The left part a) in the figure above represents a 2-dimensional abstraction of a state shortly after the ink fell in the water, denoted
What we see is that the ink completely dissolved over time (
Fig. Entropy in the sense of “disorder” is not understood aesthetically but probabilistically. (Cartoons from MS Office)
To find a way around this misleading intuition of “disorder”, let’s have another look at the two systems above: in b) every ink molecule is equally distributed within space, so it is equally likely to encounter the same number of molecules in every segment or quadrant of that system (in equilibrium also steady over time). So essentially this single state represents a pretty ‘random’ or unspecifically uniform distribution of ink molecules within a state space, if one was to infer on such a state. If in a) all water molecules would be on one side and all ink molecules on the other, then the information content or entropy of that microstate would be very low, since the whole state space simplifies itself down to two options (ink on the left or on the right..). Most of us will — at least from experience— know that such a low entropy state is highly unlikely to be experienced in life ─ also due to factors such as gravitation, which is also pulling ink down to some degree, given a greater density than water molecules... However, we hope that the above gives you a stable idea why a) in the graph above has lower entropy than b).
Thinking of the molecules being in motion (kinetic energy), one might also find it intuitively hard to think of a situation where b) moves back to a), i.e., leaves its equal distribution in space and especially leaves its high entropy state to go back to a lower entropy state by itself. However, this is what life does in various way. A cell outlasts entropy for a time-space that represents its lifetime by (partially) selectively exchanging energy with its environment in various forms and mobilizing energy in various ways (e.g., via enzymatic catalyzation of the biochemical process: C6H12O6 + 6O2 -> 6CO2 + 6H2O, or vice versa via photosynthesis, all in order to “work against” decay— entropy of a complex system).
In thermodynamics low entropy is mathematically expressed as observing one state of a system that is less likely to be observed than other possible states within a range of possible states of a system. As entropy rises over time (time arrow of entropy!), low entropy states automatically become less likely over time. In other words: Within the range of possible configurations of molecules within a system, the possible observation of a state of the elements of a system where the ink is more dissolved than in state a) is in general more likely, more often. Entropy is therefore essentially not described as the consequence of a certain power, i.e., causal in the classical sense, but as a description of the probabilities to observe a certain kind of state configuration within a system that correspond to macrostate measurements (e.g., physical temperature) ─ this relation between energy and temperature, i.e., micro- and macrostate, also reflects in the physical unit J/K, as we will see further below.
Before getting deeper into the actual physics formula for the concept of entropy, we will quickly recall some possible relations to Shannon and information theory. In Information theory a systems macrostate at a certain point in time represents the number of possibilities (exponent!) of a message that consists of a number of binary digits, which are together forming possible microstates, i.e., one amongst a range of possible messages. Shannon therefore states in the introduction of AMTC:
These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design. (AMTC, p. 1)
The entropy of such a message, which is also called the information content, also reflects on the ‘randomness’ or ‘disorder’ of a system at a certain point in time, namely: how likely a certain message is amongst all other possible messages, or in other words: how uncertain the information of a message is, apart from noise. I have hinted before and throughout the first part of this series a couple of times how information can be statistically quantified. One relay has 21 possible states, a NAND gate consists of two relays and has 4, i.e., 22 possible states and, again, 21 possible outputs, due to the logic restrictions set by the NAND gate. So the important part is the exponent.
Before we are going into the exact meaning of the exponent value and mathematical formulas, let us first “visualize thermodynamics” again, now in the sense of its use in information theory:
Fig. Thermodynamics and information. The square represents a state space of a system. Here we have a 50% chance to encounter a zero or a one within the simplified state space. In other words: there are two possible states this system can occupy (
In order to recall the relation to simple Bayes’ rule, recall the Vanessa Holmes example, drawing a relation between abductive inference, Bayes inference and investigative reasoning, by successively ruling out possibilities (updating a model). So, a way to look at the process of receiving a message and reflecting on its information content (entropy) is exactly that: investigative reasoning, or better: asking the suspect (the received message) questions on its configuration amongst a range of possibilities (being 0 or 1, or say, guilty or not guilty). Quantifing information will then lead to the following consideration: “Given that only one sign was sent, how many questions do I have to ask the suspect / the message, in order to obtain the information that either a 0 or 1 had been sent, i.e., the suspect had been guilty or not?” Played out:
─ “Was it a 0?”
─ “No.”
Ergo: a 1 was sent!
Fig. Vanessa Holmes, doing her damn job.
How many questions did she had to ask, in order to obtain the final information that a 1 was sent, not a zero? — One question. This value of 1 is equivalent with the exponent, when the number of possible states is cast as exponentiation to the base of 2 (
Another term often used to express the above is surprisal, where surprisal is to be understood as how unpredictable or “new” an information is in a statistical sense of prediction (not in the sense of a ‘psychosomatic reaction’). The quantification of uncertainty, surprisal etc. is related to the number of possibilities by stating how few we know about the exact state being the exact message in a range of possible messages. A lower number of possible states leads to less uncertainty, as there is a higher chance when just guessing (e.g., via a uniform prior) to actually guess the right state. We will soon learn that mathematically suprisal is not exactly the same as entropy, due to some minor differences in their formula (neglectable for now).
Before we head on with the pyhsics and the formula of entropy and surprisal, we will look at another important example of interrogation: How about a system equivalent to our system in the figure above, but with
Fig. Vanessa Holmes, applying her precrime methods.
None, as I would already know the answer, so to speak. Within information theory this translates into: no information had been sent. To make this example comparable to the previous quantification: the possibilities given in this case are
In more general terms: sending a message that entails information in the above sense conceptually relies on a beautiful requirement: a minimal difference — a minimal uncertainty, so to speak — in order to be informative, in order to be information at all, in order to makes sense and in order to have a reason to communicate in first place.
Same goes for classic physical thermodynamic systems by the way:
Apart from the ones I have mentioned, there are more terms related to the concept of entropy: self-information, Shannon information… A lot of them are often used interchangeable. The entropy though is actually the average self-information/surprisal, though in the case of
In the next chapter we will go through the concept of the logarithm that can be used to determin the ominous exponent we discussed above and which will soon help us to make things a little more convenient when handling the numbers.
The next chapter however can be considered optional, if you know about the logarithm well enough already.
In order to dig in even deeper into entropy and in order to get a grip on its equation, we shortly have to recall what a logarithm is. When calculating exponents, it is common to use the logarithm for convenience and clarity over the calculation of our results - this is all it actually does. In the case of information theory, the logarithm to the base of 2 is used (binary digits, bits) ─ the simplest representable way (minimal difference) to encode information. Let us recall what the logarithm and its components are:
Fig. The logarithm to the base of 2. The upper equation states outspoken: 2 raised to the power of “what?” (x) equals 4. The lower equation states: x equals the answer to the question of how many times one has to multiply 2 by itself (= “log to the base of 2”), in order to get a result of 4. The logarithm therefore represents the invers operation of exponentiating.
The logarithm is in general also useful for, e.g., transforming multiplications into summations (for convenience, since summation is easy to do by head and hand). We will encounter some of the following rules again in more detail. Just always keep in mind that logarithmic formulas are just a way to better handle and orientate within the math.
Note that the Inf
, for infinity, or -Inf
in such a case. When performing linear algebra one will sometimes operate with vectors and matrices that contain 0 values. As
nat_log_nonzero = function(x) {
x = log(x+exp(-16))
} # End of function
In thermodynamics, binary digits (bits) are not used to quantify information, but the natural logarithm is (the unit is natural digits, nats). The natural logarithm is operating to the base of
The number
A number of special characteristics of exponential functions is the reason why nats as a unit is often used ─ again for convenience:
a) The
b) The slope at a certain point
c) The integral up to a certain point
plot(x = seq, y = fx_pos, type = "l")
# forming coordinate system (+x,-x,+y)
abline(v = 0)
# Point at x = 0, so f(0) = y = e^0 = 1
exp(0) # Now add point to plot:
points(y = 1,x = 0) # P (0|1)
# Point at x = 1
# y = f(x) = e^x; f(1) = e^1 = e
exp(1)
points(exp(1)) # P (e|1)
# Slope of tangent at a point P(f(x)|x) := f'(x) = e^x,
# since f(x) = e^x = f'(x)
# Evaluate y coordinate of P(f(x)|x):
y_coordinate = exp(1)
# [1] 2.718282 = Euler's number e.
# We can add the respective tangent at exp(y)
# point with a slope = e^x. In case you wonder:
# the linear function x does not need to be defined in curve().
curve(exp(1)*x,-3,3, add = TRUE)
# Integral from -Inf to x = 1
expo_fun = function(x) {exp(x)}
integrate(expo_fun, lower = -Inf, upper = 1)
# 2.718282 with absolute error < 0.00015
# = e again.
Fig. Plot of an exponential function with a tangent at P(e|1) with the slope
Another important advice concerning handling logarithms comes from Shannon himself:
Change from the base a to base b merely requires multiplication by logb a. (AMTC, p. 2)
Now let us get fully to it and start with the equation for entropy as proposed by the Austrian physicist Ludwig Boltzmann (note that Rudolf Clausius also had a often forgotten role to play in this!). The definition holds for closed systems in thermic equilibrium.
For those already exhausted: we will simplify things dramatically soon, when getting back to actual information theory.
However, the below entails nothing you wouldn’t know from what was discussed so far. The below is also mostly concered with representing entropy as a physical/mathematical formula with correct notation.
Apart from that most of the math below is actually just summing, weighting, taking the mean ─ so it’ll be rather simple, even if the notations may not look like it. Note that the below is refers to the special case where surprisal is equal to entropy, due to uniform distribution of weights.
In other more physical terms: the Boltzmann entropy equation holds for the assumption that every microstate is equally probable. In Inferential Statistics IV we used:
You may wonder what a constant is. In general (mathematics), a constant just revers to a single constant number (not changed in relation to another variable). In physics a constant is the result of a measurement that appears everywhere in nature without changes in its value (over time) and are expressed in SI-units (standardization of units). A classic is the speed of light, 299792458 [m/s], where meter and seconds are the standardized units for length and time (kg for mass; Joule for energy transferred to another object with a force measured in Newton; Kelvin for temperature…). Establishing standardized units is in general performed in order to make the results of scientific measurements comparable — at least up to a desired precision.
The Boltzmann constant expresses a relation between the (average kinetic) energy of particles in a system (microstates) and the temperature of a system, therefore the unit Joule/Kelvin. We will not go into more physics details here, as it is not needed in information theory and other applications of the presented concept.
We will still use the unit Joulde/Kelvin when calculating Boltzmann’s entropy below. In general,
We could also define the constant in R ourselves, taking a value from some text book of course, but we should want to use the official CODATA constants (details below). Some instructions on how to use the package can be found on the authors Github page: https://github.com/r-quantities/constants. In case you don’t know: Github is an important website you may have encountered already, and which is used in the (software) developing field, where one can publish code, keep track of the development and so on.
# Install a package with the install.packages() function.
# Keep in mind to use " " within ().
install.packages("constants")
# Now load the library you just installed, otherwise
# the functions of the package will not be known by R
library("constants")
# Here you can check out which constants are available.
# The package uses the constants estimated by CODATA, i.e.,
# the Committee on Data for Science and Technology in Paris
# (CODATA), founded by the International Council for Science.
# CODATA keeps track of measurements of constants and delivers
# standardized estimations of constants for science (quality
# control and improving quality of estimations in general).
View(codata) # View full list of constants
# Use lookup() to search for constants
lookup("boltzmann", ignore.case = TRUE)
# Use with() to use a constant, in our case:
with(syms, syms$k) # can be used as a whole to do math!
# Boltzmann Entropy (syms_with_units adds unit to result)
# For R Version 4.1.0 use k instead of syms$k but both may work:
with(syms_with_units,syms$k)*log(1) # with Omega = 1 possibility
with(syms_with_units,syms$k)*log(2) # with 2 possibilities
with(syms_with_units,syms$k)*log(4) # with 4
with(syms_with_units,syms$k)*log(8) # with 8
# RESULTS:
# > with(syms_with_units,syms$k)*log(1) # with 1 possible states
# [1] 0
# > with(syms_with_units,syms$k)*log(2) # with 2
# [1] 9.56993e-24
# > with(syms_with_units,syms$k)*log(4) # with 4
# [1] 1.913986e-23
# > with(syms_with_units,syms$k)*log(8) # with 8
# [1] 2.870979e-23
# NOTE: plain log() := natural logarithm := ln.
The American scientist Josiah W. Gibbs later on generalized Boltzmann’s formula, where the probabilities of the states are mathematically treated to not necessarily have equal probability of occurrence, but are unequally weighted in relation to each other. Boltzmann’s equation can be seen as a special case of Gibbs’ generalization of entropy.
Note that we only take the negative sum in the equation below, due to mathematical and reasons of logic: when taking the log of a number between 0 and 1, the result will be negative, since
Fig. Vanessa Holmes, slowly losing her patients.
We will later see that energy can be “free” though, as in terms of leaving the system (which is, e.g., used to do work, e.g., in a steam engine; we will get there in other parts of this series when approaching approximate Bayesian methods).
Let us now finally look at Gibbs’ entropy formula:
When comparing Gibbs’ generalization with Boltzmann’s entropy equation, we see that
# Gibbs' generalization of the entropy formula:
# We will try different n with equal prob. of microstates
# (as in Boltzmann’s equation)
OMEGA1 = c(1) # one possible state
OMEGA2 = c(.5,.5) # two states with equal prob.
OMEGA4 = c(.25,.25,.25,.25)
OMEGA8 = c(.125,.125,.125,.125,.125,.125,.125,.125)
-with(syms_with_units,syms$k)*(sum(OMEGA1*log(OMEGA1)))
-with(syms_with_units,syms$k)*(sum(OMEGA2*log(OMEGA2)))
-with(syms_with_units,syms$k)*(sum(OMEGA4*log(OMEGA4)))
-with(syms_with_units,syms$k)*(sum(OMEGA8*log(OMEGA8)))
# RESULTS (EQUIVALENT TO THE ABOVE USE OF BOLTZMANN ENTROPY)
# > -with(syms_with_units,syms$k)*(sum(OMEGA1*log(OMEGA1)))
# [1] 0
# > -with(syms_with_units,syms$k)*(sum(OMEGA2*log(OMEGA2)))
# [1] 9.56993e-24
# > -with(syms_with_units,syms$k)*(sum(OMEGA4*log(OMEGA4)))
# [1] 1.913986e-23
# > -with(syms_with_units,syms$k)*(sum(OMEGA8*log(OMEGA8)))
# [1] 2.870979e-23
We see that this leads to equivalent results. Now mark and execute only the following part of the code, e.g., OMEGA4*log(OMEGA4)
, and see how summation is performed, then try just, e.g.:
-log(.5);-log(.25) # etc.
# note that “;” is a delimiter that in this case
# is interpreted by R as a new independent line
We see that in the case of equal probability of all microstates it does not necessarily need any weighting by its own probability, so the Boltzmann’ equation of entropy can be seen as a special case of Gibbs’ entropy.
As mentioned, the upper equation can also be expressed via the expected value (
Equivalent to the meaning of the summation within the previous formulas, an expected value can be understood as the sum of the weighted average of possible outcomes (weighted mean). In the case of entropy, the expected value is again the exponent of
Let us now consider a general mathematical example using a fair dice with six sides. This time our expected value will not be the value of entropy, but the value we gain from tossing a dice with values ranging from 1 to 6. This is different to before where the expected value was the sum of the log of
Fig. A fair dice. Values of
Here is the formula of our expected value:
# E[x] = Sum(x*p(x)), where x := dice value
ExpX = 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6
ExpX # 3.5 or 7/2
# Or:
x = c(1,2,3,4,5,6)
ExpX = sum(x*(1/6))
# Shortest:
ExpX = sum(1:6*(1/6))
When equal probabilities are given, then
# Note: the function nrow() delivers the number of
# rows/elements in a matrix (counting trick). The parameter
# "byrow =" adds the values by row, if FALSE, one would end up
# with 6 columns; default = FALSE; in this case it actually
# doesn’t matter, as there is only 1 column.
# Our fair six-sided dice:
dice = matrix(c(1,2,3,4,5,6), byrow = TRUE)
# probs corresponding to the matrix "dice"
weight_dice = c(1/6,1/6,1/6,1/6,1/6,1/6)
# Arithmetic mean (equal to using equivalent probs./weights)
arith_mean = sum(dice)/nrow(dice)
Below you can find the formula of the weighted arithmetic mean ─ and again: formally equivalent to
# The formula of the weighted arithmetic mean is:
weight_mean = sum(weight_dice*dice)/sum(weight_dice)
weight_mean = sum(weight_dice*dice)
# Or use R function from the base package:
weight_mean = weighted.mean(dice, weight_dice)
One can also normalize weights, i.e., turning the values into a part-to-whole range between 0 and 1,
# Normalizing weights:
normalized_weights = weight_dice/sum(weight_dice)
# [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
# 0.1666667 # where 1/6 = 0.1666667
This example will hopefully explain a lot about the math behind Gibbs’ entropy equation concerning weighting a log of some probability with (the same) probability, weighting the information / entropy
Here is another example with non-equal weights ─ a phony dice.
# Weights of a phony dice
phony_dice = matrix(c(1,2,3,4,5,6), byrow = TRUE)
# probs. corresponding to "phony dice"
weights_phony_dice = c(2/6,.5/6,.5/6,1.5/6,.5/6,1/6)
# Check if probs sum up to 1
sum(weights_phony_dice)
# Normalized weights (again equivalent to our weights)
norm_phony_weights = weights_phony_dice/sum(weights_phony_dice)
# Check on sum again
sum(norm_phony_weights)
# First we check on our weighted mean with our non-normalized
# weights. Note that the w-mean is actually our expected value:
w_mean_phony_dice = sum(weights_phony_dice*phony_dice) /
sum(weights_phony_dice)
# Now we check if we get the same expected value using our
# normalized weights:
n_w_mean_phony_dice = sum(norm_phony_weights*phony_dice) /
sum(norm_phony_weights)
In both cases we get the same result: [1] 3.166667
. Looking at the probabilities we set (our weights), we can see that there is a tendency towards lower outcomes when tossing our phony dice compared to our fair dice that reflects in our expected value. The expected value of our fair dice deviates by +1,44443 from the expected value from our phony dice.
You made it through alive!! With this we have gained a solid mathematical basis and intuition on statistical mechanics and thermodynamics. We are now moving on to Shannon’s entropy formula and things will get, as promised, much easier. I chose to dip into actual statistical mechanics and thermodynamics, in order to make you, the reader, competent in recognizing similarities and differences between physics and information theory.
From now on we will mostly focus on the information theoretic understanding of concepts within statistical thermodynamics. For us this means: simplification on various levels.
Shannon states in the introduction of his paper “A mathematical theory of communication”:
If the number of messages in the set is finite then this number or any monotonic function of this number can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely. (AMTC, p. 1)
We are now able to clearly foresee what the mathematical work of Claude Shannon was all about. The function Shannon chose was the logarithm to the base of 2, as we know, for several reasons ─ Shannon also mentions intuition as being one of them:
For example, adding one relay to a group doubles the number of possible states of the relays. It adds 1 to the base 2 logarithm of this number. Doubling the time roughly squares the number of possible messages, or doubles the logarithm, etc. […] It is nearer to our intuitive feeling as to the proper measure […] since we intuitively measures (sic) entities by linear comparison with common standards. One feels, for example, that two punched cards should have twice the capacity of one for information storage, and two identical channels twice the capacity of one for transmitting information. (AMTC, p.1)
Furthermore, also on page 1, we see the beginning of how we understand information today, when Shannon is using the famous unit bits for the first time:
The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey. A device with two stable positions, such as a relay or a flip-flop circuit, can store one bit of information. N such devices can store N bits, since the total number of possible states is
and . (AMTC, p.1)
Here the argument of the logarithm is written as 2 to the power of
We now hopefully understand that the elementary discovery of Shannon was considering a source as something that produces messages (receiver vice versa) from a stochastic process, which made communication mathematically expressible and transparently representable in the form of information technology (computers nowadays).
Let us now plot a log-function of the entropy of a single sign occurring, which is specified as surprisal, self-information, information content or even Shannon information, of a single sign or event with a probability ranging from 0 to 1 ─ and measured in the unit bits. We will use the following formula:
Fig. Shannon entropy formula, including weighted probabilities, i.e., the average surprisal of a hypothesis on an event amongst several hypotheses on an event, or signs amongst an alphabet (recall: equal probability for each microstate results in average surprisal being equivalent to surprisal). Z stands here for the German word ‘Zeichen’, i.e., the word ‘sign’ (Berlin, Kreuzberg 2020). Later on, the difference between entropy as average surprisal and surprisal itself will become more important.
What is missing compared to Boltzmann’s entropy formula is the constant
In the following we will start again with looking at unweighted log probabilities, i.e., the surprisal of a single sign with equal probability of occurrence. In some cases, using equal probability will result in rather abstract values, as in the case of
# Surprisal /self-information (information content):
# I(x) = -log2(p(x)) Bits
# Probability of a sign x = p(x)
# In order to plot a very smooth function, we are using
# probabilities starting with (x) = 0.0001 and moving towards
# p(x) = 1 in respective steps. This can be seen as forming
# a sequence. We will start with a probability of
# 0.0001, as -log2(0) = Infinite, i.e., not defined:
# We will use the sequence function seq()
px<- seq(0.0001, 1, by = 0.0001)
# Number of single probabilities:
length(px) # [1] 10000
# Let us now plot the -log2 for all p(x), i.e., x = 0 to 1
plot(x = px, y = -log2(px), typ = "l", col = "red",
ylab = "Amount of Bits = I(x) = -log2(p(x))", xlab = "p(x)",
panel.first = grid(nx = NULL, ny = NULL,
col = "lightgray", lty = "dotted"))
# You can also add some points, e.g., for p(x)=.5
# The calculation can be done within the function points():
points(.5,-log2(.5))
points(1,-log2(1))
Fig. Log-function of
It may seem as if we neglected Bayes’ theorem for a while, but we will see that this is not true at all, as there is a surprising connection between the amount of entropy/surprisal and the model evidence hiding just in front of us. Recall that we have worked with equal probabilities and Bayes’ theorem before, e.g., when inferring on a single received sign. Also recall that the model evidence is a relation between the joint of
# Here we will represent two relay switches or two sent signs for
# a change, i.e., four possible states with equal probability,
# equivalent to the NAND truth table.
prior = c(.25,.25,.25,.25)
likelihood = c(1, 0, 0, 0)
Bayes_Machine(prior,likelihood)
# Calculate surprisal:
Surprisal = -log2(sum(prior*likelihood))
# [1] 2
We can also calculate and evaluate the maximum entropy of a single sign that was sent, resulting in
# Calculate and plot the maximum entropy for 1 sign
px = seq(.1,1, by = .1)
qx = 1-px
# The entropy is at its maximum for px = .5
MaxEntropy = -(px*log2(px)+qx*log2(qx))
# > MaxEntropy # it reaches maximum of 1.0000000 at px =.5!!
# [1] 0.4689956 0.7219281 0.8812909 0.9709506 1.0000000 0.9709506 0.8812909 0.7219281
# [9] 0.4689956 NaN
# Plot maximum entropy (with new px)
px = seq(0.0001, 1, by = 0.0001)
plot(x = px, y = -(px*log2(px)+((1-px)*log2(1-px))), type = "l")
Fig. Maximum entropy of one sign sent is given at
We will get back to this again. Shannon explicitly mentions conditional probability rather late in AMTC, and also mostly in relation to what is called a Markov chain.
Let us first think think a little more about the essence of Shannon’s work again, i.e., arguing that communication is being involved in a stochastic process, before we look at what Markov chains are.
Note that Shannon also goes beyond equal probability of simple binary digits, and with this also sets a lot of the basics of what is today called computational linguistics. Information theory is therefore far less technical than some might expect, as the math also applies and has something to do with actual language. Shannon also goes beyong mere binary machine code by looking at natural languages having the same probabilistic characteristics on different levels of abstraction: some letters or words are more often than others, i.e., they have different probabilities in general; considering transitional dependecies, some letters appear more often after, e.g., the letter B than others… The same goes for words, elements of narratives... Shannon even evaluated the average entropy of the English language in his paper “Prediction and Entropy of Printed English” (1951).
Looking at language and the process of forming a message via a stochastic process is more intuitive as one might think. Understanding messages primarily as being one message within a set of possible messages is something we have all experience on ourselves. For example, we do not learn a whole sentence, in order to only use it in “that” particular form/order or in only one particular point in time. We only do this, if we consider ourselves as lossless channels for a particular information, as in remembering and reciting a poem ─ concerns memory in general, including the respective noise in various forms.
In any case, language is contingent (similar to concept arbitrariness) and can therefore be looked at probabilistically in various ways. Still, one could ask where these contingencies essentially come from. In information and (dynamical) systems theory, explicitly in the latter, this is in general explained as a recursive process, setting both boundaries and contingencies at the same time. For a source in information theory, in order to communicate, one needs code and the equivalence of the a code is what the source and receiver rely on when sending and receiving a message (incl. en- and decoding). A source can be understood as forming a recursion/identity on the matter of code, as it indirectly sets / expects the code convention in every discrete point in time and therefore also sets the boundaries and contingencies of possible messages (sets a whole).
We have also seen that the possibility of communication relies on a minimal difference, i.e., a minimal amount of surprisal, i.e., entropy or disorder, which represents information itself. This is why the specific reference, i.e., what specific aesthetic contingencies may be given, i.e., the specific code, as well as its “meaning” etc. is not what the possibility of communication essentially relies on. If everything would be determent and known by us, we would not need communication in the first place, since
Arguing contingency or even arbitrariness to always be the answer for everything may still seem somewhat unsatisfying, as if trying to bypass the question of their origin. Note that there is, e.g., work within active inference / predictive processing / dynamical systems theory research that answers the question by arguing the origin to be the phenotype, i.e., being human (Ramstead, Friston, Hipólito 2020), and our ontogenetically developed priors (starting as early as in the womb, see, e.g., Ciaunica 2021). One may has to recall in general that we are far from just popping up in the world, so we are growing up with information that is kept in the world, such as language as method of communication itself. Understanding information theory however is less about a “first reference” but a “first principle/structure” of communication (suprise: essentially Bayes’ rule again).
But what does the phenotype do to make communcation possible or even likely? For example, the phenotype (being human) can play the role of a self-reference in the most general sense, which makes it more likely that communication is possible, due to similar experience in terms of ourselves as body in the word, e.g., the range of sight, scaling, capabilities of inference in general... Our eyes (retina) are an example of how first levels of code are already set by the phenotype. Most of all the phenotype therefore sort of “pre-sets” the possibility of a structural equivalence between own self-relation and the self-relation of others with the same phenotype: it is likely to ‘understand’ the meaning of “My leg hurts.”, thinking of it as experiencing it oneself, when also being a human. In other words: even just approximating the experience of someone else just on the basis of ourselves will work quite well in a lot of cases — even though it is a kind of gambling or betting game in others.
One does not even need actual language itself, as the sight of someone in pain, performing certain actions (e.g., laying on the floor), also leads to a relation to our own experiences, making the actions of that person appear to us as index for some cause or for the pain itself. Therefore, meaning in the sense we presented here arises within the receiver and is nothing that is conveyed in the sense of a reification of meaning, being a kind of object in the world that may even move with the message from A to B via a channel, or so.
This is why the approach to communication within information theory is to be considered much more general than linguistic approaches, as information theory explains the inferential capabilities and structures necessary for communication and therefore language itself as part of it. Information theory is therefore not to be confused as a mere linguistic theory.
We will look further look into a general discourse on the relation between information theoretic concepts and us human communication in general, as well the linguistic side of information theory in the forth parts of this series.
In the next part of this series we will mathematically further explore the basics concept of language as a stochastic process when discussing the mentioned Markov chains. No fear, Markov chains are just an extension of Bayes’ rule, as we will see. We will also introduce two ways to calculate a Markov chain: a simple version directly involving Bayes’ rule and a more complex method using basics from linear algebra.
Fig. Original graph from AMTC, p. 8. This Markov chain represents the probabilities of a single sign sent/received. The sign could be an A, B, C, D or E.
Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other students and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within BEM.
Pub preview image generated by Mel Andrews via midjourney.