Skip to main content# [R] Higher Spheres: Information Theory II: The Relation Between Information Theory / Technology and Statistical Thermodynamics

# Introduction

# 1 Statistical Mechanics / Thermodynamics and Information Theory

## 1.1 Explaining the Concept of Entropy with a Glass of Water and a Drop of Ink

## 1.2 The Entropy of a Message

## 1.3 The Logarithm and Euler’s Number (Optional)

## 1.4 Boltzmann’s Entropy Formula and Gibbs’ Generalization

## 1.4.1 The Expected Value Notation of Entropy (Optional)

# 2 Shannon Information / Entropy

## 2.1 Language as a Result of Stochastic Processes (Optional)

# 3 Brief Outlook into Information Theory III

Intermediate to Advanced Stat-o-Sphere

Published onAug 18, 2023

[R] Higher Spheres: Information Theory II: The Relation Between Information Theory / Technology and Statistical Thermodynamics

**Follow ****this link directly to part III**** of this series, if you wish,** or have a look at other tutorials of our collection Stat-o-Sphere.

** Review:** This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Pub preview image generated by Mel Andrews via midjourney.

*Corresponding R script:*

Welcome to the second part of our series on information theory. **You have now entered the intermediate to advanced Stat-o-Sphere, so we recommend you to have read at least the ****first part**** of this series in order to make sense of the below without being overwhelmed by it.** **However, the basic ideas introduced in this tutorial are actually not that difficult, since we again (and again and again) will just follow and extend the basic idea behind Bayes’ rule and conditional probability.**

Note that this tutorial can also be helpful for, e.g., medical students to get a hint about what the very basic idea behind thermodynamics is, which is helpfull for understanding physics, chemistry and most of all to get a very basic intuition for understanding general biology and physiology.

**In the ****first part of our series on information theory**** we tried to get some basic insights** into Claude Shannon’s solutions for problems in the field of (telephone) communication that led to several different forms of information technology that we are now casually making use of every day.

We also briefly discussed how information theory can be found in statistics and of course in several of todays technologies, such as neuroscientific methods that use variational inference to, e.g., represent and interpret EEG and fMRI data (SPM and dynamic causal modelling), or to model decision processes and corresponding neuronal activity (predictive processing / active inference, i.e., “Bayesian Brain Hypotheses”) and how it is used in current AI technology (stable diffusion models). We also briefly discussed how Boolean logic is represented as ciruitry and then focused on the relation between Bayes’ rule and the sender receiver model, defined by Claude Shannon. We also anticipated the relation between linguistics/semiotics and information theory, which we will go into in a bit in this tutorial again, when **further reflecting on why** **communication can be looked at as a stochastic process. **Here is again a link to the paper “A mathematical theory of communication” by Claude Shannon, which we will continue to explore further below.

**In this tutorial we will discuss the relation and differences between statistical thermodynamics and information theory** and will build up all the knowledge we need for the concept of **Markov processes (Markov chains) in the ****third part of this series**.

**For those curious upfront:** In statistics Markov chains are, e.g., used in the so-called “Markov Chain Monte Carlo” (MCMC) method. The relevant parts of this series will therefore also serve as basis for later tutorials within our Inferential Statistics series on that topic. Same goes for the Akaike and Bayesian information criterion. A Markov chain is a rather simple concept, since it is again just another way to represent and extend classic Bayes’ rule / conditional probability. We will also introduce two ways to calculate a Markov chain: a simple version directly involving Bayes’ rule and a more complex method using basics from linear algebra.

**You may wonder, why information theory is linked to statistical thermodynamics and therefore physics in the first place.** In fact, we are also talking about a kind of physics that is often considered the rather crazy type, since statistical thermodynamics was actually the start of quantum mechanics. On the other hand you probably heard of quantum computers and terms such as QBits — so from that perspective you actually already know a little bit (hehe) about it and can connect it with information you obtained in the first part of this series (by the way there is also an ongoing discourse on how to formulate a quantum information theory).

**However, the easy answer how information theory and physics are connected is: it is because thermodynamics is a statistical concept.**

There are even major questions of life linked to thermodynamics that you probably encountered, such as: **What is life? And how does life manage to maintain itself by overcoming entropy in space and time (or how does it minimize free energy in order to overcome entropy)?** Such questions were evoked within a physics discourse by Erwin Schrödinger in his book “What is life?” from 1944 (link to an online archived facsimile can be found here).

One answer to this question that is further and further developed is given by the concept of active inference / the free energy principle (FEP) (Ramstead et al. 2018). The FEP as well as other “Bayesian Brain” concepts also historically rely on the concept of **Helmholtz Free Energy** and his work on **perception as “unconscious inference”** — again in a statistical sense (compare SEP on Helmhotz; by the way, this is also where Freud took his ideas from when starting as a neurologist). Active inference conceptually also relies on the idea of Helmholtz Free Energy, therefore also the name of the generalized concept behind active inference: the free energy principle (in its very general form it can be extended to a mathematical theory of physical (self-organizing) systems in general, e.g., discussed here in Ramstead et al. 2022). We will provide some basics on it in further parts of this series.

**Sidenote: The Helmholtz approach is different to, e.g., an ecological psychological approach to perception**, influenced by James J. Gibson, which argues that perception does not involve inference, but **direct (realist) interaction between the world and us as an complex organisms resulting in perception** (we will not go into this in details in this series, but it is good to know that this approach is also still discussed today!).

**Thermodynamics and concepts such as entropy can be difficult to understand, but we hope to show that there is an intuition behind it that we have encountered before, which makes it much easier to understand: Bayes’ rule / conditional probability (again…).** The math behind it will therefore turn out to be rather simple, as we will see. However, there is a popular and rather ironic, sarcastic and even cynical introduction line for a chapter on statistical mechanics in a textbook called *States of Matter*, by David L. Goodstein, reflecting on its difficulty. I have encountered this quote several times, so I do not want to withhold it from you:

“Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the work, died similarly in 1933. Now it is our turn to study statistical mechanics.” ─ textbook,

States of Matter(1975), by David L. Goodstein

**I promise, we will not go that far and it is really a cynical exaggeration in a lot of ways.** **Going through thermodynamics will take us a bit of time and imagination, but we will see that it’s well worth it and nothing to be afraid of either!** As you have seen above, this tutorial is also supported by the “Meowssissippi Cat Emergency Center”, so believe us, you will be all safe and sound.

*Let’s get to it!!*

**At first let us distinguish statistical mechanics from statistical thermodynamics:** *Statistical* *mechanics* describes ‘ensembles of *micro*states’, i.e., a range or number of different possible states a system can occupy based on the energies of the particles of a system (involves quantum mechanics). A classic system that is described this way is a closed space with gas particles. For simplicity and “vividness” we will mostly talk about an example with ink particles within a glas of water.

These ** micro**states (energy levels or frequencies of atoms in a system) are what merge into or correspond with

** Macrostates in general** are, e.g., changes in volume, temperature and the description of the entropy of a system ─ we will especially focus on the latter concept. Note that statistical thermodynamics can be looked at as a partially independent concept from statistical mechanics and classic thermodynamics (physics in general so to speak), as it can be used for anything that can be referred to as state space in an abstract way, suitable for a general statistical method ─ its use in information theory being a popular example for it.

As mentioned, ** microstates** are the number of possible states that all the elements of a system (particles) can occupy within that system (again, also referred to as the energies / frequencies of the particles of a system). This all comes with a fundamental assumption in statistical thermodynamics, assuming an

**To make this sound a little more familiar in the context of information theory:** See the relation to the math we have gone through before? “Number of possible states” is similar to, e.g., our **(and this only for mathematical convenience, as we will see, operating with the logarithm to the base of 2).**

Another important aspect of the relation between macro- and microstates is the fact that changes in the *micro*state of a system can correspond *to the same*, i.e., stable macrostate of an *isolated* system. This means that no matter what microstate is given, say a message with either 0 *or* 1, in both cases we will still have an overview of the overall possibilities of that system (possible macrostates of a state space or message): *it will always have two possibilities!*

**The physical system is closed**, when it does not lose particles/atoms (matter) but exchanges energy, and **a system is** **isolated** when it cannot exchange matter and also no energy, e.g., in the form of heat, which would change the set of possible microstates (so imagine no environmental influence / energy exchange; open system exchanges both (**see figure below**); a biological membrane does both (partially) selectively). **We will soon see what that means in mathematical formulas, and we are not so much looking for a complete physical or a biochemical description of the thermodynamics of living systems — just the basics**. For now, just remember that heat “speeds up” particles, such that interactions with other atoms cause greater fluctuations, which means that if we were to feed a system with heat *this would result in more possibilities* during a timespace compared to a system with *less* *heat *in the same time *─ *as there is more change happening within a time space in a system due to more energy, so to speak (more action, harder to track, more possible states within time, more entropy). There is also a **steady state** where nothing changes, an **equilibrium state**, e.g., in a state of maximum entropy (and minimum inner energy ─ we will get there when we deal with approximate Bayes inference in further parts of this series, e.g., when discussing the basics behind variational inference).

**Time plays an important role in this case, as it is known in physics that everything (in a closed and isolated system) tends towards “chaos”, i.e., entropy over time (diffusion is driven by this)**. Entropy therefore also works as *an index* *for the direction of time* or *information flow* (the so-called time arrow of entropy; in the context of neuroscience compare Ryan Smith 2022 and Lynn et al. 2022).

In order to fully draw a mathematical relation to information theory, let us first think of a more canonical intuitive description of thermodynamics in physics (some of them also entail some conceptual traps). We will focus on **the phenomenon of entropy in everyday life**, where in general entropy is often referred to as the disorder or randomness of a system and a tendency towards it (even though this brief definition entails some counterintuitive traps, as we will see). In order to get more physical, we will consider an everyday example: *a drop of ink that fell into a glass of water*. The glass will be our system and we want to know how the ink molecules behave within the water (system) over time (diffusion). **For this we will look at ***two points in time***, i.e., two states the molecules within the system can be in (two microstates):**

**The left part a)** **in the figure above** represents a 2-dimensional abstraction of a state shortly after the ink fell in the water, denoted ** b)** is the state of the system after some random time

**What we see is that the ink completely dissolved over time (**** ), i.e., the molecules are now ***equally distributed*** within the systems space (the simplification of a glass of water)**. The process from ** a)** to

**To find a way around this misleading intuition of “disorder”, let’s have another look at the two systems above:** in ** b)** every ink molecule is equally distributed within space, so it is equally likely to encounter the same number of molecules in every segment or quadrant of that system (in equilibrium also steady over time). So essentially this single state represents a pretty

Thinking of the molecules being in motion (kinetic energy), one might also find it intuitively hard to think of a situation where ** b)** moves back to

**In thermodynamics ***low entropy*** is mathematically expressed as observing one state of a system that is ***less likely*** to be observed than other possible states within a range of possible states of a system.** As entropy rises over time (time arrow of entropy!), low entropy states automatically become less likely over time. In other words: Within the range of possible configurations of molecules within a system, the possible observation of a state of the elements of a system where the ink is *more dissolved* than in state ** a)** is

Before getting deeper into the actual physics formula for the concept of entropy, we will quickly recall some possible relations to Shannon and information theory. In Information theory a systems macrostate at a certain point in time represents the number of possibilities (exponent!) of a *message* that consists of a number of binary digits, which are together forming possible microstates, i.e., one amongst a range of *possible* messages. Shannon therefore states in the introduction of AMTC:

These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one

selected from a setof possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design. (AMTC, p. 1)

The entropy of such a message, which is also called the *information content*, also reflects on the ‘randomness’ or ‘disorder’ of a system at a certain point in time, namely: how likely a certain message is amongst all other possible messages, or in other words: how *uncertain* the information of a message is, apart from noise. I have hinted before and throughout the first part of this series a couple of times how information can be statistically quantified. One relay has 2^{1} possible states, a NAND gate consists of two relays and has 4, i.e., 2^{2} possible states and, again, 2^{1} possible outputs, due to the logic restrictions set by the NAND gate. **So the important part is the exponent.**

Before we are going into the exact meaning of the exponent value and mathematical formulas, let us first “visualize thermodynamics” again, now in the sense of its use in information theory:

**In order to recall the relation to simple Bayes’ rule, recall the Vanessa Holmes example**, drawing a relation between abductive inference, Bayes inference and **investigative reasoning**, by successively ruling out possibilities (updating a model). So, a way to look at the process of receiving a message and reflecting on its information content (entropy) is exactly that: investigative reasoning, or better: asking the suspect (the received message) questions on its configuration amongst a range of possibilities (being 0 or 1, or say, guilty or not guilty). Quantifing information will then lead to the following consideration: “Given that only one sign was sent, how many questions do I have to ask the suspect / the message, in order to obtain the information that either a 0 or 1 had been sent, i.e., the suspect had been guilty or not?” Played out:

─ “Was it a 0?”

─ “No.”

*Ergo***:** a 1 was sent!

How many questions did she had to ask, in order to obtain the final information that a 1 was sent, not a zero? — One question. **This value of 1 is equivalent with the exponent, when the number of possible states is cast as exponentiation to the base of 2** (*bits, which is just the exponent to the base of possible configurations a single sign can have *(to the base of 2 in the case of 0 or 1). This is equivalent to stating the level of ** uncertainty** on a subject one wants to interrogate — or a message that one wants to infer on (received message). Since we are uncertain of a message, we have to ask questions, so to speak.

Another term often used to express the above is ** surprisal**, where surprisal is to be understood as

**Before we head on with the pyhsics and the formula of entropy and surprisal, we will look at **** another important example of interrogation:** How about a system equivalent to our system in the figure above, but with

**None, as I would already know the answer, so to speak.** Within information theory this translates into: no information had been sent. To make this example comparable to the previous quantification: the possibilities given in this case are **0 bits of information**.

In more general terms: **sending a message that entails information in the above sense conceptually relies on a beautiful requirement**: a minimal difference — a minimal uncertainty, so to speak — in order to be informative, in order to be information at all, in order to makes sense and in order to have a reason to communicate in first place.

Same goes for classic physical thermodynamic systems by the way: *uncertainty, *no* surprisal* whatsoever. We wouldn’t exist in such a system, as we would automatically represent a difference, if we were to exist (we would be a contradiction so to speak, given that

**Apart from the ones I have mentioned, there are more terms related to the concept of entropy: self-information, Shannon information…** A lot of them are often used interchangeable. The entropy though is actually the *average self-information/surprisal*, though in the case of

**In the next chapter we will go through the concept of the logarithm that can be used to determin the ominous exponent we discussed above and which will soon help us to make things a little more convenient when handling the numbers. **

**The next chapter however can be considered optional, if you know about the logarithm well enough already.**

In order to dig in even deeper into entropy and in order to get a grip on its equation, we shortly have to recall what a *logarithm* is. When calculating exponents, it is common to use the logarithm for convenience and clarity over the calculation of our results - this is all it actually does. In the case of information theory, the logarithm to the base of 2 is used (binary digits, bits) ─ the simplest representable way (minimal difference) to encode information. Let us recall what the logarithm and its components are:

**The logarithm is in general also useful for, e.g., transforming multiplications into summations (for convenience, since summation is easy to do by head and hand).** We will encounter some of the following rules again in more detail. Just always keep in mind that logarithmic formulas are just a way to better handle and orientate within the math.

Note that the

, for infinity, or **Inf**

in such a case. When performing linear algebra one will sometimes operate with vectors and matrices that contain 0 values. As **-Inf**

```
nat_log_nonzero = function(x) {
x = log(x+exp(-16))
} # End of function
```

In thermodynamics, binary digits (bits) are not used to quantify information, but the *natural* logarithm is (the unit is *natural digits*, nats). The natural logarithm is operating to the base of

**The number **** is a very important mathematical constant for several reasons.** The number

A number of special characteristics of exponential functions is the reason why nats as a unit is often used ─ again for convenience:

a) The *derivate* of

b) The *slope* at a certain point

c) The *integral* up to a certain point

```
plot(x = seq, y = fx_pos, type = "l")
# forming coordinate system (+x,-x,+y)
abline(v = 0)
# Point at x = 0, so f(0) = y = e^0 = 1
exp(0) # Now add point to plot:
points(y = 1,x = 0) # P (0|1)
# Point at x = 1
# y = f(x) = e^x; f(1) = e^1 = e
exp(1)
points(exp(1)) # P (e|1)
# Slope of tangent at a point P(f(x)|x) := f'(x) = e^x,
# since f(x) = e^x = f'(x)
# Evaluate y coordinate of P(f(x)|x):
y_coordinate = exp(1)
# [1] 2.718282 = Euler's number e.
# We can add the respective tangent at exp(y)
# point with a slope = e^x. In case you wonder:
# the linear function x does not need to be defined in curve().
curve(exp(1)*x,-3,3, add = TRUE)
# Integral from -Inf to x = 1
expo_fun = function(x) {exp(x)}
integrate(expo_fun, lower = -Inf, upper = 1)
# 2.718282 with absolute error < 0.00015
# = e again.
```

**Another important advice concerning handling logarithms comes from Shannon himself:**

Change from the base

ato basebmerely requires multiplication by logb a. (AMTC, p. 2)

Now let us get fully to it and start with the equation for entropy as proposed by the Austrian physicist Ludwig Boltzmann (note that Rudolf Clausius also had a often forgotten role to play in this!). The definition holds for closed systems in thermic equilibrium.

*For those already exhausted***: we will simplify things dramatically soon, when getting back to actual information theory. **

**However, the below entails nothing you wouldn’t know from what was discussed so far. **The below is also mostly concered with representing entropy as a physical/mathematical formula with correct notation.** **

Apart from that most of the math below is actually just summing, weighting, taking the mean ─ so it’ll be rather simple, even if the notations may not look like it. Note that the below is refers to the special case where *surprisal is equal to entropy*, due to uniform distribution of weights.

In other more physical terms: the Boltzmann entropy equation holds for the assumption that every microstate is equally probable. In Inferential Statistics IV we used:

**You may wonder what a constant is. In general (mathematics), a constant just revers to a single constant number (not changed in relation to another variable). **In physics a constant is the result of a measurement that appears everywhere in nature without changes in its value (over time) and are expressed in SI-units (standardization of units). **A classic is the speed of light**, 299792458 [m/s], where meter and seconds are the standardized units for length and time (kg for mass; Joule for energy transferred to another object with a force measured in Newton; Kelvin for temperature…). **Establishing standardized units is in general performed in order to make the results of scientific measurements comparable — at least up to a desired precision.**

The ** Boltzmann constant** expresses a relation between the (average kinetic) energy of particles in a system (microstates) and the temperature of a system, therefore the unit Joule/Kelvin. We will not go into more physics details here, as it is not needed in information theory and other applications of the presented concept.

We will still use the unit Joulde/Kelvin when calculating Boltzmann’s entropy below. In general, **Let as compute some examples of the Boltzmann entropy in R.** In order to use the Boltzmann constant, we will **install a package**. **In case you don’t know what a package is:** A package is a collection of functions, or even just data sets (e.g., datasauRus), so they do not have to be written or transferred into the script by oneself.

We could also define the constant in R ourselves, taking a value from some text book of course, but **we should want to use the official CODATA constants (details below)**. Some instructions on how to use the package can be found on the authors Github page: https://github.com/r-quantities/constants. **In case you don’t know: Github** is an important website you may have encountered already, and which is used in the (software) developing field, where one can publish code, keep track of the development and so on.

```
# Install a package with the install.packages() function.
# Keep in mind to use " " within ().
install.packages("constants")
# Now load the library you just installed, otherwise
# the functions of the package will not be known by R
library("constants")
# Here you can check out which constants are available.
# The package uses the constants estimated by CODATA, i.e.,
# the Committee on Data for Science and Technology in Paris
# (CODATA), founded by the International Council for Science.
# CODATA keeps track of measurements of constants and delivers
# standardized estimations of constants for science (quality
# control and improving quality of estimations in general).
View(codata) # View full list of constants
# Use lookup() to search for constants
lookup("boltzmann", ignore.case = TRUE)
# Use with() to use a constant, in our case:
with(syms, syms$k) # can be used as a whole to do math!
# Boltzmann Entropy (syms_with_units adds unit to result)
# For R Version 4.1.0 use k instead of syms$k but both may work:
with(syms_with_units,syms$k)*log(1) # with Omega = 1 possibility
with(syms_with_units,syms$k)*log(2) # with 2 possibilities
with(syms_with_units,syms$k)*log(4) # with 4
with(syms_with_units,syms$k)*log(8) # with 8
# RESULTS:
# > with(syms_with_units,syms$k)*log(1) # with 1 possible states
# [1] 0
# > with(syms_with_units,syms$k)*log(2) # with 2
# [1] 9.56993e-24
# > with(syms_with_units,syms$k)*log(4) # with 4
# [1] 1.913986e-23
# > with(syms_with_units,syms$k)*log(8) # with 8
# [1] 2.870979e-23
# NOTE: plain log() := natural logarithm := ln.
```

The American scientist Josiah W. Gibbs later on generalized Boltzmann’s formula, where the probabilities of the states are mathematically treated to not necessarily have equal probability of occurrence, but are unequally *weighted* in relation to each other. Boltzmann’s equation can be seen as a special case of Gibbs’ generalization of entropy.

**Note that we only take the negative sum in the equation below**, **due to mathematical and reasons of logic:** when taking the log of a number between 0 and 1, the result will be negative, since **As we will get a negative exponent out of these logarithms, we will turn them into a positive value by adding a subtraction up front for reasons of units and logic:** the amount of information / (average) entropy cannot be negative, as it would mean one loses information when receiving a message that entails information, which is a contradiction ─ same goes for thermodynamics in physics: information cannot be “lost” (otherwise it results in a paradox, as in questions regarding the event horizon in black holes…).

We will later see that energy can be “free” though, as in terms of leaving the system (which is, e.g., used to do work, e.g., in a steam engine; we will get there in other parts of this series when approaching approximate Bayesian methods).

**Let us now finally look at Gibbs’ entropy formula:**

When comparing Gibbs’ generalization with Boltzmann’s entropy equation, we see that *weighted* by *non-equal* probability distributions *a priori* of possible microstates ─ a *weighted* *a priori*. The upper equation can also be rewritten in the expected value notation (denoted

```
# Gibbs' generalization of the entropy formula:
# We will try different n with equal prob. of microstates
# (as in Boltzmann’s equation)
OMEGA1 = c(1) # one possible state
OMEGA2 = c(.5,.5) # two states with equal prob.
OMEGA4 = c(.25,.25,.25,.25)
OMEGA8 = c(.125,.125,.125,.125,.125,.125,.125,.125)
-with(syms_with_units,syms$k)*(sum(OMEGA1*log(OMEGA1)))
-with(syms_with_units,syms$k)*(sum(OMEGA2*log(OMEGA2)))
-with(syms_with_units,syms$k)*(sum(OMEGA4*log(OMEGA4)))
-with(syms_with_units,syms$k)*(sum(OMEGA8*log(OMEGA8)))
# RESULTS (EQUIVALENT TO THE ABOVE USE OF BOLTZMANN ENTROPY)
# > -with(syms_with_units,syms$k)*(sum(OMEGA1*log(OMEGA1)))
# [1] 0
# > -with(syms_with_units,syms$k)*(sum(OMEGA2*log(OMEGA2)))
# [1] 9.56993e-24
# > -with(syms_with_units,syms$k)*(sum(OMEGA4*log(OMEGA4)))
# [1] 1.913986e-23
# > -with(syms_with_units,syms$k)*(sum(OMEGA8*log(OMEGA8)))
# [1] 2.870979e-23
```

We see that this leads to equivalent results. Now mark and execute only the following part of the code, e.g.,

, and see how summation is performed, then try just, e.g.: **OMEGA4*log(OMEGA4)**

```
-log(.5);-log(.25) # etc.
# note that “;” is a delimiter that in this case
# is interpreted by R as a new independent line
```

We see that in the case of equal probability of all microstates it does not necessarily need any weighting by its own probability, so the Boltzmann’ equation of entropy can be seen as a special case of Gibbs’ entropy.

As mentioned, the upper equation can also be expressed via the expected value (**We have also thoroughly discussed the concept of an expected value and the often ambiguous use of the term, overlapping with (population) mean /weighted arithmetic mean, in ****Inferential Statistics III****. **

Equivalent to the meaning of the summation within the previous formulas, an expected value can be understood as the sum of the *weighted average* of possible outcomes (weighted mean). In the case of entropy, the expected value is again the exponent of *in nuce* reflects on a microstate being one amongst a range of possible microstates ─ the *lower the entropy* of a system, i.e., the lower the expected value, the *more* *likely a state is* to be expected within a range of possible microstates.

**Let us now consider a general mathematical example using a fair dice with six sides.** This time our expected value will not be the value of entropy, but the value we gain from tossing a dice with values ranging from 1 to 6. This is different to before where the expected value was the sum of the log of

Here is the formula of our expected value:

```
# E[x] = Sum(x*p(x)), where x := dice value
ExpX = 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6
ExpX # 3.5 or 7/2
# Or:
x = c(1,2,3,4,5,6)
ExpX = sum(x*(1/6))
# Shortest:
ExpX = sum(1:6*(1/6))
```

When equal probabilities are given, then *unweighted* *arithmetic* *mean* in terms of: **mean = sum(x1+...xn)/n**.

```
# Note: the function nrow() delivers the number of
# rows/elements in a matrix (counting trick). The parameter
# "byrow =" adds the values by row, if FALSE, one would end up
# with 6 columns; default = FALSE; in this case it actually
# doesn’t matter, as there is only 1 column.
# Our fair six-sided dice:
dice = matrix(c(1,2,3,4,5,6), byrow = TRUE)
# probs corresponding to the matrix "dice"
weight_dice = c(1/6,1/6,1/6,1/6,1/6,1/6)
# Arithmetic mean (equal to using equivalent probs./weights)
arith_mean = sum(dice)/nrow(dice)
```

Below you can find the formula of the weighted arithmetic mean ─ and again: formally equivalent to **Since we are using probabilities as weight, which sum to 1, the denominator can be neglected:**

```
# The formula of the weighted arithmetic mean is:
weight_mean = sum(weight_dice*dice)/sum(weight_dice)
weight_mean = sum(weight_dice*dice)
# Or use R function from the base package:
weight_mean = weighted.mean(dice, weight_dice)
```

One can also *normalize* *weights*, i.e., turning the values into a part-to-whole range between 0 and 1, *initial* already normalized weights again (forms an kind of *identity*, so to speak). To prove this, here is the formula and the respective code:

```
# Normalizing weights:
normalized_weights = weight_dice/sum(weight_dice)
# [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
# 0.1666667 # where 1/6 = 0.1666667
```

**This example will hopefully explain a lot about the math behind Gibbs’ entropy equation concerning weighting a log of some probability with (the same) probability, weighting the information / entropy **** of one microstate by its probability of occurrence amongst ***a whole*** of possible microstates.**

Here is another example with *non*-equal weights ─ a phony dice.

```
# Weights of a phony dice
phony_dice = matrix(c(1,2,3,4,5,6), byrow = TRUE)
# probs. corresponding to "phony dice"
weights_phony_dice = c(2/6,.5/6,.5/6,1.5/6,.5/6,1/6)
# Check if probs sum up to 1
sum(weights_phony_dice)
# Normalized weights (again equivalent to our weights)
norm_phony_weights = weights_phony_dice/sum(weights_phony_dice)
# Check on sum again
sum(norm_phony_weights)
# First we check on our weighted mean with our non-normalized
# weights. Note that the w-mean is actually our expected value:
w_mean_phony_dice = sum(weights_phony_dice*phony_dice) /
sum(weights_phony_dice)
# Now we check if we get the same expected value using our
# normalized weights:
n_w_mean_phony_dice = sum(norm_phony_weights*phony_dice) /
sum(norm_phony_weights)
```

In both cases we get the same result:

. Looking at the probabilities we set (our weights), we can see that there is a tendency towards **[1] 3.166667***lower* *outcomes* when tossing our phony dice compared to our fair dice that reflects in our expected value. The expected value of our *fair* dice deviates by +1,44443 from the expected value from our *phony* dice.

**You made it through alive!!** **With this we have gained a solid mathematical basis and intuition on statistical mechanics and thermodynamics.** We are now moving on to Shannon’s entropy formula and things will get, as promised, much easier. I chose to dip into actual statistical mechanics and thermodynamics, in order to make you, the reader, competent in recognizing similarities and differences between physics and information theory.

**From now on we will mostly focus on the information theoretic understanding of concepts within statistical thermodynamics. For us this means: simplification on various levels.**

Shannon states in the introduction of his paper “A mathematical theory of communication”:

If the number of messages in the set is finite then this number or any monotonic function of this number can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely. (AMTC, p. 1)

We are now able to clearly foresee what the mathematical work of Claude Shannon was all about. **The function Shannon chose was the logarithm to the base of 2**, as we know, for several reasons ─ Shannon also mentions intuition as being one of them:

For example, adding one relay to a group doubles the number of possible states of the relays. It adds 1 to the base 2 logarithm of this number. Doubling the time roughly squares the number of possible messages, or doubles the logarithm, etc. […] It is nearer to our intuitive feeling as to the proper measure […] since we intuitively measures (sic) entities by linear comparison with common standards. One feels, for example, that two punched cards should have twice the capacity of one for information storage, and two identical channels twice the capacity of one for transmitting information. (AMTC, p.1)

Furthermore, also on page 1, we see the beginning of how we understand information today, when Shannon is using the famous unit bits for the first time:

The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly

bits,a word suggested by J. W. Tukey. A device with two stable positions, such as a relay or a flip-flop circuit, can store one bit of information.Nsuch devices can storeNbits, since the total number of possible states is$2^N$ and$log_2(2^N)= N$ . (AMTC, p.1)

Here the argument of the logarithm is written as 2 to the power of *.*

We now hopefully understand that the elementary discovery of Shannon was considering a source as something that produces messages (receiver vice versa) from a *stochastic* *process*, which made communication mathematically expressible and transparently representable in the form of *information* technology (computers nowadays).

**Let us now plot a log-function of the entropy of a single sign occurring, which is specified as surprisal, self-information, information content or even ***Shannon information,*** of a single sign or event with a probability ranging from 0 to 1 ─ and measured in the unit bits.** We will use the following formula:

**What is missing compared to Boltzmann’s entropy formula is the constant ****, which can here be understood as being set to have the value **** (a constant of 1), as our unit will now be ***bits, which only relies on the base of the logarithm — it is rather a mathematical unit, so to speak.** *The unit bits can be seen as an abstract unit with *intermediate character though*, since it also has a representation in the physical world to some degree (e.g., current hindrance in a relay switch and the relay switch as such; this actual physical representation of information has been proven several times, see “Landauer’s Principle”).

In the following we will start again with looking at unweighted log probabilities, i.e., the *surprisal* of a single sign with equal probability of occurrence. **In some cases, using equal probability will result in rather abstract values, as in the case of ****, where we will get a fraction of a bit as result** (we would have to weigh the log probability in some way, if we want full bits as outcome, in order to represent the information via a relay switch operating with binary logic ─ or just simply add some bits, to make space for the information).

```
# Surprisal /self-information (information content):
# I(x) = -log2(p(x)) Bits
# Probability of a sign x = p(x)
# In order to plot a very smooth function, we are using
# probabilities starting with (x) = 0.0001 and moving towards
# p(x) = 1 in respective steps. This can be seen as forming
# a sequence. We will start with a probability of
# 0.0001, as -log2(0) = Infinite, i.e., not defined:
# We will use the sequence function seq()
px<- seq(0.0001, 1, by = 0.0001)
# Number of single probabilities:
length(px) # [1] 10000
# Let us now plot the -log2 for all p(x), i.e., x = 0 to 1
plot(x = px, y = -log2(px), typ = "l", col = "red",
ylab = "Amount of Bits = I(x) = -log2(p(x))", xlab = "p(x)",
panel.first = grid(nx = NULL, ny = NULL,
col = "lightgray", lty = "dotted"))
# You can also add some points, e.g., for p(x)=.5
# The calculation can be done within the function points():
points(.5,-log2(.5))
points(1,-log2(1))
```

**It may seem as if we neglected Bayes’ theorem for a while, but we will see that this is not true at all, as there is a surprising connection between the amount of entropy/surprisal and the ***model evidence*** hiding just in front of us**. Recall that we have worked with equal probabilities and Bayes’ theorem before, e.g., when inferring on a single received sign. Also recall that the model evidence is a relation between the joint of *all possible* *average* *likelihood* for model evidence). The *surprisal* then reflects on the uncertainty of such a single event (a sign as microstate) amongst all possible events (signs or microstates of a message), given their probabilities are uniform.

```
# Here we will represent two relay switches or two sent signs for
# a change, i.e., four possible states with equal probability,
# equivalent to the NAND truth table.
prior = c(.25,.25,.25,.25)
likelihood = c(1, 0, 0, 0)
Bayes_Machine(prior,likelihood)
# Calculate surprisal:
Surprisal = -log2(sum(prior*likelihood))
# [1] 2
```

We can also calculate and evaluate the ** maximum entropy** of a single sign that was sent, resulting in

```
# Calculate and plot the maximum entropy for 1 sign
px = seq(.1,1, by = .1)
qx = 1-px
# The entropy is at its maximum for px = .5
MaxEntropy = -(px*log2(px)+qx*log2(qx))
# > MaxEntropy # it reaches maximum of 1.0000000 at px =.5!!
# [1] 0.4689956 0.7219281 0.8812909 0.9709506 1.0000000 0.9709506 0.8812909 0.7219281
# [9] 0.4689956 NaN
# Plot maximum entropy (with new px)
px = seq(0.0001, 1, by = 0.0001)
plot(x = px, y = -(px*log2(px)+((1-px)*log2(1-px))), type = "l")
```

We will get back to this again. Shannon explicitly mentions conditional probability rather late in AMTC, and also mostly in relation to what is called a **Markov chain**.

**Let us first think think a little more about the essence of Shannon’s work again, i.e., arguing that communication is being involved in a stochastic process, before we look at what Markov chains are.**

Note that Shannon also goes beyond equal probability of simple binary digits, and with this also sets a lot of the basics of what is today called *computational linguistics*. **Information theory is therefore far less technical than some might expect, as the math also applies and has something to do with actual language.** Shannon also goes beyong mere binary machine code by looking at natural languages having the same probabilistic characteristics *on different levels of abstraction*: some letters or words are more often than others, i.e., they have different probabilities in general; considering *transitional dependecies, *some letters appear more often *after*, e.g., the letter B than others… The same goes for words, elements of narratives... Shannon even evaluated the average entropy of the English language in his paper “Prediction and Entropy of Printed English” (1951).

**Looking at language and the process of forming a message via a stochastic process is more intuitive as one might think.** Understanding messages primarily as being one message within a set of possible messages is something we have all experience on ourselves. For example, we do not learn a whole sentence, in order to only use it in “that” particular form/order or in only one particular point in time. We only do this, if we consider ourselves as lossless channels for a particular information, as in remembering and reciting a poem ─ concerns memory in general, including the respective noise in various forms.

**In any case, language is contingent (similar to concept arbitrariness) and can therefore be looked at probabilistically in various ways.** Still, one could ask where these contingencies essentially come from. In information and (dynamical) systems theory, explicitly in the latter, this is in general explained as a recursive process, setting both boundaries and contingencies at the same time. For a source in information theory, in order to communicate, one needs code and the equivalence of the a code is what the source and receiver rely on when sending and receiving a message (incl. en- and decoding). A source can be understood as forming a recursion/identity on the matter of code, as it indirectly sets / expects the code convention in every discrete point in time and therefore also sets the boundaries and contingencies of possible messages (sets a whole).

**We have also seen that the possibility of communication relies on a ***minimal difference***, i.e., a minimal amount of surprisal, i.e., entropy or disorder, which represents information itself.** This is why the specific reference, i.e., what specific *aesthetic* contingencies may be given, i.e., the specific code, as well as its “meaning” etc. is *not* what the possibility of communication essentially relies on. If everything would be determent and known by us, we would not need communication in the first place, since

**Arguing contingency or even arbitrariness to always be the answer for everything may still seem somewhat unsatisfying, as if trying to bypass the question of their origin.** Note that there is, e.g., work within active inference / predictive processing / dynamical systems theory research that answers the question by arguing the origin to be the phenotype, i.e., being human (Ramstead, Friston, Hipólito 2020), and our ontogenetically developed priors (starting as early as in the womb, see, e.g., Ciaunica 2021). One may has to recall in general that we are far from just popping up in the world, so we are growing up with information that is *kept* in the world, such as language as method of communication itself. Understanding information theory however is less about a “first reference” but a “first principle/structure” of communication (suprise: essentially Bayes’ rule again).

**But what does the phenotype do to make communcation possible or even likely?** For example, the phenotype (being human) can play the role of a self-reference in the most general sense, which makes it more likely that communication is possible, due to similar experience in terms of ourselves as body in the word, e.g., the range of sight, scaling, capabilities of inference in general... Our eyes (retina) are an example of how first levels of code are already set by the phenotype. Most of all the phenotype therefore sort of “pre-sets” the possibility of a structural equivalence between own self-relation and the self-relation of others with the same phenotype: it is likely to ‘understand’ the meaning of “My leg hurts.”, thinking of it as experiencing it oneself, when *also being a human*. In other words: even just approximating the experience of someone else just on the basis of ourselves will work quite well in a lot of cases — even though it is a kind of gambling or betting game in others.

One does not even need actual language itself, as the sight of someone in pain, performing certain actions (e.g., laying on the floor), also leads to a relation to our own experiences, making the actions of that person appear to us as index for some cause or for the pain itself. Therefore, meaning in the sense we presented here arises within the receiver and is nothing that is conveyed in the sense of a reification of meaning, being a kind of object in the world that may even move with the message from A to B via a channel, or so.

**This is why the approach to communication within information theory is to be considered ***much more general*** than linguistic approaches, as information theory explains the inferential capabilities and structures necessary for communication and therefore language itself as part of it. Information theory is therefore not to be confused as a mere linguistic theory.**

**We will look further look into a general discourse on the relation between information theoretic concepts and us human communication in general, as well the linguistic side of information theory in the ****forth parts of this series****.**

In the next part of this series we will ** mathematically** further explore the basics concept of language as a stochastic process when discussing the mentioned Markov chains. No fear, Markov chains are just an extension of Bayes’ rule, as we will see.

Pub preview image generated by Mel Andrews via midjourney.