Skip to main content# [R] Higher Spheres: Information Theory III: Using Markov Chains to Describe Language as the Result of a Stochastic Process

# Introduction

# 1 Markov Chains Describing the Process of Generating and Receiving Messages

## 1.1 Basics in Linear Algebra — Method I to Obtain the Stationary Distribution

## 1.2 Calculating the Steady State Distribution via Linear Algebra

## 1.3 Summary and Shannon’s (Bayesian) Formalism in AMTC - Method II to Obtain the Stationary Distribution

# 2 Different Forms of Entropy

# 3 Algorithmic Information Theory

**4 Brief Outlook into Information Theory IV**

Intermediate to Advanced Stat-o-Sphere

Published onAug 18, 2023

[R] Higher Spheres: Information Theory III: Using Markov Chains to Describe Language as the Result of a Stochastic Process

**Follow ****this link directly to part IV**** of this series, if you wish,** or have a look at other tutorials of our collection Stat-o-Sphere.

** Review:** This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Pub preview image generated by Mel Andrews via midjourney.

*Corresponding R Script:*

Welcome to the third part of our series on information theory. **You have ***again*** entered the intermediate to advanced ****Stat-o-Sphere****. **

**In this tutorial we will mostly introduce the concept of so called Markov chains, which are also used for the so-called Markov Chain Monte Carlo Method (MCMC). Even though this tutorial discusses mainly information theory, it can also be mostly understood if you only aim for an understanding of the concept of a Markov chain for itself. If so, we recommend you to be at least familiar with ****Inferential Statistics I****, that covers the topic of conditional probability / Bayes’ rule. **

In any case, the **basic ideas introduced in this tutorial** are actually not that difficult, since we again (and again and again) will **just** follow and **extend the basic idea behind Bayes’ rule and conditional probability.**

**In the previous part of this series, we discussed the relation between statistical thermodynamics (entropy in J/K) and information theory (bits).** Instead of the Boltzmann constant **information theory** as a mathematical theory of communication not in the sense of a metaphor, but **literally mathematically reflected actual communication via language**. Again, Shannon was concerned with the a very basic level of communication, the transmission of signs, not about meaning. However, information theory should not be considered faulty in that resepect, **Shannon actually made a point with this:** **We can communicate via language, ***despite*** transmitting actual meaning.** Meaning is something we infer internally and individually — not randomly, but within our contingencies and based on experiences and ourselves as such (e.g., ourselves as human (our phenotype)). What we can influence or better, what gives us the intention / the drive to speak is to evoke a meaning in others via the attempt of transmitting signs.

**In this part of the series, before further looking into liguistic implications of information theory, we will at first essentially expand the classic sender-receiver model by adding time as a conceptual factor (the change or transition of probabilities over time).**

Shannon used simple Markov Chains to demonstrate how the results of a source creating a message would look like, when understood as a stochastic process. Early in his paper “A mathematical theory of communication”, Shannon provides mathematical examples of how messages entailing symbols of natural languages are stochastic, as certain letters appear more often than others.

**The probability of the occurence of signs was also reflected in telegraphy,** **where common letters were encoded by sequences of dots and dashes that were shorter than of those letters that were less common (AMTC, p.5), which in essence saved time.** In other words the **capacity, i.e., the amount of information sent over time**, rises.

The channel capacity of our receiving channel is also what we indirectly reflect on, when we see the bitrate of our downloads (recall 8 bits = 1 byte). **The capacity is in essence a certain average number of bits per time (seconds) that is sent over a channel, e.g., 5 bits per second.** This does not automatically mean that this is the channels maximum, as we know that the amount of actual information sent on another level of abstraction can vary. For example, if A = 0, B = 1 and X = 0011, then a message of six binary digits can carry different amounts of *symbolic* information: in binary digits the message AABBAA has an equal length to the message XAA ─ in consequence the channel capacity can vary. In such a case the weights, i.e., the probabilities will vary in a set of symbols.

**So, what is a Markov chain now?** A Markov chain is in essence a stochastic process that reflects on the probabilities of possible *future* events *in relation to present events, or even past events*, e.g., how likely it is that a B will be received (future), given that an A was sent before (present), and maybe a space was sent before that (past). When having prior knowledge about the probability of such a *transition,* one can represent the Markov chain as a graph and the process as walking along the graph, transitioning from event to event. In general, this sequential dependency of the probability of the occurrences of future events is called a **Markov property.**

**There are also other and more complex forms of Markov processes.** For example, in active inference you will come across the concept of so-called partially observable Markov decision processes (POMDP), based on the concept of **Markov blankets**. Here and in this tutorial in general, we will just get to know the most basic forms of a so-called Markov process, i.e., the **Markov **** chain**.

**Graphs are commonly used to depict such a stochastic relation of events, in our case: symbols.** **Shannon starts with a simple **** of events, which means that each possible event is equiprobable**, which is equivalent to saying: the events occur independent of each other. In other words, the probability of future events *does actually not rely on a present event* in this particular case, even though we can still speak of a transition from one event to another. Shannon’s first example below therefore represents a process that does *not* actually have a Markov property, but can still be represented by a Markov chain.

Suppose we have five letters A, B, C, D, E which are chosen each with probability .2, successive choices being independent. This would lead to a sequence of which the following is a typical example.

B D C B C E C C C A D C B D D A A E C E E A

A B B D A E E C A C E E B A E E C B C E A D (AMTC, p. 5)

**The corresponding graph of the Markov chain looks like this. **

**Imagine starting at the middle. The graph represents the probability of walking along one of the five graph loops.** Each time you do so a symbol is produced (by the source (you)). Walking along one of the loops is equivalent to drawing a letter from a cup (with return, due to independence), i.e., randomly sampling from the set

The upper graph can also be represented via a so-called ** transition matrix**.

The table can be read as follows: The probability of the event “E”(column) occurring *after* “A”(row) is 20% ─ and same goes for every other combination (so the matrix above actually consists of *posterior values*, as we will see). Let us represent our Markov chain in R.

First, install and load the R package **“markovchain”**:

```
# Install and load library:
# install.packages("markovchain") # uncomment this row and execute to install
library(markovchain)
# Now we set the possible events (message) and our transition matrix
MessageABCDE = c("A", "B", "C", "D", "E")
MessageTransitionMatrix = matrix(c(.2,.2,.2,.2,.2,
.2,.2,.2,.2,.2,
.2,.2,.2,.2,.2,
.2,.2,.2,.2,.2,
.2,.2,.2,.2,.2),
nrow = 5,
byrow = TRUE,
dimname = list(MessageABCDE, MessageABCDE))
# This will define our actual Markov chain as a relation of
# states/events to their respective transition probabilities
MCmessage = new("markovchain", states = MessageABCDE,
byrow = TRUE,
transitionMatrix = MessageTransitionMatrix,
name = "WritingMessage")
# With this function you can evaluate a class of an object
class(MCmessage)
# Now let us “walk” on the chain, sample:
markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
# My output looked like this (compare with Shannon’s results):
# > markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
# [1] "B" "D" "C" "C" "B" "A" "D" "B" "C" "E" "A" "E" "B" "D" "C" "B"
# [16] "D" "B" "E" "C"
# Plot Markov Chain (very interesting result)
plot(MCmessage, edge.arrow.size = 0.1)
```

The result of the **plot(MCmessage…)** function relies on a different interpretation of the situation in some way. Shannon’s graph is understood as one state in the middle representing the present output state of a source. The R plot argues every letter to be a different state. The symmetry of the geometric structure is trying to represent equal probability as equal distances of the lines of the graph.

The second example by Shannon looks like the one above, just with different probabilities.

Using the same five letters let the probabilities be .4, .1, .2, .2, .1, respectively, with successive

choices independent. A typical message from this source is then:

A A A C D C B D C E A A D A D A C E D A

E A D C A B E D A D D C E C A A A A A D (AMTC, p. 5)

**Below you will again see the corresponding graph:**

```
##### Second chain example:
MessageABCDE = c("A", "B", "C", "D", "E")
MessageTransitionMatrix = matrix(c(.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1),
nrow = 5,
byrow = TRUE,
dimname = list(MessageABCDE, MessageABCDE))
MCmessage = new("markovchain", states = MessageABCDE,
byrow = TRUE,
transitionMatrix = MessageTransitionMatrix,
name = "WritingMessage")
markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
# Output Sequence:
# > markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
# [1] "A" "D" "C" "A" "C" "C" "C" "D" "D" "A" "A" "B" "D" "A" "A" "E"
# [17] "C" "A" "E" "A"
# In the case of a MC without Markov property altern. code possible
tokens <- c("A", "B", "C", "D", "E")
probs <- c(0.4, 0.1, 0.2, 0.2, 0.1)
sample(tokens, size = 20, replace = TRUE, prob = probs)
## [1] "A" "B" "A" "B" "D" "B" "C" "D" "A" "D" "C" "E" "A" "A" "C" "E" "C" "D" "C" "C"
```

I got the alternative code with the

function from the website **sample()****“stackoverflow”**. **In case you don’t know ****stackoverflow****:** It is a great website to ask any question related to code. I was uncertain, if I had coded the right transition matrix to the graph above. Here is a link to my question I asked and the answer got (people usually answer ridiculously fast, though make sure you ask your question well, and clearly present your coding attempts and your goal).

**Here is another examplatory Markov chain by Shannon:**

```
####### Third chain example:
MessageABC = c("A", "B", "C")
MessageABCTransMatrix = matrix(c(.0,.8,.2,
.5,.5,.0,
.5,.4,.1),
nrow = 3,
byrow = TRUE,
dimname = list(MessageABC, MessageABC))
MCmessageABC = new("markovchain", states = MessageABC,
byrow = TRUE,
transitionMatrix = MessageABCTransMatrix,
name = "WritingMessage")
markovchainSequence(n = 20, markovchain = MCmessageABC, t0 = "A")
# Plot Markov Chain
plot(MCmessage, edge.arrow.size = 0.1)
```

**We can also determine our Markov chain’s stationary distribution.** Before we define what a stationary distribution actually is and how it is calculated, let us first get an intuition on it by reflecting on walking on our graph over a longer time period, called several *phases*.

Starting from A, we see that the overall probability to move from A to B is much higher in our transition matrix *over time,* as some states are in general (or in average) more likely than others. **This will subsequently ***change*** ***the transition probabilities ***related to the three states of our graph**, i.e., the probabilities* *in our matrix are changed and changed further and further over* *time.

A **stationary distribution (equilibrium)** is then defined as a set of probabilities that *doesn’t change anymore***, i.e., stabilizes over time** ─ or was even stable from the beginning as in our *0*^{th}* approximation examples* with equal probabilities above. In such a case all of the probabilities in every row *and in the diagonal* of the matrix are the same from the beginning and therefore indirectly constant over time. This “row distribution” is also called a ** steady state distribution**, and represents an

Before we dig into linear algebra, **let us first heuristically look at our chain after a couple of phases** before it hits steady state, to see how the probabilities change (i.e., looking at the transition of the transition probabilities). The respective matrix can be heuristically acquired via the **markovchain** package, simply by “squaring” the chain in R (we will not literally square it in the mathematical sense; this is just the set convention for the command code in the package).

```
################################################
# Steady state distribution - Heuristic method #
################################################
# Calculating steady state distribution:
# "Heuristic" method via the markovchain
# The respective matrix that entails the
# eigenvector can be heuristically acquired
# via the markovchain package, simply by
# “squaring” the chain in R.
MCmessageABC^1 # Initial transition matrix, i.e., phase 1
MCmessageABC^2 # Transition matrix 2. phase
MCmessageABC^10 # approaching steady state
MCmessageABC^28 # steady state probability (=row (eigen)vector)
# Execute this line of code to directly get the eigenvector
steadyStates(MCmessageABC)
```

**See how the values hit steady state very early?**

**And it is just the decimal places that become more and more precise over time, so we could even stop earlier, depending on the requested precision of the estimation of the values.**

**Next we will discuss how to actually calculate and visualize the steady state distribution.**

**You now have two options!!!** *You either* follow the linear algebra appraoch, which is much easier than it looks, since it is very visual and will give you robust knowledge on the topic. *Or you* procede with Shannon’s approach further below, concerning conditional probability / Bayes’ rule right away. Shannon’s way of calculating the steady state makes sense from a probabilistic perspective, but you will miss to gain a robust visual understanding of what a steady state / eigenvector is.

The calculation behind the function

is done using linear algebra. We have encountered some basics in Inferential Statistics II, using the linear least squares method for solving a system of equations. Linear algebra can easily get complicated, but we will see that linear algebra can be well understood by visualizing the math and we are only going to look at very basic applications of it. **steadyStates()**

**In this chapter we will gain all the very basics of linear algebra we need to calculate and understand the steady state**, also since it is a good chance to introduce matrix multiplication and other mathematical concepts you will need to have an idea when approaching todays applications of linear algebra, e.g., in machine learning.

**After this, we will look back into how Claude Shannon mathematically described the (changing) probabilities, resulting in equivalent results, by applying conditional probability / Bayes inference.**

For those interested in linear algebra in general and **to get more visualizations parallel to this tutorial**, we recommend the overview by Grant Sanderson (3blue1brown), as linear algebra involves a lot of visualization that are not easy to instantly compute. **I am also intentionally going to follow some parts of Grant Sanderson’s Tutorial on linear algebra in general here as well.** Grant Sanderson is a mathematician working as a professional math tutor and also produces beautiful and very patient and calming videos on mathematics, physics and computer science for social media platforms.

I definitely recommend checking out his videos in general, as they are both mathematically deep and profound, as well as explicitly directed to beginners seeking for an intuition, all coming together in a very appealing voice and elaborate visualization of mathematics. Grant Sanderson also shares all of his (python) code he used to produce the visualizations in his videos!

**Linear algebra is in general all about vectors and vector fields in a coordinate system.** We have encountered a lot of vectors before using the

function, which we figured out is an abstraction of vectors in R and can be used to set up **c()***lists of any kind*, i.e., any class of symbols, such as the list of characters

, but also in terms of a mathematical abstraction, e.g., in the form of a list that represents a **c( “A”, “B”, “C”)***set* of possible dice values

. We have also used probability vectors to represent our prior and our likelihood, similar to our transition probability matrix.**c(1, 2, 3, 4, 5, 6)**

**In linear algebra these “vector” lists are understood as a list entailing ***coordinate*** **** values**. Linear algebra can itself be seen as an abstraction from geometry, where the elements in a vector can be used to determine

Another way of understanding a vector is by thinking of the values as command to first walk

Here is an example of the denotation of a vector located in two dimensions:

**Here is the respective R code to plot the vector within a coordinate system:**

```
## Set up a function that prints an empty coordinate system
# Note: no input for this function necessary!
grid2D = function() {
# Let us first plot an empty plot, where x and y
# range from -5 to 5. For this we will adjust the plot a little:
# These vectors represent a list of tick mark values, see below
posX = c(-25:25) # “:” states from x to y, abbreviates process
posY = c(-25:25)
# Plot empty plot (asp = 1 for steady ratio)
plot(NA, xlim=c(-5,5), ylim=c(-5,5), xlab="X", ylab="Y", asp = 1,
axes = TRUE) # axes removes pre-set axes
abline(h = 0,lty = 1) # adds horizontal line at ylim = 0
abline(v = 0,lty = 1) # adds vertical line at xlim = 0
abline(h = posY, lty=3) # adds grid both h and v
abline(v = posX, lty=3)
} # End of function
# Execute to get blank coordinate system the easy way.
# This is necessary as one cannot delete arrows from a plot,
# one has to re-run all lines of code in such cases
grid2D()
# Add arrow / vector in terms of (geometric) linear algebra.
# x0/y0 = origin, x1/y1 = point to draw arrow to
v = arrows(x0 = 0, y0 = 0,x1 = 2, y1 = 2, length = 0.1, lwd = 2)
# Add text. T for transpose, i.e., indicating that the vector was
# flipped to the side and still represents a column vector,
# not a row vector.
text(x=2.5,y=1, label="v = [2, 2]T", srt = 5)
```

Let us now try to **add vectors together**, to see what happens.

```
#### Adding vectors together:
grid2D() # get fresh grid
# Calculation of v1+v2 = v3
v1 = c(2, 2); v2 = c(2, -2); v3 = v1+v2
# Plot result
v1 = arrows(x0 = 0, y0 = 0,x1 = 2, y1 = 2, length = 0.1, lwd = 2)
v2 = arrows(x0 = 0, y0 = 0,x1 = 2, y1 = -2, length = 0.1, lwd = 2)
v3 = arrows(x0 = 0, y0 = 0,x1 = 4, y1 = 0, length = 0.1, lwd = 2)
text(x=1,y=2, label="v[1]", srt = 5)
text(x=1,y=-2, label="v[2]", srt = 5)
text(x=3,y=.5, label="v[1]+v[2]=v[3]", srt = 5)
```

This process of **adding vectors together** can also be understood as adding a vector upon another in the sense that, e.g., the “endpoint” of

```
# Execute the following lines; adds to the previous plot
v2add = arrows(x0 = 2, y0 = 2,x1 = 2+2, y1 = 2+(-2), length = 0.1,
lwd = 2, col = "lightblue")
text(x=3.5,y=1.5, label="v[2]", srt = 5, col = "lightblue")
```

**Scalar multiplication**, i.e., a single value multiplied with a vector/matrix, is easier to understand as the above, since it just **results in a change in the length of a vector**. Given our vector

**Add the new vector to upper plot via:**

```
#### Scalar multiplication
v4 = arrows(x0 = 0, y0 = 0,x1 = 2*2, y1 = 2*(-2), length = 0.1,
lwd = 2, col = "lightgreen")
text(x=3,y=-4, label="v[4]", srt = 5, col = "lightgreen")
```

**The resulting vector is stretched.** In the case of a fraction as value of *squished,* i.e., the length would be shortened. This is called scaling ─ therefore the term scalar (again, check out Grant Sanderson’s introduction on linear algebra for more visualizations of linear algebra). Negative values flip the vectors direction by 180°, which is reasonable thinking of vector values as coordinates of a point.

Next up is the so-called **linear*** ***transformation**, where we will **introduce the concept of a matrix in linear algebra**. Again, same as with vectors, we have used matrices before in the form of a list or a table in other tutorials. You have seen abstractions of matrices in the form of a list of values of a probability table (CPT), a list of character values (e.g., some text) or in the form of our transition matrix, drawing a sequential relation between two letters (the given and future letter) given its transition probability, where the rows sum up to 1…

The latter concept of a **transition matrix** regarding our Markov chain can be understood as a list of vertical *row probability vectors*, where a relation is always drawn between one given letter and three possible future letters (columns), expressing values of a probabilistic tendency to transit towards each of the letters ─ so the row probabilities always represent the relation of one to a whole of possible states (resulting in a sum of 1).

We also have encountered matrices above where the row vectors are all the same and figured that the **stationary probability is equivalent to a row vector and the diagonal of a matrix that hit steady state.** If this is not initially the case, we found out that the whole matrix, i.e., all of the values change in a way over time that they converge to a state where all row vectors and the diagonal become equivalent (equilibrium steady state). **Mathematically this process involves a ***vector / matrix multiplication***, which is also called ***linear transformation***.**

In essence think of a **linear transformation** as the process of a function *squishing* and *stretching* of vectors only (no curving), and the origin remains the same (compare with Grand Sanderson). The whole grid can here be understood as every possible vector in a field (our 2D grid), which are all changed in a certain way by a function

for us.**v = c(x,y)**

```
# Function to plot vectors as arrows(x), including optional
# color option (y).
plotvec = function(x,y){
if (missing(y)){
vector = arrows(x0 = 0, y0 = 0,x1 = x[[1]], y1 = x[[2]],
length = 0.1, lwd = 2)
}
else{
vector = arrows(x0 = 0, y0 = 0,x1 = x[[1]], y1 = x[[2]],
length = 0.1, lwd = 2, col = y)
}
} # End of function
# Vector for each quadrant of the coordinate system
v1 = c(2,2);v2 = c(2,-2);v3 = c(-2,-2); v4 = c(-2,2)
grid2D()
plotvec(v1);plotvec(v2);plotvec(v3);plotvec(v4)
```

We are now going to introduce **two important vectors of a transition matrix**, the so-called **base vectors** or coordinate vectors. These are vectors that lie on the coordinate base, i.e., either directly on the x- or on the y-axis. With changes in these two vectors, we will be able to transform any vector in the vector space, i.e., the space of possible vectors, which in two dimensions represents the grid itself, considered as a plain of possible vector points. Let us plot the base vectors of the coordinate system:

```
# Base vectors
ibase = c(1,0); jbase = c(0,1)
plotvec(ibase); plotvec(jbase)
```

**Every “regular vector” can now be understood as** a *scaled change in the base*, such that:

Another way of understanding the concept of a change in the base is seeing it as a way to transform vectors via a vector-matrix multiplication into a nother position: First, we will look at a situation, where the base is actually *not* changed. This can be achieved by multiplying the vector via a s so-called *identity matrix* that entails the base vectors of our coordinate system (we used it above as well!): **more details on **** will follow below**). Multiplying a vector with a matrix is equivalent to the process we have gone through above. **The below method used for multiplying a vector with a matrix is also referred to as taking the ***dot product***:**

It is important to note here that we are actually multiplying the vector values with the *row values* and *not* the column values, even though the columns represent the base vector coordinates *of both base vectors* and

```
# Identity matrix
matrixI = matrix(c(1,0,
0,1), ncol = 2)
# Alternative shortcut function for the identity matrix:
matrixI = diag(2) # where 2 refers to a 2x2 (square) matrix dimension
# v1 = c(2,2)
vTrans = v1%*%matrixI
```

Now imagine that *the base vectors are actually changed*, i.e., the

```
# Changed base vectors
ibase = c(2,0); jbase = c(0,2)
# Here is a function doing the transformation for us.
# The function adds the base vectors together to a matrix.
vectrans = function(vector,ibase,jbase){
matrixTrans = matrix(c(ibase,jbase),ncol=2)
vecOUT=vector%*%matrixTrans
print(vecOUT)
} #End of function
v1trans = vectrans(v1,ibase,jbase)
v2trans = vectrans(v2,ibase,jbase)
v3trans = vectrans(v3,ibase,jbase)
v4trans = vectrans(v4,ibase,jbase)
# Add to plot:
plotvec(v1trans);plotvec(v2trans);plotvec(v3trans);plotvec(v4trans)
```

Let us now change

and **ibase**

in a way, so everything is “rotated”. Again, the math below is actually not performing a **jbase***real* rotation, as this would involve a so-called *rotation* *matrix* (incl. **The below vectors will still be rotated, but will ***not*** ***have an equal length*** (i.e., rotated and stretched as well).**

```
#### "Pseudo-Rotating" the vector field
grid2D()
# original ibase/jbase
ibase = c(1,0); jbase = c(0,1)
plotvec(ibase," lightblue");plotvec(jbase," lightblue")
v1trans = vectrans(v1,ibase,jbase)
v2trans = vectrans(v2,ibase,jbase)
v3trans = vectrans(v3,ibase,jbase)
v4trans = vectrans(v4,ibase,jbase)
plotvec(v1trans,"orange");plotvec(v2trans,"orange");
plotvec(v3trans,"orange");plotvec(v4trans,"orange")
# "Pseudo-Rotating" both base vectors 45°
ibase = c(1,-1); jbase = c(1,1)
plotvec(ibase,"blue");plotvec(jbase,"blue")
v1trans = vectrans(v1,ibase,jbase)
v2trans = vectrans(v2,ibase,jbase)
v3trans = vectrans(v3,ibase,jbase)
v4trans = vectrans(v4,ibase,jbase)
plotvec(v1trans,"green");plotvec(v2trans,"green");
plotvec(v3trans,"green");plotvec(v4trans,"green")
```

**Here is also a short example for a matrix-matrix multiplication ─ just for completion:** A 2x3 matrix is multiplied with a 3x2 matrix. The inner values are equivalent, so multiplication is possible, the outer values of the dimensions provide information on the dimension of the *resulting* *matrix* ─ 2x2:

```
# Matrix matrix multiplication
# Matrices:
matrix_A = matrix(c(1,2,3,4,5,6),
ncol=3, nrow = 2, byrow = TRUE)
matrix_B = matrix(c(1,2,3,4,5,6),
ncol=2, nrow = 3, byrow = FALSE)
# Use %*% for matrix multiplication instead of *
matrix_A%*%matrix_B
```

At last, we will look at linear algebra from the perspective of *linear** **equations ─ *or more specific: a ** system of linear equations** (with one or more unknowns). As it involves plain linear equations of the form

**In general, systems of equations are** used to solve a wide range of very basic questions in science. We will start with a very simple example that most of us probably have formally encountered in riddles or so, without thinking of a system of linear equations as such:

**FURRY EXERCISE:** **Two cats, Mr. Midnight and Jurassica Parker, are together 15 years old. Their age deviates by 7 years ─ Jurassica Parker being the older sister of Mr. Midnight. How old is each of them?**

We can solve this equation by *substituting*

Another way of solving our exercise is via linear algebra: we can write the left side of the equations as a matrix and the right side as a vector, resulting in:

```
# Set up two elements, so that we can solve an
# equation in the form: left = right
left = matrix(c(1,1,
-1,1), ncol = 2, byrow = TRUE)
right = matrix(c(15,7), ncol = 1, byrow = TRUE)
# Use the solve function to solve the system of
# equations. It outputs a vector list = c(x,y)
ResultXY = solve(left,right)
```

**In linear algebra the stationary or steady state distribution is stating that there is an (eigen-) vector **** that when multiplied with the transition matrix delivers equivalent results to just multiplying **** with a scalar **** (we will look at the equation below).** A *scalar, *again*,* is a single number value that can be used to, e.g., change the length and ─ when negative ─ also the direction of a vector. Here the scalar is also called an *eigenvalue* of an *eigenvector*.

The ** eigenvector** can be, e.g., understood as an arrow in a plane within a 3D state space that stands for or lands on the axis of a rotatable plane and therefore

Here is a formula commonly used to evaluate the steady state eigenvector:

What we see in the upper equation is a *vector*/*matrix multiplication* on the left-hand side of the equation, i.e., a vector multiplied with a matrix, and a *scalar multiplication* on the right-hand side of the equation, a single value for

First let us prove that the above equation is true. Execute the following:

```
# Quick check if vA=lv is true:
vA = steadyStates(MCmessageABC)%*%MessageABCTransMatrix
lv = 1*steadyStates(MCmessageABC)
# Test for near equality (floating point comparison)
all.equal(vA,lv)
```

Note that solving the upper equation in order to obtain the eigenvector *factorized identity matrix* for

This **converts the scalar into a matrix**, so we can re-arrange the whole system of equations freely. The diagonalization with the respective zero values, can be read as only addressing each event of transitioning from,e.g., *identity *matrix graspable from what we learned before on Markov chains. Note again that the values within that identity matrix are the *eigenvalues* of our *eigenvector, *so they *do not actually *refer to probabilities of a transition, but can for itself be looked at as transitioning from

Filling in all elements we have gained from our last Markov chain example in the upper equation will result in:

The *transposed*, which essentially means that the vector is flipped to the side, as our steady state actually represents a *row* *vector. *One can use *formally for vectors* in representations, in order to save space (transposing a matrix is as simple, though). We could now again apply our substitution technique several times, until we calculate each *probability* *row* *vectors*, representing the probability of, e.g., transitioning from the letter

We will not substitute, but go a different path, re-arranging our upper equations in order to fit it into a system of linear equations in the form of *column vectors again*, which refers to our previous use in both, geometry and linear equations*: *the matrix consists of base (column) vectors, and the overall structure is that of a system of linear equations with results on the right-hand and the unknowns on the left-hand side of the equation. Turning rows into columns and vice versa can again be achieved by the transposing technique, essentially flipping the vector by 90°, as introduced above. We will start with re-arranging the equation (compare these videos and articles: a), b), c) we partially followed).

Now that we have our identity matrix set up to convert our scalar product, we can move the right side of the equation to the left:

As we want to obtain our steady state distribution, i.e., a row vector, via a system of linear equations using *column* vectors, we transpose the matrix *pro forma* 0 as well, i.e., turning all rows into columns and vice versa:

So ** structurally**:

**Here is a quick example of how the results can look like in such a case: **Given is a system of equations, where *not* an equation, i.e., *equal* to the right-hand side of the equation, i.e.,

**In order to be able to compute our system of equations, we have to do some adjusting** and add the fact that

```
##### Re-arranging matrix, to solve it:
# We will use a simplified form of our transition matrix
# without renamed columns and rows c("A", "B", "C")
MessageABCTransMatrix = matrix(c(.0,.8,.2,
.5,.5,.0,
.5,.4,.1),
nrow = 3,
byrow = TRUE)
```

Here we will add the fact that the rows of the matrix sum up to 1, which in the case of our transposed matrix is done by adding another row (!) via

which combines rows. Also note that the function **rbind()**

:= transpose, and **t()**

= **diag(dimension)**

```
# Here we will add the fact that the rows of the matrix sum up to 1,
# which in the case of our transposed matrix is done
# by adding another row (!) via rbind() which combines rows.
# Also note that the function t() := transpose, and diag(x) = I
A = rbind(t(MessageABCTransMatrix-diag(3)), c(1,1,1))
```

The transposed rows, i.e., columns will not actually sum to 1, as we have subtracted

```
# The transposed rows, i.e., columns will not actually
# sum to 1, as we have subtracted I from A,
# resulting in a sum of zero on the columns
# and a sum of 1 on the rows in the below matrix
# > A
# [,1] [,2] [,3]
# [1,] -1.0 0.5 0.5
# [2,] 0.8 -0.5 0.4
# [3,] 0.2 0.0 -0.9
# [4,] 1.0 1.0 1.0
```

We will also represent our

As our vector has to have the same number of rows as the matrix, we will also add a 1, which, again, just represents our extra information of

─ even though the **sum(row)***transposed* *row*, i.e., the column of actually sums to zero, due to the subtracted identity:

```
# We will also represent our b, which, recall b = 0
# As our vector has to have the same number of
# rows as the matrix, we will also add a 1, which, again,
# just represents our extra information of sum(row)=1 ─
# even though the transposed row, i.e., column of A
# sums to zero, due to the subtracted identity
b = c(0,0,0,1)
```

Our whole system of equations will look like the following, using the

function (combines by column):**cbind()**

```
# Our whole system of equations will
# look like the following, using the
# cbind() function (combines by columns)
cbind(A,b)
# > cbind(A,b)
# b
# [1,] -1.0 0.5 0.5 0
# [2,] 0.8 -0.5 0.4 0
# [3,] 0.2 0.0 -0.9 0
# [4,] 1.0 1.0 1.0 1
```

**Solving this system of equations via the function ****solve(A,b)**** is unfortunately still not possible as our matrix is not a ***square*** matrix :D** We can still solve our system of equations using the so called QR-algorithm via

. This algorithm is quite something to do by hand, so I will not go into it in detail. However, it can handle our issues via a special work around algorithm, so we will finally obtain our steady state vector via linear algebra:**qr.solve()**

```
# Use QR-algorithm to solve the system of equations
qr.solve(A,b)
# > qr.solve(A,b)
# [1] 0.33333333 0.59259259 0.07407407
# Let us see if our steadyStates(x) and qr.solve(a,b) is equal
all.equal(as.vector(steadyStates(MCmessageABC)),qr.solve(A,b))
# [1] TRUE
```

Looking back into AMTC on p. 5f., we see that Shannon uses a different and **much simpler formalism**, which relates linear algebra in the form of our Markov process also to our well-known conditional probability:

Let us go through all of them again using Shannon’s formalism, ultimately summarizing what we learned so far:

* *** Transition matrix:** denotes the probability

If it seems unclear why the posterior is already given, think of it as already being inferred and as something that is updated over time (heuristically via** **

) ─ until a (equilibrium) steady state is achieved. **MC^n**

**SIDE NOTE: In case letters are to abstract, one can also look at it from a temporal perspective, which is often done when using weather forecasting as an example for a Markov chain:** The posterior probability of rain tomorrow, *given* that it is sunny today could be denoted by:

** Steady state vector:** the steady state vector is technically an eigenvector with its respective scalar / eigenvalue of 1 that relates

*Relative frequency table***, i.e., ***joint probability matrix***:** This represents our *joint probability* as a matrix or table for each *prior* with the *likelihood*, or the *posterior* with the *model evidence*. Before we were mostly working with complements of

**Before we can go any further:** Below you will find a function that calculates the relative frequencies, i.e., joint probability for us. The first part sets up a list, which will in the end be a matrix. The nested for-loop (loop within a loop) multiplies *classes* of input, e.g., our

output that for itself makes trouble when considered as numeric vector. **qr.solve()**

```
#### Relative probabilities (joint probability matrix) function:
relProbs = function(Steady,TransMatrix){
# This will set up an empty matrix list:
relProbsMatrix = vector("list",
nrow(TransMatrix)*ncol(TransMatrix))
dim(relProbsMatrix) = matrix(c(nrow(TransMatrix),
ncol(TransMatrix)))
# Nested for-loop (loop within a loop)
# 1:length means: sequence of 1 to max length (=nr of elements):
for (i in 1:length(Steady)){
for (j in 1:ncol(TransMatrix)){
relProbsMatrix[[i,j]] = Steady[[i]]*TransMatrix[[i,j]]
}
}
# Set output as matrix (otherwise problems occur for colSums())
# as.numeric(), as.matrix() etc. changes the class of an object:
relProbsMatrix = as.numeric(relProbsMatrix)
# Output is a vector. dim() adjusts dimension back to a matrix
# with the square root "sqrt()" of the length as nr. of c(rows,cols)
# (can be done this way due to the fact that a square matrix is given)
dim(relProbsMatrix) = matrix(c(sqrt(length(relProbsMatrix)),
sqrt(length(relProbsMatrix))))
print(relProbsMatrix)
} # End of function
# We can now use both qr.solve(A,b), or steadyStates(MCmessageABC)
# for the first input “Steady”:
jointMatrix = relProbs(qr.solve(A,b),MessageABCTransMatrix)
```

As mentioned, the sum of *joint probability matrix* are equivalent ─ which again is due to the steady state we are in, resulting in

If

$p(i)$ is the probability of state$i$ and$p_i(j)$ the transition probability to state$j$ , then for the process to be stationary it is clear that the$p(i)$ must satisfy equilibrium conditions: (AMTC, p. 10)

```
# Row and column sums for p(i) := model evidence and p(j) := prior.
# They are equivalent to our steady state (can be seen as its
# definition in the language of probabilities, as it entails the
# notion of steadiness, i.e., “not changing” or “no difference
# between past, present or future”):
pi = rowSums(jointMatrix)
pj = colSums(jointMatrix)
# Corresponding to the formula:
pi = rowSums(pi*MessageABCTransMatrix)
```

**Looking back to our initial heuristic approach to obtain the steady state**, we saw that the eigenvector will fill every row of the transition matrix over time (phases), and can therefore *also* be found in the matrix’ *diagonal* as well.

```
# Looking back at initial heuristic approach:
MCmessageABC^28
# Result:
# A B C
# A 0.3333333 0.5925926 0.07407407
# B 0.3333333 0.5925926 0.07407407
# C 0.3333333 0.5925926 0.07407407
```

Note that the “diagonal eigenvector” structurally implies an aspect of *identity*, as, e.g., *all possible*

**This means that our initial dependencies, i.e., the dependency of the probability of the future state by its present/given state vanishes**, so that *prior and posterior essentially become the same* for every possible *particular* *particular*

We can also again compare a steady state with a

All of this again translates to what we learned from linear algebra: For an eigenvector it does not matter, if it is just scaled by 1, i.e., relating to itself, or if it is related to another state in terms of our transition matrix. **Note that all of this also holds for different phases, which is pretty crazy to discover:**

```
# Using linear algebra again, now with the tranisiton matrix of
# phase 5. I didn’t find a way to change the class of the
# output of:
MCmessageABC^5
# ... so I had to create the matrix myself with the respective
# values, taken from the console output (a little meassy but still worked!):
phase5 = matrix(c(0.31250, 0.60648, 0.08102,
0.34375, 0.58565, 0.07060,
0.34375, 0.58564, 0.07061), ncol = 3, byrow= TRUE)
A = rbind(t(phase5-diag(3)), c(1,1,1))
b = c(0,0,0,1)
cbind(A,b)
qr.solve(A,b)
# Let us check if our assumption vA=v holds for phase 1 and the
# exemplary phase 5, using a shortcut. The shortcut relies on a
# shortcut to build a joint probability matrix by just multiplying
# TransMatrix*eigenvector, which does the same as relProbs(),
# though previous output classes requested some additional coding
# Set eigenvector, here I retrieved it from the previous
# jointMatrix:
eigenvector = rowSums(jointMatrix)
# Check if v=vA holds for different phases:
INITphase1 = rowSums(MessageABCTransMatrix*eigenvector)
AFTERphase5 = rowSums(phase5*eigenvector)
# Note that we could also implement diag(length(eigenvector)) from
# the actual formulas above, but it does not make a difference.
```

**Again, we recommend to whatch a video on this, since it also makes a lot of sense when watching transitions being animated!**

We can now calculate several **different forms of information theoretic entropy**. Before we have argued the (Boltzmann) entropy to be the negative log *model evidence*. However, we can calculated different forms of entropy that refer to different kinds of probabilistic relations, as we will see.

**The following picks up where we ended in the second part of ****second part****, namely p. 11 in ****AMTC****, where Shannon calculated the suprisal / entropy and the maximum entropy of a single sign sent. **He then goes on discussing the following insights:

At first he mentiones the special case of the entropy being always greater or equal to zero (also goes for surprisal) and only being zero in the case of full certainty (

Shannon then extends the scheme to two events **(slightly confusing notation ahead, but we will get over it!!):**

**Since the above is a little difficult to read, we will slowly go through the above notation: **Basically, what we now reflect is not just one sign sent, as when calculating plain surprisal, but reflecting a joint of event, i.e., two signs, or a joint of transitioning events (our Markov Chain): **a first and a second event.** Shannon denotes these possible events **It gets a little confusing when the variables **** and **** are introduced, which refer to probabilties of events **** and ****, not just to the unknown/possible events themselves (without theor probabolities — a little redundant).** So what the first line above says:

We can now calculate the entropy of a *joint event as such,* i.e., the *average* *entropy* of *all possible joint events*, and not in relation to a particular event (a posterior). **We will do so** **via R and using our last Markov chain example.**

```
# AVERAGE JOINT ENTROPY H/x,y)
# Calculating the entropy just via the following will
# result in NaN, i.e., “Not a Number”, as we have 0 values
# within our TransMatrix (recall log(0) := undefined/Inf.!),
# see for yourself:
EntropyXY = -sum(jointMatrix*log2(jointMatrix))
# [1] NaN
# We will quickly write a function adding a tiny
# value to our inputs:
bit_log_nonzero = function(x) {
nonzerolog = log2(x+2^(-16))
} # End of function
# Let's try again using our Markov chain example:
EntropyXY = -sum(jointMatrix*bit_log_nonzero(jointMatrix))
```

**Next:** The entropy formula *below* is structurally the same as we have encountered the entropy before in the second part of this series, just this time the weighting is not equivalent with the log input ─ **as** **we are looking at the ***average*** ***of a particular posterior event (second event) amongst all possible joint event***s**, in other words: we are looking at the **entropy of a ***conditional*** event amongst all joint events**:

```
# CONDITIONAL ENTROPY H(y|x) (AMTC p. 12)
# Below we will work with the numbers of our last Markov chain example:
EntropyPOST = -sum(jointMatrix*bit_log_nonzero(MessageABCTransMatrix))
# [1] 0.9340018 = H(y|x)
```

The average entropy of a joint event

In Shannon’s words “the uncertainty (or entropy) of the joint event * *plus the uncertainty of * *when

**Before we calculate the above,** let us firt look at the average entropy or expected surprisal of a single event

The only difference between both of them is that we either take the log of the *sum* of the rows or of the columns of our joint matrix, i.e., we sum over all **rowSums**) or over all **colSums**), before taking the log, weighting it and summing over the joint.

```
# ENTROPY OF A SINGLE EVENT OF A JOINT:
EntropyX = -sum(jointMatrix*bit_log_nonzero(rowSums(jointMatrix)))
EntropyY = -sum(jointMatrix*bit_log_nonzero(colSums(jointMatrix)))
# ALTERNATIVE formula for EntropyXY
EntropyXY = EntropyX + EntropyPOST
```

**It can also be shown that:**

This ** inequality** essentially states that one cannot have more information on only a single event than when having information on two events, here both

```
# The operator “<=” checks on "less than or equal"
EntropyXY <= EntropyX + EntropyY
```

We can also alternatively obtain

```
# Alternative to obtain EntropyPOST
EntropyPOST = EntropyXY-EntropyX
```

From what we now know, it can be shown that:

Which leads to:

**… which in Shannon’s words means:**

The uncertainty of

yis never increased by knowledge ofx. It will be decreased unlessxandyare independent events, in which case it is not changed (AMTC, p. 12).

Translated this just means that in the case of a steady state the conditional probabilities (posterior/likelihood) and the probability of a single event (model evidence / prior) are equivalent and therefore are their entropies ─ otherwise they are not. In general, the upper *inequality* also repeats the fact that the more I know (the more information I have), the less uncertain I will be (the lower the entropy!), as *given*…

**At last, we will calculate the **** entropy of the source per symbol of text**, which Shannon defines as follows:

“Consider a discrete source of the finite state type considered above. For each possible state

$i$ there will be a set of probabilities$P_i(j)$ of producing the various possible symbols$j$ . Thus, there is an entropy$H_i$ for each state. The entropy of the source will be defined as the average of these$H_i$ weighted in accordance with the probability of occurrence of the states in question” (AMTC, p. 13):

```
# (AVERAGE) ENTROPY OF THE SOURCE, i.e., PER POSSIBLE SYMBOL OF TEXT
EntropySOURCE = -sum(eigenvector*MessageABCTransMatrix*bit_log_nonzero(MessageABCTransMatrix))
```

Adding a frequency to it would give us bits/second.

**We will leave the exploration of further parts of Shannon’s paper to yourself.** **You will also encounter reflections on optimal coding and compression, a negative feedback correction conception, further insights on redundancy and many more concepts.**

**We will close this chapter mathematically for now** and will look into more general reflections on information theory, how it was received and most of all, how it is often misunderstood in the next part of this series, discussing an essay / tutorial on AMTC, written by Warren Weaver. We will then again discuss more general implications for the field of computerlinguistics and how information theory plays a role today.

Before we do so, we will look into one last thing, the concept of **algorithmic information theory**.

As a closing remark, note that there is also the so-called *algorithmic* information theory, which essentially states that one could dramatically lessen the entropy of an information by sending a command with which the receiver can reproduce the message given an equivalent program or given a understanding of the command code. In practice this is fairly simply. Think of a message that may look like this:

This message entails 25 symbols, 25 bits (not in terms of entropy!). A command to produce such a message in R could be in this case the command

, which only involves 16 symbols compared to 25. With such a technique, huge amounts of information can be sent or compressed by substituting them with commands. Mathematically reflecting the above involves the so-called **c(numeric(24),1)**** Kolmogorov complexity**, which stands for the length of the shortest program (command) which produces the upper message. The only case when you cannot compress a message is when the length of the message and the program is equal. It is hard to find an example, as even a series of 1,2,3,4,5 can be compressed by

**1:5**

─ at least in R this way. In general, such a message would involve something like a non-repetitive series of signs (no logical series as **1:5**

that can be represented via a loop!), all of them appearing only once independently of each other.In the next part of this series we will have a look at the paper “Recent contributions to the mathematical theory of communication.”, by Warren Weaver. There we will pick up our discussion on the different levels of communication and will have a look how the general discourse proceded in the second part of the 20th century. The next part will therefore be a lot about the conceptual part of information theory, how it was received and misconceived in the field of linguistics and philosophy, and how it got extended mathematically.

Pub preview image generated by Mel Andrews via midjourney.