Skip to main content
SearchLoginLogin or Signup

[R] Higher Spheres: Information Theory III: Using Markov Chains to Describe Language as the Result of a Stochastic Process

Intermediate to Advanced Stat-o-Sphere

Published onAug 18, 2023
[R] Higher Spheres: Information Theory III: Using Markov Chains to Describe Language as the Result of a Stochastic Process

Follow this link directly to part IV of this series, if you wish, or have a look at other tutorials of our collection Stat-o-Sphere.

Review: This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Fig. How to start a discussion. A pubpub account is needed to do so. Sign up here. Also have a look at articles all about our Open Science Agenda.

Pub preview image generated by Mel Andrews via midjourney.

Corresponding R Script:


Welcome to the third part of our series on information theory. You have again entered the intermediate to advanced Stat-o-Sphere.

In this tutorial we will mostly introduce the concept of so called Markov chains, which are also used for the so-called Markov Chain Monte Carlo Method (MCMC). Even though this tutorial discusses mainly information theory, it can also be mostly understood if you only aim for an understanding of the concept of a Markov chain for itself. If so, we recommend you to be at least familiar with Inferential Statistics I, that covers the topic of conditional probability / Bayes’ rule.

In any case, the basic ideas introduced in this tutorial are actually not that difficult, since we again (and again and again) will just follow and extend the basic idea behind Bayes’ rule and conditional probability.

In the previous part of this series, we discussed the relation between statistical thermodynamics (entropy in J/K) and information theory (bits). Instead of the Boltzmann constant kBk_B the constant for entropy/surprisal in bits/nats etc. is set to k=1k=1. As already discussed in the second part of this series, Shannon introduced information theory as a mathematical theory of communication not in the sense of a metaphor, but literally mathematically reflected actual communication via language. Again, Shannon was concerned with the a very basic level of communication, the transmission of signs, not about meaning. However, information theory should not be considered faulty in that resepect, Shannon actually made a point with this: We can communicate via language, despite transmitting actual meaning. Meaning is something we infer internally and individually — not randomly, but within our contingencies and based on experiences and ourselves as such (e.g., ourselves as human (our phenotype)). What we can influence or better, what gives us the intention / the drive to speak is to evoke a meaning in others via the attempt of transmitting signs.

In this part of the series, before further looking into liguistic implications of information theory, we will at first essentially expand the classic sender-receiver model by adding time as a conceptual factor (the change or transition of probabilities over time).

1 Markov Chains Describing the Process of Generating and Receiving Messages

Shannon used simple Markov Chains to demonstrate how the results of a source creating a message would look like, when understood as a stochastic process. Early in his paper “A mathematical theory of communication”, Shannon provides mathematical examples of how messages entailing symbols of natural languages are stochastic, as certain letters appear more often than others.

Fig. Graphical representation of the constraints on telegraph symbols. From AMTC, p. 4. Since a telegraph is nothing we are fimilar with today, we will start with letters and language in general right away. However, the above graph shows what the options of a message are, as if it were a loop path that the the source/receiver walks on when a message is read or sent (generated). Dependeing on the options and their frequency, a probability could be assigned to, e.g., moving from DOT to LETTER SPACE. We will get far less abstract soon.

The probability of the occurence of signs was also reflected in telegraphy, where common letters were encoded by sequences of dots and dashes that were shorter than of those letters that were less common (AMTC, p.5), which in essence saved time. In other words the capacity, i.e., the amount of information sent over time, rises.

The channel capacity of our receiving channel is also what we indirectly reflect on, when we see the bitrate of our downloads (recall 8 bits = 1 byte). The capacity is in essence a certain average number of bits per time (seconds) that is sent over a channel, e.g., 5 bits per second. This does not automatically mean that this is the channels maximum, as we know that the amount of actual information sent on another level of abstraction can vary. For example, if A = 0, B = 1 and X = 0011, then a message of six binary digits can carry different amounts of symbolic information: in binary digits the message AABBAA has an equal length to the message XAA ─ in consequence the channel capacity can vary. In such a case the weights, i.e., the probabilities will vary in a set of symbols.

So, what is a Markov chain now? A Markov chain is in essence a stochastic process that reflects on the probabilities of possible future events in relation to present events, or even past events, e.g., how likely it is that a B will be received (future), given that an A was sent before (present), and maybe a space was sent before that (past). When having prior knowledge about the probability of such a transition, one can represent the Markov chain as a graph and the process as walking along the graph, transitioning from event to event. In general, this sequential dependency of the probability of the occurrences of future events is called a Markov property.

There are also other and more complex forms of Markov processes. For example, in active inference you will come across the concept of so-called partially observable Markov decision processes (POMDP), based on the concept of Markov blankets. Here and in this tutorial in general, we will just get to know the most basic forms of a so-called Markov process, i.e., the Markov chain.

Graphs are commonly used to depict such a stochastic relation of events, in our case: symbols. Shannon starts with a simple 0th order approximation0^{th} \ order \ approximation of events, which means that each possible event is equiprobable, which is equivalent to saying: the events occur independent of each other. In other words, the probability of future events does actually not rely on a present event in this particular case, even though we can still speak of a transition from one event to another. Shannon’s first example below therefore represents a process that does not actually have a Markov property, but can still be represented by a Markov chain.

Suppose we have five letters A, B, C, D, E which are chosen each with probability .2, successive choices being independent. This would lead to a sequence of which the following is a typical example.


A B B D A E E C A C E E B A E E C B C E A D     (AMTC, p. 5)

The corresponding graph of the Markov chain looks like this.

Fig. Adjusted graph from AMTC, p. 8. A source that produces a sequence of letters with an equal probability of occurrence for every sign, which in the case of five possible signs is: p(x)=2p(x)=2.

Imagine starting at the middle. The graph represents the probability of walking along one of the five graph loops. Each time you do so a symbol is produced (by the source (you)). Walking along one of the loops is equivalent to drawing a letter from a cup (with return, due to independence), i.e., randomly sampling from the set Ω={A,B,C,D,E}\Omega = \{A,B,C,D,E\} .

The upper graph can also be represented via a so-called transition matrix.

Table. Transition matrix represented as probability table, corresponding to our previous graph.

The table can be read as follows: The probability of the event “E”(column) occurring after “A”(row) is 20% ─ and same goes for every other combination (so the matrix above actually consists of posterior values, as we will see). Let us represent our Markov chain in R.

First, install and load the R package “markovchain”:

# Install and load library:
# install.packages("markovchain") # uncomment this row and execute to install

# Now we set the possible events (message) and our transition matrix
MessageABCDE = c("A", "B", "C", "D", "E")

MessageTransitionMatrix = matrix(c(.2,.2,.2,.2,.2,
                                 nrow = 5,
                                 byrow = TRUE,
                                 dimname = list(MessageABCDE, MessageABCDE))

# This will define our actual Markov chain as a relation of 
# states/events to their respective transition probabilities
MCmessage = new("markovchain", states = MessageABCDE,
                byrow = TRUE,
                transitionMatrix = MessageTransitionMatrix,
                name = "WritingMessage")
# With this function you can evaluate a class of an object

# Now let us “walk” on the chain, sample:
markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")

# My output looked like this (compare with Shannon’s results):
# > markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
# [1] "B" "D" "C" "C" "B" "A" "D" "B" "C" "E" "A" "E" "B" "D" "C" "B" 
# [16] "D" "B" "E" "C"

# Plot Markov Chain (very interesting result)
plot(MCmessage, edge.arrow.size = 0.1) 

The result of the plot(MCmessage…) function relies on a different interpretation of the situation in some way. Shannon’s graph is understood as one state in the middle representing the present output state of a source. The R plot argues every letter to be a different state. The symmetry of the geometric structure is trying to represent equal probability as equal distances of the lines of the graph.

Fig. R outputs the above graph, different to Shannon’s approach. In this form of presenting graphs the length of the lines correlate with the respective probabilities. Since all probabilities are equal, all lines have a equal length. In case you are supersticious — it is just math and signs/symbols are contingent/arbitrary anyways.

The second example by Shannon looks like the one above, just with different probabilities.

Using the same five letters let the probabilities be .4, .1, .2, .2, .1, respectively, with successive

choices independent. A typical message from this source is then:


E A D C A B E D A D D C E C A A A A A D   (AMTC, p. 5)

 Below you will again see the corresponding graph:

Fig. Original graph from AMTC, p. 8.

##### Second chain example:
MessageABCDE = c("A", "B", "C", "D", "E")
MessageTransitionMatrix = matrix(c(.4,.1,.2,.2,.1,
                                 nrow = 5,
                                 byrow = TRUE,
                                 dimname = list(MessageABCDE, MessageABCDE))

MCmessage = new("markovchain", states = MessageABCDE,
                byrow = TRUE,
                transitionMatrix = MessageTransitionMatrix,
                name = "WritingMessage")
markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
# Output Sequence:
# > markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
# [1] "A" "D" "C" "A" "C" "C" "C" "D" "D" "A" "A" "B" "D" "A" "A" "E"    
# [17] "C" "A" "E" "A"

# In the case of a MC without Markov property altern. code possible
tokens <- c("A", "B", "C", "D", "E")
probs  <- c(0.4, 0.1, 0.2, 0.2, 0.1)

sample(tokens, size = 20, replace = TRUE, prob = probs)
## [1] "A" "B" "A" "B" "D" "B" "C" "D" "A" "D" "C" "E" "A" "A" "C" "E" "C" "D" "C" "C"

I got the alternative code with the sample() function from the website “stackoverflow”. In case you don’t know stackoverflow: It is a great website to ask any question related to code. I was uncertain, if I had coded the right transition matrix to the graph above. Here is a link to my question I asked and the answer got (people usually answer ridiculously fast, though make sure you ask your question well, and clearly present your coding attempts and your goal).

Here is another examplatory Markov chain by Shannon:

Fig. Original graph from AMTC, p. 8

####### Third chain example:
MessageABC = c("A", "B", "C")
MessageABCTransMatrix = matrix(c(.0,.8,.2,
                               nrow = 3,
                               byrow = TRUE,
                               dimname = list(MessageABC, MessageABC))

MCmessageABC = new("markovchain", states = MessageABC,
                   byrow = TRUE,
                   transitionMatrix = MessageABCTransMatrix,
                   name = "WritingMessage") 
markovchainSequence(n = 20, markovchain = MCmessageABC, t0 = "A")

# Plot Markov Chain
plot(MCmessage, edge.arrow.size = 0.1) 

We can also determine our Markov chain’s stationary distribution. Before we define what a stationary distribution actually is and how it is calculated, let us first get an intuition on it by reflecting on walking on our graph over a longer time period, called several phases.

Fig. Close-up on the previous graph. Say, the starting state is A. What are your guesses what would happen with the given probabilities over time?

Starting from A, we see that the overall probability to move from A to B is much higher in our transition matrix AtoA_{t_o} at time t0t_0 = 0 than moving from A to C. This will result in a disbalance walking our graph over time, as some states are in general (or in average) more likely than others. This will subsequently change the transition probabilities related to the three states of our graph, i.e., the probabilities in our matrix are changed and changed further and further over time.

A stationary distribution (equilibrium) is then defined as a set of probabilities that doesn’t change anymore, i.e., stabilizes over time ─ or was even stable from the beginning as in our 0th approximation examples with equal probabilities above. In such a case all of the probabilities in every row and in the diagonal of the matrix are the same from the beginning and therefore indirectly constant over time. This “row distribution” is also called a steady state distribution, and represents an equilibrium in terms of thermodynamics (maximum entropy and minimum inner energy). What one obtains mathematically with calculating the stationary distribution is a so-called eigenvector that fills every row and the diagonal of a matrix (again, as in the 0th approximation examples).

Before we dig into linear algebra, let us first heuristically look at our chain after a couple of phases before it hits steady state, to see how the probabilities change (i.e., looking at the transition of the transition probabilities). The respective matrix can be heuristically acquired via the markovchain package, simply by “squaring” the chain in R (we will not literally square it in the mathematical sense; this is just the set convention for the command code in the package).   

# Steady state distribution - Heuristic method #

# Calculating steady state distribution:

# "Heuristic" method via the markovchain 
# The respective matrix that entails the 
# eigenvector can be heuristically acquired
# via the markovchain package, simply by 
# “squaring” the chain in R.
MCmessageABC^1  # Initial transition matrix, i.e., phase 1
MCmessageABC^2  # Transition matrix 2. phase
MCmessageABC^10 # approaching steady state 
MCmessageABC^28 # steady state probability (=row (eigen)vector)
# Execute this line of code to directly get the eigenvector

See how the values hit steady state very early?

Fig. Overview of the R output. See how the values hit steady state very early?

And it is just the decimal places that become more and more precise over time, so we could even stop earlier, depending on the requested precision of the estimation of the values.

Fig. Our well known Ms. Miranda, again utterly surprised what’s possible with probability theory. ^^ Image by Kevin Mueller.

Next we will discuss how to actually calculate and visualize the steady state distribution.

You now have two options!!! You either follow the linear algebra appraoch, which is much easier than it looks, since it is very visual and will give you robust knowledge on the topic. Or you procede with Shannon’s approach further below, concerning conditional probability / Bayes’ rule right away. Shannon’s way of calculating the steady state makes sense from a probabilistic perspective, but you will miss to gain a robust visual understanding of what a steady state / eigenvector is.

Fig. Vanessa Holmes, making you an offer. You have the option between two ways of calculating and reflecting the steady state distribution mathematically. First Method (Turtle Gear): Uses linear algebra to solve for the steady state distribution (eigenvector). This method is much easier than it looks, since it is a very visual approach. It jus takes a little longer. This part can also be looked at as further reflecting and extending methods we used in Inferential Statistics II, where we used a kind of pre-linear-algebra when finding the minimum for the intercept and the slope of a linear model. Second Method (Bunny Gear): This method corresponds to Shannon’s approach in AMTC. It relates linear algebra to conditional probability / Bayes’ rule. Either way this is the method you will have to look into anyways, since we will then discuss the entropy of Markov chains.

1.1 Basics in Linear Algebra — Method I to Obtain the Stationary Distribution

The calculation behind the function steadyStates() is done using linear algebra. We have encountered some basics in Inferential Statistics II, using the linear least squares method for solving a system of equations. Linear algebra can easily get complicated, but we will see that linear algebra can be well understood by visualizing the math and we are only going to look at very basic applications of it.

In this chapter we will gain all the very basics of linear algebra we need to calculate and understand the steady state, also since it is a good chance to introduce matrix multiplication and other mathematical concepts you will need to have an idea when approaching todays applications of linear algebra, e.g., in machine learning.

After this, we will look back into how Claude Shannon mathematically described the (changing) probabilities, resulting in equivalent results, by applying conditional probability / Bayes inference.

For those interested in linear algebra in general and to get more visualizations parallel to this tutorial, we recommend the overview by Grant Sanderson (3blue1brown), as linear algebra involves a lot of visualization that are not easy to instantly compute. I am also intentionally going to follow some parts of Grant Sanderson’s Tutorial on linear algebra in general here as well. Grant Sanderson is a mathematician working as a professional math tutor and also produces beautiful and very patient and calming videos on mathematics, physics and computer science for social media platforms.

Fig. The 3blue1brown youtube channel. Always on board are the above pi-creatures ^^

I definitely recommend checking out his videos in general, as they are both mathematically deep and profound, as well as explicitly directed to beginners seeking for an intuition, all coming together in a very appealing voice and elaborate visualization of mathematics. Grant Sanderson also shares all of his (python) code he used to produce the visualizations in his videos!

Linear algebra is in general all about vectors and vector fields in a coordinate system. We have encountered a lot of vectors before using the c() function, which we figured out is an abstraction of vectors in R and can be used to set up lists of any kind, i.e., any class of symbols, such as the list of characters c( “A”, “B”, “C”), but also in terms of a mathematical abstraction, e.g., in the form of a list that represents a set of possible dice values c(1, 2, 3, 4, 5, 6). We have also used probability vectors to represent our prior and our likelihood, similar to our transition probability matrix.

In linear algebra these “vector” lists are understood as a list entailing coordinate values. Linear algebra can itself be seen as an abstraction from geometry, where the elements in a vector can be used to determine a point within a geometric space (another use of linear algebra is solving a system of linear equations, see further below). A vector in this sense is then represented as an arrow that is pointing to that coordinate-point within a coordinate system, starting from an origin, i.e., a point with zero values on every given axis (basic concept; vectors can also be locatated on a trajectory in a field, e.g., in physics).

Another way of understanding a vector is by thinking of the values as command to first walk xx steps on the x-axis and then the yy steps on the y-axis, in order to get to the point with respective coordinates (xy)(x|y).

Here is an example of the denotation of a vector located in two dimensions:

Here is the respective R code to plot the vector within a coordinate system:

## Set up a function that prints an empty coordinate system
# Note: no input for this function necessary!
grid2D = function() {
  # Let us first plot an empty plot, where x and y
  # range from -5 to 5. For this we will adjust the plot a little:
  # These vectors represent a list of tick mark values, see below
  posX = c(-25:25) # “:” states from x to y, abbreviates process
  posY = c(-25:25)
  # Plot empty plot (asp = 1 for steady ratio)
  plot(NA, xlim=c(-5,5), ylim=c(-5,5), xlab="X", ylab="Y", asp = 1, 
       axes = TRUE) # axes removes pre-set axes
  abline(h = 0,lty = 1) # adds horizontal line at ylim = 0 
  abline(v = 0,lty = 1) # adds vertical line at xlim = 0
  abline(h = posY, lty=3) # adds grid both h and v
  abline(v = posX, lty=3)
} # End of function

# Execute to get blank coordinate system the easy way.
# This is necessary as one cannot delete arrows from a plot,
# one has to re-run all lines of code in such cases

# Add arrow / vector in terms of (geometric) linear algebra. 
# x0/y0 = origin, x1/y1 = point to draw arrow to
v = arrows(x0 = 0, y0 = 0,x1 = 2, y1 = 2, length = 0.1, lwd = 2)

# Add text. T for transpose, i.e., indicating that the vector was 
# flipped to the side and still represents a column vector, 
# not a row vector.
text(x=2.5,y=1, label="v = [2, 2]T", srt = 5) 

Fig. Plot of our vector [2 2].

Let us now try to add vectors together, to see what happens.

#### Adding vectors together:
grid2D() # get fresh grid
# Calculation of v1+v2 = v3
v1 = c(2, 2); v2 =  c(2, -2); v3 = v1+v2

# Plot result
v1 = arrows(x0 = 0, y0 = 0,x1 = 2, y1 = 2, length = 0.1, lwd = 2)
v2 = arrows(x0 = 0, y0 = 0,x1 = 2, y1 = -2, length = 0.1, lwd = 2)
v3 = arrows(x0 = 0, y0 = 0,x1 = 4, y1 = 0, length = 0.1, lwd = 2)
text(x=1,y=2, label="v[1]", srt = 5) 
text(x=1,y=-2, label="v[2]", srt = 5) 
text(x=3,y=.5, label="v[1]+v[2]=v[3]", srt = 5)

This process of adding vectors together can also be understood as adding a vector upon another in the sense that, e.g., the “endpoint” of v1\overrightarrow{v_1} is considered as the origin of v2\overrightarrow{v_2}. See plot below.

# Execute the following lines; adds to the previous plot
v2add = arrows(x0 = 2, y0 = 2,x1 = 2+2, y1 = 2+(-2), length = 0.1, 
              lwd = 2, col = "lightblue")
text(x=3.5,y=1.5, label="v[2]", srt = 5, col = "lightblue")

Fig. Example vector addition. The addition resulted in a vector v3\overrightarrow{v_3}  located between v1\overrightarrow{v_1}  and v2\overrightarrow{v_2} and with an extended length.

Scalar multiplication, i.e., a single value multiplied with a vector/matrix, is easier to understand as the above, since it just results in a change in the length of a vector. Given our vector v2=22\overrightarrow{v_2} = \begin{matrix} 2 \\ -2 \end{matrix}  and a scalar λ:=lambda\lambda := lambda with a value of, e.g., λ=2\lambda = 2 , a multiplication can be achieved by multiplying each vector value with the scalar value. We will call the resulting vector v4\overrightarrow{v_4}.

Add the new vector to upper plot via:

#### Scalar multiplication
v4 = arrows(x0 = 0, y0 = 0,x1 = 2*2, y1 = 2*(-2), length = 0.1,   
            lwd = 2, col = "lightgreen")
text(x=3,y=-4, label="v[4]", srt = 5, col = "lightgreen")   

The resulting vector is stretched. In the case of a fraction as value of λ\lambda , e.g., λ=12\lambda = \frac{1}{2} , the vector would be squished, i.e., the length would be shortened. This is called scaling ─ therefore the term scalar (again, check out Grant Sanderson’s introduction on linear algebra for more visualizations of linear algebra). Negative values flip the vectors direction by 180°, which is reasonable thinking of vector values as coordinates of a point.

Next up is the so-called linear transformation, where we will introduce the concept of a matrix in linear algebra. Again, same as with vectors, we have used matrices before in the form of a list or a table in other tutorials. You have seen abstractions of matrices in the form of a list of values of a probability table (CPT), a list of character values (e.g., some text) or in the form of our transition matrix, drawing a sequential relation between two letters (the given and future letter) given its transition probability, where the rows sum up to 1…

The latter concept of a transition matrix regarding our Markov chain can be understood as a list of vertical row probability vectors, where a relation is always drawn between one given letter and three possible future letters (columns), expressing values of a probabilistic tendency to transit towards each of the letters ─ so the row probabilities always represent the relation of one to a whole of possible states (resulting in a sum of 1).

We also have encountered matrices above where the row vectors are all the same and figured that the stationary probability is equivalent to a row vector and the diagonal of a matrix that hit steady state. If this is not initially the case, we found out that the whole matrix, i.e., all of the values change in a way over time that they converge to a state where all row vectors and the diagonal become equivalent (equilibrium steady state). Mathematically this process involves a vector / matrix multiplication, which is also called linear transformation.

In essence think of a linear transformation as the process of a function f(v)f(\overrightarrow{v}) that has a vector v1\overrightarrow{v_1} as input and gives us a vector v2\overrightarrow{v_2} as output. In order to understand this process in terms of our geometric representation in a 2D state space, think of the 2D grid as a whole being morphed in a linear way, which means that it involves squishing and stretching of vectors only (no curving), and the origin remains the same (compare with Grand Sanderson). The whole grid can here be understood as every possible vector in a field (our 2D grid), which are all changed in a certain way by a function f(v)f(\overrightarrow{v}). Let us first look at our grid and four vectors within that space. In order to ease the process, let us write a function that plots vectors of the kind v = c(x,y) for us.

# Function to plot vectors as arrows(x), including optional 
# color option (y).
plotvec = function(x,y){
  if (missing(y)){
    vector = arrows(x0 = 0, y0 = 0,x1 = x[[1]], y1 = x[[2]], 
                    length = 0.1, lwd = 2)
    vector = arrows(x0 = 0, y0 = 0,x1 = x[[1]], y1 = x[[2]], 
                    length = 0.1, lwd = 2, col = y)  
} # End of function

# Vector for each quadrant of the coordinate system
v1 = c(2,2);v2 = c(2,-2);v3 = c(-2,-2); v4 = c(-2,2)

We are now going to introduce two important vectors of a transition matrix, the so-called base vectors or coordinate vectors. These are vectors that lie on the coordinate base, i.e., either directly on the x- or on the y-axis. With changes in these two vectors, we will be able to transform any vector in the vector space, i.e., the space of possible vectors, which in two dimensions represents the grid itself, considered as a plain of possible vector points. Let us plot the base vectors of the coordinate system:

# Base vectors
ibase = c(1,0); jbase = c(0,1)
plotvec(ibase); plotvec(jbase)

Every “regular vector” can now be understood as a scaled change in the base, such that:

Another way of understanding the concept of a change in the base is seeing it as a way to transform vectors via a vector-matrix multiplication into a nother position: First, we will look at a situation, where the base is actually not changed. This can be achieved by multiplying the vector  via a s so-called identity matrix that entails the base vectors of our coordinate system (we used it above as well!):  I=[1001]I= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} . We can see that the diagonal values are equivalent (more details on II will follow below). Multiplying a vector with a matrix is equivalent to the process we have gone through above. The below method used for multiplying a vector with a matrix is also referred to as taking the dot product:

It is important to note here that we are actually multiplying the vector values with the row values and not the column values, even though the columns represent the base vector coordinates ii and jj ─ and due to the symmetry of II it wouldn’t matter if multiplied by either column or row in this case. To still make sense of that rule: xx is multiplied with the xx-values of both base vectors and yy is multiplied with the yy-values of both base vectors. A general formula for a linear transformation could thus look like:

# Identity matrix
matrixI = matrix(c(1,0,
                   0,1), ncol = 2)
# Alternative shortcut function for the identity matrix: 
matrixI = diag(2) # where 2 refers to a 2x2 (square) matrix dimension

# v1 = c(2,2)
vTrans = v1%*%matrixI

Now imagine that the base vectors are actually changed, i.e., the xx and yy values differ from the identity matrix, and how this would result in a change in all the possible vectors of a vector space, e.g., our v1\overrightarrow{v_1}. Let’s try this out with our grid above and let us change the upper four vectors in a way that the whole vector space is “sorft of rotated” (no actual rotation, which involves a rotation matrix). We will change our base vectors and all the corresponding vectors v14\overrightarrow{v_{1-4}} respectively.

# Changed base vectors
ibase = c(2,0); jbase = c(0,2)
# Here is a function doing the transformation for us.
# The function adds the base vectors together to a matrix.  
vectrans = function(vector,ibase,jbase){
  matrixTrans = matrix(c(ibase,jbase),ncol=2)
} #End of function

v1trans = vectrans(v1,ibase,jbase)
v2trans = vectrans(v2,ibase,jbase)
v3trans = vectrans(v3,ibase,jbase)
v4trans = vectrans(v4,ibase,jbase)
# Add to plot:

Let us now change ibase and jbase in a way, so everything is “rotated”. Again, the math below is actually not performing a real rotation, as this would involve a so-called rotation matrix (incl. sinsin and coscos), which we will not need to go into. The below vectors will still be rotated, but will not have an equal length (i.e., rotated and stretched as well).  

#### "Pseudo-Rotating" the vector field
# original ibase/jbase
ibase = c(1,0); jbase = c(0,1)
plotvec(ibase," lightblue");plotvec(jbase," lightblue")
v1trans = vectrans(v1,ibase,jbase)
v2trans = vectrans(v2,ibase,jbase)
v3trans = vectrans(v3,ibase,jbase)
v4trans = vectrans(v4,ibase,jbase)

# "Pseudo-Rotating" both base vectors 45° 
ibase = c(1,-1); jbase = c(1,1)
v1trans = vectrans(v1,ibase,jbase)
v2trans = vectrans(v2,ibase,jbase)
v3trans = vectrans(v3,ibase,jbase)
v4trans = vectrans(v4,ibase,jbase)

Fig. Plot of our original vectors (orange) with original base vectors (light blue), including the changed base (blue) and the respective transformed vectors (green). It can be considered a pseudo-rotation, since the length of the vector changes and we did not use a rotation matrix (which entails sin/cossin/cos).

Here is also a short example for a matrix-matrix multiplication ─ just for completion: A 2x3 matrix is multiplied with a 3x2 matrix. The inner values are equivalent, so multiplication is possible, the outer values of the dimensions provide information on the dimension of the resulting matrix ─ 2x2:

# Matrix matrix multiplication

# Matrices:
matrix_A = matrix(c(1,2,3,4,5,6), 
                  ncol=3, nrow = 2, byrow = TRUE)
matrix_B = matrix(c(1,2,3,4,5,6), 
                  ncol=2, nrow = 3, byrow = FALSE)
# Use %*% for matrix multiplication instead of *

At last, we will look at linear algebra from the perspective of linear equations ─ or more specific: a system of linear equations (with one or more unknowns). As it involves plain linear equations of the form x+2=0 x+2 = 0, it will not be too difficult to understand, promise (you probably already know about it). With this we will finally know all the linear algebra we need to understand how the stationary probability is calculated. We are going to use the example we used in Inferential Statistics II

In general, systems of equations are used to solve a wide range of very basic questions in science. We will start with a very simple example that most of us probably have formally encountered in riddles or so, without thinking of a system of linear equations as such:

FURRY EXERCISE: Two cats, Mr. Midnight and Jurassica Parker, are together 15 years old. Their age deviates by 7 years ─ Jurassica Parker being the older sister of Mr. Midnight. How old is each of them?

Fig. Jurassica Parker and Mr. Midnight (don’t be fooled by the young looks of Jurassica, when solving the riddle ─ it may be a trap). Original photos by Avel Chuklanov and Dominika Roseclay.

We can solve this equation by substituting y=x+7y = x + 7 within x+y=15x + y = 15.

Another way of solving our exercise is via linear algebra: we can write the left side of the equations as a matrix and the right side as a vector, resulting in: [x1+y1x2+y2]= [157]\begin{bmatrix} x_{1} + y_{1} \\ {- x}_{2}{+ y}_{2} \\ \end{bmatrix} = \ \begin{bmatrix} 15 \\ 7 \\ \end{bmatrix}, which can be abstracted in R to: [1111]= [157]\begin{bmatrix} 1 & 1 \\ - 1 & 1 \\ \end{bmatrix} = \ \begin{bmatrix} 15 \\ 7 \\ \end{bmatrix}. We will now use R to solve this system of equations:

# Set up two elements, so that we can solve an
# equation in the form: left = right
left = matrix(c(1,1,
               -1,1), ncol = 2, byrow = TRUE)

right = matrix(c(15,7), ncol = 1, byrow = TRUE)

# Use the solve function to solve the system of
# equations. It outputs a vector list = c(x,y)
ResultXY = solve(left,right)

1.2 Calculating the Steady State Distribution via Linear Algebra   

In linear algebra the stationary or steady state distribution is stating that there is an (eigen-) vector v\overrightarrow{v} that when multiplied with the transition matrix delivers equivalent results to just multiplying v\overrightarrow{v} with a scalar λ=1\lambda = 1 (we will look at the equation below). A scalar, again, is a single number value that can be used to, e.g., change the length and ─ when negative ─ also the direction of a vector. Here the scalar is also called an eigenvalue of an eigenvector.

The eigenvector can be, e.g., understood as an arrow in a plane within a 3D state space that stands for or lands on the axis of a rotatable plane and therefore doesn’t change when the plane is changed. Thinking of this being in a 2D space is a little more abstract. In general, finding an eigenvector also involves the possibility of squishing a plane down to an unchangeable vector (arrow), i.e., the eigenvector ─ so again: no matter if you try to change the eigenvector itself by a transition matrix, the eigenvector will remain the same (I recommend again looking into Grant Sandersons videos for visualizations on this). Mathematically this, again, means: multiplying an initial transition matrix with an eigenvector will again result in that eigenvector.

Fig. It’s of the map! ^^ Image by Kevin Mueller.

Here is a formula commonly used to evaluate the steady state eigenvector:

What we see in the upper equation is a vector/matrix multiplication on the left-hand side of the equation, i.e., a vector multiplied with a matrix, and a scalar multiplication on the right-hand side of the equation, a single value for λ\lambda with a vector.

First let us prove that the above equation is true. Execute the following:

# Quick check if vA=lv is true:
vA = steadyStates(MCmessageABC)%*%MessageABCTransMatrix
lv = 1*steadyStates(MCmessageABC)
# Test for near equality (floating point comparison)

Note that solving the upper equation in order to obtain the eigenvector v\overrightarrow{v} involves using a factorized identity matrix for λ\lambda, e.g., in the form of the following 3x33x3 identity matrix, factorized by λ=1\lambda=1:

This converts the scalar into a matrix, so we can re-arrange the whole system of equations freely. The diagonalization with the respective zero values, can be read as only addressing each event of transitioning from,e.g., AAA \curvearrowright A, BBB \curvearrowright B and CCC \curvearrowright C, which will hopefully make the concept of an identity matrix graspable from what we learned before on Markov chains. Note again that the values within that identity matrix are the eigenvalues of our eigenvector, so they do not actually refer to probabilities of a transition, but can for itself be looked at as transitioning from AAA \curvearrowright A etc. with a probability of 100%. 

Filling in all elements we have gained from our last Markov chain example in the upper equation will result in: 

The TT stands for transposed, which essentially means that the vector is flipped to the side, as our steady state actually represents a row vector. One can use TT  formally for vectors in representations, in order to save space (transposing a matrix is as simple, though). We could now again apply our substitution technique several times, until we calculate each xix_i, also by using the fact that v=1\sum \overrightarrow{v} =1 . The latter results from our use of probability row vectors, representing the probability of, e.g., transitioning from the letter AA to AA, to BB or CC with values ranging from 0 to 1.

We will not substitute, but go a different path, re-arranging our upper equations in order to fit it into a system of linear equations in the form of Ax=bAx = b, where xamnxa_{mn} and bmb_m are column vectors again, which refers to our previous use in both, geometry and linear equations: the matrix consists of base (column) vectors, and the overall structure is that of a system of linear equations with results on the right-hand and the unknowns on the left-hand side of the equation. Turning rows into columns and vice versa can again be achieved by the transposing technique, essentially flipping the vector by 90°, as introduced above. We will start with re-arranging the equation (compare these videos and articles: a), b), c) we partially followed).

Now that we have our identity matrix set up to convert our scalar product, we can move the right side of the equation to the left:

As we want to obtain our steady state distribution, i.e., a row vector, via a system of linear equations using column vectors, we transpose the matrix (AλI)(A-\lambda I), as well as our row vector and pro forma 0 as well, i.e., turning all rows into columns and vice versa:

So structurally: (AλI)T=A(A-\lambda I)^T = A, vT=x\overrightarrow{v^T}=x  and b=0Tb=0^T within Ax=bAx = b. Unfortunately, this assumption does not hold in our case: Computing our Ax=bAx = b in R will lead to an error indicating that the system of equations cannot be solved. In essence, to us this just means that we need to include more information into the whole calculation in order to be able to solve our system of equations. Mathematically though this indicates that our system of equations is so far actually not a system of equations :O

Here is a quick example of how the results can look like in such a case: Given is a system of equations, where x+y=0x+y =0 is being one of the equations. Let us say, we have solved our set of equations already (by hand! R would give an error) and obtained the values x=1x= 1 and y=10y = 10. Inserting these values into the equation x+y=0x+y = 0 will results in 11=011=0, which indicates that this is not an equation, i.e., 11011 \neq 0, since ‘equation’ means: the left-hand side is equal to the right-hand side of the equation, i.e., x=xx=x. The equation above, x+y=0x+y=0, would hold in the case of, e.g., x=10x=10 and y=10y= -10.

In order to be able to compute our system of equations, we have to do some adjusting and add the fact that v=1\sum \overrightarrow{v} =1  (I have mostly followed this tutorial to perform all of the below in R).

##### Re-arranging matrix, to solve it:
# We will use a simplified form of our transition matrix
# without renamed columns and rows c("A", "B", "C")
MessageABCTransMatrix = matrix(c(.0,.8,.2,
                               nrow = 3,
                               byrow = TRUE)

Here we will add the fact that the rows of the matrix sum up to 1, which in the case of our transposed matrix is done by adding another row (!) via rbind() which combines rows. Also note that the function t() := transpose, and diag(dimension) = II.

# Here we will add the fact that the rows of the matrix sum up to 1,
# which in the case of our transposed matrix is done
# by adding another row (!) via rbind() which combines rows.
# Also note that the function t() := transpose, and diag(x) = I
A = rbind(t(MessageABCTransMatrix-diag(3)), c(1,1,1))

The transposed rows, i.e., columns will not actually sum to 1, as we have subtracted II from AA, resulting in a sum of zero on the columns and a sum of 1 on the rows in the below matrix (just a side effect of the trick we want to perform):

# The transposed rows, i.e., columns will not actually 
# sum to 1, as we have subtracted I from A,
# resulting in a sum of zero on the columns
# and a sum of 1 on the rows in the below matrix
# > A
#      [,1] [,2] [,3]
# [1,] -1.0  0.5  0.5
# [2,]  0.8 -0.5  0.4
# [3,]  0.2  0.0 -0.9
# [4,]  1.0  1.0  1.0

We will also represent our bb, which, recall b=0b=0! Here is our re-arranged equation again:

As our vector has to have the same number of rows as the matrix, we will also add a 1, which, again, just represents our extra information of sum(row)  ─ even though the transposed row, i.e., the column of  actually sums to zero, due to the subtracted identity:

# We will also represent our b, which, recall b = 0
# As our vector has to have the same number of
# rows as the matrix, we will also add a 1, which, again,
# just represents our extra information of sum(row)=1 ─
# even though the transposed row, i.e., column of A
# sums to zero, due to the subtracted identity 
b = c(0,0,0,1)

Our whole system of equations will look like the following, using the cbind() function (combines by column):

# Our whole system of equations will
# look like the following, using the
# cbind() function (combines by columns)

# > cbind(A,b) 
#                     b
# [1,] -1.0  0.5  0.5 0
# [2,]  0.8 -0.5  0.4 0
# [3,]  0.2  0.0 -0.9 0
# [4,]  1.0  1.0  1.0 1

Solving this system of equations via the function solve(A,b) is unfortunately still not possible as our matrix is not a square matrix :D We can still solve our system of equations using the so called QR-algorithm via qr.solve(). This algorithm is quite something to do by hand, so I will not go into it in detail. However, it can handle our issues via a special work around algorithm, so we will finally obtain our steady state vector via linear algebra:

# Use QR-algorithm to solve the system of equations
# > qr.solve(A,b)
# [1] 0.33333333 0.59259259 0.07407407

# Let us see if our steadyStates(x) and qr.solve(a,b) is equal
# [1] TRUE

1.3 Summary and Shannon’s (Bayesian) Formalism in AMTC - Method II to Obtain the Stationary Distribution

Looking back into AMTC on p. 5f., we see that Shannon uses a different and much simpler formalism, which relates linear algebra in the form of our Markov process also to our well-known conditional probability:

Fig. Transition matrix, steady state vector and relative frequency matrix (AMTC, p. 6). Same values as we got above, just represented as fractions.

Let us go through all of them again using Shannon’s formalism, ultimately summarizing what we learned so far:

 Transition matrix:  denotes the probability pi(j)p_i(j) of the idthi^{dth} symbol transitioning to the jdthj^{dth} symbol. This is essentially in Bayesian terms a posterior (p(ji)p(j|i)), denoting the probability of jj, given ii — Shannon is just using a different denotation formalism. Another way of formalizing pi(j)p_i(j) would thus be:

If it seems unclear why the posterior is already given, think of it as already being inferred and as something that is updated over time (heuristically via MC^n) ─ until a (equilibrium) steady state is achieved.

SIDE NOTE: In case letters are to abstract, one can also look at it from a temporal perspective, which is often done when using weather forecasting as an example for a Markov chain: The posterior probability of rain tomorrow, given that it is sunny today could be denoted by:

Steady state vector: the steady state vector is technically an eigenvector with its respective scalar / eigenvalue of 1 that relates ii to “ii over time” (phases), which can be understood, as we will see, as a joint between ii and jj that is summed out over all possible jj or ii─ which in other words results in the model evidence p(j)p(j) or the prior p(i)p(i). So in other words (with all the linear algebra asside) the prior will be — in the case of a possible equilibrium — equivalent to the steady state vector p(i)p(i) or p(j)p(j). Before we dig in deeper into the steady state, let us first look at the relative frequency matrix:

Relative frequency table, i.e., joint probability matrix: This represents our joint probability as a matrix or table for each ii in relation to each jj. Equivalent to what we learned about Bayes, the joint probability can be calculated by either multiplying the prior with the likelihood, or the posterior with the model evidence. Before we were mostly working with complements of AA and BB, here we have three different states. However, in this case the joint is technically obtained by multiplying a given posterior (transition matrix), here denoted pi(j)p_i(j), with the model evidence ─ the latter being p(j)p(j) (sum over all possible ii and the eigenvector / steady state vector).

Before we can go any further: Below you will find a function that calculates the relative frequencies, i.e., joint probability for us. The first part sets up a list, which will in the end be a matrix. The nested for-loop (loop within a loop) multiplies p(j)p(j) with pi(j)p_i(j), i.e., each value of the steady state vector with the respective row of the transition matrix (not via a dot-product!), i.e., the posterior is multiplied with the model evidence p(j)p(j)/p(i)p(i). The calculation is actually really simple, as you will see, though the function below can handle different classes of input, e.g., our qr.solve() output that for itself makes trouble when considered as numeric vector.  

#### Relative probabilities (joint probability matrix) function:
relProbs = function(Steady,TransMatrix){
  # This will set up an empty matrix list:
  relProbsMatrix = vector("list",   
  dim(relProbsMatrix) = matrix(c(nrow(TransMatrix), 
  # Nested for-loop (loop within a loop)
  # 1:length means: sequence of 1 to max length (=nr of elements):
  for (i in 1:length(Steady)){ 
    for (j in 1:ncol(TransMatrix)){
      relProbsMatrix[[i,j]] = Steady[[i]]*TransMatrix[[i,j]]
  # Set output as matrix (otherwise problems occur for colSums())
  # as.numeric(), as.matrix() etc. changes the class of an object:
  relProbsMatrix = as.numeric(relProbsMatrix) 
  # Output is a vector. dim() adjusts dimension back to a matrix
  # with the square root "sqrt()" of the length as nr. of c(rows,cols)
  # (can be done this way due to the fact that a square matrix is given)
  dim(relProbsMatrix) = matrix(c(sqrt(length(relProbsMatrix)),
} # End of function

# We can now use both qr.solve(A,b), or steadyStates(MCmessageABC) 
# for the first input “Steady”:
jointMatrix = relProbs(qr.solve(A,b),MessageABCTransMatrix)

As mentioned, the sum of ii over all jj and the sum of jj over all ii are equivalent ─ technically this means the row sum and the column sum of the joint probability matrix are equivalent ─ which again is due to the steady state we are in, resulting in p(i)p(i) and p(j)p(j) both representing the eigenvector. Shannon expresses the condition for stationary probability more mathematically:

If p(i)p(i) is the probability of state ii and pi(j)p_i(j) the transition probability to state jj, then for the process to be stationary it is clear that the p(i)p(i) must satisfy equilibrium conditions: (AMTC, p. 10)

# Row and column sums for p(i) := model evidence and p(j) := prior.
# They are equivalent to our steady state (can be seen as its 
# definition in the language of probabilities, as it entails the 
# notion of steadiness, i.e., “not changing” or “no difference  
# between past, present or future”):
pi = rowSums(jointMatrix)
pj = colSums(jointMatrix)

# Corresponding to the formula:
pi = rowSums(pi*MessageABCTransMatrix)

Looking back to our initial heuristic approach to obtain the steady state, we saw that the eigenvector will fill every row of the transition matrix over time (phases), and can therefore also be found in the matrix’ diagonal as well.

# Looking back at initial heuristic approach:
# Result:
#           A       B       C
# A 0.3333333 0.5925926 0.07407407
# B 0.3333333 0.5925926 0.07407407
# C 0.3333333 0.5925926 0.07407407

Fig. The eigenvector as shown in AMTC. It is equivalent to our results in R, it is just represented in fractions.

Note that the “diagonal eigenvector” structurally implies an aspect of identity, as, e.g., BtB_t to Bt+1B_{t+1} has a probability of 1627\frac{16}{27} ─ but also CC to BB and AA to BB, which extends the identity and relates the eigenvector to the result of a sum over all possible ii as well (our technical prior p(i)p(i)).

This means that our initial dependencies, i.e., the dependency of the probability of the future state by its present/given state vanishes, so that prior and posterior essentially become the same for every possible ii preceding a particular jj and every jj preceding a particular ii.

We can also again compare a steady state with a 0th approximation0^{th} \ approximation , i.e., and independency / equilibrium, just that this time it settled in over time and was not given from the start (compare with the transition matrices of Shannon’s first initially independent chain examples above!).

All of this again translates to what we learned from linear algebra: For an eigenvector it does not matter, if it is just scaled by 1, i.e., relating to itself, or if it is related to another state in terms of our transition matrix. Note that all of this also holds for different phases, which is pretty crazy to discover:

# Using linear algebra again, now with the tranisiton matrix of 
# phase 5. I didn’t find a way to change the class of the 
# output of: 

# ... so I had to create the matrix myself with the respective 
# values, taken from the console output (a little meassy but still worked!):
phase5 = matrix(c(0.31250, 0.60648, 0.08102,
                  0.34375, 0.58565, 0.07060,
                  0.34375, 0.58564, 0.07061), ncol = 3, byrow= TRUE)

A = rbind(t(phase5-diag(3)), c(1,1,1))
b = c(0,0,0,1)

# Let us check if our assumption vA=v holds for phase 1 and the 
# exemplary phase 5, using a shortcut. The shortcut relies on a 
# shortcut to build a joint probability matrix by just multiplying
# TransMatrix*eigenvector, which does the same as relProbs(), 
# though previous output classes requested some additional coding 

# Set eigenvector, here I retrieved it from the previous 
# jointMatrix: 
eigenvector = rowSums(jointMatrix)
# Check if v=vA holds for different phases:
INITphase1 = rowSums(MessageABCTransMatrix*eigenvector)
AFTERphase5 = rowSums(phase5*eigenvector)
# Note that we could also implement diag(length(eigenvector)) from 
# the actual formulas above, but it does not make a difference.

Again, we recommend to whatch a video on this, since it also makes a lot of sense when watching transitions being animated!

2 Different Forms of Entropy

We can now calculate several different forms of information theoretic entropy. Before we have argued the (Boltzmann) entropy to be the negative log model evidence. However, we can calculated different forms of entropy that refer to different kinds of probabilistic relations, as we will see.

The following picks up where we ended in the second part of second part, namely p. 11 in AMTC, where Shannon calculated the suprisal / entropy and the maximum entropy of a single sign sent. He then goes on discussing the following insights:

Fig. AMTC p.11.

At first he mentiones the special case of the entropy being always greater or equal to zero (also goes for surprisal) and only being zero in the case of full certainty (p(x)=1p(x) = 1). Then he describes what we also discussed in part II of this series: given equal probability, the formula reduces to log(n)log(n) and it is also where we encounter maximum entropy (e.g., given 1 sign sent the maximum entropy of in this case 1 will be at p(x)=.5p(x) = .5).

Fig. Maximum entropy of one sign sent is given at p(x)=.5p(x)=.5 with a probability ranging from 0 to 1.

Shannon then extends the scheme to two events (slightly confusing notation ahead, but we will get over it!!):

Since the above is a little difficult to read, we will slowly go through the above notation: Basically, what we now reflect is not just one sign sent, as when calculating plain surprisal, but reflecting a joint of event, i.e., two signs, or a joint of transitioning events (our Markov Chain): a first and a second event. Shannon denotes these possible events xx and yy. It gets a little confusing when the variables ii and jj are introduced, which refer to probabilties of events xx and yy, not just to the unknown/possible events themselves (without theor probabolities — a little redundant). So what the first line above says: H(x,y)H(x,y) is the joint entropy of two events xx and yy (first and second event) and this entropy is relfected via the probabilities of those events, ii for the first and jj for the second event (again, a little confusing, I know, but a good exercise for reading mathematical notation).

We can now calculate the entropy of a joint event as such, i.e., the average entropy of all possible joint events, and not in relation to a particular event (a posterior). We will do so via R and using our last Markov chain example.

# Calculating the entropy just via the following will
# result in NaN, i.e., “Not a Number”, as we have 0 values
# within our TransMatrix (recall log(0) := undefined/Inf.!),
# see for yourself:
EntropyXY = -sum(jointMatrix*log2(jointMatrix))
# [1] NaN

# We will quickly write a function adding a tiny
# value to our inputs:
bit_log_nonzero = function(x) {
  nonzerolog = log2(x+2^(-16))
} # End of function

# Let's try again using our Markov chain example:
EntropyXY = -sum(jointMatrix*bit_log_nonzero(jointMatrix))

Next: The entropy formula below is structurally the same as we have encountered the entropy before in the second part of this series, just this time the weighting is not equivalent with the log input ─ as we are looking at the average of a particular posterior event (second event) amongst all possible joint events, in other words: we are looking at the entropy of a conditional event amongst all joint events: H(yx)H(y|x).

# Below we will work with the numbers of our last Markov chain example:
EntropyPOST = -sum(jointMatrix*bit_log_nonzero(MessageABCTransMatrix))
# [1] 0.9340018 = H(y|x)

The average entropy of a joint event H(x,y)H(x,y) can also be calculated via:

In Shannon’s words “the uncertainty (or entropy) of the joint event (x,y)(x,y) is the uncertainty of xx plus the uncertainty of yy when xx is known” (AMTC, p. 12), the latter being our posterior H(yx)H(y|x) (or Hx(y)H_x(y)).

Before we calculate the above, let us firt look at the average entropy or expected surprisal of a single event xx or yy within a joint (x,y)(x,y) (they look complicated, but still follow our known formalism, as we will see!):

The only difference between both of them is that we either take the log of the sum of the rows or of the columns of our joint matrix, i.e., we sum over all jj (rowSums) or over all ii (colSums), before taking the log, weighting it and summing over the joint.  

EntropyX = -sum(jointMatrix*bit_log_nonzero(rowSums(jointMatrix)))
EntropyY = -sum(jointMatrix*bit_log_nonzero(colSums(jointMatrix)))

# ALTERNATIVE formula for EntropyXY
EntropyXY = EntropyX + EntropyPOST

It can also be shown that:

This inequality essentially states that one cannot have more information on only a single event than when having information on two events, here both H(x)H(x) and H(y)H(y)). This inequality also relates to what we will discuss as Bayesian surprise, KL-divergence or relative entropy (divergence between prior and posterior) in further parts of this series.

# The operator “<=” checks on "less than or equal"
EntropyXY <= EntropyX + EntropyY 

We can also alternatively obtain Hx(y)H_x(y) via:

# Alternative to obtain EntropyPOST
EntropyPOST = EntropyXY-EntropyX

From what we now know, it can be shown that:

Which leads to:

… which in Shannon’s words means:

The uncertainty of y is never increased by knowledge of x. It will be decreased unless x and y are independent events, in which case it is not changed (AMTC, p. 12).

Translated this just means that in the case of a steady state the conditional probabilities (posterior/likelihood) and the probability of a single event (model evidence / prior) are equivalent and therefore are their entropies ─ otherwise they are not. In general, the upper inequality also repeats the fact that the more I know (the more information I have), the less uncertain I will be (the lower the entropy!), as Hx(y)Hx(y), i.e., H(yx)H(y|x) not only entails yy, but both xx and yy, where xx is given

At last, we will calculate the entropy of the source per symbol of text, which Shannon defines as follows:

“Consider a discrete source of the finite state type considered above. For each possible state ii there will be a set of probabilities Pi(j)P_i(j) of producing the various possible symbols jj. Thus, there is an entropy HiH_i for each state. The entropy of the source will be defined as the average of these   HiH_i weighted in accordance with the probability of occurrence of the states in question” (AMTC, p. 13):

EntropySOURCE = -sum(eigenvector*MessageABCTransMatrix*bit_log_nonzero(MessageABCTransMatrix))

Adding a frequency to it would give us bits/second.

We will leave the exploration of further parts of Shannon’s paper to yourself. You will also encounter reflections on optimal coding and compression, a negative feedback correction conception, further insights on redundancy and many more concepts.

We will close this chapter mathematically for now and will look into more general reflections on information theory, how it was received and most of all, how it is often misunderstood in the next part of this series, discussing an essay / tutorial on AMTC, written by Warren Weaver. We will then again discuss more general implications for the field of computerlinguistics and how information theory plays a role today.

Before we do so, we will look into one last thing, the concept of algorithmic information theory.

3 Algorithmic Information Theory

As a closing remark, note that there is also the so-called algorithmic information theory, which essentially states that one could dramatically lessen the entropy of an information by sending a command with which the receiver can reproduce the message given an equivalent program or given a understanding of the command code. In practice this is fairly simply. Think of a message that may look like this:

This message entails 25 symbols, 25 bits (not in terms of entropy!). A command to produce such a message in R could be in this case the command c(numeric(24),1), which only involves 16 symbols compared to 25. With such a technique, huge amounts of information can be sent or compressed by substituting them with commands. Mathematically reflecting the above involves the so-called Kolmogorov complexity, which stands for the length of the shortest program (command) which produces the upper message. The only case when you cannot compress a message is when the length of the message and the program is equal. It is hard to find an example, as even a series of 1,2,3,4,5 can be compressed by 1:5 ─ at least in R this way. In general, such a message would involve something like a non-repetitive series of signs (no logical series as 1:5 that can be represented via a loop!), all of them appearing only once independently of each other.

4 Brief Outlook into Information Theory IV

In the next part of this series we will have a look at the paper “Recent contributions to the mathematical theory of communication.”, by Warren Weaver. There we will pick up our discussion on the different levels of communication and will have a look how the general discourse proceded in the second part of the 20th century. The next part will therefore be a lot about the conceptual part of information theory, how it was received and misconceived in the field of linguistics and philosophy, and how it got extended mathematically.

Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other students and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within BEM. Original photo by Alfred Kenneally

Pub preview image generated by Mel Andrews via midjourney.

No comments here
Why not start the discussion?