Skip to main content
SearchLoginLogin or Signup

[R] Higher Spheres: Information Theory I: Introduction into the World of Information Processing and its Technological Representation

Beginner's Stat-o-Sphere

Published onAug 18, 2023
[R] Higher Spheres: Information Theory I: Introduction into the World of Information Processing and its Technological Representation

Follow this link directly to part II of this series, if you wish, or have a look at other tutorials of our collection Stat-o-Sphere.

Review: This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Fig. How to start a discussion. A pubpub account is needed to do so. Sign up here. Also have a look at articles all about our Open Science Agenda.

Pub preview image generated by Mel Andrews via midjourney. Thanks for the permission!

Corresponding R script:


Welcome to “Higher Spheres”, our second tutorial series that aims to provide you with basics in computer science and mathematics in general that are used in fields such as computational neuroscience, (mathematical) psychology, sociology, statistics, physiology / physics, genetics, and in research and methodology on “artificial intelligence” in various forms.

Note that this series is not going to be particularly more difficult than any of the other tutorials that we have published so far, even though we are slowly moving towards advanced methods. We will also not speed up the pace and will again start from a beginner’s level. However, this series is still trying to introduce you to the basics of advanced scientific research methods that are often only used in particular fields (e.g. computational neuroscience), different to general statistical methodology. The goal is to especially make those kinds of methods further accessible that are hard to access in the first place (e.g., fMRI methods in neuroscience).

SIDENOTE: In general, for orientation within tutorials, we recommend making use of the contents function (upper right corner).

Fig. 2 After scrolling down until the title of an article disappears, the content function becomes available. It can help you to navigate within the script. It is entailed in every article you will find in our journal.

Some of our recently published tutorials will partially serve as background knowledge for this tutorial. Especially Inferential Statistics I, discussing conditional probability, Bayes’ rule and the logic behind frequentist statistics. No worries, we will go through most of its content again and we spiced it up a little and shifted the focus to a more neuro- and computer scientific perspective. So, we will still go through most of the basics again, but in a way that it serves new readers and hopefully also those that have already gone through our series on inferential statistics. We will also again work with the programming language R, so we assume some knowledge, which are however only very basic sections in the mentioned first part of our series on inferential statistics and that are rather simple and banal (install R, open script, execute some lines of code…).

However, the first part of this tutorial series on information theory is most of all written in order to convey a more general message on the meaning of information technology and processing and how to approach it best. It is intentionally written in a rather entertaining way and it will take us a while until we actually do some (very simple) calculations in R, so no worries on the programming part.

Fig. Intuition/Concept, Mathematics, Code — the general structure of our tutorials.

Even though our Series “Higher Spheres” intents to convay very specific methods, this will turn out to be only partially true for the case of information theory, since its applicability is actually very wide and also concerns methods used in general statistics, such as the Akaike and Bayesian information criterion or Markov chains (part three). The reason for this is simple: information theory is essentially just applied and advanced conditional probability / Bayes’ rule.

Fig. At the end, a lot of AI methodologies are just further developments of simple conditional probability / Bayes’ rule. This doesn’t make it boring though, on the contrary. Famous Scooby-Doo Meme, modified via a meme generator.

Despite its simple mathematical background, advanced information theoretic methods have become very popular and powerful recently, e.g., in the form of stable diffusion models, which use so-called variational inference methods (which can also be cast as a form of approximate Bayesian inference methods (part V of this series); note, we will not cover integrated information theory in any form (since I know only little about it and I am sceptic)). In the field of computational neuroscience those methods are, e.g., present in predictive processing / active inference / the free energy principle — concepts that are mostly applied to model decision processes, i.e., behavior of organisms of any kind and scale (cognitive processes, “Bayesian Brain Theories”) and more. Another method involving variational inference worth mentioning is Dynamic Causal Modelling which uses variational inference for testing neuronal dynamics / interpretation of fMRI data (SPM). This tutorial series was initially thought as an introduction into the very basics behind the mathematics of active inference, a concept initially introduced by Karl Friston that also invented SPM. However, we will get back to variational methods in the form of active inference and other methods in further parts of this series.

Fig. Basic concept behind active inference, one of several concepts under the umbrella of “Bayesian Brain Hypotheses”. The graphic above treats the relation between the world and oneself as organism as an (active) inferential process, basically representing abductive inference as the way we perceive and act upon the world under a policy. Those that have read part one of our series on inferential statistics will be familiar with the triadic Bayesian approach of a) generating a hypothesis, b) encountering (relating) to data (the world), and c) updating ones hypothesis as a consequence of such experience. The above concept is a generalization of the Bayes’ rule / abductive inference scheme, assuming that the brain has a generative model that is inverted and updated in the sense of Bayes’ rule for both perception and action (see this tutorial by Smith, Whyte and Friston for details (I have also translated some of its Matlab code into R). The above approach relies on so-called variational inference methods — (stable) diffusion models are also part of this class of modelling techniques. In general, such methods are used in neuroscience to model decision making, learning etc., comparing the model with EEG or fMRT data. We will further look intot this in fifth part of this series. Image (Brain) from Wikimedia Commons.

Together with the parallel development of cybernetics, information theory was also the beginning of a lot of fields as we know them today. In fact you probably regularly think in those conceptions every day, e.g., when talking about feedback/feedforward processes, the bio-psycho-social model in medicine, (dynamical) systems theory, DNA as genetic code, homeostasis (equilibrium steady state), homeorhesis (non-equilibrium steady state) etc.

The reason why such ideas can be found everywhere in science is rather simple: it all comes down to a generalization of the mathematics of statistical thermodynamics which is again just applied probability theory (conditional probability / Bayes’ rule). If you always had trouble to understand what (statistical) thermodynamics and in parts quantum physics is all about, this series could help you find a way, even though we will not go too deep into physics in this series.

Fig. Our tutorial series on information theory will indirectly reveal the basics around the question of how “diffusion models”, which imply physics (entropy, temperature, movement of molecules etc.), are related to text generation, image generation and to methods from computational neuroscience. We will discuss this in the second part of this series. Image by Chaozzy Lin.

When discovering the basics, information theory / systems theory starts to appear as a kind of hidden paradigm in vast parts of science, especially in the medical field. This is probably most of all the reason why evaluating the scope of applications of information theory actually becomes a little weird, or better: pretty weird, as it appears to be everywhere around us. Another reason is the simple fact that information theory concerns the basic idea and concept behind the very object right in front of you: a computer that nowadays feels more “natural” and “banal” to a lot of people then ever, even without actually knowing a single bit (haha-hehe) about it.

Therefore, another goal of this series is to overcome a mere consumers perspective on computers and information technology in general in order to overcome the partially dangerous tilting between over- and understating the meaning and capabilities of information technology and especially oneself using it. Anything else just serves misinformation on information technology in some form or another in the long run. Technology and science has advanced and spread widely and given a lot of current crises and their complexity, paired with more and more interaction with highly advanced technology, it becomes more and more easy to abuse a lack of knowledge within society for power longing spread of misinformation (even in various forms of costly campaigns agains facts, as we have experienced world wide in the last years, e.g., during the pandemic).

Fig. False information, mere belief systems and conspiracy theories distract people from real problems and real solutions for them. Irresponsible use of AI can worsen such social tendencies. Since all of us were literally flooded with false information and subsequent hatred during the COVID pandemic (same as with, e.g., the HIV pandemic), it is more then ever necessary to understand basic ideas behind information technological communication and science in general. Other topics casually blurred by misinformation are most of all the climate catastrophe and its estimated consequences. Image by Mika Baumeister.

Looking at problems in science in general, since AI technology has advanced quickly in the recent years, we may approach a time were an often opportunistic and irresponsible habitus of “just click your way around, you do not need to understand it, and importantly, write your name under it” will move on to: “just drop your data into the AI and don’t think about it — apart from writing your name under it”. A superficial approach to technology, its logic etc. comes with great danger for society, e.g., in the form of heavily biased data sets of machine learning tools that are as such not properly reflected when applied. Knowing the math of a simple neural machine learning network means knowing that its output sort of just “represents the bias” of the data it was trained on, since that is what it does when learning: training to predict given data, in order to predict new data.

Fig. A recent article in nature discusses the work of John Carlisle, an anesthetist that checks medical papers for flaws and fraud (for the latter as well as heavily flawed trials he uses the term zombie trials). There are two answers to the problem above: a) more open access, also including open data. b) most of all, people should spend more time and should also be given more time and guidance for understanding statistical methods, in order to take the respective responsibility. Methods Crlisle uses to do test papers, given that he as insight into data, are simple and well-known, e.g., the so called Benford’s Law (German video by arte / English video by numberphile). Science should not be based on superficial attitudes, nor a habitus that tries to become immune by discarding critical thinking on statistics as “not necessary”. Our open science agenda and open education projects are trying to make a difference in that respect, e.g. in form of our tutorial series, trying to work against a mere habitual approach to statistics and computer science which appears to be most of all interested in somehow making use of it, in order to be able to publish, instead of really trying understanding it, before publishing anything at all...

Again, not properly reflecting methods and their output are problems that are also commonly known from the application of classic statistics, but they appear to potentially double their blurring impact on critical thinking in the field of applied higher mathematics, especially due to their still often impressive results, despite being unreliable. We recommend, e.g., the work of Iris van Rooij that discusses the jeopardy of the use of AI systems without reflecting ethical and technical issues. Examples are, e.g., the use of AI as modern phrenology (headlines such as ‘ML predicts gender or personality via facial expression’) and questions of the copyright of training data. Is society hyping hybris?

Fig. Scene from Faust, Part I, which can be read as Mephisto’s attempt to seduce Faust into a hybris of gained power. Technology is often discussed in the context of a hybris (in general common as a trope), especially when it comes to industries heavily influencing our environment and society, naively putting people in danger. However, an anxious and generalized rejecting attitude towards technology is also not doing any better, since it has the tendency to, e.g., confuse industry with science and scientific thinking. Such approaches to science often end in opinionizing and fatalizing discourses, as well as possibilities and boundaries of technologies, instead of gaining and sharing knowledge, gaining and sharing information — such that everybody can progressively make informed and therefore more ethical decisions. This is what this series on information theory is at least trying to evoke in readers: start to understand your computer, it is a scientific object (quote inspired by Marx, page 59; German version here; inspired by a talk between Sophie Rois and Alexander Kluge).

Of course we all believe that science and technological advances in general should never be led by a mere attitude and no one should be forced to follow mere habits in order to flurish in science or in the world of technology. Still, this appears to be what is in fact already happening — even without AI. At least as a medical student I feel one often has to justify oneself more when having or longing for consistent knowledge on statistics and AI, instead if “just using it”. It therefore appears that problems of, e.g., faulty methodology are more an issue of taking statistical applications in general far more serious than, e.g., concerns about p-hacking. Understanding methodology of science and basic concepts of technology should in general be much more open / transparent, accessible and also appreciated by societies.

With this tutorial series we again want to motivate you to better just invest some hours going through scientific methodologies step-by-step, instead of hopping on a habitus that is focusing on performance / appearance instead of consistency and responsibility. The culture of publish-or-perish makes it clear to everyone that far too many people promote dividing attitudes and are gate-keeping their way up to some random and often undemocratic top. At the end it is only us that can make a difference in openly discussing and educating each other about methodologies, instead of just imitating a habit of neglecting scientific and social ideals, others in general, and at the end oneself in that respect.

Fig. Witnesses claim he afterwards started chanting: “The null-hypothesis was true all the way down!” … The problem of faulty and even fraudulent applications of statistics is probably often not so much a lack of knowledge on p-hacking, but a in general questionable opportunistic or heuristic approach to scientific methodology, which either discards basic knowledge as “not necessary” or results from pushing students and scientists much too far (e.g., existentially, educationally).

Note that we, students from various fields at NOS, think that discourses around open science, and within it open education, would make a difference in how people approach science in general, especially when introducing students to scientific research methods and ethics of consistently applying and sharing knowledge (link leads to our first policy paper). We hope to make a difference.

Getting back to information technology and how it is actually all around us, this series on information theory might — and hopefully!!— feel like a futuristic archeology adventure, since it entails a great part of history, dating back to the 1930s and 40s; on the other hand, understanding the basic logic behind information theory and therefore some of the most basic logic behind a computer and telephone network (computation in general) makes you feel like having lived in the middle ages all along by discovering how ridiculously futuristic the time that we live in actually is: our present.

Fig. Going full cyber via information technological devices. :D Image generated via midjourney, i.e., via information technology itself. In later parts we will see how methods such as ‘stable diffusion’ relate to information theory via statistical thermodynamics. Source here.

1 Brief Looks into the History of Information Technology in the 20th Century

Let us start with going back in time a little bit: One might think, when someone invents or discovers something very important that everyone uses and talks about every day, everyone will know about that person. Same goes for the discovery itself: if something is so important that everybody relies on it every day, and this something even still gains more and more importance with no end in sight, even though it never changed conceptually, then we may assume automatically that a great amount of people within a society will understand it and know about the inventors. This is not always the case, and at least cults around persons are probably not that important in general either, but it is still quite weird with Claude Shannon for instance and the development of a general information theory.

This is similar to other important figures in the field of information technology, e.g., Margarete Hamilton and her contribution in the development of the Apollo Guidance Computer, which actually flew the vessel to the moon (there is a great website on the restoration and emulation of the AGC. Don’t be irritated by the 90’s look of the website!). Or with Katherine Johnson, even though gender and being a PoC definitely has played a role in how much we were and are able to know about them.

Fig. Claude E. Shannon, Margarete Hamilton and Katherine Johnson, pioneers of modern information technology and science. a) Shannon and his robotic mouse “Theseus” that can solve an adjustable maze via a relay algorithm that represents memory and was able to do the inference necessary to solve a maze. b) Margarete Hamilton invented the concept of software engineering. In the photo she is standing next to a pile of the printed source code of the “Apollo Guidance Computer”. The AGC was in a lot of ways what actually flew humans to the moon. c) Katherine Johnson was also involved in the Apollo 11 program, calculating the orbit for the flight of the space craft. The film “Hidden Figures” from 2016 depicts parts of her life during the Mercury-Program (158-1963). The contributions of Dorothy Vaughan (first afro-american ingeneur at NASA) and Mary Jackson (first afro-american supervisor at NASA) are also wonderfully depicted in this film.

Image Sources: a) (dead Wikimedia commons link), b), c).

There had been many figures involved in the development of past and present technology — and it also didn’t all start with Shannon or any of the above when it comes to information technology. Here we will just take a brief look into a small selection of people that were part of further developing science and technology. Everyone with their own story and solutions to problems they faced in their time. We deeply encourage you to dive deeper into this topic, especially when you have a kind of aversion towards technology. As mentioned, it is becoming more and more important to leave a mere consumers perspective in that respect, since technologies such as, e.g., in the form of “AI”, will accompany every one of us in some form or another in the near future — and a lot of other futuristic technologies do so already casually every day. It is therefore pretty weird how many within our society treat science and technology more as a myth and a mystery.

Fig. The above is a still from arguably one of the most meaningful and touching scenes of the film “Hidden Figures” with a subtle but important message on several levels — the intellectual, but most of all on the social level: fighting to have the opportunity to gain and pass on knowledge, especially despite gender or color. Octavia Spencer is playing Dorothy Vaughan in this film. In this scene she reads out loud the above introduction line of a book about the programming language FORTRAN (FORMula TRANslation) to her kid.

Other interesting figures in the further development of the basics of information technology and what later became AI research are people like John von Neumann (cellular automata, “Von Neumann architecture”), Warren McCullach and Warren Pitts (concept of neural networks), Alen Turing (Turing machine, Halting problem, Turing test), Richard Feynman (Variational method) and many others.

Fig. Rosenblatt perceptron, based on the conceptual work of McCulloch and Pitts from 1943.

If you are interested in a maybe historically more relatable insight into the history of computing in the 20th century up to now, we recommend the BBC television show “Micro Live” that was aired between 1983 and 1987 — a time where the first affordable home computers thrived. This television show is particularly interesting, since it was planned in combination with an interesting and unique educational deal between the company Acron Computers and the television network BBC (British Broadcasting Corporation). The result was the so-called BBC Micro, a micro-computer (little computers that don’t fill up a room) that is conceptually equivalent to the architecture of todays smart phones and laptops: a single (printed) circuit board, different to a modular “multi-board” PC.

Fig. Still from the BBC show “Micro live”. A television show all about computer technology from the 1980s.

Acron Computers is particularly important since it brought up the so-called ARM chip-architecture, primarily developed by Steve Furber and Sophie Wilson. ARM stands for Acorn and later Advanced RISC Machine. RISC stands for “reduced instruction set (complexity) computer” (the first “c” is spared out in the acronym) and refers to a computer chip that is build to translate complex instruction sets, such as, e.g., a division within an arithmetic unit, into simpler machine instructions while still leading to the same result but being more efficient. The ARM chip in particular turned out to be extremely energy efficient — in fact the first chip just runs via leaking current. Have you ever wondered why Nokia mobile phones in the 90s / early 2000s had such ridiculous battery life? They were all equipped with ARM chips. Not only that, ARM chips and smaller and smaller microcomputers triggered a revolution that made the first and nowadays smartphones possible (which don’t have as much battery life as old Nokia models, since they are equipped with energy costly displays and are connected to the internet). Since this is a highly relatable development in information technology, it is a good way to get a hint on what was and what has changed technologically during the decades and how it impacts what kinds of technology we use today.

Apart from technological advances in general, ARM chips are also used in the field of computational neuroscience, e.g., the Human Brain Project to which Steve Furber made contributions. Here is also a great interview with Sophie Wilson about the history of ARM.

Fig. On the left we see Steve Furber that was also working on the Human Brain Project, which also involved ARM technology. Together with the wonderful Sophie Wilson in the right Furber developed the ARM architecture in 1983 (she is also transgender and therefore also pioneered in taking her rights). Sophie Wilson was most of all responsible for writing the instruction set of the original ARM chip. Sophie Wilson is standing in front of an enlarged microscopic image of the original ARM architecture and holds an original ARM chip in her hand for comparison. Image Sources: left, right.

The above are just some of many historic examples of people that greatly influenced the development of information technology and can be considered a personal selection in order to give you a brief insight into the realm of computation. It is not necessary to get into every detail of history, but it helps to reflect one’s own life as being accompanied by a history of technology.

2 Brief Looks into Claude E. Shannon’s Biography

We hope the above selection of historic figures of mathematics and computer science has given you some inspiring insight into the history of information technology. However, the next parts of this series will mostly focus on the work of Claude Shannon, which we briefly introduced above.

When approaching actual mathematics and some code, we are especially going to focus on two works of his: “A Symbolic Analysis of Relay and Switching Circuits”, his master thesis from 1937, where he essentially conceptualized a modern (digital) computer via a telephone network, and most of all his paper “A Mathematical Theory of Communication” from 1948, where he defined a general theory of information still applied today in seemingly uncountable forms. The latter work will be the one we will focus on most in this series. We will only take brief look into his master thesis in the next chapter. However, both works of his are actually written in a quite accessible way to some degree, addressing rather everyday problems that we all know of from habit and experience with communication devices in our lives (such as noise for example).

In general, it is hard to track the impact of Shannon’s work, as information theory was the beginning of what is called the “information age”. As mentioned he discovered a way to mathematically represent the basic logic of today’s digital computers by using just a telephone (network). He also found the mathematical basics for digital audio, together with Harry Nyquist; digital audio is literally the result of a mathematical calculation, also involving the beautiful concept “Fourier Series/Transformation”. Shannon also worked on basics in machine learning e.g., in the form of the mouse “Theseus”. He was also the first to contribute to the idea of a chess computer in the modern sense. In fact, due to his contribution in this field the yearly winner of the World Computer Chess Championship is given the so-called Shannon-Trophy.

Fig. The first Shannon-Trophy of the “World Computer Chess Championship” was given to Feng-hsiung Hsu for Deep Thought in 1989 and was handed over by Claude Shannon himself. IBMs Deep Blue was a further development of Deep Thought. The name Deep Thought was inspired by the work of Douglas Adams.

A lot of the above discoveries happened during his time in the famous Bell Telephone Labs. Other technological and scientific inventions and discoveries within the Bell Labs where radioastronomy, UNIX, the programming language C and C++, transistors, the laser, first actual solar cells, the statistical programming language S that inspired the syntax of R etc.; in its best times around five patents were filed a day by members of the Bell Lab (compare this lecture on the Bell Labs). Unfortunately the best times of the Bell Labs found a very ugly end. After several financial crises and a lot of changes (e.g., in owners), a German physicist “almost faked his way to a Nobel Prize” while working at Bell Labs (link leads to a great and thorough documentary by the youtuber BobbyBroccoli). This scandal eventually struck the reputation of the former Bell Labs hard. However, Shannon was working at Bell Labs in a time where it was a highly vibrant and booming idea factory with lots of influential scientist that seemed to come up with ideas instantly and also had a lot of freedom (different to todays standards).

Apart from his contribution to information theory, which we will get into in a bit in detail, Shannon was also famous for inventing a lot of fun stuff as well, such as a rocket powered frisbee, a joggling machine, and various other fun things. Apart from chess games, we also mentioned the mouse Theseus that does a maze. Similar to chess computer competitions, Shannon was also involved in triggering the so-called micro-mouse competition that is held since the late 70s by the IEEE (Institute of Electric and Electronical Engineering; I learned about it from this tutorial on graph theory that I reviewed as a part of the SoME competiton by James Schloss and Grant Sanderson, 2023). Here is also a nice introduction into micro-mouse competitions by “Veritasium”.

Fig. Example of a micro-mouse and a micro-mouse maze used for competition. Source Wikimedia commons: a), b).

I will not go into more details of Shannon’s biography here, as it would not justice the matter in brevity, but I deeply, deeply recommend looking for background information on Claude E. Shannon, e.g., documentaries, and I hope you will soon understand why this should be the case. For me, understanding the basics of information theory was in a lot of ways a relief, even touching, as it brought me closer to things I use every day, now understanding at least the basic concepts behind information technology.

Understanding technology helps to use it consciously, instead out of pure boredom and habit. Computers are fun, but they are actual conceptual universes we often miss to explore — even though they can help us grow this way.

3 “A Symbolic Analysis of Relay and Switching Circuits”

Before we get to know what “Information” actually means we will first take a very brief look into Shannon’s first groundbreaking work, his master’s thesis “A Symbolic Analysis of Relay and Switching Circuits” from 1937 (link leads to pdf), in which he showed that Boolean Algebra, which operates with 0 and 1 logically (true, false), can be represented by relay and switching circuits (he does it symbolically, not technically — therefore the name). With this and some other mathematical tricks, one can perform, e.g., non-binary arithmetic operations. So, the logic of Boolean algebra works as an interface for the logic of general mathematics, delivering an ‘equivalence’ that can be functionalized as binary code, making expressing and processing mathematics via a circuitry possible. As mentioned, one can say that Shannon conceptually (symbolically) turned a telephone relay/switching circuit network into a computer.

Fig. A large Bell System international switchboard from 1943. As you can see, this was a job that was here entirely given to women at that time. Women also invented programming, however this was widely forgotten, also due to tropes such es the male computer nerd/geek. There is also a great paper by Olivia Guest and Samuel Forbes on teaching programming inclusively that goes deeper into this topic. Image source Wikipedia.

Functionalizing binary logic with relay circuits is what later Konrad Zuse did, when he built the first computer in the modern sense in Kreuzberg, Berlin, in 1941 (Zuse Z3) ─ actually just around 100 Meters away from where I was starting to write this tutorial a while ago (Methfesselstraße 10 and 7 (both buildings destroyed) in 10965 Kreuzberg, Berlin). He developed very similar ideas to Shannon’s as well as von Neumann’s (“Von Neumann architecture” of computers) in 1938 and also realized them as actual machines. At a similar time the ENIAC (Electronical Numerical Integrator and Computer) was developed in the USA by John Eckert and John Mauchly. There is a discussion which one of the above was actually the first computer in the modern sense and under what definitions — we will not go into details any further here.

Fig. Replica of the Zuse 3, located at “Deutsches Museum” in Munich. The Zuse 3 machine was the first universal computer (programmable and digital). Image Source.

The motivation behind building computers at that time was finding a way out from being forced to do hundreds of hundreds of monotonous and boring calculations as only labor, in a time before there were (technical) computers. It is important to underline that a computer at that time was essentially the job title of a human doing calculations (often/mostly women) by using algorithms to solve mathematical problems and therefore all problems that can be solved in the form of a mathematical abstraction.

Fig. Relay Switch used in the Z3 and other models of Konrad Zuse. Image Source.

Now let us get to the core of what a relay switch conceptually does and have a brief look at some examples, to understand the general concept (you will not need more than that for this series!).

Fig. Original figures from ‘A Symbolic Analysis of Relay and Switching Circuits’ by Claude E. Shannon. The general representation of relay / switching circuits in this form indicates no defined state, in other words: e.g., figure 1 represents both 0 or 1 in this representation. In general 0 and 1 are related to the hindrance, such that 0 hindrance equals zero and 100% hindrance (the switch is disconnected!) represents a 1.

Let us have a look at the upper left “Figure 1”: If the relay switch is closed, there is 0% hindrance (!), the current can flow and it represents its respective binary digit, a 0. If the relay switch is open, there is a 100% hindrance, subsequently representing a 1 ─ the current cannot flow. Note that the common on/off analogy for 0/1 does not work here, since on = 0% hindrance and off = 100% hindrance (this is different to today transistors). Since this may not be as intuitive, as it seems, you may want to go through it a couple of times in your mind, just in case.

By putting switches / relays in series or in relation to each other, one can, e.g., perform an addition, following Boolean logic, where, e.g., 0 hindrance plus 0 hindrance equals 0, 0 hindrance plus 1 hindrance equals 1. See all the postulates below for details (note that it is enough to get the very general idea behind it to get through this tutorial!).

To jump a few steps ahead: From there it was possible to create and make use of several different so-called logic gates to form circuits that perform a certain logical or subsequent mathematical task, represented by a hardware relay inference structure. Again, as circuits can perform binary logic, and binary logic is also a way of general abstraction on the basis of the principle of equivalence (code), anything can be computed theoretically. This is also where software comes in play, and differences in machine code, assembly code, source code. etc. ─ which at the end are just further hierarchical abstractions of the same logic of en- and decoding binary digits. Note that we will later see that binary logic is used, since it is mostly just the easiest way to produce a technical representation of logical inference, e.g., via relay switches (so “bits” is actually more of a randomly picked and also convertible unit, referring to the base of the log-algorithm (2) of possible states, as we will see later below).

There are two logic gates that had been famous before Shannon developed information theory: the NAND gate (not-“and”) and the NOR-gate (not-“or”). Both of them have the special quality that one can replicate any possible logic gate with a circuit of NAND or NOR gates. Both, as well as their special feature, and many other concepts in logic, were initially defined / further developed by C. S. Peirce. We mentioned the work of Peirce in Inferential Statistics I, since he also defined the concept of abductive inference, in contrast to (or as basis of) deductive and indictive inference. There is also evidence that Peirce had also foreseen the possibilities that Shannon used for his circuit analogy (e.g., in “Logical machines” in 1887). His work had in many ways a great influence on logic in general and on further developments of Boolean logic (here is a link to a great math tutorial on that subject of logic gates for further details (part I of III) by N. J. Wildberger).

Fig. The NAND gate. a) The NAND truth table. If AA is false, similar to theta\overline{theta} is given, and B is false as well, the operation not-and outputs Y=1Y=1. Same goes for the case when A is false and B is true, because the logic assumption holds: A and B are not the case, so NAND = true. At last: A is true and B is true ─ therefore the assumption doesn’t hold: NAND = false. Note that the presence of two relay switches has doubled the number of possible states (22), compared to one switch (21), representing either 1 or 0 only. This of course does not concern the output, which is a single binary. We will get to this again, when we start to quantify information. b) The symbol for a NAND gate used in DIN 40700 ─ conventions alter. What we see is A and B as inputs and a “D” shaped form that merges into Y, the output, processed by the corollary logic. Most important: the black dot distinguishes the NAND gate from the AND gate, symbolically representing “not”.  c) NAND relay logic: The input relays, set either to 1 or 0, either break, or do not break the flow of a current (100%, or 0% hindrance). If the hindrance measured in Y is only 50%, as only either A or B is 0 and the other switch is in 1 or so, the output Y will not lead to a flip in the subsequent relay switch according to our truth table, as the hindrance is not 100%.

There is of course a lot more to it, especially when creating complex circuits. However, this chapter should just give you a general idea of the physical representation of information processing. Nowadays we are of course using tiny transistors. A chip from the Apple M1 series for example carries 16 to 114 billion transistors (the functional equivalent of relay switches).

If you want to learn more about the technical side of information technology and engineering, we recommend looking into tutorials that introduce you into the field of microcontroller such as a raspberry pi or others. These were explicitly invented to be used for affordable educational purposes and are quite powerful little computers.

Here is a German tutorial for making plants curse at you, when they need water. ^^ And here is an English tutorial to build and ECG (to also motivate medical students to further dig into the topic).

Fig. German tutorial at the great youtube channel “achNina”. Nina has a collection of really fun tutorials and makes very humorous short videos, so we definitely recommend to give her a visit!! ^^

4 The Meaning of Communication as a Mathematical Theory

The second important paper, and the one that is especially important for us, is called “A Mathematical Theory of Communication” (AMTC) from 1948 (full paper can be found here). In this paper Shannon first of all solved the problem of how and in what optimal form to send information from one point to another, e.g., via a telephone network, without losing or changing the infoꓤⴠⱥⱡⱼⱺn, which means, without noise. Noise means, in its simplest form: one sends a 0 and the receiver receives a 1 to a certain probabilistic chance within a series of sent bits ─ one also speaks of “flipped bits” in such cases. In this series on information theory, we will essentially replicate a lot of the math of the first 12 pages of AMTC.

Fig. The original schematic diagram of a general communicational system by Claude E. Shannon, 1948. 

Note that the word “bit”, yes, stands for binary digit. However, most people think of the concept of bits in the simple sense of storage, where e.g. 8 bits (one byte) refers to a combination of 8 zeros and ones. Later bits will also be the name for a quantity, which not only refers to its essential symbolic possibilities (two) or a representation of stored information, but extends the implication of the concept ‘possibilities’ to this, a mathematical and most of all probabilistic quantity and argues communication to be a process of probabilistic inference. In the next part of this series we will also find out that it is not even important which ‘unit’ we will use apart from binary, so reductions like “computers are only 0 or 1” are missing the point in several ways. Note that the relation and differences between humans and computers, at least if drawn in computational neuroscience, lay somewhere else, as we will see. ChatGPT for example is for one just a very passive AI system that lacks a lot of general structural abilities to perform or represent complex reasoning; it will necessarily come up with stupid, biased, i.e., dangerous suggestions at some point or another, which need further reflection — bluntly spoken; on the other hand, this circumstance is not “that” different from common human fallacies or at least problems of communication (which we will look further into in the forth part of this series)... Comparisons with humans are in general a little difficult and strongly depends on how literal we are drawing such a comparison (a computer and a human can solve 2+2 — but they may do so in very different ways).

In general, the term computation however just means: solving a problem / answering a question via an algorithm — any algorithm, not a specific one in a specific hardware representation or so... This issue of misunderstanding the concept of information, information processing and computation is, e.g., discussed in this paper by Richards and Lillycrap in 2022. So again, “computation is just 0 and 1” or “input and output” is neither witty, nor does it add anything to the story of what computers have to do with humans and what not, apart from again starting an old discourse on a lack of information about the concept of information and information technology itself…

5 Bayes’ Theorem / Conditional Probability (Optional)

In this chapter we are starting with basics on Bayes’ rule / conditional probability, before getting right into the paper AMTC— which is again just applied statistics, so to speak. If you feel you have enough background on Bayes’ rule or if you have read our tutorial Inferential Statistics I, you may want to skip this next chapter.

However, the approach on cond. prob. and Bayes’ rule in this chapters will differ a little from our previous approach: The focus will be shifted a little more towards a “subjective” perspective on general “inference” (or a subjective interpretation of probability), i.e., an intuition / concept that understands “cognition” — and even ourself as whole (in the sense of a body / organism in the world)— as performing probabilistic inference (in terms of sensory and action). This is supposed to help you understand the actual sender-receiver model and is most of all supposed to give you an idea about concepts in computational neuroscience that are relatable to or direct “descendants” of information theory (such as “Bayesian Brain Theories” or the bio-psycho-social model), which we will go into in future parts of our series “Higher Spheres”. Another goal is to obtain a critically reflected understanding why people compare, e.g., neural networks with human inference in the first place (in more or less sensible ways).

We will also further extend our discourse on ideas from C. S. Peirce, e.g., the concept of abduction and further intuitions around it. As we mentioned before, C. S. Peirce had a great contribution in the further development of Boolean algebra and logic. Apart from being influential in mathematics he also greatly contributed to other fields, such as philosophy (discourses which we today may cast as epistemological, phenomenological (philosophy) — or as part of cognitive science in general).


5.1 Abductive Inference in General (Optional)

In Inferential Statistics I, we have discussed Bayes’ rule / Conditional Probability mostly as a kind of scientific learning algorithm with a triadic structure in the context of hypothesis testing in general (here we mostly skip the comparison between Bayesian and frequentist statistics). We also logically related this process of probabilistic inference to the so-called abductive inference. Later we will see that Bayes’ rule/abductive inference can also be used to reflect on the relation between a sender and a receiver, i.e., communication.

Fig. From the webcomic ‘xkcd’, titled “Hypothesis testing”, which is also referred to as abductive inference (xkcd: Hypothesis Generation). Any inference we make as humans can be referred to as abductive inference: making a guess on the world (a prior information) and adjusting it, given new information on the world in order to update a model, cast a new prior for further inference. The posterior probability will serve as a new prior, such that it can be understood as a learning algorithm, close to the concept of classic and operant conditioning for example (which was directly influenced by information theory!). Statistical inference can be understood as a way to extend the method by making its steps transparent via mathematics: a mathematical representation of an argument on the world.

Abduction can be defined as the process of generating a hypothesis to optimally explain a new observation, based on what is given (e.g. in general datadata, as discussed in Inf. Stat. I).

To give you an intuitive furry example, let’s say you are observing water on the ground ─ indoors(!!), which appears ‘surprising’ to you in the sense that you are ‘uncertain why’ this is the case. Evaluating previous evidence spiced with some intuition/confidence (prior beliefs) and gaining present insights (likelihood of present events, given a prior belief) may lead to the cat (Mr. Midnight, see below) being the most probable explanation for it (posterior belief): a high probability on “the cat being the cause, given that there is water on the ground”. Let’s now say, you may still be too uncertain to draw such a conclusion (low probability of the posterior), maybe because your cat has not been destructive in quite a while (or ever), or because you haven’t actually seen Mr. Midnight do it (^°^) — so you start to gather more information and you want to make your inference more precise. After doing so, you may eventually find out that it wasn’t the cat ─ but a squirrel that got lost in your rooms (updated hypothesis, e.g., due to direct sight of a squirrel or a lack of matching pawprints…).

Fig. Probably innocent Mr. Midnight, since there are no (matching) pawprints to be found at the “crime scene”. Lyrics from a KLR animal voice over meme. Image from Anita Jankovic and Dominica Roseclay.

Note that references we make, here cast as the specific content/variables of inference (cat, water on the ground), remain modally contingent, which means that the principle of abductive inference in the sense of conditional probability / Bayes’ rule does not rely on its content ─ can potentially be used for anything (contingent = similar to ‘arbitrary’).

As we have discussed in other tutorials, we can always perform some random mathematical inference in the relation between the above variables — our hypothesis (e.g., cat was the cause) and the data we get (e.g., water on the ground) — using Bayes’ rule. However, doing so does not cover a more general approach in reflecting variables and how they are influenced, e.g., in the sense of a complex system with many, many relations and inferences being made (e.g. confounders), in other words: the ad hoc theory we may come up with in order to learn how well it can be applied on the world or even ourselves, may turn out to be all bonkers in the long run, gathering more information or insight into possible relations — so there is always the chance that temporal and spatial expansions of inquiry (gaining further data, looking further into populations over time) overturns our confidence, our model of the relation between us and the world… However, this is what learning is all about, and this is what the above circumstance makes possible in the first place. In other words, when looking at abductive inference in a positive way that does not focus on the mere fallibility aspect of inference, we can at least say: We are able to adjust our hypothesis, we are able to overcome faulty models over time, we are able to further learn!

So again, abduction can be defined as the process of generating a hypothesis to optimally explain a new observation, based on what is given. Note that “optimally” can mean “best” as in “best fits a model of self and world”. Thinking of it in this way may seem intuitively reasonable, but also entails the risk of overfitting, i.e., inferring something to be existing or in general the case, even though it is actually not, due to overly putting confidence in a belief, e.g., due to high reward, or in data that fits the belief. Another reason to be hesitant using the term ‘best’ to describe a form of inference itself:  The term ‘best’ may imply the presence of values, concepts of relevance etc. (priors), in other words: results of abductive inference itself, which are at the same time not as universal as the form of inference itself.

What seems to be a tautology within the concept of abductive inference is actually an interesting spatio-temporal twist within the process of abductive inference: abduction is better to be understood as “inference on a better hypothesis”, compared to previous ones ─ essentially moving from past, to future (prior to posterior), creating a new or updated hypothesis and therefore a necessary latency and modal contingency concerning the epistemics of a hypothesis over time.

Fig. Peirce saw in the existence of optical illusions (above in the form of tilting figures) a proof for abduction being the prime form of every inference (see his work on textlog): The cubes spatiotemporal perspective starts to tilt, when the hypothesis loses its character of singular necessity and becomes contingent (one possibility amongst others). At the same time model updating is literally experienced, as the hypothesis on the perspective has changed and therefore the experience (Image sources: left, right).

Another way to understand abduction, especially cast as prime form of inference, is in a row with induction and deduction. Different to the latter two forms of inference, abductive inference actually generates something new, a hypothesis. Abduction can therefore be understood as initially providing the relationality, or better: contiguity of, e.g., two events within space and time, represented as a prior hypothesis on data that can subsequently be evaluated in a mereological way, i.e., describing the result of an abductive inference as a part-to-whole relation: induction from a part (data) to a whole, a generalization (theta), and deduction vice versa ─ abduction being the cause, or the process that makes and therefore sets mereological relations in the first place.            

Fig. Abductive inference as the process of forming a hypothesis (theta) on the world (data), i.e., an initial (new) relation, which can subsequently be evaluated mereologically (=part-to-whole relation) via deductive and inductive methods of inference (1. Abduction, 2. Deduction, 3. Induction).

As mentioned, this fits into several aspects of our everyday experience: our experience of fallibility of inference, as well as the possibility of updating a model/hypothesis, which, over time, understood as a method, will automatically lead to falsifications to some degree, i.e., learning (this can be understood as a reason that abduction itself is not falsifiable, as it states to be the major principle of inference itself). This also fits well to our ability to actively attenuate to certain aspects of interest and subsequently create new and eventually more complex interests, ideas, hypotheses, perspectives, e.g., by intentionally experiencing new environmental situations (selection, planning, action).

In general, one can say, casting abduction as prime form of inference is arguing that the way we relate and approach our environment is in its essence a speculative and rather adventures act, though again: it will conceptually lead to knowledge over time.

5.2 The Formula of Bayes’ Rule and Conditional Probability (Optional)

In information theory we will discuss Bayes’ rule / conditional probability in the context of a sender-receiver relation. However, the variables as such are contingent, in other words: it can be discussed in many contexts, with potentially any variable we assign, so to speak. Even though we are mainly after understanding the relation between sender and receiver, we are going to give you a quick general overview on the use of Bayes’ rule and conditional probability.

Again, in Inferential Statistics I, we have discussed Bayes’ rule / conditional Probability mostly as a kind of scientific learning algorithm with a triadic structure in the context of hypothesis testing. However, we also figured that there are many applications, such as the power analysis, the p-value in frequ. statistics, analogies to investigative reasoning… In all cases we were able to orientate ourselves via the below triadic structure of forming a hypothesis (prior), gathering/encountering data (present) and updating (the probability of) a hypothesis / model (even the frequentist approach).

Fig. Ms Miranda (she knows-it-owl) and a reader of her work (on sounds that implicate snowing well). In other tutorials we already had furry and feathery support, e.g., explaining hypothesis testing in the general sense of abductive inference and other concepts of statistics. Some of them will reappear in this tutorial series as well. ^^ Images by Kevin Mueller, Alfred Kenneally and Ray Hennessy.

Note the temporal aspect of the above steps, representing the process of updating a prior hypothesis/belief/model of the world, given new observations, i.e., new experiences. The goal is therefore, after observing, i.e., given new data, to adjust our concepts (thetatheta) in order to fit well to the actual data in the world we encounter in the future (speaking of plain passive perception only).

To wrap this up: abduction is the (inferential) process of finding/adjusting a model/belief/hypothesis that predicts future states well in terms of “better”, so that the overall belief does not have to change much in the future when encountering new observations of the same or similar kind (fitting our theta to data).

Fig. Bayes’ rule reflected as a course of inference or as a course of conditions. This notion is present in both frequentist and Bayesian statistics, as the likelihood becomes equivalent to the posterior, given a uniform prior (frequentist approach). Inference can also be thought of as a course or flow of information.

In case you have an idea of conditional probability from school: Note that the usage of Bayes’ rule / conditional probability for any kind of hypothesis testing is contextually special, since it is about the “prediction of an event” in relation to an “actual event” — in other words, it is about one event only, just in two “forms”: prediction of an event and the experience of an event (essentially testing for a redundancy). This is contextually and practically different with causal event/event relations, even though the formula is the same. For example: the probability of getting COVID, given that you are in a hospital, P(COVIDhospital)P(COVID|hospital), is a contextually different relation as landing in the hospital, given that one has COVID, P(hospitalCOVID)P(hospital|COVID) (thanks to the biometric Institute at Charité for this advice).

Let us again go through all the conceptual steps and find out how they are related to actual mathematical formulas:

Step 1) Our fist element of Bayes’ rule is the prior (belief), denoted P(theta)P(theta), where PP stands for probability ─ the probability of a belief/hypothesis (thetatheta) about the world or an event, regardless any new events / observations, or ex negativo: a belief in relation to all past, or possible observations (here all binary options datadata and notdata=datanot-data = \overline{data} are meant with all possible observations!).

The phrase “all past or possible observations” represents linguistically what is technically called the sum rule of probability (details later below). We will see that this rule is most of all applied to calculate P(data)P(data), not P(theta)P(theta), as we can set our initial priors ourselves, but it conceptually stands for the same, just summed out over all possible hypotheses (instead of all possible observations) ─ we will get back to this soon below.

Step 2) The second element of Bayes’ theorem is the joint probability, which is sometimes also referred to as the model in general. The joint probability consists of a likelihood in relation to the prior and is denoted P(theta,data)P(theta,data), which can be read as: the probability of the actual theta “and” (“,”) the actual data occurring together at the same time, i.e., within the present.

The concept behind the joint probability can also be expressed set theoretically via a Venn diagram. P(theta)P(theta) and P(data)P(data) then represent a set of two overlapping circles. The overlapping forms a joint probability, cast as a set of true data and theta, i.e., P(theta,data)P(theta , data). A joint can also be cast as a logical conjunction, spoken “theta and data”, equivalent to the sign “,” in P(theta,data)P(theta , data).   

Fig. A so-called Venn diagram. The circles each represent theta and data for itself as ‘areas’ (compare set theory for details). The overlapping indicates that both theta and data are true. The comma “,” in “theta , data” is here considered a mathematical symbol and is spoken “and”. Probability theory can be understood as an expansion of classic propositional logic, which operates with the modal necessities of 0 and 1. Probability theory expands this by allowing a logic of modal contigencies of inference, i.e., values between 0 and 1.

The only difference between Bayes rule’ and classic conditional probability is that in Bayes’ rule the joint probability can also be obtained by weighting the invers of the posterior conditional probability (i.e., weighting the likelihood). Upfront you can find the formal difference between classic conditional probability and Bayes’ rule below (here using the variables AA and BB instead of thetatheta and datadata).

To understand what weighting the inverse posterior means, let us have a look at the likelihood for itself: The likelihood is the probability of making an observation, given that our belief/hypothesis is set (a priori) and is denoted P(datatheta)P(data|theta), where “|” means “given” or “under the condition of” (therefore the name conditional probability). The likelihood represents our new observation, the present.

Again, the joint probability represents a “relation (conjunction) between a prior and a likelihood” and is also denoted: P(datatheta)P(theta)=P(data,theta)P(data|theta)*P(theta) = P(data,theta). In other words, the weighted likelihood or “weighted inverse posterior” results in the joint. The phrase ‘relation between prior and likelihood’ linguistically also represents what is called the product rule of probability in mathematics (also known as chain rule).

Fig. Product or chain rule of probability. The graph represents a particular chain of decisions. A classic decision tree also essentially refers to what is called the chain or product rule of probability. The above also entails P(datatheta)P(data∣theta), which is also referred to as the likelihood, and is ─ as we know ─ just the inverse condition of our posterior probability P(thetadata)P(theta∣data), i.e., the likelihood is the data under the condition of or after we formed a hypothesis.

Note that in classic (or exact) Bayes, P(datatheta)P(theta)=P(theta,data)P(data|theta)*P(theta) = P(theta,data) is equivalent to P(thetadata)P(data)=P(theta,data)P(theta|data)*P(data) = P(theta,data), which in some way represents the possibility of what is called the model inversion of the upper formula (however, we will see this only works numerically in a special case of a uniform prior, see further below). So far just think of model inversion representing the attempt to move, or invert or tilt the actual observation, the likelihood P(datatheta)P(data|theta), to become our new hypothesis given datadata (the posterior, P(thetadata)P(theta|data), see below): Because it is the world we want to predict and understand, we adapt or move towards the likelihood, i.e., again, the actual observation, in order to make better inference in the future.  

Step 3) As mentioned, the third element of Bayes’ theorem is the posterior, denoted P(thetadata)P(theta|data), the probability of a belief / hypothesis, after observing, or given some datadata that fits the thetatheta. This is the result of a calculation that involves another term that we encountered before, the marginal likelihood, or model evidence, denoted P(data)P(data), i.e., the probability of the datadata or an event, regardless any thetatheta, any belief, or, ex negativo, concerning every possible thetatheta (meaning thetatheta and its complement theta\overline{theta}). As mentioned before, this is linguistically defining the sum rule of probability:

P(data)=P(datatheta)P(theta)+P(datatheta)P(theta)P(data) = P(data|theta)*P(theta) + P(data|\overline{theta})*P(\overline{theta})

Fig. This figure demonstrates that the sum rule reverses the actions of the chain rule (which can be looked at as a decision tree). Note that this time the figure above refers to a different path of inference, a different chain, starting with P(data)P(data) this time. The sum rule leads to P(data)P(data) only, as data is set “constant”, so to speak (another tree of decisions would lead to data\overline{data}).

The model evidence is also called a normalization term. Since the joint probability is not summing up to 1 by itself, it has to be normalized to do so.

So how do we get to Bayes’ rule now? One way to look at it: It simply follows from the above equivalent ways to obtain the joint probability. Dividing all sides by P(data)P(data), and therefore normalizing the joint, will leaves us with the posterior probability (the joint was crossed out, to argue that we have decided for a specific course of inference — however, I could have divided the joint by the model evidence in order to represent classic conditional probability):

Another way to look at the above is as a weighting and counter-weighting of the likelihood, in order to obtain the posterior probability and update the model.

Without normalization, the posterior probability and the joint probability are “only” considered proportional to each other:

In Inferential Statistics III we also figured out that there is a relation between calculating the regression coefficient (the slope) of a linear model and the scheme of conditional probability, just for statistics (values of measurements):

The left part is spoken: the estimate of yy (dependent variable) given a change in xx (independent variable) by 1 is equal to the joint variance (or covariance), divided by the sum of squares of xx. The above can also be understood as representing the difference between statistics and probability (measurements and probabilities).

In Inferential Statistics I we have also discussed the difference between the frequentist and Bayesian “interpretation” of probability in more detail. However, the difference is not that difficult to understand, since the frequentist approach always operates with a uniform prior and therefore never updates a model (compares a series/frequency of likelihoods so to speak and works with plain frequencies of events in general). This means that given a binary of thetatheta and theta\overline{theta} we would always assign a prior probability of .5 for each possible case. It turns out that in such cases, the posterior is always equivalent to the likelihood. Recall the perspective of P(theta)P(theta) and P(data)P(data) as weight and counter-weight of the likelihood above. In the case of a uniform distribution, P(theta)P(theta) and P(data)P(data) will sort of cancel each other out. This can be interpreted as always capturing the present moment, so to speak. The frequentist appraoch therefore does not actually update a model due to the above circumstance, but compares a lot of “present” moments with each other (a frequency, e.g., via a confusion matrix within power analysis or a meta analysis). Note that Bayesian statistics (e.g. a Bayesian regression model) are computationally nothing that can be done by hand easily, different to classic approaches.

Fig. “(Yet another) history of life as we know it….” the meme was found here.

This is one reason why statistics is often taught from a frequentist perspective on statistics: it can even be done by hand. However, the actual difference to Bayesian statistics is still minor and every approach has its pros and cons, so to speak. For further details on this matter, especially on the “philosophical aspects” (which is actually not philosophical) of the above difference please see Inferential Statistics I.

That was it, you have mastered the basics of Bayes’ rule and conditional probability!

To end this chapter with a brief numeric example: Investigative reasoning has often been related to Bayes’ rule(a structural recursion / iteration of hypothesis testing; the following example can also found in Inferential Statistics I):

Fig. The innocent Sally the Coon was murdered and the famous Vanessa Holmes is investigating on the case using Bayesian inference methods. Image by Nima Sarram.

Vanessa Holmes’ new case involves four suspects, one of them being the murder of an innocent racoon. A prior probability could for now look something like [.25 .25 .25 .25]\lbrack.25\ .25\ .25\ .25\rbrack for each of four suspects, when there are no specific prior assumptions on any suspect given so far (no clues) ─ staying elementary neutral and letting the facts speak first, so to speak (initial uniform prior). Holmes checked on one of the suspects, but the first possible suspect has a clear alibi. With this new observation, our prior probabilities will subsequently change to [.33 .33 .33 .0][.33 \ .33 \ .33 \ .0], so our model was subsequently updated. Below you will find the code to do more iterations. Another example would be reasoning in the medical field in terms of differential diagnostics, as well as evidence-based medicine in general...

Fig. Vanessa Holmes, doing her damn job!

Here is some code to replicate the Vanessa Holmes example in the programming language R:

# First iteration:
joint = c(.25,.25,.25,25)*c(.33,.33,.33,0) # prior*likelihood
model_evidence = sum(c(.25,.25,.25,.25)*c(.33,.33,.33,0))
posterior = joint/model_evidence

# Result:
# [1] 0.3333333  0.3333333  0.3333333  0

# Second iteration (another suspect ruled out):
joint = c(.33,.33,.33,0)*c(.5,.5,0,0)
model_evidence = sum(c(.33,.33,.33,0)*c(.5,.5,0,0))
posterior = joint/model_evidence

# Result:
# [1] 0.5  0.5  0  0

# Third iteration - let's say all of the suspects are ruled out:
joint = c(.5,.5,0,0)*c(0,0,0,0)
# [1] 0 0 0 0
model_evidence = sum(c(.5,.5,0,0)*c(0,0,0,0))
# [1] 0 
posterior = joint/model_evidence # Note that it says: 0 divided by 0!
# [1] NaN NaN NaN NaN

# NaN = Not a number.
# This is due to the fact that 0/0 is not defined. At this point
# Vanessa Holmes would need to find another suspect to be able 
# to do further inference... 

In case you want to play around with Bayes’ rule a little more, below you will find the code for two simple functions you can use to do so. The second version of the function below also works with examples where P(data)P(data) is provided and not calculated via summing out, e.g., for correcting a test (you may have encountered examples of that kind on the web). Note that the below functions uses sum(x, nar.rm = TRUE) instead of the plain sum function, so that the 0 row in a probability vector is deleted (the sum() function doesn’t like vectors of that kind). With this slight adjustment you can also input a prior or likelihood with [0 1] as probability vector. You can also input just single values, instead of a probability vector with, e.g., thetatheta and theta\overline{theta}. However, a probability vector is what we are going to work with in the rest of this article.

#### First simple version of our Bayes_Machine() function:
Bayes_Machine = function (prior,likelihood) {
  joint = prior*likelihood
  # na.rm = TRUE in sum() deletes 0 rows if given
  modelevidence = sum(joint, na.rm = TRUE) 
  posterior = joint/modelevidence
  # Needed for console output and adds text to it
  # using a matrix, which works similar as c()
  postprint = matrix(c("Posterior",posterior,
                       "Joint probability", joint,
                       "Model evidence", modelevidence))
}  # end of function

# Give it a try with defined prior and likelihood:
prior = c(.5,.5)
likelihood = c(.5,.5)

# Try these inputs and contemplate the results:
prior = c(.1,.9)
likelihood = c(.9,.1)

#### Here is the extended function that can also handle 
#### the model evidence as input:
Bayes_Machine = function (prior,likelihood,modelevidence) {
  if (missing(modelevidence)){
    joint = prior*likelihood
    # na.rm = TRUE in sum() deletes 0 rows if given
    modelevidence = sum(joint, na.rm = TRUE) 
    posterior = joint/modelevidence
    # Needed for console output
    postprint = matrix(c("Posterior",posterior,
                         "Joint probability", joint,
                         "Model evidence", modelevidence)) 
  else {
    joint = prior*likelihood
    posterior = joint/modelevidence  
    postprint = matrix(c("Posterior",posterior,
                         "Joint probability", joint,
                         "Model evidence", modelevidence)) 
} # End of function

Bayes_Machine(likelihood = likelihood, prior = prior)

Next we are going to look at how Bayes’ rule is applied in information theory.

6 Communication as Bayes’ Inference - the Relation between Sender and Receiver as a Probabilistic Relation

Another way to understand abductive probabilistic inference — the perspective we are actually looking for — is to see it as an attempt to communicate with the world, being either a sender or a receiver of information / a message (perception/action).

As mentioned, the problem that Shannon mainly solved was rather technical: how to send a message (in the form of text or sound) to a receiver without being corrupted by noise. Let us say we send a message that only consists of a single binary, 0 or 1. A corrupted message would mean that a 1 was sent and a 0 was received. In other words, a correctly sent message can be looked at as an equivalence at two spatially distant points: sending a 1, receiving a 1 on the other end (a kind of referential redundancy). Before we get into the concept of noise, we will first discuss the idealized relation between sender and receiver without noise.

As mentioned, Shannon figured out that information entails a statistical property. You might wonder, what does it mean exactly, to understand a message as having a probability? Let us look what Shannon wrote himself in the first paragraph of ‘A mathematical theory of communication’:

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is, they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.” (AMTC, p.1) 

The semantics are irrelevant!!!??? We will discuss this matter in the next chapter, and will first answer the question why a message can have a statistical property: For a message with one symbol only (binary digit), we know a priori that there are two possible messages. Given that we are the receiver and only one binary digit will be sent, we can safely say that the a priori probability that either a 1 or a 0 was sent (without noise) is 50%. Mathematically we can express this circumstance in the following way:

Let us now have a look at the complete sender receiver model (without noise!):

Think of the channel in the middle that carries the signal between the sender and the receiver as a joint (the channel that makes our connection possible). The steps between the information source and the transmitter, as well as the receiver and the destination can be understood as en- and decoding processes (Morse code etc.). Note that the steps of encoding and decoding (a binary sign for a text sign) can themselves be looked at as Bayes’ inference and they could also be corrupted by noise — within the circuitry of the en- and decoder for example. The reason could be leaking current, which represents itself as a hindrance (such that a 0 flips to 1). We will take a closer look at the concept of code in the next chapter.

Proceeding with the math, let us say that we are the receiver. What we then want to infer on is the probability of a source signal, given a received signal, P(senderreceiver)P(sender|receiver). Note that as we work with binaries below, we are inferring on discrete random variables, where discrete means that there is a ‘minimal distance’ between one and another state on which we infer (continuous variables imply that there is continuous information on infinite states between them). So let us do some math and coding, using our Bayes_Machine() function.

# Example for sending one binary digit via a noiseless (!) channel:
prior = c(.5,.5)
likelihood = c(1,0)

Bayes_Machine(prior = prior, likelihood = likelihood)
# Posterior = [1,0]

If the result surprised you, remember the fact that we are dealing with a noiseless channel, such that there will be a 100% chance that a 0 was sent, given that a 0 was received. This may of course sound a little too simple or boring from a mathematical perspective. Before we proceed to noisy channels right away, let us take a look at the concept of code and the fact that meaning is not transmitted (also concerns “human communication”, since we, e.g., do not “hear the meaning” of a message, it is an individual inference, primed on previous use of a language we learned before).

6.1 Receiving Encoded Messages without Meaning

Let us again take a look at the quote from the previous chapter:

These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.” (AMTC, p.1)

Shannon importantly argues that the meaning of the message that is sent is not the concern of sending or receiving a message, in other words: something does not need meaning to be sent and received as a message and its meaning remains contingent (contingency essentially means: a set of possibilities). In other words, meaning is always something that is internally inferred by a system on each end, but never something that is literally sent. This is also the reason that we cannot be sure to be understood when speaking or writing to someone, even though we may have a specific intention behind sending a specific message (or to synchronize with others in general). We can only (or still) try to evoke a certain inference on the receiver side, so to speak.

If this feels odd, think of your phone or computer in front of you right now: this text just consists of letters. The fact that you may feel addressed by this text or that you assume this text to be meaningful comes from your prior beliefs and your model of the English language and your inference on this text, as well as maybe even most of all the belief that a communicational relation is possible in the first place, given something that was identified as a message.

“Receiving a message without meaning” might sound empty as if communication is an empty process, but it can be seen more as a necessary contingency, i.e., a lack of the necessity of meaning as a necessity for the possibility of communication itself (similar to the concept of arbitrariness, however the term contingency refers to a space of possibilities, not to randomness in the sense of “endless possibilities”). In consequence: It does not need an equivalence of meaning between receiver and sender upfront or in general in order to communicate (apart from the fact that personal intentions/beliefs will generate different outcomes than those of another person anyways). Making a random sound or pointing somewhere can even be used to establish new code from experience, in other words: language is not a fixed set of “meanings” that is communicated in the literal sense. At the end actual transmission of meaning makes no sense, since we do not fully share all of the same prior beliefs, we establish, learn and use language in order to try to evoke meaning in other persons. In other words, meaning is nothing that is sent and received in terms of a reification of meaning or so, i.e., an object of some kind moving through space being its own entity or so.

In the forth part of this series we will also discuss the introduction / tutorial on information theory by Warren Weaver, which was part of “THE mathematical theory of communication” in 1949. Weaver presents three levels of communication, which can be referred to the syntactic or level of mere symbols, the semantic level and the pragmatic level of communication. There we will also discuss the difference between semantic and pragmatic meaning. However, the fact that meaning is not transmitted is present in Weavers discussions as well. The first “symbolic” level of communication is the one we are going to focus on in the first part of this series.

However, what we still need in order to communicate is an agreement on the code. Such an agreement can also be understood as a special case of meaning — though if so, then this kind of “meaning” just repeats the structure of the first level of communication: a sign is equivalent with a sign (e.g. a sound as a sign).

Below you find a classic example of code equivalence: the morse code. Again, only signs are translated into other signs — so no transmission of meaning is involved in en- or decoding either.

Fig. Morse code. A classic example for code. Given the code scheme above one can encode and decode message bi-directionally (both sender and receiver). Source Wikipedia.

For humans this is similar to a phonetic sound related to a letter (symbolic meaning), and even there: noise or different coding conventions can make things ambiguous. Such code conventions are something that need to be given a priori on both sides (sender and receiver) in order to correctly decode the message ─ otherwise the code of the received signal would need to be decrypted (Shannon also wrote a paper on that topic).

Another example of code you probably heard of is the ASCII-code (American Standard Code for Information Interchange), which assigns 7 bits to each symbol (e.g. 100 0001 for the letter A) and comes with one so-called parity bit. A parity bit is used to check if noise changed the message. For the letter A in the ASCII code we would assign a zero, since the sum of all seven numbers is even (we will get to this further below after talking about noisy channels). We can also recommend looking into en- and decoding processes that involve translating a continuous function into a discrete function and back again (which essentially led to Digital Audio (CD’s, MP3) and the JPEG image format). This entails the usage of the beautiful Fourier Transformation (used for the Shannon-Nyquist Sampling theorem), which is also a very good way to understand the essentials of Quantum Mechanics (uncertainty principle) without unnecessary philosophical fuzz about it. We recommend a tutorial by 3blue1brown on this matter, as well as this tutorial on the concept of bandwidth.

Mathematically we are just going to operate with binary code in this tutorial, which also cast as machine code in the context of a computer hardware system. Programming languages such as R do not operate on that level but a higher level of symbolic abstraction which is translated into code that can be “read” and executed by a machine (e.g., higher level = sum(x), gets translated into lower level languages down to machine code, gets processed by the machine, which outputs a binary that is then retranslated into code that is easy to read for humans: the numerical output of the sum function).

Fig. Grace Hopper was the first person to develop the idea of a programming language and has developed the first compiler in 1952 and therefore paved the way for languages such as FORTRAN (FORMula TRANslation). A compiler translates a command of a programming language (like R) into a lower level language that actually does the job (machine code or later: assembly code, which is the language used to give the specific hardware the necessary commands to process the compiled higher order language commands (like FORTRAN code)). Image Source Wikimedia.

Even though a message and a code does not entail meaning for itself, a code is still something that makes meaning possible in the first place. Though, hearing a random word for itself does not provide us with the meaning of a word and it does not necessarily have to mean anything at all and also nothing necessarily intentional.

The important part — which brings us back to communication as a stochastic process — is that before any code convention, before any set equivalence, what one needs ─ as a very first ─ is a set of possibilities in order to form a communicational relation, which implies a minimal difference necessary in order to be able to relate in the first place. This is why it must be at least binary. Think of a case where there is only a 1 as a possible sign that can be sent or received:

How much information will you possibly be able to send, when there is a 100% chance to receive a 1, with or without noise? You may think of sending and receiving a frequency of ones — this would already include a difference between on and off again…

The above is often compared with a coin with only heads or tails. A game with such a coin is absolutely predictable. As mentioned, bits can also be understood as a quantity for information, and in the case of having P(x=1)=1P(x=1)=1, the amount of information is mathematically also 0, as we will see (20=12^0 = 1; we will get back to this in the second part of this series on information theory!).

6.2 Noisy Channels (incl. Binominal Distributions)

This part follows examples from a wonderful lecture held by David McKay, a famous physicist and mathematician that published on Bayes’ inference and information theory, and who unfortunately died very young, age 48, on cancer (the preserved lecture we are following in this chapter can be found here!).

Fig. David McKay. Source Wikimedia.

Adding noise, as mentioned, occasionally leads to flipped bits. Noise in the technical sense can appear anywhere within the hardware. Recall our relay switch and let’s say the isolation material of one of the cables ─ of the channel ─ became obsolete over time and now occasionally leads to electrical currents “leaking” and getting lost along the way, resulting in a 100% hindrance where there should be zero percent.

The chance of such an occurrence can be described probabilistically by sampling multiple times and comparing the received signal with the original sent message. This process of gaining information on noise in such a way can be understood as part of a negative feedback, which was also described by Norbert Wiener in his concept of cybernetics (1948), i.e., the control and regulation of processes, such as measuring and adjusting heat within a circumscribe space (thermostat) ─ closely related to information theory (developed in parallel).   

Let us say, we already obtained information on the ratio between signal and noise and we now know that in 10% of the cases the bit is flipped. To get some orientation, we will adapt a toy model, the so-called binary symmetric channel:

Fig. The binary symmetric channel. A classic toy model for understanding noisy channels.

What may look intimidating is to us just falling back to regular Bayes! Here the confidence in an output to really be 1, when a 1 was sent, and vice versa, represents a 100%(11), minus the chance of a bit flipping, say 10% (ff).

David McKay gives an exercise that goes like this: How many bits are flipped, after sending or storing 10.000 of them, where each bit has an equal probability of appearance (.5) and the chance of a bit flipping is 10% (this would be a question for quality control of, e.g., a producer of CD burners (neg. feedback): how noisy is the output of their hardware).

The first half of the calculation needed to get an answer is simple:

In order to get the mentioned values, we will use a binominal distribution. A binominal distribution is used to evaluate a series of trials, in our case 10.000, that share an equal probability of occurrence (independent of each other, or with returning the pick), as well as a binary output (true or false). All of this is given in our case; the formula for our binominal distribution of interest can be written as follows:     

Note that success is here considered observing an error, since we set its probability to p=.1p= .1. Also note that the structure of B(kp,n)B(k|p,n) is spoken: what number of successes (kk) is most likely within a range of possible successes (0 to 10.000), given a probability of a success in one trial (pp), in joint (“,”) with a number of trials (10.000). In other words, the upper formular represents a likelihood of k given p and n, as p is in joint with another factor: n (same rules of probability that we previously learned about apply, as we will see in later parts of this seies). The binominal probability distribution computed below will therefore show us what number of flipped bits is most likely to be observed, given nn and pp, i.e., it shows us the maximum likelihood estimate (of a number of trials), which will be at a 1000 flipped bits (different to maximum likelihood itself, which concerns probabilities). We have already observed the plain likelihood itself above, which is in this case ff ─ equivalent to P(rs)P(r|\overline{s}). Looking at the extended formula of the binominal dsitribution above, it just says that s\overline{s} is in joint with a number of trials or a number of possible s\overline{s}. So again, we want to infer on / estimate the number of successes kk (flipped bits), given pp and nn is set.

Let us compute the example with R. Below the parameter x equals the range of kk, size the number of trials and prob our likelihood for ff. Executing flippedbits below will lead to a list of 10.000 numbers that we will plot as a graph.

# Use dbinom() for probability distributions
flippedbits = dbinom(x = 0:10000, size = 10000, prob = .1)

# delete [1500]-[10000] for better plot overview 
plotbits = flippedbits[-c(1500:10000)] 

# Plot plotbits
plot(plotbits, typ = "l",
     		xlab = "Number of flipped bits", 
     		ylab = "Density")

Fig. R plot of a binominal distribution representing the number of flipped bits (f=.1f =.1) in 10.000 trials as a probability distribution with a maximum likelihood estimate of 1000, which states outspoken: it is most likely to observe 1000 (kk) flipped bits in 10.000 trials, when the chance of bit flipping is 10% ─ B(kp=.1,n=10.000)B(k|p=.1,n=10.000).

As things don’t always come as expected, what is missing is the standard deviation. Recall that the standard deviation, σ\sigma, is the root of the variance, σ2\sigma^2. In case the concept of standard deviation is new to you, we recommend Inferential Statistics III. However, the variance is essentially the summed, squared and averaged centralized deviation of the values of a set from the mean (centralization := xixx_i-\overline{x}). In more detail: The variance is calculated by the following steps: a) take the mean of a set of numbers, b) subtract the mean from each number within the set and take the square of each result and sum them up (squaring results in all positive number, so this is why SD is cast as +/- the mean; squaring is necessary since otherwise positive and negative centralized values would cancel each other out, when summing them up). c) finally, calculate the mean (deviation) of all of the latter results and you will get the variance, σ2\sigma^2. Take the square root of the variance in order to get the standard deviation, σ\sigma. Why take the square root? Adding the variance to a plot would result in adding a squared area to the plot. By the taking the square root we reduce the variance to one dimension, such that it only relates to the x-axis of a plot (see further below), which then becomes directly relatable to the mean again.

Go through the above in your mind, to get an intuition on what variance and standard deviation are conceptually and mathematically ─ and then we will take the short cut to obtain these values, starting with the mean, which is denoted μ\mu below.

Let’s compute the upper formulas in R:

# Number of trials
n = 10000
# P(r|not-s) = f
p = 0.1
# P(r|s) = 1-f
q = 1-p

# Mean
mu = n * p
# [1] 1000

# Variance
sigmaSQ = n * p * q
# [1] 900

# Standard Deviation
sigma = sqrt(sigmaSQ) 
# [1] 30

# Use abline() to add lines to an existing plot.
# "v =": x value(s) of vertical line(s) to be added
# Add mean
abline(v = mu, col = "red") 

# Add standard deviation to plot
abline(v = mu - sigma, col = "blue") # minus sd
abline(v = mu + sigma, col = "blue") # plus sd

Summing up the upper results, giving an answer to David McKay’s question “How many bits are flipped?”: 1000+/301000 +/- 30 (SDSD) bits.

6.2 Overcoming Noise by Adding Redundancy to a Message

In order to overcome the problem of noise, one can add redundancy. There are several ways to do so. One way is to use a so-called parity bit, which we breifly mentioned before, i.e., an extra (redundant) bit, indicating if the sum of the bits is even (1) or odd (0), therefore indicating a flipped bit (doesn’t work for two bits flipped in a set of bits though!). To give an example: let us say we encode A = 0000, then we will add an extra bit, indicating: “even”, as 0 is defined to be even. So, our encoded message that will be sent is: A = 0000 1 ─ we added redundancy, and this is what encoding in the technical sense can also essentially mean. What redundancy can tell us is that if a bit flips, then the redundant parity bit will not fit to the rest of the code sequence, summing to an odd instead of an even number. Encoded redundancy can also be used to specifically track and correct flipped bits, in order to overcome noise. Hamming-Code is a fascinating example for this and we deeply recommend the tutorial by 3blue1brown on this topic.

Redundancy and degeneracy are also commonly found in the genetic code concerning codons. The genetic code typically ─ but scientifically hard earned ─ also doesn’t involve “en- or decoding of meaning” in any common sense.

Fig. Code sun / Amino acids table. Redundancy is given in the form that various amino acids are decoded by several codons. Noise would be a flipped (changed) base, resulting in a different amino acid, but not necessarily, as due to redundancy UUU and UUC, still leading to the same amino acid (note that the code sun refers to mRNA, therefore uracil instead of thymine is used, different to the DNA code). At the same time these codons also entail degeneracy in the form of many-to-one, as the redundancy does not lead to an ambiguity concerning the encoded amino acid itself, such as in the redundancy of AGG and AGA for arginine Arg(R), both leading to the same amino acid (no ambiguity) ─ and therefore these codons are both redundant and degenerate at the same time (Image source: Wikipedia).

Another classic technique is adding redundancy and deciding for the correct code via majority vote. Let us say we have a code where A=1A = 1. Adding redundancy in order to decode via majority vote would mean we would encode a 1 for A multiple times, e.g., A = 1 => 111. If a bit flips, one can still evaluate the sent signal via majority vote: 101 = 1. See the lecture of David McKay on how this can also be calculated via Bayes’ theorem. Again: Also check on Hemming code, which is a beautiful way of encoding redundancy within huge amounts of code, with very little extra bits, and also entails a method to exactly evaluate where a bit has flipped just via binary logic!

7 Brief Outlook into Information Theory II

In the next part of this series, we will get further into the concept of information theory and will discuss the mentioned relation between statistical thermodynamics and information theory concerning the beautiful concept of entropy. Up front, the relation is pretty straight forward, since it relies on a simplification of the concept of Boltzmann/Gibbs’ entropy. The only difference between Shannon’s and the general physics approach is essentially the unit, in other words: Shannon established a definition of information in bits that relies on the formula of entropy, just without using/including the Boltzmann’s constant (drawing a relation between kinetic and thermal energy in physics). So in the next part we will mostly look at the relation/difference between the computer scientific and physics use of the concept of entropy.

Fig. Shannon’s entropy formula, including weighted probabilities (based on Gibbs’ entropy), i.e., the average surprisal of a hypothesis on an event amongst several hypotheses on an event, or signs amongst an alphabet. The Z here stands for the German word ‘Zeichen’, i.e., the word ‘sign’ (Berlin, Kreuzberg 2020; photo taken and modified by the author, the cloud that represents a capital sigma (sum sign) was created by perspective and thermodynamics itself though).

Follow this link directly to part II of this series or have a look at other tutorials of our collection Stat-o-Sphere.

Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other students and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within NOS.

Steffen Schwerdtfeger:

Second release: Thanks to the feedback of reviewers of the Summer of Math exposition I was able to correct some typos and further explained a few terms. I also made clear that the use of the midjourney images (including the pub preview images) is only done in order to relate them to information theory (stable diffusion). I also added a passage on the micromouse competion, which I learned from another SoME entry (reference given; fitted well to Shannon’s mouse Theseus that I mentioned). In general though only minor changes were made.