Review: This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Pub preview image generated by Mel Andrews via midjourney. Thanks for the permission!

Corresponding R script:

Introduction

Welcome to “Higher Spheres”, our second tutorial series that aims to provide you with basics in computer science and mathematics in general that are used in fields such as computational neuroscience, (mathematical) psychology, sociology, statistics, physiology / physics, genetics, and in research and methodology on “artificial intelligence” in various forms.

Note that this series is not going to be particularly more difficult than any of the other tutorials that we have published so far, even though we are slowly moving towards advanced methods. We will also not speed up the pace and will again start from a beginner’s level. However, this series is still trying to introduce you to the basics of advanced scientific research methods that are often only used in particular fields (e.g. computational neuroscience), different to general statistical methodology. The goal is to especially make those kinds of methods further accessible that are hard to access in the first place (e.g., fMRI methods in neuroscience).

SIDENOTE: In general, for orientation within tutorials, we recommend making use of the contents function (upper right corner).

Some of our recently published tutorials will partially serve as background knowledge for this tutorial. Especially Inferential Statistics I, discussing conditional probability, Bayes’ rule and the logic behind frequentist statistics. No worries, we will go through most of its content again and we spiced it up a little and shifted the focus to a more neuro- and computer scientific perspective. So,we will still go through most of the basics again, but in a way that it serves new readers and hopefully also those that have already gone through our series on inferential statistics. We will also again work with the programming language R, so we assume some knowledge, which are however only very basic sections in the mentioned first part of our series on inferential statistics and that are rather simple and banal (install R, open script, execute some lines of code…).

However,the first part of this tutorial series on information theory is most of all written in order to convey a more general message on the meaning of information technology and processing and how to approach it best. It is intentionally written in a rather entertaining way and it will take us a while until we actually do some (very simple) calculations in R, so no worries on the programming part.

Even though our Series “Higher Spheres” intents to convay very specific methods, this will turn out to be only partially true for the case of information theory, since its applicability is actually very wide and also concerns methods used in general statistics, such as the Akaike and Bayesian information criterion or Markov chains (part three). The reason for this is simple: information theory is essentially just applied and advanced conditional probability / Bayes’ rule.

Despite its simple mathematical background, advanced information theoretic methods have become very popular and powerful recently, e.g., in the form of stable diffusion models, which use so-called variational inference methods (which can also be cast as a form of approximate Bayesian inference methods (part V of this series); note, we will not cover integrated information theory in any form (since I know only little about it and I am sceptic)). In the field of computational neuroscience those methods are, e.g., present in predictive processing / active inference / the free energy principle — concepts that are mostly applied to model decision processes, i.e., behavior of organisms of any kind and scale (cognitive processes, “Bayesian Brain Theories”) and more. Another method involving variational inference worth mentioning is Dynamic Causal Modelling which uses variational inference for testing neuronal dynamics / interpretation of fMRI data (SPM). This tutorial series was initially thought as an introduction into the very basics behind the mathematics of active inference, a concept initially introduced by Karl Friston that also invented SPM. However, we will get back to variational methods in the form of active inference and other methods in further parts of this series.

Together with the parallel development of cybernetics, information theory was also the beginning of a lot of fields as we know them today.In fact you probably regularly think in those conceptions every day, e.g., when talking about feedback/feedforward processes, the bio-psycho-social model in medicine, (dynamical) systems theory, DNA as genetic code, homeostasis (equilibrium steady state), homeorhesis (non-equilibrium steady state) etc.

The reason why such ideas can be found everywhere in science is rather simple: it all comes down to a generalization of the mathematics of statistical thermodynamics which is again just applied probability theory (conditional probability / Bayes’ rule). If you always had trouble to understand what (statistical) thermodynamics and in parts quantum physics is all about, this series could help you find a way, even though we will not go too deep into physics in this series.

When discovering the basics, information theory / systems theory starts to appear as a kind of hidden paradigm in vast parts of science, especially in the medical field. This is probably most of all the reason why evaluating the scope of applications of information theory actually becomes a little weird, or better: pretty weird, as it appears to be everywhere around us. Another reason is the simple fact that information theory concerns the basic idea and concept behind the very object right in front of you: a computer that nowadays feels more “natural” and “banal” to a lot of people then ever, even without actually knowing a single bit (haha-hehe) about it.

Therefore, another goal of this series is to overcome a mere consumers perspective on computers and information technology in general in order to overcome the partially dangerous tilting between over- and understating the meaning and capabilities of information technology and especially oneself using it. Anything else just serves misinformation on information technology in some form or another in the long run. Technology and science has advanced and spread widely and given a lot of current crises and their complexity, paired with more and more interaction with highly advanced technology, it becomes more and more easy to abuse a lack of knowledge within society for power longing spread of misinformation (even in various forms of costly campaigns agains facts, as we have experienced world wide in the last years, e.g., during the pandemic).

Looking at problems in science in general, since AI technology has advanced quickly in the recent years, we may approach a time were an often opportunistic and irresponsible habitus of “just click your way around, you do not need to understand it, and importantly, write your name under it” will move on to: “just drop your data into the AI and don’t think about it — apart from writing your name under it”. A superficial approach to technology, its logic etc. comes with great danger for society, e.g., in the form of heavily biased data sets of machine learning tools that are as such not properly reflected when applied. Knowing the math of a simple neural machine learning network means knowing that its output sort of just “represents the bias” of the data it was trained on, since that is what it does when learning: training to predict given data, in order to predict new data.

Again, not properly reflecting methods and their output are problems that are also commonly known from the application of classic statistics, but they appear to potentially double their blurring impact on critical thinking in the field of applied higher mathematics, especially due to their still often impressive results, despite being unreliable. We recommend, e.g., the work of Iris van Rooij that discusses the jeopardy of the use of AI systems without reflecting ethical and technical issues. Examples are, e.g., the use of AI as modern phrenology (headlines such as ‘ML predicts gender or personality via facial expression’) and questions of the copyright of training data. Is society hyping hybris?

Of course we all believe that science and technological advances in general should never be led by a mere attitude and no one should be forced to follow mere habits in order to flurish in science or in the world of technology. Still, this appears to be what is in fact already happening — even without AI. At least as a medical student I feel one often has to justify oneself more when having or longing for consistent knowledge on statistics and AI, instead if “just using it”. It therefore appears that problems of, e.g., faulty methodology are more an issue of taking statistical applications in general far more serious than, e.g., concerns about p-hacking. Understanding methodology of science and basic concepts of technology should in general be much more open / transparent, accessible and also appreciated by societies.

With this tutorial series we again want to motivate you to better just invest some hours going through scientific methodologies step-by-step, instead of hopping on a habitus that is focusing on performance / appearance instead of consistency and responsibility. The culture of publish-or-perish makes it clear to everyone that far too many people promote dividing attitudes and are gate-keeping their way up to some random and often undemocratic top. At the end it is only us that can make a difference in openly discussing and educating each other about methodologies, instead of just imitating a habit of neglecting scientific and social ideals, others in general, and at the end oneself in that respect.

Getting back to information technology and how it is actually all around us, this series on information theory might — and hopefully!!— feel like a futuristic archeology adventure, since it entails a great part of history, dating back to the 1930s and 40s; on the other hand, understanding the basic logic behind information theory and therefore some of the most basic logic behind a computer and telephone network (computation in general) makes you feel like having lived in the middle ages all along by discovering how ridiculously futuristic the time that we live in actually is: our present.

1 Brief Looks into the History of Information Technology in the 20th Century

Let us start with going back in time a little bit: One might think, when someone invents or discovers something very important that everyone uses and talks about every day, everyone will know about that person. Same goes for the discovery itself: if something is so important that everybody relies on it every day, and this something even still gains more and more importance with no end in sight, even though it never changed conceptually, then we may assume automatically that a great amount of people within a society will understand it and know about the inventors. This is not always the case, and at least cults around persons are probably not that important in general either, but it is still quite weird with Claude Shannon for instance and the development of a general information theory.

This is similar to other important figures in the field of information technology, e.g., Margarete Hamilton and her contribution in the development of the Apollo Guidance Computer, which actually flew the vessel to the moon (there is a great website on the restoration and emulation of the AGC. Don’t be irritated by the 90’s look of the website!). Or with Katherine Johnson, even though gender and being a PoC definitely has played a role in how much we wereand are able to know about them.

There had been many figures involved in the development of past and present technology — and it also didn’t all start with Shannon or any of the above when it comes to information technology. Here we will just take a brief look into a small selection of people that were part of further developing science and technology. Everyone with their own story and solutions to problems they faced in their time. We deeply encourage you to dive deeper into this topic, especially when you have a kind of aversion towards technology. As mentioned, it is becoming more and more important to leave a mere consumers perspective in that respect, since technologies such as, e.g., in the form of “AI”, will accompany every one of us in some form or another in the near future — and a lot of other futuristic technologies do so already casually every day. It is therefore pretty weird how many within our society treat science and technology more as a myth and a mystery.

Other interesting figures in the further development of the basics of information technology and what later became AI research are people like John von Neumann (cellular automata, “Von Neumann architecture”), Warren McCullach and Warren Pitts (concept of neural networks), Alen Turing (Turing machine, Halting problem, Turing test), Richard Feynman (Variational method) and many others.

If you are interested in a maybe historically more relatable insight into the history of computing in the 20th century up to now, we recommend the BBC television show “Micro Live” that was aired between 1983 and 1987 — a time where the first affordable home computers thrived. This television show is particularly interesting, since it was planned in combination with an interesting and unique educational deal between the company Acron Computers and the television network BBC (British Broadcasting Corporation). The result was the so-called BBC Micro, a micro-computer (little computers that don’t fill up a room) that is conceptually equivalent to the architecture of todays smart phones and laptops: a single (printed) circuit board, different to a modular “multi-board” PC.

Acron Computers is particularly important since it brought up the so-called ARM chip-architecture, primarily developed by Steve Furber and Sophie Wilson. ARM stands for Acorn and later Advanced RISC Machine. RISC stands for “reduced instruction set (complexity) computer” (the first “c” is spared out in the acronym) and refers to a computer chip that is build to translate complex instruction sets, such as, e.g., a division within an arithmetic unit, into simpler machine instructions while still leading to the same result but being more efficient. The ARM chip in particular turned out to be extremely energy efficient — in fact the first chip just runs via leaking current. Have you ever wondered why Nokia mobile phones in the 90s / early 2000s had such ridiculous battery life? They were all equipped with ARM chips. Not only that, ARM chips and smaller and smaller microcomputers triggered a revolution that made the first and nowadays smartphones possible (which don’t have as much battery life as old Nokia models, since they are equipped with energy costly displays and are connected to the internet). Since this is a highly relatable development in information technology, it is a good way to get a hint on what was and what has changed technologically during the decades and how it impacts what kinds of technology we use today.

The above are just some of many historic examples of people that greatly influenced the development of information technology and can be considered a personal selection in order to give you a brief insight into the realm of computation. It is not necessary to get into every detail of history, but it helps to reflect one’s own life as being accompanied by a history of technology.

2 Brief Looks into Claude E. Shannon’s Biography

We hope the above selection of historic figures of mathematics and computer science has given you some inspiring insight into the history of information technology.However, the next parts of this series will mostly focus on the work of Claude Shannon, which we briefly introduced above.

When approaching actual mathematics and some code, we are especially going to focus on two works of his: “A Symbolic Analysis of Relay and Switching Circuits”, his master thesis from 1937, where he essentially conceptualized a modern (digital) computer via a telephone network, and most of all his paper “A Mathematical Theory of Communication”from 1948, where he defined a general theory of information still applied today in seemingly uncountable forms. The latter work will be the one we will focus on most in this series. We will only take brief look into his master thesis in the next chapter. However, both works of his are actually written in a quite accessible way to some degree, addressing rather everyday problems that we all know of from habit and experience with communication devices in our lives (such as noise for example).

In general, it is hard to track the impact of Shannon’s work, as information theory was the beginning of what is called the “information age”. As mentioned he discovered a way to mathematically represent the basic logic of today’s digital computers by using just a telephone (network). He also found the mathematical basics for digital audio, together with Harry Nyquist; digital audio is literally the result of a mathematical calculation, also involving the beautiful concept “Fourier Series/Transformation”. Shannon also worked on basics in machine learning e.g., in the form of the mouse “Theseus”. He was also the first to contribute to the idea of a chess computer in the modern sense. In fact, due to his contribution in this field the yearly winner of the World Computer Chess Championship is given the so-called Shannon-Trophy.

A lot of the above discoveries happened during his time in the famous Bell Telephone Labs. Other technological and scientific inventions and discoveries within the Bell Labs where radioastronomy, UNIX, the programming language C and C++, transistors, the laser, first actual solar cells, the statistical programming language S that inspired the syntax of R etc.; in its best times around five patents were filed a day by members of the Bell Lab (compare this lecture on the Bell Labs). Unfortunately the best times of the Bell Labs found a very ugly end. After several financial crises and a lot of changes (e.g., in owners), a German physicist “almost faked his way to a Nobel Prize” while working at Bell Labs (link leads to a great and thorough documentary by the youtuber BobbyBroccoli). This scandal eventually struck the reputation of the former Bell Labs hard. However, Shannon was working at Bell Labs in a time where it was a highly vibrant and booming idea factory with lots of influential scientist that seemed to come up with ideas instantly and also had a lot of freedom.

Apart from his contribution to information theory, which we will get into in a bit in detail, Shannon was also famous for inventing a lot of fun stuff as well, such as a rocket powered frisbee, a joggling machine, and various other fun things. Apart from chess games, we also mentioned the mouse Theseus that does a maze. Similar to chess computer competitions, Shannon was also involved in triggering the so-called micro-mouse competition that is held since the late 70s by the IEEE (Institute of Electric and Electronical Engineering; I learned about it from this tutorial on graph theory that I reviewed as a part of the SoME competiton by James Schloss and Grant Sanderson). Here is also a nice introduction into micromouse competitions by “Veritasium”.

I will not go into more details of Shannon’s biography here, as it would not justice the matter in brevity, but I deeply, deeply recommend looking for background information on Claude E. Shannon, e.g., documentaries, and I hope you will soon understand why this should be the case. For me, understanding the basics of information theory was in a lot of ways a relief, even touching, as it brought me closer to things I use every day, now understanding at least the basic concepts behind information technology.

Understanding technology helps to use it consciously, instead of out of pure boredom and habit. Computers are fun, but they are actual conceptual universes we often miss to explore — even though they can help us grow this way.

3 “A Symbolic Analysis of Relay and Switching Circuits”

Before we get to know what “Information” actually means we will first take a very brief lookinto Shannon’s first groundbreaking work, his master’s thesis “A Symbolic Analysis of Relay and Switching Circuits” from 1937 (link leads to pdf), in which he showed that Boolean Algebra, which operates with 0 and 1 logically (true, false), can be represented by relay and switching circuits (he does it symbolically, not technically — therefore the name). With this and some other mathematical tricks, one can perform, e.g., non-binary arithmetic operations. So, the logic of Boolean algebra works as an interface for the logic of general mathematics, delivering an ‘equivalence’ that can be functionalized as binarycode, making expressing and processing mathematics via a circuitry possible. As mentioned, one can say that Shannon conceptually (symbolically) turned a telephone relay/switching circuit network into a computer.

Functionalizing binary logic with relay circuits is what later Konrad Zuse did, when he built the first computer in the modern sense in Kreuzberg, Berlin, in 1941 (Zuse Z3) ─ actually just around 100 Meters away from where I was starting to write this tutorial a while ago (Methfesselstraße 10 and 7 (both buildings destroyed) in 10965 Kreuzberg, Berlin). He developed very similar ideas to Shannon’s as well as von Neumann’s (“Von Neumann architecture” of computers) in 1938 and also realized them as actual machines. At a similar time the ENIAC (Electronical Numerical Integrator and Computer) was developed in the USA by John Eckert and John Mauchly. There is a discussion which one of the above was actually the first computer in the modern sense and under what definitions — we will not go into details any further here.

The motivation behind building computers at that time was finding a way out from being forced to do hundreds of hundreds of monotonous and boring calculations as only labor, in a time before there were computers. It is important to underline that a computer at that time was essentially the job title of a human doing calculations (often women) by using algorithms to solve mathematical problems and therefore all problems that can be solved in the form of a mathematical abstraction.

Now let us get to the core of what a relay switch conceptually does and have a brief look at some examples, to understand the general concept (you will not need more than that for this series!).

Let us have a look the upper left “Figure 1”: If the relay switch is closed, there is 0% hindrance (!), the current can flow and it represents its respective binary digit, a 0. If the relay switch is open, there is a 100% hindrance, subsequently representing a 1 ─ the current cannot flow. Note that the common on/off analogy for 0/1 does not work here, since on = 0% hindrance and off = 100% hindrance (this is different to today transistors). Sincethis may not be as intuitive, as it seems, you may want to go through it a couple of times in your mind, just in case.

By putting switches / relays in series or in relation to each other, one can, e.g., perform an addition, following Boolean logic, where, e.g., 0 hindrance plus 0 hindrance equals 0, 0 hindrance plus 1 hindrance equals 1. See all the postulates below for details (note that it is enough to get the very general idea behind it to get through this tutorial!).

To jump a few steps ahead: From there it was possible to create and make use of several different so-called logic gates to form circuits that perform a certain logical or subsequent mathematical task, represented by a hardware relay inference structure. Again, as circuits can perform binary logic, and binary logic is also a way of general abstraction on the basis of the principle of equivalence (code), anything can be computed theoretically. This is also where software comes in play, and differences in machine code, assembly code, source code. etc. ─ which at the end are just further hierarchical abstractions of the same logic of en- and decoding binary digits. Note that we will later see that binary logic is used, since it is mostly just the easiest way to produce a technical representation of logical inference, e.g., via relay switches (so “bits” is actually more of a randomly picked and also convertible unit, referring to the base of the log-algorithm (2) of possible states, as we will see later below).

There are two logic gates that had been famous before Shannon developed information theory: the NAND gate (not-“and”) and the NOR-gate (not-“or”). Both of them have the special quality that one can replicate any possible logic gate with a circuit of NAND or NOR gates. Both, as well as their special feature, and many other concepts in logic, were initially defined / further developed by C. S. Peirce. We mentioned the work of Peirce in Inferential Statistics I, since he also defined the concept of abductive inference, in contrast to (or as basis of) deductive and indictive inference. There is also evidence that Peirce had also foreseen the possibilities that Shannon used for his circuit analogy (e.g., in “Logical machines” in 1887). His work had in many ways a great influence on logic in general and on further developments of Boolean logic (here is a link to a great math tutorial on that subject of logic gates for further details (part I of III) by N. J. Wildberger).

There is of course a lot more to it, especially when creating complex circuits. However, this chapter should just give you a general idea of the physical representation of information processing. Nowadays we are of course using tiny transistors. A chip from the Apple M1 series for example carries 16 to 114 billion transistors (the functional equivalent of relay switches).

If you want to learn more about the technical side of information technology and engineering, we recommend looking into tutorials that introduce you into the field of microcontroller such as a raspberry pi or others.These were explicitly invented to be used for affordable educational purposes and are quite powerful little computers.

Here is a German tutorial for making plants curse at you, when they need water. ^^ And here is an English tutorial to build and ECG (to also motivate medical students to further dig into the topic).

4 The Meaning of Communication as a Mathematical Theory

The second important paper, and the one that is especially important for us, is called “A Mathematical Theory of Communication” (AMTC) from 1948 (full paper can be found here). In this paper Shannon first of all solved the problem of how and in what optimal form to send information from one point to another, e.g., via a telephone network, without losing or changing the infoꓤⴠⱥⱡⱼⱺn, which means, without noise. Noise means, in its simplest form: one sends a 0 and the receiver receives a 1 to a certain probabilistic chance within a series of sent bits ─ one also speaks of “flipped bits” in such cases. In this series on information theory, we will essentially replicate a lot of the math of the first 12 pages of AMTC.

Note that the word “bit”, yes, stands for binary digit.However, most people think of the concept of bits in the simple sense of storage, where e.g. 8 bits (one byte) refers to a combination of 8 zeros and ones. Later bits will also be the name for a quantity, which not only refers to its essential symbolic possibilities (two) or a representation of stored information but extends the implication of the concept ‘possibilities’ to this, a mathematical and most of all probabilistic quantity and argues communication to be a process of probabilistic inference. In the next part of this series we will also find out that it is not even important which ‘unit’ we will use apart from binary, so reductions like “computers are only 0 or 1” are missing the point in several ways. Note that the relation and differences between humans and computers, at least if drawn in computational neuroscience, lay somewhere else, as we will see. ChatGPT for example is for one just a very passive AI system that lacks a lot of general structural abilities to perform or represent complex reasoning; it will necessarily come up with stupid, biased, i.e., dangerous suggestions at some point or another, which need further reflection — bluntly spoken; on the other hand, this circumstance is not “that” different from common human fallacies or at least problems of communication (which we will look further into in the forth part of this series)... Comparisons with humans are in general a little difficult and strongly depends on how literal we are drawing such a comparison (a computer and a human can solve 2+2 — but they may do so in very different ways).

In general, the term computation however just means: solving a problem / answering a question via an algorithm — any algorithm, not a specific one in a specific hardware representation or so... This issue of misunderstanding the concept of information, information processing and computation is, e.g., discussed in this paper by Richards and Lillycrap in 2022. So again, “computation is just 0 and 1” or “input and output” is neither witty, nor does it add anything to the story of what computers have to do with humans and what not, apart from again starting an old discourse on a lack of information about the concept of information and information technology itself…

5 Bayes’ Theorem / Conditional Probability (Optional)

In this chapter we are starting with basics on Bayes’ rule / conditional probability, before getting right into the paper AMTC— which is again just applied statistics, so to speak. If you feel you have enough background on Bayes’ rule or if you have read our tutorial Inferential Statistics I, you may want to skip the next chapter.

However, the approach on cond. prob. and Bayes’ rule in this chapters will differ a little from our previous approach: The focus will be shifted a little more towards a “subjective” perspective on general “inference” (or a subjective interpretation of probability), i.e., an intuition / concept that understands “cognition” — and even ourself as whole (in the sense of a body / organism in the world)— as performing probabilistic inference (in terms of sensory and action). This is supposed to help you understand the actual sender-receiver model and is most of all supposed to give you an idea about concepts in computational neuroscience that are relatable to or direct “descendants” of information theory (such as “Bayesian Brain Theories”), which we will go into in future parts of our series “Higher Spheres”. Another goal is to obtain a critically reflected understanding why people compare, e.g., neural networks with human inference in the first place (in more or less sensible ways).

We will also further extend our discourse on ideas from C. S. Peirce, e.g., the concept of abduction and further intuitions around it. As we mentioned before, C. S. Peirce had a great contribution in the further development of Boolean algebra and logic. Apart from being influential in mathematics he also greatly contributed to other fields, such as philosophy (discourses which we today may cast as epistemological, phenomenological (philosophy) — or as part of cognitive science in general).

LET’S GET STARTED!

5.1 Abductive Inference in General (Optional)

In Inferential Statistics I, we have discussed Bayes’ rule / Conditional Probability mostly as a kind of scientific learning algorithm with a triadic structure in the context of hypothesis testing in general (here we mostly skip the comparison between Bayesian and frequentist statistics). We also logically related this process of probabilistic inference to the so-called abductive inference. Later we will see that Bayes’ rule/abductive inference can also be used to reflect on the relation between a sender and a receiver, i.e., communication.

Abduction can be defined as the process of generating a hypothesis to optimally explain a new observation, based on what is given ($data$, as discussed in Inf. Stat. I).

To give you an intuitive furry example, let’s say you are observing water on the ground ─ indoors(!!), which appears ‘surprising’ to you in the sense that you are ‘uncertain why’ this is the case. Evaluating previous evidence spiced with some intuition/confidence (prior beliefs) and gaining present insights (likelihood of present events, given a prior belief) may lead to the cat (Mr. Midnight, see below) being the most probable explanation for it (posterior belief): a high probability on “the cat being the cause, given that there is water on the ground”. Let’s now say, you may still be too uncertain to draw such a conclusion (low probability of the posterior), maybe because your cat has not been destructive in quite a while (or ever), or because you haven’t actually seen Mr. Midnight do it (^°^) — so you start to gather more information and you want to make your inference more precise. After doing so, you may eventually find out that it wasn’t the cat ─ but a squirrel that got lost in your rooms (updated hypothesis, e.g., due to direct sight of a squirrel or a lack of matching pawprints…).

Note that references we make, here cast as the specific content/variables of inference (cat, water on the ground), remain modally contingent, which means that the principle of abductive inference in the sense of conditional probability / Bayes’ rule does not rely on its content ─ can potentially be used for anything (contingent = similar to ‘arbitrary’).

As we have discussed in other tutorials, we can always perform some random mathematical inference in the relation between the above variables — our hypothesis (e.g., cat was the cause) and the data we get (e.g., water on the ground) — using Bayes’ rule. However, doing so does not cover a more general approach in reflecting variables and how they are influenced, e.g., in the sense of a complex system with many, many relations and inferences being made (e.g. confounders), in other words: the ad hoc theory we may come up with in order to learn how well it can be applied on the world or even ourselves, may turn out to be all bonkers in the long run, gathering more information or insight into possible relations — so there is always the chance that temporal and spatial expansions of inquiry (gaining further data, looking further into populations over time) overturns our confidence, our model of the relation between us and the world… However, this is what learning is all about, this is what the above circumstance makes possible in the first place. In other words, when looking at abductive inference in a positive way that does not focus on the mere fallibility aspect of inference, we can at least say: We are able to adjust our hypothesis, we are able to overcome faulty models over time, we are able to further learn!

So again, abduction can be defined as the process of generating a hypothesis to optimally explain a new observation, based on what is given. Note that “optimally” can mean “best” as in “best fits a model of self and world”. Thinking of it in this way may seem intuitively reasonable, but also entails the risk of overfitting, i.e., inferring something to be existing or in general the case, even though it is actually not, due to overly putting confidence in a belief, e.g., due to high reward, or in data that fits the belief. Another reason to be hesitant using the term ‘best’ to describe a form of inference itself: The term ‘best’ may imply the presence of values, concepts of relevance etc., in other words: results of abductive inference itself, which are at the same time not as universal as the form of inference itself.

What seems to be a tautology within the concept of abductive inference is actually an interesting spatio-temporal twist within the process of abductive inference: abduction is better to be understood as “inference on a better hypothesis”, compared to previous ones ─ essentially moving from past, to future (prior to posterior), creating a new or updated hypothesis and therefore a necessary latency and modal contingency concerning the epistemics of a hypothesis over time.

Another way to understand abduction, especially cast as prime form of inference, is in a row with induction and deduction. Different to the latter two forms of inference, abductive inference actually generates something new, a hypothesis. Abduction can therefore be understood as initially providing the relationality, or better: contiguity of, e.g., two events within space and time, represented as a prior hypothesis on data that can subsequently be evaluated in a mereological way, i.e., describing the result of an abductive inference as a part-to-whole relation: induction from a part (data) to a whole, a generalization (theta), and deduction vice versa ─ abduction being the cause, or the process that makes and therefore sets mereological relations in the first place.

As mentioned, this fits into several aspects of our everyday experience: our experience of fallibility of inference, as well as the possibility of updating a model/hypothesis, which, over time, understood as a method, will automatically lead to falsifications to some degree, i.e., learning (this can be understood as a reason that abduction itself is not falsifiable, as it states to be a principle of inference itself). This also fits well to our ability to actively attenuate to certain aspects of interest and subsequently create new and eventually more complex interests, ideas, hypotheses, perspectives, e.g., by intentionally experiencing new environmental situations (selection, planning, action).

In general, one can say, casting abduction as prime form of inference is arguing that the way we relate and approach our environment is in its essence a speculative and rather adventures act, though again: it will conceptually lead to knowledge over time.

5.2 The Formula of Bayes’ Rule and Conditional Probability (Optional)

In information theory we will discuss Bayes’ rule / conditional probability in the context of a sender-receiver relation. However, the variables as such are contingent, in other words: it can be discussed in many contexts, with potentially any variable we assign, so to speak. Even though we are mainly after understanding the relation between sender and receiver, we are going to give you a quick general overview on the use of Bayes’ rule and conditional probability.

Again, in Inferential Statistics I, we have discussed Bayes’ rule / conditional Probability mostly as a kind of scientific learning algorithm with a triadic structure in the context of hypothesis testing. However, we also figured that there are many applications, such as the power analysis, the p-value in frequ. statistics, analogies to investigative reasoning… In all cases we were able to orientate ourselves via the below triadic structure of forming a hypothesis (prior), gathering/encountering data (present) and updating (the probability of) a hypothesis / model (even the frequentist approach).

Note the temporal aspect of the above steps, representing the process of updating a prior hypothesis/belief/model of the world, given new observations, i.e., new experiences. The goal is therefore, after observing, i.e., given new data, to adjust our concepts ($theta$) in order to fit well to the actual data in the world we encounter in the future (speaking of plain passive perception only).

To wrap this up: abduction is the (inferential) process of finding/adjusting a model/belief/hypothesis that predicts future states well in terms of “better”, so that the overall belief does not have to change much in the future when encountering new observations of the same or similar kind (fitting our theta to data).

In case you have an idea of conditional probability from school: Note that the below usage of Bayes’ rule / conditional probability for any kind of hypothesis testing is contextually special, since it is about the prediction of an event in relation to an actual event — in other words, it is about one event, just in two “forms”: prediction of an event and the experience of an event (essentially testing for a redundancy). This is contextually and practically different with causal Event/Event relations, even though the formula is the same. For example: the probability of getting COVID, given that you are in a hospital, $P(COVID|hospital)$, is a contextually different relation as landing in the hospital, given that one has COVID, $P(hospital|COVID)$.

Let us again go through all the conceptual steps and find out how they are related to actual mathematical formulas:

Step 1) Our fist element of Bayes’ rule is the prior (belief), denoted $P(theta)$, where $P$ stands for probability ─ the probability of a belief/hypothesis ($theta$) about the world or an event, regardless any new events / observation, or ex negativo: a belief in relation to all past, or possible observations (here all binary options $data$ and $not-data = \overline{data}$ are meant with all possible observations!).

The phrase “all past or possible observations” represents linguistically what is technically called the sum rule of probability (details later below). We will see that this rule is most of all applied to calculate $P(data)$, not $P(theta)$, as we can set our initial priors ourselves, but it conceptually stands for the same, just summed out over all possible hypotheses (instead of all possible observations) ─ we will get back to this soon below.

Step 2) The second element of Bayes’ theorem is the jointprobability, which is sometimes also referred to as the model in general. The joint probabilityconsists ofalikelihoodin relation to the prior and is denoted $P(theta,data)$, which can be read as: the probability of the actual theta “and” (“,”) the actual data occurring together at the same time, i.e., within the present.

The concept behind the joint probability can also be expressed set theoretically via a Venn diagram. $P(theta)$ and $P(data)$ then represent a set of two overlapping circles. The overlapping forms a joint probability, cast as a set of true data and theta, i.e., $P(theta , data)$. A joint can also be cast as a logical conjunction, spoken “theta and data”, equivalent to the sign “,” in $P(theta , data)$.

The only difference between Bayes rule’ and classic conditional probability is that in Bayes’ rule the joint probability can also be obtained by weighting the invers of the posterior conditional probability (i.e., weighting the likelihood). Upfront you can find the formal difference between classic conditional probability and Bayes’ rule below (here using the variables $A$ and $B$ instead of $theta$ and $data$).

To understand what weighting the inverse posterior means, let us have a look at the likelihood for itself: The likelihood is the probability of making an observation, given that our belief/hypothesis is set (a priori) and is denoted $P(data|theta)$, where “|” means “given” or “under the condition of” (therefore the name conditional probability). The likelihood represents our new observation, the present.

Again, the joint probability represents a “relation (conjunction) between a prior and a likelihood” and is also denoted: $P(data|theta)*P(theta) = P(data,theta)$. In other words, the weighted likelihood or “weighted inverse posterior” results in the joint. The phrase ‘relation between prior and likelihood’ linguistically also represents what is called the product rule of probabilityin mathematics (also known as chain rule).

Note that in classic (or exact) Bayes, $P(data|theta)*P(theta) = P(theta,data)$ is equivalent to $P(theta|data)*P(data) = P(theta,data)$, which in some way represents the possibility of what is called the model inversion of the upper formula. So far just think of model inversion representing the attempt to move, or invert or tilt the actual observation, the likelihood $P(data|theta)$, to become our new hypothesis given $data$ (the posterior, $P(theta|data)$, see below): Because it is the world we want to predict and understand, we adapt or move towards the likelihood, i.e., again, the actual observation, in order to make better inference in the future.

Step 3) As mentioned, the third element of Bayes’ theorem is the posterior, denoted $P(theta|data)$, the probability of a belief / hypothesis, after observing, or given some $data$ that fits the $theta$. This is the result of a calculation that involves another term that we encountered before, the marginal likelihood, or model evidence, denoted $P(data)$, i.e., the probability of the $data$ or an event, regardless any $theta$, any belief, or, ex negativo, concerning every possible $theta$ (meaning $theta$ and its complement $\overline{theta}$). As mentioned before, this is linguistically defining the sum rule of probability:

The model evidence is also called a normalization term. Since the joint probability is not summing up to 1 by itself, it has to be normalized to do so.

So how do we get to Bayes’ rule now? One way to look at it is: It simply follows from the above equivalent ways to obtain the joint probability. Dividing all sides by $P(data)$, and therefore normalizing the joint, will leaves us with the posterior probability:

Another way is to look at it as a weighting and counter-weighting of the likelihood, in order to obtain the posterior probability and update the model.

Without normalization, the posterior probability and the joint probability are “only” considered proportional to each other:

In Inferential Statistics III we also figured out that there is a relation between calculating the regression coefficient (the slope) of a linear model and the scheme of conditional probability, just for statistics (values of measurements):

The left part is spoken: the estimate of $y$ (dependent variable) given a change in $x$ (independent variable) by 1 is equal to the joint variance (or covariance), divided by the sum of squares of $x$. The above can also be understood as representing the difference between statistics and probability (measurements and probabilities).

In Inferential Statistics I we have also discussed the difference between the frequentist and Bayesian “interpretation” of probability in more detail. However, the difference is not that difficult to understand, since the frequentist approach always operates with a uniform prior and therefore never updates a model. This means that given a binary of $theta$ and $\overline{theta}$ we would always assign a prior probability of .5 for each possible case. It turns out that in such cases, the posterior is always equivalent to the likelihood. Recall the perspective of $P(theta)$ and $P(data)$ as weight and counter-weight of the likelihood above. In the case of a uniform distribution, $P(theta)$ and $P(data)$ will sort of cancel each other out. This can be interpreted as always capturing the present moment, so to speak. The frequentist appraoch therefore does not actually update a model due to the above circumstance, but compares a lot of “present” moments with each other (a frequency, e.g., via a confusion matrix within power analysis or a meta analysis). Note that Bayesian statistics (e.g. a Bayesian regression model) are computationally nothing that can be done by hand easily, different to classic approaches.

This is one reason why statistics is often taught from a frequentist perspective on statistics: it can even be done by hand. However, the actual difference to Bayesian statistics is still minor and every approach has its pros and cons, so to speak. For further details on this matter, especially on the “philosophical aspects” of the above difference please see Inferential Statistics I.

That was it, you have mastered the basics of Bayes’ rule and conditional probability!

To end this chapter with a brief numeric example: Investigative reasoning has often been related to Bayes’ rule(a structural recursion / iteration of hypothesis testing; the following example can also found in Inferential Statistics I):

Vanessa Holmes’new case involves four suspects, one of them being the murder of an innocent racoon. A prior probability could for now look something like $\lbrack.25\ .25\ .25\ .25\rbrack$ for each of four suspects, when there are no specific prior assumptions on any suspect given so far (no clues) ─ staying elementary neutral and letting the facts speak first, so to speak (initial uniform prior). Holmes checked on one of the suspects, but the first possible suspect has a clear alibi. With this new observation, our prior probabilities will subsequently change to $[.33 \ .33 \ .33 \ .0]$, so our model was subsequently updated. Below you will find the code to do more iterations. Another example would be reasoning in the medical field in terms of differential diagnostics, as well as evidence-based medicine in general...

Here is some code to replicate the Vanessa Holmes example in the programming language R:

# First iteration:
joint = c(.25,.25,.25,25)*c(.33,.33,.33,0) # prior*likelihood
model_evidence = sum(c(.25,.25,.25,.25)*c(.33,.33,.33,0))
posterior = joint/model_evidence
# Result:
# [1] 0.3333333 0.3333333 0.3333333 0
# Second iteration (another suspect ruled out):
joint = c(.33,.33,.33,0)*c(.5,.5,0,0)
model_evidence = sum(c(.33,.33,.33,0)*c(.5,.5,0,0))
posterior = joint/model_evidence
# Result:
# [1] 0.5 0.5 0 0
# Third iteration - let's say all of the suspects are ruled out:
joint = c(.5,.5,0,0)*c(0,0,0,0)
# [1] 0 0 0 0
model_evidence = sum(c(.5,.5,.33,0)*c(0,0,0,0))
# [1] 0
posterior = joint/model_evidence # Note that it says: 0 divided by 0!
# [1] NaN NaN NaN NaN
# NaN = Not a number.
# This is due to the fact that 0/0 is not defined. At this point
# Vanessa Holmes would need to find another suspect to be able
# to do further inference...

In case you want to play around with Bayes’ rule a little more, below you will find the code for two simple functions you can use to do so. The second version of the function below also works with examples where $P(data)$ is provided and not calculated via summing out, e.g., for correcting a test (you may have encountered examples of that kind on the web). Note that the below functions uses sum(x, nar.rm = TRUE) instead of the plain sum function, so that the 0 row in a probability vector is deleted (the sum() function doesn’t like vectors of that kind). With this slight adjustment you can also input a prior or likelihood with [0 1] as probability vector. You can also input just single values, instead of a probability vector with, e.g., $theta$ and $\overline{theta}$. However, the a vector is what we are going to work with in the rest of this article.

#### First simple version of our Bayes_Machine() function:
Bayes_Machine = function (prior,likelihood) {
joint = prior*likelihood
# na.rm = TRUE in sum() deletes 0 rows if given
modelevidence = sum(joint, na.rm = TRUE)
posterior = joint/modelevidence
# Needed for console output and adds text to it
# using a matrix, which works similar as c()
postprint = matrix(c("Posterior",posterior,
"Joint probability", joint,
"Model evidence", modelevidence))
print(postprint)
} # end of function
# Give it a try with defined prior and likelihood:
prior = c(.5,.5)
likelihood = c(.5,.5)
Bayes_Machine(prior,likelihood)
# Try these inputs and contemplate the results:
prior = c(.1,.9)
likelihood = c(.9,.1)
Bayes_Machine(prior,likelihood)
#### Here is the extended function that can also handle
#### the model evidence as input:
Bayes_Machine = function (prior,likelihood,modelevidence) {
if (missing(modelevidence)){
joint = prior*likelihood
# na.rm = TRUE in sum() deletes 0 rows if given
modelevidence = sum(joint, na.rm = TRUE)
posterior = joint/modelevidence
# Needed for console output
postprint = matrix(c("Posterior",posterior,
"Joint probability", joint,
"Model evidence", modelevidence))
print(postprint)
}
else {
joint = prior*likelihood
posterior = joint/modelevidence
postprint = matrix(c("Posterior",posterior,
"Joint probability", joint,
"Model evidence", modelevidence))
print(postprint)
}
} # End of function
Bayes_Machine(likelihood = likelihood, prior = prior)

Next we are going to look at how Bayes’ rule is applied in information theory.

6 Communication as Bayes’ Inference - the Relation between Sender and Receiver as a Probabilistic Relation

Another way to understand abductive probabilistic inference — the perspective we are actually looking for — is to see it as an attempt to communicate with the world, being either a sender or a receiver of information / a message (perception/action).

As mentioned, the problem that Shannon mainly solved was rather technical: how to send a message (in the form of text or sound) to a receiver without being corrupted by noise. Let us say we send away a message that only consists of a binary. A corrupted message would mean that a 1 was sent and a 0 was received. In other words, a correctly sent message can be looked at as an equivalence at two spatially distant points: sending a 1, receiving a 1 on the other end. Before we get into the concept of noise, we will first discuss the idealized relation between sender and receiver without noise.

As mentioned, Shannon figured out that information entails a statistical property. You might wonder, what does it mean exactly, to understand a message as having a probability? Let us look what Shannon wrote himself in the first paragraph of ‘A mathematical theory of communication’:

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is, they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.” (AMTC, p.1)

The semantics are irrelevant!!!???We will discuss this matter in the next chapterin order to first answer the question why a message can have a statistical property: For a message with one symbol only (binary digit), we know a priori that there are two possible messages. Given that we are the receiver and only one binary digit will be sent, we can safely say that the a priori probability that either a 1 or a 0 was sent (without noise) is 50%. Mathematically we can express this circumstance in the following way:

Let us now have a look at the complete sender receiver model:

Think of the channel in the middle that carries the signal between the sender and the receiver as a joint (the channel that makes our connection possible). The steps between the information source and the transmitter, as well as the receiver and the destination can be understood as en- and decoding processes (Morse code etc.). Note that the steps of encoding and decoding (a binary sign for a text sign) can themselves be looked at as Bayes’ inference and they could also be corrupted by noise — within the circuitry of the en- and decoder for example. The reason could be leaking current, which represents itself as a hindrance (such that a 0 flips to 1). We will take a closer look at the concept of code in the next chapter.

Proceeding with the math, let us say that we are the receiver. What we then want to infer on is the probability of a source signal, given a received signal, $P(sender|receiver)$. Note that as we work with binaries below, we are inferring on discrete random variables, where discrete means that there is a ‘minimal distance’ between one and another state on which we infer, and continuous implies that there is continuous information on infinite states between them. So let us do some math and coding, using our Bayes_Machine():

# Example for sending one binary digit via a noiseless (!) channel:
prior = c(.5,.5)
likelihood = c(1,0)
Bayes_Machine(prior = prior, likelihood = likelihood)
# Posterior = [1,0]

If the result surprised you, remember the fact that we are dealing with a noiseless channel, such that there will be a 100% chance that a 0 was sent, given that a 0 was received. This may of course a little too simple or boring from a mathematical perspective. Before we proceed to noisy channels right away, let us take a look at the concept of code and the fact that meaning is not transmitted.

6.1 Receiving Encoded Messages without Meaning

Let us again take a look at the quote from the previous chapter:

These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.” (AMTC, p.1)

Shannon importantly argues that the meaning of the message that is sent is not the concern of sending or receiving a message, in other words: something does not need meaning to be sent and received as a message and its meaning remains contingent (contingency essentially means: a set of possibilities). In other words, meaning is always something that is internally inferred by a system on each end, but never something that is literally sent. This is also the reason that we cannot be sure to be understood when speaking or writing to someone, even though we may have a specific intention behind sending a specific message. We can only try to evoke a certain inference on the receiver side, so to speak.

If this feels odd, think of your phone or computer in front of you right now: this text just consists of letters. The fact that you may feel addressed by this text or that you assume this text to be meaningful comes from yourpriorbeliefs and your model of the English language and your inference on this text, as well as maybe even most of all the belief that a communicational relation is possible in the first place, given something that was identified as a message.

“Receiving a message without meaning” might sound empty as if communication is an empty process, but it can be seen more as a necessary contingency, i.e., a lack of the necessity of meaning as a necessity for the possibility of communication itself (similar to the concept of arbitrariness). In consequence: It does not need an equivalence of meaning between receiver and sender upfront or in general in order to communicate (apart from the fact that personal intentions/beliefs will generate different outcomes than those of another person anyways). Making a random sound or pointing somewhere can even be used to establish new code from experience, in other words: language is not a fixed set of “meanings” that is communicated in the literal sense. At the end actual transmission of meaning makes no sense, since we do not fully share all of the same prior beliefs, we establish, learn and use language in order to try to evoke meaning in other persons. In other words, meaning is nothing that is sent and received in terms of a reification of meaning or so, i.e., an object of some kind moving through space being its own entity or so.

In the forth part of this series we will also discuss the introduction / tutorial on information theory by Warren Weaver, which was part of “THE mathematical theory of communication” in 1949. Weaver presents three levels of communication, which can be referred to the syntactic or level of mere symbols, the semantic level and the pragmatic level of communication. There we will also discuss the difference between semantic and pragmatic meaning. However, the fact that meaning is not transmitted is present in Weavers discussions as well. The first “symbolic” level of communication is the one we are going to focus on in the first part of this series.

However, what we still need in order to communicate is an agreement on the code. Such an agreement can also be understood as a special case of meaning — though if so, then this kind of “meaning” just repeats the structure of the first level of communication: a sign is equivalent with a sign.

Below you find a classic example of code equivalence: the morse code. Again, only symbols are translated into other symbols — so no transmission of meaning is involved in en- or decoding either.

For humans this is similar to a phonetic sound related to a letter (symbolic meaning), and even there: noise or different coding conventions can make things ambiguous. Such code conventions are something that need to be given a priori on both sides (sender and receiver) in order to correctly decode the message ─ otherwise the code of the received signal would need to be decrypted (Shannon also wrote a paper on that topic).

Another example of code you probably heard of is the ASCII code (American Standard Code for Information Interchange), which assigns 7 bits to each symbol (e.g. 100 0001 for the letter A) and comes with one so-called parity bit. A parity bit is used to check if noise changed the message. For the letter A in the ASCII code we would assign a zero, since the sum of all seven numbers is even (we will get to this further below after talking about noisy channels). We can also recommend looking into en- and decoding processes that involve translating a continuous function into a discrete function and back again (which essentially led to Digital Audio (CD’s, MP3) and the JPEG image format). This entails the usage of the beautiful Fourier Transformation (used for the Shannon-Nyquist Sampling theorem), which is also a very good way to understand the essentials of Quantum Mechanics (uncertainty principle) without unnecessary philosophical fuzz about it. We recommend a tutorial by 3blue1brown on this matter, as well as this tutorial on the concept of bandwidth.

Mathematically we are just going to operate with binary code in this tutorial, which also cast as machine code in the context of a computer hardware system. Programming languages such as R do not operate on that level but a higher level of symbolic abstraction which is translated into code that can be “read” and executed by a machine (e.g., higher level = sum(x), gets translated into lower level languages down to machine code, gets processed by the machine, which outputs a binary that is then retranslated into code that is easy to read for humans: the numerical output of the sum function).

Even though a message and a code does not entail meaning for itself, a code is still something that makes meaning possible in the first place. Though, hearing a random word for itself does not provide us with the meaning of a word and it does not necessarily have to mean anything at all and also nothing necessarily intentional.

The important part — which brings us back to communication as a stochastic process — is that before any code convention, before any set equivalence, what one needs ─ as a very first ─ is a set of possibilities in order to form a communicational relation, which implies a minimal difference necessary in order to be able to relate in the first place. This is why it must be at least binary. Think of a case where there is only a 1 as a possible sign that can be sent or received:

How much information will you possibly be able to send, when there is a 100% chance to receive a 1, with or without noise? You may think of sending and receiving a frequency of ones — this would already include a difference between on and off again…

The above is often compared with a coin with only heads or tails. A game with such a coin is absolutely predictable. As mentioned, bits can also be understood as a quantity for information, and in the case of having $P(x=1)=1$, the amount of information is mathematically also 0, as we will see ($2^0 = 1$; we will get back to this in the second part of this series on information theory!).

This part follows examples from a wonderful lecture held by David McKay, a famous physicist and mathematician that published on Bayes’ inference and information theory, and who unfortunately died very young, age 48, on cancer (the preserved lecture we are following in this chapter can be found here!).

Adding noise, as mentioned, occasionally leads to flipped bits. Noise in the technical sense can appear anywhere within the hardware. Recall our relay switch and let’s say the isolation material of one of the cables ─ of the channel ─ became obsolete over time and now occasionally leads to electrical currents “leaking” and getting lost along the way, resulting in a 100% hindrance where there should be 0.

The chance of such an occurrence can be described probabilistically by sampling multiple times and comparing the received signal with the original sent message. This process of gaining information on noise in such a way can be understood as part of a negative feedback, which was also described by Norbert Wiener in his concept of cybernetics (1948), i.e., the control and regulation of processes, such as measuring and adjusting heat within a circumscribe space (thermostat) ─ closely related to information theory (developed in parallel).

Let us say, we already obtained information on the ratio between signal and noise and we now know that in 10% of the cases the bit is flipped. To get some orientation, we will adapt a toy model, the so-called binary symmetric channel:

What may look intimidating is to us just falling back to regular Bayes! Here the confidence in an output to really be 1, when a 1 was sent, and vice versa, represents a 100%($1$), minus the chance of a bit flipping, say 10% ($f$).

David McKay gives an exercise that goes like this: How many bits are flipped, after sending or storing 10.000 of them, where each bit has an equal probability of appearance (.5) and the chance of a bit flipping is 10% (this would be a question for quality control of, e.g., a producer of CD burners (neg. feedback): how noisy is the output of their hardware).

The first half of the calculation needed to get an answer is simple:

In order to get the mentioned values, we will use a binominal distribution. A binominal distribution is used to evaluate a series of trials, in our case 10.000, that share an equal probability of occurrence (independent of each other, or with returning the pick), as well as a binary output (true or false). All of this is given in our case; the formula for our binominal distribution of interest can be written as follows:

Note that success is here considered observing an error, since we set its probability to $p= .1$. Also note that the structure of $B(k|p,n)$ is spoken: what number of successes ($k$) is most likely within a range of possible successes (0 to 10.000), given a probability of a success in one trial ($p$), in joint (“,”) with a number of trials (10.000). In other words, the upper formular represents a likelihood of k given p and n, as p is in joint with another factor: n (same rules of probability that we previously learned about apply, as we will see later). The binominal probability distribution computed below will therefore show us what number of flipped bits is most likely to be observed, given$n$and$p$, i.e., it shows us the maximum likelihood estimate (of a number of trials), which will be at a 1000 flipped bits (different to maximum likelihood itself, which concerns probabilities). We have already observed the plain likelihood itself above, which is in this case $f$ ─ equivalent to $P(r|\overline{s})$. Looking at the extended formula of the binominal dsitribution above it just says that $\overline{s}$ is in joint with a number of trials or a number of possible $\overline{s}$. So again, we want to infer on / estimate the number of successes $k$ (flipped bits), given $p$ and $n$ is set.

Let us compute the example with R. Below the parameter x equals the range of $k$, size the number of trials and prob our likelihood for $f$. Executing flippedbits below will lead to a list of 10.000 numbers that we will plot as a graph.

# Use dbinom() for probability distributions
flippedbits = dbinom(x = 0:10000, size = 10000, prob = .1)
# delete [1500]-[10000] for better plot overview
plotbits = flippedbits[-c(1500:10000)]
# Plot plotbits
plot(plotbits, typ = "l",
xlab = "Number of flipped bits",
ylab = "Density")

As things don’t always come as expected, what is missing is the standard deviation. Recall that the standard deviation, $\sigma$, is the root of the variance, $\sigma^2$. In case the concept of standard deviation is new to you, we recommend Inferential Statistics III. However, the variance is essentially the summed, squared and averaged centralized deviation of the values of a set from the mean (centralization := $x_i-\overline{x}$). In more detail: The variance is calculated by the following steps: a) take the mean of a set of numbers, b) subtract the mean from each number within the set and take the square of each result and sum them up (squaring results in all positive number, so this is why SD is cast as +/- the mean). c) finally, calculate the mean of all of the latter results and you will get the variance, $\sigma^2$. Take the square root of the variance in order to get the standarddeviation, $\sigma$. Why take the square root? Adding the variance to a plot would result in adding a squared area to the plot. By the taking the square root we reduce the variance to one dimension, such that it only relates to the x-axis of a plot (see further below), which then becomes directly relatable to the mean again.

Go through the above in your mind, to get an intuition on what variance and standard deviation are conceptually and mathematically ─ and then we will take the short cutto obtain these values, starting with the mean, which is denoted $\mu$.

Let’s compute the upper formulas in R:

# Number of trials
n = 10000
# P(r|not-s) = f
p = 0.1
# P(r|s) = 1-f
q = 1-p
# Mean
mu = n * p
# [1] 1000
# Variance
sigmaSQ = n * p * q
# [1] 900
# Standard Deviation
sigma = sqrt(sigmaSQ)
# [1] 30
# Use abline() to add lines to an existing plot.
# "v =": x value(s) of vertical line(s) to be added
# Add mean
abline(v = mu, col = "red")
# Add standard deviation to plot
abline(v = mu - sigma, col = "blue") # minus sd
abline(v = mu + sigma, col = "blue") # plus sd

Summing up the upper results, giving an answer to David McKay’s question “How many bits are flipped?”:$1000 +/- 30$ ($SD$) bits.

6.2 Overcoming Noise by Adding Redundancy to a Message

In order to overcome the problem of noise, one can add redundancy. There are several ways to do so. One way is to use a so-called parity bit, which we mentioned above, i.e., an extra (redundant) bit, indicating if the sum of the bits is even (1) or odd (0), therefore indicating a flipped bit (doesn’t work for two bits flipped in a set of bits though!). To give an example: let us say we decode A = 0000, then we will add an extra bit, indicating: “even”, as 0 is defined to be even. So, our encoded message that will be sent is: A = 0000 1 ─ we added redundancy, and this is what encoding in the technical sense can also essentially mean. What redundancy can tell us is that if a bit flips, then the redundant parity bit will not fit to the rest of the code sequence, summing to an odd instead of an even number. Encoded redundancy can also be used to specifically track and correct flipped bits, in order to overcome noise. Hamming-Code is a fascinating example for this and we deeply recommend the tutorial by 3blue1brown on this topic.

Redundancy and degeneracy are also commonly found in the genetic code concerning codons. The genetic code typically ─ but scientifically hard earned ─ also doesn’t involve “en- or decoding of meaning” in any common sense.

Another classic technique is adding redundancy and deciding for the correct code via majority vote. Let us say we have a code where $A = 1$. Adding redundancy in order to decode via majority vote would mean we would encode a 1 for A multiple times, e.g., A = 1 = 111. If a bit flips, one can still evaluate the sent signal via majority vote: 101 = 1. See the lecture of David McKay on how this can also be calculated via Bayes’ theorem. Again: Also check on Hemming code, which is a beautiful way of encoding redundancy within huge amounts of code, with very little extra bits, and also entails a method to exactly evaluate where a bit has flipped just via binary logic!

7 Brief Outlook into Information Theory II

In the next part of this series, we will get further into the concept of information theory and will discuss the mentioned relation between statistical thermodynamics and information theoryconcerning the beautiful conceptentropy. Up front, the relation is pretty straight forward, since it relies on a simplification of the concept of Boltzmann/Gibbs’ entropy. The only difference between Shannon’s and the general physics approach is essentially the unit, in other words: Shannon established a definition of information in bits that relies on the formula of entropy, just without using/including the Boltzmann’s constant (drawing a relation between kinetic and thermal energy in physics). So in the next part we will mostly look at the relation/difference between the computer scientific and physics use of the concept of entropy.

Second release: Thanks to the feedback of reviewers of the Summer of Math exposition I was able to correct some typos and further explained a few terms. I also made clear that the use of the midjourney images (including the pub preview images) is only done in order to relate them to information theory (stable diffusion). I also added a passage on the micromouse competion, which I learned from another SoME entry (reference given; fitted well to Shannon’s mouse Theseus that I mentioned). In general though only minor changes were made.