Skip to main content# [R] Inferential Statistics I: Hypothesis Testing in the Basic Form of Conditional Probability / Bayes' Rule

# Introduction ─ Structure of the Series

# 1 Hypothesis Testing

## 1.1 The Prior Hypothesis

## 1.2 Gathering Evident Data

## 1.3 Evaluating and Adjusting the Hypothesis to Fit the Data

# 2 Conditional Probability and Bayes’ Theorem

# 3 Computing Conditional Probability / Bayes’ Rule in R

# 4 The Bayesian and the Frequentist Approach to (Conditional) Probability

# Appendix: Our Submission for the >>Summer of Math Exposition II<<

Absolute Beginner's Sphere

Published onAug 11, 2022

[R] Inferential Statistics I: Hypothesis Testing in the Basic Form of Conditional Probability / Bayes' Rule

** Review: **This tutorial was internally reviewed by the BEM-Editors Rico Schmitt, Phillippa Schunk and Joëlle Lousberg.

*PDF-Version and Summary:*

Welcome to our first tutorial series within our editorial collection “Stat-o-Sphere”. **You have just entered the absolute beginner’s sphere, so we hope your stay will not come with any turbulent inconveniences and we wish you a pleasant and insightful trip through our tutorial**. As this is our first editorial series within this collection, we decided to give you a thorough overview over two concepts that we believe are most important for understanding scientific reasoning as such and that are also often heavily misunderstood ─ namely: the concept of **hypothesis testing** (this tutorial, concerning probabilities) and **statistical modelling **(starting in part two of this series concerning estimates (of measurements)). Knowing at least these two concepts in detail is especially for those of you important that seek to get a *stable* heuristic overview over what statistical inference is all about, without digging into every possible method. On the way we will also dip into basics of descriptive statistics and distributions of various kinds and other topics such as probability density functions.

As mentioned in the introduction of our tutorial collection, we are trying to provide you with *slow-paced* tutorials that potentially entail the *conceptual / intuitive*, the *mathematical* and the *computer* *scientific* perspective (programming) on statistical methods. In this first tutorial series, we will provide a *thorough* introduction into inferential statistics on all of the above ‘levels of abstraction’ at one place.

Following the above scheme, the tutorials are structured more or less hierarchically, moving from the intuition to the mathematics and eventually to the code in R, using functions such as

for linear models ─ all coming together in the interpretation of the output results (most important for review). **lm()****Note that the first part of the series introduces R only to do calculations that can be done by hand or with a calculator as well!** **So in case you are scared to get into programming languages, fear no more**. We are trying to introduce R as what it essentially is: a very sophisticated calculator.

Not every area might be of your immediate interest. This is also the reason why we established **a short summary at the end of every chapter** that can also be used as a chapter overview for the impatient reader. In the future, we will also add chapters going through the statistics of open data papers, which will provide you with a wider range of examples with different levels of complexity and further insight to the application of statistical methods in the wild, so to speak.

Corresponding to a modular educational expansion of actual scientific work, published via our student journal, we will also add condensed chapters that will focus on aspects that are not only important for the application, but mostly for the *interpretation* of the results ─ e.g. within review processes. **This will not be the case for this article though, as we are here introducing a rather general concept of mathematics, present in the field of statistics in various ways.**

As the following tutorials are also introducing coding as such, we chose to start from an “absolute” beginner’s level. It also comes with the advantage that this tutorial can theoretically also be understood and **mastered by scholars to some degree** (we may test for that in the future). We also want to give those a chance to get a full recap of the mathematics that didn’t start studying soon after school, or that are not that much incorporated in the matter for any other reason ─ especially those concerning a lack of interest or aversion. In the end, statistics paves the path of *every* medical inference nowadays and it does so for very good and even *intuitive* reasons, as we will see. It is easy to click through a computer program nowadays, in order to obtain mathematical values as a kind of power of speech within society, marking one’s argumentation as being evident. But therefore it is also dangerous when those who do claim such powers actually have only little understanding what they do and what the consequences of false or untransparent information can be. At the end statistics is the logic of science and it can therefore be demanded that scientists know their methods by heart, at least those that can be considered basic.

Statistical modelling will be the core of our first series of tutorials on inferential statistics, discussed on the basic example of a *linear regression model* (eventually moving to other methods in the future, such as linear mixed effects models). Hypothesis testing in the mathematical sense will be important especially when comparings means or when evaluating the results of our linear regression, i.e., the full output that we have obtained using the programming language R, especially when discussing the *p-value*, obtained via a z-test/t-test as a specific form of hypothesis testing on, e.g., linear regression models or a difference in means...

However, *hypothesis testing in its conceptual and mathematical sense is what statistical modelling is for in the first place*, so we will start our journey into the *Stat-o-sphere* by going through its steps. This will give us a clear view on what the basic structure of a p-value *actually* is in terms of probability theory as well as conceptually, and will show us the rather banal difference between the *frequentist* and *Bayesian interpretation* of conditional probability. There is also a vast variety of mathematical algorithms, called “tests”, that all result and reflect on p-values of a model, so we decided that it is best to *not* start with a specific test for a specific type of model, but with explaining the concept of hypothesis testing itself and how the p(robability)-value represents itself not as any, but as specific *conditional* *probability*.

We believe that discussing conditional probability as a start is the most economic approach to statistics and data science in general, as it provides a conceptually consistent overview over a vast variety of methods involved in statistical reasoning in general, apart from the p-value (to name a few: positive predictive value, even thermodynamics, information theory (Shannon), AIC, BIC, ‘machine learning’ (e.g., diffusion models), computational neuroscience (“Bayesian Brain Hypothesis”) etc.). Another reason is that hypothesis testing understood as conditional probability is actually really simple and especially surprisingly intuitive *and doesn`t need any mathematical background at all to be understood*.

**However, it appears as if most tutorials leave out the topic of conditional probability in the first place** and jump right into using concepts such as the variance, t-value, confidence intervals, standard normal distributions etc., so our approach will hopefully close an important gap for those seeking further insight and a *clear intuition* on each of the concepts by themselves and in relation to each other. To cut it short: the p-value is a distinct conceptual composition, which however still begins with conditional probability / Bayes’ rule.

**If all of this still sounds a lot:** The webcomic further above shows that **hypothesis testing involves nothing you don’t know or wouldn’t do already** and we hope that our tutorial will leave you as surprised as we are, when realizing how easy the steps of testing a hypothesis can be represented by mathematical terms ─ without much of an effort and without losing any of its conceptual intuition and magic behind it.

In general, every scientific study, every experiment, every process of ‘testing’ involves going through the following three steps in some way or another (even in descriptive statistics we explore data in relation to what we expect, the full argumentation is just not represented via mathematics):

formulating a

**(prior) hypothesis**─ that “what-ever something” is the case**gathering (new) data**that is*related*to our hypothesis**evaluating the results**(testing the hypothesis) and adjusting the prior hypothesis in order to better predict new data (**updating the hypothesis**)

So far this may not reflect on all the formulas and processes that pre-informed readers may expect, when performing statistical analysis. Nevertheless, it makes up the core of every statistical analysis in some way or another. Let’s go through them in detail:

In the beginning of every scientific evaluation there is a claim, i.e., a *prior*** hypothesis**. A prior hypothesis can be understood as a belief about the world, *regardless* any present or future experiences, or in other words: *before an experience was made that could prove or disprove the hypothesis* (the prior hypothesis can also be looked at as the *sum* *of all past* *experiences* on a hypothesis). An “experience” in statistics is often called an event. In general, an “experience” or an event is termed *data* in statistics, which represents events in the form of the outcome of measurements of any kind ─ measurements that have not yet been made to this point of inquiry.

The next step in any scientific investigation is a process of gathering experiences, observing events ─ **gathering **** data**. This can be done in various ways.

The most important constrain ─ with which readers may be familiar with to some degree ─ is the constrain of events being ** evident**. The term

Another way evidence as an attribute is commonly expressed is by saying “*data* *is* *given”*. The phrase “*data is given*” is etymologically redundant, as *data* originated in the Latin language and also means “that what is given” ─ and stands in contrast to *what is set in advance*, i.e., our (hypo)theta, the prior hypothesis (the term *hypothesis* originated in Greek and means “to place under” ─ or *set* *before* in the sense of our temporal hierarchy of the three steps of scientific inquiry).

The above terminological use of the term ‘evident’ may appear confusing, considering the everyday use of the term evidence: Note that the practice of saying a study to have shown evidence in a belief or hypothesis often implicitly jumps from the prior constrain on data acquisition to the interpretation of the outcome of the hypothesis testing in the mathematical sense. Both involve evidence, either as a constrain, or as a possible interpretation of statistical results. Though this is still partially overlapping with the concept of the *significance* *of a model* in an unfortunate way, as the significance is not the only marker for ‘evidence’ of any kind, as we will see. As the COVID pandemic and developments in the recent years have often blurred the view on science massively, we believe that it is important to note here that ** evidence-based medicine** is not just an advocacy on how to interpret the outcome of a (statistical) hypothesis test correctly (the significance of a model), but also a discourse on the evidence of inquiry as such. This approach to infer “on the world” therefore

**In comparison to our first step**, consisting of our prior hypothesis only, **the second step of “gathering data” can now be looked at as an actual or ***present relation*** between the ***data*** and our (hypo)**** theta**. In probability theory this relation can be looked at logically as a

Those with prior knowledge may recall that the so-called *null hypothesis* refers to a case where the hypothesis is set to be *false* (

However, in probability theory the conjunction between data and theta is also called a *joint (probability)*. Later we will add actual probabilities to the Venn diagram, suggesting that for any of the possible ‘joint combinations’, there is a certain probability ranging from 0 to 1 assigned to it. Technically this joint probability is what can be called a *probabilistic* *model* ─ a model of the relation of *theta* and *data*. However, our linear model in the next part of this tutorial series will mathematically ** not be** exactly the same, as it is “not made out of” probability values, but out of the values of a measurement (e.g., tea drank within time). This is where hypothesis tests in the mathematical sense come in play, as a t-test can be seen as a method to obtain values (such as the t-value) that can be used to obtain a p-value for a linear model. In other words: there are ways to “look at” or evaluate the results of a linear model under the consideration of probabilistic relations.

**We will get back to all of this in detail soon. For now, just hold on to the idea of an overlapping** of our hypothesis and data that fits to it in the sense that the assumptions hold in both “areas” ─ *theta* and *data* both being *true* or ‘the case’.

The consequence of gathering *data* ─ or “making experiences” ─ is usually that **we evaluate and eventually adjust our hypothesis to (better) fit the actual experience**, the actual *data*. We have to say “usually” as it is unfortunately not a “self-evident” practice, given a high tendency to produce positive results only in science (‘publication bias’, resulting in a lot of studies not even getting close to the third step of statistical inference, which is unfortunate in a lot of ways).

The result of our evaluation will eventually become our *new prior* hypothesis. The better our hypothesis, the better we are to predict future events. Note that adjusting or updating the hypothesis in the mathematical sense concerns updating the *numerical probability value* of a hypothesis, representing a change in the “confidence” towards a certain hypothesis (details in the next chapter), *not* inventing a completely new thesis (e.g., no changes in the variables or so, which could be referred to as HARKing (**H**ypothesizing **A**fter the **R**esults are **K**nown), which is essentially faking a course of inference). However, in a wider sense, updating a hypothesis in terms of changing our beliefs still represents what we do as a consequence of gathering and evaluating data in the long run: we change/update our model of the world, gathering new insights over time (Bayes’ rule therefore represents “learning”, as we will see).

To give you another simple non-feathery example of what updating a hypothesis means, given a *negative* result: if we were to experience a ball to *never* fall down by itself, we would adjust or “doubt” on our hypothesis of gravitation in the long run.

Now that we have roughly gone through all the steps of hypothesis testing, let us look at them from a mathematical perspective.

**We will leave a summary at the end of every chapter, to give you an overview of what we have learned so far:**

Translating our discussion above into mathematics is fairly easy. There are only two *minor* features we have to include into our intuition on hypothesis testing to make it work smoothly.

**Categorical variables:**The binary distinction*true*and*false*can be looked at as categorical distinction. Other categories are also possible, such as*heads*or*tails*. In the latter cases it depends on what you chose first as$theta$ , either heads or tails, to translate the coin into a binary. However, there is no restriction to binary categories. The categories that were chosen are for themselves contingent, but of course still depend on pre-set decisions*we**make*(e.g., the number of categories). Below we will mostly work with binary outcome options.**Probability values:**Our confidence in a hypothesis will be represented in the form of probability values. Before we were just working with the categorical distinction of*true*and*false (which could be represented by a 1 and a 0 only)*. Each of them will now just get assigned a probability*between*0 and 1, making inference a lot more dynamic. This is importantly*not the same*as the binary distinction*true*and*false*, as these are just*categories*, such that there could be a 0.1 probability assigned to*theta*. However, for the addition of probabilities to our variables to be recognizable, our variables*theta*and*data*will now be*symbolically marked*by a*P*for ‘probability’ standing in front of them ─ such that our prior hypothesis, i.e., our hypothesis regardless any (new) data, will be written as$P(theta)$ .

**To get an intuition on probabilities in general:** we could theoretically assign everything we experience a probability, especially as a very ‘prior relation’ we have with the world in order to orientate ourselves within it is time and space (fitting well to inference in the sense of learning and inference as a form of making better guesses (predictions) or ‘bets on reality’ over time). Only if something were to never change, it would be assigned a 100% chance to be the case and such an estimate would involve infinite observation, if someone wants to be certain over a wide timespan as well. So one general way of looking at probability is as a representation of a frequency of events, e.g., in a certain context of time for example.

Now that we are set, we can simply go through the same three steps again, including the mathematical symbols used to address what we have gone through conceptually already. It is also important to highlight the simple and intuitive temporal hierarchy, corresponding the steps.

The mathematical denotation for the prior and the joint probability does not add much to what we have gone through so far, just the third step, the *posterior* *probability* should be symbolically new to us. Outspoken the posterior probability is read as “the probability of theta, given or under the *condition* of data”, where the sign ”|” means *given* or *under the condition of. ***Therefore the name: conditional probability**. The temporal course from theta to data (our condition) may appear inverted to you, and this is partially true! The posterior probability actually refers to the data being a ‘*prior condition’* to the hypothesis this time, as the ** posterior** probability reflects

Let us go *one step back* and fully forget about the posterior for now, in order to focus on what a joint probability actually is. So far, we referred to the second step as an overlapping or a *joint* between theta and data. Both, the joint probability and the conditional probability reflect a relation between two variables. However, the **difference between a conditional probability**, such as the posterior, **and the joint probability though is simple**: A joint probability is considered a probability where the specific course of conditions *is not yet defined or decided*, such that either “theta as prior condition for data” or “data as prior condition for theta” could be obtained from evaluating the joint (note that this refers to a prior only in terms of our temporal hierarchy, as the prior is usually denoted

The joint probability ─ at least the way it is denoted above ─ therefore reflects the probability of encountering theta *and* data in general, *not* under a particular condition of our temporal course of inference, so to speak. In another words, the course of inference is *potentially*** bi-directional**: we could retrieve

There is still another way of representing and especially calculating the joint probability ─ **and this is where Bayes’ theorem comes in play**:

**Note that Bayes’ rule and conditional probability are essentially the same.** However, classic conditional probability does not reflect the joint probability by itself as the result of a *weighted conditional probability* that can also be represented via a decision tree, as we will see below (this is the only real difference, apart from the historic remarks referring to Thomas Bayes, which we will not get into here). Bayes’ rule just extends conditional probability on that matter, so to speak. Note that we will in general not work with the complement of data, as making no observations or making observation other than the ones we actually make so to speak is not what we can logically achieve when doing inference (there is another more mathematical answer to this, also concerning set theory and the Venn diagram, but we will not go into further details in this tutorial).

To get a closer look at **one specific way of obtaining the joint probability**, let us zoom in into one particular *chain of decisions* made.

With the chain rule we have introduced two things: a mathematical algorithm to calculate the joint probability from one course of inference only, and the mentioned likelihood *conditional* probability.

To make sense of the likelihood: Remember Ms. Miranda, when she was on her way to make new observations? This is what the **likelihood** is actually about: gathering data, *under the condition* of theta, **moving from step I to step II**. The evaluation of the joint eventually resulted in the **posterior**, representing the inverse condition, **moving from step II to step III** ─ the probability of theta, *after* we observed data.

What Bayes’ rule does mathematically is what we literally just have gone through conceptually: the process of hypothesis testing more or less inverts the likelihood to become the posterior. “More or less” as this is numerically only the case under special prior conditions, as we will see, but it is still true: Bayes’ rule is method to invert *the experience under the condition of a hypothesis* to become *the hypothesis under the condition of (new) experiences* made (which eventually becomes the future prior and therefore influences or changes the next joint probability formed with the new prior, i.e., the model is updated).

**To wrap this up:** The chain rule above shows that the joint probability not only consists of the likelihood, moving from step I to step II, but also of the prior itself, which can importantly be understood as a *weight* in the form of an expectation on the likelihood. The likelihood again represents the observation, the experiences we make, our *data, given our hypothesis* (or the prob. of data *after* we formed a hypothesis). But what does *weighting* actually mean? The concept is actually simple: something can be weighted to have a higher probability of occurrence in general (over time) as something else, nevertheless the observation (regardless any data!). In other words: a model can entail a prior “confidence” in a hypothesis which may be higher than the prior confidence into its *complement* and eventually influences how an event is evaluated *a posteriori (after making an experience)*. If this sounds abstract, think of ‘classic conditioning’ in psychology as a classic example, where a conditioned weight, i.e., a prior expectation, influences the behavior/inference of a subject (build up on weighted inference) in relation to certain events (the behavior is therefore different to unconditioned subjects over time; in other words: classic Bayes’ rule represents a “learning algorithm”).

Note up front that a probability of .5 for theta and .5 for its complement refers to a special kind of “balanced prior” weight, which is also referred to as “uniform” prior ─ we will get there soon.

**In order to fully make sense of all of the above, we can now finally discuss Bayes’ rule via actual mathematical formulas**. Let us start with the fact that there are two approaches to obtain one and the same joint probability, since:

On the far-left we see the posterior multiplied or weighted by

Above we can see how Bayes’ rule can be understood as a way to obtain a joint probability *from* (weighted) conditional probabilities. However, Bayes’ rule is essentially about a specific conditional probability: the posterior. Let us first take the simplest route to Bayes’ theorem just by taking basic rules of equations into consideration. In order to obtain the *posterior* probability from the formula above, we could just divide the whole line of equation by

Just in case it appears confusing, the joint probability is crossed out too, as the result of our actions leaves us with a *conditional* probability only (deciding for the course of inference). The conditional relation is mathematically represented by the division (our inferential “counter weight”). The last line eventually represents what is called **Bayes’ rule. **

**An easy way to remember Bayes’ rule is to follow the variables clockwise, starting with the posterior, and chant them, going: theta-data // data-theta // theta --- data** (with a little break in-between the prior and p(data), which stand for themselves). As it just repeats an inversion of the pair theta and data, it is not too hard to remember (also conceptually).

As mentioned, the posterior can be understood as the result of *the joint* *being divided by the weight* of

Our prior *model evidence*, works as a kind of counter weight to the joint as a whole (not just the likelihood), resulting in the posterior probability.

To make the above relation between weight and “counter weight” more clear, we could also rearrange Bayes’ rule in the following way as relation between likelihood and weights:

Another way of reflecting on the joint probabilities is via a conditional probability table, which also reveals a way how we can obtain

**Note that** **is a value that can also be provided**, in order to correct or compare it with the assumed model evidence of just one event (or the events a model works with). You may have already found respective examples during your own research, where the model evidence is not calculated via summing out, but provided. A common example is weather forecast: What is the probability of tomorrow being sunny, say given that it is rainy today. In mathematical terms this would be denoted as:

Apart from that, Bayes’ rule is in general used to test a test, i.e., the evidence of a test ─ how well it predicts. It does so by considering a longer range of observations over time, or a wider population in the form of a general frequency of *two conditional dimensions*: the *actual* *data (set as hypothesis)*, i.e., true and false, and the *predicted* *data*: positive and negative. The model evidence is here more clearly used to reflect on the evidence of a statistical model, i.e., our joint probability. The far-right refers to probabilities of the joint and the model evidence, to clearly relate it to Bayes rule’, even though the middle part of the equation does not yet indicate probabilities (is not *normalized* in this case), but the number of events of a certain kind.

A CBT for the binary dimensions “true/false” and “positive/negative” is also referred to as a *confusion matrix*. Such a confusion matrix can in a lot of cases be literally confusing. Recall that the classic p-value is mostly referred to as *not* the null hypothesis being false and significant. We believe that the redundant *ex negativo* approach is one reason why people fail in understanding p-values and conditional probability in inferential statistics in general, even though it just represents the steps of an argumentation we make.

In general, in the case you want to explain the above either to yourselve or others, this is unfortunately something you have to be certain about. This also concerns making sure to be ceratin about how the columns and rows are tagged by yourselve or other (there is in principle no convention). We will get back to this topic in the fifth part of this tutorial series, discussing what is commonly known as power-analysis — a term you may have encountered before.

That was it, **you have now mastered the essentials of hypothesis testing reflected as conditional probability and Bayes’ rule respectively,** both on the conceptual and on the mathematical level. In the next chapter we will do some calculations to gain some experience with applications of the math that we have learned so far using R. After that we will introduce the difference between the Bayesian and the frequentist interpretation of conditional probability aka Bayes’ rule and finally reveal what a *p-value* actually is (we have indirectly encountered it before).

Below you see the posterior on the far-left side of the equation, the classic formula to obtain such a conditional probability as

After elaborations on the concept and the mathematics behind hypothesis testing (its mathematical logic of inference), we will now do some calculations in R for a change (the application of the mathematical logic). This will help us to get a clear orientation on what the results of the upper formula may look like and we will also numerically prove some assumption we made above. **Note that using R will be much ***easier*** than it might sound, as we will be using it more as a simple calculator for now (note, all the following can also be done ‘by hand’** ─ **also holds for the linear regression, as we will see).** We will also introduce R in the second part of this tutorial series again, which also involves things as plotting ─ so no worries if this either appears too much, or too basic to you for now.

**Below you will find code for the programming language R** that can be used to calculate the posterior probability. Note again, **if you are new to R** you can either just read this tutorial and with it read the code ─ as we will provide the output for every line of commands ─ or you **just ****download R**** and ****RStudio****, install ist and open a new script and copy and paste the code provided below along the text into your script, ooor you download and open the R script we provided below** (or type it in yourself, if you wish to do so).

**In case you are trying to install R on an Apple Computer**, you have to decide for a version for Apple silicon (M1/M2) or older Intel Macs. If you don’t know which version, just try out which works for you.

**R script, corresponding to the tutorial:**

The first lines of our script will be just a test and looks like this:

```
# This is a test, which will also be the name of the ‘object’
test = 2 + 5 # Execute this line!
```

Note that those lines that start with `#`

, as well as any code *within a line* *after* a `#`

was placed, is considered a “comment” and will not be ‘understood as code’ by R (so you can also mark and execute it and it will not mess up anything). Otherwise text will be interpreted as code, such as a calculation ─ which will lead to (mostly enigmatic) errors presented in the console below the script (see figure below for details on interface of RStudio).

**Mark the lines you want to** **execute** and **press ALT+ENTER**. **In case you are using Apple hardware, use the command key instead of ALT.** We also recommend using the keyboard when operating within the script: use **SHIFT+ARROW** to mark and demark letters (left/right) or whole lines (up/down).

You can also execute comments, so if a script is set as a whole, one can also mark all and then execute the script as whole. **The result** can be seen in the *environment* *tab* on the upper right side of RStudio (see figure below). If you ever feel that your script is presenting a funny output (especially after a series of errors), **clear the environment via the** **brush tool** ─ there is another brush-icon placed in the console tab to clear the console. **Now mark the name of an object only** **and press ALT+ENTER again to obtain the results in the console** (below the script) ─ you won’t need to know much more for this tutorial for now, believe us!

The consequence of your actions should result in the following console output (ignore the `[1]`

for a moment).

```
# Console output:
# [1] 7
```

Note that we can represent our probability variables *either as single value or as a probability vector that sums to 1*, such that we will always consider theta and its complement at the same time. We will start with using single values first.

The following example will refer to anything you consider a *theta* and for which (evident) data is given. For a simple example, we will start with flipping a coin, arguing that the coin is fair, such that a 50%/50% chance is given to encounter either heads or tails (or 0 or 1; true or false…). **As this is an example, we can provide ourselves with data and set the likelihood for ourself.** Recall that we need the likelihood to calculate the joint probability via Bayes’ rule and that the likelihood is a conditional probability for itself ─ reflecting on the transition of step I to step II within hypothesis testing (I. forming a hypothesis. II. making (new) experiences / gathering data). The likelihood can therefore be reflected as the probability of data, *after* we have formed a hypothesis (in contrast to the inverted posterior, reflecting moving from step II to step III: ─ expressing the probability of the theta, *after* data was obtained).

```
# Define your prior, e.g., .5 for heads.
# Note that R is a case sensitive language (“prior” not same as “Prior”).
prior = .5
# Likelihood
likelihood = .5
```

Now we can define the likelihood corresponding to our prior hypothesis. It can either be .5 again, which would represent a truly fair coin (after some rounds of flipping the coin) ─ and suggests that our hypothesis holds. Or you assign a value of .6 or any other value deviating from .5, which would suggest that the coin is phony and that our prior assumptions about the coin are wrong.

Either way, we can now calculate the joint probability, as well as the model evidence. Again, copy the code below into the script. Everthing after

will be cast as comment:**#**

```
# Joint probability:
joint = prior*likelihood
# Output:
# [1] 0.25
```

Now we can calculate the model evidence. Recall that the mathematical definition is as follows (discussed in chapter two: sum rule, e.g., Fig. 11):

```
# Model evidence (note that R does not allow spacing within names!):
model_evidence = .25 + .25
```

The color marking refers to the result of each of the joint probabilities ─ the sum of both results in

We can now calculate the posterior:

```
# Posterior:
posterior = joint/model_evidence
# [1] 0.5
```

In this case the prior hypothesis was *not* *really* updated, such that *prior and posterior remain the same*! What if the likelihood or prior changes?

To get a better overview of the process above, we are now going to slightly expand the math: Next we are not only using probability values, but probability *vectors* to do Bayes’, which lets us calculate both

, with which objects of any kind can be combined as a list of values, objects or even a list of lists (note that there is also a function called **c()**

as well, which is not structured in rows and columns, but sequentially numerates elements ─ also of various kinds of classes (vector, matrix, single values, whole data frames, all in one list)).**list()**

However, the upper formula and code using probability vectors *just slightly differs* from what we have gone through so far. Note that for non-binary outcomes the vector would just be expanded holding 3+ values. The prior probability distribution you see below is also referred to as *uniform prior ***(will be important in a bit)**. Let us first go through the formulas and then do some computation with R. **Play around with the input values and you might notice a special characteristic of the posterior for yourself, when changing the values of the likelihood, but keeping the prior equally distributed (uniform).** The formula below formally does not include the complement, but it does so by using the probability vectors. Keep in mind that every vector has to sum to 1, when changing the input values below. **Also keep in mind that if you change a line you have to execute the respective line and every line involved ***again*** to obtain the new results.**

```
# Define prior
prior = c(.5, .5)
# Likelihood
likelihood = c(.5, .5)
# Joint probability
joint = likelihood*prior
# Model evidence
model_evidence = sum(joint)
# Posterior
posterior = joint / model_evidence
```

Above we have calculate the

using the **model_evidence**

function. Basically, this function does what is says and uses an input, such as our joint probability vector with two values ─ **sum()****Keep in mind that an R script is executed from top to bottom.** The R script we provided can theoretically be executed as a whole (mark all and execute). However, it may be that a variable, e.g., with the name

gets **prior***redefined* in lower parts of a script (changes in the environment!). In other words: The content of the previous object with the same name

will be “overwritten” so to speak. **prior**

**Have you figured out what happened, when using a uniform prior, changing the likelihood only?** You are right! *The posterior will always match the likelihood*. How is this possible? The reason is that in such a case the weight and the “counter weight” eliminate each other. Let us take a look at the formula to understand what that means mathematically:

The above can also be generalized to the following, where

**With this we have already revealed the most essential difference between the frequentist and Bayesian interpretation of Bayes’ rule, as the frequentist apporach always assumes a uniform prior, such that the posterior will be equivalent to the maximum likelihood (to which we will come to in further part of this tutorial series in detail).**

As we know, the likelihood *almost* what is typically referred to as **the p-values, testing for the null hypothesis** ─ which would be denoted as the probability of *true*” ─ don´t get confused, it is the same as saying *theta* to be *false*. **There are also some other minor twists concerning the t-test, the confidence interval, the power of a study, density functions etc. that we have to go through when trying to fully understand how we get to a p-value ***given a linear regression model*** (again this will be done in the ***further*** parts of this series).** However, we are still well prepared to discuss the general difference between the Bayesian and the frequentist approach to probability in the next chapter.

How does the special case of equivalence between posterior and likelihood fit our intuition? In general, a uniform prior can be considered as taking a “neutral position a priori” ─ at least neutral regarding the weight of the *pre-set contingencies* of our categories (above it was binary). Equivalent to our coin example we assume a kind of ‘fairness’ *a priori*. This makes sense in a lot of ways ─ in others it absolutely doesn´t and contradicts a neutral position due to ‘false balance’, due to a uniform prior. In order to get a grip on what that means, we will have a look at another famous xkcd webcomic:

The **far-left part** argues that a machine does some measurements (gathers data) regarding the state of the sun (Nova / not-Nova) and then reflects on how likely the measurement is to be true amongst a series of measurements, where in rare cases (

In the **middle part** the frequentist statistician reflects on how rare such an event would be, in order to check if the null hypothesis can be discarded (again, does not have to be a zero likelihood, but a defined cut-off value, such as 0.05, or 5%; the calculation would also include reflections on the sample size). **CAVE: The definition of the p-value being the probability that something is happening by chance is unfortunatly wrong / misleading (see ****part five****, chapter 3.4 for details).** **Upfront for those interesed right away:** The p-value is the difference (e.g., in means), given the null-hypothesis, or the probability of sampling a mean from a population / second sample, given that we sampled randomly (randomization as constrain of the design of a experimental trial, since otherwise one could not work with / assume, e.g., a normal distribution (Gaussian distribution); the sample variable with a set of values is also called a random variable).

However, the important part in this webcomic is the reaction of the Bayesian statistician in the **far-right part** of the comic: You may ask, how come so certain? In general, the comic beautifully conveys the problem of overfitting. Recall that the frequentist approach involves a uniform prior, in order to stay neutral, i.e., this approach implicitly argues that the states “Nova” and “Not-Nova” have a probability of .5 assigned to each *a priori*. As we know, this results in a model that will fully adapt the data (our measurement) ─ the posterior being equal to the likelihood. The Bayesian approach on the other hand allows weighting the likelihood (also called *informed prior*). Knowing that a Nova is in general a very rare event, the frequentist statistician indirectly overweight’s the event “Sun has gone Nova”. The Bayesian statistician, aware of this circumstance, therefore immediately bets against the frequentist model. On the other hand, this also implies that an overconfident prior can lead to overfitting in both interpretations of Bayes´ rule. All of this also relates well to the difference between evidence and significance, as discussed in chapter 1.2.

There is more to overfitting and we will probably come up with a tutorial on this in the future too. The take home message that we intended to convey is that both interpretations of Bayes’ rule / conditional probability lead to problems in similar forms, when trying to gain evidence of any kind from a statistical analysis. The Bayesian approach is trying to get hold of issues such as overfitting, by setting informed priors (e.g., on prior knowledge from previous research). The frequentist approach will have similar issues, just more related to the interpretation of the results and less to the way data is gathered (when likelihood is unweighted). There is a great number of methods trying to overlook and overcome such boundaries in any of the two “fields”, both mathematically and intellectually (‘What is evidence?’). At the end, the attribute “fields” is somewhat over the top, as both approaches refer to the same equation, just under different prior assumptions. The ‘tilting effect’ between likelihood and posterior in combination with the *ex negativo* likelihood

When doing your own research, you may have come across further distinctions between the Bayesian and the frequentist approach. E.g., in Bayesian statistics it is said that the hypothesis is dynamic or changes, the data being something constant. In frequentist statistics it is said that the data or likelihood changes, and the hypotheses stay stable (i.e., binary (0% and 100%). This may seem complicated or even enigmatic, but it essentially just refers to the denotation of the variables:

As mentioned, the difference between the Bayesian and frequentist approach is often cast as philosophical discussion on probability as such: Bayes’ rule assumes that experience changes the way we hypothesize or expect the world to be *a priori in the future *(posterior becoming the new prior), where the frequentist approach assumes the possibility of a stable neutral position *a priori *(always uniform, never updated!)* *and casts evidence as a frequency of an assumption (being able to “frequently discard” the null hypothesis). Given a uniform prior, we will obtain an equivalence between posterior and likelihood,** so the course of inference does not matter in such a case**. So again, the difference is rather synthetic and it logically makes at least to me much more sense, to always consider the outcome probability of our statistical inference as the probability of a hypothesis after we gather data, under specific prior conditions (either uniform or informed prior). The p-value under frequentist considerations is still always denoted a likelihood, even though given a bi-directionality of inference so to speak; this can also be understood as trying to representing the “present” only (focusing on the data only, though: a uniform prior is still a prior assumption!).

However, arguing this to be a “philosophical” discussion is somewhat misleading, as we learned that the above is again just a reflection on Bayes’ rule (in particular likelihood and posterior). So in general, just keep in mind that the frequentist approach to probability just reflects a special case of Bayes’ theorem.

I have also found this illustration in an article on the difference between Bayes’ and frequentist statistics that illustrates the above in a similar way:

Being aware of different approaches to probability theory does not involve choosing for a specific side in that discourse. We rather believe it to be essential to be aware of the prior considerations of hypothesis testing no matter how the chosen prior weight may look like. Still, in a lot of cases it is not just a matter of style or opinion, which side or method we choose, as we are all Bayes’ when it comes to, e.g., the positive predictive values, or when performing differential diagnostics (“investigative reasoning”, see below).

**One last thing before we close our reflections on Bayes and frequentist stats:** As mentioned, the Bayesian complement of a “frequency” is the posterior to become the prior . This can be understood as a structural recursion / iteration of hypothesis testing. Investigative reasoning has therefore often been related to Bayes’ rule: **Vanessa Holmes’** new case involves four suspects, one of them being the murder of an innocent racoon. A *prior probability* could for now look something like

```
# First iteration:
joint = c(.25,.25,.25,25)*c(.33,.33,.33,0) # prior*likelihood
model_evidence = sum(c(.25,.25,.25,.25)*c(.33,.33,.33,0))
posterior = joint/model_evidence
# Result:
# [1] 0.3333333 0.3333333 0.3333333 0
# Second iteration (another suspect ruled out):
joint = c(.33,.33,.33,0)*c(.5,.5,0,0)
model_evidence = sum(c(.33,.33,.33,0)*c(.5,.5,0,0))
posterior = joint/model_evidence
# Result:
# [1] 0.5 0.5 0 0
# Third iteration - let's say all of the suspects are ruled out:
joint = c(.5,.5,0,0)*c(0,0,0,0)
# [1] 0 0 0 0
model_evidence = sum(c(.5,.5,.33,0)*c(0,0,0,0))
# [1] 0
posterior = joint/model_evidence # Note that it says: 0 divided by 0!
# [1] NaN NaN NaN NaN
# NaN = Not a number.
# This is due to the fact that 0/0 is not defined. At this point
# Vanessa Holmes would need to find another suspect to be able
# to do further inference...
```

**The first part of our series on statistical inference has come to an end.** We hope that this has given you a stable overview over what hypothesis testing is in general all about and how it is related to conditional probability. You will definitely come across conditional probability in various forms, when further digging into statistics. A lot of our section members are also interested in computational neuroscience, bioinformatics and data science in general, so we will therefore soon also provide tutorials on topics such as information theory (Shannon, Akaike; both relies on reflections of thermodynamic analogies, which also involves conditional probability (Boltzmann and Gibbs´ entropy, free energy etc.)), predictive processing / active inference (disconnection hypothesis arguing schizophrenia to be a weighting problem) and many more, which all rely on the basic concept of hypothesis testing in the sense of Bayes’ rule (tutorials on these topics are also just about to be finished, so stay tuned!).

The information theoretic use of conditional probability is one of the most fascinating and mind blowing and refers to the “bit” as a statistical quantity, showing that communication is a stochastic process and comes without conveying meaning in the actual sense (Shannon). Something that especially research in the field of humanities struggled a lot with conceptually when trying to understand information technology (unfortunately there a lot of heavily faulty interpretations of information theory and what a computer actually does).

However, note that hypothesis testing is often also referred to as **abductive inference** **in contrast to deductive and inductive reasoning** and was defined by the mathematician and philosopher C.S. Peirce that had a great influence on the development of information theory (logic gates, further development of Boolean algebra, abductive inference and communication; the temporal triad we referred to is essentially referring to the triadic structure of abductive inference (related to semiotics)). The difference to deductive and inductive reasoning is essentially that abductive inference / hypothesis testing involves developing and evaluating something *new*, a hypothesis, and does not just evaluate a pre-set part-to-whole relation between a rule and a single event as in the other two forms of inference. Arguing that this can be looked at as a hierarchical and recursive process, such that a deduction is just an abduction of an abduction, then one can say that statistical reasoning in the sense of a mathematical method can be referred to as deductive reasoning (also under “neutral assumptions” in particular), since then a process of forming a hypothesis has already been performed cognitively before and therefore on a lower-level of the hierarchy so to speak. **In other words: Bayesianism** **eventually descriptively starts from a subjective perspective of inference.**

This tutorial was submitted for the Summer of Math Exposition II, organized by Grant Sanderson and James Schloss. We are proud that we made it into the top 10% of all entries (the top 100 entries of which 25 were non video entries), especially with such a basic topic such as Bayes’ rule.

**To give some other submission of the competition some attention, here is one that fits very well to the topic of our tutorial:** The article by Jeffrey Wang discusses misinterpretations of conditional probability by a medical doctor that played the role of an “expert” in court trial. The misinterpretation led to the false accusation of a mother to be the cause of the death of her infant that actually “just” died from sudden infant death syndrome (SIDS). The false accusations and imprisonment eventually led to trauma and subsequent lethal self-destruction of the mother, Sally Clark (which eventually died of alcohol intoxication). It is a drastic example of the misconception of statistics by medical personal in this case (which are unfortunately not necessarily experts on that matter), as well as especially by juristictional personal.

Note that the mathematical example also involves conditional *independency*. **Above we have only gone through examples of a conditional dependency, but will eventually get back to the topic some other time, as well.**