Skip to main content# [R] Inferential Statistics V: Probabilistically Evaluating Statistical Models via the Z- and T-Test

# 1 Introduction:

# 2 Recap on Z-Scores and an Overview on the Regular Z-Test / Gauß-Test and its P-Value

## 2.1 The Z-Value/-Statistics and the Effect Size in Detail

## 2.2 Standard Deviation/Error of the Mean

# 3 Possible Hypotheses of the One- and Two-Sample Z-Test and their Respective P-Values

**Example for the following chapters: **

**3.1 One-Tailed One-Sample Z-Test**

**3.2 Two-Tailed One-Sample Z-Test**

**3.3 (Independent and Dependent/Paired) Two-Sample Z-Test**

## 3.4 Common Misleading or Confusing Definitions of the P-Value and the Significance

# 4 The Student’s T-Test

## 4.1 The T-Distribution, its PDF and its Degrees of Freedom

# 5 Testing a Linear Model

## 5.1 Overview of the output of

## 5.2 T-Test on the Slope and the Intercept of a Linear Model

## 5.2.1 Testing the Slope

## 5.2.2 Testing the Intercept

# 6 Z-/T-Test and the Signal Detection Theory

# 7 Update on our

Intermediate Stat-o-Sphere

Published onAug 18, 2023

[R] Inferential Statistics V: Probabilistically Evaluating Statistical Models via the Z- and T-Test

** Review:** This is a pre-released article and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

*Corresponding R script: *

*Summary:*

Welcome to the fifth part of our tutorial series on inferential statistics! You have come quite a way and we hope you get the feeling that is well worth the wait! The following tutorial requires a basic understanding in probability theory, linear regression and in essential descriptive measures, such as the variance, the standard deviation and the mean (Inf. Stat. I-IV). On the other hand, we are not going to speed up the pace, as we still aim to consistently replicate the math in R as far as possible. We again recommend to make use of the **contents function (upper right corner)**, to reorient oneself in the script. Since we are approaching higher spheres, we will not provide summaries at the end of every chapter, but we still provided a downloadable summary above!

**As mentioned in part four of this series, we will now cover the regular z-/t-tests, comparing simple means only, before applying the same principle of hypothesis testing to the intercept and slope of a fitted linear model.**

**Let us briefly catch up on what we have gone through so far:** In the first two parts of this tutorial series, we focused on the general concept of hypothesis testing (conditional probability) and linear modelling (conditional linear relations). In the third part we looked at our linear model under a descriptive perspective and at alternative methods to obtain the optimal intercept and regression coefficient of a linear model via such descriptive measures (variance, covariance etc.). In the fourth part of this series we looked at basic distributions / functions such as the uniform and several forms of Gaussian distributions, evaluated the z-score and the effect size as a special case of z-score (where the mean of the sample is the input; we will talk about the effect size again, since the z-statistics/value can be looked at as a weighted effect size…).

**In the fifth part of this series, we are going to fully combine the concept of probabilistic relations with the concept of mean estimate and linear relations. In other words: we will look at methods to evaluate samples as well as linear models from a probabilistic perspective (p-value). **

Apart from evaluating p-values there are also *other methods* that we can use to evaluate our model in general, such as the effect size etc., which is also not part of the

output in R, but important. However, we will get to further measures such as the F-statistics in the **summary(lm(y~x))***sixth* part of this series.

In the context of R this tutorial can be understood as the first part of understanding the output of the

function as well as the **summary(lm(y~x))**

function. In the **t.test(y~x)***sixth* part we will eventually replicate the output of the mentioned R functions in full and we will also look into the topic of power analysis, confidence intervals and confusion matrices (essentially conditional probability tables) more closely. **In other words: this tutorial is all about p-values.**

**The general question we will ask when comparing the mean of two distributions will be:** how different are the two distributions in question? More precise: how likely is it that the sample was obtained from sampling from a population (or from another sample in comparison). What is the difference and is it different enough?

**On the other hand, the general question we are going to ask, when evaluating a ***linear model*** will be:** How good, how fit is our model of Lady Meow drinking tea under a probabilistic perspective? Is the difference between the independent variable and the dependent variable enough to consider a conditional (cor-)relation between them?

So the goal is now to not only descriptively, ** but also **probabilistically evaluate how well our linear model predicts given / new data. Again, we will do so by computationally replicating most parts of the math of the

**summary(lm(y~x))**

function and other functions in R, such as the infamous p-value. We are finally approaching our first statistical hypothesis test using a PDF: the **Gauß-test** or also more commonly called **z-test**. As mentioned a couple of times: the difference to a t-test is formally marginal, but not too trivial as we will see further below.

As mentioned: a z-test tests for a **difference between the sample mean and the population mean (or another sample)** in a way that we can use the result of a z-test to make assumptions about the probability that a sample distribution is part of a population distribution or another sample (when taking multiple samples, since backed by the central limit theorem).

All this is done via a z-test, e.g., by using the mean of the sample and the population, given the var/sd of the population. **To make things a lot** **easier upfront, the** **core of what a z-test does is simply the following (similar to the z-score that we encountered in the fourth part of this series):**

Different to the z-score, we now do not just standardize a set of sample values but the mean of a sample only. Similar to the effect size, you may recon — we will get there! ** **

**In any way, now we are really comparing for a difference in means. Doing so can essentially again be understood as a kind of centralization, just not of vector **** but the mean only. **

The above difference will eventually be divided by the standard deviation of the population — but again not of the whole set but the mean only to obtain the z-value/-statistics**. This is different to the effect size, you may recon. Below you can see that the standard deviation of the mean of a sample brings in the ****. Don’t worry why this is the case for now, we set up a whole chapter to clarify where **** actually comes from and it will be very enlightening, promised. **

**For now just think of** ** for itself as a weight that reflects the central limit theorem: **the higher the sample size, the less likely will a high difference in means be the result of sampling from the actual population.

**Below you see a comparison between the z-score (discussed in ****part four**** of this series), Cohen’s d effect size and the z-value (the result of a z-test). Recall that Cohen’s d is just a s special case of the z-score! Looking at the z-value/statistics we can see that this is on the other hand just a special case of effect size, in some way:**

For a z-test we can either compare the sample mean with a population mean or the means of two samples (**one or two sample z-test**, details later below). The rest is essentially again just a method of converting the above difference into a standard normal PDF. **This “standardized” perspective can be helpful to get a general grip on the difference in means, as we have seen in the forth part of this series. **

**For example:** testing medication that is supposed to lower the blood pressure, how can we interpret a difference between -10 mmHg, between the control group and those that took the medication? Maybe it is way below the standard deviation of the population?

Technically, standardization also helps to simplify the process of integration (via z-tables), structurally equivalent to what we did when obtaining z-scores, but again, now with a focus on converting the difference *in means* only. Not exactly only, since we can see above, that there is a weight of

**If we include ****, the result of our actions will be a ***specific*** ***z-score***, the z-statistics/***value*** of the above difference of means.** Again, the rest is formally equivalent to the process of evaluating z-scores or the effect size: the process of standardization translates a simple numeric difference of means into a numerical relation that represents the difference of the sample mean from the perspective of a *standardized* (population) mean of 0 in measures of standard deviation (recall that the values of the x-axis of a standard normal PDF are in the unit standard deviation). Therefore: A test score / z-value of -1 states that the elaborated *standardized* *difference of the sample mean* is -1*pop.-sd away from the population mean. **Looking above at the formulas, we can see that this is not exactly true, since the z-statistics also involves the **** and kind of weighs in on the effect size:** The higher the

**In the figure further below,** we will compare two distributions but we kept the **100% different** to each other would indicate **no overlapping of the ***sample*** ***mean*** with the population distribution at any area of that distribution.** An **overlapping** of the mean with the distribution of the population on the other hand **indicates that it is possible to draw a sample from a population with the respective mean when taking multiple (random) samples from the population in measures. **This possibility is represented either in per-cent or as a certain probability. **From the probabilistic perspective we will therefore later ask if the sample is likely/unlikely from the same distribution, when integrating up to a point of difference.** Our hypothesis will be that there is **no difference (null-hypothesis), meaning: the sample is a sample from a population (or another sample)**. This means that when we get a certain z-score and we integrate up to that score it tells us to what probability we would have obtained our sample mean from the population. Since we argue that the sample mean is from the population a priori (null-hypothesis), we take the population distribution as our base line, so to speak. When we compare two samples, we set one sample to be the base line (with a standardized mean of 0) and compare the other sample with it.

The hypothesis of a z-test is also specified to address a specific tail of a normal distribution: The term **tail** refers to the outer **left or right parts of a normal distribution.** Our sample mean can either be higher or lower than the mean (lower and upper tail). There is also a two-tailed version checking for overall equality/inequality (both tails at the same time). We will get into the details further below…

**Usually, we do not aim for a 100% difference. **Therefore, a **threshold is commonly set to define significance**. This threshold is often .05, i.e., the lower/upper 5% of the tail of the normal distribution, as getting down to 0% is rather unlikely.** Note that we are not going through all the details of confidence intervals, power analysis etc. in this tutorial, since we do not need it in order to define and calculate a p-value. This may surprise you, which may be due to the fact that the significance test and the p-value as a result of z- or t-test are often conflated with each other (see chapter 3.4 for details on common misleading and faulty definitions of the p-value).**

**In the plot below** you will see a z-value for a difference of the sample and the population mean that is below .05 threshold, indicating that it is at least *very unlikely* that the sample is from the population or “the same population”. **NOTE:** **Even though we are just operating with the effect size in the plot below, not the z-statistics/value, we can still see that the difference is significant. This is due to the fact that a z-statistics with **** would result in the effect size, such that any other higher value than **** for **** will just turn out to be more significant!!**

So inn this case we can already argue that we can still discard the null-hypothesis (no difference), which in this case means: there is a significant difference given in the sample and it is therefore unlikely that there is no difference as stated by the **significance testing**: comparing the p-value (obtained from integration up to a z-statistics) with a threshold value that argues for a kind of “enough difference”.

Code for the plot above.

```
# General formula for a probability density function (PDF).
# Insert mean = 0 and var = 1 to obtain the values for input x
# given a so-called standard normal distribution:
prob_dens = function(x,mean,var){
fx = (1/sqrt(2*pi*var))*exp((-((x-mean)^2))/(2*var))
return(fx)
} # End of Function
# Comparing distributions:
# We chose equal variance, so the form will be the same
# Sample:
mean_samp = 5
var_samp = 12
# Population:
mean_pop = 15
var_pop = 12
# Plot:
seq_samp = seq(-7,40,by =.01)
plot(x = seq_samp, y = prob_dens(seq_samp,mean_samp,var_samp), type = "l", xlim=c(-8,30),lty =2)
seq_pop = seq(-7,40,by =.01)
lines(x = seq_pop,y = prob_dens(seq_pop,mean_pop,var_pop), type = "l")
abline(v=mean_pop,col="lightblue")
abline(v=mean_samp,col="orange")
abline(h=0) # line for x-axis at y = 0
# Quantile function for .05 of pop:
pop_threshold = qnorm(.05,mean=15,sd=sqrt(12))
# Add line to mark lower tail threshold of .05 (5%):
segments(x0=pop_threshold,x1=pop_threshold, y0 = 0, y1 = prob_dens(pop_threshold,mean_pop,var_pop))
```

Let’s go through all of the above in detail:

**As we have seen before, the formula of the z-tests z-value is nearly the same as the formula of the z-score. There are just two minor differences. However, it is the same principal at work as in regular z-score:** centralization in the numerator (evaluating the difference) and then standardization via the sd (similar to the process of normalization in probability theory). **Again, a z-test is just a z-score to articulate for a specific difference: a difference of mean values only!** Note: the formula below refers to a one-sample z-test (we will catch up on the slightly different formula for two samples further below).

**Differences to the z-score (redundant overview, potentially optional):**

**a)** For the z-table we used a sequence of *possible* values of *value* (also called z-statistics) only: our sample mean as input **of the mean** (standardized), see b) below:

**b)** To perform a comparison of means in such a way we have to take the so-called **standard error of the mean (****)** into account, which is essentially just the population standard deviation,* but here again: the mean only will be the input x*. So, again: not a whole set of data or possible mean values is looked at and centralized, but the mean of a set itself in relation to the population mean. **Importantly**, on the far-right side of the equation above you will find a rearrangement of the formula that can be read as the **z-score of the sample mean** **weighted by the square root of **** (see next chapter why sqrt(n) is used; it is rather banal but enlightening, so we still gave it an extra chapter).** This can be looked at the same as the effect size (Cohen’s d) weighted by the **But what does weighting by **** do?** **Numerically weighting by the **** will eventually lead to a rise in confidence in the comparison, so to speak, since the higher **** will be, the higher the affect will appear in the form of a z-statistics, which will eventually lead to an automatic lower p-value (and therefore higher significance in the end).** This is not what we want, when we simply want to compare the difference in the means in terms of a standardized perspective, as we will see (effect size, see next paragraph). **Put simply the weighting by the **** argues that if we have a difference in means, such a given difference will have a much higher meaning when the sample size is, e.g., 1000, compared to the confidence we could gain from a much smaller sample with the same difference in means. In the next chapter we will explain where the **** actually comes from mathematically, looking at the details of the standard error of the mean. **

As we know, the z-statistics is very close to **Cohen’s d effect size**, which also standardizes a difference in means, but not using the standard error of the mean, but the regular standard error/deviation. From another perspective the formula below represents the z-score of the mean only, not a whole set

The effect size is the perspective we want for mere standardized numerical comparison, where the plain mean difference may not be informative (just converts the mean difference into measures of SD). Say we, we have a measurement of 140 mmHg systolic blood pressure and a difference of 10 mmHg — is that much? The standardization in the fashion of the plain effect size can help to get a grip on how big the actual effect ist. Cohen’s d can be looked at as equivalent to the corresponding z-statistics/value, given that

```
# Comparison z-stat and effect size:
x=c(0:10)
y=c(0:10)*3
pop_mean = mean(y)
pop_sd = sum((y-mean(y))^2)/length(y)
sem_x = pop_sd/sqrt(length(x))
z_stat = (mean(x)-mean(y))/sem_x
effect_size = (mean(x)-mean(y))/pop_sd
# Same with n = 1
sem_x_2 = pop_sd/sqrt(1)
z_stat_2 = (mean(x)-mean(y))/sem_x_2
effect_size_2 = (mean(x)-mean(y))/pop_sd
# Check for equality of z_stat2 and effect_size_2:
all.equal(z_stat_2,effect_size_2)
```

**The z-statistics/value may still appear a little out of order**, but let us take a closer look at the

The **standard error of the mean **can be a **little tricky**, even though it actually doesn’t introduce anything new to us (see part three). Nevertheless, we decided to give the topic some space for an own chapter. **We promise it is worth it**, as it again relies on assumptions of the central limit theorem, i.e., it essentially relies on the relation between the sample and the population size (**However, you can also skip this chapter if you want to get right into calculating p-values.**

Per definition, the *mean* from the actual population *mean*. Again: the standard error (of the mean) is actually the same as the standard deviation, it just includes a reflection on the size of the sample. However, the formula of the **So even though we will go through all of it in detail, just keep in mind upfront about what input we are speaking of:**

**STANDARD DEVIATION** = Average deviation of a **set** of data(**from the mean of that sample or a population mean (centralization, recall ****Inferential Statistics III****).**

**==>**

**STANDARD DEVIATION/ERROR OF THE MEAN** = Average deviation of the **sample mean** **from the population mean** (not a set, just the sample mean in relation to the population mean is centralized!!).

**==> **

**Given a slight difference only, the regular standard deviation is often confused with the standard error of the mean. The reason is again that it is essentially the standard deviation, just of the mean only.** Note, in R we *can’t* just use

to compute the standard error of the mean, as we will see.**sd(mean(x))**

For the z-test the **Let us look at the formula of the ****:**

**As written above: a special characteristic of the **** is that the higher the **** of the sample mean, the lower the deviation from the population will be (due to central limit…). **

**CAVE: This is exactly what Cohen’s d effect size is trying to avoid, since samples with a large **** tend to become significant more easily. **Still, it is also understandable, given the central limit theorem, to weigh in the **. **

To get more insight why weighting in makes sense, let us ask again: **Where does the square root come from anyway? The answer is actually quite intuitive:** The mean as input brings in

**In the decomposition above we can see how the sample mean actually disappeared completely as input, essentially reduced to the sd, when **** is taken out, now representing a kind of redundant averaging. **This comes in handy, when only given sample size

**Note upfront that the above bounded relation between **** also holds for the expected value used in a t-test:** we can still draw relations between a lower and a (potentially) higher size of

**Below we tried to mathematically express the **** particularly as an inequality, dependent on the size of ****.** On the far left side

Looking at the formulas above also helps to understand how the *sample* variance corrects for a bias (Bessel’s correction), since

**Below you will find another decomposition** using the variance only **(optional)**, again showing where the square root of

Below you will find some code to calculate the

```
# Example for SEM:
# Our population:
x = c(rep(c(0:10),100))
# Population variance:
var = sum((x-mean(x))^2)/length(x)
# Evaluate SEM for several different sizes of n:
n_sample = 20
sem_20 = sqrt(var)/sqrt(n_sample)
# [1] 0.7071068
n_sample = 200
sem_200 = sqrt(var)/sqrt(n_sample)
# [1] 0.2236068
n_sample = 1000
sem_1k = sqrt(var)/sqrt(n_sample)
# [1] 0.1
n_sample = length(x) # = 1100
sem_N = sqrt(var)/sqrt(n_sample)
# [1] 0.09534626
n_sample = 10000
sem_10k = sqrt(var)/sqrt(n_sample)
# [1] 0.0003015113
sem_20 > sem_200 & sem_200 > sem_1k & sem_1k > sem_N & sem_N > sem_10k
# [1] TRUE
```

**Next (optional) you will find another more complex simulation in order to prove the mathematics above: **Now we are going to use some code we used to demonstrate the central limit theorem, including actual sampling, created by using the `rnorm()`

function (rnorm = random normal distribution).

We will create a set of 5000 means from samples with a size of **with a mean of 10 and a sd of 2 (population parameters)**. We are then going to compare the

Even though we have knowledge about the populations mean and sd, below we are **first going to just use the sample sd and later the population sd**, since at least in general the sample sd can also be used for an approximation of the

```
# Standard error of the mean including actual sampling:
# How the SEM becomes smaller, the more samples we take:
set.seed(1) # for reproducability
data = rnorm(n = 1000, mean = 10, sd = 2)
# We can again plot our data as histogram
hist(data, col = "lightblue")
# Use a for loop to take 5000 random samples of size n = 50 each:
n_samp = 5000
sample_mean = c() # create empty vector
sample_sd = c()
# The below function will also takes samples of length 50
# and then mean of that sample stores in our vector sample_mean.
# Running this loop will always result in different samples:
for (i in 1:n_samp){
# Store mean of each sample with length 50:
sample = sample(data, 50, replace=TRUE)
sample_mean[i] = mean(sample)
} # End for i
# Look at the histogram of the sample_mean:
hist(sample_mean, col = "lightblue")
# Now we compare the SEM of the first 20 sample
# with the SEM of the first 250 sample. Recall
# the vector sample_n_50 contains 5000 means
# with a sample size of 50 each:
sample_mean
length(sample_mean)
# [1] 5000
#### SEM of the first 20 samples:
samp_n20 = sample_mean[1:20]
length(samp_n20)
# [1] 20
# SEM_n20
SEM_n20 = sd(samp_n20)/sqrt(length(samp_n20))
# [1] 0.05782984
#### SEM of the first 250 samples:
samp_n250 = sample_mean[1:250]
length(samp_n250)
# [1] 250
# SEM_n20
SEM_n250 = sd(samp_n250)/sqrt(length(samp_n250))
# [1] 0.01723958
#### SEM of all 5000 samples:
SEM_n5000 = sd(sample_mean)/sqrt(length(sample_mean))
# [1] 0.003924541
# The numerical values will deviate, each time you run the
# loop, but it will always hold the logic of:
SEM_n20 > SEM_n250 & SEM_n250 > SEM_n5000
# [1] TRUE
# For a z-test we would have to use the population sd,
# which we set to be 2 in rnorm() function above. Given
# the sd we can simply just adjust a size of n:
# pop_sd/sqrt(n):
2/sqrt(20) > 2/sqrt(250) & 2/sqrt(250) > 2/sqrt(10000)
# [1] TRUE
```

For some more examples on the

Conceptually we have already gone through most of what is ahead of us. We have previously obtained an overview on what possible hypotheses of a z-test roughly look like, in this chapter we will learn about all of them in detail.

In general a z-test can be differentiated in two categories (the formula however is essentially the same):

=> The **one-sample z-test** compares a sample mean from a population with the population mean / expected value.

=> The **two-sample** **z-test** compares two samples und checks if they are related to the same population or different populations.

Apart from that, there are **three types of hypotheses for the z-test** that can be drawn: **one-tailed (lower/upper) and the two-tailed test hypotheses**.

The important message of the previous chapter: effect size and z-statistics/value are both doing the same: standardizing the difference in means. We know that the sample size is not weight into the effect size (or is treated as if

The example we will be using in the next chapers will be (mostly) an comparison of the tea drinking behavior of happy cats and sad cats. We will also look into a catnip treatment at one point (dep. two sample z-test), so stay tuned! However, **brace yourselves, Chat Noir:**

A **two sample** z-test **could ask if sad cats drink less tea than regularly happy cats, e.g., such as Lady Meow.** The samples are distinct by each other since they imply inclusion criteria: cats are tested for being sad and regularly happy (on whatever basis) and then put into their respective category. To answer our above question, these two samples would be compared with each other. The probability value (p-value) would state how likely it is that the samples are samples from the same population. A **one-sample** z-test is similar, just that not the mean of a second sample is used, but the given mean of a population of all other cats that are at least not sad. The difference to the two-sample z-test may appear minor in this case, however it is a good lesson to conceptually think of what a population actually represents. However, the exact **question for a one sample z-test of our example would be: does the “chat noir” drink significantly less tea then any other cat (cats in general)? **So in nuce, the category “all other cats” is not as specifically circumscribed by inclusion criteria as in the two sample test. Such a test can still be performed, but the result of this one-sample test may be that no difference is perceived, due to too many confounding factors or other biases, which are all present in the population (since no further differentiation via inclusion criteria was performed). Keep that in mind!!

**In case the term population becomes fuzzy thought as representing a category:** Note that the concept of a population is not bounded by a certain “real” category, such as e.g. species. *In nuce*, it can also mean a population within a population (sub-population). In the example for the one-sample test above we looked at the tea drinking behavior of a sample of a subpopulation of sad cats amongst a population of “regular” or “all other” cats (of course its all made up, but since we are doing a bit…). So here the difference between what a sample is and what a population is in the form of a category is not set/differed by the species (cats) but by their mood and potential tea drinking behavior. We could also compare the behavior of different species… We recommend to keep such conceptual/linguistic categorical issues in mind when trying to understand what variables are used. And of course: In order to make a distinction between sad and happy cats inclusion criteria would need to be discussed in detail (since it is just a made up example, we will skip this part; however, an example of an exclusion criteria for a second sample could be cats that have trouble drinking in general, e.g., due to some illness like a tumor that influences the ability to properly swallow liquids).

For the lower-tail we can formulate the null hypothesis that the sample has either an equal mean or is *not* (significantly) lower than the population mean (meaning: it could be a sample from the population if *not* lower than a defined threshold). Below we also included different denotations of the essential logical binary relation between alternative and null hypothesis. Recall from Inferential Statistics I that the use of the term null hypothesis can be confusing (e.g., due to often expressed *redundant* binary relation when, saying things such as “the null-hypothesis being true”). **Also recall** that the p-value below is essentially a (uniformly weighted) likelihood of the form **Bayesian regression** involves more than just one function: the multiplication of the prior probability function with likelihood function results in a posterior probability distribution... Since the prior is uniform in frequ. statistics, we can just operate with the PDF, which is not the same as a likelihood function (which would be a regular prob. function), but the integral of our PDF still represents a likelihood.

You may wonder what role the *ex negativo* approach in parts of statistics. Mathematically the role of the

**As mentioned in the ****first part**** of this series, the use of the null-hypothesis is very unfortunate,** since this is also not the way we express the evidence in every day life: When we find evidence for the effect of a certain treatment, we do not say: we did *not not*-find any evidence, we positively say: there is evidence for an effect.

This *ex negativo* approach is sometimes also related to falsification (Karl Popper), e.g., in [this paper]. In general it refers to a discourse on epistemics, i.e., the question on what can be known and what can be tested. Personally, I don’t find this logically very convincing, since falsification is better to be understood in relation to temporal and spatial boundaries of experience/examination, as in the classic black swan example, and less by overinterpreting an *ex negativo* approach in setting a category in conditional probability. The latter appears to be more of an epistemological reminder, apart from being most of all mathematically convenient. Mathematically tilting the category in advance via the null-hypothesis approach is not solving epistemic issues of any kind of abductive inference concerning time and space (recall part I of this series). Tilting the category is at the end just tilting numbers, since the relation between variable and its complement is already part of probability as well as propositional logic. Again, apart from that we always communicate the results in the positive sense: evidence was found that medication X *has* an effect (we don’t say: evidence was found that medication X did not not-have an effect). **Most of all, evidence is not obtained by one study alone, so significance for itself does not relate to the world anyways.**

Apart from *numerically* tilting from *ex negativo* approach in general, it unfortunately gets a bit more confusing when evaluating a difference in mean being higher, lower or equal/inequal to the population mean, as it also tilts and some values are actually equivalent, no matter the tail we look at. However, this tilting appears due to the symmetry of a normal distribution when switching from the sample mean being lower to being higher than the population mean (same trick: 1-value). See code further below, to fully understand what this refers to.

Let us start with the lower tail:

**When do I know when to use the lower and not the upper tail? In general it is dictated by the question that is asked:** If we decided in advance that our hypothesis is that depressed cats drink less tea than others, then we indirectly decided to perform a lower tail z-test. What if the inverse is the case: drink more tea than others? Performing a hypothesis test correctly, requires to set a hypothesis in advance. In the best case, you register your trial before you collect data and then perform your statistical analysis. Not doing so can be essentially considered HARKing, i.e., formulating or adjusting a hypothesis after the results are known (bad practice). Also, when the result does not match our a priori hypothesis (either higher or lower), but turns out to be significantly different, we technically would again need to perform a new trial (especially when preregistration was done, we can’t and shouldn’t just change the hypothesis). However, we will see below that this is why most people use a two-tailed test.

We can use the **same formula as above for the inverted case with a sample mean that is ***higher*** than the population mean (in the sense that it is not from the same population)**. Here we use the same integral as for the “lower-tail” and due to the symmetry of a standard normal PDF we can use the same p-value as for the lower-tail as well. As mentioned, we just need to tilt the result, due to the symmetry of Gaussian functions. Formally we actually integrate up to the **absolute value** of our given z-statistics and subtract the result from 1 (see code below). Integrating up to the absolute value of the z-statistics by itself (without subtracting it from 1) would only results in the likelihood that follows the alternative hypothesis of a lower-tail test…

Below you will find some code to perform a lower and upper tail one-sample z-test:

```
### Chat Noir One-Sample Z test:
# We will compare sad cats with a population
# of cats ("all cats"):
# Happy cats like Lady Meow drink 10 cups of tea on average:
pop_cats = c(rep(seq(0,10, length(11)),10))
# Sad cat, which drank only 2 cups of tea in 10 hours:
sample_sad_cats = seq(0,2,length = 11)
# [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
# Population var and sd:
var = sum((pop_cats-mean(pop_cats))^2)/length(pop_cats)
sd = sqrt(var)
# SEM:
sem = sd/sqrt(length(sample_sad_cats))
# Z-test:
z_stat = (mean(sample_sad_cats)-mean(pop_cats))/sem
# [1] -4.195235
# P-Value LOWER TAIL
p_val_low = pnorm(z_stat,mean = 0,sd = 1)
# [1] 1.362942e-05
p_val_low < .05
# [1] TRUE
# P-Value UPPER TAIL:
# The below is only theoretically the case and we
# have to pretend to have gotten a positive z-statistics.
# This spares us the time of creating another set of data.
# This time the below actually argues that sad cats drink
# more tea than happy cats (we made thinks up anyways, but just that
# you know):
p_val_upper = pnorm(abs(z_stat),mean = 0,sd = 1)
1-p_val_upper
# [1] 1.362942e-05
1-p_val_upper < .05
# [1] TRUE
```

Awww :( Turns out sad cats truly drink significantly less tea :’( The p-value is way below 0.001, so we can write the p-value as: < 0.001. **This is also the way we would report the p-value in a paper as well.** One can also report the z-value to the corresponding p-value, as it tells us how many sd’s our sample mean is from the population mean.

As the name already says: now we are going to consider both tails of the distribution *a priori* and are looking at a *total*-threshold of 5% (2.5% each tail) when comparing it with the p-value for testing the significance (again, we will cover the whole topic of significance testing in detail in the next part of this series).

**This time we are asking for the equality/inequality of two distributions.** In other words: we are going to perform a lower and upper tailed z-test at the same time, so to speak.

**When should I use the two-tailed z-test?** In general it is a safe prior hypothesis to argue for a difference that is either higher or lower than the population. Conceptually it can be understood as a rather exploratory approach: catnip may result in less depressed cats, but maybe also more depressed cats? Maybe we don’t know yet, since current research is ambivalent or simply not given. The cost though is a double p-value and a higher chance that the result will turn out to be insignificant. **However, in case this appears confusing:** we do not actually get a double difference or so, since we are still looking at a single difference of means only that can’t be at two sides at once: one set of samples, one mean, ergo: a difference *either* + or - will be given. So in any case there will always be only a “one-tailed difference”, even when performing a two-tailed test.

**When testing for significance only** we can assume they are from the same distribution if the result of the z-test lays in an, e.g., 95% area of our population function (confidence interval), with an upper and lower tale of 2.5% each (total of 5% threshold).

As we are calculating the p-value for a z-statistics that only refers to one side, when performing integration, we have to **multiply the p-value by 2 for the two-tailed test. **We can simply do so due to the symmetry of a Gaussian function.

```
# P-Value TWO-TAIL
p_val_two = 2*pnorm(z_stat,mean = 0,sd = 1)
# [1] 2.725883e-05
p_val_two < .05
# [1] TRUE
```

The formula for the (*independent/dependent)* two-sample z-test is nearly the same as for the one-sample z-test. Here we compare the means of two samples, given their respective population sd and population mean. **A two-sample test is considered **** independent**, when it argues that there is no influential relation of the samples to each other. To be less abstract: the group of happy cats and sad cats, each sorted by inclusion criteria, are considered independent when their tea drinking behavior is not influenced by each other: sad cats don’t drink less tea because of happy cats, but because they are sad. Happy cats are also not withholding any tea from sad cats…

To give an example for the **dependent two-sample test:** two samples are, e.g., considered dependent when we compare sad cats that drink tea with themselves after getting a dose of catnip.

**The complete formula of the ***independent*** two-sample z-test** **is:**

**…which can importantly be simplified to using the population standard deviations only, discarding the population mean:**

When forming a hypothesis, we can now do the same as with the one sample z-test, just with two samples: is the sample mean significantly lower, higher — or equal/unequal to the population mean? However, for brevity we will just ask for the lower tail below:

```
# Chat Noir Independent Two-Sample Z-test:
# We will compare happy cats such as Lady Meow with
# unfortunate cats that are evidently sad (inclusion
# criteria will be left out for now):
# Happy Cats:
sample_cats = c(0:10)
pop_cats = seq(0,10.5,length = 11)
pop_cats = c(rep(pop_cats,100))
# Sad cats, which drink only 1 cup of tea in 10 hours :'(
sample_sad_cats = seq(0,2.5,length = 11)
# [1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
pop_sad_cats = c(rep(seq(0,2.5,length = 11),100))
# SEM of each sample:
happy_var = sum((pop_cats-mean(pop_cats))^2)/length(pop_cats)
happy_pop_sd = sqrt(happy_var)
happy_sem = happy_pop_sd/length(sample_cats)
sad_var_pop = sum((pop_sad_cats-mean(pop_sad_cats))^2)/length(pop_sad_cats)
sad_pop_sd = sqrt(sad_var_pop)
sad_sem = sad_pop_sd/length(sample_sad_cats)
# z-Statistics:
z_statistics_num = (mean(sample_sad_cats)-mean(sample_cats))-(mean(pop_sad_cats)-mean(pop_cats))
z_statistics_denom = sad_sem + happy_sem
z_statistics = z_statistics_num/z_statistics_denom
# [1] -10.03415
# We can simplify the formula and will get the same results:
z_statistics_num = (mean(sample_sad_cats)-mean(sample_cats))
z_statistics_denom = sad_sem + happy_sem
z_statistics = z_statistics_num/z_statistics_denom
# [1] -10.03415 # == -10.034 sd's away from the pop. mean
# P-value:
pnorm(z_statistics)
# [1] 5.394214e-24
```

However, the above only holds for **independent samples. **We have covered the difference between the terms independent and dependent when talking about respective variables in the second part of our series. In general: **independent** means, something is not influenced by something else (e.g. the probability of **Dependent** means, there is an influential relation. We also discussed **dependent** **probabilistic relations** when discussing conditional probability (but haven’t discussed independent conditional probability yet — we will do so some other time).

A **dependent relation** between samples **often concerns before/after relations**. **Following our furry example** on sad cats (it’s so sad!), we could, e.g., check if cat nip makes cats happy again by checking if they drink more tea again in a certain amount of time, **after they received some catnip.**

**Outspoken the formula goes as follows:** z for paired samples (dependent relation) equals the mean of differences, divided by the standard error of the mean differences. We could also add “minus the expected difference” in the numerator, but since we are looking for the p-value that assumes the null hypothesis, this value will be zero anyways (since

Below is some code for the example of testing **if** **sad cats drink more tea again after consuming catnip: **this represents an **upper-tail paired or dependent z-test.**

```
#### Chat Noir Dependent/Paired Two-Sample Z-test:
# Sad cats, which drink only 1 cup of tea in 10 hours :'(
# We can also consider this as timestep t_1:
sample_sad_cats = seq(0,2.5,length = 11)
# [1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
# Timestep t_2, i.e., after cat was on catnip:
sample_catnip = c(0:10)
# [1] 0 1 2 3 4 5 6 7 8 9 10
# Evaluate the difference between t_1 and t_2 via
# diff = t_2 - t_1
diff = sample_catnip-sample_sad_cats
# Set up data table:
data = cbind(sample_sad_cats, sample_catnip, diff)
data = as.data.frame(data) # in order to be able to use $
# sample_sad_cats sample_catnip diff
# [1,] 0.00 0 0.00
# [2,] 0.25 1 0.75
# [3,] 0.50 2 1.50
# [4,] 0.75 3 2.25
# [5,] 1.00 4 3.00
# [6,] 1.25 5 3.75
# [7,] 1.50 6 4.50
# [8,] 1.75 7 5.25
# [9,] 2.00 8 6.00
# [10,] 2.25 9 6.75
# [11,] 2.50 10 7.50
# Caluclate the mean difference
mean_diff = sum(data$diff)/length(diff)
# Total sum of squared mean differences
sum_sq_diff = sum(((diff-mean_diff)^2))
# Standard error of the mean differences
sd_mean_diff = sqrt(sum_sq_diff/length(diff))/sqrt(length(diff))
# Z_value:
z_value = mean_diff/sd_mean_diff
# Upper tail z-test, arguing that catnip has the effect that
# sad cats are drinking more tea than before again:
p_value = 1-pnorm(z_value)
# [1] 7.854725e-08
```

**YOU DID IT!! You have completed your first statistical hypothesis test!! Whoooop!!!** Before we move on to the t-test, we will discuss some common misleading, false or at least confusing definitions of the p-value.

You have probability figured out yourself that handling the p-value, e.g., when reading a paper is actually rather simple: If it is below a threshold, we are allowed to state that a (alternative) hypothesis finds statistical evidence due to significant difference in, e.g., means (since we can discard the null-hypothesis in such a case).

However, you probably also figured out that **defining the p-value easily becomes complicated, since it involves quite a bunch of mathematical concepts to be obtained**. **The general message of this tutorial: it may not even be a good idea to try to fit the concept into one phrase.**

There are many definitions you probably came across already, but may still appear rather enigmatic and confusing to you, when thinking about it, after going through the upper math. **In this short chapter we are trying to anticipate and entangle some of the possible confusions in common definitions of the p-value, before proceeding with the t-test and the integration of the t-statistics/-value, in order to obtain a p-value again.**

**You can also skip this part for now and come back to this chapter later, as it does not add anything essential to the math.** **In this chapter we just discuss different approaches of linguistically expressing mathematical concepts such as the p-value and the significance.**

You probably have come across statements such as: **“The p-value is the probability that the** **difference in the mean of the sample and population occurred by chance**”. **To me** **this particular statement always appeared terminologically quite confusing**, as the term chance implies that the probability (p-value) redundantly refers to another probability / chance, such that the p-value could understood as a probability of a probability (prob. of a chance of something…)? I really have trouble to relate this definition to the process of standardization and integration. If chance is here understood as referring to the threshold of <5%, then the p-value can either be below / outside the “area of chance” or within it, but the p-value itself mathematically does not involve the threshold, nor does it tell us upfront that a difference is in a certain area by some probability (since it is just integration to a certain z-score/-value, as we already know). The p-value does not probabilistically quantify a second order chance of some kind (the power analysis however does, as we will see in the next part of this series). However, we have seen that the sample size is weighted in, making a high difference, given high sample size less likely to be the result of the sample of a population.

**However, the notion of chance is also related to the process of random sampling (randomization) from a distribution and the difference being just one of potentially many random possible differences from the population mean, when taking a sample (with size n) from one and the same population and potentially a different population. Again, weighting in the sample size into the effect size provides this perspective, since a greater sample size makes a high difference more likely to be (gives it more confidence).** Speaking of randomness: it is not recommended to understand “random” as uniform or so, since a Gaussian PDF says not every mean of a sample is uniformly likely, but we can still take samples from that distribution in a randomized way. The process of sampling is considered random if we did not intentionally decide for a specific sample or had control in the choice whatsoever, in order to intentionally influence the mean of the sample therefore the process of randomization in clinical trials. Randomization (clinical design) is therefore not a nicety that helps to maintain some higher order quality standard, but also makes sure that we can actually consistently apply mathematical inference methods that work with random variables. Still, randomness in the above sense is *not* what defines the p-value either, it only defines specific circumstances of how the sampling took place (randomization), i.e., a characteristic of variables (random variables). This definition for itself at least leaves a lot of questions, specially for beginners.

**You may have a different opinion on this, but we recommend to keep in mind that the use of the term chance is neither a good intuition, nor is it terminologically consistent looking at the actual math. At least exchange chance by “consequence of random sampling from one and the same distribution, ***not*** a different one (= ****)”.** The fact that there are differences in the first place is due to the sampling from a normal distribution, where differences are possible, but overall less likely when the difference is far from the population mean. This is importantly due to the *form* of a Gaussian.

**The p-value for a z- or t-test defines** how likely it is to obtain a *particular* *standardized / studentized* *mean* *difference* (sample minus population/expectation) given that we were (randomly) sampling from an expected (normal) distribution (population with respective mean and sd for the z-test; same goes for two samples, where one sample is considered a base line, so to speak). It is not the same as the plain standardized difference in means, since the sample size is weighted in. Weighting in the sample size can give rise to a higher confidence in a particular difference in means compared to a much smaller sample. Intuitively more data provides more confidence/credibility. However, in the next part of this series we will look at examples where this becomes a problem for interpreting p-values and their “evidence”, but for a narrow scale it still makes sense: the mean of just two events is certainly less credible than the mean of 100 events, under the condition that we randomly sampled and didn’t selectively pick events...

**We could also modify and reformulate our initial statement on p-values at the beginning of this chapter into a certain condition of ***in***significance:** *If* the p-value is ABOVE a threshold of 5% we can interpret the given difference or overlapping of the sample mean with the population PDF as coincidence / chance (result of random sampling) and *cannot* discard the null-hypothesis. *a priori* that the sample *is* part of the population; the

**Importantly: **We do not even have to calculate the p-value to know if the difference in means is significant, given the z-/t-score of the threshold probability. This threshold z-score is also called a **critical z-score** that results in a p-value of the threshold, e.g., .05: After calculating a t-value or z-value of a data set, we could simply compare those with the z-/t-value of a threshold probability of 5% (obtained via a quantile function) and argue for significance *if* the value is outside the critical threshold score, without integrating a PDF at all. . . However, in the above statement on the insignificance of a difference, the p-value is *still* not defined, just compared with the threshold. In other words: the condition of that statement (*if* below a threshold) has nothing to do with the condition and definition of the conditional probability we are after, our p-value.

**Here is another statement that is commonly used but rather confusing, even though there is a true story behind it:** **“the p-value is the probability of observing an effect at least as extreme or more extreme as when the null-hypothesis was true”** **(****Wikipedia****, 02.2023)**. Here I have a lot of trouble to understand what is meant with “extreme” and “at least… or more extreme”. The term “effect” refers to the difference in means. But why does it say “at least as extreme or more extreme”? We also don’t even know if the difference is in any sense given at all (could be t = 0).“At least as… or more extreme” again sounds like confusing the p-value itself with the comparison of p-value and the threshold (significance). **The actual idea that is behind this definition of the p-value (as far as I have encountered) addresses the fact that we integrated a PDF (interval prob. estimate, e.g., from infinity to z-value). So in other words, the integral technically addresses all other values that are possible in the sense that they are even more unlikely (we can say so due to the form of a Gaussian function, therefore the effect: regression to the middle…).** The problem though lays in the following question: Does a p-value obtained from integrating a probability *density* function really address other cases differently than a point-probability estimate of a regular and discrete probability function, since we approach a z-/t-value from -/+ infinity? Or in other words: don’t both probability values (interval estimate and point estimate) always automatically account for a range that is more extreme in the sense that any particular probability value (no matter point or interval estimate) refer to all other possible probability values that are lower than the probability value one got indirectly? A PF and a PDF are not the same kind of function and there are technical reasons that a point estimate of a continuous function results in 0 and therefore we use a PDF. Still, the form of the function and the fact that we looking at a tail of a Gaussian up to infinity may play an equal role in what is considered “extreme and more extreme” and which probabilistic assumptions we can make beyond a point estimate. **EXAMPLE:** A point-probability of .1 (point **In other words, I am not convinced that this is worth highlighting in the definition of a p-value, since the fact that we are integrating is not changing anything about the concept of probability as such, always indirectly including a range of less or more likely occurrences, given a probability at hand and its corresponding function (and therefore its form). **Definitions of the p-value of a z- or t-test using the terms “extreme” at least need a lot of extra reflections and knowledge to be understood, so other definitions of the p-value of a z- or t-test that address the tested difference in means explicitly are probably a better choice in order to explain it to people.

In any way, the above Wikipedia definition still indirectly confuses the p-value with the significance test, as it brings in a threshold with the words like ‘at least as extreme *as when the null-hypothesis was true*’. The problem is here that the null-hypothesis in a z-test is always true (trap of unnecessary redundant binary). Only when we perform a significance test we decide for a threshold — this is not part of the p-value calculation but the power analysis / the significance test. The probability *then* the *can be discarded*”. Now it makes sense both linguistically and conceptually, but is still not referring to the correct definition of a p-value for a z- or t-test,

**Maybe I am missing something above (let us know if so), but to me a lot of common definitions appear rather confusing and they are often not even necessary to understand the p-value, since there are other more straight forward definitions to rely on. **

You may also have come across definitions stating that **the p-value is the probability that the null hypothesis is true**. We have discussed before (especially in the first part of this series) that stating “the null-hypothesis is true” is just adding an unnecessary redundant binary and means the same as stating “the alternative-hypothesis is false” or “*true* that there is *no* difference”. However, such a redundant binary only refers to what is set in advance as hypothesis (condition) and does not refer to the probability of an observation or data (sample mean) given the condition of the *alternative hypothesis,* **However,** **the false expression above would actually be correct when stating:** the p-value is the probability that the null hypothesis *cannot* be discarded, when *no* difference in means. Is that probability significantly low (arbitrary threshold), then we can say: it is *unlikely* that we *cannot* discard the *ex negativo* approach). Still, this defines the conditions of significance, not the p-value itself.

**We hope we were able to address most of the general issues around defining, explaining and understanding the p-value and significance test.** As a final note: keep in mind that p-values do not represent the world in the sense that they represent a constant value, so that we can just rely on its “significance”. The major downside of frequentist statistics is essentially that its models do not learn and reflects less what the role of a hypothesis and the theory behind it is (weights). Also recall that the frequentist approach comes with a uniform prior, so to speak. This may be good and safe spot in a lot of situations (most of all when no prior knowledge is given), however, in the long run it nakedly exposes science to a game of randomness and pretends to be neutral, which is at the end a dangerous lie, making people pretend to not be responsible in making at least major conceptual assumptions about the world in the first place, since some significance (meta-studies therefore become more and more important over the years).

**IMPORTANT TAKE HOME MESSAGE: Don’t let yourself get confused by lousy definitions of the p-value, even if they come from professionals!!! Any mathematical statement has to have a mathematically formulated representation. A mathematical statement is not about an effort of correctly interpreting words, but the instructions of an algorithm represented as formula. In the end, the formula is all we got in order to confirm our definitions, so it is probably best to start there as reference. **

As mentioned before, the t-test only slightly differs from a regular z-test. For a regular t-test the only difference is that we will rely on the *sample* standard deviation / variance not on the population sd / var. In other words we will operate with **Note that if we were two compare three samples with each other, we would perform a analysis of variance (ANOVA) — a topic which we will cover in future tutorials.** At the end of this tutorial, we will also perform a t-test on our linear model, which will look at two relations. The only difference will be that we will operator with **degrees of freedom (which is a kind of general approach to Besserl’s correction, as we will see)**. We will get there later below, since we don’t need to understand the concept yet for our regular t-test looking at one relation only (comparing means, where we can just use Bessel’s correction).

For the **one sample t-test** at least the expected value has to be given, which can be the mean of a bunch of previous samples (previous trials) or a theoretical value (another model, e.g., from physics, i.e., an expectation backed by other (kinds) of research and its respective formulas and predictions they make). **It will not be the actual population mean, since the population is considered unknown, when performing a t-test, different to the z-test.** The t-test is also called Student’s t-test (we will not cover the history of the t-test, but we recommend doing some research on why it is called that way!).

Note that this test is considered robust, given **equal variance and sample size**. We could also use a **Welch’s test** in case of **inequal variance and sample size.** **However,** **we will not cover the Welch’s test here (we will do so some other time; note that the ****t-test()**** function and other functions actually often are able to decide for the correct/optimal test itself, based on the input data).**

**One-sample t-test:**

In the formulas below we essentially just exchanged z with t, that’s all: We centralize in the numerator and use the *sample* standard error of the mean for scaling or *studentization *(instead of standardization).

**Independent Two-sample t-test:**

**Dependent/Paired Two-sample t-test:**

Recall that we could also add “minus the expected mean difference” in order to represent a centralization in the common form. But again, since the expected difference under the null hypothesis will be zero anyways, we can abbreviate the formula.

**A t-distribution is very similar to a standard normal distribution and approaches a normal distribution the higher the sample size will be (similar to what we encountered when discussing the central limit theorem).** The formula of a PDF of the t-distribution is a little complicated and deriving its formula would take a separate tutorial for sure. We will therefore not cover all the details here. The formula of the PDF of a t-distribution goes as follows:

As you can see, the formula looks rather nasty. The variable *nu*) refers to the **degrees of freedom** and the twisted “L” is referring to a capital gamma and a **gamma function**. We will not go into the details of a gamma function, but to at least mention it: A **gamma function** is essentially a function of the *factorial* of **degrees of freedom:**

We didn’t discuss Bessel’s correction in detail, but however, one way of understanding Bessel’s correction is as representing the **degrees of freedom**, where the 1 refers to one relation/parameter looked at (difference in the mean). A linear regression model on the other hand operates with two parameters that are evaluated: the intercept and the regression coefficient, such that the **But what are degrees of freedom now exactly? **

**Let us think of two numerical examples:** Say we have taken a sample from a population and end up with a set of values of 10, 20 and 30 (**This is exactly what the degrees of freedom refers to: given the mean 20 and **** of 3, there are only two numbers that are free in the sense that they can vary. After these two values are set, the third number won’t have any freedom or variability anymore, since it is indirectly dictated by the given mean!** As mentioned, the degrees of freedom of a linear model with the two parameters

**Our first numerical example** is special in the sense that the value of the mean (below 20) is entailed as a value of the set of

The same holds for a set of

**To formulate a definition of the degrees of freedom in maybe more common words:** The concept of “freedomness” refers to the number of values that can vary and are not dictated by the parameters such as the mean themselves.

**To wrap this up**, we will now look at a t-distribution with several different degrees of freedom and check if it is true that **the higher the df (and therefore indirectly ****) the more it approaches a standard normal PDF.** Below you will find the code for the function of the PDF of a t-distribution and we will plot the standard normal distribution for our comparison using our

function:**prob_dens(x,mean=0,var=1)**

```
# PDF of a T-Distribution:
# df = n-1 # or 2 for two estimates/relations
t_distr = function(x,df){
t_1 = gamma((df+1)/2)/(sqrt(df*pi)*gamma(df/2))
t_2 = (1 + (x^2/df))^(-(df+1)/2)
t_distr = t_1*t_2
return(t_distr)
} # End of function
# Initialize Sequence:
seq = seq(-4,4,by = .1)
# Plot standard normal and t-distribution
# with df = 1, df = 2, df = 10, df = 100:
plot(x=seq,y=prob_dens(seq,0,1), type ="l")
lines(x=seq,y=t_distr(seq,1),col="red") # df 1
lines(x=seq,y=t_distr(seq,2),col="green") # df 2
lines(x=seq,y=t_distr(seq,10),col="blue") # df 10
# Here you can see that given a df = 30 the t-distribution
# gets very close to standard normal distribution, hence
# the z-score is used, given n>30:
lines(x=seq,y=t_distr(seq,30),col="lightblue") # df 30
# Moving to df = 100 the light blue line will essentially overlap
# with the standard normal PDF
plot(x=seq,y=prob_dens(seq,0,1), type ="l")
lines(x=seq,y=t_distr(seq,100),col="lightblue") # df 100
```

So again, we haven’t covered Bessel’s correction, but we should now understand it in the sense of degrees of freedom and how it affects approaching a standard normal PDF when

At last, looking at the plot above **we can eventually see** **why it is said that given a df of over 30 and given the population sd, one can use the z-score, since the t-distribution approaches the standard normal distribution.** Below you will find a plot with df = 100, resulting in a diminishing difference:

**In other words: if we want to make sure to correct for a bias, then in the case of a sufficiently high number of samples / df, it makes only little difference to working with a regular old standard normal PDF.** There is more behind it mathematically and the number n>30 is still somewhat arbitrary, but we hope this gave you enough information to understand and make it plausible why the sample size dictates which kind of test score is to be used.

**summary(lm(y~x))**

Before we discuss the t-test for a linear model, we are going to go through the complete output of the function

. However, this part can be considered optional, if you want to get right into testing the slope and intercept of a linear model. **summary(lm(y~x))**

In general, the output of

extends the output of **summary(lm(y~x))**

and, e.g., performs a t-test and several other things on our linear model. Note that the **lm(y~x)**

function can be used for several purposes. Its output depends on the type of input (check **summary()**

for a documentation of the function; note that documentations for R functions are sometimes a little hard to read and sometimes incomplete). However, we will only use it together with the **?summary()**

function for now. **lm(y~x)**

Let us first use the second synthetic data set from Inferential Statistics II to look at the output of

, in order to get a basic **summary(lm(y~x))****overview of what’s ahead:**

```
### Second synthetic data set from Inf. Stat. II
# Lady Meow drinking tea ^^
# Cups of Tea:
cups = c(0, 1, 2, 3.5, 4.8, 5.2, 6, 6.9, 8.5, 9.1, 9.9, # run 1
0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9, # 2
0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9, 9.8) # 3
# Time passed for each run of measurements (cups), abbreviated:
time = c(0:10,0:10,0:10)
# Summary(lm(y~x)) => y under the condition (~) of x
summary(lm(cups~time))
```

Executing the line

will give us the following console output:**summary(lm(cups~time))**

The first line of output just shows the code formula. The next part looks at the **residuals** from the perspective of **quantiles**, which includes the **median**. This part can also be recreated via:

```
# Residual quantiles, median and mean:
summary(residuals(lm(cups~time)))
```

We have used the function

before in the second part of this series. However, the formula that the function **residuals(lm(y~x))**

uses is simply representing the data value of the dependent variable (**residuals()**

The part **“Coefficients”** entails the output of the

function, i.e., the **lm()****estimates** for the optimal **standard error (of the mean)**, which is, as we know, not the same as the standard deviation (its the sd of the mean not a whole set of values, see details later below). Other columns entail **t-values** and **p-values** (the Pr stands for probability). **In this tutorial, we will focus on this part/table of the output of ****summary(lm(y~x))****. The above output only entails a t-test though (we will also only cover the t-test for a linear model).**

The next part of the R output concerns the **“Significance Codes”** and just entails a legend for the number of stars that are associated with several p-value thresholds. Three stars means that the p-value is approaching zero and therefore very small.

The **last segment of the console output entails a couple of values**, such as the **F-statistics**, again our p-value (of the slope) and others. These are additional values to get a grip on the quality of our model, which we will focus on in the **next ****sixth**** part of this tutorial series**. **However, the only values that will not be covered in this tutorial is multiple and adjusted R^2 and the F-statistics.** In other words: after this tutorial we nearly completely replicated the output of

by just using mathematical formulas, whoop!!**summary(lm(y~x))**

We are now approaching the end if this tutorial. Our last topic will be hypothesis testing the slope and the intercept of a linear model. After this last chapter we will have replicated most of the output of

! Let’s get to it: **summary(lm(y~x))**

**Our data set:** First of all, because Lady Meow is adorably predictable in her tea drinking behavior, the p-value we will receive for the slope of our optimal linear model is so small that R will just output a plain 0. We therefore decided that we will use a random data set this time. **We chose the integrated data set “mtcars”, which can be loaded into the environment via ****data(“mtcars”)****.** Loading the data set will make it possible to call columns of that data set via “mtcars$…”:

We intentionally chose a column of the data set “mtcars” that led to a model with a very *high* p-value, such that it will be above our significance threshold of 5%! For the independent variable we just created a set of

. **length(column)****In general, let us go abstract and just ignore the meaning of the variables (cars are evil anyways) and let us look at the plot right away:**

```
data("mtcars")
dep = mtcars$disp
indep = c(1:length(dep))
plot(indep,dep)
abline(a=alpha,b=beta)
```

Let us look at the output of

, where we can clearly see that with a p-value of 0.905 we **summary(lm(y~x))***can’t discard* the **null-hypothesis that the slope is essentially zero and there is no conditional linear relation between the variables, or only very little change in the data when **** changes (the variable we have control over: our independent variable)**:

```
# The p-value is 0.9047, suggesting no significant
# result, given a slope of -0.2913, a very low
# slope, barely different to 0:
compare = summary(lm(dep~indep))
# For more detailed numbers look at:
compare$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 235.528226 45.597118 5.1654192 1.460149e-05
# indep -0.291294 2.411567 -0.1207903 9.046625e-01
# Our optimal linear model:
beta = cov(indep,dep)/var(indep)
# [1] -0.291294
alpha = mean(dep)-beta*mean(indep)
# [1] 235.5282
fx = alpha+beta*indep
```

**Let us first replicate the p-value of the slope of our linear model:**

We mentioned it before, but the *mean slope* of 0. It is again the same principle as in a regular z-test/t-test: centralization and then complete studentization via the sd. This time we do not subtract a population mean / expected value from a sample, but the *expected slope* assuming the null-hypothesis from the evaluated slope of our model. Since that expected slope will be zero, you will often find formulas entailing just the beta term in the numerator as default.

The **formula for the standard error of the slope looks a little complicated** at first, but we will get there. **Note that the standard error of the slope is in nuce just again**

and its form below just results from rearrangements. The below formula says: the standard error of the slope equals the standard error of the sum of squared residuals/errors divided by the sum of squares of **sqrt(var(beta))**

You may wonder where the classic

has gone in this formula. Here it comes:**sqrt(n)**

The above refers to the standard error of the residuals/errors or “residual standard error” (term used in the R output; SSE = sum of squared errors). It is essentially just our formula for calculating the sum of squared errors, which we discussed in the **second part of this series**, and then averaging them over the degrees of freedom: so it represents the *standard deviation*** of the errors/residuals**. **Without taking the square root the above is also called the mean squared error or variance of the errors (****).** The above will then be divided by the sum of squares of **third part of this series**:

Another way to represent the standard error of the slope is via the following formula (the second formula below is our outwritten first approach from above):

Here we can clearly see how a relation between

There is **yet another way of calculating the standard error** of the slope via an alternative approach of calculating the sum of squared errors/residuals (SSE) [2]:

Here we have the parameter beta itself within the function and the denominator still represents a centralization. However, the formula is now completely brought into the realm of variability under the perspective of sum of squares.

**We hope this helped you to understand the rather tricky formula for the SE of the slope.** We have still trouble to give this part a detailed intuition, but we will get back to this in the future (feel free to comment or contact us if you have a good take on this!).

Below you will find the code to replicate the above and to evaluate the p-value in comparison with the

function, leading to equivalent output values (whooop!!):**summary(lm(y~x))**

```
# Sum of square residuals:
SumR2 = sum((dep-fx)^2)
# [1] 475953.3
SumR2_alt = sum(as.numeric(residuals(lm(dep~indep))^2))
SumR2_alt2 = sum((dep-mean(dep))^2) - beta*sum((dep-mean(dep))*(indep-mean(indep)))
all.equal(SumR2,SumR2_alt,SumR2_alt2)
# [1] TRUE
# Residual SE:
residual_standard_error = sqrt(SumR2/(length(indep)-2))
# [1] 125.9568
compare
# => Residual standard error: 126
# summary(lm(y~x)) rounds the value in the output,
# but since we will get to equivalent results we can
# savely assume that the output value is just rounded.
# Standard error of the slope for t-distribution:
se_denom = sqrt(sum((indep - mean(indep))^2))
se_slope_t = residual_standard_error/se_denom
# t-value:
t_value = beta/se_slope_t
# [1] -0.1207903
# p-value for two-tail == to standard output of summary(lm(y~x)):
2*integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value)$value
# [1] 0.9046625
# Alternative via basic R function pt():
2*pt(t_value, df = length(indep)-2)
# [1] 0.9046625
# t-value and p-value is equivalent with the output of
# summary(lm(y~x)):
compare$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 235.528226 45.597118 5.1654192 1.460149e-05
# indep -0.291294 2.411567 -0.1207903 9.046625e-01
```

We did it!! We replicated the most important part of the hypothesis test on a linear model: the slope. Let’s move on to test our intercept:

We mentioned briefly before that testing the slope is often not as important. **Under the null hypothesis a t-test of the intercept asks if the crossing point with the y-axis is equal to zero, given that **** is ****.** Different to the slope, we do not ask for a relational value (a value related to the change of **For example: A human with the size of 0 cm and age 25 just does not exist. The test on the intercept therefore just does not make a lot of sense.**

In the case of Lady Meow, it actually makes sense that the intercept is around zero, since we argued she sleeps 14h and is observed 10 per day starting with 0 cups and 0 hours. **IMPORTANTLY:** **In this example we would not want to discard the null hypothesis, since a ***in***significant p-value on the intercept would make sense.** And it is true, looking back on the output we the discussed in the introduction of this tutorial using our Lady Meow data set: The p-value of the intercept is insignificant and we can also see that our alpha of 0.2318 is very low and close to zero. **In the case of Lady Meow this is totally plausible. Let us look at the R output using our secon synth. Lady Meow data set:**

In comparison, when looking at the output of “mtcars” we can see that our intercept is way above 0 and we get a p-value way below a threshold of 5% (i.e. we can discard the null-hypothesis that the intercept is 0). Note that this output **again makes sense**, when related to the assumption of the *null-hypothesis with a* *slope being zero* being the case and therefore having values of **see plot of mtcars and look at the crossing point of the model function with the y-axis**).

**So in general keep in mind: The p-value of both the slope and the intercept do not have to be zero for a model to be considered “good” — it is also a question of plausibility.**

Below we see the *detailed* output of a t-test on “mtcars”:

```
# summary(lm(y~x)):
compare$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 235.528226 45.597118 5.1654192 1.460149e-05
# indep -0.291294 2.411567 -0.1207903 9.046625e-01
```

**Below you will find another example of a simplified model with a slope of zero.** Imagine you would want to measure some random change and things would just never change when you measure. **The x-axis could be the time again and y-axis could refer to some growth with a starting value of 5 that remains constant (note that the interept will also be equivalent to the mean of the set of **** and ****!). In other words: the model below represents growth that in this case is just not happening:**

```
# Example of a model with slope of zero:
x = c(0:10)
y = rep(5,11)
plot(x,y)
abline(h=5)
```

**What we are aiming at is that the relation between the test for the slope and the intercept is not completely apart from each other**. **Here we can assume that a slope of zero will automatically be related to an intercept that is close to the mean of the data itself, such that ***and* matches the mean of the set **.**

**Now that we have solved the conceptual part of interpreting a p-value of an intercept, let us get to the formula.** It is again the same spiel as before: the expected intercept will be 0, so we can abbreviate the formula:

**The standard error of the intercept however looks a little messy again. Note again: in essence the standard error of the slope is just ****sqrt(var(mean(y)-b*mean(x))****.(the formula for the alternative way to obtain the intercept, discussed in ****part three**** of this series). Rearrangements and including the SSE will lead to the form below. Brace yourselves:**

The formula is really hard to read, even though the general message is the same as always: standard deviation just of the intercept, such that

Below you will find code to calculate the above and to replicate the

output of our “mtcars” example:**summary(lm(y~x))**

```
# Standard error of the intercept:
SE_a = sqrt(SumR2/(length(indep)-2))*sqrt((1/length(indep))+((mean(indep)^2)/sum((indep-mean(indep))^2)))
# [1] 45.59712
# t-value intercept:
t_value = alpha/SE_a
# [1] 5.165419
# p-value intercept:
2*(1-(integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value)$value))
2*(1-pt(t_value,df = length(indep)-2))
# [1] 1.460149e-05
```

We did it!! Let us update our linear_least_square() function that we wrote in part II of this series.

As a short sidenote, we recommend the second chapter (p.13) of the open access stats book “Understanding Statistics and Experimental Design” by Herzog, Francis and Clarke (2019). In the third part of this series, we hinted at one point that the variance/sd can also be understood as precision of an estimate of a mean, gathered from sampling. On the example of a sonar sensor from a submarine, the chapter introduces the t-test as a method to distinguish signal (mean difference) from noise (standard deviation), e.g., distiguishing rock from ‘no rock’, given a noisy channel. The smaller the noise the higher the precision.

**linear_least_square()**

FunctionBelow you will find the code for a complete update of our

function from the second part of this series, to perform the equivalent output of the **linear_least_square()**

of what we have gone through so far. Note that the actual **summary(lm(y~x))**

function has a bunch more options than written in code below. Our function again just serves an educational purpose. Have fun with it!**lm()**

```
# Go-go-gadgeto linear_least_square!!!!
linear_least_square = function(indep,dep){ # Start of function
# Evaluating coefficients for and optimal linear model
# given a set of dependent and independent variabels:
beta = cov(indep,dep)/var(indep)
alpha = mean(dep)-beta*mean(indep)
fx = alpha+beta*indep
# Sum of Squared Errors/Residuals:
SumR2 = sum((dep-fx)^2)
# Residual Standard Error/Deviation:
residual_standard_error = sqrt(SumR2/(length(indep)-2))
# Standard error of the slope for t-distribution:
se_denom = sqrt(sum((indep - mean(indep))^2))
se_slope_t = residual_standard_error/se_denom
# Standard error of the incercept for t-distribution:
se_a = sqrt(SumR2/(length(indep)-2))*sqrt((1/length(indep))+((mean(indep)^2)/sum((indep-mean(indep))^2)))
# t-value of the slope:
t_value_b = beta/(se_slope_t+exp(-32))
# t-value of the intercept:
t_value_a = alpha/(se_a+exp(-32))
### p-value of the slope via integrating the PDF of a t-distribution
# up to the t-value calculated above:
t_distr = function(x,df){
t_1 = gamma((df+1)/2)/(sqrt(df*pi)*gamma(df/2))
t_2 = (1 + (x^2/df))^(-(df+1)/2)
t_distr = t_1*t_2
return(t_distr)
} # End of function
# Two-Tail P(t=T|H_0):
p_b_2t = 2*integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_b)$value
### p-value of the intercept:
# Two-Tail P(t=T|H_0):
p_a_2t = 2*(1-(integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_a)$value))
# Results for two tail
Results_a = c(round(alpha,4), round(se_a,4), round(t_value_a,4), p_a_2t)
Results_b = c(round(beta,4), round(se_slope_t,4), round(t_value_b,4), p_b_2t)
Res_data = as.data.frame(cbind(Results_a,Results_b))
rownames(Res_data) = c("Estimate","Std. Error","t value","Pr(>|t|)")
colnames(Res_data) = c("Intercept", "Reg. coeff.")
# Nice output using cat() function:
cat(" Linear least square method in R","\n","\n",
"Independent variable:", "\t", deparse(substitute(indep)),"\n",
"Dependent variable:", "\t", deparse(substitute(dep)),"\n","\n",
"alpha", "\t",alpha,"\n",
"beta","\t",beta, "\t","SumR2", "\t", SumR2, "\n","\n")
print(Res_data)
cat("\n","Residual Standard Error:",round(residual_standard_error),
"on", (length(indep)-2), "degrees of freedom","\n","\n")
# Let us also plot our results:
# We will also use deparse(substitute(x)) for the
# labels of our plot axes.
plot(x=indep,y=dep, ylab = deparse(substitute(dep)),
xlab = deparse(substitute(indep)))
abline(a=alpha, b=beta, col = "darkblue")
} # End of function
# Test:
linear_least_square(indep, dep)
# Linear least square method in R
#
# Independent variable: indep
# Dependent variable: dep
#
# alpha 235.5282
# beta -0.291294 SumR2 475953.3
#
# Estimate Std. Error t value Pr(>|t|)
# Intercept 235.5282 45.5971 5.1654 1.460149e-05
# Reg. coeff. -0.2913 2.4116 -0.1208 9.046625e-01
#
# Residual Standard Error: 126 on 30 degrees of freedom
```

For comparison, here again the output of

:**summary(lm(dep~indep))**

It is done, we have successfully replicated most parts of the

function, whoop-whoop!! **summary(lm(y~x))****In the next part of this series, we will look at other measures, such as the effect size, F-statistics and we will take a closer look at confidence intervals and confusion matrices.**