Skip to main content
SearchLoginLogin or Signup

[R] Inferential Statistics V: Probabilistically Evaluating Statistical Models via the Z- and T-Test

Intermediate Stat-o-Sphere

Published onAug 18, 2023
[R] Inferential Statistics V: Probabilistically Evaluating Statistical Models via the Z- and T-Test
·

Review: This is a pre-released article and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Fig. How to start a discussion. A pubpub account is needed to do so. Sign up here. Also have a look at articles all about our Open Science Agenda.

Corresponding R script:

Summary:

1 Introduction:

Welcome to the fifth part of our tutorial series on inferential statistics! You have come quite a way and we hope you get the feeling that is well worth the wait! The following tutorial requires a basic understanding in probability theory, linear regression and in essential descriptive measures, such as the variance, the standard deviation and the mean (Inf. Stat. I-IV). On the other hand, we are not going to speed up the pace, as we still aim to consistently replicate the math in R as far as possible. We again recommend to make use of the contents function (upper right corner), to reorient oneself in the script. Since we are approaching higher spheres, we will not provide summaries at the end of every chapter, but we still provided a downloadable summary above!

Fig. After scrolling down until the title of an article disappears, the content function becomes available. It can help you to navigate within the script. It is entailed in every article you will find in our journal.

As mentioned in part four of this series, we will now cover the regular z-/t-tests, comparing simple means only, before applying the same principle of hypothesis testing to the intercept and slope of a fitted linear model.

Let us briefly catch up on what we have gone through so far: In the first two parts of this tutorial series, we focused on the general concept of hypothesis testing (conditional probability) and linear modelling (conditional linear relations). In the third part we looked at our linear model under a descriptive perspective and at alternative methods to obtain the optimal intercept and regression coefficient of a linear model via such descriptive measures (variance, covariance etc.). In the fourth part of this series we looked at basic distributions / functions such as the uniform and several forms of Gaussian distributions, evaluated the z-score and the effect size as a special case of z-score (where the mean of the sample is the input; we will talk about the effect size again, since the z-statistics/value can be looked at as a weighted effect size…).

In the fifth part of this series, we are going to fully combine the concept of probabilistic relations with the concept of mean estimate and linear relations. In other words: we will look at methods to evaluate samples as well as linear models from a probabilistic perspective (p-value).

Apart from evaluating p-values there are also other methods that we can use to evaluate our model in general, such as the effect size etc., which is also not part of the summary(lm(y~x)) output in R, but important. However, we will get to further measures such as the F-statistics in the sixth part of this series.

In the context of R this tutorial can be understood as the first part of understanding the output of the summary(lm(y~x)) function as well as the t.test(y~x) function. In the sixth part we will eventually replicate the output of the mentioned R functions in full and we will also look into the topic of power analysis, confidence intervals and confusion matrices (essentially conditional probability tables) more closely. In other words: this tutorial is all about p-values.

The general question we will ask when comparing the mean of two distributions will be: how different are the two distributions in question? More precise: how likely is it that the sample was obtained from sampling from a population (or from another sample in comparison). What is the difference and is it different enough?

On the other hand, the general question we are going to ask, when evaluating a linear model will be: How good, how fit is our model of Lady Meow drinking tea under a probabilistic perspective? Is the difference between the independent variable and the dependent variable enough to consider a conditional (cor-)relation between them?

Fig. Ms. Miranda, checking on our model of Lady Meow drinking her beloved tea (dependent) during a period of time (independent variable). ^^

So the goal is now to not only descriptively, but also probabilistically evaluate how well our linear model predicts given / new data. Again, we will do so by computationally replicating most parts of the math of the summary(lm(y~x)) function and other functions in R, such as the infamous p-value.

2 Recap on Z-Scores and an Overview on the Regular Z-Test / Gauß-Test and its P-Value

We are finally approaching our first statistical hypothesis test using a PDF: the Gauß-test or also more commonly called z-test. As mentioned a couple of times: the difference to a t-test is formally marginal, but not too trivial as we will see further below.

As mentioned: a z-test tests for a difference between the sample mean and the population mean (or another sample) in a way that we can use the result of a z-test to make assumptions about the probability that a sample distribution is part of a population distribution or another sample (when taking multiple samples, since backed by the central limit theorem).

All this is done via a z-test, e.g., by using the mean of the sample and the population, given the var/sd of the population. To make things a lot easier upfront, the core of what a z-test does is simply the following (similar to the z-score that we encountered in the fourth part of this series):

Different to the z-score, we now do not just standardize a set of sample values but the mean of a sample only. Similar to the effect size, you may recon — we will get there!

In any way, now we are really comparing for a difference in means. Doing so can essentially again be understood as a kind of centralization, just not of vector xix_i but the mean only.

The above difference will eventually be divided by the standard deviation of the population — but again not of the whole set but the mean only to obtain the z-value/-statistics. This is different to the effect size, you may recon. Below you can see that the standard deviation of the mean of a sample brings in the sqrt(n)sqrt(n). Don’t worry why this is the case for now, we set up a whole chapter to clarify where sqrt(n)sqrt(n) actually comes from and it will be very enlightening, promised.

For now just think of nn for itself as a weight that reflects the central limit theorem: the higher the sample size, the less likely will a high difference in means be the result of sampling from the actual population.

Below you see a comparison between the z-score (discussed in part four of this series), Cohen’s d effect size and the z-value (the result of a z-test). Recall that Cohen’s d is just a s special case of the z-score! Looking at the z-value/statistics we can see that this is on the other hand just a special case of effect size, in some way:

For a z-test we can either compare the sample mean with a population mean or the means of two samples (one or two sample z-test, details later below). The rest is essentially again just a method of converting the above difference into a standard normal PDF. This “standardized” perspective can be helpful to get a general grip on the difference in means, as we have seen in the forth part of this series.

For example: testing medication that is supposed to lower the blood pressure, how can we interpret a difference between -10 mmHg, between the control group and those that took the medication? Maybe it is way below the standard deviation of the population?

Technically, standardization also helps to simplify the process of integration (via z-tables), structurally equivalent to what we did when obtaining z-scores, but again, now with a focus on converting the difference in means only. Not exactly only, since we can see above, that there is a weight of nn involved in the z-statistics as well! We can also see Cohen’s d effect size being a composition of the denominator of the z-score and the numerator of the z-statistics. Essentially Cohen’s d is the regularily standardized effect size.

If we include sqrt(n)sqrt(n), the result of our actions will be a specific z-score, the z-statistics/value of the above difference of means. Again, the rest is formally equivalent to the process of evaluating z-scores or the effect size: the process of standardization translates a simple numeric difference of means into a numerical relation that represents the difference of the sample mean from the perspective of a standardized (population) mean of 0 in measures of standard deviation (recall that the values of the x-axis of a standard normal PDF are in the unit standard deviation). Therefore: A test score / z-value of -1 states that the elaborated standardized difference of the sample mean is -1*pop.-sd away from the population mean. Looking above at the formulas, we can see that this is not exactly true, since the z-statistics also involves the sqrt(n)sqrt(n) and kind of weighs in on the effect size: The higher the nn the higher will be the z-statistics. This circumstance is also the reason why there is no point in testing with an exorbitant sample sizes, hoping to make things more evident, since it becomes self-fullfilling prophecy, so to speak (we will look further into this circumstance in the next part of this series!)

In the figure further below, we will compare two distributions but we kept the xaxisx-axis non-standardized in this case. Arguing that two distributions are 100% different to each other would indicate no overlapping of the sample mean with the population distribution at any area of that distribution. An overlapping of the mean with the distribution of the population on the other hand indicates that it is possible to draw a sample from a population with the respective mean when taking multiple (random) samples from the population in measures. This possibility is represented either in per-cent or as a certain probability. From the probabilistic perspective we will therefore later ask if the sample is likely/unlikely from the same distribution, when integrating up to a point of difference. Our hypothesis will be that there is no difference (null-hypothesis), meaning: the sample is a sample from a population (or another sample). This means that when we get a certain z-score and we integrate up to that score it tells us to what probability we would have obtained our sample mean from the population. Since we argue that the sample mean is from the population a priori (null-hypothesis), we take the population distribution as our base line, so to speak. When we compare two samples, we set one sample to be the base line (with a standardized mean of 0) and compare the other sample with it.

The hypothesis of a z-test is also specified to address a specific tail of a normal distribution: The term tail refers to the outer left or right parts of a normal distribution. Our sample mean can either be higher or lower than the mean (lower and upper tail). There is also a two-tailed version checking for overall equality/inequality (both tails at the same time). We will get into the details further below…

Usually, we do not aim for a 100% difference. Therefore, a threshold is commonly set to define significance. This threshold is often .05, i.e., the lower/upper 5% of the tail of the normal distribution, as getting down to 0% is rather unlikely. Note that we are not going through all the details of confidence intervals, power analysis etc. in this tutorial, since we do not need it in order to define and calculate a p-value. This may surprise you, which may be due to the fact that the significance test and the p-value as a result of z- or t-test are often conflated with each other (see chapter 3.4 for details on common misleading and faulty definitions of the p-value).

In the plot below you will see a z-value for a difference of the sample and the population mean that is below .05 threshold, indicating that it is at least very unlikely that the sample is from the population or “the same population”. NOTE: Even though we are just operating with the effect size in the plot below, not the z-statistics/value, we can still see that the difference is significant. This is due to the fact that a z-statistics with n=1n=1 would result in the effect size, such that any other higher value than 11 for nn will just turn out to be more significant!!

So inn this case we can already argue that we can still discard the null-hypothesis (no difference), which in this case means: there is a significant difference given in the sample and it is therefore unlikely that there is no difference as stated by the H0H_0 — or again in other words: It is unlikely or too unlikely (threshold) that the sample is from the same distribution as the population. This is essentially what is called significance testing: comparing the p-value (obtained from integration up to a z-statistics) with a threshold value that argues for a kind of “enough difference”.

Fig. Plot of two non-standardized normal distributions. The blue line marks the population mean, the orange line the sample mean. Standardizations will essentially just “translate” the x-axis into a general relation. The plot above shows what can be called a one-tailed z-test (technically not, since above we only operated with the effect size, but its already significant). On a standardized x-axis the population mean would be at zero and the effect size (recall not the same as the z-statistics but close) of the sample mean would be the standardized deviation of the sample mean from the population mean (the value of the effect size). We also included the confidence interval / threshold with which we can visually perform a significance test, since we can see that the sample mean is below the critical value already (even without weighting in sqrt(n)). Note that a hypothesis test can be performed without the need of a threshold / confidence interval. We will discuss the full meaning of confidence intervals, confusion matrices, power analysis etc. in the sixth part of this series.

Code for the plot above.

# General formula for a probability density function (PDF).
# Insert mean = 0 and var = 1 to obtain the values for input x
# given a so-called standard normal distribution:
prob_dens = function(x,mean,var){
  fx = (1/sqrt(2*pi*var))*exp((-((x-mean)^2))/(2*var))
  return(fx)
} # End of Function

# Comparing distributions:
# We chose equal variance, so the form will be the same

# Sample:
mean_samp = 5
var_samp = 12

# Population:
mean_pop = 15
var_pop = 12 

# Plot:
seq_samp = seq(-7,40,by =.01)
plot(x = seq_samp, y = prob_dens(seq_samp,mean_samp,var_samp), type = "l", xlim=c(-8,30),lty =2)
seq_pop = seq(-7,40,by =.01)
lines(x = seq_pop,y = prob_dens(seq_pop,mean_pop,var_pop), type = "l")
abline(v=mean_pop,col="lightblue")
abline(v=mean_samp,col="orange")
abline(h=0) # line for x-axis at y = 0

# Quantile function for .05 of pop:
pop_threshold = qnorm(.05,mean=15,sd=sqrt(12))

# Add line to mark lower tail threshold of .05 (5%):
segments(x0=pop_threshold,x1=pop_threshold, y0 = 0, y1 = prob_dens(pop_threshold,mean_pop,var_pop))

Let’s go through all of the above in detail:

2.1 The Z-Value/-Statistics and the Effect Size in Detail

As we have seen before, the formula of the z-tests z-value is nearly the same as the formula of the z-score. There are just two minor differences. However, it is the same principal at work as in regular z-score: centralization in the numerator (evaluating the difference) and then standardization via the sd (similar to the process of normalization in probability theory). Again, a z-test is just a z-score to articulate for a specific difference: a difference of mean values only! Note: the formula below refers to a one-sample z-test (we will catch up on the slightly different formula for two samples further below).

Differences to the z-score (redundant overview, potentially optional):

a) For the z-table we used a sequence of possible values of xx of our standard normal distribution and therefore calculated a series of possible p-values, which resulted in our CDF. Now we are interested to input a specific value of xx to calculate its z-value (also called z-statistics) only: our sample mean as input xx, which will be centralized by the population mean and then divided by the population sd of the mean (standardized), see b) below:

b) To perform a comparison of means in such a way we have to take the so-called standard error of the mean (SEMSE_M) into account, which is essentially just the population standard deviation, but here again: the mean only will be the input x. So, again: not a whole set of data or possible mean values is looked at and centralized, but the mean of a set itself in relation to the population mean. Importantly, on the far-right side of the equation above you will find a rearrangement of the formula that can be read as the z-score of the sample mean weighted by the square root of nn (see next chapter why sqrt(n) is used; it is rather banal but enlightening, so we still gave it an extra chapter). This can be looked at the same as the effect size (Cohen’s d) weighted by the sqrt(n)sqrt(n), where the effect size shows the mere standardized difference in means only. In other words: for the z-statistics/value the deviation of the sample mean from the population mean is again put into proportions of the size n<Nn<N (backed by the central limit theorem). But what does weighting by sqrt(n)sqrt(n) do? Numerically weighting by the sqrt(n)sqrt(n) will eventually lead to a rise in confidence in the comparison, so to speak, since the higher nn will be, the higher the affect will appear in the form of a z-statistics, which will eventually lead to an automatic lower p-value (and therefore higher significance in the end). This is not what we want, when we simply want to compare the difference in the means in terms of a standardized perspective, as we will see (effect size, see next paragraph). Put simply the weighting by the sqrt(n)sqrt(n) argues that if we have a difference in means, such a given difference will have a much higher meaning when the sample size is, e.g., 1000, compared to the confidence we could gain from a much smaller sample with the same difference in means. In the next chapter we will explain where the sqrt(n)sqrt(n) actually comes from mathematically, looking at the details of the standard error of the mean.

As we know, the z-statistics is very close to Cohen’s d effect size, which also standardizes a difference in means, but not using the standard error of the mean, but the regular standard error/deviation. From another perspective the formula below represents the z-score of the mean only, not a whole set xix_i.

Fig. A low standardized difference in means indicates a low effect and a large effect size stands for a large effect. However, this scale plays a role in the power-analysis in chapter six. The unit of the effect size is sd, just as the z-statistics. So the numbers below are not some scale that arises of the formula, but just the standardized difference of the mean (without using the standard error of the mean, but the regular standard deviation of the population).

The effect size is the perspective we want for mere standardized numerical comparison, where the plain mean difference may not be informative (just converts the mean difference into measures of SD). Say we, we have a measurement of 140 mmHg systolic blood pressure and a difference of 10 mmHg — is that much? The standardization in the fashion of the plain effect size can help to get a grip on how big the actual effect ist. Cohen’s d can be looked at as equivalent to the corresponding z-statistics/value, given that n=1n = 1.

# Comparison z-stat and effect size:
x=c(0:10)
y=c(0:10)*3
pop_mean = mean(y)
pop_sd = sum((y-mean(y))^2)/length(y)
sem_x = pop_sd/sqrt(length(x))
z_stat = (mean(x)-mean(y))/sem_x
effect_size = (mean(x)-mean(y))/pop_sd

# Same with n = 1
sem_x_2 = pop_sd/sqrt(1)
z_stat_2 = (mean(x)-mean(y))/sem_x_2
effect_size_2 = (mean(x)-mean(y))/pop_sd

# Check for equality of z_stat2 and effect_size_2:
all.equal(z_stat_2,effect_size_2)

The z-statistics/value may still appear a little out of order, but let us take a closer look at the SEMSE_M, the denominator of the z-value, in order to understand where the sqrt(n)sqrt(n) comes from!

2.2 Standard Deviation/Error of the Mean

The standard error of the mean can be a little tricky, even though it actually doesn’t introduce anything new to us (see part three). Nevertheless, we decided to give the topic some space for an own chapter. We promise it is worth it, as it again relies on assumptions of the central limit theorem, i.e., it essentially relies on the relation between the sample and the population size (n<Nn<N). However, you can also skip this chapter if you want to get right into calculating p-values.

Per definition, the SEMSE_M is evaluating or estimating the standard deviation of the sample mean from the actual population mean. Again: the standard error (of the mean) is actually the same as the standard deviation, it just includes a reflection on the size of the sample. However, the formula of the SEMSE_M with its sqrt(n)sqrt(n) blurs away the simple fact that we know the complete formula already, as we will see in a bit. So even though we will go through all of it in detail, just keep in mind upfront about what input we are speaking of:

STANDARD DEVIATION = Average deviation of a set of data(xi ϵ Xx_i \ \epsilon \ X ) from the mean of that sample or a population mean (centralization, recall Inferential Statistics III).

==> Sqrt(Var(X))Sqrt(Var(X))

STANDARD DEVIATION/ERROR OF THE MEAN = Average deviation of the sample mean from the population mean (not a set, just the sample mean in relation to the population mean is centralized!!).

==> Sqrt(Var(Mean(X))Sqrt(Var(Mean(X))

Given a slight difference only, the regular standard deviation is often confused with the standard error of the mean. The reason is again that it is essentially the standard deviation, just of the mean only. Note, in R we can’t just use sd(mean(x)) to compute the standard error of the mean, as we will see.

For the z-test the SEMSE_M is evaluated by using the population standard deviation of the mean only. However, in general it is mostly estimated using the sample sd of the mean (t-test), as the population sd is of course usually not given (but necessary for the z-test), but the principle is the same. Let us look at the formula of the SEMSE_M:

As written above: a special characteristic of the SEMSE_M is that the higher the nn of the sample mean, the lower the deviation from the population will be (due to central limit…).

CAVE: This is exactly what Cohen’s d effect size is trying to avoid, since samples with a large nn tend to become significant more easily. Still, it is also understandable, given the central limit theorem, to weigh in the sqrt(n)sqrt(n). Neglicting sqrt(n)sqrt(n) is the only difference between the z-statistics and Cohen’s d effect size (note that there are other types of effect size as well, though we will go into details in the next part of this series).

To get more insight why weighting in makes sense, let us ask again: Where does the square root come from anyway? The answer is actually quite intuitive: The mean as input brings in nn of the sample and therefore a relation between nn and NN! Via the below decomposition we can derive the commonly used formula of the SEMSE_M via:

In the decomposition above we can see how the sample mean actually disappeared completely as input, essentially reduced to the sd, when sqrt(n)sqrt(n) is taken out, now representing a kind of redundant averaging. This comes in handy, when only given sample size nn and the population sd as such, but no further details (no information on the size NN of the population, only its parameters mean and sd are needed in the simplified form of the SEMSE_M!!!).

Note upfront that the above bounded relation between n<Nn < N also holds for the expected value used in a t-test: we can still draw relations between a lower and a (potentially) higher size of nn in the same sense of an inequality, representing a relation between nn and an approxiamte NN (backed by central…).

Below we tried to mathematically express the SEMSE_M particularly as an inequality, dependent on the size of nn. On the far left side N=NN=N, i.e., we essentially compared the population mean with the population mean, which will result in a SEMSE_M of 0 (and never below 0, given the population sd and its NN).

Looking at the formulas above also helps to understand how the sample variance corrects for a bias (Bessel’s correction), since n/nn/n would cancel each other out and we would subsequently just lose a perspective on our inequality between n<Nn<N. Dividing (n1)/n(n-1)/n will always result in a fraction below 1 (we will get back to this, discussing the concept of degreees of freedom, which is the best way to e.g. understand why minus 11 and not, e.g., minus 1.41.4, is used…). This makes the use of the SEMSE_M also possible for the t-test!

Below you will find another decomposition using the variance only (optional), again showing where the square root of nn comes from. Take the square root of the below formula and you will end up with the SEMSE_M again. For more details: The below solution was taken from StackExchange. However, in the decomposition below it does not instantly become obvious that NN is involved too, when calculating the variance, so keep in mind the implicit relation between nn and NN in the standard error / deviation of the mean:

Below you will find some code to calculate the SEMSE_M using a made up population and different sizes of nn, in order to show that the above inequality dependent on the size of nn holds!

# Example for SEM:

# Our population:
x = c(rep(c(0:10),100))

# Population variance:
var = sum((x-mean(x))^2)/length(x)

# Evaluate SEM for several different sizes of n:
n_sample = 20
sem_20 = sqrt(var)/sqrt(n_sample) 
# [1] 0.7071068 

n_sample = 200
sem_200 = sqrt(var)/sqrt(n_sample) 
# [1] 0.2236068

n_sample = 1000
sem_1k = sqrt(var)/sqrt(n_sample) 
# [1] 0.1

n_sample = length(x) # = 1100
sem_N = sqrt(var)/sqrt(n_sample) 
# [1] 0.09534626

n_sample = 10000
sem_10k = sqrt(var)/sqrt(n_sample) 
# [1] 0.0003015113

sem_20 > sem_200 & sem_200 > sem_1k & sem_1k > sem_N & sem_N > sem_10k
# [1] TRUE

Next (optional) you will find another more complex simulation in order to prove the mathematics above: Now we are going to use some code we used to demonstrate the central limit theorem, including actual sampling, created by using the rnorm() function (rnorm = random normal distribution).

We will create a set of 5000 means from samples with a size of n=50n = 50 each, sampled from a normal distribution with a mean of 10 and a sd of 2 (population parameters). We are then going to compare the SEMSE_M of the first 20, the first 250 and the SEMSE_M of all the 5000 samples. Running the loop below will result in different values for the SEMSE_M each time, dependent on the size of nn. Nevertheless, the logic will always hold: SEM,n=20>SEM,n=250>SEM,n=5000SE_M, n = 20 > SE_M, n = 250 > SE_M, n = 5000.

Even though we have knowledge about the populations mean and sd, below we are first going to just use the sample sd and later the population sd, since at least in general the sample sd can also be used for an approximation of the SEMSE_M, since the sample sd, as we have learned in part three of this series, entails a bias correction of n1n-1 (the error between the estimation and the actual error will therefore still get lower the higher the length nn of the sample…).

# Standard error of the mean including actual sampling:
# How the SEM becomes smaller, the more samples we take:

set.seed(1) # for reproducability
data = rnorm(n = 1000, mean = 10, sd = 2)

# We can again plot our data as histogram
hist(data, col = "lightblue")

# Use a for loop to take 5000 random samples of size n = 50 each:
n_samp = 5000
sample_mean = c() # create empty vector
sample_sd   = c()

# The below function will also takes samples of length 50
# and then mean of that sample stores in our vector sample_mean.
# Running this loop will always result in different samples:
for (i in 1:n_samp){
  # Store mean of each sample with length 50:
  sample = sample(data, 50, replace=TRUE)
  sample_mean[i] = mean(sample)
} # End for i

# Look at the histogram of the sample_mean:
hist(sample_mean, col = "lightblue")

# Now we compare the SEM of the first 20 sample
# with the SEM of the first 250 sample. Recall
# the vector sample_n_50 contains 5000 means
# with a sample size of 50 each:
sample_mean
length(sample_mean) 
# [1] 5000


#### SEM of the first 20 samples:
samp_n20 = sample_mean[1:20]
length(samp_n20)
# [1] 20

# SEM_n20
SEM_n20 = sd(samp_n20)/sqrt(length(samp_n20))
# [1] 0.05782984


#### SEM of the first 250 samples:
samp_n250 = sample_mean[1:250]
length(samp_n250)
# [1] 250

# SEM_n20
SEM_n250 = sd(samp_n250)/sqrt(length(samp_n250))
# [1] 0.01723958


#### SEM of all 5000 samples:
SEM_n5000 = sd(sample_mean)/sqrt(length(sample_mean))
# [1] 0.003924541

# The numerical values will deviate, each time you run the
# loop, but it will always hold the logic of:
SEM_n20 > SEM_n250 & SEM_n250 > SEM_n5000
# [1] TRUE

# For a z-test we would have to use the population sd,
# which we set to be 2 in rnorm() function above. Given
# the sd we can simply just adjust a size of n:
# pop_sd/sqrt(n):
2/sqrt(20) > 2/sqrt(250) & 2/sqrt(250) > 2/sqrt(10000)
# [1] TRUE

For some more examples on the SEMSE_M in general, we recommend this tutorial by The Organic Chemistry Tutor.

3 Possible Hypotheses of the One- and Two-Sample Z-Test and their Respective P-Values

Conceptually we have already gone through most of what is ahead of us. We have previously obtained an overview on what possible hypotheses of a z-test roughly look like, in this chapter we will learn about all of them in detail.

In general a z-test can be differentiated in two categories (the formula however is essentially the same):

=> The one-sample z-test compares a sample mean from a population with the population mean / expected value.

=> The two-sample z-test compares two samples und checks if they are related to the same population or different populations.

Apart from that, there are three types of hypotheses for the z-test that can be drawn: one-tailed (lower/upper) and the two-tailed test hypotheses.

The important message of the previous chapter: effect size and z-statistics/value are both doing the same: standardizing the difference in means. We know that the sample size is not weight into the effect size (or is treated as if n=1n=1) and it makes sense from the perspective of the z-score translating values in the unit pop.-sd. However, it also makes sense to consider nn, since the input of the SEMSE_M is the actual sample mean. In any way, it is important to look at and understand both of them to interpret your results or those from papers...

Example for the following chapters:

The example we will be using in the next chapers will be (mostly) an comparison of the tea drinking behavior of happy cats and sad cats. We will also look into a catnip treatment at one point (dep. two sample z-test), so stay tuned! However, brace yourselves, Chat Noir:

Fig. Chat noir, from the looks of it probably significantly sad. :’( Original photo by Raoul Droog.

A two sample z-test could ask if sad cats drink less tea than regularly happy cats, e.g., such as Lady Meow. The samples are distinct by each other since they imply inclusion criteria: cats are tested for being sad and regularly happy (on whatever basis) and then put into their respective category. To answer our above question, these two samples would be compared with each other. The probability value (p-value) would state how likely it is that the samples are samples from the same population. A one-sample z-test is similar, just that not the mean of a second sample is used, but the given mean of a population of all other cats that are at least not sad. The difference to the two-sample z-test may appear minor in this case, however it is a good lesson to conceptually think of what a population actually represents. However, the exact question for a one sample z-test of our example would be: does the “chat noir” drink significantly less tea then any other cat (cats in general)? So in nuce, the category “all other cats” is not as specifically circumscribed by inclusion criteria as in the two sample test. Such a test can still be performed, but the result of this one-sample test may be that no difference is perceived, due to too many confounding factors or other biases, which are all present in the population (since no further differentiation via inclusion criteria was performed). Keep that in mind!!

In case the term population becomes fuzzy thought as representing a category: Note that the concept of a population is not bounded by a certain “real” category, such as e.g. species. In nuce, it can also mean a population within a population (sub-population). In the example for the one-sample test above we looked at the tea drinking behavior of a sample of a subpopulation of sad cats amongst a population of “regular” or “all other” cats (of course its all made up, but since we are doing a bit…). So here the difference between what a sample is and what a population is in the form of a category is not set/differed by the species (cats) but by their mood and potential tea drinking behavior. We could also compare the behavior of different species… We recommend to keep such conceptual/linguistic categorical issues in mind when trying to understand what variables are used. And of course: In order to make a distinction between sad and happy cats inclusion criteria would need to be discussed in detail (since it is just a made up example, we will skip this part; however, an example of an exclusion criteria for a second sample could be cats that have trouble drinking in general, e.g., due to some illness like a tumor that influences the ability to properly swallow liquids).

3.1 One-Tailed One-Sample Z-Test

For the lower-tail we can formulate the null hypothesis that the sample has either an equal mean or is not (significantly) lower than the population mean (meaning: it could be a sample from the population if not lower than a defined threshold). Below we also included different denotations of the essential logical binary relation between alternative and null hypothesis. Recall from Inferential Statistics I that the use of the term null hypothesis can be confusing (e.g., due to often expressed redundant binary relation when, saying things such as “the null-hypothesis being true”). Also recall that the p-value below is essentially a (uniformly weighted) likelihood of the form P(datatheta)P(data|\overline{theta}), the probability of observing data (difference of the means), given the H0H_0 (no difference assumed). For some methodological orientation in general: Performing Bayesian regression involves more than just one function: the multiplication of the prior probability function with likelihood function results in a posterior probability distribution... Since the prior is uniform in frequ. statistics, we can just operate with the PDF, which is not the same as a likelihood function (which would be a regular prob. function), but the integral of our PDF still represents a likelihood.

You may wonder what role the H0H_0 plays mathematically, since we can relate to the z-values as numbers and a hypothesis as something that is given. But what about the negation of the hypothesis or a category representing “everything else being the case except of the hypothesis”? Conceptually it may appear to be a never-ending task to grasp the usual ex negativo approach in parts of statistics. Mathematically the role of the H0H_0 reveals itself as a simple binary complement, when subtracting the p-value from 1, which results in the regular likelihood P(datatheta)P(data|theta), i.e., the data given the alternative hypothesis. Since we want to discard the null-hypothesis, we hope that the p-value will turn out to be very low.

As mentioned in the first part of this series, the use of the null-hypothesis is very unfortunate, since this is also not the way we express the evidence in every day life: When we find evidence for the effect of a certain treatment, we do not say: we did not not-find any evidence, we positively say: there is evidence for an effect.

This ex negativo approach is sometimes also related to falsification (Karl Popper), e.g., in [this paper]. In general it refers to a discourse on epistemics, i.e., the question on what can be known and what can be tested. Personally, I don’t find this logically very convincing, since falsification is better to be understood in relation to temporal and spatial boundaries of experience/examination, as in the classic black swan example, and less by overinterpreting an ex negativo approach in setting a category in conditional probability. The latter appears to be more of an epistemological reminder, apart from being most of all mathematically convenient. Mathematically tilting the category in advance via the null-hypothesis approach is not solving epistemic issues of any kind of abductive inference concerning time and space (recall part I of this series). Tilting the category is at the end just tilting numbers, since the relation between variable and its complement is already part of probability as well as propositional logic. Again, apart from that we always communicate the results in the positive sense: evidence was found that medication X has an effect (we don’t say: evidence was found that medication X did not not-have an effect). Most of all, evidence is not obtained by one study alone, so significance for itself does not relate to the world anyways.

Apart from numerically tilting from H0H_0 to H1H_1 and handling all the ex negativo approach in general, it unfortunately gets a bit more confusing when evaluating a difference in mean being higher, lower or equal/inequal to the population mean, as it also tilts and some values are actually equivalent, no matter the tail we look at. However, this tilting appears due to the symmetry of a normal distribution when switching from the sample mean being lower to being higher than the population mean (same trick: 1-value). See code further below, to fully understand what this refers to.

Let us start with the lower tail:

When do I know when to use the lower and not the upper tail? In general it is dictated by the question that is asked: If we decided in advance that our hypothesis is that depressed cats drink less tea than others, then we indirectly decided to perform a lower tail z-test. What if the inverse is the case: drink more tea than others? Performing a hypothesis test correctly, requires to set a hypothesis in advance. In the best case, you register your trial before you collect data and then perform your statistical analysis. Not doing so can be essentially considered HARKing, i.e., formulating or adjusting a hypothesis after the results are known (bad practice). Also, when the result does not match our a priori hypothesis (either higher or lower), but turns out to be significantly different, we technically would again need to perform a new trial (especially when preregistration was done, we can’t and shouldn’t just change the hypothesis). However, we will see below that this is why most people use a two-tailed test.

We can use the same formula as above for the inverted case with a sample mean that is higher than the population mean (in the sense that it is not from the same population). Here we use the same integral as for the “lower-tail” and due to the symmetry of a standard normal PDF we can use the same p-value as for the lower-tail as well. As mentioned, we just need to tilt the result, due to the symmetry of Gaussian functions. Formally we actually integrate up to the absolute value of our given z-statistics and subtract the result from 1 (see code below). Integrating up to the absolute value of the z-statistics by itself (without subtracting it from 1) would only results in the likelihood that follows the alternative hypothesis of a lower-tail test…

Below you will find some code to perform a lower and upper tail one-sample z-test:

### Chat Noir One-Sample Z test:

# We will compare sad cats with a population
# of cats ("all cats"):

# Happy cats like Lady Meow drink 10 cups of tea on average:
pop_cats = c(rep(seq(0,10, length(11)),10))

# Sad cat, which drank only 2 cups of tea in 10 hours:
sample_sad_cats = seq(0,2,length = 11)
# [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0


# Population var and sd:
var = sum((pop_cats-mean(pop_cats))^2)/length(pop_cats)
sd = sqrt(var)
# SEM:
sem = sd/sqrt(length(sample_sad_cats))

# Z-test:
z_stat = (mean(sample_sad_cats)-mean(pop_cats))/sem
# [1] -4.195235

# P-Value LOWER TAIL
p_val_low = pnorm(z_stat,mean = 0,sd = 1)
# [1] 1.362942e-05
p_val_low < .05
# [1] TRUE

# P-Value UPPER TAIL:
# The below is only theoretically the case and we
# have to pretend to have gotten a positive z-statistics.
# This spares us the time of creating another set of data.
# This time the below actually argues that sad cats drink 
# more tea than happy cats (we made thinks up anyways, but just that
# you know):
p_val_upper = pnorm(abs(z_stat),mean = 0,sd = 1)
1-p_val_upper
# [1] 1.362942e-05
1-p_val_upper < .05
# [1] TRUE

Awww :( Turns out sad cats truly drink significantly less tea :’( The p-value is way below 0.001, so we can write the p-value as: < 0.001. This is also the way we would report the p-value in a paper as well. One can also report the z-value to the corresponding p-value, as it tells us how many sd’s our sample mean is from the population mean.

3.2 Two-Tailed One-Sample Z-Test

As the name already says: now we are going to consider both tails of the distribution a priori and are looking at a total-threshold of 5% (2.5% each tail) when comparing it with the p-value for testing the significance (again, we will cover the whole topic of significance testing in detail in the next part of this series).

This time we are asking for the equality/inequality of two distributions. In other words: we are going to perform a lower and upper tailed z-test at the same time, so to speak.

When should I use the two-tailed z-test? In general it is a safe prior hypothesis to argue for a difference that is either higher or lower than the population. Conceptually it can be understood as a rather exploratory approach: catnip may result in less depressed cats, but maybe also more depressed cats? Maybe we don’t know yet, since current research is ambivalent or simply not given. The cost though is a double p-value and a higher chance that the result will turn out to be insignificant. However, in case this appears confusing: we do not actually get a double difference or so, since we are still looking at a single difference of means only that can’t be at two sides at once: one set of samples, one mean, ergo: a difference either + or - will be given. So in any case there will always be only a “one-tailed difference”, even when performing a two-tailed test.

When testing for significance only we can assume they are from the same distribution if the result of the z-test lays in an, e.g., 95% area of our population function (confidence interval), with an upper and lower tale of 2.5% each (total of 5% threshold).

Fig. The plot above again only represents the effect size. However, we know that if the effect size is already below the threshold than the p-value and z-statistics will be for sure. Two-tailed test is testing for both cases, a sample mean significantly lower or higher than the population mean. The threshold is therefore divided into half in order to sum to 5%, i.e., 2.5 % for each tail. However, as we know: the threshold does not actually concern the p-value itself.

As we are calculating the p-value for a z-statistics that only refers to one side, when performing integration, we have to multiply the p-value by 2 for the two-tailed test. We can simply do so due to the symmetry of a Gaussian function.

# P-Value TWO-TAIL
p_val_two = 2*pnorm(z_stat,mean = 0,sd = 1)
# [1] 2.725883e-05
p_val_two < .05
# [1] TRUE

3.3 (Independent and Dependent/Paired) Two-Sample Z-Test

The formula for the (independent/dependent) two-sample z-test is nearly the same as for the one-sample z-test. Here we compare the means of two samples, given their respective population sd and population mean. A two-sample test is considered independent, when it argues that there is no influential relation of the samples to each other. To be less abstract: the group of happy cats and sad cats, each sorted by inclusion criteria, are considered independent when their tea drinking behavior is not influenced by each other: sad cats don’t drink less tea because of happy cats, but because they are sad. Happy cats are also not withholding any tea from sad cats…

To give an example for the dependent two-sample test: two samples are, e.g., considered dependent when we compare sad cats that drink tea with themselves after getting a dose of catnip.

The complete formula of the independent two-sample z-test is:

…which can importantly be simplified to using the population standard deviations only, discarding the population mean:

When forming a hypothesis, we can now do the same as with the one sample z-test, just with two samples: is the sample mean significantly lower, higher — or equal/unequal to the population mean? However, for brevity we will just ask for the lower tail below:

# Chat Noir Independent Two-Sample Z-test:

# We will compare happy cats such as Lady Meow with 
# unfortunate cats that are evidently sad (inclusion 
# criteria will be left out for now):

# Happy Cats:
sample_cats = c(0:10)
pop_cats = seq(0,10.5,length = 11)
pop_cats = c(rep(pop_cats,100))

# Sad cats, which drink only 1 cup of tea in 10 hours :'(
sample_sad_cats = seq(0,2.5,length = 11)
# [1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
pop_sad_cats = c(rep(seq(0,2.5,length = 11),100)) 


# SEM of each sample:
happy_var = sum((pop_cats-mean(pop_cats))^2)/length(pop_cats)
happy_pop_sd = sqrt(happy_var)
happy_sem = happy_pop_sd/length(sample_cats)

sad_var_pop = sum((pop_sad_cats-mean(pop_sad_cats))^2)/length(pop_sad_cats)
sad_pop_sd = sqrt(sad_var_pop)
sad_sem = sad_pop_sd/length(sample_sad_cats)

# z-Statistics:
z_statistics_num = (mean(sample_sad_cats)-mean(sample_cats))-(mean(pop_sad_cats)-mean(pop_cats))
z_statistics_denom = sad_sem + happy_sem
z_statistics = z_statistics_num/z_statistics_denom
# [1] -10.03415

# We can simplify the formula and will get the same results:
z_statistics_num = (mean(sample_sad_cats)-mean(sample_cats))
z_statistics_denom = sad_sem + happy_sem
z_statistics = z_statistics_num/z_statistics_denom
# [1] -10.03415  # == -10.034 sd's away from the pop. mean

# P-value:
pnorm(z_statistics)
# [1] 5.394214e-24

However, the above only holds for independent samples. We have covered the difference between the terms independent and dependent when talking about respective variables in the second part of our series. In general: independent means, something is not influenced by something else (e.g. the probability of aa does not influence the probability of bb). Dependent means, there is an influential relation. We also discussed dependent probabilistic relations when discussing conditional probability (but haven’t discussed independent conditional probability yet — we will do so some other time).

A dependent relation between samples often concerns before/after relations. Following our furry example on sad cats (it’s so sad!), we could, e.g., check if cat nip makes cats happy again by checking if they drink more tea again in a certain amount of time, after they received some catnip.

Outspoken the formula goes as follows: z for paired samples (dependent relation) equals the mean of differences, divided by the standard error of the mean differences. We could also add “minus the expected difference” in the numerator, but since we are looking for the p-value that assumes the null hypothesis, this value will be zero anyways (since H0H_0 argues for zero difference). This is why it is often more convenient to work under the null-hypothesis, since we do need information on an expected differences…

Below is some code for the example of testing if sad cats drink more tea again after consuming catnip: this represents an upper-tail paired or dependent z-test.

#### Chat Noir Dependent/Paired Two-Sample Z-test:

# Sad cats, which drink only 1 cup of tea in 10 hours :'(
# We can also consider this as timestep t_1:
sample_sad_cats = seq(0,2.5,length = 11)
# [1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50

# Timestep t_2, i.e., after cat was on catnip:
sample_catnip = c(0:10)
# [1]  0  1  2  3  4  5  6  7  8  9 10

# Evaluate the difference between t_1 and t_2 via
# diff = t_2 - t_1
diff = sample_catnip-sample_sad_cats

# Set up data table:
data = cbind(sample_sad_cats, sample_catnip, diff)
data = as.data.frame(data) # in order to be able to use $

#       sample_sad_cats  sample_catnip diff
#  [1,]            0.00              0 0.00
#  [2,]            0.25              1 0.75
#  [3,]            0.50              2 1.50
#  [4,]            0.75              3 2.25
#  [5,]            1.00              4 3.00
#  [6,]            1.25              5 3.75
#  [7,]            1.50              6 4.50
#  [8,]            1.75              7 5.25
#  [9,]            2.00              8 6.00
# [10,]            2.25              9 6.75
# [11,]            2.50             10 7.50

# Caluclate the mean difference
mean_diff = sum(data$diff)/length(diff)
# Total sum of squared mean differences
sum_sq_diff = sum(((diff-mean_diff)^2))
# Standard error of the mean differences
sd_mean_diff = sqrt(sum_sq_diff/length(diff))/sqrt(length(diff))

# Z_value:
z_value = mean_diff/sd_mean_diff 

# Upper tail z-test, arguing that catnip has the effect that
# sad cats are drinking more tea than before again:
p_value = 1-pnorm(z_value)
# [1] 7.854725e-08

Fig. “Still think drugs are cool?” :D Cat on catnip meme (origin unknown to us). The above cat may be an outlier, since it appears that this cat wont be able to drink tea for a few hours or so ;) So in case you study depressed cats, keep your stash of catnip in bounds. ^^

YOU DID IT!! You have completed your first statistical hypothesis test!! Whoooop!!! Before we move on to the t-test, we will discuss some common misleading, false or at least confusing definitions of the p-value.

3.4 Common Misleading or Confusing Definitions of the P-Value and the Significance

You have probability figured out yourself that handling the p-value, e.g., when reading a paper is actually rather simple: If it is below a threshold, we are allowed to state that a (alternative) hypothesis finds statistical evidence due to significant difference in, e.g., means (since we can discard the null-hypothesis in such a case).

However, you probably also figured out that defining the p-value easily becomes complicated, since it involves quite a bunch of mathematical concepts to be obtained. The general message of this tutorial: it may not even be a good idea to try to fit the concept into one phrase.

There are many definitions you probably came across already, but may still appear rather enigmatic and confusing to you, when thinking about it, after going through the upper math. In this short chapter we are trying to anticipate and entangle some of the possible confusions in common definitions of the p-value, before proceeding with the t-test and the integration of the t-statistics/-value, in order to obtain a p-value again.

You can also skip this part for now and come back to this chapter later, as it does not add anything essential to the math. In this chapter we just discuss different approaches of linguistically expressing mathematical concepts such as the p-value and the significance.

Fig. Monty Python doing a bit on being “lost in translation”. CAVE: Some common linguistic definitions of concepts in statistics don’t match the mathematics behind it.

You probably have come across statements such as: “The p-value is the probability that the difference in the mean of the sample and population occurred by chance”. To me this particular statement always appeared terminologically quite confusing, as the term chance implies that the probability (p-value) redundantly refers to another probability / chance, such that the p-value could understood as a probability of a probability (prob. of a chance of something…)? I really have trouble to relate this definition to the process of standardization and integration. If chance is here understood as referring to the threshold of <5%, then the p-value can either be below / outside the “area of chance” or within it, but the p-value itself mathematically does not involve the threshold, nor does it tell us upfront that a difference is in a certain area by some probability (since it is just integration to a certain z-score/-value, as we already know). The p-value does not probabilistically quantify a second order chance of some kind (the power analysis however does, as we will see in the next part of this series). However, we have seen that the sample size is weighted in, making a high difference, given high sample size less likely to be the result of the sample of a population.

However, the notion of chance is also related to the process of random sampling (randomization) from a distribution and the difference being just one of potentially many random possible differences from the population mean, when taking a sample (with size n) from one and the same population and potentially a different population. Again, weighting in the sample size into the effect size provides this perspective, since a greater sample size makes a high difference more likely to be (gives it more confidence). Speaking of randomness: it is not recommended to understand “random” as uniform or so, since a Gaussian PDF says not every mean of a sample is uniformly likely, but we can still take samples from that distribution in a randomized way. The process of sampling is considered random if we did not intentionally decide for a specific sample or had control in the choice whatsoever, in order to intentionally influence the mean of the sample therefore the process of randomization in clinical trials. Randomization (clinical design) is therefore not a nicety that helps to maintain some higher order quality standard, but also makes sure that we can actually consistently apply mathematical inference methods that work with random variables. Still, randomness in the above sense is not what defines the p-value either, it only defines specific circumstances of how the sampling took place (randomization), i.e., a characteristic of variables (random variables). This definition for itself at least leaves a lot of questions, specially for beginners.

You may have a different opinion on this, but we recommend to keep in mind that the use of the term chance is neither a good intuition, nor is it terminologically consistent looking at the actual math. At least exchange chance by “consequence of random sampling from one and the same distribution, not a different one (= H0H_0)”. The fact that there are differences in the first place is due to the sampling from a normal distribution, where differences are possible, but overall less likely when the difference is far from the population mean. This is importantly due to the form of a Gaussian.

The p-value for a z- or t-test defines how likely it is to obtain a particular standardized / studentized mean difference (sample minus population/expectation) given that we were (randomly) sampling from an expected (normal) distribution (population with respective mean and sd for the z-test; same goes for two samples, where one sample is considered a base line, so to speak). It is not the same as the plain standardized difference in means, since the sample size is weighted in. Weighting in the sample size can give rise to a higher confidence in a particular difference in means compared to a much smaller sample. Intuitively more data provides more confidence/credibility. However, in the next part of this series we will look at examples where this becomes a problem for interpreting p-values and their “evidence”, but for a narrow scale it still makes sense: the mean of just two events is certainly less credible than the mean of 100 events, under the condition that we randomly sampled and didn’t selectively pick events...

Fig. Common denotation of the p-value. “Z” refers to the standardized population mean and “z” to the standardized sample mean. The term tail just refers to which end of the normal PDF or which kind of difference we are looking at: a difference lower/higher than the population mean / the other sample (one way = one-tailed), or at an equality/inequality of both sample population / the other sample (includes both ways = two-tailed).

Fig. You may recall, we showed this figure in the first part of this series as a preview of what’s ahead. We went through all of what we need to fully understand what the figure above tells us in detail.

We could also modify and reformulate our initial statement on p-values at the beginning of this chapter into a certain condition of insignificance: If the p-value is ABOVE a threshold of 5% we can interpret the given difference or overlapping of the sample mean with the population PDF as coincidence / chance (result of random sampling) and cannot discard the null-hypothesis. H0H_0 states no difference, i.e., it assumes a priori that the sample is part of the population; the H0H_0 is therefore the hypothesis that we want to discard, in order to be able to assume a difference/effect of, e.g., some medication. However, in order to distinguish between no difference and enough difference or between “coincidence and not coincidence”, a threshold is needed (but again not for calculating the p-value itself).

Importantly: We do not even have to calculate the p-value to know if the difference in means is significant, given the z-/t-score of the threshold probability. This threshold z-score is also called a critical z-score that results in a p-value of the threshold, e.g., .05: After calculating a t-value or z-value of a data set, we could simply compare those with the z-/t-value of a threshold probability of 5% (obtained via a quantile function) and argue for significance if the value is outside the critical threshold score, without integrating a PDF at all. . . However, in the above statement on the insignificance of a difference, the p-value is still not defined, just compared with the threshold. In other words: the condition of that statement (if below a threshold) has nothing to do with the condition and definition of the conditional probability we are after, our p-value.

Here is another statement that is commonly used but rather confusing, even though there is a true story behind it: “the p-value is the probability of observing an effect at least as extreme or more extreme as when the null-hypothesis was true” (Wikipedia, 02.2023). Here I have a lot of trouble to understand what is meant with “extreme” and “at least… or more extreme”. The term “effect” refers to the difference in means. But why does it say “at least as extreme or more extreme”? We also don’t even know if the difference is in any sense given at all (could be t = 0).“At least as… or more extreme” again sounds like confusing the p-value itself with the comparison of p-value and the threshold (significance). The actual idea that is behind this definition of the p-value (as far as I have encountered) addresses the fact that we integrated a PDF (interval prob. estimate, e.g., from infinity to z-value). So in other words, the integral technically addresses all other values that are possible in the sense that they are even more unlikely (we can say so due to the form of a Gaussian function, therefore the effect: regression to the middle…). The problem though lays in the following question: Does a p-value obtained from integrating a probability density function really address other cases differently than a point-probability estimate of a regular and discrete probability function, since we approach a z-/t-value from -/+ infinity? Or in other words: don’t both probability values (interval estimate and point estimate) always automatically account for a range that is more extreme in the sense that any particular probability value (no matter point or interval estimate) refer to all other possible probability values that are lower than the probability value one got indirectly? A PF and a PDF are not the same kind of function and there are technical reasons that a point estimate of a continuous function results in 0 and therefore we use a PDF. Still, the form of the function and the fact that we looking at a tail of a Gaussian up to infinity may play an equal role in what is considered “extreme and more extreme” and which probabilistic assumptions we can make beyond a point estimate. EXAMPLE: A point-probability of .1 (point f(x)f(x) on a Gaussian discrete PMF) and a p-value of .1 (integral of a Gaussian PDF) both cover the fact that lower or higher values of the xaxisx-axis (mean difference) are respectively less or more likely when 0 difference on the x-axis denotes the mean and mode of a Gaussian. In other words, I am not convinced that this is worth highlighting in the definition of a p-value, since the fact that we are integrating is not changing anything about the concept of probability as such, always indirectly including a range of less or more likely occurrences, given a probability at hand and its corresponding function (and therefore its form). Definitions of the p-value of a z- or t-test using the terms “extreme” at least need a lot of extra reflections and knowledge to be understood, so other definitions of the p-value of a z- or t-test that address the tested difference in means explicitly are probably a better choice in order to explain it to people.

In any way, the above Wikipedia definition still indirectly confuses the p-value with the significance test, as it brings in a threshold with the words like ‘at least as extreme as when the null-hypothesis was true’. The problem is here that the null-hypothesis in a z-test is always true (trap of unnecessary redundant binary). Only when we perform a significance test we decide for a threshold — this is not part of the p-value calculation but the power analysis / the significance test. The probability P(positive=significanttrue)P(positive = significant|true) is not the same as P(z<ZH0)P(z<Z|H_0)... A correct definition of the significance test could be “Given that a p-value is at least as different/extreme or more different/extreme than a threshold suggests (extreme in the sense of further away from the pop. mean on the xaxisx-axis), then the H0H_0 can be discarded”. Now it makes sense both linguistically and conceptually, but is still not referring to the correct definition of a p-value for a z- or t-test, P(t<TH0)P(t<T|H_0).

Maybe I am missing something above (let us know if so), but to me a lot of common definitions appear rather confusing and they are often not even necessary to understand the p-value, since there are other more straight forward definitions to rely on.

Fig. Which definition would you choose? The second definition is more or less the same as on Wikipedia. In general, defining the p-value is most of all difficult because it includes a lot of concepts to be understood already: conditional probability, integrating Gaussian PDFs, z-scores/-values…

You may also have come across definitions stating that the p-value is the probability that the null hypothesis is true. We have discussed before (especially in the first part of this series) that stating “the null-hypothesis is true” is just adding an unnecessary redundant binary and means the same as stating “the alternative-hypothesis is false” or “H0H_0 states that it is true that there is no difference”. However, such a redundant binary only refers to what is set in advance as hypothesis (condition) and does not refer to the probability of an observation or data (sample mean) given the condition of the H0H_0. Recall, that the probability of a difference that relies on the alternative hypothesis, P(z<ZH1)P(z<Z|H_1), is just 1P(z<ZH0)1-P(z<Z|H_0). However, the false expression above would actually be correct when stating: the p-value is the probability that the null hypothesis cannot be discarded, when H0H_0 is stating no difference in means. Is that probability significantly low (arbitrary threshold), then we can say: it is unlikely that we cannot discard the H0H_0 — again it is just tilting a binary relation (an ex negativo approach). Still, this defines the conditions of significance, not the p-value itself.

We hope we were able to address most of the general issues around defining, explaining and understanding the p-value and significance test. As a final note: keep in mind that p-values do not represent the world in the sense that they represent a constant value, so that we can just rely on its “significance”. The major downside of frequentist statistics is essentially that its models do not learn and reflects less what the role of a hypothesis and the theory behind it is (weights). Also recall that the frequentist approach comes with a uniform prior, so to speak. This may be good and safe spot in a lot of situations (most of all when no prior knowledge is given), however, in the long run it nakedly exposes science to a game of randomness and pretends to be neutral, which is at the end a dangerous lie, making people pretend to not be responsible in making at least major conceptual assumptions about the world in the first place, since some significance (meta-studies therefore become more and more important over the years).

IMPORTANT TAKE HOME MESSAGE: Don’t let yourself get confused by lousy definitions of the p-value, even if they come from professionals!!! Any mathematical statement has to have a mathematically formulated representation. A mathematical statement is not about an effort of correctly interpreting words, but the instructions of an algorithm represented as formula. In the end, the formula is all we got in order to confirm our definitions, so it is probably best to start there as reference.

4 The Student’s T-Test

As mentioned before, the t-test only slightly differs from a regular z-test. For a regular t-test the only difference is that we will rely on the sample standard deviation / variance not on the population sd / var. In other words we will operate with n1n-1 in the denominator, Bessel’s correction. There is again the possibility of a one and two sample t-test. Note that if we were two compare three samples with each other, we would perform a analysis of variance (ANOVA) — a topic which we will cover in future tutorials. At the end of this tutorial, we will also perform a t-test on our linear model, which will look at two relations. The only difference will be that we will operator with n2n-2 for the sd, since we are looking at two relations: the intercept and the slope. The amount of parameters that are in question is why the choice of n1n-1 is not arbitrary. The case of n2n-2 is not Bessel’s correction anymore, but applying the concept of degrees of freedom (which is a kind of general approach to Besserl’s correction, as we will see). We will get there later below, since we don’t need to understand the concept yet for our regular t-test looking at one relation only (comparing means, where we can just use Bessel’s correction).

For the one sample t-test at least the expected value has to be given, which can be the mean of a bunch of previous samples (previous trials) or a theoretical value (another model, e.g., from physics, i.e., an expectation backed by other (kinds) of research and its respective formulas and predictions they make). It will not be the actual population mean, since the population is considered unknown, when performing a t-test, different to the z-test. The t-test is also called Student’s t-test (we will not cover the history of the t-test, but we recommend doing some research on why it is called that way!).

Note that this test is considered robust, given equal variance and sample size. We could also use a Welch’s test in case of inequal variance and sample size. However, we will not cover the Welch’s test here (we will do so some other time; note that the t-test() function and other functions actually often are able to decide for the correct/optimal test itself, based on the input data).

One-sample t-test:

In the formulas below we essentially just exchanged z with t, that’s all: We centralize in the numerator and use the sample standard error of the mean for scaling or studentization (instead of standardization).

Independent Two-sample t-test:

Dependent/Paired Two-sample t-test:

Recall that we could also add “minus the expected mean difference” in order to represent a centralization in the common form. But again, since the expected difference under the null hypothesis will be zero anyways, we can abbreviate the formula.

4.1 The T-Distribution, its PDF and its Degrees of Freedom

A t-distribution is very similar to a standard normal distribution and approaches a normal distribution the higher the sample size will be (similar to what we encountered when discussing the central limit theorem). The formula of a PDF of the t-distribution is a little complicated and deriving its formula would take a separate tutorial for sure. We will therefore not cover all the details here. The formula of the PDF of a t-distribution goes as follows:

As you can see, the formula looks rather nasty. The variable vv (greek letter nu) refers to the degrees of freedom and the twisted “L” is referring to a capital gamma and a gamma function. We will not go into the details of a gamma function, but to at least mention it: A gamma function is essentially a function of the factorial of n1n-1 (gamma = (n-1)!). What we are going to cover though is the concept of degrees of freedom:

We didn’t discuss Bessel’s correction in detail, but however, one way of understanding Bessel’s correction is as representing the degrees of freedom, where the 1 refers to one relation/parameter looked at (difference in the mean). A linear regression model on the other hand operates with two parameters that are evaluated: the intercept and the regression coefficient, such that the df=n2df = n-2. Also note that a two sample t-test, which involves the sd of both samples, also technically involves two paramters (n-1 for each sample). But what are degrees of freedom now exactly?

Let us think of two numerical examples: Say we have taken a sample from a population and end up with a set of values of 10, 20 and 30 (n=3n=3). We can now calculate the mean and will obtain a value of 20. Let us say we know the mean already, and 10 will be the first value that we obtain and 30 will be the second value that we obtain. If we followed that order of obtaining values, what will be the third value? Since we know the mean and size of nn already, there will be no other option for the third value than 20. This is exactly what the degrees of freedom refers to: given the mean 20 and nn of 3, there are only two numbers that are free in the sense that they can vary. After these two values are set, the third number won’t have any freedom or variability anymore, since it is indirectly dictated by the given mean! As mentioned, the degrees of freedom of a linear model with the two parameters aa and bb has a df of n2n-2. It may appear weird that we assume that the mean is given, but recall that this is always the case for the formula of the sum of squares / variance / sd, when performing centralization.

Our first numerical example is special in the sense that the value of the mean (below 20) is entailed as a value of the set of xx already:

The same holds for a set of xx that doesn’t numerically entail the mean itself:

To formulate a definition of the degrees of freedom in maybe more common words: The concept of “freedomness” refers to the number of values that can vary and are not dictated by the parameters such as the mean themselves.

To wrap this up, we will now look at a t-distribution with several different degrees of freedom and check if it is true that the higher the df (and therefore indirectly nn) the more it approaches a standard normal PDF. Below you will find the code for the function of the PDF of a t-distribution and we will plot the standard normal distribution for our comparison using our prob_dens(x,mean=0,var=1) function:

Fig. Plot of several PDFs of a t-distribution compared to a standard normal PDF (black). We can see that a df of simply 1 (red) has the effect that there is a much wider spread of values and a lower density in the area close to the mean. A rise in the df up to two (green) and 10 (blue) clearly shows how the PDF of a t-distribution starts to approach the standard normal PDF.

# PDF of a T-Distribution:
# df = n-1 # or 2 for two estimates/relations

t_distr = function(x,df){
  t_1 = gamma((df+1)/2)/(sqrt(df*pi)*gamma(df/2))
  t_2 = (1 + (x^2/df))^(-(df+1)/2)
  t_distr = t_1*t_2
  return(t_distr)
} # End of function

# Initialize Sequence: 
seq = seq(-4,4,by = .1)

# Plot standard normal and t-distribution
# with df = 1, df = 2, df = 10, df = 100:
plot(x=seq,y=prob_dens(seq,0,1), type ="l") 
lines(x=seq,y=t_distr(seq,1),col="red")    # df 1
lines(x=seq,y=t_distr(seq,2),col="green")  # df 2
lines(x=seq,y=t_distr(seq,10),col="blue")  # df 10
# Here you can see that given a df = 30 the t-distribution
# gets very close to standard normal distribution, hence 
# the z-score is used, given n>30:
lines(x=seq,y=t_distr(seq,30),col="lightblue") # df 30

# Moving to df = 100 the light blue line will essentially overlap
# with the standard normal PDF
plot(x=seq,y=prob_dens(seq,0,1), type ="l")
lines(x=seq,y=t_distr(seq,100),col="lightblue") # df 100

So again, we haven’t covered Bessel’s correction, but we should now understand it in the sense of degrees of freedom and how it affects approaching a standard normal PDF when nn raises. We can also understand why it is used in order to correct for a bias, since it makes sure that the sample variance is always higher than the variance of the population excluding obvious parameters within the calculation.

At last, looking at the plot above we can eventually see why it is said that given a df of over 30 and given the population sd, one can use the z-score, since the t-distribution approaches the standard normal distribution. Below you will find a plot with df = 100, resulting in a diminishing difference:

Fig. Plot of a PDF of a t-distribution given a df of 100 (light blue) compared to a standard normal PDF (black). There is only a minor difference remaining between the two PDFs.

In other words: if we want to make sure to correct for a bias, then in the case of a sufficiently high number of samples / df, it makes only little difference to working with a regular old standard normal PDF. There is more behind it mathematically and the number n>30 is still somewhat arbitrary, but we hope this gave you enough information to understand and make it plausible why the sample size dictates which kind of test score is to be used.

5 Testing a Linear Model

5.1 Overview of the output of summary(lm(y~x))

Before we discuss the t-test for a linear model, we are going to go through the complete output of the function summary(lm(y~x)). However, this part can be considered optional, if you want to get right into testing the slope and intercept of a linear model.

In general, the output of summary(lm(y~x)) extends the output of lm(y~x) and, e.g., performs a t-test and several other things on our linear model. Note that the summary() function can be used for several purposes. Its output depends on the type of input (check ?summary() for a documentation of the function; note that documentations for R functions are sometimes a little hard to read and sometimes incomplete). However, we will only use it together with the lm(y~x) function for now.

Let us first use the second synthetic data set from Inferential Statistics II to look at the output of summary(lm(y~x)), in order to get a basic overview of what’s ahead:

    ### Second synthetic data set from Inf. Stat. II

# Lady Meow drinking tea ^^ 

# Cups of Tea: 
cups = c(0,  1, 2,   3.5, 4.8, 5.2, 6,   6.9, 8.5, 9.1, 9.9, # run 1
         0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9, #     2
         0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9,   9.8) #     3

# Time passed for each run of measurements (cups), abbreviated:
time = c(0:10,0:10,0:10)

# Summary(lm(y~x)) => y under the condition (~) of x
summary(lm(cups~time))

Executing the line summary(lm(cups~time)) will give us the following console output:

The first line of output just shows the code formula. The next part looks at the residuals from the perspective of quantiles, which includes the median. This part can also be recreated via:

# Residual quantiles, median and mean:
summary(residuals(lm(cups~time)))

We have used the function residuals(lm(y~x)) before in the second part of this series. However, the formula that the function residuals() uses is simply representing the data value of the dependent variable (yiy_i) minus the model estimate of the respective data point given the independent variable (f(x)f(x)). We will not go into further details in this tutorial (see Inferential Statistics II and VI for details).

The part “Coefficients” entails the output of the lm() function, i.e., the estimates for the optimal aa and bb in f(x)=a+bxf(x)=a+bx in the first column. This part also entails a column for the standard error (of the mean), which is, as we know, not the same as the standard deviation (its the sd of the mean not a whole set of values, see details later below). Other columns entail t-values and p-values (the Pr stands for probability). In this tutorial, we will focus on this part/table of the output of summary(lm(y~x)). The above output only entails a t-test though (we will also only cover the t-test for a linear model).

The next part of the R output concerns the “Significance Codes” and just entails a legend for the number of stars that are associated with several p-value thresholds. Three stars means that the p-value is approaching zero and therefore very small.

The last segment of the console output entails a couple of values, such as the F-statistics, again our p-value (of the slope) and others. These are additional values to get a grip on the quality of our model, which we will focus on in the next sixth part of this tutorial series. However, the only values that will not be covered in this tutorial is multiple and adjusted R^2 and the F-statistics. In other words: after this tutorial we nearly completely replicated the output of summary(lm(y~x)) by just using mathematical formulas, whoop!!

5.2 T-Test on the Slope and the Intercept of a Linear Model

We are now approaching the end if this tutorial. Our last topic will be hypothesis testing the slope and the intercept of a linear model. After this last chapter we will have replicated most of the output of summary(lm(y~x))! Let’s get to it:

Our data set: First of all, because Lady Meow is adorably predictable in her tea drinking behavior, the p-value we will receive for the slope of our optimal linear model is so small that R will just output a plain 0. We therefore decided that we will use a random data set this time. We chose the integrated data set “mtcars”, which can be loaded into the environment via data(“mtcars”). Loading the data set will make it possible to call columns of that data set via “mtcars$…”:

We intentionally chose a column of the data set “mtcars” that led to a model with a very high p-value, such that it will be above our significance threshold of 5%! For the independent variable we just created a set of xx that entails numbers from 0 to length(column). In general, let us go abstract and just ignore the meaning of the variables (cars are evil anyways) and let us look at the plot right away:

Fig. Whatever the variables are, we can see right away that deciding where to draw a linear model function through the data points results in a rather flat line.

data("mtcars")
dep = mtcars$disp
indep = c(1:length(dep))
plot(indep,dep)
abline(a=alpha,b=beta)

Let us look at the output of summary(lm(y~x)), where we can clearly see that with a p-value of 0.905 we can’t discard the null-hypothesis that the slope is essentially zero and there is no conditional linear relation between the variables, or only very little change in the data when xx changes (the variable we have control over: our independent variable):

# The p-value is 0.9047, suggesting no significant
# result, given a slope of -0.2913, a very low
# slope, barely different to 0:
compare = summary(lm(dep~indep))

# For more detailed numbers look at:
compare$coefficients
#               Estimate Std. Error    t value     Pr(>|t|)
# (Intercept) 235.528226  45.597118  5.1654192 1.460149e-05
# indep        -0.291294   2.411567 -0.1207903 9.046625e-01

# Our optimal linear model:
beta  =  cov(indep,dep)/var(indep)
# [1] -0.291294
alpha =  mean(dep)-beta*mean(indep)
# [1] 235.5282
fx = alpha+beta*indep

Let us first replicate the p-value of the slope of our linear model:

5.2.1 Testing the Slope

We mentioned it before, but the H0H_0 when testing the slope of a linear model is arguing that there is no difference, no relation between the dependent and independent variable (a slope of 0). We again have to think in the realm of distributions, so we are testing if the slope of our optimal model, evaluated via sample data, is likely to be part of a population with a mean slope of 0. It is again the same principle as in a regular z-test/t-test: centralization and then complete studentization via the sd. This time we do not subtract a population mean / expected value from a sample, but the expected slope assuming the null-hypothesis from the evaluated slope of our model. Since that expected slope will be zero, you will often find formulas entailing just the beta term in the numerator as default.

The formula for the standard error of the slope looks a little complicated at first, but we will get there. Note that the standard error of the slope is in nuce just again sqrt(var(beta)) and its form below just results from rearrangements. The below formula says: the standard error of the slope equals the standard error of the sum of squared residuals/errors divided by the sum of squares of xx.

You may wonder where the classic sqrt(n) has gone in this formula. Here it comes:

The above refers to the standard error of the residuals/errors or “residual standard error” (term used in the R output; SSE = sum of squared errors). It is essentially just our formula for calculating the sum of squared errors, which we discussed in the second part of this series, and then averaging them over the degrees of freedom: so it represents the standard deviation of the errors/residuals. Without taking the square root the above is also called the mean squared error or variance of the errors (MSEMSE). The above will then be divided by the sum of squares of xx to put the errors between yiy_i and our model estimate f(xi)f(x_i) into proportions of a change in xx, resulting in the sd of the slope. Where does the sum of squares come from? We will not go through it in detail, but recall that the sd of the slope could look like this: sd(SSXY/SSXX)sd(SS_{XY}/SS_{XX}), since we can calculate the regression coefficient via the sum of squares only as well (recall part three of this series). Here is again the formula of the sum of squares of xx, which we also discussed in the third part of this series:

Another way to represent the standard error of the slope is via the following formula (the second formula below is our outwritten first approach from above):

Here we can clearly see how a relation between yy and xx is present in the formula in a similar way as with calculating the slope itself as the result of a difference in yy given a difference in xx. Above it is present as a difference of the errors of yy divided by the erronous difference between xx in relation to the mean of xx (sum of squares of xx). Think of f(xi)f(x_i) as an expected value (optimal model estimation) where we centralize our data values of yiy_i around it. In other words: the above represents a centralization (yy of the data around the optimal estimate of yy, i.e., f(x)f(x)) divided by a centralization of xx for itself (both put in relation to the degrees of freedom).

There is yet another way of calculating the standard error of the slope via an alternative approach of calculating the sum of squared errors/residuals (SSE) [2]:

Here we have the parameter beta itself within the function and the denominator still represents a centralization. However, the formula is now completely brought into the realm of variability under the perspective of sum of squares.

We hope this helped you to understand the rather tricky formula for the SE of the slope. We have still trouble to give this part a detailed intuition, but we will get back to this in the future (feel free to comment or contact us if you have a good take on this!).

Below you will find the code to replicate the above and to evaluate the p-value in comparison with the summary(lm(y~x)) function, leading to equivalent output values (whooop!!):

# Sum of square residuals:
SumR2     = sum((dep-fx)^2)
# [1] 475953.3
SumR2_alt = sum(as.numeric(residuals(lm(dep~indep))^2))
SumR2_alt2 = sum((dep-mean(dep))^2) - beta*sum((dep-mean(dep))*(indep-mean(indep)))

all.equal(SumR2,SumR2_alt,SumR2_alt2)
# [1] TRUE

# Residual SE:
residual_standard_error = sqrt(SumR2/(length(indep)-2)) 
# [1] 125.9568
compare
# => Residual standard error: 126
# summary(lm(y~x)) rounds the value in the output,
# but since we will get to equivalent results we can
# savely assume that the output value is just rounded.

# Standard error of the slope for t-distribution:
se_denom = sqrt(sum((indep - mean(indep))^2))
se_slope_t = residual_standard_error/se_denom

# t-value:
t_value = beta/se_slope_t 
# [1] -0.1207903

# p-value for two-tail == to standard output of summary(lm(y~x)):
2*integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value)$value
# [1] 0.9046625
# Alternative via basic R function pt():
2*pt(t_value, df = length(indep)-2) 
# [1] 0.9046625

# t-value and p-value is equivalent with the output of
# summary(lm(y~x)):
compare$coefficients
#               Estimate Std. Error    t value     Pr(>|t|)
# (Intercept) 235.528226  45.597118  5.1654192 1.460149e-05
# indep        -0.291294   2.411567 -0.1207903 9.046625e-01

We did it!! We replicated the most important part of the hypothesis test on a linear model: the slope. Let’s move on to test our intercept:

5.2.2 Testing the Intercept

We mentioned briefly before that testing the slope is often not as important. Under the null hypothesis a t-test of the intercept asks if the crossing point with the y-axis is equal to zero, given that xx is 00. Different to the slope, we do not ask for a relational value (a value related to the change of yy given a change in xx (indep.)). The intercept refers to an explicit data value on the y-axis only, where xx is just 00. Depending on the numerical values used for the independent variable (the scaling of statistical values), a crossing with the y-axis at x=0x = 0 or any other point on the y-axis may not be of interest. This may be the case if the measured values are in a range between say 40 and 300, due to the specific scale. Here a measure with x=0x = 0 may just not be relevant, since implausable or something. For example: A human with the size of 0 cm and age 25 just does not exist. The test on the intercept therefore just does not make a lot of sense.

In the case of Lady Meow, it actually makes sense that the intercept is around zero, since we argued she sleeps 14h and is observed 10 per day starting with 0 cups and 0 hours. IMPORTANTLY: In this example we would not want to discard the null hypothesis, since a insignificant p-value on the intercept would make sense. And it is true, looking back on the output we the discussed in the introduction of this tutorial using our Lady Meow data set: The p-value of the intercept is insignificant and we can also see that our alpha of 0.2318 is very low and close to zero. In the case of Lady Meow this is totally plausible. Let us look at the R output using our secon synth. Lady Meow data set:

In comparison, when looking at the output of “mtcars” we can see that our intercept is way above 0 and we get a p-value way below a threshold of 5% (i.e. we can discard the null-hypothesis that the intercept is 0). Note that this output again makes sense, when related to the assumption of the null-hypothesis with a slope being zero being the case and therefore having values of yy being overall constant and non-zero, in order to be plausible (see plot of mtcars and look at the crossing point of the model function with the y-axis).

So in general keep in mind: The p-value of both the slope and the intercept do not have to be zero for a model to be considered “good” — it is also a question of plausibility.

Below we see the detailed output of a t-test on “mtcars”:

# summary(lm(y~x)):
compare$coefficients
#               Estimate Std. Error    t value     Pr(>|t|)
# (Intercept) 235.528226  45.597118  5.1654192 1.460149e-05
# indep        -0.291294   2.411567 -0.1207903 9.046625e-01

Below you will find another example of a simplified model with a slope of zero. Imagine you would want to measure some random change and things would just never change when you measure. The x-axis could be the time again and y-axis could refer to some growth with a starting value of 5 that remains constant (note that the interept will also be equivalent to the mean of the set of yy and f(x)f(x)!). In other words: the model below represents growth that in this case is just not happening:

Fig. Model with a slope of zero: f(x)=5+0xf(x)= 5+ 0*x .

# Example of a model with slope of zero:
x = c(0:10)
y = rep(5,11)
plot(x,y)
abline(h=5)

What we are aiming at is that the relation between the test for the slope and the intercept is not completely apart from each other. Here we can assume that a slope of zero will automatically be related to an intercept that is close to the mean of the data itself, such that a=yi=mean(Y)a=y_i=mean(Y) (aa matches every idthi^{dth} value of yy in the set YY and matches the mean of the set YY).

Now that we have solved the conceptual part of interpreting a p-value of an intercept, let us get to the formula. It is again the same spiel as before: the expected intercept will be 0, so we can abbreviate the formula:

The standard error of the intercept however looks a little messy again. Note again: in essence the standard error of the slope is just sqrt(var(mean(y)-b*mean(x)).(the formula for the alternative way to obtain the intercept, discussed in part three of this series). Rearrangements and including the SSE will lead to the form below. Brace yourselves:

The formula is really hard to read, even though the general message is the same as always: standard deviation just of the intercept, such that sd(mean(y)bmean(x))sd(mean(y)-b*mean(x)) . Here we really had trouble to find a clear intuition on the formula. Again, comment or contact us if you have a good take on this. You may find this discussion on StackExchange enlightening. It entails some decompositions of the standard errors of the slope and intercept, which we left out here.

Below you will find code to calculate the above and to replicate the summary(lm(y~x)) output of our “mtcars” example:

# Standard error of the intercept:
SE_a = sqrt(SumR2/(length(indep)-2))*sqrt((1/length(indep))+((mean(indep)^2)/sum((indep-mean(indep))^2)))
# [1] 45.59712

# t-value intercept:
t_value = alpha/SE_a
# [1] 5.165419

# p-value intercept:
2*(1-(integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value)$value))
2*(1-pt(t_value,df = length(indep)-2))
# [1] 1.460149e-05

We did it!! Let us update our linear_least_square() function that we wrote in part II of this series.

6 Z-/T-Test and the Signal Detection Theory

As a short sidenote, we recommend the second chapter (p.13) of the open access stats book “Understanding Statistics and Experimental Design” by Herzog, Francis and Clarke (2019). In the third part of this series, we hinted at one point that the variance/sd can also be understood as precision of an estimate of a mean, gathered from sampling. On the example of a sonar sensor from a submarine, the chapter introduces the t-test as a method to distinguish signal (mean difference) from noise (standard deviation), e.g., distiguishing rock from ‘no rock’, given a noisy channel. The smaller the noise the higher the precision.

7 Update on our linear_least_square() Function

Below you will find the code for a complete update of our linear_least_square() function from the second part of this series, to perform the equivalent output of the summary(lm(y~x)) of what we have gone through so far. Note that the actual lm() function has a bunch more options than written in code below. Our function again just serves an educational purpose. Have fun with it!

# Go-go-gadgeto linear_least_square!!!!
linear_least_square = function(indep,dep){ # Start of function
  
  # Evaluating coefficients for and optimal linear model
  # given a set of dependent and independent variabels:
  beta  =  cov(indep,dep)/var(indep)
  alpha =  mean(dep)-beta*mean(indep)
  fx = alpha+beta*indep
  
  # Sum of Squared Errors/Residuals:
  SumR2     = sum((dep-fx)^2)
  
  # Residual Standard Error/Deviation:
  residual_standard_error = sqrt(SumR2/(length(indep)-2)) 
  
  # Standard error of the slope for t-distribution:
  se_denom = sqrt(sum((indep - mean(indep))^2))
  se_slope_t = residual_standard_error/se_denom
  
  # Standard error of the incercept for t-distribution:
  se_a = sqrt(SumR2/(length(indep)-2))*sqrt((1/length(indep))+((mean(indep)^2)/sum((indep-mean(indep))^2)))
  
  # t-value of the slope:
  t_value_b = beta/(se_slope_t+exp(-32)) 
  
  # t-value of the intercept:
  t_value_a = alpha/(se_a+exp(-32))

  ### p-value of the slope via integrating the PDF of a t-distribution
  # up to the t-value calculated above:

  t_distr = function(x,df){
    t_1 = gamma((df+1)/2)/(sqrt(df*pi)*gamma(df/2))
    t_2 = (1 + (x^2/df))^(-(df+1)/2)
    t_distr = t_1*t_2
    return(t_distr)
  } # End of function
  
  # Two-Tail P(t=T|H_0):
  p_b_2t = 2*integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_b)$value
  
  ### p-value of the intercept:
  
  # Two-Tail P(t=T|H_0):
  p_a_2t = 2*(1-(integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_a)$value))

  # Results for two tail
  Results_a = c(round(alpha,4), round(se_a,4), round(t_value_a,4), p_a_2t)
  Results_b = c(round(beta,4), round(se_slope_t,4), round(t_value_b,4), p_b_2t)
  Res_data = as.data.frame(rbind(Results_a,Results_b))
  colnames(Res_data) = c("Estimate","Std. Error","t value","Pr(>|t|)")
  rownames(Res_data) = c("Intercept", "Reg. coeff.")

  # Nice output using cat() function:  
  cat(" Linear least square method in R","\n","\n",
      "Independent variable:", "\t", deparse(substitute(indep)),"\n", 
      "Dependent   variable:", "\t", deparse(substitute(dep)),"\n","\n",
      "alpha", "\t",alpha,"\n",
      "beta","\t",beta, "\t","SumR2", "\t", SumR2, "\n","\n")
  print(Res_data)
  cat("\n","Residual Standard Error:",round(residual_standard_error),
      "on", (length(indep)-2), "degrees of freedom","\n","\n") 
  
  # Let us also plot our results:
  # We will also use deparse(substitute(x)) for the 
  # labels of our plot axes.
  plot(x=indep,y=dep, ylab = deparse(substitute(dep)),       
       xlab = deparse(substitute(indep)))
  abline(a=alpha, b=beta, col = "darkblue")

} # End of function


# Test:
linear_least_square(indep, dep)

# Linear least square method in R 
# 
# Independent variable: 	 indep 
# Dependent   variable: 	 dep 
#
# alpha 	 235.5282 
# beta 	    -0.291294 	 SumR2 	 475953.3 
#
#             Estimate Std. Error t value     Pr(>|t|)
# Intercept   235.5282    45.5971  5.1654 1.460149e-05
# Reg. coeff.  -0.2913     2.4116 -0.1208 9.046625e-01
#
# Residual Standard Error: 126 on 30 degrees of freedom 

For comparison, here again the output of summary(lm(dep~indep)):

It is done, we have successfully replicated most parts of the summary(lm(y~x)) function, whoop-whoop!! In the next part of this series, we will look at other measures, such as the effect size, F-statistics and we will take a closer look at confidence intervals and confusion matrices.

Fig. Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other student and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within NOS.

Comments
0
comment
No comments here
Why not start the discussion?