Intermediate Stat-o-Sphere
Review: This is a pre-released article and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.
Corresponding R script:
Summary:
Welcome to the fifth part of our tutorial series on inferential statistics! You have come quite a way and we hope you get the feeling that is well worth the wait! The following tutorial requires a basic understanding in probability theory, linear regression and in essential descriptive measures, such as the variance, the standard deviation and the mean (Inf. Stat. I-IV). On the other hand, we are not going to speed up the pace, as we still aim to consistently replicate the math in R as far as possible. We again recommend to make use of the contents function (upper right corner), to reorient oneself in the script. Since we are approaching higher spheres, we will not provide summaries at the end of every chapter, but we still provided a downloadable summary above!
As mentioned in part four of this series, we will now cover the regular z-/t-tests, comparing simple means only, before applying the same principle of hypothesis testing to the intercept and slope of a fitted linear model.
Let us briefly catch up on what we have gone through so far: In the first two parts of this tutorial series, we focused on the general concept of hypothesis testing (conditional probability) and linear modelling (conditional linear relations). In the third part we looked at our linear model under a descriptive perspective and at alternative methods to obtain the optimal intercept and regression coefficient of a linear model via such descriptive measures (variance, covariance etc.). In the fourth part of this series we looked at basic distributions / functions such as the uniform and several forms of Gaussian distributions, evaluated the z-score and the effect size as a special case of z-score (where the mean of the sample is the input; we will talk about the effect size again, since the z-statistics/value can be looked at as a weighted effect size…).
In the fifth part of this series, we are going to fully combine the concept of probabilistic relations with the concept of mean estimate and linear relations. In other words: we will look at methods to evaluate samples as well as linear models from a probabilistic perspective (p-value).
Apart from evaluating p-values there are also other methods that we can use to evaluate our model in general, such as the effect size etc., which is also not part of the summary(lm(y~x))
output in R, but important. However, we will get to further measures such as the F-statistics in the sixth part of this series.
In the context of R this tutorial can be understood as the first part of understanding the output of the summary(lm(y~x))
function as well as the t.test(y~x)
function. In the sixth part we will eventually replicate the output of the mentioned R functions in full and we will also look into the topic of power analysis, confidence intervals and confusion matrices (essentially conditional probability tables) more closely. In other words: this tutorial is all about p-values.
The general question we will ask when comparing the mean of two distributions will be: how different are the two distributions in question? More precise: how likely is it that the sample was obtained from sampling from a population (or from another sample in comparison). What is the difference and is it different enough?
On the other hand, the general question we are going to ask, when evaluating a linear model will be: How good, how fit is our model of Lady Meow drinking tea under a probabilistic perspective? Is the difference between the independent variable and the dependent variable enough to consider a conditional (cor-)relation between them?
So the goal is now to not only descriptively, but also probabilistically evaluate how well our linear model predicts given / new data. Again, we will do so by computationally replicating most parts of the math of the summary(lm(y~x))
function and other functions in R, such as the infamous p-value.
We are finally approaching our first statistical hypothesis test using a PDF: the Gauß-test or also more commonly called z-test. As mentioned a couple of times: the difference to a t-test is formally marginal, but not too trivial as we will see further below.
As mentioned: a z-test tests for a difference between the sample mean and the population mean (or another sample) in a way that we can use the result of a z-test to make assumptions about the probability that a sample distribution is part of a population distribution or another sample (when taking multiple samples, since backed by the central limit theorem).
All this is done via a z-test, e.g., by using the mean of the sample and the population, given the var/sd of the population. To make things a lot easier upfront, the core of what a z-test does is simply the following (similar to the z-score that we encountered in the fourth part of this series):
Different to the z-score, we now do not just standardize a set of sample values but the mean of a sample only. Similar to the effect size, you may recon — we will get there!
In any way, now we are really comparing for a difference in means. Doing so can essentially again be understood as a kind of centralization, just not of vector
The above difference will eventually be divided by the standard deviation of the population — but again not of the whole set but the mean only to obtain the z-value/-statistics. This is different to the effect size, you may recon. Below you can see that the standard deviation of the mean of a sample brings in the
For now just think of
Below you see a comparison between the z-score (discussed in part four of this series), Cohen’s d effect size and the z-value (the result of a z-test). Recall that Cohen’s d is just a s special case of the z-score! Looking at the z-value/statistics we can see that this is on the other hand just a special case of effect size, in some way:
For a z-test we can either compare the sample mean with a population mean or the means of two samples (one or two sample z-test, details later below). The rest is essentially again just a method of converting the above difference into a standard normal PDF. This “standardized” perspective can be helpful to get a general grip on the difference in means, as we have seen in the forth part of this series.
For example: testing medication that is supposed to lower the blood pressure, how can we interpret a difference between -10 mmHg, between the control group and those that took the medication? Maybe it is way below the standard deviation of the population?
Technically, standardization also helps to simplify the process of integration (via z-tables), structurally equivalent to what we did when obtaining z-scores, but again, now with a focus on converting the difference in means only. Not exactly only, since we can see above, that there is a weight of
If we include
In the figure further below, we will compare two distributions but we kept the
The hypothesis of a z-test is also specified to address a specific tail of a normal distribution: The term tail refers to the outer left or right parts of a normal distribution. Our sample mean can either be higher or lower than the mean (lower and upper tail). There is also a two-tailed version checking for overall equality/inequality (both tails at the same time). We will get into the details further below…
Usually, we do not aim for a 100% difference. Therefore, a threshold is commonly set to define significance. This threshold is often .05, i.e., the lower/upper 5% of the tail of the normal distribution, as getting down to 0% is rather unlikely. Note that we are not going through all the details of confidence intervals, power analysis etc. in this tutorial, since we do not need it in order to define and calculate a p-value. This may surprise you, which may be due to the fact that the significance test and the p-value as a result of z- or t-test are often conflated with each other (see chapter 3.4 for details on common misleading and faulty definitions of the p-value).
In the plot below you will see a z-value for a difference of the sample and the population mean that is below .05 threshold, indicating that it is at least very unlikely that the sample is from the population or “the same population”. NOTE: Even though we are just operating with the effect size in the plot below, not the z-statistics/value, we can still see that the difference is significant. This is due to the fact that a z-statistics with
So inn this case we can already argue that we can still discard the null-hypothesis (no difference), which in this case means: there is a significant difference given in the sample and it is therefore unlikely that there is no difference as stated by the
Code for the plot above.
# General formula for a probability density function (PDF).
# Insert mean = 0 and var = 1 to obtain the values for input x
# given a so-called standard normal distribution:
prob_dens = function(x,mean,var){
fx = (1/sqrt(2*pi*var))*exp((-((x-mean)^2))/(2*var))
return(fx)
} # End of Function
# Comparing distributions:
# We chose equal variance, so the form will be the same
# Sample:
mean_samp = 5
var_samp = 12
# Population:
mean_pop = 15
var_pop = 12
# Plot:
seq_samp = seq(-7,40,by =.01)
plot(x = seq_samp, y = prob_dens(seq_samp,mean_samp,var_samp), type = "l", xlim=c(-8,30),lty =2)
seq_pop = seq(-7,40,by =.01)
lines(x = seq_pop,y = prob_dens(seq_pop,mean_pop,var_pop), type = "l")
abline(v=mean_pop,col="lightblue")
abline(v=mean_samp,col="orange")
abline(h=0) # line for x-axis at y = 0
# Quantile function for .05 of pop:
pop_threshold = qnorm(.05,mean=15,sd=sqrt(12))
# Add line to mark lower tail threshold of .05 (5%):
segments(x0=pop_threshold,x1=pop_threshold, y0 = 0, y1 = prob_dens(pop_threshold,mean_pop,var_pop))
Let’s go through all of the above in detail:
As we have seen before, the formula of the z-tests z-value is nearly the same as the formula of the z-score. There are just two minor differences. However, it is the same principal at work as in regular z-score: centralization in the numerator (evaluating the difference) and then standardization via the sd (similar to the process of normalization in probability theory). Again, a z-test is just a z-score to articulate for a specific difference: a difference of mean values only! Note: the formula below refers to a one-sample z-test (we will catch up on the slightly different formula for two samples further below).
Differences to the z-score (redundant overview, potentially optional):
a) For the z-table we used a sequence of possible values of
b) To perform a comparison of means in such a way we have to take the so-called standard error of the mean (
As we know, the z-statistics is very close to Cohen’s d effect size, which also standardizes a difference in means, but not using the standard error of the mean, but the regular standard error/deviation. From another perspective the formula below represents the z-score of the mean only, not a whole set
The effect size is the perspective we want for mere standardized numerical comparison, where the plain mean difference may not be informative (just converts the mean difference into measures of SD). Say we, we have a measurement of 140 mmHg systolic blood pressure and a difference of 10 mmHg — is that much? The standardization in the fashion of the plain effect size can help to get a grip on how big the actual effect ist. Cohen’s d can be looked at as equivalent to the corresponding z-statistics/value, given that
# Comparison z-stat and effect size:
x=c(0:10)
y=c(0:10)*3
pop_mean = mean(y)
pop_sd = sum((y-mean(y))^2)/length(y)
sem_x = pop_sd/sqrt(length(x))
z_stat = (mean(x)-mean(y))/sem_x
effect_size = (mean(x)-mean(y))/pop_sd
# Same with n = 1
sem_x_2 = pop_sd/sqrt(1)
z_stat_2 = (mean(x)-mean(y))/sem_x_2
effect_size_2 = (mean(x)-mean(y))/pop_sd
# Check for equality of z_stat2 and effect_size_2:
all.equal(z_stat_2,effect_size_2)
The z-statistics/value may still appear a little out of order, but let us take a closer look at the
The standard error of the mean can be a little tricky, even though it actually doesn’t introduce anything new to us (see part three). Nevertheless, we decided to give the topic some space for an own chapter. We promise it is worth it, as it again relies on assumptions of the central limit theorem, i.e., it essentially relies on the relation between the sample and the population size (
Per definition, the
STANDARD DEVIATION = Average deviation of a set of data(
==>
STANDARD DEVIATION/ERROR OF THE MEAN = Average deviation of the sample mean from the population mean (not a set, just the sample mean in relation to the population mean is centralized!!).
==>
Given a slight difference only, the regular standard deviation is often confused with the standard error of the mean. The reason is again that it is essentially the standard deviation, just of the mean only. Note, in R we can’t just use sd(mean(x))
to compute the standard error of the mean, as we will see.
For the z-test the
As written above: a special characteristic of the
CAVE: This is exactly what Cohen’s d effect size is trying to avoid, since samples with a large
To get more insight why weighting in makes sense, let us ask again: Where does the square root come from anyway? The answer is actually quite intuitive: The mean as input brings in
In the decomposition above we can see how the sample mean actually disappeared completely as input, essentially reduced to the sd, when
Note upfront that the above bounded relation between
Below we tried to mathematically express the
Looking at the formulas above also helps to understand how the sample variance corrects for a bias (Bessel’s correction), since
Below you will find another decomposition using the variance only (optional), again showing where the square root of
Below you will find some code to calculate the
# Example for SEM:
# Our population:
x = c(rep(c(0:10),100))
# Population variance:
var = sum((x-mean(x))^2)/length(x)
# Evaluate SEM for several different sizes of n:
n_sample = 20
sem_20 = sqrt(var)/sqrt(n_sample)
# [1] 0.7071068
n_sample = 200
sem_200 = sqrt(var)/sqrt(n_sample)
# [1] 0.2236068
n_sample = 1000
sem_1k = sqrt(var)/sqrt(n_sample)
# [1] 0.1
n_sample = length(x) # = 1100
sem_N = sqrt(var)/sqrt(n_sample)
# [1] 0.09534626
n_sample = 10000
sem_10k = sqrt(var)/sqrt(n_sample)
# [1] 0.0003015113
sem_20 > sem_200 & sem_200 > sem_1k & sem_1k > sem_N & sem_N > sem_10k
# [1] TRUE
Next (optional) you will find another more complex simulation in order to prove the mathematics above: Now we are going to use some code we used to demonstrate the central limit theorem, including actual sampling, created by using the rnorm()
function (rnorm = random normal distribution).
We will create a set of 5000 means from samples with a size of
Even though we have knowledge about the populations mean and sd, below we are first going to just use the sample sd and later the population sd, since at least in general the sample sd can also be used for an approximation of the
# Standard error of the mean including actual sampling:
# How the SEM becomes smaller, the more samples we take:
set.seed(1) # for reproducability
data = rnorm(n = 1000, mean = 10, sd = 2)
# We can again plot our data as histogram
hist(data, col = "lightblue")
# Use a for loop to take 5000 random samples of size n = 50 each:
n_samp = 5000
sample_mean = c() # create empty vector
sample_sd = c()
# The below function will also takes samples of length 50
# and then mean of that sample stores in our vector sample_mean.
# Running this loop will always result in different samples:
for (i in 1:n_samp){
# Store mean of each sample with length 50:
sample = sample(data, 50, replace=TRUE)
sample_mean[i] = mean(sample)
} # End for i
# Look at the histogram of the sample_mean:
hist(sample_mean, col = "lightblue")
# Now we compare the SEM of the first 20 sample
# with the SEM of the first 250 sample. Recall
# the vector sample_n_50 contains 5000 means
# with a sample size of 50 each:
sample_mean
length(sample_mean)
# [1] 5000
#### SEM of the first 20 samples:
samp_n20 = sample_mean[1:20]
length(samp_n20)
# [1] 20
# SEM_n20
SEM_n20 = sd(samp_n20)/sqrt(length(samp_n20))
# [1] 0.05782984
#### SEM of the first 250 samples:
samp_n250 = sample_mean[1:250]
length(samp_n250)
# [1] 250
# SEM_n20
SEM_n250 = sd(samp_n250)/sqrt(length(samp_n250))
# [1] 0.01723958
#### SEM of all 5000 samples:
SEM_n5000 = sd(sample_mean)/sqrt(length(sample_mean))
# [1] 0.003924541
# The numerical values will deviate, each time you run the
# loop, but it will always hold the logic of:
SEM_n20 > SEM_n250 & SEM_n250 > SEM_n5000
# [1] TRUE
# For a z-test we would have to use the population sd,
# which we set to be 2 in rnorm() function above. Given
# the sd we can simply just adjust a size of n:
# pop_sd/sqrt(n):
2/sqrt(20) > 2/sqrt(250) & 2/sqrt(250) > 2/sqrt(10000)
# [1] TRUE
For some more examples on the
Conceptually we have already gone through most of what is ahead of us. We have previously obtained an overview on what possible hypotheses of a z-test roughly look like, in this chapter we will learn about all of them in detail.
In general a z-test can be differentiated in two categories (the formula however is essentially the same):
=> The one-sample z-test compares a sample mean from a population with the population mean / expected value.
=> The two-sample z-test compares two samples und checks if they are related to the same population or different populations.
Apart from that, there are three types of hypotheses for the z-test that can be drawn: one-tailed (lower/upper) and the two-tailed test hypotheses.
The important message of the previous chapter: effect size and z-statistics/value are both doing the same: standardizing the difference in means. We know that the sample size is not weight into the effect size (or is treated as if
The example we will be using in the next chapers will be (mostly) an comparison of the tea drinking behavior of happy cats and sad cats. We will also look into a catnip treatment at one point (dep. two sample z-test), so stay tuned! However, brace yourselves, Chat Noir:
A two sample z-test could ask if sad cats drink less tea than regularly happy cats, e.g., such as Lady Meow. The samples are distinct by each other since they imply inclusion criteria: cats are tested for being sad and regularly happy (on whatever basis) and then put into their respective category. To answer our above question, these two samples would be compared with each other. The probability value (p-value) would state how likely it is that the samples are samples from the same population. A one-sample z-test is similar, just that not the mean of a second sample is used, but the given mean of a population of all other cats that are at least not sad. The difference to the two-sample z-test may appear minor in this case, however it is a good lesson to conceptually think of what a population actually represents. However, the exact question for a one sample z-test of our example would be: does the “chat noir” drink significantly less tea then any other cat (cats in general)? So in nuce, the category “all other cats” is not as specifically circumscribed by inclusion criteria as in the two sample test. Such a test can still be performed, but the result of this one-sample test may be that no difference is perceived, due to too many confounding factors or other biases, which are all present in the population (since no further differentiation via inclusion criteria was performed). Keep that in mind!!
In case the term population becomes fuzzy thought as representing a category: Note that the concept of a population is not bounded by a certain “real” category, such as e.g. species. In nuce, it can also mean a population within a population (sub-population). In the example for the one-sample test above we looked at the tea drinking behavior of a sample of a subpopulation of sad cats amongst a population of “regular” or “all other” cats (of course its all made up, but since we are doing a bit…). So here the difference between what a sample is and what a population is in the form of a category is not set/differed by the species (cats) but by their mood and potential tea drinking behavior. We could also compare the behavior of different species… We recommend to keep such conceptual/linguistic categorical issues in mind when trying to understand what variables are used. And of course: In order to make a distinction between sad and happy cats inclusion criteria would need to be discussed in detail (since it is just a made up example, we will skip this part; however, an example of an exclusion criteria for a second sample could be cats that have trouble drinking in general, e.g., due to some illness like a tumor that influences the ability to properly swallow liquids).
For the lower-tail we can formulate the null hypothesis that the sample has either an equal mean or is not (significantly) lower than the population mean (meaning: it could be a sample from the population if not lower than a defined threshold). Below we also included different denotations of the essential logical binary relation between alternative and null hypothesis. Recall from Inferential Statistics I that the use of the term null hypothesis can be confusing (e.g., due to often expressed redundant binary relation when, saying things such as “the null-hypothesis being true”). Also recall that the p-value below is essentially a (uniformly weighted) likelihood of the form
You may wonder what role the
As mentioned in the first part of this series, the use of the null-hypothesis is very unfortunate, since this is also not the way we express the evidence in every day life: When we find evidence for the effect of a certain treatment, we do not say: we did not not-find any evidence, we positively say: there is evidence for an effect.
This ex negativo approach is sometimes also related to falsification (Karl Popper), e.g., in [this paper]. In general it refers to a discourse on epistemics, i.e., the question on what can be known and what can be tested. Personally, I don’t find this logically very convincing, since falsification is better to be understood in relation to temporal and spatial boundaries of experience/examination, as in the classic black swan example, and less by overinterpreting an ex negativo approach in setting a category in conditional probability. The latter appears to be more of an epistemological reminder, apart from being most of all mathematically convenient. Mathematically tilting the category in advance via the null-hypothesis approach is not solving epistemic issues of any kind of abductive inference concerning time and space (recall part I of this series). Tilting the category is at the end just tilting numbers, since the relation between variable and its complement is already part of probability as well as propositional logic. Again, apart from that we always communicate the results in the positive sense: evidence was found that medication X has an effect (we don’t say: evidence was found that medication X did not not-have an effect). Most of all, evidence is not obtained by one study alone, so significance for itself does not relate to the world anyways.
Apart from numerically tilting from
Let us start with the lower tail:
When do I know when to use the lower and not the upper tail? In general it is dictated by the question that is asked: If we decided in advance that our hypothesis is that depressed cats drink less tea than others, then we indirectly decided to perform a lower tail z-test. What if the inverse is the case: drink more tea than others? Performing a hypothesis test correctly, requires to set a hypothesis in advance. In the best case, you register your trial before you collect data and then perform your statistical analysis. Not doing so can be essentially considered HARKing, i.e., formulating or adjusting a hypothesis after the results are known (bad practice). Also, when the result does not match our a priori hypothesis (either higher or lower), but turns out to be significantly different, we technically would again need to perform a new trial (especially when preregistration was done, we can’t and shouldn’t just change the hypothesis). However, we will see below that this is why most people use a two-tailed test.
We can use the same formula as above for the inverted case with a sample mean that is higher than the population mean (in the sense that it is not from the same population). Here we use the same integral as for the “lower-tail” and due to the symmetry of a standard normal PDF we can use the same p-value as for the lower-tail as well. As mentioned, we just need to tilt the result, due to the symmetry of Gaussian functions. Formally we actually integrate up to the absolute value of our given z-statistics and subtract the result from 1 (see code below). Integrating up to the absolute value of the z-statistics by itself (without subtracting it from 1) would only results in the likelihood that follows the alternative hypothesis of a lower-tail test…
Below you will find some code to perform a lower and upper tail one-sample z-test:
### Chat Noir One-Sample Z test:
# We will compare sad cats with a population
# of cats ("all cats"):
# Happy cats like Lady Meow drink 10 cups of tea on average:
pop_cats = c(rep(seq(0,10, length(11)),10))
# Sad cat, which drank only 2 cups of tea in 10 hours:
sample_sad_cats = seq(0,2,length = 11)
# [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
# Population var and sd:
var = sum((pop_cats-mean(pop_cats))^2)/length(pop_cats)
sd = sqrt(var)
# SEM:
sem = sd/sqrt(length(sample_sad_cats))
# Z-test:
z_stat = (mean(sample_sad_cats)-mean(pop_cats))/sem
# [1] -4.195235
# P-Value LOWER TAIL
p_val_low = pnorm(z_stat,mean = 0,sd = 1)
# [1] 1.362942e-05
p_val_low < .05
# [1] TRUE
# P-Value UPPER TAIL:
# The below is only theoretically the case and we
# have to pretend to have gotten a positive z-statistics.
# This spares us the time of creating another set of data.
# This time the below actually argues that sad cats drink
# more tea than happy cats (we made thinks up anyways, but just that
# you know):
p_val_upper = pnorm(abs(z_stat),mean = 0,sd = 1)
1-p_val_upper
# [1] 1.362942e-05
1-p_val_upper < .05
# [1] TRUE
Awww :( Turns out sad cats truly drink significantly less tea :’( The p-value is way below 0.001, so we can write the p-value as: < 0.001. This is also the way we would report the p-value in a paper as well. One can also report the z-value to the corresponding p-value, as it tells us how many sd’s our sample mean is from the population mean.
As the name already says: now we are going to consider both tails of the distribution a priori and are looking at a total-threshold of 5% (2.5% each tail) when comparing it with the p-value for testing the significance (again, we will cover the whole topic of significance testing in detail in the next part of this series).
This time we are asking for the equality/inequality of two distributions. In other words: we are going to perform a lower and upper tailed z-test at the same time, so to speak.
When should I use the two-tailed z-test? In general it is a safe prior hypothesis to argue for a difference that is either higher or lower than the population. Conceptually it can be understood as a rather exploratory approach: catnip may result in less depressed cats, but maybe also more depressed cats? Maybe we don’t know yet, since current research is ambivalent or simply not given. The cost though is a double p-value and a higher chance that the result will turn out to be insignificant. However, in case this appears confusing: we do not actually get a double difference or so, since we are still looking at a single difference of means only that can’t be at two sides at once: one set of samples, one mean, ergo: a difference either + or - will be given. So in any case there will always be only a “one-tailed difference”, even when performing a two-tailed test.
When testing for significance only we can assume they are from the same distribution if the result of the z-test lays in an, e.g., 95% area of our population function (confidence interval), with an upper and lower tale of 2.5% each (total of 5% threshold).
As we are calculating the p-value for a z-statistics that only refers to one side, when performing integration, we have to multiply the p-value by 2 for the two-tailed test. We can simply do so due to the symmetry of a Gaussian function.
# P-Value TWO-TAIL
p_val_two = 2*pnorm(z_stat,mean = 0,sd = 1)
# [1] 2.725883e-05
p_val_two < .05
# [1] TRUE
The formula for the (independent/dependent) two-sample z-test is nearly the same as for the one-sample z-test. Here we compare the means of two samples, given their respective population sd and population mean. A two-sample test is considered independent, when it argues that there is no influential relation of the samples to each other. To be less abstract: the group of happy cats and sad cats, each sorted by inclusion criteria, are considered independent when their tea drinking behavior is not influenced by each other: sad cats don’t drink less tea because of happy cats, but because they are sad. Happy cats are also not withholding any tea from sad cats…
To give an example for the dependent two-sample test: two samples are, e.g., considered dependent when we compare sad cats that drink tea with themselves after getting a dose of catnip.
The complete formula of the independent two-sample z-test is:
…which can importantly be simplified to using the population standard deviations only, discarding the population mean:
When forming a hypothesis, we can now do the same as with the one sample z-test, just with two samples: is the sample mean significantly lower, higher — or equal/unequal to the population mean? However, for brevity we will just ask for the lower tail below:
# Chat Noir Independent Two-Sample Z-test:
# We will compare happy cats such as Lady Meow with
# unfortunate cats that are evidently sad (inclusion
# criteria will be left out for now):
# Happy Cats:
sample_cats = c(0:10)
pop_cats = seq(0,10.5,length = 11)
pop_cats = c(rep(pop_cats,100))
# Sad cats, which drink only 1 cup of tea in 10 hours :'(
sample_sad_cats = seq(0,2.5,length = 11)
# [1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
pop_sad_cats = c(rep(seq(0,2.5,length = 11),100))
# SEM of each sample:
happy_var = sum((pop_cats-mean(pop_cats))^2)/length(pop_cats)
happy_pop_sd = sqrt(happy_var)
happy_sem = happy_pop_sd/length(sample_cats)
sad_var_pop = sum((pop_sad_cats-mean(pop_sad_cats))^2)/length(pop_sad_cats)
sad_pop_sd = sqrt(sad_var_pop)
sad_sem = sad_pop_sd/length(sample_sad_cats)
# z-Statistics:
z_statistics_num = (mean(sample_sad_cats)-mean(sample_cats))-(mean(pop_sad_cats)-mean(pop_cats))
z_statistics_denom = sad_sem + happy_sem
z_statistics = z_statistics_num/z_statistics_denom
# [1] -10.03415
# We can simplify the formula and will get the same results:
z_statistics_num = (mean(sample_sad_cats)-mean(sample_cats))
z_statistics_denom = sad_sem + happy_sem
z_statistics = z_statistics_num/z_statistics_denom
# [1] -10.03415 # == -10.034 sd's away from the pop. mean
# P-value:
pnorm(z_statistics)
# [1] 5.394214e-24
However, the above only holds for independent samples. We have covered the difference between the terms independent and dependent when talking about respective variables in the second part of our series. In general: independent means, something is not influenced by something else (e.g. the probability of
A dependent relation between samples often concerns before/after relations. Following our furry example on sad cats (it’s so sad!), we could, e.g., check if cat nip makes cats happy again by checking if they drink more tea again in a certain amount of time, after they received some catnip.
Outspoken the formula goes as follows: z for paired samples (dependent relation) equals the mean of differences, divided by the standard error of the mean differences. We could also add “minus the expected difference” in the numerator, but since we are looking for the p-value that assumes the null hypothesis, this value will be zero anyways (since
Below is some code for the example of testing if sad cats drink more tea again after consuming catnip: this represents an upper-tail paired or dependent z-test.
#### Chat Noir Dependent/Paired Two-Sample Z-test:
# Sad cats, which drink only 1 cup of tea in 10 hours :'(
# We can also consider this as timestep t_1:
sample_sad_cats = seq(0,2.5,length = 11)
# [1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
# Timestep t_2, i.e., after cat was on catnip:
sample_catnip = c(0:10)
# [1] 0 1 2 3 4 5 6 7 8 9 10
# Evaluate the difference between t_1 and t_2 via
# diff = t_2 - t_1
diff = sample_catnip-sample_sad_cats
# Set up data table:
data = cbind(sample_sad_cats, sample_catnip, diff)
data = as.data.frame(data) # in order to be able to use $
# sample_sad_cats sample_catnip diff
# [1,] 0.00 0 0.00
# [2,] 0.25 1 0.75
# [3,] 0.50 2 1.50
# [4,] 0.75 3 2.25
# [5,] 1.00 4 3.00
# [6,] 1.25 5 3.75
# [7,] 1.50 6 4.50
# [8,] 1.75 7 5.25
# [9,] 2.00 8 6.00
# [10,] 2.25 9 6.75
# [11,] 2.50 10 7.50
# Caluclate the mean difference
mean_diff = sum(data$diff)/length(diff)
# Total sum of squared mean differences
sum_sq_diff = sum(((diff-mean_diff)^2))
# Standard error of the mean differences
sd_mean_diff = sqrt(sum_sq_diff/length(diff))/sqrt(length(diff))
# Z_value:
z_value = mean_diff/sd_mean_diff
# Upper tail z-test, arguing that catnip has the effect that
# sad cats are drinking more tea than before again:
p_value = 1-pnorm(z_value)
# [1] 7.854725e-08
YOU DID IT!! You have completed your first statistical hypothesis test!! Whoooop!!! Before we move on to the t-test, we will discuss some common misleading, false or at least confusing definitions of the p-value.
You have probability figured out yourself that handling the p-value, e.g., when reading a paper is actually rather simple: If it is below a threshold, we are allowed to state that a (alternative) hypothesis finds statistical evidence due to significant difference in, e.g., means (since we can discard the null-hypothesis in such a case).
However, you probably also figured out that defining the p-value easily becomes complicated, since it involves quite a bunch of mathematical concepts to be obtained. The general message of this tutorial: it may not even be a good idea to try to fit the concept into one phrase.
There are many definitions you probably came across already, but may still appear rather enigmatic and confusing to you, when thinking about it, after going through the upper math. In this short chapter we are trying to anticipate and entangle some of the possible confusions in common definitions of the p-value, before proceeding with the t-test and the integration of the t-statistics/-value, in order to obtain a p-value again.
You can also skip this part for now and come back to this chapter later, as it does not add anything essential to the math. In this chapter we just discuss different approaches of linguistically expressing mathematical concepts such as the p-value and the significance.
You probably have come across statements such as: “The p-value is the probability that the difference in the mean of the sample and population occurred by chance”. To me this particular statement always appeared terminologically quite confusing, as the term chance implies that the probability (p-value) redundantly refers to another probability / chance, such that the p-value could understood as a probability of a probability (prob. of a chance of something…)? I really have trouble to relate this definition to the process of standardization and integration. If chance is here understood as referring to the threshold of <5%, then the p-value can either be below / outside the “area of chance” or within it, but the p-value itself mathematically does not involve the threshold, nor does it tell us upfront that a difference is in a certain area by some probability (since it is just integration to a certain z-score/-value, as we already know). The p-value does not probabilistically quantify a second order chance of some kind (the power analysis however does, as we will see in the next part of this series). However, we have seen that the sample size is weighted in, making a high difference, given high sample size less likely to be the result of the sample of a population.
However, the notion of chance is also related to the process of random sampling (randomization) from a distribution and the difference being just one of potentially many random possible differences from the population mean, when taking a sample (with size n) from one and the same population and potentially a different population. Again, weighting in the sample size into the effect size provides this perspective, since a greater sample size makes a high difference more likely to be (gives it more confidence). Speaking of randomness: it is not recommended to understand “random” as uniform or so, since a Gaussian PDF says not every mean of a sample is uniformly likely, but we can still take samples from that distribution in a randomized way. The process of sampling is considered random if we did not intentionally decide for a specific sample or had control in the choice whatsoever, in order to intentionally influence the mean of the sample therefore the process of randomization in clinical trials. Randomization (clinical design) is therefore not a nicety that helps to maintain some higher order quality standard, but also makes sure that we can actually consistently apply mathematical inference methods that work with random variables. Still, randomness in the above sense is not what defines the p-value either, it only defines specific circumstances of how the sampling took place (randomization), i.e., a characteristic of variables (random variables). This definition for itself at least leaves a lot of questions, specially for beginners.
You may have a different opinion on this, but we recommend to keep in mind that the use of the term chance is neither a good intuition, nor is it terminologically consistent looking at the actual math. At least exchange chance by “consequence of random sampling from one and the same distribution, not a different one (=
The p-value for a z- or t-test defines how likely it is to obtain a particular standardized / studentized mean difference (sample minus population/expectation) given that we were (randomly) sampling from an expected (normal) distribution (population with respective mean and sd for the z-test; same goes for two samples, where one sample is considered a base line, so to speak). It is not the same as the plain standardized difference in means, since the sample size is weighted in. Weighting in the sample size can give rise to a higher confidence in a particular difference in means compared to a much smaller sample. Intuitively more data provides more confidence/credibility. However, in the next part of this series we will look at examples where this becomes a problem for interpreting p-values and their “evidence”, but for a narrow scale it still makes sense: the mean of just two events is certainly less credible than the mean of 100 events, under the condition that we randomly sampled and didn’t selectively pick events...
We could also modify and reformulate our initial statement on p-values at the beginning of this chapter into a certain condition of insignificance: If the p-value is ABOVE a threshold of 5% we can interpret the given difference or overlapping of the sample mean with the population PDF as coincidence / chance (result of random sampling) and cannot discard the null-hypothesis.
Importantly: We do not even have to calculate the p-value to know if the difference in means is significant, given the z-/t-score of the threshold probability. This threshold z-score is also called a critical z-score that results in a p-value of the threshold, e.g., .05: After calculating a t-value or z-value of a data set, we could simply compare those with the z-/t-value of a threshold probability of 5% (obtained via a quantile function) and argue for significance if the value is outside the critical threshold score, without integrating a PDF at all. . . However, in the above statement on the insignificance of a difference, the p-value is still not defined, just compared with the threshold. In other words: the condition of that statement (if below a threshold) has nothing to do with the condition and definition of the conditional probability we are after, our p-value.
Here is another statement that is commonly used but rather confusing, even though there is a true story behind it: “the p-value is the probability of observing an effect at least as extreme or more extreme as when the null-hypothesis was true” (Wikipedia, 02.2023). Here I have a lot of trouble to understand what is meant with “extreme” and “at least… or more extreme”. The term “effect” refers to the difference in means. But why does it say “at least as extreme or more extreme”? We also don’t even know if the difference is in any sense given at all (could be t = 0).“At least as… or more extreme” again sounds like confusing the p-value itself with the comparison of p-value and the threshold (significance). The actual idea that is behind this definition of the p-value (as far as I have encountered) addresses the fact that we integrated a PDF (interval prob. estimate, e.g., from infinity to z-value). So in other words, the integral technically addresses all other values that are possible in the sense that they are even more unlikely (we can say so due to the form of a Gaussian function, therefore the effect: regression to the middle…). The problem though lays in the following question: Does a p-value obtained from integrating a probability density function really address other cases differently than a point-probability estimate of a regular and discrete probability function, since we approach a z-/t-value from -/+ infinity? Or in other words: don’t both probability values (interval estimate and point estimate) always automatically account for a range that is more extreme in the sense that any particular probability value (no matter point or interval estimate) refer to all other possible probability values that are lower than the probability value one got indirectly? A PF and a PDF are not the same kind of function and there are technical reasons that a point estimate of a continuous function results in 0 and therefore we use a PDF. Still, the form of the function and the fact that we looking at a tail of a Gaussian up to infinity may play an equal role in what is considered “extreme and more extreme” and which probabilistic assumptions we can make beyond a point estimate. EXAMPLE: A point-probability of .1 (point
In any way, the above Wikipedia definition still indirectly confuses the p-value with the significance test, as it brings in a threshold with the words like ‘at least as extreme as when the null-hypothesis was true’. The problem is here that the null-hypothesis in a z-test is always true (trap of unnecessary redundant binary). Only when we perform a significance test we decide for a threshold — this is not part of the p-value calculation but the power analysis / the significance test. The probability
Maybe I am missing something above (let us know if so), but to me a lot of common definitions appear rather confusing and they are often not even necessary to understand the p-value, since there are other more straight forward definitions to rely on.
You may also have come across definitions stating that the p-value is the probability that the null hypothesis is true. We have discussed before (especially in the first part of this series) that stating “the null-hypothesis is true” is just adding an unnecessary redundant binary and means the same as stating “the alternative-hypothesis is false” or “
We hope we were able to address most of the general issues around defining, explaining and understanding the p-value and significance test. As a final note: keep in mind that p-values do not represent the world in the sense that they represent a constant value, so that we can just rely on its “significance”. The major downside of frequentist statistics is essentially that its models do not learn and reflects less what the role of a hypothesis and the theory behind it is (weights). Also recall that the frequentist approach comes with a uniform prior, so to speak. This may be good and safe spot in a lot of situations (most of all when no prior knowledge is given), however, in the long run it nakedly exposes science to a game of randomness and pretends to be neutral, which is at the end a dangerous lie, making people pretend to not be responsible in making at least major conceptual assumptions about the world in the first place, since some significance (meta-studies therefore become more and more important over the years).
IMPORTANT TAKE HOME MESSAGE: Don’t let yourself get confused by lousy definitions of the p-value, even if they come from professionals!!! Any mathematical statement has to have a mathematically formulated representation. A mathematical statement is not about an effort of correctly interpreting words, but the instructions of an algorithm represented as formula. In the end, the formula is all we got in order to confirm our definitions, so it is probably best to start there as reference.
As mentioned before, the t-test only slightly differs from a regular z-test. For a regular t-test the only difference is that we will rely on the sample standard deviation / variance not on the population sd / var. In other words we will operate with
For the one sample t-test at least the expected value has to be given, which can be the mean of a bunch of previous samples (previous trials) or a theoretical value (another model, e.g., from physics, i.e., an expectation backed by other (kinds) of research and its respective formulas and predictions they make). It will not be the actual population mean, since the population is considered unknown, when performing a t-test, different to the z-test. The t-test is also called Student’s t-test (we will not cover the history of the t-test, but we recommend doing some research on why it is called that way!).
Note that this test is considered robust, given equal variance and sample size. We could also use a Welch’s test in case of inequal variance and sample size. However, we will not cover the Welch’s test here (we will do so some other time; note that the t-test()
function and other functions actually often are able to decide for the correct/optimal test itself, based on the input data).
One-sample t-test:
In the formulas below we essentially just exchanged z with t, that’s all: We centralize in the numerator and use the sample standard error of the mean for scaling or studentization (instead of standardization).
Independent Two-sample t-test:
Dependent/Paired Two-sample t-test:
Recall that we could also add “minus the expected mean difference” in order to represent a centralization in the common form. But again, since the expected difference under the null hypothesis will be zero anyways, we can abbreviate the formula.
A t-distribution is very similar to a standard normal distribution and approaches a normal distribution the higher the sample size will be (similar to what we encountered when discussing the central limit theorem). The formula of a PDF of the t-distribution is a little complicated and deriving its formula would take a separate tutorial for sure. We will therefore not cover all the details here. The formula of the PDF of a t-distribution goes as follows:
As you can see, the formula looks rather nasty. The variable
We didn’t discuss Bessel’s correction in detail, but however, one way of understanding Bessel’s correction is as representing the degrees of freedom, where the 1 refers to one relation/parameter looked at (difference in the mean). A linear regression model on the other hand operates with two parameters that are evaluated: the intercept and the regression coefficient, such that the
Let us think of two numerical examples: Say we have taken a sample from a population and end up with a set of values of 10, 20 and 30 (
Our first numerical example is special in the sense that the value of the mean (below 20) is entailed as a value of the set of
The same holds for a set of
To formulate a definition of the degrees of freedom in maybe more common words: The concept of “freedomness” refers to the number of values that can vary and are not dictated by the parameters such as the mean themselves.
To wrap this up, we will now look at a t-distribution with several different degrees of freedom and check if it is true that the higher the df (and therefore indirectly prob_dens(x,mean=0,var=1)
function:
# PDF of a T-Distribution:
# df = n-1 # or 2 for two estimates/relations
t_distr = function(x,df){
t_1 = gamma((df+1)/2)/(sqrt(df*pi)*gamma(df/2))
t_2 = (1 + (x^2/df))^(-(df+1)/2)
t_distr = t_1*t_2
return(t_distr)
} # End of function
# Initialize Sequence:
seq = seq(-4,4,by = .1)
# Plot standard normal and t-distribution
# with df = 1, df = 2, df = 10, df = 100:
plot(x=seq,y=prob_dens(seq,0,1), type ="l")
lines(x=seq,y=t_distr(seq,1),col="red") # df 1
lines(x=seq,y=t_distr(seq,2),col="green") # df 2
lines(x=seq,y=t_distr(seq,10),col="blue") # df 10
# Here you can see that given a df = 30 the t-distribution
# gets very close to standard normal distribution, hence
# the z-score is used, given n>30:
lines(x=seq,y=t_distr(seq,30),col="lightblue") # df 30
# Moving to df = 100 the light blue line will essentially overlap
# with the standard normal PDF
plot(x=seq,y=prob_dens(seq,0,1), type ="l")
lines(x=seq,y=t_distr(seq,100),col="lightblue") # df 100
So again, we haven’t covered Bessel’s correction, but we should now understand it in the sense of degrees of freedom and how it affects approaching a standard normal PDF when
At last, looking at the plot above we can eventually see why it is said that given a df of over 30 and given the population sd, one can use the z-score, since the t-distribution approaches the standard normal distribution. Below you will find a plot with df = 100, resulting in a diminishing difference:
In other words: if we want to make sure to correct for a bias, then in the case of a sufficiently high number of samples / df, it makes only little difference to working with a regular old standard normal PDF. There is more behind it mathematically and the number n>30 is still somewhat arbitrary, but we hope this gave you enough information to understand and make it plausible why the sample size dictates which kind of test score is to be used.
summary(lm(y~x))
Before we discuss the t-test for a linear model, we are going to go through the complete output of the function summary(lm(y~x))
. However, this part can be considered optional, if you want to get right into testing the slope and intercept of a linear model.
In general, the output of summary(lm(y~x))
extends the output of lm(y~x)
and, e.g., performs a t-test and several other things on our linear model. Note that the summary()
function can be used for several purposes. Its output depends on the type of input (check ?summary()
for a documentation of the function; note that documentations for R functions are sometimes a little hard to read and sometimes incomplete). However, we will only use it together with the lm(y~x)
function for now.
Let us first use the second synthetic data set from Inferential Statistics II to look at the output of summary(lm(y~x))
, in order to get a basic overview of what’s ahead:
### Second synthetic data set from Inf. Stat. II
# Lady Meow drinking tea ^^
# Cups of Tea:
cups = c(0, 1, 2, 3.5, 4.8, 5.2, 6, 6.9, 8.5, 9.1, 9.9, # run 1
0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9, # 2
0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9, 9.8) # 3
# Time passed for each run of measurements (cups), abbreviated:
time = c(0:10,0:10,0:10)
# Summary(lm(y~x)) => y under the condition (~) of x
summary(lm(cups~time))
Executing the line summary(lm(cups~time))
will give us the following console output:
The first line of output just shows the code formula. The next part looks at the residuals from the perspective of quantiles, which includes the median. This part can also be recreated via:
# Residual quantiles, median and mean:
summary(residuals(lm(cups~time)))
We have used the function residuals(lm(y~x))
before in the second part of this series. However, the formula that the function residuals()
uses is simply representing the data value of the dependent variable (
The part “Coefficients” entails the output of the lm()
function, i.e., the estimates for the optimal summary(lm(y~x))
. The above output only entails a t-test though (we will also only cover the t-test for a linear model).
The next part of the R output concerns the “Significance Codes” and just entails a legend for the number of stars that are associated with several p-value thresholds. Three stars means that the p-value is approaching zero and therefore very small.
The last segment of the console output entails a couple of values, such as the F-statistics, again our p-value (of the slope) and others. These are additional values to get a grip on the quality of our model, which we will focus on in the next sixth part of this tutorial series. However, the only values that will not be covered in this tutorial is multiple and adjusted R^2 and the F-statistics. In other words: after this tutorial we nearly completely replicated the output of summary(lm(y~x))
by just using mathematical formulas, whoop!!
We are now approaching the end if this tutorial. Our last topic will be hypothesis testing the slope and the intercept of a linear model. After this last chapter we will have replicated most of the output of summary(lm(y~x))
! Let’s get to it:
Our data set: First of all, because Lady Meow is adorably predictable in her tea drinking behavior, the p-value we will receive for the slope of our optimal linear model is so small that R will just output a plain 0. We therefore decided that we will use a random data set this time. We chose the integrated data set “mtcars”, which can be loaded into the environment via data(“mtcars”)
. Loading the data set will make it possible to call columns of that data set via “mtcars$…”:
We intentionally chose a column of the data set “mtcars” that led to a model with a very high p-value, such that it will be above our significance threshold of 5%! For the independent variable we just created a set of length(column)
. In general, let us go abstract and just ignore the meaning of the variables (cars are evil anyways) and let us look at the plot right away:
data("mtcars")
dep = mtcars$disp
indep = c(1:length(dep))
plot(indep,dep)
abline(a=alpha,b=beta)
Let us look at the output of summary(lm(y~x))
, where we can clearly see that with a p-value of 0.905 we can’t discard the null-hypothesis that the slope is essentially zero and there is no conditional linear relation between the variables, or only very little change in the data when
# The p-value is 0.9047, suggesting no significant
# result, given a slope of -0.2913, a very low
# slope, barely different to 0:
compare = summary(lm(dep~indep))
# For more detailed numbers look at:
compare$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 235.528226 45.597118 5.1654192 1.460149e-05
# indep -0.291294 2.411567 -0.1207903 9.046625e-01
# Our optimal linear model:
beta = cov(indep,dep)/var(indep)
# [1] -0.291294
alpha = mean(dep)-beta*mean(indep)
# [1] 235.5282
fx = alpha+beta*indep
Let us first replicate the p-value of the slope of our linear model:
We mentioned it before, but the
The formula for the standard error of the slope looks a little complicated at first, but we will get there. Note that the standard error of the slope is in nuce just again sqrt(var(beta))
and its form below just results from rearrangements. The below formula says: the standard error of the slope equals the standard error of the sum of squared residuals/errors divided by the sum of squares of
You may wonder where the classic sqrt(n)
has gone in this formula. Here it comes:
The above refers to the standard error of the residuals/errors or “residual standard error” (term used in the R output; SSE = sum of squared errors). It is essentially just our formula for calculating the sum of squared errors, which we discussed in the second part of this series, and then averaging them over the degrees of freedom: so it represents the standard deviation of the errors/residuals. Without taking the square root the above is also called the mean squared error or variance of the errors (
Another way to represent the standard error of the slope is via the following formula (the second formula below is our outwritten first approach from above):
Here we can clearly see how a relation between
There is yet another way of calculating the standard error of the slope via an alternative approach of calculating the sum of squared errors/residuals (SSE) [2]:
Here we have the parameter beta itself within the function and the denominator still represents a centralization. However, the formula is now completely brought into the realm of variability under the perspective of sum of squares.
We hope this helped you to understand the rather tricky formula for the SE of the slope. We have still trouble to give this part a detailed intuition, but we will get back to this in the future (feel free to comment or contact us if you have a good take on this!).
Below you will find the code to replicate the above and to evaluate the p-value in comparison with the summary(lm(y~x))
function, leading to equivalent output values (whooop!!):
# Sum of square residuals:
SumR2 = sum((dep-fx)^2)
# [1] 475953.3
SumR2_alt = sum(as.numeric(residuals(lm(dep~indep))^2))
SumR2_alt2 = sum((dep-mean(dep))^2) - beta*sum((dep-mean(dep))*(indep-mean(indep)))
all.equal(SumR2,SumR2_alt,SumR2_alt2)
# [1] TRUE
# Residual SE:
residual_standard_error = sqrt(SumR2/(length(indep)-2))
# [1] 125.9568
compare
# => Residual standard error: 126
# summary(lm(y~x)) rounds the value in the output,
# but since we will get to equivalent results we can
# savely assume that the output value is just rounded.
# Standard error of the slope for t-distribution:
se_denom = sqrt(sum((indep - mean(indep))^2))
se_slope_t = residual_standard_error/se_denom
# t-value:
t_value = beta/se_slope_t
# [1] -0.1207903
# p-value for two-tail == to standard output of summary(lm(y~x)):
2*integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value)$value
# [1] 0.9046625
# Alternative via basic R function pt():
2*pt(t_value, df = length(indep)-2)
# [1] 0.9046625
# t-value and p-value is equivalent with the output of
# summary(lm(y~x)):
compare$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 235.528226 45.597118 5.1654192 1.460149e-05
# indep -0.291294 2.411567 -0.1207903 9.046625e-01
We did it!! We replicated the most important part of the hypothesis test on a linear model: the slope. Let’s move on to test our intercept:
We mentioned briefly before that testing the slope is often not as important. Under the null hypothesis a t-test of the intercept asks if the crossing point with the y-axis is equal to zero, given that
In the case of Lady Meow, it actually makes sense that the intercept is around zero, since we argued she sleeps 14h and is observed 10 per day starting with 0 cups and 0 hours. IMPORTANTLY: In this example we would not want to discard the null hypothesis, since a insignificant p-value on the intercept would make sense. And it is true, looking back on the output we the discussed in the introduction of this tutorial using our Lady Meow data set: The p-value of the intercept is insignificant and we can also see that our alpha of 0.2318 is very low and close to zero. In the case of Lady Meow this is totally plausible. Let us look at the R output using our secon synth. Lady Meow data set:
In comparison, when looking at the output of “mtcars” we can see that our intercept is way above 0 and we get a p-value way below a threshold of 5% (i.e. we can discard the null-hypothesis that the intercept is 0). Note that this output again makes sense, when related to the assumption of the null-hypothesis with a slope being zero being the case and therefore having values of
So in general keep in mind: The p-value of both the slope and the intercept do not have to be zero for a model to be considered “good” — it is also a question of plausibility.
Below we see the detailed output of a t-test on “mtcars”:
# summary(lm(y~x)):
compare$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 235.528226 45.597118 5.1654192 1.460149e-05
# indep -0.291294 2.411567 -0.1207903 9.046625e-01
Below you will find another example of a simplified model with a slope of zero. Imagine you would want to measure some random change and things would just never change when you measure. The x-axis could be the time again and y-axis could refer to some growth with a starting value of 5 that remains constant (note that the interept will also be equivalent to the mean of the set of
# Example of a model with slope of zero:
x = c(0:10)
y = rep(5,11)
plot(x,y)
abline(h=5)
What we are aiming at is that the relation between the test for the slope and the intercept is not completely apart from each other. Here we can assume that a slope of zero will automatically be related to an intercept that is close to the mean of the data itself, such that
Now that we have solved the conceptual part of interpreting a p-value of an intercept, let us get to the formula. It is again the same spiel as before: the expected intercept will be 0, so we can abbreviate the formula:
The standard error of the intercept however looks a little messy again. Note again: in essence the standard error of the slope is just sqrt(var(mean(y)-b*mean(x))
.(the formula for the alternative way to obtain the intercept, discussed in part three of this series). Rearrangements and including the SSE will lead to the form below. Brace yourselves:
The formula is really hard to read, even though the general message is the same as always: standard deviation just of the intercept, such that
Below you will find code to calculate the above and to replicate the summary(lm(y~x))
output of our “mtcars” example:
# Standard error of the intercept:
SE_a = sqrt(SumR2/(length(indep)-2))*sqrt((1/length(indep))+((mean(indep)^2)/sum((indep-mean(indep))^2)))
# [1] 45.59712
# t-value intercept:
t_value = alpha/SE_a
# [1] 5.165419
# p-value intercept:
2*(1-(integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value)$value))
2*(1-pt(t_value,df = length(indep)-2))
# [1] 1.460149e-05
We did it!! Let us update our linear_least_square() function that we wrote in part II of this series.
As a short sidenote, we recommend the second chapter (p.13) of the open access stats book “Understanding Statistics and Experimental Design” by Herzog, Francis and Clarke (2019). In the third part of this series, we hinted at one point that the variance/sd can also be understood as precision of an estimate of a mean, gathered from sampling. On the example of a sonar sensor from a submarine, the chapter introduces the t-test as a method to distinguish signal (mean difference) from noise (standard deviation), e.g., distiguishing rock from ‘no rock’, given a noisy channel. The smaller the noise the higher the precision.
linear_least_square()
FunctionBelow you will find the code for a complete update of our linear_least_square()
function from the second part of this series, to perform the equivalent output of the summary(lm(y~x))
of what we have gone through so far. Note that the actual lm()
function has a bunch more options than written in code below. Our function again just serves an educational purpose. Have fun with it!
# Go-go-gadgeto linear_least_square!!!!
linear_least_square = function(indep,dep){ # Start of function
# Evaluating coefficients for and optimal linear model
# given a set of dependent and independent variabels:
beta = cov(indep,dep)/var(indep)
alpha = mean(dep)-beta*mean(indep)
fx = alpha+beta*indep
# Sum of Squared Errors/Residuals:
SumR2 = sum((dep-fx)^2)
# Residual Standard Error/Deviation:
residual_standard_error = sqrt(SumR2/(length(indep)-2))
# Standard error of the slope for t-distribution:
se_denom = sqrt(sum((indep - mean(indep))^2))
se_slope_t = residual_standard_error/se_denom
# Standard error of the incercept for t-distribution:
se_a = sqrt(SumR2/(length(indep)-2))*sqrt((1/length(indep))+((mean(indep)^2)/sum((indep-mean(indep))^2)))
# t-value of the slope:
t_value_b = beta/(se_slope_t+exp(-32))
# t-value of the intercept:
t_value_a = alpha/(se_a+exp(-32))
### p-value of the slope via integrating the PDF of a t-distribution
# up to the t-value calculated above:
t_distr = function(x,df){
t_1 = gamma((df+1)/2)/(sqrt(df*pi)*gamma(df/2))
t_2 = (1 + (x^2/df))^(-(df+1)/2)
t_distr = t_1*t_2
return(t_distr)
} # End of function
# Two-Tail P(t=T|H_0):
p_b_2t = 2*integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_b)$value
### p-value of the intercept:
# Two-Tail P(t=T|H_0):
p_a_2t = 2*(1-(integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_a)$value))
# Results for two tail
Results_a = c(round(alpha,4), round(se_a,4), round(t_value_a,4), p_a_2t)
Results_b = c(round(beta,4), round(se_slope_t,4), round(t_value_b,4), p_b_2t)
Res_data = as.data.frame(rbind(Results_a,Results_b))
colnames(Res_data) = c("Estimate","Std. Error","t value","Pr(>|t|)")
rownames(Res_data) = c("Intercept", "Reg. coeff.")
# Nice output using cat() function:
cat(" Linear least square method in R","\n","\n",
"Independent variable:", "\t", deparse(substitute(indep)),"\n",
"Dependent variable:", "\t", deparse(substitute(dep)),"\n","\n",
"alpha", "\t",alpha,"\n",
"beta","\t",beta, "\t","SumR2", "\t", SumR2, "\n","\n")
print(Res_data)
cat("\n","Residual Standard Error:",round(residual_standard_error),
"on", (length(indep)-2), "degrees of freedom","\n","\n")
# Let us also plot our results:
# We will also use deparse(substitute(x)) for the
# labels of our plot axes.
plot(x=indep,y=dep, ylab = deparse(substitute(dep)),
xlab = deparse(substitute(indep)))
abline(a=alpha, b=beta, col = "darkblue")
} # End of function
# Test:
linear_least_square(indep, dep)
# Linear least square method in R
#
# Independent variable: indep
# Dependent variable: dep
#
# alpha 235.5282
# beta -0.291294 SumR2 475953.3
#
# Estimate Std. Error t value Pr(>|t|)
# Intercept 235.5282 45.5971 5.1654 1.460149e-05
# Reg. coeff. -0.2913 2.4116 -0.1208 9.046625e-01
#
# Residual Standard Error: 126 on 30 degrees of freedom
For comparison, here again the output of summary(lm(dep~indep))
:
It is done, we have successfully replicated most parts of the summary(lm(y~x))
function, whoop-whoop!! In the next part of this series, we will look at other measures, such as the effect size, F-statistics and we will take a closer look at confidence intervals and confusion matrices.