Intermediate Stat-o-Sphere
Review: This is a pre-released article and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.
Fig. How to start a discussion. A pubpub account is needed to do so. Sign up here. Also have a look at articles all about our Open Science Agenda.
Corresponding R script:
Summary:
Welcome to the fourth part of our series! In this part we are going to focus on several kinds of probability functions that we need to understand in order to be able to finally perform a z- and t-test.
The goal of this tutorial is to introduce you to basic probability functions that we need to understand, in order to perform and understand the process behind a z- and t-test. We will actually do most of the math of a z-test already, as you will see. Note that we will cover t-distributions in the next part of this series, along with the z- and t-test.
We will start with simple uniform distributions, in order to completly demystify what a probability density function actually is. However, the focus of this tutorial will lay on Gaussian bell-curved normal distributions. We will also take a look at the semantics of the term “normal” and how its use in probability theory differs from the every day cultural use of the word “normal”.
Fig. Famous meme on normal and paranormal distributions. :D Origin is unknown to us. Image was obtained from Starecat.
Apart from that, we will learn how to integrate a Gaussian function and will learn what z- and t-tables are. These tables can be used to sort of skip the process of actually performing such an integration of a normal distribution, which is a common procedure when performing a z- or t-test by hand, before computers became suffiently powerful. Since we are living in the 21. century, we will also provide R code to create a z-table for yourself.
We again recommend to make use of the contents function (upper right corner), to reorient yourself in the script. Since we are approaching higher spheres, we will not provide summaries at the end of every chapter, but we still (as always!) provided a downloadable summary above!
Fig. After scrolling down until the title of an article disappears, the content function becomes available. It can help you to navigate within the script. It is entailed in every article you will find in our journal.
Note that we are not going to talk about linear modelling in this tutorial. All examples will only concern regular distributions of data, which is equivalent to dropping the independent variable by only focussing on the (mean of the) dependent variable. We will still occasionally use Lady Meow as an example, but if we do so, we will only look at the average amount of tea Lady Meow drank (mean of a sample) and will not consider an independet relation to time! In the next part of this series, we will discuss the z-test and t-test for both cases, regular distributions and hypothesis testing for linear models!
Until now we haven’t talked much about probability functions, yet. So, at first we have to entangle some terminology. However, we already know what a discrete and a continuous distribution is: In the first part of this series, we worked with discrete and mostly binary (categorical) probability distributions (e.g. theta and its complement or ‘heads or tails’ as variables; we didn’t represent it as a function of several trials though). Our linear model in the second part of this series however was a continuous function, but did not consist of probabilistic values on neither the x- nor the y-axis (but estimates of measurements).
Next we are going to go through three special types of probability functions: the probability mass function (PMF), the probability density functions (PDF), as well as the cumulative distribution function (CDF) of a PDF (in nuce a CDF is a function of possible p-values — no worries, we will get there and it is far more simple then it sounds).
There are also simple probability functions, such as a likelihood function, binominal functions and others. However, we will at first stick with the above three for now. You may also have heard of the maximum likelihood method and estimate. We will get to it in further parts of this tutorial series and then explicitly discuss regular probability functions.
In case it appears confusing upfront concerning the term likelihood: note that the classic p-value obtained via integrating a probability density function (PDF) also represents a likelihood, but the PDF itself is not a simple probability likelihood function, it is a probability density function. Again, no worries it is all much easier than it sounds, promised. Especially the concept of a density instantly makes sense when looking at a uniform distribution.
So, for the sake of simplicity and to really get hold of some basic differences between PMFs, PDFs and CDFs, we decided to start with a uniform distribution, before approaching the standard normal probability density function, commonly used to evaluate a z-value (only slightly different to a t-value) and subsequently a p-value. This will take a few steps, but we promise that it is totally worth it. As mentioned, especially the meaning of the concept “density” (y-axis) can be confusing and is easier to understand given a uniform distribution first (e.g., via a six-sided dice with a
However, you can of course skip this part if you want to get right into Gaussian functions.
For our discrete uniform distribution we can operate similarly as we did in the third part of this series, where we discussed the weighted arithmetic mean: we can obtain the probability of each dice value occurring via dividing 1 by the length
# Probability mass function for p(x_i) = 1/6:
dice = c(1:6)
px = 1/length(dice) # 1/n == for uniform probability distr.
# Written as vector of length(dice_value_x):
px = c(rep(1/length(dice),length(dice)))
# Plot discrete probabilities:
plot(x=dice,y=px, ylim= c(0,1), xlim = c(0,7))
Our plot will look like this and is just one small step away of representing a probability mass function!
Fig. Probabilities for discrete events, i.e., the occurring dice values.
Mathematically a probability mass function now can be defined as follows: If the dice value
Using R, we can translate the above into a function using the conditional operators if()
and else if()
as well as a for-loop to check every possible value and calculate its probability! This means that our function will also be able to output a
The below function is also written to work with any length of dice values, treating them all as uniform or fair dice. This may not always correctly correspond to real circumstances, thinking of, e.g., a uniform 19-sided dice, but expanding the length of
CAVE: Being able to understand logic operators, such as if and else, is an universal ability that can be applied to any kind of programming language (mostly only slightly different “syntax” is used (= different use of signs for programing, so to speak)).
# Probability mass function for fair dice with n_sides:
dice_px = function(x,n_sides){ # x = sequence / range of possible values
dice_val = c(1:n_sides)
px = seq # initialize vector for loop with 1:length(x)
for(i in 1:length(x)){ # := for every object of the vector x
if(x[i] < min(dice_val)){
px[i] = 0
}
else if(x[i] >= min(dice_val) & x[i] <= max(dice_val)){
px[i] = 1/n_sides
}
else if(x[i] > max(dice_val)){
px[i] = 0
}
} # End for i
# Plot as probability mass function:
plot(x=seq,y=px, ylim = c(0:1))
for(i in 1:n_sides){
segments(x0=dice_val[i],y0=0,x1=dice_val[i],y1=1/n_sides, col = "blue")
} # End for i
# Return px in console:
return(px)
} # End of function
# Let us test our function:
seq = c(0:7)
px = dice_px(x=seq,n_sides=6)
# 0.1666667 == 1/6
# [1] 0.0000000 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
# [8] 0.0000000
# Plot discrete probabilities only:
plot(x=seq,y=px, ylim = c(0:1))
Fig. Our probability mass function for the sequence 0 to 7 in blue, given a fair six-sided dice. Think of the lines as indication of a lack of width. The height of the lines (y-axis) is in the case of PMFs equivalent to the probability of p(x) (i.e., f(x) == probability of x). This is confusingly not the case for the continuous density of a probability density function, but we will see that there is a simple reason why the y-axis of a density plot is hard to interpret and especially not equal to the corresponding probability of the value of x (the x-axis).
To wrap this up: In the case of a PMF the y-axis refers to the probability of the event occurring and the x-axis to the corresponding event — in our case an event is cast as a single dice value of a six-sided dice. However, different to “regular” probability functions, and different to our probability mass function, the y-axis (density) of a density functions (continuous) does not refer to the corresponding probabilities, due to rather simple matters of integration, as we will see (a PDF sort of blurs the meaning of the y-axis).
You might have also come across examples regarding functions where the y-axis refers to a certain number of occurrences of a certain value of
Let us say, we rolled our dice 600 times and ended up with the below (idealized) uniform distribution. Below you will find the formula that lets us translate frequencies into probabilities!
Fig. The far-left y-axis refers to the number of times an event
So now, what can we do with a PMF, apart from calculating the probability of one certain dice value occurring? Say, we want to roll a 3, a 4 or a 5 for some reason during a game where one six-sided dice is involved. What is the probability that we are rolling either a 3, a 4 or a 5? A PMF can help us evaluate the probability of that “range of discrete values”. The term “area”, and sometimes also “range”, is commonly used for continuous functions: the integral of a function as the “area under the curve”. Just keep in mind!
Fig. Modified plot of our PMF.
In order to evaluate the probability of rolling a 3, 4 or a 5, we can simply sum the respective probabilities:
# Probability of rolling a 3, 4 or 5, i.e., a range so to speak:
px_3to5 = px[3]+px[4]+px[5]
px_3to5 = sum(px[3:5])
# [1] 0.5
We could also add the mean and the standard deviation of our uniform PMF, but it would not add much to the story.
Let us now directly move to the topic of a uniform probability density function (continuous).
Defining a PMF was rather easy. As mentioned, the relation between the value of the y-axis and x-axis totally makes sense, since the y-axis just refers to the probability of each individual x (
Let us first use basic R functions to plot a uniform probability density function via dunif()
. Mathematically, we will partially follow a video on uniform PDFs from the channel “The Organic Chemistry Tutor”.
# Plot Uniform PDF using basic R functions:
x = seq(0,7, length = 10000) # try length 50 and plot again!
# Uniform density for each x:
y = dunif(x, min = 1, max = 6)
# Plot uniform PDF:
plot(x,y, type = "l", ylim = c(0,1))
Fig. Our uniform probability density function (continuous). CAVE: Note that the height of our density function, the value of y or f(x) does not correspond to our discrete probabilities of our PMF anymore! We will find out in a bit why this is the case.
The definition of a uniform PDF goes as follows and is not that far away from the definition of our PMF, as we will see:
To understand why
Since we are working with a simple rectangle, we can obtain the area of that rectangle simply by multiplying
Fig. Modified plot of our PDF. Note that the area under the curve will also sum to 1!
In case you wonder: It may sound weird, but yes, a continuous uniform function can be referred to as curve in mathematics.
So, now we know that the width is
Moving forward to PDFs we also already know that the area under our uniform curve (total integral) has to sum to 1 ─ and this will turn out as reason why the density will not be related to the probability, but the integrated area under the curve indeed does. In other words, the fact that the integral has to sum to 1 dictates what the values of the y-axis will look like in a rather abstract way only. Still not clear? Let’s go through it step-by-step:
Let’s write down what we know about a uniform PDF, this time using
Doing some very, very simple rearrangements will get us to the already mentioned definition of a uniform probability density function. Below you will also find some code to play around with:
# Uniform PDF
unif_PDF = function(a,b,x){
fx = x # initialize output vector of length(x), so we just use x
for(i in 1:length(x)){ # := for every object of the vector x
if(x[i] < a){
fx[i] = 0
}
else if(x[i] >= a & x[i] <= b){
fx[i] = 1/(b-a)
}
else if(x[i] > b){
fx[i] = 0
}
} # End for i
return(fx)
} # End of function
# Test:
test = unif_PDF(a = 1,b = 6,x = x)
plot(y = test, x = x, type = "l", ylim = c(0,1))
Different to our PMF, we can now determine a certain area under the curve concerning infinitely small values in-between discrete values. NOTE: We are now leaving the dice example as a graspable “real example” for the purpose of mathematical abstraction, since a “continuous but bounded dice” would mean that our dice would range from values between 1 to 6 and at the same time actually has infinitely small sides in-between the mentioned range of values. Of course, we are leaving the realm of simple facts here.
However, this problem is actually somewhat addressed to in the math, as we will see: the integral of a specific point
However, nevertheless how realistic rolling a dice with infinite sides may be, let us still again determine the area between 3 and 5. Our chosen values 3 and 5 can also be understood as a lower and upper bound of our desired integral (not the same as min. / max. of our whole curve). In the plot below we will just call the variables of our desired area
Fig. Modified plot of our uniform PDF. The purple area indicates our area of interest:
In order to calculate the area of interest, we can simply use the formulas we already know, again treating our uniform curve as a rectangle. Doing some rearrangements will result in a simple ratio between areas under the curve (see last line below):
# Area under the curve via multiplication:
upper_bo = 5 # set upper bound
low_bo = 3 # set lower bound
a_min = 1 # set minimum
b_max = 6 # set maximum, else f(x) = 0 anyways...
unif_area = (upper_bo-low_bo)*(1/(b_max-a_min))
# [1] 0.4
# Integration of a uniform prob. density function via ratio
# area of interest divided by total area:
# P(x_1<=X<=x_2):
P_area = (5-3)/(6-1)
# [1] 0.4
You may wonder why the PMF of a six-sided dice in the same area gives a probability of
Let us now check for the special case where the area of interest is equal to the total area of our uniform PDF, which should result in 1:
# P(x_1=a<=X<=x_2=b) := area of interest == total area:
P_area = (b_max-a_min)/(b_max-a_min)
P_area = (6-1)/(6-1)
# [1] 1
As mentioned already, when asking for the probability of a specific individual value of
# P(c<=X<=c) := probability of only one point of, e.g., x = 5:
P_area = (5-5)/(6-1)
# Results in 0/5 = 0
# [1] 0
For completion we will now quickly calculate the standard deviation and the mean as well. The formula for the sd of uniform a distribution deviates a little from the formula we have used previously in part three of this series, but we will not go into details here, why this is the case. As said, we do this just for completion, since we will not need uniform distributions any longer (we will get back to uniform distributions when discussing Bayesian model regression in future some time):
# Mean for uniform distribution:
a = 1
b = 6
# For uniform distribution you may come across this formula:
mean = (a+b)/2
# .. but we can also use our regular old mean:
# mean = sum(dice)/length(dice):
mean = sum(dice)/length(dice)
mean = mean(dice)
# The standard deviation for uniform distributions:
# sd() does not deliver equivalent results:
sd = sqrt(((b-a)^2)/12)
# Add mean and sd to plot:
plot(y = test, x = x, type = "l", ylim = c(0,1))
abline(v = mean)
abline(v=mean+sd)
abline(v=mean-sd)
Fig. Resulting plot including the mean and sd of the above uniform distribution.
The cumulative distribution function of a uniform PDF as well as PDFs in general is simply explained, as it just reflects the integral of the PDF. We have therefore already looked at the values of a CDF, so to speak (also check out this tutorial, which I partially followed). The CDF is called cumulative, as it accumulates (sums up) the probabilities the further it integrates the PDF.
We could also integrate from - Inf to + Inf, but for uniform distributions we can also simply integrate from min. to max. The CDF can be formally defined as follows and is often denoted
The above formula says, if
unif_CDF = function(a,b,x){
Fx = x # initialize output vector of length(x): just use x
for(i in 1:length(x)){ # := for every object of the vector x
if(x[i] < a){
Fx[i] = 0
}
else if(x[i] >= a & x[i] <= b){
Fx[i] = (x[i]-a)/(b-a)
}
else if(x[i] > b){
Fx[i] = 1
}
} # End for i
return(Fx)
} # End of function
# Initialize a sequence of possible x:
x = seq(0,7,by = 1)
# Fx:
Fx = unif_CDF(a = 1, b = 6, x = x)
# Plot CDF:
plot(x = x, y = Fx)
Fig. Plot of our CDF. The y-axis refers to the probability and the x-axis to the respective area of x. The plot only shows some discrete points of the CDF of our continous PDF in this case.
As you may have expected, the probability raises linearly the more we integrate, until we integrated the area of the whole density function, numerically summing to 1. The x-axis elegantly does not refer to the density, but our variable
Now that we have an idea what a PMF, PDF and CDF is, given a simple uniform distribution, we can now precede to probably the most commonly used form of functions: the bell curved Gaussian function, i.e., the (standard) normal probability (density) distribution.
Next, we are going through the same steps as above, just not with a uniform, but a (standard) normal function.
You may wonder why bell curved standard normal distributions are so popular and important in the first place?
You may even feel conflicted hearing the word “normal”?
It may even appear as a dark mystique power to you influenced or even under the control of a secret political agenda? :O
To get settled and leave some uncertainties behind, we will first go through some conceptual basics around the famous Gaussian bell function, in order to understand how this special function is related to the world and the way we perform statistical inference.
Fig. Famous meme on normal and paranormal distributions. :D Origin is unknown to us. Image was obtained from Starecat.
One way to approach the concept of “normal distributions” is by treating it as an evident phenomenon, namely: the regression to the middle. And yes, this is why it is called regression model! Linguistically, “regression” is Latin and stands for “going back” or “backwards”. So, in general the concept regression to the middle therefore first of all refers to a process of “going back to the middle”. Mathematically, the middle is referring to the mean, the mean of a distribution. A classic example is the (population) mean of the height of humans. Have you ever wondered why there aren’t any people just randomly growing up to a height of 4, 6 or 9 meters? One way to answer this question is by saying: regression to the middle: the further away a measurement is from the average, the more unlikely such a measurement will be (until it approaches zero).
Note that the Gaussian function depicted below refers to a regular probability function, not a probability density function yet!
Fig. Abstract example for a normal probability distribution (not a PDF!), representing the relation between a certain random data value
Below you will find an actual example from real data via the website “Our World in Data”:
Fig. From the website “Our World in Data”. Whatever the exact reasons for the “restrictions to the middle” are on the biological level, they will always manifest themselves as shadows within distributions of measurements. We are also aware that the binary categorization is controverse. The categories we decide for are of course influencing the outcome and overall perspective of a trial/experiment. Especially the concept of gender is following a discussable deterministic pars-pro-toto approach: from chromosomes to the organism as a whole, which includes a huge categorical change in the scale of complexity we refer to: from a chromosome to a cell, to tissue and organs… up to cognition / experience and behavior. In nuce: such a complex determinism is not working out. However, it is still an important classic in biology for several reasons in general to measure the height and observe the growth of organisms, so we still chose height as an example.
It is hard to answer why such principal restrictions concerning, e.g., the height of an organism, are exactly the case. In general, most brief answers will probably refer to some concept of environmental adaption and even boundaries dictated by physics as such to some degree. However, no matter if you know the exact reason, the work of such effects/boundaries still often represent themselves in the form of a bell curved function.
Historically, the concept of a regression (to the middle) was initially introduced in the context of the genetic phenotype of peas by Francis Galton, a cousin of Charles Darwin, in the 19th century, as well as in the context of heredity and height (our example above). However, Galton did not discover the mathematical concept of such an effect, i.e., the normal distribution, but he figured out it can be used to empirically describe and predict natures behavior to some degree.
The take home message is here: the mathematical method of regression analysis as such is interestingly partially rooted in nature itself. If you like it or not, normal distributions are not just a concept, but also a phenomenon to some degree (e.g., humans don’t randomly grow 10m of height and even if, it would be something happening very seldomly).
Still, keep in mind that:
a) not every phenomenon can be described by a normal distribution, as it is not the only principle at work in statistical inference and nature, so to speak. In other words: there are also other kinds of distributions (logistic, binominal etc.). Below we will also see that normal distributions are a common effect of multiple sampling (a mathematical effect). In other words: a statistical model does not represent reality by itself (e.g., due to bias, confounders etc.). This is something that Galton (and also Pearson) completely overestimated and by doing so influenced the grim idea of eugenics. There is a great article by the science magazine “Nautilus” discussing how eugenics shaped statistics in its early stages.
Also keep in mind that:
b) The normal distribution is distinct from a concept of “norm” in the psycho-social (or cultural) sense. The deviation of height for example did not result from a conceptual (!) intention set by human minds / cognition (such as cultural norms). Being normal is not even necessarily a goal in that respect, such that some random cultural ideal might even be rather the unprobeable case (e.g., very tall people, if considered an ideal). . .
Fig. Karl Pearson (1857-1936) and Fracis Galton (1822-1911), pioneers of statistical inference with a grim history of influencing eugenic ideals.
So keep in mind, facts like the average height, or any other value of some arbitrary measurement, can still be instrumentalized to, e.g., opportunistically argue against the worth of others for some self-interest, so it is important to distinguish between the descriptive norm, e.g., of people’s height, and the “normative norm”, which not only involves a description of how things are, but arbitrarily sets how they are supposed to be in the first place (mere opinion, which from a cultural perspective often represents itself as, e.g., tradition).
Apart from examples in biology or certain phenomena in general: another mathematical reason why Gaussian functions play a great role in (frequentist) statistics is the so-called central limit theorem. The central limit theorem states that if we take several samples of a population with any kind of function as form and take the mean of each sample, we will get a normally distributed function of means (!!) over time. It may sound crazy, but it’s true and can be proven with R! See next chapter for details and code.
Again, the central limit theorem states that if we take several samples of a population with any kind of function as form and take the mean of each sample, we will get a normally distributed distribution of means (!!) over time, in other words: when our sample size of
Note that in this chapter we will mostly follow this R tutorial on statology.org. I can also recommend this StatQuest Tutorial by Josh Starmer on that topic (can’t always recommend StatQuest!) and this video by 3blue1brown (Grant Sanderson), which was just recently published and which covers a lot of other topics discussed in this tutorial, as well!
Here is some code to demonstrate the effect of the central limit theorem in R given a uniform distribution. In the downloadable R script of this tutorial we did the same with a logarithmic function, to prove this doesn’t only hold for sampling from uniform distributions (which is, e.g., also symmetric).
# Initialize data via runif, which will
# give us a random but uniformly distributed
# set of data. Each time you run this line, the result will
# be another set of data points.
# Uses set.seed(0) or other values to make results reproducible
# (we will not get further into the the topic here):
data = runif(n = 1000, min = 1, max = 10)
# We can also plot our data as histogram
hist(data, col = "lightblue")
# Use a for loop to take 5000 random samples of size n = 5 each:
n_samp = 5000
sample_n_5 = c() # create empty vector
# The below function will also take the mean of the sample and
# will store the output in our vector sample_n_5:
for (i in 1:n_samp){
# Store mean of each sample with length 5:
sample_n_5[i] = mean(sample(data, 5, replace=TRUE))
} # End for i
# Look at the histogram of the sample_n_5:
hist(sample_n_5, col = "lightblue")
Fig. On the left we see a histogram of a uniform distribution of values of x ranging from 0 to 10. The y-axis refers to the frequency of each value of x (total of the frequencies is: n = 1000). On the right we see the distributions of the means of various samples taken from that uniform distribution. The y-axis again refers to the frequency of each sample mean (x-axis) occurring. We can see that even though we sampled from a uniform distributed set, the histogram of the means of our samples clearly tends towards a normal distribution over time.
In consequence of the central limit theorem, because the distribution of a bunch of sample means tends towards a normal distribution over time, we can use the normal distribution right away for performing (frequentist) hypothesis testing (integrating a standard normal PDF). In order to be able to follow the central limit theorem, the rule of thumb is said to be a sample size of at least 30. Otherwise, it is recommended to work with a t-distribution, which is very similar to a Gaussian. We will get there in the next part of this series, no worries.
However, the above only works in the case of independent random sampling and finite variance. Note that there can be cases where the independence breakes apart, e.g., due to some global event: Let us say the tea market brakes down (a kind of bias is introduced by that) and Lady Meow is not able to get hold of any fresh tea. Other cats will then also have the problem of getting hold of some tea, so the overall mean of tea drank during a specific time will suddenly change and we will obtain mean values that were previously thought to be rather seldom…
We hope chapter 4 and 4.1 have given you a good overview of the mathematical and conceptual importance and the meaning of working with normal distributions and how to distinguish the use of the term “normal” from the term / concept of a “norm”.
Next we will move on to the mathematical description of normally distributed functions as such, which are also referred to as Gaussian functions.
The formula for a basic Gaussian function is actually fairly simple and we will look into why the formula below results in the respective plot (we will partially follow the partially well written intro of the Wikipedia article on Gaussian functions for this part (December 2022):
# Gaussian base form:
gaussian_fun = function(x){
fx = exp(-x^2)
return(fx)
} # End of function
# Plot of basic Gaussian:
seq = seq(-4,4,by = .1)
fx = gaussian_fun(seq)
plot(x = seq, y = fx)
Fig. Plot of the Gaussian base form
You can probably identify the above function as bell curved, but how did this actually result from the equation?
In general, the number
However, before we get into
# Recall a similar effect when squaring the deviation of
# x_i-mean(x) and looking at the plot:
x = c(0:10)
plot(x = x, y = (x-mean(x))) # deviation x_i from mean(x)
abline(h=0)
plot(x = x, y = abs(x-mean(x))) # absolute value of x_i-mean(x)
plot(x = x, y = (x-mean(x))^2) # plot of squared deviation
# Recall: summing the squared deviation results in
# the total sum of squares...
Fig. Recap of evaluating the structure of the total sum of squares before actually summing. The plot shows very well what squaring actually does to the deviation of
Again: squaring turned a linear function with negative values into a smooth quadratic parabola function with positive values only. Recall that centralizing the values of x was the first step to obtain the total sum of squares / variance / sd (relateable to a parametric Gaussian, see further below). We know that a parabola function can have a maximum, similar to a Gaussian function. We also already know that only a parabola like
Think of the step of squaring the exponent of a regular exponential function exp(-x)
as performing a similar process as above with our initially linear centralized values: Squaring the exponent of an exponential function turns a steady into a — at least in its middle part — rather parabolic relation of values with either a minimum in exp(x^2)
or a maximum in exp(-x^2)
.
We can also plot the exp(-(abs(x))
, which shows how a proper Gaussian function integrates a parabola function
Fig. Plot of exp(-abs(x)), which already is very close to a classic Gaussian and showing a similar effect as we have observed before when altering centralization.
# Plot of exp(-abs(x)):
test_abs = function(x){
fx = exp(-abs(x))
return(fx)
} # End of function
# Recall that we will use the absolute value of a sequence
# consisting of values ranging from -4 to 4 (by a step of .1):
plot(x = seq, y = test_abs(seq), typ = "l")
As mentioned, in the case of a regular (non-parametric) Gaussian, we could even use a different base, and we will still get a Gaussian form. Below we will use base 2.
# It does not even matter that we use the number e as base:
test_gen = function(x){
fx = 2^(-x^2)
return(fx)
} # End of function
plot(x = seq, y = test_gen(seq), typ = "l")
Even though the number
exp(x)
For completion, here are some optional facts why exp(x)
plays an important role in mathematics and physics. Up front: it is essentially convenience, as
a) The
b) The slope at a certain point
c) The integral up to a certain point
Proof of all above the above in R:
plot(x = seq, y = fx_pos, type = "l") # fx_pos from above
# forming coordinate system (+x,-x,+y)
abline(v = 0)
# Point at x = 0, so f(0) = y = e^0 = 1
exp(0) # Now add point to plot:
points(y = 1,x = 0) # P (0|1)
# Point at x = 1
# y = f(x) = e^x; f(1) = e^1 = e
exp(1)
points(exp(1)) # P (e|1)
# Slope of tangent at a point P(f(x)|x) := f'(x) = e^x,
# since f(x) = e^x = f'(x)
# Evaluate y coordinate of P(f(x)|x):
y_coordinate = exp(1)
# [1] 2.718282 = Euler's number e.
# We can add the respective tangent at exp(y)
# point with a slope = e^x. In case you wonder:
# the linear function x does not need to be defined in curve().
curve(exp(1)*x,-3,3, add = TRUE)
# Integral from -Inf to x = 1
expo_fun = function(x) {exp(x)}
integrate(expo_fun, lower = -Inf, upper = 1)
# 2.718282 with absolute error < 0.00015
# = e again.
Fig. Plot of an exponential function with a tangent at P(e|1) with the slope
In the next chapters, we will see that the use of
Next, we are going to look into the parametrized form of a Gaussian function. The formula may look complicated, but it is actually really simple:
The parameters
So here we finally get to a point where we can clearly relate the mean and the sd to a normal / Gaussian distribution. Whoop!!
Looking at the mean
As mentioned, the height of the peak can be understood as a parameter for scaling the y-axis. However, keep in mind that the above formula does not yet refer to a density for
For the numeric example below, we used the above data from the height example (“Our World in Data”) and tried to figure out a distribution for humans in general.
# Gaussian with parametric extension:
seq = seq(100,220,by=.1) # height in cm.
a = .5 # height of curve peak
b = (178.4+164.7)/2 # x of peak = mean = 50%
c = ((178.4-170.8)+(164.7-157.6))/2
# width of the bell = sd
# + turning point of function!
gaussian_fun_para = function(x, a, b, c){
fx = a*exp(-(((x-b)^2)/(2*c^2)))
return(fx)
} # End of function
fx = gaussian_fun_para(seq,a,b,c)
plot(x = seq, y = fx, xlim=c(135,205), type ="l")
abline(v=b, col = "lightblue")
abline(v=b+c, lty = 2)
abline(v=b-c, lty = 2)
Fig. Plot of the parametrized Gaussian function with
To understand the exponent of the above parametrized form of the Gaussian, we will further play around a bit with the formula to understand what each value actually does. R offers a lot of possibilities to use, e.g., plotting in order to heuristically explore functions via visualizing data:
Fig. The more you fuck around (independent), the more you find out (dependent variable). Meme by Roger Skaer.
Here is some code where we manipulated the formulas including the respective plot. Below you will find a corresponding figure including the formulas that were used for each plot. Note upfront that the first and the second test in R code below (test_1
and test_2
) resulted in equivalent values, since
# Parameters
a=1
b=0
c=1
# Sequence
seq = seq(-4,4,by=.1)
# Test 1:5
test_1=exp(-(((seq-b)^2)/(2*c^2)))
test_2=exp(-(((seq-b)^2)/(2*c)))
test_3=exp(-(((seq-b)^2)))
test_4=exp(-(((seq-b))))
test_5=exp(-(((seq-b))/(2*c)))
# Plot of Test 1:5
plot(x=seq,y=test_1)
plot(x=seq,y=test_2)
plot(x=seq,y=test_3)
plot(x=seq,y=test_4)
plot(x=seq,y=test_5)
Fig. Plot of test_1 to test_5. The plot of the first and second test on the upper left are the same, since
You may figured out for yourself already but the parametrized Gaussian probability function is structurally the same as the Gaussian or normal probability density function. The only difference is again that the area will now sum to 1 (we will soon learn why). The param. normal PDF is also referred to as standard normal PDF, when a mean of 0 and a sd (or variance) of 1 is given (sqrt(1) == 1
).
What can we do with it? The standard normal PDF with a mean of 0 and sd of 1 is representing the population (or another sample) and can essentially be used to compare a population it with a standardized sample distribution, i.e., with a sample distribution that is transformed into a standard normal form. More precise, the sample mean will be translated into measures of population standard deviation: If the mean of the population is 10, the standard deviation of the population is 2 and the standardized sample mean equals +1, then we can reclaim the sample mean to be 12, since it is deviating +1 population-sd from the population mean of 10 (recall, we set the pop-sd to be 2 in this example).
To give an example what can be achieved by comparing the means of two distributions in a standardized way: say we want to test if a new drug changes the blood pressure of a patient. What we could do is, we could compare the sample mean of those patients that received the new drug with the population mean, i.e., a group that did not receive the new drug (or “all other people”). However, looking at the difference of means for itself, does not tell us much. Say, we end up with a difference of systolic blood pressure of 5 mmHg, is that much? Looking back at part three of this series, you may ask: couldn’t 5mmHg just be the standard deviation of the mean? In other words, could it be that such a difference is quite common when sampling patients’ systolic blood pressure? In other words: the difference in means for itself does not tell us much about how to asses such a difference. Standardization will help us to evaluate such individual differences by transforming individual values of measurements into a general relation. Again, this means that the deviation of the sample mean from the population mean will be looked at in units of standard deviation of the population mean.
This is essentially what the formula of a z-test does: translating the sample mean, such that it becomes relatable to a Gaussian PDF. However, translating the sample mean into a standardized scheme for itself does not tell us enough either, since it remains unclear how likely such a certain obtained difference may turn out to be. We also need to integrate our PDF up to our standadized sample mean in order to obtain a probabilistic perspective, as we will see further below. So again, different to a regular probability function, we can not just read out the probability value by looking at the y-axis, since we are handling a (standard normal) probabiltiy density function. Note that the process of standardization is only slightly different to what is called studentization, using a Student’s t-test, i.e., a t-distribution (similar to a standard normal PDF, see further below). The standard normal PDF is also referred to as z-distribution. You may have already come across the term t-score or z-score before.
Before we get all crazy with hypothesis testing, and since integrating a Gaussian is not that trivial, let us first go through the formula of a Gaussian PDF step-by-step, especially in order to understand how a parametric Gaussian function becomes a density function. German readers, depending on their age, may have come across one of these in their past:
Fig. The Gaussian function is named after Karl Friedrich Gauß (1777-1855). We will not go into historic details here, especially as Gauß has played such a great role in many fields of mathematics, physics and others, such that it is truly hard to sum his influence up in a few sentences. Gauß also established the least squares method by the way! The image above shows the effigy of Gauß on the old German 10 Marks bill, including a graph and the general parametrized formula of the normal PDF.
The general parametrized formula for the normal (non-standardized) probability density function that follows the mean and the variance of a distribution is denoted as follows (recall that sigma-squared is essentially the variance; the formula intentionally deviates from the formula from the 10 Mark bill):
The left part of the formula is spoken out:
However, we will stick with the non-simplified formula, in order to fully understand what it actually says and for what it can be used.
To understand the basic structure of the formula, recall that our uniform density function was denoted
The reason that the number
Again, the so called standard normal distribution is denoted as having a mean of 0 and a variance of 1. Below you will find some code for the above. You can also plug in a different mean and variance below, however, recall that it will not result in a standardized PDF anymore:
# General formula for a probability density function (PDF).
# Insert mean = 0 and var = 1 to obtain the values for input x
# given a so-called standard normal distribution:
prob_dens = function(x,mean,var){
fx = (1/sqrt(2*pi*var))*exp((-((x-mean)^2))/(2*var))
return(fx)
} # End of function
# In case placing () in the above becomes an issue when
# writing a function, you can also decompose the formula:
prob_dens_decomp = function(x,mean,var){
fx1 = 1/sqrt(2*pi*var)
fx2 = (-((x-mean)^2))/(2*var)
fx = fx1*exp(fx2)
return(fx)
} # End of function
# Initlize seq:
x = seq(-4,4,by=.01)
# Given standard normal distribution:
norm_PDF = prob_dens(x, mean = 0, var = 1)
# Using basic R function dnorm():
norm_PDF_alt = dnorm(x, mean = 0, sd = 1)
# Check for equality:
all.equal(norm_PDF,norm_PDF_alt)
# [1] TRUE
# Plot
plot(x=x,y=norm_PDF)
plot(x=x,y=norm_PDF_alt)
Fig. Modified plot of our standard normal probability function. The x-axis is scaled for the population sd, given a mean of 0.
Note that the x-axis
We noticed previously when discussing the parametric Gaussian function that changing
Let us plot the output of our prob_dens()
function using a variance of
As we figured out in part three of this series, the value of the variance itself does not tell us much about the quality of a function, but the comparision with other functions does (see below). The sd on the other hand can directly be related to the mean and delivers a value that we can interpret descriptively right away (turning points of the function).
Fig. Plot of two Normal PDFs with variance = 1 (blue) and variance = 3 (red).
# Plot with var = 1 and var = 3
plot(x=x,y=prob_dens(x,mean = 0,var = 1), type="l", col = "blue")
# The lines function adds another cure to existing plot:
lines(x=x,y=prob_dens(x,mean = 0,var = 3), col = "red")
Again: The formal reason for the above deformations of the Gaussian by the sd is of course the fact that the sd/variance appears in our parameter
Note that most of the time we will operate with equal variances in this tutorial, so the standardized sample distribution will have the same form as the standard normal PDF, only the mean will deviate. We did not yet compare a sample distribution with a population distribution, so no worries if this note does not ring a bell yet.
Next up we will integrate Gaussian functions and finally understand a little better why
This chapter is rather detailed and entails some aspects that could be considered optional, so in case you have trouble understanding all of it, it does not matter too much, as long as you get the general idea how we can turn a param. Gaussian into a param. Gaussian PDF. We will also recommend some video tutorials to follow in parallel.
The formula for the integral of a non-param. Gaussian goes as follows:
We already noticed that the number
This does not answer why
Below you will find some code to replicate the integral of the basic Gaussian function using the integrate()
function in R. By using our gaussian_fun()
function together with the basic R function integrate()
, we can determine a specific area we want to integrate, e.g., the whole curve via lower = -Inf
, upper = +Inf
.
# Gaussian base form:
gaussian_fun = function(x){
fx = exp(-x^2)
return(fx)
} # End of function
# Using integrate function:
gaussian_integral = integrate(gaussian_fun, lower = -Inf, upper = +Inf)
# Use gaussian_integral$value to get value only, as integrate()
# returns a message and a list:
all.equal(gaussian_integral$value,sqrt(pi))
# [1] TRUE
To finally understand why the number
The first thing I thought of, when I saw that the integral of a non-parametric Gaussian is the square root of
We are only going to touch the surface here, but there are a lot of videos on that matter on the internet. Here are two videos I watched to write this part of the tutorial: one in German by Weitz / HAW Hamburg, another English video by vcubingx. This article by Ryan Brideau also discusses the same topic in a similar way.
Calculating the volume can also be written as taking the integral of an integral, where importantly
In general, there are some difficulties when integrating a Gaussian (especially by hand) and there are several ways to do so. We will only dip into the topic to get some intuition and will not replicate the whole math, but however: One way to integrate a Gaussian is by using polar coordinates instead of cartesian coordinates, eventually rotating the 2D integral around its peak, which essentially results in the volume of a 3D Gaussian (it is easy, due to the symmetric structure of 3D Gaussian, see plot later below).
The below formula may appear complicated, but think of the integral from
The part where the integral of
To wrap this up: What we get by essentially adding a redundant dimension is a perfectly symmetric 3D gaussian (see plot below). There are lot of videos on the internet on that topic, so we will leave it by that and we will also not replicate the various other existing mathematical methods to integrate a Gaussian via R here (we will stick with the built-in integrate()
function).
The plot below is supposed to show that rotation is easily doable, since the 3D Gaussian is perfectly symmetric.
Fig. Plot of a 3D Gaussian from two perspectives: From upper right side (left plot) and from the top (right plot). The latter plot was modified to include arrows to mark that a perfectly circular rotation is possible. Relating the previous equation to the above plot: The first integral of the right side of the equation of the plot can be related to the rotation: 360° = 2*pi*rad = 1 period of a sine wave. “Rad” relates to a circle with a radius of 1, where pi radians = 180° (we will not go into further details here). The second integral can be related to the integral of the area from the center axis (the “peak axis”) to its infinite outer realm (from 0 to infinity).
Here is a more detailed plot. Not the best, since it was hard to integrate the information into the above plot but I hope you can make sense of it:
Fig. The r (orange) in the formula can be looked at as a vector with the coordinates (x,y) (light blue = x, light green = y). This vector will then be rotated by 360°=2pi. This is then done for every circle (green) the vector r can draw — we have included only two vectors as an example. So in nuce, we get the volume by slicing the 3D Gaussian up into cylinders with an diameter approaching 0 that we all integrate and sum in order to obtain the integral of the whole (like a 3D Gaussian sliced into a histogram!).
Grant Sanderson (3blue1brown) just recently published a video on all of the above in a series that also entails the video on the central limit theorem that we mentioned before. Also note that Grant Sanderson provides all the python code for his the visualization of the math in his videos — in case you are interested in python as well!
However, below you will find the code to replicate the above 3D plot in R:
# 3D Plot of a Gaussian:
x = seq(-4,4,by=.1)
y = x
# We will not go into full detail why we need the outer product for z,
# but the result is a matrix and z needs to be related to two
# coordinates: x and y.
# However, outer() allows to write the function of our 3D Gaussian
# within the outer() function:
z = outer(x, y, function(x,y)(exp(-(x^2+y^2))))
# The persp() function lets us plot in 3D; adjust theta and phi
# in order to change perspective:
persp(x = x,y = y,z = z, theta =20,phi=20)
# From the top:
persp(x = x,y = y,z = z, theta = 0, phi = 90)
The integral of the parametric form looks a little messier, but it is fairly simple when decomposed.
The next part is a short replication of parts of the Wikipedia article on Gaussian functions. It explains how our parameters
Upfront: The value of
# Integration using our parametrized Gaussian function:
gaussian_fun_para = function(x, a, b, c){
fx = a*exp(-(((x-b)^2)/(2*c^2)))
return(fx)
} # End of function
gauss_para_integ = integrate(gaussian_fun_para,a=a,b=b,c=c,lower=-Inf,upper=Inf)
all.equal(gauss_para_integ$value,(a*sqrt(2*pi*c^2)))
# [1] TRUE
So, now we know why we use
You may wonder why
We now have all of the knowledge to finally calculate the whole integral of a normal PDF and therefore nearly all the knowledge we need to calculate the infamous p-value, which only concerns a specific integrated area under the curve. . . Note that below we will just integrate from - Inf to + Inf and half of a Gaussian function, i.e., we will not perform an actual z-/t-test in this tutorial.
Integrating is essential to obtain probabilistic values from our PDF. However, we were not testing for anything yet. Testing would again involve integrating a specific value (a z- or t-value). We will get there soon.
A short way of integrating our standard normal PDF in R is again by using our prob_dens()
or the built-in dnorm()
function together with the R function integrate()
. In case you wonder about the code below, the integrate()
function has a special structural feature: it takes a function such as prob_dens
as input for itself and the parameters of that function (mean =
and var =
) can be called as parameters within integrate()
, see below:
# Integrate stand. norm. probability density function:
integrate(prob_dens, mean = 0, var = 1, -Inf, +Inf)
# Result:
# 1 with absolute error < 9.4e-05
# ... or via using dnorm(), now with the parameters "mean =" and "sd =":
integrate(dnorm, mean = 0, sd = 1, -Inf, +Inf)
# Result:
# 1 with absolute error < 9.4e-05
However, the shortest way may be by using the built-in pnorm()
function that evaluates the p-value given a normal distribution:
# We could also use the function pnorm():
p_value_Inf = pnorm(Inf,mean = 0,sd = 1)
# [1] 1
# p-value at the mean = 0,
# given a mean of 0 and sd of 1:
p_value_pnorm = pnorm(0,mean = 0,sd = 1)
p_value = integrate(prob_dens, mean = 0, var =1, -Inf, 0)$value
# [1] 0.5
# Test for equality:
all.equal(p_value_pnorm,p_value)
# [1] TRUE
We can now go on and calculate and plot the respective normal CDF of our normal PDF that cumulatively adds up the p-values of the respective z-scores (the values on the x-axis). We will discuss the z-score and its formula in the next chapter, but so far we know already that it is just a way to translate any values of the x-axis into a standardized form of values representing a standard deviation from the mean (which itself is 0). The function entails a for-loop to calculate the p-value for every possible value of our sequence (-4 to 4, by .1) via the integrate function:
# CDF of a normal PDF:
norm_CDF = function(seq, mean, var){
# Initialize vector length(density)
Fx = seq
for(i in 1:length(seq)){
Fx[i] = integrate(prob_dens, mean = 0, var = 1, lower = -Inf, upper = seq[i])$value
} # End for i
# Plot result:
plot(x=seq, y = Fx, type = "l")
return(Fx)
} # End of function
# Test our function and plot our CDF (code for plotting is entailed
# in function above already):
seq = seq(-4,4,by=.1)
norm_CDF(seq,0,1)
Fig. Compared to the CDF of a uniform distribution we now get a smoothed out sigmoid (s-shaped) function as a result.
We added the possibility to adjust the mean and variance in the prob_dens()
function, but note that this is not the same as evaluating the previously mentioned z-score of our sample in order to perform the complete standardization. However, evaluating the z-score of a given set of values is fairly easy, since it has nearly the same structure as the exponent of
Obtaining a t-score is again very, very similar to obtaining a z-score, as we will see, so we will start with the z-score; perform a z-test and will then move on to the t-test.
So far we have looked at the (standard) normal PDF as such and did not perform any actual standardization of data and also did not perform any hypothesis testing yet. Up front: testing in a z- and t-test will just mean: comparing two means in order to evaluate if they deviate enough that we can argue that they are likely two different distributions and not just samples from one and the same distribution. To give an example what can be achieved by comparing the means of two distributions: say we want to test if a new drug changes the blood pressure of a patient. What we could do is, we could compare the sample mean of those patients that received the new drug with the population mean, i.e., a group that did not receive the new drug (or “all other people”). However, looking at the difference of means for itself, does not tell us much.
So far, it was rather easy to integrate a uniform distribution and doing so helped us to understand the formula of our normal PDF, which we figured out needed some deeper insight into mathematics in order to integrate such functions. However, the fact that integration here involves a complicated process is also important for error control. A computer scientific solution therefore also always indirectly aims to keep (human) errors in calculations low. Computers give us the opportunity to operate with complex math without doing any calculations “by hand” or via cognition at all. This is of course also dangerous, since we do not need to actually understand a z-test in order to execute some lines of code — of course scientists should be able to understand and explain more than just some lines of code.
This is one of the reasons we chose to create thorough tutorials for you, before we come up with heuristic quick guides of some kind, as it is unfortunately rather common to use statistical methods without actually understanding them porperly. People with a mere heuristic approach also get easily agitated when one is asking basic questions, which is sad, since that means: you may have to justify yourself because you have knowledge instead justifying a lack of knowledge. Note that this indeed has the charachter of a a classic opportunist gate-keeping tactic and especially explains why statistical analysis is often not well performed in papers. Simply talking about p-hacking etc. is not enough to tackle attitude driven science.
However, by using nowadays computer software such as R, the focus can lay on the concept, the formula, and a little less on the representational realization of an algorithm (calculating by head and hand). As you have hopefully figured out by now: This makes the mathematics intellectually a lot more graspable for us and also allows for smooth experimentation and various forms of visualization of mathematics/statistics…
Apart from a computer scientific solution for our integral problem, you may already be familiar with t-tables, or a standard normal table, also called a z-score table in that respect (we will replicate a z-score table in R further below!!). These tables deliver you the integral up to a certain bound value (z-/t-score), so they don’t have to be calculated by hand (which was rather daunting most of the time in the 20th century)! In other words: t-/z-tables were the error control in the 20th century, where computers where not as common and powerful as computers are today. You might have also guessed already, but yes, a z-table is essentially just a CDF in a tabular form (see further below).
Fig. Evolution of mathematical tools, from abacus to calculators to programs such as R. Certain methods are dictated by a kind of principle of least action, i.e., are popular and widely used, since they are easy to calculate by hand or can be performed with only few and simple tools. Now that we are living in the 21. century technology casually allows us to flexibly play around with a wide variety of highly sophisticated and even open access software tools that can be used by probably any of todays computers. Original Photos by Recha Oktaviani and Crissy Jarvis.
Let us now look at the formula of the z-score which is used to standardize any value of
So, keep in mind, we are still not at a stage of actual testing, but will at least be able to convert any numerical value of a measurement into a standard normal PDF. This is again done for convenience and comparability of our data under a generalized (standardized) perspective:
IMPORTANT NOTE: With this formula not only the prob_dens()
function did), but also the density in a way that it fully corresponds to a standard normal PDF with mean 0 and sd of 1 and can therefore be properly compared with a population or another sample. Now the integral to a point of
The figure below shows what we can do using the z-score formula. IMPORTANTLY, apart from that you may wonder how it comes that we are comparing two distributions? The sample for itself is also considered a distribution. In fact, performing a z-test requires that the sample itself has a normal distribution. We mentioned the Shapiro-Wilck test in part three of this series as a quick side quest before.
Fig. On the left we see a normal PDF with population mean of 15 (blue line) and a sample mean of 5 (orange line). On the right we see the standardized sample mean (orange line) deviating in measures of standard deviation from the population mean of 0. In case this appears dizzy: The population mean is 0 no matter what the actual value of the (non-standardized) population mean is, since it acts as the standardized baseline. Why all this fuzz about translation again? Apart from mathematical convenience, the goal of standardization is that we can look at the sample mean from the general perspective / units of standard deviations or simply generalized difference from the population mean! Mathematically: evaluating the z-score essentially translates all values of x from our sample into values of such a standardized perspective. In contrast to the formula of the z-score above, the z-test formula (also referred to as z-statistics or z-value) does not translate all possible values of x, but the sample mean of x as a deviation from the pop. mean in particular. The formula slightly deviates, the difference between the formula of the z-score and z-value/statistics is rather marginal, as we will see in the next part of this series.
Let us do some calculations in order to figure out what the above actually means. At first we will look at an example where the sample mean of a sample of values is equivalent with the population mean. This can be looked at as a special case of the z-test. In this case we can represent the difference between sample mean and population mean just by the formula of the z-score, since the difference is zero anyways. In this special case we just end up with standard normal distribution itself, representing our baseline by itself. When integrating we can represent and calculate the probability of all possible differences of the (standardized) sample mean from the (standardized) population mean. This baseline approach is essentially what a z-score table consists of, as we will see.
Fig. Lady Meow — such a lovely cat!
Numeric example: So, let us say we have knowledge on the population mean of the number of cups of tea Lady Meow drank in a whole day (say pop. mean = 5). Say, we also took a sample recently, i.e., measured the tea drinking behavior of Lady Meow again and want to compare the sample with the population and the sample turns out to also have a mean of 5. We will therefore end up with a difference of 0.
The integral up to a z-score of 0, representing the standardized baseline population mean, will be .5, which is essentially just half of the integral. By doing so, we are essentially asking how likely it is that our sample mean is significantly lower or equal to the population mean (later discussed as lower tail z-test of a 0 difference in means). Upfront: when calculating a p-value, asking for both being either lower or higher in difference from the population mean (equality/inequality as such), then we would have to double the above p-value (later discussed again as two-tailed z-test). In the case of sample mean == population mean, this double sided approach will result in a 100% chance (probability of 1) that the sample is equal to the population with a non-standardized mean of 5 (again, this is equivalent to a z-test at least for the case where the sample mean == population mean).
How would we in general interpret the real life case where the result of a clinical trial turns out to be showing a 0 difference in means? It depends on the question we ask when performing a hypothesis test, but most of the time we essentially want this value to be very low, aiming for a difference in means, so that we can argue that the sample is actually different from the population. To give a medical example: when comparing sample participants that were given medication with a population that obtained no medication, then we mostly hope for a difference when comparing these two groups, since we probably hope for an effective new medication that makes a significant difference to doing nothing at all (or placebo effect, which is essentially no conditional effect of a medication — claiming anything else is considered manipulative product/procedure binding)! In other words, we hope to find medication that has an evident effect and deserves the name medication in the first place.
As a side note: Mathematically we already know from the first chapters when looking at uniform distributions, that a PDF also allows us in general to integrate any region, e.g., when asking how likely it is to obtain a sample with a mean that is between -1*sd and +1*sd. . .
Below you will find some Code for the special case of a 0 difference between population and sample mean, as well as a plot of the standardization of the dependent variable of our second synthetic data set observing Lady Meow ^°^!
###### Special case of pop. mean == sample mean:
pop_cups = c(0:10)
# Population mean, variance and standard deviation:
# Recall that var() and sd() calculates teh sample var/sd
# including Bessel's correction, therefore we use
# the formulas below:
pop_mean = mean(pop_cups)
# [1] 5
pop_var = sum((pop_cups-mean(pop_cups))^2)/length(pop_cups)
# [1] 10
pop_sd = sqrt(pop_var)
# [1] 3.162278
# Function for evaluation the z-value of a
# non-standardized data set:
z_score = function(x,pop_mean,pop_sd){
z = (x-pop_mean)/pop_sd
return(z)
} # End of function
# Standardized Z-Score for a particular value of x:
# Say we want to standardize a value of x that is equal
# to our mean itself as a special where a z-test formula
# is not needed yet:
z_score(pop_mean, pop_mean, pop_sd)
# [1] 0
# Integrating will reasonably result in
# 0.5, i.e., half of the area under the curve:
pnorm(0)
# [1] 0.5
# For a two-tailed test, checking on equality/inequality:
2*pnorm(0)
# [1] 1
# Let us now look at the z-score for our
# value of mean+/-sd in pop_cups:
z_score((pop_mean-pop_sd), pop_mean, pop_sd)
# [1] -1
z_score((pop_mean+pop_sd), pop_mean, pop_sd)
# [1] +1
z_score((pop_mean+2*pop_sd), pop_mean, pop_sd)
# [1] +2
# Worked!
#### Plot standardization of the dependent variable of
#### second synth. observing Lady Meow
# Second synthetic data set:
# Cups of Tea:
cups = c(0, 1, 2, 3.5, 4.8, 5.2, 6, 6.9, 8.5, 9.1, 9.9, # run 1
0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9, # 2
0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9, 9.8) # 3
# Time passed for each run of measurements (cups), abbreviated:
time = c(0:10,0:10,0:10)
# Plot corresponding normal and standard norm. PDF:
# Normal PDF (non- standardized) of our data:
seq = seq(-10,20,by=.1)
plot(x=seq,y=prob_dens(seq,mean(cups),pop_var),type ="l")
# Translating all possible values on the x axis into our
# normal PDF into the format of a standard(!) normal PDF:
z_seq = z_score(seq,mean(cups),pop_var)
plot(x=z_seq,y=prob_dens(seq,mean(cups),pop_var),type ="l")
We can also create and use a z-table via R, which is essentially again just our CDF in tabular form: Here is some code to create a standard normal distribution table / z-table yourself (partially following this R tutorial).
# Standard normal distribution table / z-table:
# Initialize seq for z-scores (+/-)
z = seq(0,9.99,by = 0.01)
# Calculate all possible z-scores from 0 to 9.99:
p_z = pnorm(z)
# Plotting the above shows that we calculated the
# upper half of the normal CDF:
plot(x=z,y=p_z)
# Set as matrix/table:
z_table = matrix(p_z,ncol = 10,byrow = TRUE)
# Add row and column names to matrix:
rownames(z_table) = seq(0,9.99,b = .1)
colnames(z_table) = seq(0,.09,by = .01)
# Evaluate z and search table for corresponding p-value:
# [["col","row"]]
z_table[["3.2","0.03"]]
# ...or use the following way without having to separate the digits:
# Change class to data_frame:
table = as.data.frame(z_table)
# Define random z:
z_score = 3.23 # only use value up to two decimal places
# The following lines separates the first 3 digits into the
# [["col","row"]] structure, so you don't need to do yourself:
# substr(x,start,stop):
rowDigit = as.numeric(substr(z_score*100, 1,2))/10
# [1] 3.2
colDigit = as.numeric(substr(z_score*100, 3,3))/100
# [1] 0.03
# Execute the next line to find the respective value
# in the table:
table[which(rownames(table) == rowDigit),which(colnames(table) == colDigit)]
# [1] 0.999381
pnorm(3.23,0,1)
# [1] 0.999381
Fig. The plot on the left shows that the z-table basically consists of the upper part of our normal CDF. On the right side you can see parts of our z-table.
Only the upper part of a normal CDF is used for the table. 1-pnorm(z)
delivers a list for negative z-values for the left half of our Gaussian, so to speak (recall the symmetry of Gaussian functions).
We can also invert our CDF into a so-called quantile function. We will not reproduce the whole math here for that, but a quantile function delivers an important perspective. We can use the qnorm()
function in R. In nuce: with the quantile function we can evaluate a specific z-score given a specific p-value or percentage (e.g., z-score for a confidence interval, see further tutorials). In other words: it is just a “tilted” CDF, where the z-score is now located on the y-axis and the probability on the x-axis (not vice versa):
# Use the qnorm function for the inverse CDF:
# Note that we used a rounded score as example
# before:
qnorm(0.990,0,1) # for a p-value threshold of 1%
# [1] 3.229977
seq = seq(0,1,by=0.0001)
quant_func = c() # empty vector
for(i in 1:length(seq)){
quant_func[i] = qnorm(seq[i],mean=0,sd=1)
} # End for i
# Plot Quantile function:
plot(x=seq,y=quant_func, type = "l")
Fig. Plot of the inverse CDF, i.e., the quantile function. On the x-axis we have p-values ranging from 0 to 1. On the y-axis we see the corresponding z-score values (the x-axis of our norm. PDF).
Apart from a z-score table there are also so called t-tables. We will get to the t-test in the next part of this series, but we want to explain the main difference upfront: The main difference between a z-table and t-table is the fact that the t-score uses the sample variance/sd and the z-table uses the population variance/sd in its calculations.
The decision which test to choose also involves the size of a sample:
Fig. This flow chart can be used to decide which test to take. Of course, a R function can be written in a way that it infers based on the above logic by itself. The flow chart was adapted from Statology. The constrain of n>30 is essentially a rule of thumb, again in order to be able to argue for the use of a standard normal PDF via the central limit theorem: as we will see in later chapters, a t-distribution, which has a very similar shape to a Gaussian, but entails Bessel’s bias correction on the variance/sd, also approaches a normal distribution, the higher n is; however, the choice of n>30 is still rather arbitrary, but the conceptual difference to a standard normal PDF is low, as we will see.
In the next part of this series we will go fully into discussing z- an t-tests for both regular sets of values and for linear regression models. In the last chapter we essentially touched the topic of one sample z-tests, i.e., comparing a sample with a population (different to two-sample tests, where two samples are compared with each other). We will also thoroughly discuss what a one-tail and two-tail test is and when to use them. Plenty of furry examples will await you, in order to get an overview of when and how to use certain configurations of z- and t-tests. ^^
Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other students and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within BEM.