Skip to main content
SearchLoginLogin or Signup

[R] Inferential Statistics IV: Uniform and (Standard) Normal Probability (Density) Functions

Intermediate Stat-o-Sphere

Published onAug 18, 2023
[R] Inferential Statistics IV: Uniform and (Standard) Normal Probability (Density) Functions

Review: This is a pre-released article and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.

Fig. How to start a discussion. A pubpub account is needed to do so. Sign up here. Also have a look at articles all about our Open Science Agenda.

Corresponding R script:



Welcome to the fourth part of our series! In this part we are going to focus on several kinds of probability functions that we need to understand in order to be able to finally perform a z- and t-test (we will actually perform most of the steps in this tutorial already). Note that we will cover t-distributions in the next part of this series, along with the z- and t-test.

We will start with simple uniform distributions, in order to completly demystify what a probability density function actually is. Plotting density functions is usually recommended to give a visual/descriptive persepctive if a normal distribution (or some other) is given in the data. However, the focus of this tutorial will lay on Gaussian bell-curved normal distributions. We will also take a look at the semantics of the term “normal” and how its use in probability theory differs from the every day cultural use of the word “normal”.

Fig. Famous meme on normal and paranormal distributions. :D Origin is unknown to us. Image was obtained from Starecat.

Apart from that, we will learn how to integrate a Gaussian function and will learn what z- and t-tables are. These tables can be used to sort of skip the process of actually performing such an integration of a normal distribution, which is a common procedure when performing a z- or t-test by hand, before computers became suffiently powerful. Since we are living in the 21. century, we will also provide R code to create a z-table for yourself.

We again recommend to make use of the contents function (upper right corner), to reorient yourself within the script. Since we are approaching higher spheres, we will not provide summaries at the end of every chapter, but we still (as always!) provided a downloadable summary above.

Fig. After scrolling down until the title of an article disappears, the content function becomes available. It can help you to navigate within the script. It is entailed in every article you will find in our journal.

Note that we are not going to talk about linear modelling in this tutorial. All examples will only concern regular distributions of data, which is equivalent to dropping the independent variable by only focussing on the (mean of the) dependent variable. We will still occasionally use Lady Meow as an example, but if we do so, we will only look at the average amount of tea Lady Meow drank (mean of a sample) and will not consider an independet relation to time! In the next part of this series, we will discuss the z-test and t-test for both cases, regular distributions and hypothesis testing for linear models!

1 Uniform PMFs, PDFs and CDFs ─ Uniform Probability Mass Functions (Discrete), Probability Density Functions (Continuous) and the Cumulative Distribution Function (Continuous)

Until now we haven’t talked much about probability functions, yet. So, at first we have to entangle some terminology. We already know what a discrete and a continuous distribution is: In the first part of this series, we worked with discrete and mostly binary (categorical) probability distributions (e.g. theta and its complement or ‘heads or tails’ as variables; we didn’t represent it as a function of several trials though). Our linear model in the second part of this series however was a continuous function, but did not consist of probabilistic values on neither the x- nor the y-axis (but estimates of measurements).

Next we are going to go through three special types of probability functions: the probability mass function (PMF), the probability density functions (PDF), as well as the cumulative distribution function (CDF) of a PDF (in nuce a CDF is a function of possible p-values — no worries, we will get there and it is far more simple then it sounds).

There are also simple probability functions, such as a likelihood function, binominal functions and others. However, we will at first stick with the above three for now. You may also have heard of the maximum likelihood method and estimate. We will get to it in further parts of this tutorial series and then explicitly discuss regular probability functions.

In case it appears confusing upfront concerning the term likelihood: note that the classic p-value obtained via integrating a probability density function (PDF) also represents a likelihood, but the PDF itself is not a simple probability likelihood function, it is a probability density function. Again, no worries it is all much easier than it sounds, promised. Especially the concept of a density instantly makes sense when looking at a uniform distribution.

So, for the sake of simplicity and to really get hold of some basic differences between PMFs, PDFs and CDFs, we decided to start with a uniform distribution, before approaching the standard normal probability density function, commonly used to evaluate a z-value (only slightly different to a t-value) and subsequently a p-value. This will take a few steps, but we promise that it is totally worth it. As mentioned, especially the meaning of the concept “density” (y-axis) can be confusing and is easier to understand given a uniform distribution first (e.g., via a six-sided dice with a p(x)=1/6p(x) =1/6 for each dice value (uniform) as example).

However, you can of course skip this part if you want to get right into Gaussian functions, if you wish!

1.1 Uniform Probability Mass Function (Discrete)

For our discrete uniform distribution we can operate similarly as we did in the third part of this series, where we discussed the weighted arithmetic mean: we can obtain the probability of each dice value occurring via dividing 1 by the length nn of our set of dice values, i.e., by 6. This will result in a uniform or constant discrete set of probabilities that sum to 1.

# Probability mass function for p(x_i) = 1/6:
dice = c(1:6)
px = 1/length(dice)  # 1/n == for uniform probability distr.

# Written as vector of length(dice_value_x):
px = c(rep(1/length(dice),length(dice))) 

# Plot discrete probabilities:
plot(x=dice,y=px, ylim= c(0,1), xlim = c(0,7))

Our plot will look like this and is just one small step away of representing a probability mass function!

Fig. Probabilities for discrete events, i.e., the occurring dice values.

Mathematically a probability mass function now can be defined as follows: If the dice value xx is smaller than 1, then the probability of such a zero dice value occurring is 0. If the dice value has a (discrete) range from 1 to 6, then the probability will always be 1/61/6 each. In the case of a value above 6, it will also result in a probability of zero. In words, the above can be read as code: if the dice value xx is either greater or equal to one and smaller or equal to six, then p(x)=1/6p(x) = 1/6 , else 0.

Using R, we can translate the above into a function using the conditional operators if() and else if() as well as a for-loop to check every possible value and calculate its probability! This means that our function will also be able to output a p(x)p(x) for counterfactual dice values such as 0 or 7 (which will always have probability of 0 in this example). Calculating a uniform distribution via 1/nsides1/n_{sides} just gave us the probability of factual values of our six-sided dice, not those that are always zero (since counter-factual). This is also useful for our code and plotting, since the input values of xx for the function below can therefore be a sequence of values below and beyond the number of possible dice values, without returning an error.

The below function is also written to work with any length of dice values, treating them all as uniform or fair dice. This may not always correctly correspond to real circumstances, thinking of, e.g., a uniform 19-sided dice, but expanding the length of xx values will become important later, when we are approaching continuous functions (entailing potentially infinite sides, so to speak).

CAVE: Being able to understand logic operators, such as if and else, is an universal ability that can be applied to any kind of programming language (mostly only slightly different “syntax” is used (= different use of signs for programing, so to speak)).

# Probability mass function for fair dice with n_sides:
dice_px = function(x,n_sides){ # x = sequence / range of possible values
  dice_val = c(1:n_sides)
  px = seq # initialize vector for loop with 1:length(x)

  for(i in 1:length(x)){ # := for every object of the vector x
    if(x[i] < min(dice_val)){ 
      px[i] = 0
    else if(x[i] >= min(dice_val) & x[i] <= max(dice_val)){
      px[i] = 1/n_sides
    else if(x[i] > max(dice_val)){
      px[i] = 0
  } # End for i

  # Plot as probability mass function:
  plot(x=seq,y=px, ylim = c(0:1))
  for(i in 1:n_sides){
    segments(x0=dice_val[i],y0=0,x1=dice_val[i],y1=1/n_sides, col = "blue")
  } # End for i

  # Return px in console:
} # End of function

# Let us test our function:
seq = c(0:7)
px = dice_px(x=seq,n_sides=6)
# 0.1666667 == 1/6
# [1] 0.0000000 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
# [8] 0.0000000

# Plot discrete probabilities only:
plot(x=seq,y=px, ylim = c(0:1))

Fig. Our probability mass function for the sequence 0 to 7 in blue, given a fair six-sided dice. Think of the lines as indication of a lack of width. The height of the lines (y-axis) is in the case of PMFs equivalent to the probability of p(x) (i.e., f(x) == probability of x). This is confusingly not the case for the continuous density of a probability density function, but we will see that there is a simple reason why the y-axis of a density plot is hard to interpret and especially not equal to the corresponding probability of the value of x (the x-axis).

To wrap this up: In the case of a PMF the y-axis refers to the probability of the event occurring and the x-axis to the corresponding event — in our case an event is cast as a single dice value of a six-sided dice. However, we will see that different to “regular” probability functions, and different to our probability mass function, the y-axis (density) of a density functions (continuous) does not refer to the corresponding probabilities, due to rather simple matters of integration (a PDF sort of blurs the meaning of the y-axis).

You might have also come across examples regarding functions where the y-axis refers to a certain number of occurrences of a certain value of xx, not a probability by itself (e.g., the number of occurrences of each dice value, often plotted in the form of a histogram). However, dividing the number of individual occurrences by the total number nn of occurrences will scale the y-axis to a percentage or probabilistic value without changing the appearance of the graph (just normalizes the values of the y-axis).

Let us say, we rolled our dice 600 times and ended up with the below (idealized) uniform distribution. Below you will find the formula that lets us translate frequencies into probabilities!

Fig. The far-left y-axis refers to the number of times an event xi​​​x_i​​ occurred (=a certain dice value was rolled). The near-left y-axis of our PMF refers to p(xi)p(x_i​). We will always be able to translate the number of occurrences into probabilities via the formula above. It essentially just divides the number of occurrences of a single event xix_i​ by the total number of all possible events nX​​n_X​​ (capital XX is referring to the whole set of all xix_i). Since we are handling a uniform function with 6 possible events (six-sided dice) and every value from 1 to 6 occurred a hundred times in this example, the total number of n events equals nX​​​n_X​​=600. A 100 divided by 600 = 1/6 = p(xi)p(x_i).

So now, what can we do with a PMF, apart from calculating the probability of one certain dice value occurring? Say, we want to roll a 3, a 4 or a 5 for some reason during a game where one six-sided dice is involved. What is the probability that we are rolling either a 3, a 4 or a 5? A PMF can help us evaluate the probability of that “range of discrete values”. Note that the term “area”, and sometimes also “range”, is commonly associated to continuous functions: the integral of a function as the “area under the curve”. Just keep in mind!

Fig. Modified plot of our PMF.

In order to evaluate the probability of rolling a 3, 4 or a 5, we can simply sum the respective probabilities:

# Probability of rolling a 3, 4 or 5, i.e., a range so to speak:
px_3to5 = px[3]+px[4]+px[5]
px_3to5 = sum(px[3:5])
# [1] 0.5

We could also add the mean and the standard deviation of our uniform PMF, but it would not add much to the story.

Let us now directly move to the topic of a uniform probability density function (continuous).

1.2 Uniform Probability Density Function (continuous)

Defining a PMF was rather easy. As mentioned, the relation between the value of the y-axis and x-axis totally makes sense, since the y-axis just refers to the probability of each individual x (xix_i). It also makes sense how to calculate the probability of rolling a value within a “range of values” occurring, since it just involves summing the individual probabilities of the events in question. However, the value for the density is unfortunately not just the probability of a discrete event occurring, but there is a simple reason why this is the case: the constrain of a PDF is that the integral of a PDF hast to sum to 1!

Let us first use basic R functions to plot a uniform probability density function via dunif(). Mathematically, we will partially follow a video on uniform PDFs from the channel “The Organic Chemistry Tutor”.

# Plot Uniform PDF using basic R functions:
x = seq(0,7, length = 10000) # try length 50 and plot again!

# Uniform density for each x:
y = dunif(x, min = 1, max = 6)

# Plot uniform PDF:
plot(x,y, type = "l", ylim = c(0,1))

Fig. Our uniform probability density function (continuous). CAVE: Note that the height of our density function, the value of y or f(x) does not correspond to our discrete probabilities of our PMF anymore! We will find out in a bit why this is the case.

The definition of a uniform PDF goes as follows and is not that far away from the definition of our PMF, as we will see:

To understand why f(x)=1/(ba)f(x) = 1/(b-a), let as recall some basic formulas in geometry.

Since we are working with a simple rectangle, we can obtain the area of that rectangle simply by multiplying bab-a by f(x)f(x). In essence this will result in the complete integral of our uniform curve. See plot below:

Fig. Modified plot of our PDF. Note that the area under the curve will also sum to 1!

In case you wonder: It may sound weird, but yes, a continuous uniform function can be referred to as curve in mathematics.

So, now we know that the width is bab - a and that the height refers to our unknown f(x)f(x). Since b-a results in 6, the above formula is not that different from the formula we used for our PMF. However, working with continuous values adds some spice to it, as we will see.

Moving forward to PDFs we also already know that the area under our uniform curve (total integral) has to sum to 1 ─ and this will turn out as reason why the density will not be related to the probability, but the integrated area under the curve indeed does. In other words, the fact that the integral has to sum to 1 dictates what the values of the y-axis will look like in a rather abstract way only. Still not clear? Let’s go through it step-by-step:

Let’s write down what we know about a uniform PDF, this time using x2x_2 and x1x_1 for a more general approach. We will treat aa and bb as the min.min. and max.max. of our range of possible values (1 to 6 for a six-sided dice)!

Doing some very, very simple rearrangements will get us to the already mentioned definition of a uniform probability density function. Below you will also find some code to play around with:

# Uniform PDF
unif_PDF = function(a,b,x){
fx = x # initialize output vector of length(x), so we just use x
  for(i in 1:length(x)){ # := for every object of the vector x
    if(x[i] < a){
      fx[i] = 0
    else if(x[i] >= a & x[i] <= b){
      fx[i] = 1/(b-a)
    else if(x[i] > b){
      fx[i] = 0
  } # End for i
} # End of function

# Test:
test = unif_PDF(a = 1,b = 6,x = x)
plot(y = test, x = x, type = "l", ylim = c(0,1)) 

Different to our PMF, we can now determine a certain area under the curve concerning infinitely small values in-between discrete values. NOTE: We are now leaving the dice example as a graspable “real example” for the purpose of mathematical abstraction, since a “continuous but bounded dice” would mean that our dice would range from values between 1 to 6 and at the same time actually has infinitely small sides in-between the mentioned range of values. Of course, we are leaving the realm of simple facts here.

However, this problem is actually somewhat addressed to in the math, as we will see: the integral of a specific point xx on our continous PDF will be actually just 0 (code later below). Only the integral of an actual area (a difference of e.g. bab-a) under a PDF will lead to a reasonable probability. This is again different to simple continuous probability function: the integral of such simple probability functions doesn’t necessarily sum to 1 (since such a regular probability function is not normalized to do so). However, in contrast to a PDF, we can read out the probability of a certain singular event from the y-axis of a regular probability function right away, without the need of any integration. So both serve their purpose, so to speak.

However, nevertheless how realistic rolling a dice with infinite sides may be, let us still again determine the area between 3 and 5. Our chosen values 3 and 5 can also be understood as a lower and upper bound of our desired integral (not the same as min. / max. of our whole curve). In the plot below we will just call the variables of our desired area x1x_1 and x2x_2 and aa and bb will be our minimumminimum and maximummaximum again.

Fig. Modified plot of our uniform PDF. The purple area indicates our area of interest: 3<=x<=53<=x<=5.

In order to calculate the area of interest, we can simply use the formulas we already know, again treating our uniform curve as a rectangle. Doing some rearrangements will result in a simple ratio between areas under the curve (see last line below):

# Area under the curve via multiplication:
upper_bo = 5 # set upper bound
low_bo   = 3 # set lower bound
a_min    = 1 # set minimum
b_max    = 6 # set maximum, else f(x) = 0 anyways...
unif_area = (upper_bo-low_bo)*(1/(b_max-a_min)) 
# [1] 0.4

# Integration of a uniform prob. density function via ratio
# area of interest divided by total area:
# P(x_1<=X<=x_2):
P_area = (5-3)/(6-1)
# [1] 0.4

You may wonder why the PMF of a six-sided dice in the same area gives a probability of .5.5 and with a PDF we get .4.4 as a result? The reason is again rather abstract, as we have to recall that in contrast to our PMF our “continuous dice” would need to have infinite sides to relate to a continuous density function, while at the same time only leading to values between 1 to 6 in total. Due to the characteristic of infinity even the area between 3 and 5 itself would also already conceptually entail infinite sides of a dice(!!). Without getting to fuzzy about it: our PDF above just does not represent a discrete six sided-dice anymore, keep in mind. :D

Let us now check for the special case where the area of interest is equal to the total area of our uniform PDF, which should result in 1:

# P(x_1=a<=X<=x_2=b) := area of interest == total area:
P_area = (b_max-a_min)/(b_max-a_min)
P_area = (6-1)/(6-1)
# [1] 1

As mentioned already, when asking for the probability of a specific individual value of xx, given a continuous density function, the answer will always be: a probability of 0, different to a regular probability function (or our PMF). Looking at the above formula and recalling some math from the second part of this tutorial series, you may instantly spot why this is the case: we are again working with a relation (division) of differences (similar to the slope), so we do not want to encounter a 0 in the numerator in any case:

# P(c<=X<=c) := probability of only one point of, e.g., x = 5:
P_area = (5-5)/(6-1)
# Results in 0/5 = 0
# [1] 0

For completion we will now quickly calculate the standard deviation and the mean as well. The formula for the sd of uniform a distribution deviates a little from the formula we have used previously in part three of this series, but we will not go into details here, why this is the case. As said, we do this just for completion, since we will not need uniform distributions any longer (we will get back to uniform distributions when discussing Bayesian model regression in future some time):

# Mean for uniform distribution:
a = 1
b = 6
# For uniform distribution you may come across this formula:
mean = (a+b)/2

# .. but we can also use our regular old mean:
# mean = sum(dice)/length(dice):
mean = sum(dice)/length(dice)
mean = mean(dice)

# The standard deviation for uniform distributions:
# sd() does not deliver equivalent results:
sd = sqrt(((b-a)^2)/12)

# Add mean and sd to plot:
plot(y = test, x = x, type = "l", ylim = c(0,1)) 
abline(v = mean)

Fig. Resulting plot including the mean and sd of the above uniform distribution.

1.3 The Cumulative Distribution Function of a Uniform PDF

The cumulative distribution function of a uniform PDF as well as PDFs in general is simply explained, as it just reflects the integral of the PDF. We have therefore already looked at the values of a CDF, so to speak (also check out this tutorial, which I partially followed). The CDF is called cumulative, as it accumulates (sums up) the probabilities the further it integrates the PDF.

We could also integrate from - Inf to + Inf, but for uniform distributions we can also simply integrate from min. to max. The CDF can be formally defined as follows and is often denoted F(x)F(x) in contrast to f(x)f(x), our PDF:

The above formula says, if xx is greater than bb than it will always result in 1. If xx is below aa it will result in zero…. In the form code:

unif_CDF = function(a,b,x){
  Fx = x # initialize output vector of length(x): just use x 
  for(i in 1:length(x)){ # := for every object of the vector x
    if(x[i] < a){
      Fx[i] = 0
    else if(x[i] >= a & x[i] <= b){
      Fx[i] = (x[i]-a)/(b-a) 
    else if(x[i] > b){
      Fx[i] = 1
  } # End for i
} # End of function

# Initialize a sequence of possible x:
x = seq(0,7,by = 1) 
# Fx:
Fx = unif_CDF(a = 1, b = 6, x = x)

# Plot CDF:
plot(x = x, y = Fx)

Fig. Plot of our CDF. The y-axis refers to the probability and the x-axis to the respective area of x. The plot only shows some discrete points of the CDF of our continous PDF in this case.

As you may have expected, the probability raises linearly the more we integrate, until we integrated the area of the whole density function, numerically summing to 1. The x-axis elegantly does not refer to the density, but our variable xx itself. A CDF therefore translates the y-axis into probabilities, but keeps the relation of xx, our dice values (values of a variable in the form of an vector in R). It therefore has the same structure as a regular probability function (y-axis = probability and x-axis to the values of a variable).

Now that we have an idea what a PMF, PDF and CDF is, given a simple uniform distribution, we can now precede to probably the most commonly used form of functions: the bell curved Gaussian function, i.e., the (standard) normal probability (density) distribution.

2 The (Standard) Normal Probability Distribution ─ Gaussian PDFs and their corresponding CDFs

Next, we are going through the same steps as above, just not with a uniform, but a (standard) normal function.

You may wonder why bell curved standard normal distributions are so popular and important in the first place?

You may even feel conflicted hearing the word “normal”?

It may even appear as a dark mystique power to you influenced or even under the control of a secret political agenda? :O

To get settled and leave some uncertainties behind, we will first go through some conceptual basics around the famous Gaussian bell function, in order to understand how this special function is related to the world and the way we perform statistical inference.

Fig. Famous meme on normal and paranormal distributions. :D Origin is unknown to us. Image was obtained from Starecat.

One way to approach the concept of “normal distributions” is by treating it as an evident phenomenon, namely: the regression to the middle. And yes, this is why it is called regression model! Linguistically, “regression” is Latin and stands for “going back” or “backwards”. So, in general the concept regression to the middle therefore first of all refers to a process of “going back to the middle”. Mathematically, the middle is referring to the mean, the mean of a distribution. A classic example is the (population) mean of the height of humans. Have you ever wondered why there aren’t any people just randomly growing up to a height of 4, 6 or 9 meters? One way to answer this question is by saying: regression to the middle: the further away a measurement is from the average, the more unlikely such a measurement will be (until it approaches zero).

Note that the Gaussian function depicted below refers to a regular probability function, not a probability density function yet!

Fig. Abstract example for a normal probability distribution (not a PDF!), representing the relation between a certain random data value xx (e.g., height) and a corresponding probability yy (probability of encountering a certain height; again, not the density is used here for the y-axis!). The graph indicates a regression or “going back to the middle” concerning the probability of occurrence: The further xx deviates from the expected mean (marked orange) to both sides (blue arrows), the lower the probability of occurrence of that value of xx will be. The graph also indicates that there are values of x with 0 probability (maybe because a full-grown human with a height of 10 cm becomes counterfactual to specific biological constrains of an organism at some point).

Below you will find an actual example from real data via the website “Our World in Data”:

Fig. From the website “Our World in Data”. Whatever the exact reasons for the “restrictions to the middle” are on the biological level, they will always manifest themselves as shadows within distributions of measurements. We are also aware that the binary categorization is controverse. The categories we decide for are of course influencing the outcome and overall perspective of a trial/experiment. Especially the concept of gender is following a discussable deterministic pars-pro-toto approach: from chromosomes to the organism as a whole, which includes a huge categorical change in the scale of complexity we refer to: from a chromosome to a cell, to tissue and organs… up to cognition / experience and behavior. In nuce: such a complex determinism is not working out. However, it is still an important classic in biology for several reasons in general to measure the height and observe the growth of organisms, so we still chose height as an example.

It is hard to answer why such principal restrictions concerning, e.g., the height of an organism, are exactly the case. In general, most brief answers will probably refer to some concept of environmental adaption and even boundaries dictated by physics as such to some degree. However, no matter if you know the exact reason, the work of such effects/boundaries still often represent themselves in the form of a bell curved function.

Historically, the concept of a regression (to the middle) was initially introduced in the context of the genetic phenotype of peas by Francis Galton, a cousin of Charles Darwin, in the 19th century, as well as in the context of heredity and height (our example above). However, Galton did not discover the mathematical concept of such an effect, i.e., the normal distribution, but he figured out it can be used to empirically describe and predict natures behavior to some degree.

The take home message is here: the mathematical method of regression analysis as such is interestingly partially rooted in nature itself. If you like it or not, normal distributions are not just a concept, but also a phenomenon to some degree (e.g., humans don’t randomly grow 10m of height and even if, it would be something happening very seldomly).

Still, keep in mind that:

a) not every phenomenon can be described by a normal distribution, as it is not the only principle at work in statistical inference and nature, so to speak. In other words: there are also other kinds of distributions (logistic, binominal etc.). Below we will also see that normal distributions are a common effect of multiple sampling (a mathematical effect). In other words: a statistical model does not represent reality by itself (e.g., due to bias, confounders etc.). This is something that Galton (and also Pearson) completely overestimated and by doing so influenced the grim idea of eugenics. There is a great article by the science magazine “Nautilus” discussing how eugenics shaped statistics in its early stages.

Also keep in mind that:

b) The normal distribution is distinct from a concept of “norm” in the psycho-social (or cultural) sense. The deviation of height for example did not result from a conceptual (!) intention set by human minds / cognition (such as cultural norms). Being normal is not even necessarily a goal in that respect, such that some random cultural ideal might even be rather the unprobeable case (e.g., very tall people, if considered an ideal). . .

Fig. Karl Pearson (1857-1936) and Fracis Galton (1822-1911), pioneers of statistical inference with a grim history of influencing eugenic ideals.

So keep in mind, facts like the average height, or any other value of some arbitrary measurement, can still be instrumentalized to, e.g., opportunistically argue against the worth of others for some self-interest, so it is important to distinguish between the descriptive norm, e.g., of people’s height, and the “normative norm”, which not only involves a description of how things are, but arbitrarily sets how they are supposed to be in the first place (mere opinion, which from a cultural perspective often represents itself as, e.g., tradition).

Apart from examples in biology or certain phenomena in general: another mathematical reason why Gaussian functions play a great role in (frequentist) statistics is the so-called central limit theorem. The central limit theorem states that if we take several samples of a population with any kind of function as form and take the mean of each sample, we will get a normally distributed function of means (!!) over time. It may sound crazy, but it’s true and can be proven with R! See next chapter for details and code.

2.1 The Central Limit Theorem

Again, the central limit theorem states that if we take several samples of a population with any kind of function as form and take the mean of each sample, we will get a normally distributed distribution of means (!!) over time, in other words: when our sample size of nmeansn*means grows it will automatically start approaching a normal distribution, no matter the form of the actual distribution. It is important to address the fact that we are talking about the means of the samples, in order to make sense of “the effect” of the central limit theorem below. This can be thought of as approaching a population with a normal distribution and with size NN. In other words: the normal distribution that we start apporaching serves as a kind of bound or limit.

Note that in this chapter we will mostly follow this R tutorial on I can also recommend this StatQuest Tutorial by Josh Starmer on that topic (can’t always recommend StatQuest!) and this video by 3blue1brown (Grant Sanderson), which was just recently published and which covers a lot of other topics discussed in this tutorial, as well!

Here is some code to demonstrate the effect of the central limit theorem in R given a uniform distribution. In the downloadable R script of this tutorial we did the same with a logarithmic function, to prove this doesn’t only hold for sampling from uniform distributions (which is, e.g., also symmetric).

# Initialize data via runif, which will
# give us a random but uniformly distributed
# set of data. Each time you run this line, the result will
# be another set of data points.
# Uses set.seed(0) or other values to make results reproducible
# (we will not get further into the the topic here):
data = runif(n = 1000, min = 1, max = 10)

# We can also plot our data as histogram
hist(data, col = "lightblue")

# Use a for loop to take 5000 random samples of size n = 5 each:
n_samp = 5000
sample_n_5 = c() # create empty vector

# The below function will also take the mean of the sample and 
# will store the output in our vector sample_n_5:
for (i in 1:n_samp){
  # Store mean of each sample with length 5:
  sample_n_5[i] = mean(sample(data, 5, replace=TRUE))
} # End for i

# Look at the histogram of the sample_n_5:
hist(sample_n_5, col = "lightblue")

Fig. On the left we see a histogram of a uniform distribution of values of x ranging from 0 to 10. The y-axis refers to the frequency of each value of x (total of the frequencies is: n = 1000). On the right we see the distributions of the means of various samples taken from that uniform distribution. The y-axis again refers to the frequency of each sample mean (x-axis) occurring. We can see that even though we sampled from a uniform distributed set, the histogram of the means of our samples clearly tends towards a normal distribution over time.

In consequence of the central limit theorem, because the distribution of a bunch of sample means tends towards a normal distribution over time, we can assume a normal distribution right away in a lot of cases for performing (frequentist) hypothesis testing (integrating a standard normal PDF).

In the next part of this series the central limit theorem will get important when thinking of sampling from a normal distribution right away: when we sample from a normal distribution, the higher the sample size, the closer our sample mean will approach the actual mean of the population. In consequence, when we have a high sample size and a very high difference between the population mean and the sample mean, we can assume that such an event is very unlikely explainable by actually sampling from the population (null-hypothesis) but more likely the result of sampling from another distribution (the alternative hypohtesis), due to the central limit theorem (again, assuming that we sampled from the normally distributed population right away, i.e., from the perspective of the null-hypothesis). This will be the basic idea behind a z-/ and very similarly a t-test in the next part of this series.

In order to be able to follow the central limit theorem and a classic normal and not a t-distribution, the rule of thumb is said to be a sample size of at least 30. Otherwise, it is recommended to work with a t-distribution, which is very similar to a regular Gaussian and is also bell-curved. We will also get there in the next part of this series, no worries.

However, the above only works in the case of independent random sampling and finite variance. Note that there can be cases where the independence breakes apart, e.g., due to some global event: Let us say the tea market brakes down (a kind of bias is introduced by that) and Lady Meow is not able to get hold of any fresh tea. Other cats will then also have the problem of getting hold of some tea, so the overall mean of tea drank during a specific time will suddenly change and we will obtain mean values that were previously thought to be rather seldom…

We hope chapter 4 and 4.1 have given you a good overview of the mathematical and conceptual importance and the meaning of working with normal distributions and how to distinguish the use of the term “normal” from the term / concept of a “norm”.

Next we will move on to the mathematical description of normally distributed functions as such, which are also referred to as Gaussian functions.

2.2 The Basic Gaussian Function

The formula for a basic Gaussian function is actually fairly simple and we will look into why the formula below results in the respective plot (we will partially follow the partially well written intro of the Wikipedia article on Gaussian functions for this part (December 2022):

# Gaussian base form:
gaussian_fun = function(x){
  fx = exp(-x^2)
} # End of function

# Plot of basic Gaussian:
seq = seq(-4,4,by = .1)
fx = gaussian_fun(seq)
plot(x = seq, y = fx)

Fig. Plot of the Gaussian base form exp(x2)exp(-x^2).

You can probably identify the above function as bell curved, but how did this actually result from the equation?

In general, the number ee is a very important mathematical constant for several reasons, which we will not go into here. The number ee also appears in the description of growth, such as growth in infections, or growth of algae in a pond ─ or a growth of entropy… However, using the number ee is actually not necessary, as we could also use the base of 2, in order to get a bell curved function. However, the number ee has some convenient benifits (which we will not get into) that makes it possible to parametrize a Gaussian function (see further below).

However, before we get into ex2e^{-x^2} let us first recall what squaring did to our centralized values of x before summing (!) them (discussed in part three of this series). Remember? Look again at the plot below and try to make out the pattern:

# Recall a similar effect when squaring the deviation of 
# x_i-mean(x) and looking at the plot:
x = c(0:10)
plot(x = x, y = (x-mean(x)))    # deviation x_i from mean(x)
plot(x = x, y = abs(x-mean(x))) # absolute value of x_i-mean(x)
plot(x = x, y = (x-mean(x))^2)  # plot of squared deviation
# Recall: summing the squared deviation results in 
#         the total sum of squares...

Fig. Recap of evaluating the structure of the total sum of squares before actually summing. The plot shows very well what squaring actually does to the deviation of ximean(x)x_i-mean(x). As squaring also indirectly has the same effect as taking the absolute value, resulting in all values being positive, we also included a plot of the absolute value of the deviation of ximean(x)x_i−mean(x). We can therefore distinctively evaluate what squaring does in advance: it curves or smooths or parabolizes the centralized values, so to speak. IN NUCE: Squaring turned a linear function into a smooth parabola function (with positive values only).

Again: squaring turned a linear function with negative values into a smooth quadratic parabola function with positive values only. Recall that centralizing the values of x was the first step to obtain the total sum of squares / variance / sd (relateable to a parametric Gaussian, see further below). We know that a parabola function can have a maximum, similar to a Gaussian function. We also already know that only a parabola like g(x)=x2g(x) = -x^2 would result in a parabola that has a maximum, not a minimum (f(x)=+x2f(x)=+x^2). So in nuce: we know that there is a hidden parabola within a Gaussian function (as briefly mentioned in the last part of this series).

Think of the step of squaring the exponent of a regular exponential function exp(-x) as performing a similar process as above with our initially linear centralized values: Squaring the exponent of an exponential function turns a steady into a — at least in its middle part — rather parabolic relation of values with either a minimum in exp(x^2) or a maximum in exp(-x^2).

We can also plot the exp(-(abs(x)), which shows how a proper Gaussian function integrates a parabola function x2-x^2 into the “bi-exponential” function below:

Fig. Plot of exp(-abs(x)), which already is very close to a classic Gaussian and showing a similar effect as we have observed before when altering centralization.

# Plot of exp(-abs(x)):
test_abs = function(x){
  fx = exp(-abs(x))
} # End of function

# Recall that we will use the absolute value of a sequence
# consisting of values ranging from -4 to 4 (by a step of .1):
plot(x = seq, y = test_abs(seq), typ = "l") 

As mentioned, in the case of a regular (non-parametric) Gaussian, we could even use a different base, and we will still get a Gaussian form. Below we will use base 2.

# It does not even matter that we use the number e as base:
test_gen = function(x){
  fx = 2^(-x^2)
} # End of function

plot(x = seq, y = test_gen(seq), typ = "l")  

Even though the number ee plays a very important role in a lot of math and physics, the above shows that the form of Gaussian is more due to general mathematical reasons, not due to particular characteristics of the number ee. As hinted, the use of the numer ee is more a convenience, especially when it comes to PDFs and the concept of standardization (we will not focus on the number ee any further in this tutorial; below you will find some additional information on using ee as a matter of convenience and readability in general).

2.2.1 Optional Facts on exp(x)

For completion, here are some optional facts why exp(x) plays an important role in mathematics and physics. Up front: it is essentially convenience, as ee has some mathematical characteristics that makes a lot of calculations easier:

a) The nthn^{th} derivate of exe^{x} is again exe^{x}.

b) The slope at a certain point xx on f(x)=exf(x) = e^{x} is also exe^{x}.

c) The integral up to a certain point xx is again exe^{x}. Note that the code below proves that the assumption xexdx= ex\int_{- \infty}^{x}{e^{x}\text{dx}} = \ e^{x} holds, since e1=ee^{1} = e.

Proof of all above the above in R:

plot(x = seq, y = fx_pos, type = "l") # fx_pos from above
# forming coordinate system (+x,-x,+y)
abline(v = 0)

# Point at x = 0, so f(0) = y = e^0 = 1
exp(0) # Now add point to plot:
points(y = 1,x = 0) # P (0|1)

# Point at x = 1
# y = f(x) = e^x; f(1) = e^1 = e

points(exp(1)) # P (e|1)
# Slope of tangent at a point P(f(x)|x) := f'(x) = e^x,
# since f(x) = e^x = f'(x)

# Evaluate y coordinate of P(f(x)|x):
y_coordinate = exp(1)
# [1] 2.718282 = Euler's number e.

# We can add the respective tangent at exp(y)
# point with a slope = e^x. In case you wonder:
# the linear function x does not need to be defined in curve().
curve(exp(1)*x,-3,3, add = TRUE)

# Integral from -Inf to x = 1
expo_fun = function(x) {exp(x)}
integrate(expo_fun, lower = -Inf, upper = 1)
# 2.718282 with absolute error < 0.00015
# = e again.

Fig. Plot of an exponential function with a tangent at P(e|1) with the slope ex.e^{x}.

In the next chapters, we will see that the use of ee as base serves the purpose of making standardization possible, since we can plug in the mean and the variance/sd as a parameter into an equation. We will not focus on the number ee any further. However, the recently published video tutorial by 3blue1brown on the central limit theorem that we mentioned before gives some answers starting at minute 17:10. However, most of the math on that topic in the video by Grant Sanderson will be presented below as well via R. In general, using ee as base gives us the ability to parametrize and therefore generalize our Gaussian function.

2.3 The Parametrized Gaussian Function

Next, we are going to look into the parametrized form of a Gaussian function. The formula may look complicated, but it is actually really simple:

The parameters aa, bb and cc stand for: the height of the peak (aa), which can be considered as scaling the y-axis; the value of xx at the peak of the Gaussian (bb, or the mean); and cc stands for the width of the curve (cc, or just the standard deviation!), which refers to the turning points of the tails of a Gaussian (see plot further below). This is how the sd indirectly controls the width of the bell curve.

So here we finally get to a point where we can clearly relate the mean and the sd to a normal / Gaussian distribution. Whoop!!

Looking at the mean bb, we see that our set of xx in the numerator of the fraction in the exponent of ee is again centralized! With some slight adjustments this whole exponent will later be referred to as z-score for standard normal PDFs, as we will see!!

As mentioned, the height of the peak can be understood as a parameter for scaling the y-axis. However, keep in mind that the above formula does not yet refer to a density for f(x)f(x). Recall that for a density function our y-axis would be scaled in a way, such that the integral of the function will always sum to 1.

For the numeric example below, we used the above data from the height example (“Our World in Data”) and tried to figure out a distribution for humans in general.

 # Gaussian with parametric extension:
seq = seq(100,220,by=.1) # height in cm. 
a = .5                 # height of curve peak  
b = (178.4+164.7)/2    # x of peak = mean = 50%
c = ((178.4-170.8)+(164.7-157.6))/2                   
                       # width of the bell = sd 
                       # + turning point of function!

gaussian_fun_para = function(x, a, b, c){
  fx = a*exp(-(((x-b)^2)/(2*c^2)))
} # End of function

fx = gaussian_fun_para(seq,a,b,c)
plot(x = seq, y = fx, xlim=c(135,205), type ="l")
abline(v=b, col = "lightblue")
abline(v=b+c, lty = 2)
abline(v=b-c, lty = 2)

Fig. Plot of the parametrized Gaussian function with a=.5a=.5 and a=1a=1 for comparison. We could also scale the y-axis in a way that it refers to the number of occurrences of each x, as we did before in chapter 3.1, when discussing the possibility to translate the number of occurrences of an event xix_i into the probability of an event xix_i (p(xi)p(x_i)) occurring.

To understand the exponent of the above parametrized form of the Gaussian, we will further play around a bit with the formula to understand what each value actually does. R offers a lot of possibilities to use, e.g., plotting in order to heuristically explore functions via visualizing data:

Fig. The more you fuck around (independent), the more you find out (dependent variable). Meme by Roger Skaer.

Here is some code where we manipulated the formulas including the respective plot. Below you will find a corresponding figure including the formulas that were used for each plot. Note upfront that the first and the second test in R code below (test_1 and test_2) resulted in equivalent values, since cc is here set to be equal to 11 — and 111*1 or 121^2 remains 11. Feel free to play around with the values yourself!

# Parameters

# Sequence
seq = seq(-4,4,by=.1)
# Test 1:5

# Plot of Test 1:5

Fig. Plot of test_1 to test_5. The plot of the first and second test on the upper left are the same, since sd=1sd=1 and 11=12=11*1 =1^2 =1 as well. In the second plot on the lower left we can see that the curve is slightly more narrow, indicating an sd of 0 in this case, though there is still a turning point in the function — so what happened here is, we just gave up control over the parameter c (the sd). The upper right plot shows what happens when we get rid of the squaring again and the lower left includes 2c2∗c again (squaring was left out). We can see that the y-axis has a different scale in this case. Further evaluations are possible, but we think this has given us enough insight into what the formula actually does for now.

2.4 The Parametrized Gaussian Probability Density Function (Normal PDF)

You may figured out for yourself already but the parametrized Gaussian probability function is structurally the same as the Gaussian or normal probability density function. The only difference is again that the area will now sum to 1 (we will soon learn why). The param. normal PDF is also referred to as standard normal PDF, when a mean of 0 and a sd (or variance) of 1 is given (sqrt(1) == 1).

What can we do with it? The standard normal PDF with a mean of 0 and sd of 1 is representing the population (or another sample) and can essentially be used to compare a population it with a standardized sample distribution, i.e., with a sample distribution that is transformed into a standard normal form. More precise, the sample mean will be translated into measures of population standard deviation: If the mean of the population is 10, the standard deviation of the population is 2 and the standardized sample mean equals +1, then we can reclaim the sample mean to be 12, since it is deviating +1 population-sd from the population mean of 10 (recall, we set the pop-sd to be 2 in this example).

To give an example what can be achieved by comparing the means of two distributions in a standardized way: say we want to test if a new drug changes the blood pressure of a patient. What we could do is, we could compare the sample mean of those patients that received the new drug with the population mean, i.e., a group that did not receive the new drug (or “all other people”). However, looking at the difference of means for itself, does not tell us much. Say, we end up with a difference of systolic blood pressure of 5 mmHg, is that much? Looking back at part three of this series, you may ask: couldn’t 5mmHg just be the standard deviation of the mean? In other words, could it be that such a difference is quite common when sampling patients’ systolic blood pressure? In other words: the difference in means for itself does not tell us much about how to asses such a difference. Standardization will help us to evaluate such individual differences by transforming individual values of measurements into a general relation. Again, this means that the deviation of the sample mean from the population mean will be looked at in units of standard deviation of the population mean.

This is essentially what the formula of a z-test does: translating the sample mean, such that it becomes relatable to a Gaussian PDF. However, translating the sample mean into a standardized scheme for itself does not tell us enough either, since it remains unclear how likely such a certain obtained difference may turn out to be. We also need to integrate our PDF up to our standadized sample mean in order to obtain a probabilistic perspective, as we will see further below. So again, different to a regular probability function, we can not just read out the probability value by looking at the y-axis, since we are handling a (standard normal) probabiltiy density function. Note that the process of standardization is only slightly different to what is called studentization, using a Student’s t-test, i.e., a t-distribution (similar to a standard normal PDF, see further below). The standard normal PDF is also referred to as z-distribution. You may have already come across the term t-score or z-score before.

Before we get all crazy with hypothesis testing, and since integrating a Gaussian is not that trivial, let us first go through the formula of a Gaussian PDF step-by-step, especially in order to understand how a parametric Gaussian function becomes a density function. German readers, depending on their age, may have come across one of these in their past:

Fig. The Gaussian function is named after Karl Friedrich Gauß (1777-1855). We will not go into historic details here, especially as Gauß has played such a great role in many fields of mathematics, physics and others, such that it is truly hard to sum his influence up in a few sentences. Gauß also established the least squares method by the way! The image above shows the effigy of Gauß on the old German 10 Marks bill, including a graph and the general parametrized formula of the normal PDF.

The general parametrized formula for the normal (non-standardized) probability density function that follows the mean and the variance of a distribution is denoted as follows (recall that sigma-squared is essentially the variance; the formula intentionally deviates from the formula from the 10 Mark bill):

The left part of the formula is spoken out: ff is the name of a function of xx, given (“|”) the mean and (“,”) the variance (the latter two can be referred to as set parameters). We could also use the sd, however as we know, it is just a matter of squaring and some minor adjustments of the above formula. When doing research, you might encounter formulas with other slightly different rearranged formalism (as in the 10 Mark bill above), apart from using the standard deviation instead of the variance. Don’t worry, as long as they all result in the same. Especially the formula for the standard normal PDF could also be simplified further, since it has a mean of 0 and a sd of 1 (purple formula):

However, we will stick with the non-simplified formula, in order to fully understand what it actually says and for what it can be used.

To understand the basic structure of the formula, recall that our uniform density function was denoted f(x)=1/(x2x1)f(x)=1/(x_2-x_1), which resulted from the constrain to sum to 1. In the normal PDF above, we see a similar pattern for our parameter a=1/sqrt(2πvariance)a = 1/sqrt(2*π*variance). The use of this particular complex looking aa is in general again for reasons of integration: it is supposed to sum to 1 and the funny looking denominator of our parameter aa is actually the integral of the regular param. Gaussian, as we will see! Note that the funny looking denominator is also referred to as normalization factor (it normalizes the y-axis to form a density).

The reason that the number ππ turns up may seem intuitive since ππ is often used as scale of the x-axis, e.g., for plotting and integrating sine waves (which have a period of 2π), or when rotating “things” in general. This is actually true, since the use of ππ also has to do with the calculation of the integral of a regular Gaussian that applies some mathematical tricks that involve ππ, but we will get there in the next chapter! Let es first do some descriptive work on our function:

Again, the so called standard normal distribution is denoted as having a mean of 0 and a variance of 1. Below you will find some code for the above. You can also plug in a different mean and variance below, however, recall that it will not result in a standardized PDF anymore:

# General formula for a probability density function (PDF).
# Insert mean = 0 and var = 1 to obtain the values for input x
# given a so-called standard normal distribution:
prob_dens = function(x,mean,var){
  fx = (1/sqrt(2*pi*var))*exp((-((x-mean)^2))/(2*var))
} # End of function

# In case placing () in the above becomes an issue when
# writing a function, you can also decompose the formula:
prob_dens_decomp = function(x,mean,var){
  fx1 = 1/sqrt(2*pi*var)
  fx2 = (-((x-mean)^2))/(2*var)
  fx = fx1*exp(fx2)
} # End of function

# Initlize seq:
x = seq(-4,4,by=.01)

# Given standard normal distribution:
norm_PDF     = prob_dens(x, mean = 0, var = 1)

# Using basic R function dnorm():
norm_PDF_alt = dnorm(x, mean = 0, sd = 1)

# Check for equality:
# [1] TRUE

# Plot

Fig. Modified plot of our standard normal probability function. The x-axis is scaled for the population sd, given a mean of 0.

Note that the x-axis 4-4 to 44 can also be read as mean4sdmean - 4*sd to mean+4sdmean + 4*sd, in other words: the xaxisx-axis is scaled to represent the standard deviation from the mean(recall that the sd = 1 in the standard normal PDF, so the turning points will be located at x=+/1x = +/- 1). In case you have trouble understanding why: As the mean=0mean = 0 in a standard normal PDF, the x-axis therefore refers to the standard deviation only (0 +/- sd).

We noticed previously when discussing the parametric Gaussian function that changing aa did have effect on the height of our function but not on the general form of the function. We also noticed that the sd did have effect on where the turning points are located and therefore how narrow our function will be, but the sd did not affect the height. Looking at our formula above, you might have noticed that things have changed by now, since the sd is now also entailed in the general parameter of aa. A relation between the height and the sd of a function is also plausible, since the area has to sum to 1, no matter what variance/sd is given (only changing the sd but not the height would result in an area under the curve above 1).

Let us plot the output of our prob_dens() function using a variance of 11 and a variance of 33 and a mean of 0 in both cases. Here we can see why a low variance is essentially considered a good thing, since a high variance leads to a wider spread of values and subsequently to a higher variability when taking samples from the respective distribution (e.g., it becomes more likely to get a deviation of +/-4*sd when taking a sample from a distribution with higher variance). Our assumptions, our predictions will therefore be more messy, given a higher variance. In other words: A higher variance here means that, e.g., trying to predict new events via the mean of a sample may come with higher errors. The yaxisy-axis does not indicate a probability, but however the shape still does, so by looking at the plot it is still true when saying: with a high variance, values further away of the mean (errors) become moremore likely as with data that has a lower variance / sd.

As we figured out in part three of this series, the value of the variance itself does not tell us much about the quality of a function, but the comparision with other functions does (see below). The sd on the other hand can directly be related to the mean and delivers a value that we can interpret descriptively right away (turning points of the function).

Fig. Plot of two Normal PDFs with variance = 1 (blue) and variance = 3 (red).

# Plot with var = 1 and var = 3
plot(x=x,y=prob_dens(x,mean = 0,var = 1), type="l", col = "blue")

# The lines function adds another cure to existing plot:
lines(x=x,y=prob_dens(x,mean = 0,var = 3), col = "red")

Again: The formal reason for the above deformations of the Gaussian by the sd is of course the fact that the sd/variance appears in our parameter aa in the normal PDF (we will learn the exact reason in the next chapter).

Note that most of the time we will operate with equal variances in this tutorial, so the standardized sample distribution will have the same form as the standard normal PDF, only the mean will deviate. We did not yet compare a sample distribution with a population distribution, so no worries if this note does not ring a bell yet.

Next up we will integrate Gaussian functions and finally understand a little better why ππ and the sd is involved in normalizing the integral of our normal PDF.

2.5 Integrating Non-Param. and Param. Gaussian Functions (Optional)

This chapter is rather detailed and entails some aspects that could be considered optional, so in case you have trouble understanding all of it, it does not matter too much, as long as you get the general idea how we can turn a param. Gaussian into a param. Gaussian PDF. We will also recommend some video tutorials to follow in parallel.

The formula for the integral of a non-param. Gaussian goes as follows:

We already noticed that the number ππ can be involved in a standard normal PDF as well. The general reason is again that if we want the above integral to result in a value of 11 on the right side of the equation (since we integrate from - Inf to Inf), we would need to divide both sides of the equation by sqrt(π)sqrt(π) in order to obtain a density function that sums to one (see decomposition below). In a standard normal PDF this is done in advance, so to speak. So, one way to answer the question why ππ is involved: it has to do with making the complete integral sum to a value of 1 (normalization), as the integral of a regular Gaussian is equal to sqrt(π)sqrt(π) and not a value from 0 to 1 by itself.

This does not answer why ππ appears in the first place in the integral yet. Though I hope, that so far the formula of our standard normal PDF makes sense to you again. The decomposition below derives the formula of the basic (non-parametric!) Gaussian PDF:

Below you will find some code to replicate the integral of the basic Gaussian function using the integrate() function in R. By using our gaussian_fun() function together with the basic R function integrate() , we can determine a specific area we want to integrate, e.g., the whole curve via lower = -Inf, upper = +Inf.

# Gaussian base form:
gaussian_fun = function(x){
  fx = exp(-x^2)
} # End of function

# Using integrate function:
gaussian_integral = integrate(gaussian_fun, lower = -Inf, upper = +Inf)

# Use gaussian_integral$value to get value only, as integrate()
# returns a message and a list:
# [1] TRUE

To finally understand why the number ππ is involved in the first place, we have to dig a little deeper into the math:

The first thing I thought of, when I saw that the integral of a non-parametric Gaussian is the square root of ππ, was: “Interesting!! Then it results in ππ if we were to square it”. This was actually important to notice as I figured out (the more you…), as a common way to calculate and explain the Gaussian integral is by actually calculating the volume of a 3D Gaussian, resulting in ππ, instead of just the area under the 2D curve (resulting in sqrt(π)sqrt(π)). This is again done for mathematical convenience, especially when calculating “by hand”.

We are only going to touch the surface here, but there are a lot of videos on that matter on the internet. Here are two videos I watched to write this part of the tutorial: one in German by Weitz / HAW Hamburg, another English video by vcubingx. This article by Ryan Brideau also discusses the same topic in a similar way.

Calculating the volume can also be written as taking the integral of an integral, where importantly xx and yy actually refer to the same set(!!).

In general, there are some difficulties when integrating a Gaussian (especially by hand) and there are several ways to do so. We will only dip into the topic to get some intuition and will not replicate the whole math, but however: One way to integrate a Gaussian is by using polar coordinates instead of cartesian coordinates, eventually rotating the 2D integral around its peak, which essentially results in the volume of a 3D Gaussian (it is easy, due to the symmetric structure of 3D Gaussian, see plot later below).

The below formula may appear complicated, but think of the integral from 00 to 2π as rotating the 2D integral 360°. There is 2π2*π involved as the radiant is typically used in such cases, where the radius of a circle is 1, which results in 2π2*π being equal to one period of a sine wave (360°) — and not just ππ for itself (which represents a full rotation, given a radius that is not 1).

The part where the integral of exp(x2)exp(-x^2) and exp(y2)exp(-y^2) are separated may appear confusing but indicates a geometric intuition in the sense of a square as well, especially since xx and yy are actually the same set in the case of a standard normal distribution, it actally refers to a square in the sense of a2a^2. So keep in kind that yy does importantly not refer to a dependent variable! It is just the same set multiplied by itself in order to add a third dimension to the 2D Gaussian.

To wrap this up: What we get by essentially adding a redundant dimension is a perfectly symmetric 3D gaussian (see plot below). There are lot of videos on the internet on that topic, so we will leave it by that and we will also not replicate the various other existing mathematical methods to integrate a Gaussian via R here (we will stick with the built-in integrate() function).

The plot below is supposed to show that rotation is easily doable, since the 3D Gaussian is perfectly symmetric.

Fig. Plot of a 3D Gaussian from two perspectives: From upper right side (left plot) and from the top (right plot). The latter plot was modified to include arrows to mark that a perfectly circular rotation is possible. Relating the previous equation to the above plot: The first integral of the right side of the equation of the plot can be related to the rotation: 360° = 2*pi*rad = 1 period of a sine wave. “Rad” relates to a circle with a radius of 1, where pi radians = 180° (we will not go into further details here). The second integral can be related to the integral of the area from the center axis (the “peak axis”) to its infinite outer realm (from 0 to infinity).

Here is a more detailed plot. Not the best, since it was hard to integrate the information into the above plot but I hope you can make sense of it:

Fig. The r (orange) in the formula can be looked at as a vector with the coordinates (x,y) (light blue = x, light green = y). This vector will then be rotated by 360°=2pi. This is then done for every circle (green) the vector r can draw — we have included only two vectors as an example. So in nuce, we get the volume by slicing the 3D Gaussian up into cylinders with an diameter approaching 0 that we all integrate and sum in order to obtain the integral of the whole (like a 3D Gaussian sliced into a histogram!).

Grant Sanderson (3blue1brown) just recently published a video on all of the above in a series that also entails the video on the central limit theorem that we mentioned before. Also note that Grant Sanderson provides all the python code for his the visualization of the math in his videos — in case you are interested in python as well!

However, below you will find the code to replicate the above 3D plot in R:

# 3D Plot of a Gaussian:
x = seq(-4,4,by=.1)
y = x

# We will not go into full detail why we need the outer product for z,
# but the result is a matrix and z needs to be related to two 
# coordinates: x and y.
# However, outer() allows to write the function of our 3D Gaussian 
# within the outer() function:
z = outer(x, y, function(x,y)(exp(-(x^2+y^2))))

# The persp() function lets us plot in 3D; adjust theta and phi
# in order to change perspective:
persp(x = x,y = y,z = z, theta =20,phi=20)

# From the top:
persp(x = x,y = y,z = z, theta = 0, phi = 90)

The integral of the parametric form looks a little messier, but it is fairly simple when decomposed.

The next part is a short replication of parts of the Wikipedia article on Gaussian functions. It explains how our parameters aa and cc turn out to be part of the above result of integrating a param. Gaussian. The below is again just doing some rearrangements. Note that moving from line 3 to 4 may appear bumpy, but it is again just serving normalization, such that the last line results in 1=11=1. Later we will also further look into the specific purpose of the substitution of yy and zz below.

Upfront: The value of zz below in a slightly altered form is also referred to as z-score, which is important for converting any set into a standard normal form, as we will see later below. Note that the substitution of yy and zz is possible overall as it involves variables that are given (mean, sd, as well as the sample values xix_i).

# Integration using our parametrized Gaussian function:
gaussian_fun_para = function(x, a, b, c){
  fx = a*exp(-(((x-b)^2)/(2*c^2)))
} # End of function

gauss_para_integ = integrate(gaussian_fun_para,a=a,b=b,c=c,lower=-Inf,upper=Inf)
# [1] TRUE

So, now we know why we use 1/sqrt(2pic2)1/sqrt(2*pi*c^2) and why ππ is involved in the first place.

You may wonder why aa is not included in the standard normal PDF. Look at the first line of the equations above and you will notice that when we divide the left side of the equation by its integral asqrt(2pisd2)a*sqrt(2*pi*sd^2), the parameter aa will cancel itself out! On the other hand the term 1/sqrt(2pisd2)1/sqrt(2*pi*sd^2) can be looked at as aa as well, since it indeed scales the y-axis / changes the height of the Gaussian, as we have seen above when comparing different variances of a Gaussian, given an equivalent mean.

We now have all of the knowledge to finally calculate the whole integral of a normal PDF and therefore nearly all the knowledge we need to calculate the infamous p-value, which only concerns a specific integrated area under the curve. . . Note that below we will just integrate from - Inf to + Inf and half of a Gaussian function, i.e., we will not perform an actual z-/t-test in this tutorial.

2.6 Integrating the (Standard) Normal PDF (including CDF)

Integrating is essential to obtain probabilistic values from our PDF. However, we were not testing for anything yet. Testing would again involve integrating a specific value (a z- or t-value). We will get there soon.

A short way of integrating our standard normal PDF in R is again by using our prob_dens() or the built-in dnorm() function together with the R function integrate(). In case you wonder about the code below, the integrate() function has a special structural feature: it takes a function such as prob_dens as input for itself and the parameters of that function (mean = and var =) can be called as parameters within integrate(), see below:

# Integrate stand. norm. probability density function:
integrate(prob_dens, mean = 0, var = 1, -Inf, +Inf)
# Result:
# 1 with absolute error < 9.4e-05

# ... or via using dnorm(), now with the parameters "mean =" and "sd =":
integrate(dnorm, mean = 0, sd = 1, -Inf, +Inf)
# Result:
# 1 with absolute error < 9.4e-05

However, the shortest way may be by using the built-in pnorm() function that evaluates the p-value given a normal distribution:

# We could also use the function pnorm():
p_value_Inf = pnorm(Inf,mean = 0,sd = 1)
# [1] 1

# p-value at the mean = 0, 
# given a mean of 0 and sd of 1:
p_value_pnorm = pnorm(0,mean = 0,sd = 1) 
p_value       = integrate(prob_dens, mean = 0, var =1, -Inf, 0)$value
# [1] 0.5

# Test for equality:
# [1] TRUE

We can now go on and calculate and plot the respective normal CDF of our normal PDF that cumulatively adds up the p-values of the respective z-scores (the values on the x-axis). We will discuss the z-score and its formula in the next chapter, but so far we know already that it is just a way to translate any values of the x-axis into a standardized form of values representing a standard deviation from the mean (which itself is 0). The function entails a for-loop to calculate the p-value for every possible value of our sequence (-4 to 4, by .1) via the integrate function:

# CDF of a normal PDF:
norm_CDF = function(seq, mean, var){
  # Initialize vector length(density)
  Fx = seq  
  for(i in 1:length(seq)){
    Fx[i] = integrate(prob_dens, mean = 0, var = 1, lower = -Inf, upper = seq[i])$value
  } # End for i
  # Plot result:
  plot(x=seq, y = Fx, type = "l")

} # End of function

# Test our function and plot our CDF (code for plotting is entailed 
# in function above already):
seq = seq(-4,4,by=.1)

Fig. Compared to the CDF of a uniform distribution we now get a smoothed out sigmoid (s-shaped) function as a result.

We added the possibility to adjust the mean and variance in the prob_dens() function, but note that this is not the same as evaluating the previously mentioned z-score of our sample in order to perform the complete standardization. However, evaluating the z-score of a given set of values is fairly easy, since it has nearly the same structure as the exponent of ee in the standard normal PDF. In case you are confused: above we actually did not standardize a set of values, but looked at the standardized PDF itself.

Obtaining a t-score is again very, very similar to obtaining a z-score, as we will see, so we will start with the z-score; perform a z-test and will then move on to the t-test.

3 Standardization ─ Z-Scores, Z-Tables (Standard Normal Tables)

So far we have looked at the (standard) normal PDF as such and did not perform any actual standardization of data and also did not perform any hypothesis testing yet. Up front: testing in a z- and t-test will just mean: comparing two means in order to evaluate if they deviate enough that we can argue that they are likely two different distributions and not just samples from one and the same distribution. To give an example what can be achieved by comparing the means of two distributions: say we want to test if a new drug changes the blood pressure of a patient. What we could do is, we could compare the sample mean of those patients that received the new drug with the population mean, i.e., a group that did not receive the new drug (or “all other people”). However, looking at the difference of means for itself, does not tell us much.

So far, it was rather easy to integrate a uniform distribution and doing so helped us to understand the formula of our normal PDF, which we figured out needed some deeper insight into mathematics in order to integrate such functions. However, the fact that integration here involves a complicated process is also important for error control. A computer scientific solution therefore also always indirectly aims to keep (human) errors in calculations low. Computers give us the opportunity to operate with complex math without doing any calculations “by hand” or via cognition at all. This is of course also dangerous, since we do not need to actually understand a z-test in order to execute some lines of code — of course scientists should be able to understand and explain more than just some lines of code.

This is one of the reasons we chose to create thorough tutorials for you, before we come up with heuristic quick guides of some kind, as it is unfortunately rather common to use statistical methods without actually understanding them porperly. People with a mere heuristic approach also get easily agitated when one is asking basic questions, which is sad, since that means: you may have to justify yourself because you have knowledge instead justifying a lack of knowledge. Note that this indeed has the charachter of a a classic opportunist gate-keeping tactic and especially explains why statistical analysis is often not well performed in papers. Simply talking about p-hacking etc. is not enough to tackle attitude driven science.

However, by using nowadays computer software such as R, the focus can lay on the concept, the formula, and a little less on the representational realization of an algorithm (calculating by head and hand). As you have hopefully figured out by now: This makes the mathematics intellectually a lot more graspable for us and also allows for smooth experimentation and various forms of visualization of mathematics/statistics…

Apart from a computer scientific solution for our integral problem, you may already be familiar with t-tables, or a standard normal table, also called a z-score table in that respect (we will replicate a z-score table in R further below!!). These tables deliver you the integral up to a certain bound value (z-/t-score), so they don’t have to be calculated by hand (which was rather daunting most of the time in the 20th century)! In other words: t-/z-tables were the error control in the 20th century, where computers where not as common and powerful as computers are today. You might have also guessed already, but yes, a z-table is essentially just a CDF in a tabular form (see further below).

Fig. Evolution of mathematical tools, from abacus to calculators to programs such as R. Certain methods are dictated by a kind of principle of least action, i.e., are popular and widely used, since they are easy to calculate by hand or can be performed with only few and simple tools. Now that we are living in the 21. century technology casually allows us to flexibly play around with a wide variety of highly sophisticated and even open access software tools that can be used by probably any of todays computers. Original Photos by Recha Oktaviani and Crissy Jarvis.

Let us now look at the formula of the z-score which is used to standardize any value of xx (our measurements). For the actual z-test (testing for differences in means), we will essentially use the same z-score formula, just using the mean of our sample as input xx and a special type of standard deviation: the standard deviation/error of the mean. The result of a z-test results in a special kind of z that will be called z-value instead of z-score.

So, keep in mind, we are still not at a stage of actual testing, but will at least be able to convert any numerical value of a measurement into a standard normal PDF. This is again done for convenience and comparability of our data under a generalized (standardized) perspective:

IMPORTANT NOTE: With this formula not only the xaxisx-axis of our sample gets properly scaled (as our prob_dens() function did), but also the density in a way that it fully corresponds to a standard normal PDF with mean 0 and sd of 1 and can therefore be properly compared with a population or another sample. Now the integral to a point of zz will correspond to the p-values of z-score tables, such that neither we, nor our computer has to perform an actual integration!

The figure below shows what we can do using the z-score formula. IMPORTANTLY, apart from that you may wonder how it comes that we are comparing two distributions? The sample for itself is also considered a distribution. In fact, performing a z-test requires that the sample itself has a normal distribution. We mentioned the Shapiro-Wilck test in part three of this series as a quick side quest before.

Fig. On the left we see a normal PDF with population mean of 15 (blue line) and a sample mean of 5 (orange line). On the right we see the standardized sample mean (orange line) deviating in measures of standard deviation from the population mean of 0. In case this appears dizzy: The population mean is 0 no matter what the actual value of the (non-standardized) population mean is, since it acts as the standardized baseline. Why all this fuzz about translation again? Apart from mathematical convenience, the goal of standardization is that we can look at the sample mean from the general perspective / units of standard deviations or simply generalized difference from the population mean! Mathematically: evaluating the z-score essentially translates all values of x from our sample into values of such a standardized perspective. In contrast to the formula of the z-score above, the z-test formula (also referred to as z-statistics or z-value) does not translate all possible values of x, but the sample mean of x as a deviation from the pop. mean in particular. The formula slightly deviates, the difference between the formula of the z-score and z-value/statistics is rather marginal, as we will see in the next part of this series.

Let us do some calculations in order to figure out what the above actually means. At first we will look at an example where the sample mean of a sample of values is equivalent with the population mean. This can be looked at as a special case of the z-test. In this case we can represent the difference between sample mean and population mean just by the formula of the z-score, since the difference is zero anyways. In this special case we just end up with standard normal distribution itself, representing our baseline by itself. When integrating we can represent and calculate the probability of all possible differences of the (standardized) sample mean from the (standardized) population mean. This baseline approach is essentially what a z-score table consists of, as we will see.

Fig. Lady Meow — such a lovely cat!

Numeric example: So, let us say we have knowledge on the population mean of the number of cups of tea Lady Meow drank in a whole day (say pop. mean = 5). Say, we also took a sample recently, i.e., measured the tea drinking behavior of Lady Meow again and want to compare the sample with the population and the sample turns out to also have a mean of 5. We will therefore end up with a difference of 0.

The integral up to a z-score of 0, representing the standardized baseline population mean, will be .5, which is essentially just half of the integral. By doing so, we are essentially asking how likely it is that our sample mean is significantly lower or equal to the population mean (later discussed as lower tail z-test of a 0 difference in means). Upfront: when calculating a p-value, asking for both being either lower or higher in difference from the population mean (equality/inequality as such), then we would have to double the above p-value (later discussed again as two-tailed z-test). In the case of sample mean == population mean, this double sided approach will result in a 100% chance (probability of 1) that the sample is equal to the population with a non-standardized mean of 5 (again, this is equivalent to a z-test at least for the case where the sample mean == population mean).

How would we in general interpret the real life case where the result of a clinical trial turns out to be showing a 0 difference in means? It depends on the question we ask when performing a hypothesis test, but most of the time we essentially want this value to be very low, aiming for a difference in means, so that we can argue that the sample is actually different from the population. To give a medical example: when comparing sample participants that were given medication with a population that obtained no medication, then we mostly hope for a difference when comparing these two groups, since we probably hope for an effective new medication that makes a significant difference to doing nothing at all (or placebo effect, which is essentially no conditional effect of a medication — claiming anything else is considered manipulative product/procedure binding)! In other words, we hope to find medication that has an evident effect and deserves the name medication in the first place.

As a side note: Mathematically we already know from the first chapters when looking at uniform distributions, that a PDF also allows us in general to integrate any region, e.g., when asking how likely it is to obtain a sample with a mean that is between -1*sd and +1*sd. . .

Below you will find some Code for the special case of a 0 difference between population and sample mean, as well as a plot of the standardization of the dependent variable of our second synthetic data set observing Lady Meow ^°^!

###### Special case of pop. mean == sample mean:
pop_cups = c(0:10)

# Population mean, variance and standard deviation:
# Recall that var() and sd() calculates teh sample var/sd
# including Bessel's correction, therefore we use
# the formulas below:
pop_mean = mean(pop_cups)
# [1] 5
pop_var = sum((pop_cups-mean(pop_cups))^2)/length(pop_cups)
# [1] 10
pop_sd = sqrt(pop_var)
# [1] 3.162278

# Function for evaluation the z-value of a 
# non-standardized data set:
z_score = function(x,pop_mean,pop_sd){
  z = (x-pop_mean)/pop_sd
} # End of function

# Standardized Z-Score for a particular value of x:
# Say we want to standardize a value of x that is equal
# to our mean itself as a special where a z-test formula
# is not needed yet:
z_score(pop_mean, pop_mean, pop_sd)
# [1] 0

# Integrating will reasonably result in
# 0.5, i.e., half of the area under the curve:
# [1] 0.5
# For a two-tailed test, checking on equality/inequality:
# [1] 1

# Let us now look at the z-score for our
# value of mean+/-sd in pop_cups:
z_score((pop_mean-pop_sd), pop_mean, pop_sd)
# [1] -1
z_score((pop_mean+pop_sd), pop_mean, pop_sd)
# [1] +1
z_score((pop_mean+2*pop_sd), pop_mean, pop_sd)
# [1] +2
# Worked!

#### Plot standardization of the dependent variable of 
####     second synth. observing Lady Meow

# Second synthetic data set:
# Cups of Tea: 
cups = c(0,  1, 2,   3.5, 4.8, 5.2, 6,   6.9, 8.5, 9.1, 9.9, # run 1
         0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9, #     2
         0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9,   9.8) #     3

# Time passed for each run of measurements (cups), abbreviated:
time = c(0:10,0:10,0:10)

# Plot corresponding normal and standard norm. PDF:
# Normal PDF (non- standardized) of our data:
seq = seq(-10,20,by=.1)
plot(x=seq,y=prob_dens(seq,mean(cups),pop_var),type ="l")

# Translating all possible values on the x axis into our 
# normal PDF into the format of a standard(!) normal PDF:
z_seq = z_score(seq,mean(cups),pop_var)
plot(x=z_seq,y=prob_dens(seq,mean(cups),pop_var),type ="l")

We can also create and use a z-table via R, which is essentially again just our CDF in tabular form: Here is some code to create a standard normal distribution table / z-table yourself (partially following this R tutorial).

# Standard normal distribution table / z-table:

# Initialize seq for z-scores (+/-)
z = seq(0,9.99,by = 0.01) 

# Calculate all possible z-scores from 0 to 9.99:
p_z = pnorm(z)

# Plotting the above shows that we calculated the
# upper half of the normal CDF:

# Set as matrix/table:
z_table = matrix(p_z,ncol = 10,byrow = TRUE)

# Add row and column names to matrix:
rownames(z_table) = seq(0,9.99,b = .1)
colnames(z_table) = seq(0,.09,by = .01)

# Evaluate z and search table for corresponding p-value:
#       [["col","row"]]

# ...or use the following way without having to separate the digits:

# Change class to data_frame:
table =

# Define random z:
z_score = 3.23 # only use value up to two decimal places

# The following lines separates the first 3 digits into the
# [["col","row"]] structure, so you don't need to do yourself:
# substr(x,start,stop):
rowDigit = as.numeric(substr(z_score*100, 1,2))/10 
# [1] 3.2
colDigit = as.numeric(substr(z_score*100, 3,3))/100
# [1] 0.03

# Execute the next line to find the respective value
# in the table:
table[which(rownames(table) == rowDigit),which(colnames(table) == colDigit)]
# [1] 0.999381
# [1] 0.999381

Fig. The plot on the left shows that the z-table basically consists of the upper part of our normal CDF. On the right side you can see parts of our z-table.

Only the upper part of a normal CDF is used for the table. 1-pnorm(z) delivers a list for negative z-values for the left half of our Gaussian, so to speak (recall the symmetry of Gaussian functions).

We can also invert our CDF into a so-called quantile function. We will not reproduce the whole math here for that, but a quantile function delivers an important perspective. We can use the qnorm() function in R. In nuce: with the quantile function we can evaluate a specific z-score given a specific p-value or percentage (e.g., z-score for a confidence interval, see further tutorials). In other words: it is just a “tilted” CDF, where the z-score is now located on the y-axis and the probability on the x-axis (not vice versa):

# Use the qnorm function for the inverse CDF:
# Note that we used a rounded score as example
# before:
qnorm(0.990,0,1) # for a p-value threshold of 1%
# [1] 3.229977

seq = seq(0,1,by=0.0001)
quant_func = c() # empty vector
for(i in 1:length(seq)){
  quant_func[i] = qnorm(seq[i],mean=0,sd=1)
} # End for i

# Plot Quantile function:
plot(x=seq,y=quant_func, type = "l")

Fig. Plot of the inverse CDF, i.e., the quantile function. On the x-axis we have p-values ranging from 0 to 1. On the y-axis we see the corresponding z-score values (the x-axis of our norm. PDF).

Apart from a z-score table there are also so called t-tables. We will get to the t-test in the next part of this series, but we want to explain the main difference upfront: The main difference between a z-table and t-table is the fact that the t-score uses the sample variance/sd and the z-table uses the population variance/sd in its calculations.

The decision which test to choose also involves the size of a sample:

Fig. This flow chart can be used to decide which test to take. Of course, a R function can be written in a way that it infers based on the above logic by itself. The flow chart was adapted from Statology. The constrain of n>30 is essentially a rule of thumb, again in order to be able to argue for the use of a standard normal PDF via the central limit theorem: as we will see in later chapters, a t-distribution, which has a very similar shape to a Gaussian, but entails Bessel’s bias correction on the variance/sd, also approaches a normal distribution, the higher n is; however, the choice of n>30 is still rather arbitrary, but the conceptual difference to a standard normal PDF is low, as we will see.

A special case of the z-score is the effect size (Cohen’s d effect size), which does not standardize a whole set, but the mean of the sample as input only:

In the next part of this series we will further discover the difference and the relation between the z-score, the z-statistics and Cohen’s d effect size.

4 Brief Outlook into Inferential Statistics V

In the next part of this series we will go fully into discussing z- an t-tests for both regular sets of values and for linear regression models. In the last chapter we essentially touched the topic of one sample z-tests, i.e., comparing a sample with a population (different to two-sample tests, where two samples are compared with each other). We will also thoroughly discuss what a one-tail and two-tail test is and when to use them. Plenty of furry examples will await you, in order to get an overview of when and how to use certain configurations of z- and t-tests. ^^

Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other students and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within NOS.

No comments here
Why not start the discussion?