Skip to main content# [R] Inferential Statistics III: Obtaining an Optimal Linear Model via Descriptively Evaluating Conditional Linear Relations

# 1 Introduction:

# 2 Recap on Linear Modelling and some Thoughts on why Alternative Mathematical Methods exist and come in handy

# 3 Descriptive Statistics and Conditionally Linear Relations ─ Mean**, **Sum of Squares, Variance and Covariance

## 3.1 Arithmetic Mean / Weighted Arithmetic Mean / Expected Value

## 3.1.1 The Ambiguous Use of the Term Expected Value

## 3.1.2 Probabilistic and Non-Probabilistic Weights

#### Non-Probabilistic Weights:

#### Probabilistic Weights:

## 3.2 Sum of Squares

## 3.3 Variance

## 3.4 Covariance

## 3.5 Conditional Probabilistic and Conditional Linear Relations

## 3.6 Standard Deviation

# 4 Super Brief Outlook on Inferential Statistics IV

Beginner's Stat-o-Sphere

Published onJun 30, 2023

[R] Inferential Statistics III: Obtaining an Optimal Linear Model via Descriptively Evaluating Conditional Linear Relations

** Review:** This article was internally reviewed by the BEM-Editors Rico Schmitt and Philippa Schunk.

*Corresponding R script: *

*Summary:*

Welcome to the third part of our tutorial series on inferential statistics! Note that you are still in the **beginner’s spheres**. We are also including optional recaps on the first two tutorials. So, in case you missed the first two parts or if you felt they were too basic or too complicated (previous tutorial) for your needs, **up to here** **it should not be too much of an issue starting with this part of our series right away**. We will still increase the pacing a bit in this tutorial and will get closer to hypothesis testing in the probabilistic sense and with working with probability distributions, which will be a matter of the *fourth* *part* of this series.

In the first two parts of this tutorial series, we focused on the general concept of hypothesis testing (conditional probability) and linear modelling (conditional linear relations). **In the third part of this series, we are going to look at structural similarities between both concepts by taking a closer look at what we called a conditional linear relation.**

The **first goal** of this tutorial is to learn about ways to **just** **descriptively evaluate our statistical linear model via** concepts such as the **mean**, the **variance** and the **standard deviation** (in the future we may also provide a tutorial on descriptive statistics from a more general perspective). In the fourth and fifth part of this series we will relate these mathematical concepts to the concept of distributions of various kinds. Namely uniform and especially (standard) normal distributions.

The **second goal** of this tutorial is, as mentioned: taking a closer look into what we have called a **conditionally linear relation **so far and **how it can be evaluated via the variance and the covariance of our linear model, resulting in the regression coefficient of our optimal model (its optimal slope)**. In other words: **evaluating the optimal slope** **of our model can alternatively be done via descriptive measures such as the mentioned forms of variance, as well. This method will also turn out to be mathematically less laborious. **Our alternative approach will also result in a surprising but intuitive formal relation between conditional probability and on statistical relations, which concern (estimate) values of measurements and not probabilities. However, the downside of this approach will be that it is not clear at first sight how the formula relates to our “real life problem” of drawing an optimal straight line through data ourselves. Therefore we started with the classic linear least square method in this series, which is more laborious, but allows for a much better “visual intuition”. You may think different, In this case we hope starting with this tutorial for itself will serve you well.

In the second part of this series, we have used one of the simplest methods to obtain a fitted linear model. It was simple in the sense that we could relate the process very well to what we wanted to achieve: draw a straight line through data with optimal values for the intercept and the slope. By formulating our error function (our partial differential equation), taking the derivates, setting them to zero and doing some rearrangements we obtained a generalized system of equations that we can use to obtain a optimal model with the least squared errors which can be used for *any set* of **least sum of squared errors** when **predicting (new) data points**. We also discussed how this process can be related to classic neural networks performing gradient descent.

We have also written our own

function that does all the calculating for us and which delivers an equivalent output to the basic function of **linear_least_square()**

in R. To do so we translated the above system of equations into the form of code. We therefore don’t have to do all the calculating by hand. **lm()**

Below you will find the code for the complete

function: **linear_least_square()**

```
# Go-go-gadgeto linear_least_square!!!!
# Replication of the basic math of the lm() function:
linear_least_square = function (x,y){ # Start of function
# "Guess function"
lmRandom = function(x) (x)
# Setting up system of equations:
left_a_r1c1 = sum(a*length(x)); left_b_r1c2 = b*sum(x)
left_a_r2c1 = a*sum(x) ; left_b_r2c2 = b*sum(x^2)
right_r1c1 = sum(y)
right_r2c1 = sum(x*y)
# Now we will set up the above as matrix (left) and
# vector (right) objects:
left = matrix(c(left_a_r1c1 , left_b_r1c2,
left_a_r2c1 , left_b_r2c2),
ncol=2, nrow=2, byrow = TRUE)
right = c(right_r1c1, right_r2c1)
# Now we can solve our system of equations via:
Result = solve(left,right)
# Fitted function:
lmFitted = function(x) (Result[1]+Result[2]*x)
SumError2fitted = sum((y-lmFitted(x))^2)
# Code for nice console output
cat(" Linear least square method in R","\n","\n",
"Independent variable:", "\t", deparse(substitute(x)),"\n",
"Dependent variable:", "\t", deparse(substitute(y)),"\n","\n",
"alpha", "\t",Result[1],"\n",
"beta","\t",Result[2], "\t","SumR2", "\t", SumError2fitted)
# Plot Results:
plot(x=x,y=y, ylab = deparse(substitute(y)),
xlab = deparse(substitute(x)))
abline(a=Result[1], b=Result[2], col = "darkblue")
} # End of function
```

We can again use our furry data set to compare the output of our

function with the output of the regular **linear_least_square()**

function:**lm()**

```
# lm(dependent~independent)
lm(cup~time)
```

```
# linear_least_square(independent,dependent)
linear_least_square(time,cups)
```

The output of our function also entails the sum of the errors squared (SumR2) of our optimal fitted function. Recall that R^2 in the

output is not the same as errors squared or residuals squared (has a different formula, but we will get there in the later parts of this tutorial series). The results of both functions will lead to the same plot output, suggesting that Lady Meow ^°^ is drinking around 1 cup of tea per hour.**summary(lm(y~x))**

**However, there are other ways, other formulas to obtain the optimal parameters of a fitted linear model function.** The reason why alternative methods exist is manifold. One reason is the **universality of different mathematical concepts**: Probability theory for example can be looked at from the perspective of linear algebra (e.g., conditional probability (Markov processes)); we dipped into some very basic linear algebra when solving our system of equations in the former part of this series. Another complex example is a basic mathematical neural network that can be generalized down to become a regular logic circuit (NAND-gates etc.) and can therefore be used to do any kind of math equivalent to a regular computer. In the above two cases alternative methods are *at least* used to prove a point or to expand methods and perspectives on methods.

**Other reasons are based on pure pragmatism:** some calculations are easier to do “by hand”, other formulas and algorithms may be especially convenient to be processed by a computer, i.e., by a certain information technological architecture (such as graphic chips (GPU) which are e.g. used for matrix multiplications (CUDA), as it can do a lot of calculations in parallel instead of in series (in a CPU); in computer games graphic chips are used to obtain a high frame rate when playing the game; in the field of machine learning CUDA just lower the processing duration in general).

In other words: some algorithms run *faster* and therefore more efficient on (certain) computers (with certain hardware and software). A lot of nowadays common procedures in statistics also go back to a time where computers weren’t common (not that long ago). A classic example we will learn about in the fifth part of this tutorial series is the use of z-tables/t-tables. These tables are use to read out the integral of a standard normal or t-distribution up to certain value of z/t (the standardized/studentized sample mean), so that the actual integration does not have to be performed and can be just read out of the table...

**We are going to mostly look at one particular alternative formula that has the same conditional structure as conditional probability** and that is at least conceptually known to us: **conditional linear relations** in terms of the value of the *dependent* variable, given that the value of the *independent* variable is/has changed by 1 (*regression coefficient / slope*).

**Via the** **covariance of **** and **** and the variance of **** we will be able to calculate our slope **** in **** with the same formalism as used in conditional probability** ─ the covariance serving as joint estimate and the variance of

**For some orientation:** In the fifth part, when approaching the z-test / t-test, used to obtain a p-value, we will be able to check how likely such a conditionally linear relationship actually is (or unlikely, given that we test the relation under the condition of a *null hypothesis *and some extra constraints we will get into later). **What will be the hypothesis?** A t-test / z-test for linear regression models essentially checks if the slope is greater than zero. No slope: ─ no change: ─ no linear difference in the data, ergo:─ *not* dependent on the variable

**Next up we are therefore going to look at very basic values in statistics, such as the mean, the variance and the standard deviation (also part of descriptive statistics).**

In this chapter we will go through some very basic concepts in statistics, **some of them being part of what is called** **descriptive statistics**, such as the mean and the standard deviation.

Along the way we will get to know the mentioned alternative method of calculating our regression coefficient **mean**, the **variance** and the **standard deviation** when we approach the topic of **probability density functions**, which will be a part of evaluating our linear model under a “probabilistic perspective” (a part of calculating the t-value/z-value and subsequently the famous p-value).

**We will start with an absolute classic of classics: the (arithmetic) mean.**

There are several different kinds of “mean values”, even though the basic formula remains the same. We will start with the general concept of an *arithmetic mean*, which can be understood as evaluating an average or central tendency of a value of a data set (set of values; distribution of values). It is referred to as arithmetic, since it just operates with most basic arithmetic operations: sum and division. There is also the *harmonic mean (is the arith. mean of the reciprocal of the values of a set) *and the *geometric mean (uses the product not the sum of values of a set)*. All of the three mentioned kinds of mean are part of what is called the *Pythagorean means*.[1] However, we will stick to the **(weighted)** **arithmetic mean in this tutorial**.

**There are two kinds of arithmetic mean: the sample and the population mean.** A population is a rather abstract concept, as it refers to an absolute quantity of, e.g., “all humans” or “observed over all time” and is therefore in general something that is seldomly given in the form of a data set: It is quite difficult to obtain data from all humans and it is also not necessary to test all humans in order to obtain evidence for an outcome. In general and philosophically spoken the attributes “all” or “everything” may never really fit an empirical scheme, thinking about it (reflecting on boundaries of space and time when performing inference).

However, **the mathematical formula for the sample and populations mean is the same**, just the size of

**Recall that our second synthetic data set can be understood as three samples, so to speak: It is not a population (observing Lady Meow for her whole life or every tea she drank), but however more than a single sample data point (three runs/days of observing Lady Meow ^^). Note that the sample size orientates on the length n of the events (or relations between y = cups of tea and x= time in hours). So the sample size of our synthetic data sets on observing Lady Meow will be n=33 each (33 events that we observed in each of our synthetic data sets). **

```
### First set with idealized linear relation f(x) = x:
# Cups of tea
cups_ideal = c(0:10,0:10,0:10)
# Time
time_ideal = c(0:10,0:10,0:10)
# Sample size n
length(time_ideal)
# [1] 33
### Second set:
# Cups of Tea:
cups = c(0, 1, 2, 3.5, 4.8, 5.2, 6, 6.9, 8.5, 9.1, 9.9, # run 1
0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9, # 2
0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9, 9.8) # 3
# Time
time = c(0:10,0:10,0:10)
# Sample size n
length(time)
# [1] 33
```

**In the code below, we will again use the second set of data for the sample as well as the population for simplicity,** as the concept of a whole population is a rather abstract matter anyways, since we are at least theoretically able to observe it, but practically it would be at least very hard to do so (e.g., observing or testing all humans is hardly possible and there are also mathematical reasons why we should not and do not need to do so in order to obtain or boost evidence).

**The formulas for the two kinds of arithmetic mean are as follows:**

The population mean is often denoted as *μ (mu) *in contrast to the sample mean *.*

```
# Arithmetic mean for the sample x and y of our second synth. data set,
# i.e., the sample mean:
x = time
y = cups
n = length(x)
mean = sum(x)/length(x)
mean = (1/n)*sum(x)
# Using the built in R function mean(x):
mean = mean(x)
# Note that in the case of x = c(0:10,0:10,0:10)
# the mean is equivalent to the median:
mean(x) == median(x)
# [1] TRUE
```

In the formulas above we saw that the population mean is denoted as *μ *in contrast to the sample mean *. *However, the population mean *μ *is often also referred to as the **expected value, even though it is not always the same.** In general, the expected value or expectation value is mathematically also denoted as:

This kind of expected value notation is usually related to the use of the weighted arithmetic mean using probabilistic weights and is not related to the use of the term expected value as population mean, determent on **However, before we discuss the weighted arithmetic mean and probabilistic weights,** **let us first look at the relation between the term expected value and population mean:**

Another way to define the expected value is *not* precisely as an actual population mean, but as representing the mean of previous studies that form a “population of means”, since observing the actual population is usually not the case. In that sense all previous studies can be understood as an *approximation of the population mean*, since further observations makes the gap between *new* sample and *bunch* of samples ‘heuristically’ serves the inequality between population and sample

**In the next part of this series we will show how this heuristic is grounded in the so-called central limit theorem, concerning normal distributions (parametric Gaussian functions). The central limit theorem says** that when you sample and take the mean of a sample from a distribution of any kind of form (logarithmic functions, uniform functions, linear functions…) the more sample *means* we obtain from that randomly formed distribution, the more it will apporach a normal distribution of sample means (over time, so to speak). This is one of the reasons why normal distributions are so common in (frequentist) statistics, since it makes no difference at some point or size of **A truly fascinating circumstance that we will discuss and widely make use of in the next part of this series.**

**To wrap this up:** Think of the difference between a sample and a population and the difference between a sample and a bunch of samples *of a* population as a *part-to-whole relationship*, all the way down serving the same inequality of

**Furry example:** Say, we have observed the tea drinking behavior of Lady Meow for a couple of years. The mean of all the previous study outcomes on the amount of tea lady meow drinks on average could be used as expected value. In other words, all previous studies of the same kind (observing Lady Meow ^^) can be understood as population, again in the sense of a part-to-whole relationship: the current sample in relation to all previous study samples. Evaluating a difference between these two is essentially what is evaluated in a z-/t-test: a test for a difference in means. **What could we do with such an expected value of cups of tea in a day of Lady Meow?** Well, let us say that Lady Meow drinks less tea when she is sad. When our sample mean deviates a lot from the expected value (from years of experience) it could mean that Lady Meow needs help, since she is sad and therefore not feeling well enough to go after her favorite habit.

Same goes for the slope and intercept of a linear model: we may expect a certain difference, given a change in

**So, if some of what we discussed above concerning the meaning of the term expected value still appears confusing to you: recall that the definition is actually already in the name: **the expected value is the expected *mean/average* or in general **our expectation of how things will turn out**, since we have been fully absorbed by Lady Meow and have watched her many, many times drinking tea in the past, so we can say we know her quite well already. Therefore, we will have an expectation of her behavior, given previous data (a population of samples so to speak)...

**In a more general usage of the expected value** it not only reflects the “regular” average but also the **weighted average**.[2] This will not be so important for our simple linear model, but we will at least show how this is done with estimates and with a probability distribution as weight:

**If the weights are equally distributed (equivalent to a uniform prior as a weight on the likelihood for instance) the weighted arithmetic and the regular arithmetic mean will result in the same. Think of a uniform weight as not adding any weight at all, since every value has the same weight (so it is actually unweighted). **

Note that the

function in R can also handle weights of estimates (weighted linear regression!). Check **lm()****?lm()**** **concerning the input parameter “weights”.

**The next two chapters can be considered optional,** since we will not need to understand the below to understand the remaining parts of this tutorial, as well as further parts of this series.

**As mentioned: a weight can be probabilistic or non-probabilistic. Let us start with non-probabilistic weights.** A non-probabilistic weight could be, e.g., the differing number of participants in two groups.

For a numeric example we will go through the rather simple example on the Wikipedia page on the weighted arithmetic mean (03.2023). The example is about the test grades of two classes with different sizes of **In such a case we can weight **** every grade of each group by the number of participating students**.

In the Wikipedia example one class had 30 the other 20 students. **However, the calculation of the numeric example on Wikipedia will slightly differ from the formula we provided above (which is also used by the ****weighted.mean()**** function). It is probably wrong on Wikipedia, but teaches a good lesson, we think.** On Wikipedia they first calculated the mean of each sample / each class and *then* weighted it. As mentioned, the *actual* general weighted mean would weight each student’s grade with the respective length of its sample group (the **then** average it out by the sum of all weights.

We will only go through the correct use of the formula of the weighted arithmetic mean. **In the script of this tutorial, we also provided the code to calculate the actual Wikipedia example. The fact that the Wikipedia example is deviating from the actual use of the formula is a good example for gaining awareness to be cautious when using open educational resources on the internet (especially when there are no peer-review options; Wikipedia is in general not well reviewed, I’d say, even though when there are some articles or parts of them that are done well). However, R can be used well as a tool to evaluate numeric examples, as it has built in functions to check on both the formula and the output of an example. **

Apart from the formula, the R function

can also be used (see code below). This is how I was able to check if my use of the formula was correct:**weighted.mean()**

```
# Weighted arithmetic mean - Modified Wikipedia example:
# Samples:
grades_class_1 = c(62, 67, 71, 74, 76, 77, 78, 79, 79, 80, 80, 81, 81,
82, 83, 84, 86, 89, 93, 98)
grades_class_2 = c(81, 82, 83, 84, 85, 86, 87, 87, 88, 88, 89, 89, 89,
90, 90, 90, 90, 91, 91, 91, 92, 92, 93, 93, 94, 95,
96, 97, 98, 99)
# Regular arithemtic mean of the grades of each sample / class:
mean_class_1 = mean(grades_class_1)
# [1] 80
mean_class_2 = mean(grades_class_2)
# [1] 90
# Regular arithmetic mean of the grades of both classes:
mean_grades = (mean_class_1 + mean_class_2)/2
# [1] 85
# Weights (here weight == n):
n_class_1 = length(grades_class_1)
n_class_2 = length(grades_class_2)
# Now we will turn the above weights and grades into a vector
# of weights and grades. To do so we can use the rep() function that
# repeats a value up to a defined length:
weights = c(rep(n_class_1,length(grades_class_1)),
rep(n_class_2,length(grades_class_2)))
grades = c(grades_class_1,grades_class_2)
# Evaluate weighted mean of both samples(!!):
weighted_mean = sum(weights*grades)/sum(weights)
# [1] 86.92308
# We can also normalize weights (that they sum to 1):
norm_weights = weights/sum(weights)
norm_weighted_mean = sum(norm_weights*grades)/sum(norm_weights)
# Alternative via R basic stat. functions:
R_basic = weighted.mean(grades,weights)
# Check:
all.equal(weighted_mean, norm_weighted_mean, R_basic)
# [1] TRUE
```

A simple example for **probabilistic weighting **is the weight of the dice values of a *fair dice*, **weighted by their respectively uniform probability of occurrence of each dice value**.

Arguing that the dice is truly fair also lets us at least logically relate the expected value with the population mean: fairness argues that after infinite observations, the occurrence of each dice value will turn out to be weighted fairly. We will get into phony dices some other time… **In the case of probabilistic weights, the formula of the expected value (weighted arithmetic mean) can also be abbreviated and written as follows:**

It is the same formula as before, just lacking the denominator. Why? Take a look: With the denominator the formula would be as follows:

**Now, recall that we know that the probability vector **** will alway sum to 1, in other words: the denominator will always be 1 and can therefore be abbreviated** **in the case of probabilities**, which may blur the obvious relation to the weighted arithmetic mean a little at first glance.

```
# Expected value as weighting a regular distribution of data
# points (here dice values) by a uniform probability
# distribution:
# prob. vector
dice_prob = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
1 == sum(dice_prob)
# [1] TRUE
dice_value = c(1:6)
EX_dice = sum(dice_prob*dice_value)
EX_dice_alt = sum(dice_prob*dice_value)/sum(dice_prob)
# [1] 3.5
all.equal(EX_dice, mean(dice_value),EX_dice_alt)
# [1] TRUE
```

In case your heart only beats in the pace of the time that Lady Meow is drinking her tea (<3), we can do the same with the independent variable of our second synthetic data set, i.e.,

.**x = c(0:10) # hours**

```
x = c(0:10) # hours we will use 0:10 only once
# Obtain prob. value by calculating 1/n
# (a kind of normalization):
probX = 1/length(x) # uniform probability distribution
probXvec = c(rep(1/length(x),length(x))) # alternatively represented as
# probability vector that
# sums to 1:
1 == sum(probXvec)
# [1] TRUE
# Weighted values of x:
weighted_x = x*(1/length(x))
weighted_x_vec = x*probXvec # alt. via prob. vector
weighted_x == weighted_x_vec
# [1] TRUE
# weighted_x (uniform weight):
# [1] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
# [7] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
# Expected value as the sum of weighted values of x:
EX = sum(x*(1/length(x))) # sum of weighted_x is in this
# special case...
EX = sum(probXvec*x)
# [1] 5
EX == mean(x) # ... equivalent to the unweighted
# arith. mean of x
```

In order to obtain the variance and the standard deviation, we have to look at the so-called sum of squares. We have worked with the *sum of (least) squares* before, or more correctly: with the sum of* squared errors / residuals*. The *sum of squares* is not the same as the residual errors. **The sum of squares refers to the ***variability*** or deviation of the values of a set from the mean of that set ─ only slightly different as the variance or the standard deviation does, as we will see. **

**CAVE: Importantly the variance and standard deviation actually only really make sense, when understanding its basic structure: the total sum of squares.** **We promise that the next chapters will be very enlightening when it comes to understanding and never forgetting the relation and difference between the variance and the sd. **

The *sum of squares* is also referred to as *total sum of squares. *Below you will find the formula of the total sum of squares as presented as the summed and squared deviations of

Here is some code, using our second synthetic data set again. The numeric value of the sum of squares does not make a lot of sense for itself, so we will dig in deeper into the formula.

```
# Using the second synthetic data set from Inf. Stat. II again:
x=time
y=cups
### (Total) Sum of Squares (different forms):
# SumSqXX = ∑(x_i-E[X])^2,
# where E[X] = 〈x〉 = μ_x = ∑(x_i)/n = expected value
SumSqXX = sum((x-(sum(x)/length(x)))^2)
SumSqXX = sum((x-mean(x))^2)
# SumSqYY = ∑(y_i-E[Y])^2
SumSqYY = sum((y-(sum(y)/length(y)))^2)
SumSqYY = sum((y-mean(y))^2)
# SumSqXY = ∑((x_i - E[X])*(y_i - E[y]))
# We will dig into this one deeper later:
SumSqXY = sum((x-mean(x))*(y-mean(y)))
```

**Note that the process of subtracting the mean from a set of values (without summing) is also called centralization.** It results in a set of values that “centralize around the mean”, i.e., the closer the initial positive/negative value has been to the mean, the lower/higher the centralized value will be, until it approaches zero, when the initial value of the set was equivalent to the mean. In some way, we can look at it as a kind of specific error function as well (figure below).

**In contrast to the squared residuals**, where we looked at the deviation of

**Note that some values in the plot below, showing the result of an unsquared centralization, are negative and some values turn out to be positive**. In our idealized example below (

), we would actually obtain a total sum of the **c(0:10) - mean(c(0:10))***un-*squared deviations from the mean of exactly 0, which makes no sense, looking at the plot of it below (indicating a lot of deviations). In order to avoid that positive and negative values just cancel each other out, we have to square the centralized values before summing.

```
# Pure deviation from the mean (without summing and squaring!):
dev_of_mean_X = x-mean(x)
# Plot of dev_of_mean:
plot(x = x, y = dev_of_mean_X)
abline(h=0)
# Total sum
sum(x-mean(x))
# [1] 0
```

Let us do some further plotting of the individual deviations from the mean, before summing them: As mentioned, squaring leads to values that will be all positive or equal to zero (the standard deviation, which starts with the sum of squares, is therefore always cast as **absolute value** where, e.g.,

Plotting will lead to a graph with a discrete distribution where the height (y-axis) indicates how much each value of

```
# Pure deviation from the mean (without summing!):
dev_of_mean_X = x-mean(x)
# Plot using absolute values via function abs():
plot(x = x, y = abs(dev_of_mean_X))
# Note: Only 0 deviation given when x_i = 5 = mean(x)
# Plot using the square, which smoothes the graph:
plot(x = x, y = dev_of_mean_X^2)
```

**The above plot is also supposed to make it somewhat easier to numerically understand from step 1 (the sum of squares) how values such as the mean, the variance and the standard deviation are in the end related to a (bell curved) Gaussian probability distribution. Specifically, the characteristic symmetry of the plot above makes it hopefully somewhat intuitively plausible how values such as the standard deviation can be related to a normally distributed probability density function. However, we will get into all details of probability (density) functions in the ****next part of this series****. **

**Upfront for those impatient getting an idea on Gaussian functions:** The number

As the total sum of squares is the basis of the variance and the standard deviation, we can conclude that all of these values actually reflect on deviations from the mean in some way. It is therefore not totally wrong when you have difficulties to semantically differentiate the variance from the standard deviation (since it both implies variability of some kind, when just reflecting on the words variability and deviation).

However, **different to the variance, the total sum of squares does ***not*** directly reflect on the ***sample ***or*** population size***,** when reflecting on the variability of values around the mean (centralization), even though the size of the sample/population is indirectly entailed (when summing all *total sum of squares* by the length of

**The total sum of squares can therefore in general be understood as the first step of calculating the variance. The variance on the other hand can then be understood as the ***average*** squared deviation from the mean. Even though the term variance is often in general used to indicate variability, recall the variability starts with the step of ****x-mean(x)**** as essential part of the sum of squares (centralization). **

**QUICK OPTIONAL SIDE QUEST:** We can also test our vector

for normality via the **x****Shapiro-Wilck test** for example (we won’t get into the actual math of this test). However, this is what we would check on when choosing a hypothesis testing method. In the case that the data turns out to be not normally distributed, we would use a rank test instead of a t- or z-test. We will catch up on what rank tests are some other time…

```
shapiro.test(x)
# OUTPUT
# Shapiro-Wilk normality test
# data: x
# W = 0.94191, p-value = 0.07723
# IMPORTANT: A p-value above 0.05 indicates that normality
# is given!! Here the null-hypothesis is that
# the data is normal. We cannot discard the
# null hypothesis, so normality is given.
```

As we already know: Formally the (population) variance is just the total sum of squares divided by N. **Therefore, you can** **conceptually think of the variance as the average squared deviation from the mean, or “the mean of squares”, not only its sum!** As mentioned before, it is quite hard to interpret the numeric value of the sum of squares by itself. Same goes for the variance. If we were to plot the variance, we would also need to plot an area, since it is squared (we would need to plot a literal square — this is why we take the square root for the standard deviation later below). **Nevertheless, understanding the concept of variance in terms of (averaged) variability, we can in general say: the lower the value of the variance, or in other words: the less variability the better, as a low variability means it will stick more closely to our expectations**. The variance for itself though does not come with a numerical base line, as it depends on the values of the set. A variance of 100 — is that low? We could compare the variance of several samples though, in order to get some insight…

Note: The possibility of some form of comparison does not hold for the total sum of squares for itself, as it lacks an interpretation of

Similar to the arithmetic mean, the variance comes in two forms: the **population and the sample variance:**

**The difference between sample and population variance is again rather simple:** The population refers to a complete set of

**a) **Instead of the sample mean **mean of the whole population 𝜇 for centralization in the population variance**, which can deviate from

**a) **A *higher* than the population variance, in order to compensate for the lack of information we have on the behavior / data of a population. [4][5] This is also referred to as **Bessel’s correction**. The choice of *not*** arbitrary**, but we will not go into further details on why it is carefully chosen here (we will discuss it in the form of degrees of freedom when discussing the t-test).

**Note that you will come across several types of bias correcting modifications in other contexts of statistics.** Also note that such corrections also do not correct for every kind, but specific types of biases.

Let’s do some coding:

```
# Variance := σ^2 = SS/n := the variance is the mean of
# the sum of squares
# Variance of x:
sampVarX = SumSqXX/(length(x)-1) # := sample variance == var(x)
sampVarX = var(x) # default var(x) == sample variance
popVarX = SumSqXX/(length(x)) # := population variance
# Variance of y:
sampVarY = SumSqYY/(length(y)-1) # := sample variance == var(y)
sampVarY = var(y) # default var(y) == sample variance
popVarY = SumSqYY/(length(y)) # := population variance
```

**Next **we are going to evaluate the **covariance**, which will lead us to the mentioned alternative way of obtaining the **optimal slope** **of our linear model**.

**For the covariance we now need to talk about the** **sum of “squares” of **** ***and*** ****.** You might have noticed in the code examples above that we have actually already calculated the total sum of squares regarding *or* **instead of just taking the square! Taking the square is not needed in this case of “co-variability”. **

**In case you wonder why this is the case**, recall what a square is in the simple geometric sense: a perfect square forms an area with equal length on each side, such that the area of such a two-dimensional square can be calculated via *not* actually involve a square with equivalent side length, but a rectangle: Calculating the area with unequal length of the sides is simply obtained via multiplying **explains the whole ****, **** and **** notation so far. **

*Most import:*** **different to the variance and later the standard deviation, the **covariance does not necesserily have to be positive**, since the values are not just squared! We will get to the meaning of this special characteristic again in a bit!

```
# SumSqXY = ∑(x_i - E[X])*(y_i - E[y])
SumSqXY = sum((x-mean(x))*(y-mean(y)))
```

Using our “co-sum of squares” to obtain its variance is simply called the covariance; apart from that it has the same formalism as the variance: We can again distinguish between a **sample** **and** **a** **population covariance (i.e., including bias correction on the sample covariance)**:

```
# Covariance:
covarXYsamp = SumSqXY / (length(x)-1) # sample covariance
# [1] 10.39687
covarXYpop = SumSqXY / length(x) # population covariance
# [1] 10.08182
covarXYsamp = cov(x,y) # samp cov := cov(x,y)
# [1] 10.39687
```

**Recall all the great time we had with Lady Meow and take a close look at the population covariance:** **Do you notice anything unusually coincidental with the population covariance? **You are right, it is numerically preeetty close to our beta estimate of our linear model (second synthetic data set)!** :O**

To understand why this is the case let us first have a look if this is only coincidence: We can do so by checking different linear models of different data sets and see how their slope relates to the value of the covariance. After some time, we might notice the following relation: **A positive covariance relates to a positive value of beta and a negative value to a negative slope.**

Below you will find the code to replicate the graphs above:

```
# Example scatter plot for cov(x,y) > 0:
plot(x,y)
cov(x,y) > 0
# [1] TRUE
# Simple mirroring of our data as an
# example scatter plot for cov(x,y) < 0
plot(x,-y)
cov(x,-y) < 0
# [1] TRUE
# Here another (synthetic) example for cov(x,y) = 0:
cross_x = c(5, 5.5, 4.5, 5 , 5 , 5.25, 4.75, 5 , 5 )
cross_y = c(5, 5 , 5 , 4.5, 5.5, 5 , 5 , 4.75, 5.25)
# Plot data:
plot(cross_x,cross_y, xlim = c(0,10), ylim = c(0,10))
cov(cross_x, cross_y) == 0
# [1] TRUE
```

Below you will find the output for a linear model of the latter data set (the cross).

```
linear_least_square(cross_x, cross_y)
# Linear least square method in R
# Independent variable: cross_x
# Dependent variable: cross_y
# alpha 5
# beta 0 SumR2 0.625
# Using lm(y~x) only:
lm(cross_y~cross_x)
# plot(cross_x, cross_y)
# abline(lm(cross_y~cross_x)) # color default = "black"
```

The output of

is equivalent to the one of our own **lm(y~x)**

function:**linear_least_square()**

Below you find the plot of the function

An alternative of a cross would be a ball shaped set of data points. The Wikipedia article on covariance [6] presents it like this:

**Thinking of a model with a slope of zero, we can now already answer what the hypothesis for our linear model will be regarding the slope, when testing our linear model for significance:** Think about observing Lady Meow drinking tea again. How useful would a model be with a slope of 0? Let us say, we start observing Lady Meow at

```
### Alt. universe where Lady Meow is drinking no tea at all :'(
# Setting up a respective data set: The rep() function is used
# to obtain a vector c() with 11 times the value 0. We need
# 11 values, as x goes from 0h to 10h, resulting in 11 values:
data = cbind(x=c(0:10),y=c(rep(0,11)))
data = as.data.frame(data)
cov(x=c(0:10), y=c(rep(0,11)))
# Plot data:
plot(data)
# Or use:
plot(x=c(0:10), y=c(rep(0,11)))
# Model of that data set:
lm(data$y~data$x)
# Call:
# lm(formula = data$y ~ data$x)
#
# Coefficients:
# (Intercept) data$x
# 0 0
```

**Of course, such a model with a slope of zero would be useless, as it shows that **** never changes when **** changes, so no dependent relation is given (**** is also independent in this case). The ****summary(lm())**** function even returns a ****NaN**** for “not a number” as p-value in this case. If we had a vector **** with only ones, the output would essentially tell us that the model is unreliable. **

We will therefore always hypothesize a slope that will be *greater* than zero (alternative hypothesis), when doing hypothesis testing on the slope (the null hypothesis will be that the slope is actually zero). Probabilistically evaluating the intercept also assumes a non-zero crossing point as alternative hypothesis. **Testing the intercept however is not always important and a question of logic and concept what to make of it.** When observing Lady Meow, we know that we will always start with a value of zero cups of tea, right after Lady Meow woke up. A value of zero in this case is therefore totally sensible. A model where the independent variable is referring, e.g., to the IQ of participants does not make sense being 0 (0 IQ can’t be measured), so it is rather abstract to interpret the intercept for itself in many cases, so **the intercept and its p-value are therefore often neglected.**

Closely related to the covariance is the so called **‘correlation’ (also referred to as Pearson correlation coefficient)**. The formula only slightly deviates to the covariance, but we will catch up on that in later parts of this tutorial series.

In this chapter we are essentially again going to evaluate the optimal **In case you want to precede to the standard deviation, you can also skip this part. However, we recommend to at least read the summary of this chapter to get the idea behind it.**

As mentioned before, conditional linearity and conditional probability are closely related to each other by formula. We have already figured out that the population covariance is very close to the optimal slope of our model. Looking at a population covariance of 10.08182 and a regression coefficient of 1.008182, we can argue that the **covariance** appears to be **proportional** in some way **to the regression coefficient (we will get to the intercept later below)**. If we were to divide the covariance by 10 (“put things into proportions”, see below), we would obtain the value for the optimal slope of our model. :O

**Now the twisty part:** Giving up probabilities, and sticking to actual values (estimates), we will see that a conditional linear relation in terms of the optimal slope of our function (regression coefficient) is by structure the same as the posterior probability

**What does proportionality exactly mean mathematically?** Our well known joint probability for example can also be understood as being in a **proportional relation** to the posterior probability. Mathematically this essentially just means that the joint *hasn’t* been normalized yet (or “counter-weighted” by

The mathematical symbols used to denote proportionality are either the tilde (~) or ∝, which looks similar to alpha (α), just more symmetric. The (“un-counter-weighted” or non-normalized) **proportional relation** **of the joint probability and the posterior** in conditional probability / Bayes’ rule is denoted as:

**Again, mathematically proportionality can be understood as actually formally missing but indicating a (counter-) weight that can “put things into proportions”.** In other words: proportionality in mathematics aims for a more general relational fact as the posterior (estimate), so to speak. However, the complete formula of Bayes’ rule / conditional probability is what actually does the “putting into proportions” via a “counter-”weight (model evidence), so to speak (this repetition is meant to make sure that the linguistic semantics fit the mathematical semantics).

Note that proportionality in mathematics also indicates what is called a **constant ratio** (a linear relation in the case of a linear model). In the case of conditional probability, the model evidence indicates a constant, as it normalizes the joint to sum to 1 (normalizing constant). In the second part of this series, we also once briefly termed the constant ratio of a linear model as having a **constant numerical gradient (see plot below)**, where the ratio of change is always the same (constant) at every step or point of the function. As a linear function is its own tangent, the slope will be constant automatically. In other words: given that *monotonically* increasing).

We can also calculate the numerical gradient for each step by formula, however it is not that important, **but we included the necessary code in the downloadable script that we prepared for you at the beginning of this tutorial.**

**To wrap this up:** Equivalent to the joint probability, **the covariance is the joint variability of **** and ****the** **covariance is** **proportional to the estimate of the change of **** (dependent) given a change in **** (independent; change by 1):**

Below you will find the complete formula to obtain the so-called **regression coefficient (the optimal slope of our model)**, structurally analogue to conditional probability. As mentioned, the “counter-weight” is simply the variance of the independent variable:

**Outspoken, the above formula says:** the slope or regression coefficient can be defined as the estimate of the dependent variable **joint variability estimate**, i.e., the covariance of

**If you still have trouble understanding the formal relation above between the regression coefficient **** and conditional probability, think of this:** as the slope is equivalent to the concept of a posterior probability, they both indicate the concept of *change or updating*, equivalent to the steps of probabilistic inference we discussed in the first part of this series! The value of *no* change in

**In sum:** above we were talking about the same formal structure, just the one structure results in probability values and the other in estimate values of measurements: In the case of Lady Meow (second synth. data set) the regression coefficient was around 1.008182 cups of tea per hour (we can argue explicitly *per hour*, because at the numerical gradient of a linear function is steady at any point of

Below you will find some code for calculating the optimal slope of the model function via the above formula, again regarding our second synthetic data set, meow (^°^) !

```
# Estimate of the optimal slope of our model function via
# beta = cov(y,x)/var(x):
beta_meow = cov(y,x)/var(x)
# [1] 1.008182
```

**The same can also be achieved via using the total sum of squares only.** The total sum of squares of *joint variability*, it is just not averaged by **But how can we just skip the step of taking the average of the total sum of squares?** The below formula works, as

```
# beta = SumSqXY / SumSqXX
beta_meow = SumSqXY/SumSqXX
# [1] 1.008182
```

**The message of this chapter is:** there is a way to convert the structural idea of probabilistic relations (conditional probability) into the concept of an estimation of measurements by structure (conditional linear relations). Personally, I find this mind-blowing. The crucial point is also that given the fact that conditional probability can be understood as intuitive logical inference (abductive inference), statistics is doing the same, just with real numbers, but in the same way we rather naively assume how the world would be, when bluntly forming a hypothesis on it.

**There are a lot of other ways to perform the above estimation of **** given a change in ****, especially when bringing in actual linear algebra for a more visual understanding (including even methods such as gradient decent).**

**Wait!! **

**What happened to our intercept??? **

In the last part of this series, we used a system of equations in order to obtain **both** **intercept estimate,** **via some rearrangements of the so-called** **point-slope formalism**.

**Let us say, we know the parameters **** and **** of our optimal model already (otherwise we wouldn’t know points on that function): **we could then just pick two specific points that lay on our model function.

The **first point** will entail the variable **second point** will be our regression coefficient:

Below you will find some code to rearrange the above equation in various ways in order to solve for

function (built-in R function):**predict()**

```
# Deriving the intercept from the slope-intercept formula.
# To do so we will start with obtaining two
# points that lay on our model function. We
# will take x = 1 and the second will be the y at
# x = 0, our intercept itself. For the first point we
# need the value of alpha already, so we will quickly
# calculate it via our second approach:
alpha_meow = mean(y)-beta_meow*mean(x) # Details later below.
# Our fitted function:
lmFitted = function(x) (alpha_meow+beta_meow*x)
# Value of y of our fitted model function,
# given that x = 1:
y1 = lmFitted(1)
# Alternatively we can also use the predict() function in R.
# However, it outputs a list of values where the first element
# is our alpha and the second relates to x=1. Therefore we use
# [2] to only get the second element in the output vector of
# predict(lm(y~x)):
y1 = predict(lm(y~x))[2]
# Check for equality:
all.equal(lmFitted(1),as.numeric(predict(lm(y~x))[2]))
# Calculating beta with the points (formula above):
beta_meow = (alpha_meow-lmFitted(1))/(0-1)
# Rearrange formula by multiplying beta_meow by (0-1):
beta_meow*(0-1) == alpha_meow-lmFitted(1)
# We will invert the upper line and do some rearrangements:
alpha_meow-lmFitted(1) == beta_meow*(0-1)
alpha_meow = lmFitted(1) + beta_meow*(0-1)
#[1] 0.2318182
```

However, **a simplified way to obtain the intercept is given via the mean of **** and **

Below some code that explains, why the above formula holds:

```
# The reason we can use the mean is due to the following fact:
alpha_meow-lmFitted(mean(y)) == -beta_meow*lmFitted(mean(x))
# We can simplify the above by just using the respective means
# and do some rearrangements:
alpha_meow-mean(y) == -beta_meow*mean(x)
# alpha = mean(y)-beta*mean(x)
alpha_meow = (sum(y)-(beta_meow*sum(x)))/(length(x))
alpha_meow = mean(y)-beta_meow*(0-mean(x)) # x = 0
# The simplest form is:
alpha_meow = mean(y)-beta_meow*mean(x)
# Checking for equality:
all.equal((alpha_meow-mean(y)),(-beta_meow*mean(x)),
(beta_meow*(0-mean(x))))
```

Mathematically there is still a lot more to discover about the above mathematics of linear regression models. As mentioned before, we could also use methods commonly used in machine learning, such as gradient descent... **However, for now we are just going to move on to a simpler topic: the standard deviation.**

**Finally, we get to the standard deviation. We know most about it already, nut we will go through it in detail again anyways. **

**So ***in nuce***, the standard deviation is just the square root of the variance.** **It tells us the average deviation from the mean of a distribution, when we were to take multiple samples from it.**

Recall that we started with the total sum of squares. In other words, we started with the squared sum of the individual deviations of the elements of our list, set or vector of **In other words, the variance turned out to be just the (bias corrected) average of the total sum of squares (which itself was the squared sum of the deviations to the mean).**

As mentioned, we can now obtain the standard deviation simply by taking the square root of the variance, in other words: **we simply get rid again of the initial squaring we performed **when we calculated the total sum of squares! Taking the root brings the dimensions down to one (the x-axis). That is why the sd becomes relatable to the mean (which is also to be found on the x-axis).This makes the deviation or variability again relatable to the actual mean of a set, which only relates to the input values and not its square. The sd is therefore rather easy relate to the actually mean as

**Why +/- again?** Taking the root does not eliminate the effect that all of the centralized values were turned into positive values only! This is again the reason why the standard deviation is denoted to come with a +/- deviation to the mean. The standard deviation of normal distribution is special, since the +/- sd is related to the turning points of a Gaussian function:

For brevity, you will find the formula for the standard deviation for the **sample sd below only**, which is also sometimes referred to as empirical sd (the following code remains comprehensive though). **Note that the standard deviation is commonly denoted as **** and the variance as ****. **

```
# Standard deviation := sqrt(variance), where
# sqrt() == square root:
# Sd for x given the sample variance
sdXsamp = sqrt((sum((x-mean)^2))/(length(x)-1))
sdXsamp = sqrt(sampVarX)
sdXsamp = sqrt(var(x))
sdXsamp = sd(x)
# Sd for x given population variance:
sdXpop = sqrt((sum((x-mean)^2))/(length(x)))
sdXpop = sqrt(popVarX)
# Sd of y given sample variance:
sdYsamp = sqrt((sum((y-mean(y))^2))/(length(y)-1))
sdYsamp = sqrt(sampVarY)
sdYsamp = sqrt(var(y))
sdYsamp = sd(y)
# Sd for y given population variance:
sdYpop = sqrt((sum((y-mean)^2))/(length(y)))
sdYpop = sqrt(popVarY)
```

**The resulting standard deviation can now be clearly related to the mean value as a standard or an “average deviation from the mean”.**

**In the next part of this series we will also get to know specials kinds of standard deviation:** the standard error/deviation of the mean, the incept and the slope of a model.

In order to fully visualize the meaning of the standard deviation, we have to take a closer look on what (uniform and normal) **distributions** are and what is called a **probability (density) function**.

We will get there in the next part of this series, where we will also discuss the the mentioned probability density functions, that are used in a **z-test (which is** **very similar to the t-test as a form of statistical hypothesis test).**

.