Skip to main content# [R] Inferential Statistics II: Linear Regression via the Linear Least Square Method

# Introduction and Recap of Inferential Statistics I

# 1 Dependent and Independent Variables

# 2 Obtaining Linear Models via the Linear Least Square Method

## 2.1 Linear Functions - Idealized Linear Relations between Variables

## 2.2 Obtaining a Linear Model from *Non-Idealized* Data Sets using lm() only

## 2.3 Errors / Residuals

## 2.4 Finding a Model with a Minimum of *Error^2* via a Partial Differential Equation

## 2.4.1 Brief Excurse on Evaluating the Minimum/Maximum in a Parabola Function (Optional)

## 2.4.2 How to Obtain the Partial Derivative and the Difference between $\partial$ , $d$ and $\mathrm{\Delta}$ .

## 2.4.3 Setting Partial Derivatives to Zero and how to Rearrange its Equations to Represent a System of Equations

## 2.4.4 Approaching Results: Solving our System of Equations

# 3 How to Write your own Linear Model Function - Replicating the Complete Mathematics in R

Beginner's Sphere

Published onMar 30, 2023

[R] Inferential Statistics II: Linear Regression via the Linear Least Square Method

** Review:** This article was internally reviewed by the BEM-Editors Rico Schmitt and Philippa Schunk.

**Corresponding R script (use download button to obtain the file; clicking on the file name will only open a new tab in your browser showing a .txt version of the script):**

**Summary and PDF-Version:**

In the *first* part of our tutorial series on inferential statistics we explored the general idea of science as iterative hypothesis testing in its conceptual and intuitive, as well as its most basic mathematical form: conditional probability / Bayes’ theorem (essentially used as a ‘learning algorithm’ in science, representing the inference made and its course). We have also learned about the difference between the frequentist and the Bayesian interpretation or usage of Bayes’ theorem ─ and that the p-value is essentially a likelihood *second* and further parts of this series of tutorials will rather focus on a frequentist approach to statistics, but we will catch up with *Bayesian linear regression models* in the future, eventually comparing the two methods under previous and more specific considerations.

**Before we fully get to linear models** as the subject of *this* tutorial, **let us first recall** that hypothesis testing (t-test, Chi-squared…) in the mathematical sense not only refers to plain conditional probability, but implicitly also to methods for **obtaining a ***probabilistic*** ***model*** from a ***non-probabilistic*** **** model**, such as a linear model ─ as the latter consists of measurements and

**So, this time we are essentially going to focus on step II (gathering data / formulating a model) of the three steps of hypothesis testing, but under a ***non-probabilistic*** perspective first. This is also what distinguishes probability from statistics: estimates not probabilities, both being part of what is called stochastics. **We will again use R to plot data and calculate the results. This time we will go in full cyber, beyond simple arithmetic exercises, and will also start with some plotting, whoop! We will also fully introduce R in this tutorial, so in case you missed it, it is not mandatory to have read Inferential statistics I for the R part.

**In case you are worried: you have still entered the beginner´s sphere, so no relevant prior knowledge should be required and we will also work with a certain degree of redundancy in order to avoid turbulences on your journey through our tutorial.**

**Note that some chapters, e.g., on the decomposition of some formulas can also be considered optional side-quests, if you feel you don’t want to go through all the details. **As before, we have also provided a summary of this tutorial as a whole, as well as a short summary at the end of every chapter ─ for the rather impatient reader or for those seeking orientation in the text. We still wanted to give you the chance to follow every step taken in the math without missing links and used the opportunity to introduce R as an abstraction of mathematics in general (at least for our first series of tutorials for beginner’s).

**Mathematical abstraction? Doesn’t that sound kind of redundant, a little dizzy? N o w o r r i e s ! Mathematically we will mostly just recall and especially ***rediscover*** some basic math from our time in school, when going through the math of linear modelling (the essence is a plain linear function and a parabola). We hope that this tutorial will leave you as surprised as we are, considering that we actually possess nearly every tool to solve our problem of drawing an optimal line through data already, given basic education**. Still, every step will be fully introduced, so in case you feel dizzy thinking of a linear function already, fear no more.

**The structure of this tutorial will be as follows: **As a first step into the sphere of linear modelling we will gain a stable grip on two already mentioned terms: the infamous, but mostly just misunderstood concept of *independent* and *dependent* variables. After that we will look at simple linear functions, given idealized data points that are located on the linear function already. From there we will dig in deeper into the math and will apply the *linear least square method* on further non-idealized synthetic data sets. We will do so both mathematically, as well as computationally. In order to give you a general understanding on R as a programming language, we will also write our own

function doing the mathematics of a linear regression and will eventually compare the results of our function with the built in **linear_least_square()**

function. All of the above will be the preparation needed for the further parts of this tutorial series that will go through the math of the output of the **lm()**

of the **summary()**

function (as said, including F-statistics, p-values etc.). **lm()**

In general, for orientation within tutorials, we recommend to make use of the **contents function (upper right corner)**, to reorientate in the script.

The distinction and definition of dependent and independent variables is ─ even with some experience ─ not always a trivial task. Nevertheless, we are going to see that there are “algorithmic ways” to clearly orientate within both the terminology and the process of defining or evaluating the dependencies of actual variables.

**To get some intuition, we will start with a simple example.** Let us say our life is accompanied by **a doggo**. We want to please our dog and we have experienced that our dog had “health problems” recently, so we thought about checking if the dog *food* is causing the problems (influencing the dog in some way). We could now check on the weight, take a blood sample etc., but for simplicity we will just check, if the dog is showing signs of “*feeling good”*, or “*feeling not good” *─ his mood* *(whatever it may mean). In other words: let us not focus on the details of our chosen variables for now, just on the characteristic relation the variables may have to each other.

The specification *dependent* and *independent* can intuitively be understood in terms of the influence that variables have *to each* *other ─ *either thought of as *conditional or causal relation or ‘course of inference’ ***(again, inferring on an estimate of a measurement, not a probability this time)**:

*Conditional**:*If the dog food is changed, the mood of the dog changes.: Changing the dog’s food (cause) results in a change in the dog’s mood (consequence; similarly discussed here).*Causal*

Note that there are experimental settings in which terms or phrases such as “causal/conditional”, “influenced by”, “changed by”, “in relation with” or “dependent on” may appear counterintuitive and / or rises philosophical questions, e.g., because time for example measured via a clock, as well as the clock itself is *not* what influences an event *in the literal sense* of a cause (time considered as an object that acts upon things (reification of time)). The upper example concerning the dog food and the dog’s mood makes more sense in the context of “literal” causal influence, as the dog food is “actual matter” that evidently influences the dog as an organism that metabolizes matter.

The most neutral way to address a relation between variables is probably by just saying:

Apart from correctly describing what is meant by a “relation” of variables, note that the form of dependency is also something not always “objectively” clear, as there might be “*hidden”* variables influencing the situation (e.g., discussed as **proxy** or **confounder**, see below), leading to further questions that may have to be evaluated in order to get further insight into possible dependencies influencing the results. **In other words: just because one thinks, defines or makes it look plausible by concept / belief that something is independent, it does not actually have to be the case ─ assuming so is essentially setting a hypothesis. **Note that a measurement that leads to data itself can also be understood as a small experimental design itself, usually evaluated and rated by its precision (of a clock for example, e.g., atomic clocks).

**The task of designing an experiment (gathering data) is therefore always** to either find or create a situation, where dependencies of variables can be distinguished as being either independent or dependent to each other, so that no other variable influences the result (there are also multivariate models and other concepts, but we are going to stick with the simple scheme for now). This process again involves the goal of gathering *evident* data (see Inferential statistics I), in other words: to operate with evident variables in the form of data *that represent events in the world *in some way (can be cast as data in the first place). To cut it short: one’s own confidence is not data, so we have to find ways to intersubjectively make those experience / data accessible via an experiment for instance.

Questions concerning the evidence and the epistemics of inference (“what can be known?”) are especially present in conducting psychological experiments: ─ does a certain behavior of a person (dep. var.) really represent the consequence of a *certain* cause (indep. var.)? **However,** **no matter how far reflections on “relationality” as such have dragged you away, it is always good to recall** that the two terms in question are actually *mathematical* *terms* and refer to the *characteristics *of a *relation* of variables defining, e.g., the probabilistic influence of one event on the probabilities of another event occurring, e.g., a certain type of dog food (indep.) makes the probability of encountering a dog feeling good (dep.) more likely (probabilistic model), or that, e.g., a rise in time corresponds with a rise in the number of cups of coffee, e.g., in a linear way (non-probabilistic model; conditionally linear relation).

**Another way of reflecting all of the above difficulties in understanding the dependencies of variables is by becoming aware of an implicit twist in the definition of the dependency of variables, entailing ***two*** ***concepts*** **** at once**:

**I.**The terms dep./indep. address the*characteristics*of the dependencies of variables, e.g., within an experimental setting (as discussed above).**II.**The term variable and the*chosen*variables are pointedly defined/set*independently from the definition of the characteristic dependency of any variables.*

The latter also implies the need of an *evaluation* of such a model of the relation of variables in the first place, as we can set a pair of variables and hypothesize on a dependency, but the assumed dependency does not have to be supported by the data / experience. A dependency may be supported by previous observations or plausible for some other reason, but is at first always a mere hypothesis and therefore also set.

**In other words**: A major difficulty in getting a stable grip on the dependencies of variables relies on the fact that **the dependency of variables will never be something that is ***necessarily*** given in a variable ***itself ***(that’s why we have to do research in the first place)***. ***This is again due to the fact that variables are in general something ***we*** set and their dependency something ***we*** hypothesize and evaluate. To give an example: **the variable “dog’s mood” itself does not entail information on any possible or in general necessary dependency to another possible variable, apart from those *we* potentially associate the variable with (our hypotheses).

Evaluating the dependency of variables therefore involves a **twofold process**: **a)** *defining/setting* the variables of a model (or finding them in a study design, reading or reviewing a paper) and **b)** *checking* on their dependencies, *given* a definition of both, variables and the various possible forms of dependency. This twofold process is involved in *both* designing (includes reflections on evidence as such), as well as conducting an experiment and evaluating its results (therefore this process is involved *before* and *after* gathering data within in our triadic scheme of hypothesis testing).

**To wrap this up:** Operating with the concept of the “dependencies of variables”, either when reading a paper or designing and conducting an experiment oneself, should always involve an ** algorithm of some kind **running through your mind

**In case of philosophical vertigos:** One should not overestimate philosophical perspectives on a relation of variables as not everything we experience is just a matter of interpreting a term or language in the first place (its a tool for communication) and it is also not the goal to represent as much detail as possible in the form of language in order to “speak evidently”, or so. For science I’d say, it is not even important to find a stable definition for a term, due to the fact that language is not where we actually start when getting involved with ‘evidence’ (I find gravitation a very evident experience, especially when tired or exhausted, no matter how gravitation is further discussed in physics or what connotation it may have or what it is called in the first place). So finding an exclusion of a definitional rule, such as contemplating on the meaning of “causal relation”, “influence”, “conditional relation” et cetera is often not as witty as it may seem, especially when the concept of a variable allows for potentially anything (is contingent) to be in conditional relation with anything else, even though it may require a (physical) measurement to be an object / event of discussion in the first place. So the goal of this tutorial is to make people aware of different levels of abstraction and how they should be part of planning and evaluating an experiment / study, but we want to also clearly distinct and present the boundaries of different perspectives on science as such (e.g. philosophy).

**We hope this chapter has given you a stable grip on all the diverse aspects involved when operating with the dependencies of variables. What’s next?** We will dig in deeper into the actual mathematical representation of the *relation* between our dependent and our independent variable**.** The easiest way such a mathematical relation can be drawn is of course via a *linear model *(conditional linearity), i.e., a linear function with ** **(y-axis) being the dependent and

**If this feels a lot already, no worries**, the tutorial is intentionally structured in a redundant way and will soon provide you with some tools in the form of code that empowers you to visualize and literally experience the mathematics behind the concept of statistical inference in the form of linear modelling.

As we know already: The main goal of a linear regression is to draw a relation between different data points in a way that one can literally *draw* a linear conclusion between the two variables represented as a line in a graph, such that, e.g., a *steady* rise in the amount of time results in a *steady* rise in the amount of, say, cups of coffee drank ─ or a change in dogfood (categorical “rise”) resulting in a change in the dog’s behavior.

As mentioned before, the purpose of such a model is in general to do **predictions on new data**! We intuitively know that a rock-solid linear relation may be rather unrealistic, as there might always be something slightly influencing the result (e.g., referred to as standard deviation). In other words: our predictions may never be exact ─ but maybe good enough! The goal is therefore to find a model that results in the *least amount of error* to actual data points (events we encounter). We are looking for a compromise, so to speak, given that our model is a simplification of reality.

**If this sounds enigmatic, don´t worry, as the general idea of a linear regression model is actually really simple and intuitive: it is ***literally*** just about drawing a line into a graph with data points in a way that the linear model doesn´t deviate too much from the data points, when the data points are seen as a whole of some kind. Try yourself: **

**For our first idealized example below, we will ***not*** work with categorical variables this time.** We will also change the species: This time we want to evaluate how many cups of tea **Lady Meow **drinks in a period of 10h.

To get some orientation and in order to wrap our head around linear relations, we will plot the relation between cups of tea drunk by Lady Meow and the time passed as data points *discrete* data points, i.e., single events in time (each hour), *not* a continuous relation between time and cups (values for “cups” for potentially infinitely small values of “time”). **Think of these data points more as snapshots of reality and our linear function as a ***continuous*** ***solution*** to the gaps in our discrete data** [1].

In our new furry example, the **time is considered the **** independent variable**, which can be found intuitively as

**Further below you will again find code for the programming language R** that can be used to plot some **idealized data points**, i.e., where every point is located on a linear function already, as an example. **If you are new to R** you can either just read this tutorial and with it read the code (we will provide all the results), or you **just download ****R**** and ****RStudio**. Start R Studio and **download and open the R script we provided at the beginning of this tutorial, which contains all of the code we are going to use**. **We recommend to use our script**, instead of copying line by line this time, in case you lose track of copying code, resulting in errors due to missing lines of code. In other words: each code block we added in this tutorial does not necessarily stand and works for itself!

**Mark the lines you want to** **execute** and **press ALT+ENTER**. The result can be seen in the environment tab on the upper right side in RStudio. If you ever feel that your script is presenting a funny output (especially after a series of errors), **clear the environment via the** **brush tool** ─ there is another brush-icon placed in the console tab to clear the console. If this does not help close R without saving the workspace. **In case you are using any Apple hardware, use the command key instead of ALT.**

**We also recommend using the keyboard when operating within the script**: use **SHIFT+ARROW** to mark and demark letters (left/right) or whole lines (up/down).

**Mark** the **name** of an object **only** and press **ALT+ENTER again** to obtain the **results in the console** (below the script) ─ **you again won’t need to know much more for this tutorial and all the rest will be introduced along the way.**

**Note that** **R scripts are executed ***from*** top ***to*** bottom**. The R script we provided can theoretically be executed as a whole (mark all and execute). However, it may be that a variable, e.g., with the name

, gets **test***redefined* in lower parts of a script with a different value (results in changes in the environment!). In other words: The content of the previous object with the same name

will be “overwritten” by the new content assigned to the name **test****test****.** **Keep that in mind, when playing around with the script.** Just download the original script again if you have the feeling that you changed to much in the script and lost track or so.

We will represent our data points via a **vector** in R. Think of a column of a data table as a (column) vector. In R a vector is in general used as an abstraction of mathematics, where vectors can be seen as, e.g., a *list* of coordinate values

. In R it can list anything that way, such as a set of values of measurements. The function **c(x,y)**

is called the combine function, with which objects of any kind can be combined as a list of objects, or even a list of lists (note that there is also a function called **c()**

as well, which is not structured in rows and columns, but sequentially numerates elements ─ also of various kinds of classes (vector, matrix, single values, whole data frames, all in one list)).**list()**

**Below you will find our first lines of code for plotting a set of synthetic data.** Below `#`

marks lines as *comments*, so they won’t be interpreted as command code by the program (as plain text will result in various errors).

```
# Variables:
# time_passed = indep. = y-axis
time_passed = c(0:10) # 0:10 = 0 to 10 = c(1,2,3,4 … 10)
# cups_of_tea = dep. = x-axis
cups_of_tea = c(0:10)
# Let us plot our data points
plot(x = time_passed, y = cups_of_tea)
```

You might remember that the information that is entailed in the formula of a linear function refers to **a)** the **slope** or “steepness” of a function and **b)** the **point where it crosses the y-axis**, when

**Before we look at the actual formula of a classic linear function as a whole, we will start with calculating the slope.** First recall that a function is nothing else than a *relation* *of* *two* *variables* (here a linear relation). The slope is therefore also a relation, just a relation within a time-space in our case, or within a *difference* in time (*relation* *of* *the* *change* *of variables*: a relational difference. Considering our furry example, the slope reflects the difference between, e.g., the amount of tea Lady Meow drank after 1h compared to the amount after, say 10h (the ratio of the “change” of these to snapshots of reality so to speak).

We can mathematically evaluate the slope via two randomly chosen points of our data set, specifying and evaluating “change” as a “differing relation” (denoted *one* *point* in time (we will go through an example later below). Also note that the below formula is also defined as the quotient of differences or ** difference quotient** (only

Below you will find a modified plot that will graphically explain the formula above. It also includes a drawn line representing the value of

**Let us now do the above calculation of the slope via R.** To give you an example for the “logical flexibility” of R, we will use two different methods. There are often several possible ways to do a certain kind of operation. Try to fully figure out what is done in the code below, as the translation of the formula into reasonable code sometimes comes with the necessity of additional abstractions that have to be overlooked.

```
# Calculating the slope, directly via the objects of the variables
# that we used above already. The [] are used to call a certain value
# of the vector, i.e., element 1 of an object is object[1] etc. If
# this is too abstract, check on the whole object again and play
# around a little by adding [] with and without values to the object.
beta = (cups_of_tea[2]-cups_of_tea[1])/(time_passed[2]-time_passed[1])
# beta = 1
# Alternative via defining the points represented
# as vectors P = c(x,y),
# first. Console output: # x y
P1 = c(time_passed[1],cups_of_tea[1]) # [1] 0 0
P2 = c(time_passed[2],cups_of_tea[2]) # [1] 1 1
# Note that here [] refers to the vector list element of P now,
# not the vector list of our data for x and y!
beta = (P2[2]-P1[2])/(P2[1]-P1[1])
# beta = 1
```

**Now what about the point where our linear function crosses the y-axis?** We can see in our graph further above (and by logic) that the minimum value of cups and the minimum *input* *value* of time measured (passed) will be set to *awake* *during one day* ─ as she sleeps around 14h a day on average (^ ◦ ^). We will therefore start with zero and will not use a 24h scale for one day of measurements, just 0:10.

Now let’s finally have a look at the complete **general formula of a linear function** itself (various common forms displayed, do not get confused!).

**It is also very important to understand (!!!!)** that the concept of functions, such as

, in programming is an abstraction of the mathematical concept of a function such as the linear function above:**lm(y~x)**

**To get a better orientation on the variables **** and **** of a function, we will look at another modified plot.** Below you will see how changes in the values of *linear model*. The model is again the relation between *linear relation* means that the relation can be represented as having a constant numerical gradient in terms of a steady differential ratio throughout the whole graph (different to a pond like parabola function, as we will see later below), which just means that the slope of a linear function will be the same at every point of that function.

Again, our function will relate the number of cups of tea that Lady Meow drank (**In our case we already know that the function that fits our data best will also have the final form of:**

With this information we can now add our function to the plot of our* idealized data set*. There are again several different ways to do so. We will mostly use the

function in R for such cases:**abline()**

```
# We can add a line representing y = a + bx
# to our plot. The line crosses the y-axis
# at 0 (a = 0), as we are starting with y = 0 at x = 0.
# a(lpha) = 0, b(eta) = 1
abline(a = 0, b = 1)
```

Another way of representing the function via R would be as actual “R function”, referring to what we have learned on the abstraction of functions above:

```
# Define function f(x) = 0 + x * 1 = x.
f = function(x)(x)
plot(y = f(0:10), x = 0:10, type = "l", ylab = "Cups", xlab = "Time")
points(0:10, 0:10) # add data points to plot
```

Note that the function above is spoken out “**f** is the name of a function with one input **x** and with that **x** some math is done, i.e., it remains **x**” ─ which may appear confusing, as it says, nothing is done with it. This is due to the fact that in the case of our function **f** the input value is equivalent to the output value. A parabola function,

```
# Example parabola function:
g = function(x)(x^2)
g(2) # = 4
```

**With this we have visually/graphically proven to ourselves that there is a linear relation between the time that had passed and the amount of tea measured in cups, without using any linear regression math or so**. We can also use the

function, i.e., the **lm()****l**inear **m**odel function in R in order to evaluate the optimal

```
# Using lm() to evaluate alpha (intercept) and beta of x:
# Formula: lm(y~x), saying: a linear model of y (dep.) in
# conditional linear relation (~) to or dependent
# on x (indep.). Note that ~ can also be read as proportional
# which we will get into in Inf. Stat. III:
lm(cups_of_tea~time_passed)
# The result in the console will look like this
# CONSOLE OUTPUT (DO NOT EXECUTE):
# Call:
# lm(formula = cups_of_tea ~ time_passed)
# Coefficients:
# (Intercept) time_passed
# 1.071e-15 1.000e+00
## α β
# Note that one can now also add the line to a plot of data points via:
plot(x = cups_of_tea, y = time_passed)
abline(lm(cups_of_tea~time_passed)) # does not work with plot(f(0:10)...)
```

The *intercept* “cups_of_tea” (**The y-intercept is therefore equivalent to our ****The slope essentially represents the value of** **, when** **is raised or changes** **by the value of** **─ which for linear models** **is ***always***In other words:** this tells us that Lady Meow drinks about 1 cup of tea per hour! **This is again the same as asking for the value of **** under the condition of (a change in) ****, i.e., what we called a conditional linear relation (in later parts of this series we will also call it regression coefficient, but it does not add much to the story). **

You might have noticed that the slope appears much more intuitive than interpreting ** ** will appear rather abstract, as it it may not lay in the field of relevance / plausibility ─ just keep in mind.

Don’t let yourself get confused by unknown terms such as *intercept*, used in the console output of the

function, as they do not add much to the story. **lm()****In case you get lost, try to always recall the basics of a simple linear function to reorientate within the math.**

Our linear function in the previous chapter represented our so-called “model” function, i.e., a function that served as a model of the relation between variables, such as “time passed” and “cups of tea”, in order to predict new observations (new data). Note that other functions such as logistic functions are also used (see further tutorials in the future). In the far most of the cases we will *not* obtain an idealized linear relation between variables, representing a perfect line of data points. Again, the goal is to find a model function with which we will make the *least* number of mistakes, or get the *least* possible deviation on average from “reality” (phenomenologically evident events), when trying to *predict* future events ─ new data. In general, minimizing an (prediction) *error* or *cost* in terms of “how much more information would I need to get to the right answer, i.e., how do I have to change the parameters of my model to better predict the future?” is considered an *optimization*** **** problem** in mathematics ─ the

**Let us now work with data points that do not already perfectly align** with a linear function. Below you will find some random data points I chose, this time slightly deviating from our linear sequence of 0:10, where the numerical gradient, the difference between each step of time, was always 1, i.e., linear by itself already. The data below will represent ** three measurements** of

```
# Cups of Tea:
cups = c(0, 1, 2, 3.5, 4.8, 5.2, 6, 6.9, 8.5, 9.1, 9.9,
0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9,
0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9, 9.8)
# Time passed for each run of measurements (cups), abbreviated:
time = c(0:10,0:10,0:10)
```

The three runs can be understood as different participants or just the same person observed over

With the

, i.e., the column bind function, we can combine our two (column-) vectors, resulting in an object that is something more of a data table (class of that object is actually a combination of a matrix and an array; see the web or **cbind()***future* R-Basic tutorials within this collection on that matter). We will then convert the result of

and turn it into an object of the class **cbind()**

. With this conversion we are now able to make use of the function of the **data frame**

sign in R, to call only a certain column (-vector), such as **$**

within the **data_table$time**

function. There are also other ways, such as again using **lm()**

and **y = cups**

within the function **x = time**

. I chose the following, as a proper data table should be a more familiar and less abstract format to you and using the **lm()**

sign is common when operating with .csv files for example. **$****It is also a good chance to learn about the circumstance that not every object class is compatible with every function, often resulting in enigmatic errors in the console. The function ****lm()**** for example cannot handle our “matrix, array” object for itself, therefore the conversion; you will definitely come across such issues in the future). By the way, you can evaluate the class of an object via ****class(object)****. Give it a try!**

What we get below is a data table, where each row represents a possible relation between the number of cups and the amount of time that has passed. The numbers on the far-left side of the console output (see below) just represents the number of elements as row number. I did not include an extra column differentiating the different runs of measurement any further, so the values of all three runs are entailed in one column. In nuce, each row of the data table entails the y-value with its respective value of x.

```
data_table = as.data.frame(cbind(cups,time))
# Output of "data_table":
#
# cups time
# 1 0.0 0
# 2 1.0 1
# 3 2.0 2
# 4 3.5 3
# 5 4.8 4
# 6 5.2 5
# ...
# ...
# Let us now plot our new set of data points. By using the $ sign
# a certain column or object in a list only will be called:
plot(x=data_table$time, y=data_table$cups,
ylab = "Cups of Tea", xlab= "Hours passed")
```

We can now use the

function to again easily obtain a **lm()***fitted linear model* between the variables time and cups of tea:

```
# As mentioned, we could just use:
linear_model = lm(cups ~ time)
# or the following, using $ to refer to a
# name of a column in this case:
linear_model = lm(data_table$cups ~ data_table$time)
# OUTPUT CONSOLE – Do note execute
# lm(formula = cups ~ time)
# Coefficients:
# (Intercept) time
# 0.2318 1.0082
## α β
```

**Comparing the result of our new model with the previous**, we can see that *idealized* data points, where the slope was 1. Let’s have a look via the plot:

```
# Add linear model to plot:
abline(linear_model, col = "blue")
# or again via:
abline(a = 0.2318, b = 1.0082, col = "blue")
```

We now know how to perform a simple linear regression via the

function in R. In our first synthetic data set all of the data points were perfectly in line with the function already. But how do we decide where to draw the line in the second case **lm()***mathematically*, without taking the easy path via R? **We will now start to go through the simplest mathematical method to make a compromising but optimal decision**, in order to find a model, which will make the *fewest* errors of all possible models, when making a prediction: **the (linear) least square method**.

Let us first look at how these “errors” mentioned above are represented mathematically, before we start to figure out how to optimize them, eventually obtaining a model which makes the *fewest* errors of all models. **In general, solving our optimization problem involves setting up a “guess function” in order to optimize its parameters **** and **** with respect to the errors made when comparing the model’s prediction with actual data points of a data set. Below we will just use our well known function **** for a “guess function”**.

The error is a mathematical value that represents the deviation of the value of our model function,

We will now look at some randomly chosen values

```
# Example residuals:
# x = indep. var.; y = dep. var.
x = c(0:5)
y = c(0, 1.5, 1.8, 3.4, 4.4, 5.9)
```

We can now evaluate the errors, defining the function

in R, representing **f**** guess function” that we want to optimize** to better fit the given data.

```
# Our guess function:
f = function(x)(x)
# Calculate errors (here x and y refer to the above data vectors!):
error = y-f(x)
# Console output of the object "error", entailing the elements 0-5:
# [1] 0.0 0.5 -0.2 0.4 0.4 0.9
```

Let us now plot our data points and our linear model

```
# Plot of our data points:
plot(x=x,y=y)
# Plot of our guessed function:
abline(a=0,b=1)
# Adds lines representing the residuals, i.e., the error between our
# "guessed function" and the actual data points, drawn along the y-axis.
# x0 is the x-coordinate of the start of the line and x1 the x-coordinate
# of the end of the dotted line (lty=3):
segments(x0=x,y0=f(x), x1= x, y1=y, lty =3, col = "blue")
```

Recall that we are after finding a function that produces a *minimum* of summed errors *compared* to the errors of all other possible functions. However, below we will also calculate the *squared* error, *square* method in detail, but think of it visually and linguistically and recall some prior knowledge: a minimum of something is usually graphed in a way that it mathematically at least looks something like a parabola function **If we were to represent our minimization problem as a ***linear problem***, the minimum of a ***linear*** error function would be minus infinity for **** ─ which is not what we want for an estimate for the crossing point with the y-axis.** **What we are definitely still looking for though is a ***linear solution*** in the sense that the result of our calculations leaves us with parameters **** and **** of an optimal ***linear*** function (not a parabola function).**

In order to get a function that has at least something like a parabolic minimum as an attribute, we are mathematically just bluntly going to *turn our linear problem, i.e., deciding for a linear function with the lowest errors (r), into a quadratic problem, deciding for a particular**and**that leads to the lowest values of errors /reiduals squared (*** ).** We will get to this in detail in a bit, but definitely keep in mind that intuitively a linear function itself will not be something we are going to use to find a

**Now let us say, we are again working with our previous **** idealized values** of

We can do the same with our *second synthetic dataset***, **** using the same guess function again (theoretically any function can be optimized)**:

Here is the respective code for the above and a possibility to plot the residuals:

```
# Following the plain math:
error = (cups_of_tea - time_passed) # here f(x) = x = time_passed itself
sumError = sum(error)
sumError2 = sum(error^2)
# Via lm() and residuals() of our idealized data set:
idealized = lm(cups_of_tea~time_passed)
residuals_fun = residuals(idealized) # all basically 0 values
# Checks for (near) equality of the result of the function
# residuals() with our object "error" that we obtained
# by applying the plain math (change object class of
# residuals_fun via as.numeric() to make it work):
all.equal(error, as.numeric(residuals_fun))
# [1] TRUE
# Let us plot our error / residuals
plot(error)
```

Let us now do the same with our second synthetic data set and **compare the residuals of our first guess with the residuals of our already fitted function** obtained via

. We will start with the errors of our guessed function:**lm()**

```
# Guessed function
lmRandom = function(x) (x)
# These are the errors we still have to minimize
errorRandom = (data_table$cups-lmRandom(data_table$time))
sumErrorRan = sum(errorRandom)
sumErrorRan2 = sum(errorRandom^2)
# Plot errorRandom
plot(errorRandom)
# adds (h)orizontal line at y-axis = error = 0 (:= no error)
abline(h=0)
```

```
# This is our fitted model function
lmFitted = function(x) (0.2318+1.0082*x)
# These are the errors / residuals that are already fitted
# In other words: this is where we want to get to.
errorFitted = (data_table$cups-lmFitted(data_table$time))
# The function residuals() does the same with our model
residualsFitted = residuals(lm(data_table$cups~data_table$time))
# Let us check if our method delivers same results as
# the function residual():
all.equal(errorFitted, residualsFitted)
# [1] "Mean relative difference: 0.0001735146" = neglectable difference
# Plot residuals:
plot(residualsFitted)
abline(h=0)
```

As promised, we will now compare how the residuals have changed, after optimizing the errors:

In order to represent our minimization problem formally, we are going to use a partial differential equation. The field of partial differential equations (PDEs) can be complicated, due to a lack of solutions for *non*-linear PDEs. We are lucky though, as we are looking for a *linear solution* only anyways, and our PDE can be solved, e.g., via a system of linear equations ─ the method we are after.

**Note that there are other methods to find a linear function fitting our data**, i.e., with the least amount of squared error. For example: in the third part of this series, we will obtain our optimal β just by dividing the so-called covariance of

**If any of this sounds too unfamiliar, no worries!** This tutorial is not going to introduce the whole field of ordinary and partial differential equations (ODEs and PDEs) or any complex linear algebra ─ you will still be able to fully understand, follow and reproduce all of the calculations below (even by hand). Apart from that it is just important *to at least be aware* that different methods and formulas are used to evaluate a fitted function, *so that you will not become insecure when doing your own research, knowing only one of the methods*. [2] **However,** **solving optimization problems is essential for any kind of statistical modelling, so we believe it is more than helpful to at least know the poetics of a linear model in detail, to get a stable overview on statistical modelling. **

After we mastered the mathematics, we are also going to reproduce the essentials of the original

function to show you how easy it can be written by oneself and in order to demystify what the code of the original **lm()**

function actually does.**lm()**

Let us now finally look at our PDE:

What we see on the far-right-hand side of the equation is essentially our squared error formula above

So, essentially our PDE in the current form above is a function with respect to two variables, **and** *not its minimum yet!!*). What we now want to use this formula for is to get the values of *minimum value* of

**Finding the minimum can easily be done by evaluating the ***partial*** derivatives of our PDE and setting each of them to zero, in order to evaluate the optimal **** and **** given minimum squared error ─ our compromise.**

“Setting the partial derivatives of a PDE to zero” has a similar purpose as with regular linear functions, where setting * However, *we already know that we have to turn our linear problem into a *quadratic* problem, so we are looking for a minimum of a ‘pond’ as a result of setting something to zero (and we have to take the derivative, which “downgrades” a parabola function to a linear function). Before we start with setting the partial derivatives to zero, let us recall some basics on *quadratic functions* that you will probably remember from school, to slowly build up complexity. **Note that** **the next chapter can be considered optional.**

In a parabola function *first* *derivative*, then setting it to zero, subsequently solving the equation for

In general, the derivative gives us the slope of a *tangent* at a certain point

The derivative of *values* without *coefficient*, i.e., *without* *a corresponding factor variable*, such as

Note that the *derivative* **Below you will find code to plot an example of a parabola function that does ***not*** have its minimum/maximum at the origin of the coordinate system.**

**In case you may wonder or get confused**, note that setting the derivative to zero, is *not* the same as setting *particular* point with a *particular* slope: a slope of zero, indicating a minimum or maximum of a function ─ therefore we set to zero

Conceptually it is also important to keep in mind that

Below you will find some code to do some calculations and plotting:

```
# Slightly more complex parabola function:
# Defining our function g(x) = x^2 + x + 2
g = function(x) (x^2+x+2)
# Plot g(x)
plot(g, xlim = c(-5,5), ylim=c(-1,10), asp=1 )
abline(h=0, v=0)
# Derivative g'(x) of g(x)
gdev = function(x) (2*x + 1)
# Evaluate if minimum or maximum is given (none if below = 0)
# a) Input neg. and pos. test values
# b) Output: changes from negative to positive => maximum is given,
# else a minimum is given (i.e., when input == output)
gdev(-Inf); gdev(+Inf) # Minimum is given
# Add derivative function (it crosses x at our xMin!)
# Here we graphically see that x refers to both g’(x) and g(x)
# g’(x)=2*x+1
abline(a=1,b=2) # adds our linear function to plot
# Setting g'(x) to zero, to get the x coordinate of that minimum
# which has a slope of zero
# g'(x) = 2*x + 1
# 0 = 2*x + 1
# -2x = 1
xMin = solve(-2,1)
# x = -.5
# Let's check if this is true, since g'(-.5) = 0
# In other words: the slope at g(-.5) is zero
gdev(xMin) == 0 # [1] TRUE
g(xMin) # y-coordinate of minimum
points(x = xMin, y = g(xMin))
# Result: P(-.5| 1.75)
# Add vertical line at x = -.5
segments(x0=-.5,y0=0, x1=-.5,y1=g(-.5), col="darkgreen", lty = 3)
```

**Now that we know how to find the minimum in a parabola function, let’s precede to PDEs.** The relation in our function *partially*, i.e., first either for

**In other words: we cannot evaluate **** and **** at the same time, and what we will get when taking the derivative of a PDE is actually two equations that we have to solve ─ after doing some rearrangements. Both **** and **** will be related to the least squared errors, but it will be a rather abstract visualization compared to finding the minimum of a simple parabola function (so we will skip plotting on that part).**

**Most of the below will be more about formalism and will be conceptually familiar to you, promised. It is also enough do get an overview of this chapter. However, we again still want to give you an optional overview of the process with the least number of missing links (at least for simple linear regression models).** **Same goes for the decomposition in the next chapter, so don’t be afraid of what’s ahead.**

The below is supposed to show that a PDE is not too different from an ordinary differential equation and from the concept of a difference quotient. We will therefore again rediscover some math basics from our time in school. By looking into the difference between

So, let´s get to it. This is our PDE again:

The derivatives are formulated as follows. The purple part refers to the outer derivative, and the blue to the inner derivative. Obtaining the outer derivate is simple:

The fraction is spoken “partial derivative of

The symbol *al* equations, and as *difference*, just different kinds of differences, so to speak. In order to get a stable grip on the differences of differences, recall that the slope of a linear function is calculated by the difference of

However, the Greek capital delta, *finite*** differences** only. We will see in a bit what that means.

A **differen***tial*** equation** on the other hand can also be used to evaluate the slope of a function, but with a twist to it. Equivalent to the difference quotient, a differential equation evaluates the difference, but substitutes all * on a lot of other types of functions*, such as a parabola function, not only linear functions. In differenti*al* equations the difference is denoted by the Latin letter *also* holds for ** infinite differences** as well. For the numerical example below, we will work with finite values, i.e., where the difference

**Short remark on the history of calculus:** The above formula was a huge deal in the time where people like Isaac Newton and G. W. Leibniz lived (both discovered calculus independently), since it allows to translate a summation of discrete values into an integration of continuous values up to certain precision! A pragmatic simplification, so to speak. Later on Planck showed that a continuous functions of light can be cast as discrete “quant” packages, which is essentially the inverese, or better: a tilting from continuous to discrete, since the whole wave-particle dualism in physics is mathematically grounded by the mention translation, so to speak.

**Let us now check on the slope of our previously evaluated parabola minimum** ** using a differential equation of the ordinary kind this time.** Intuitively the below does not work when the *difference* is actually *equal to* 0, as it always ends up in a situation where 0 is divided by 0 (*approaching 0 difference, but never being in such a state*. To make it work in R, we have to add a tiny term (

Here ** approaching** the value of

```
# Output using different x => x = x +.1 x = x +.001
betaNumerator = (g(-.5+.1)-g(-.5)) # 0.01 1e-06
betaDenominator = ((-.5+.1)+.5) # 0.1 1e-03 = 0.001
betaNumerator/betaDenominator # 0.1 1e-03 = 0.001
```

The output in R could be coincidental, so let us check with some other value of

```
# Choose a value for x
xTest = 1
# Using g(x) from above:
g = function(x) (x^2+x+2)
betaTest = (g(xTest+1e-7)-g(xTest))/((xTest+1e-7)-xTest)
# Result for xTest = 1 => beta = 3
# Some values come with a remaining difference
# (not the case with, e.g., xTest = 9):
gdev = function(x) (2*x + 1)
all.equal(betaTest,gdev(xTest))
```

With an added value of

to our chosen **1e-7**

. Using **gedev(x)**

leads to false values and at some point, you will get a **1e-12**

(Not a Number) as output, as the precision is not high enough or due to some other reason within the function (there may be ways in R, but it is not important for us now). **NaN****We are not going to need the calculations above for the actual linear regression, but in case a PDE is still too abstract, always just recall the general idea from a conceptually downgraded perspective. **

**We have learned:** In differential equations *approaching* a difference of some limit, such as 0, in order to obtain the approximate slope at the respective point

We are now going to move on to taking the derivative of our PDE, setting them to zero, in order to find the minimum of *partial* differential equations, in our case the minimum of the sum of our squared errors of our guessed function. The goal is again the same as with ordinary differential equation, just a little more abstract: **we will be handling two equations at once and will solve them via a ***system of equations***, when using the derivatives to obtain the optimal values for **** and ****, given minimum residuals. **

**The next steps of decomposing our derivatives are a preparation to fit our derivatives into an easily solvable system of equations. Note that the rearranging part is supposed to give you the opportunity to follow all of the steps by hand, but is not necessary to understand in detail. The important part will be our resulting system of equations that can be used for any set of **** and **** in the future.**

In case you my wonder: **We will get into what a system of equation is in a bit** ─ we could have explained it now or before, but as it is rather simple, we decided to first rearrange the equations of our derivatives into its respective form. For now, just think of more than one equation, *each* looking something like *together* to solve for

**Let us start decomposing our derivatives:** As we are looking for a minimum value of

Now we get the factor **the sum sign looks spiky, but as it turns out it is actually harmless**). Contemplate on this and convince yourself via R (double “=” checks for equality; mark only one side of the equation and execute to obtain the single results):

`2*sum(1:6) == sum(2*1:6)`

We can also get rid of the factor

Resulting in:

Now we get rid of the product, *within the sum*:

Resulting in:

Next, we can split up the sum to get closer to the appearance of a system of equations:

Now we bring the negative sum of

Next, we do the same as with the factor 2 above, this time bringing the variable factors *are given to us* and that can be summed out easily. In the special case of

**From here everything becomes a lot easier**, again as

```
# The above just represented in R (DO NOT EXECUTE IN THIS FORM)
# sum(a*length(x)) + b*sum(x) = sum(y)
# a*sum(x) + b*sum(x^2) = sum(x*y)
```

Numerically, the result of applying our second synthetic data set to the upper formulas will leave us with the following system of equations (code will follow in a bit):

With the above **we have obtained our desired structure of a system of linear equations and are confidently approaching higher spheres of inference**. We can see two equations, each with the variables

One might already have an idea on how to get to a result by substitution… For those who don’t know, or only roughly remember what a system of linear equations is, let us go through a quick example below.

**In general, systems of equations are** used to solve a wide range of very basic questions in science. We will start with a very simple example that most of us probably have formally encountered in riddles or so, without thinking of a system of linear equations as such:

**FURRY EXERCISE:** **Two cats, Mr. Midnight and Jurassica Parker, are together 15 years old. Their age deviates by 7 years ─ Jurassica Parker being the older sister of Mr. Midnight. How old is each of them?**

We can solve this equation by *substituting*

Another way of solving our exercise is via linear algebra: we can write the left side of the equations as a matrix and the right side as a vector, resulting in:

```
# Set up two elements, so that we can solve an
# equation in the form: left = right
left = matrix(c(1,1,
-1,1), ncol = 2, byrow = TRUE)
right = matrix(c(15,7), ncol = 1, byrow = TRUE)
# Use the solve function to solve the system of
# equations. It outputs a vector list = c(x,y)
ResultXY = solve(left,right)
```

**Let us now finally solve** the system of equations regarding our second synthetic data set **by hand**:

Resulting in:

Now we can smoothly insert

Resulting in:

And now we can insert

**Whoop-whoop!** The values for

function! **lm()****You have now mastered your first linear regression by hand!** Let us now replicate the above fully via R and write our own function to perform a linear regression.

Next up we will replicate all of the math above using R and subsequently write our own

function. In other words: we will replicate the essential **linear_least_square()***mathematical* part of the

function ourselves. You may be surprised that only a few lines of code are needed to do so. Apart from putting together all the math from previous chapters, we will also include some fancy code for some nice-looking console output. ^^ **lm()**

First of all, we will again define our random “guess function” and our

```
regRandom = function(x) (x)
x = data_table$time; y = data_table$cups
```

Next up we will optimize our guessed function, by looking for the minimum of the values of

We have been to the part where we rearranged the derivatives to resemble a system of equations. Our insights can now be used to *abbreviate* the code that performs the math dramatically. To do so, we will also define

and **a**

, due to an abstraction of linear algebra in R, when using the function **b**

─ you’ll see in a bit.**solve()**

`a = 1; b = 1 # ";" marks a new line (delimitter)`

Below you will find our rearranged system of equations that can be looked at as forming a 2x2 matrix (left side) and a 2x1 (column-) vector (right side of the equation). We will use these objects as input for our

function. The formula of the rearranged equations of our partial derivatives was:**solve(left,right)**

In terms of code, the above can be written as:

```
# Formally in code (DO NOT EXECUTE):
# sum(a*length(x)) + b*sum(x) = sum(y)
# a*sum(x) + b*sum(x^2) = sum(x*y)
```

To get the above set for solving a system of linear equations we will create objects of the parts of the above and then fit them into a *matrix* and a *vector *in R. The

function will treat each element separately, so something like **solve()**

evaluates **length()**

```
# Named in the form r = row, c = column
left_a_r1c1 = sum(a*length(x)); left_b_r1c2 = b*sum(x) ;
left_a_r2c1 = a*sum(x) ; left_b_r2c2 = b*sum(x^2)
right_r1c1 = sum(y)
right_r2c1 = sum(x*y)
```

Now we will set up the above as actual matrix (left) and vector (right) objects:

```
# matrix
left = matrix(c(left_a_r1c1 , left_b_r1c2,
left_a_r2c1 , left_b_r2c2),
ncol=2, nrow=2, byrow = TRUE)
# vector
right = c(right_r1c1, right_r2c1)
```

*Now we can finally solve our system of equations via:*

```
Result = solve(left,right)
# [1] 0.2318182
# [2] 1.008182
# Fitted function:
fitfun = function(x) (Result[1]+Result[2]*x)
SumError2fitted = sum((y-fitfun(x))^2)
# [1] 9.703364
```

**In order to get a nice looking output for our function below**, we will use the

function. The **cat()**

function **cat()***concatenates* elements of any kind in the console (!), such as text and our result values. Use

for a line break, and **"\n"**

for the tab separator. We also have to apply a trick, to get the actual names of the input objects into our console output using **"\t"**

:**deparse(substitute(x))**

```
# DO NOT EXECUTE. Code is exclusively meant for our function below:
# cat("Linear least square method in R","\n","\n",
# "Independent variable:", "\t", deparse(substitute(x)),"\n",
# "Dependent variable:", "\t", deparse(substitute(y)),"\n","\n",
# "alpha", "\t",Result[1],"\n",
# "beta","\t",Result[2], "\t","SumR2", "\t", SumError2fitted)
```

Let us now also plot our results. We will also use

for the labels of our plot axes.**deparse(substitute(x))**

```
plot(x=x,y=y, ylab = deparse(substitute(y)),
xlab = deparse(substitute(x)))
abline(a=Result[1], b=Result[2], col = "darkblue")
```

**Voilà, that was it!** We can now put all of the above together, execute and subsequently use our own

function ─ **linear_least_square()****whoooooop!!!**

```
# Go-go-gadgeto linear_least_square!!!!
# Replication of the lm() function:
linear_least_square = function (x,y){ # Start of function
# Setting up system of equations:
left_a_r1c1 = sum(a*length(x)); left_b_r1c2 = b*sum(x)
left_a_r2c1 = a*sum(x) ; left_b_r2c2 = b*sum(x^2)
right_r1c1 = sum(y)
right_r2c1 = sum(x*y)
# Now we will set up the above as matrix (left) and
# vector (right) objects:
left = matrix(c(left_a_r1c1 , left_b_r1c2,
left_a_r2c1 , left_b_r2c2),
ncol=2, nrow=2, byrow = TRUE)
right = c(right_r1c1, right_r2c1)
# Now we can solve our system of equations via:
Result = solve(left,right)
# Fitted function:
lmFitted = function(x) (Result[1]+Result[2]*x)
SumError2fitted = sum((y-lmFitted(x))^2)
# Code for nice console output
cat(" Linear least square method in R","\n","\n",
"Independent variable:", "\t", deparse(substitute(x)),"\n",
"Dependent variable:", "\t", deparse(substitute(y)),"\n","\n",
"alpha", "\t",Result[1],"\n",
"beta","\t",Result[2], "\t","SumR2", "\t", SumError2fitted)
# Plot Results:
plot(x=x,y=y, ylab = deparse(substitute(y)),
xlab = deparse(substitute(x)))
abline(a=Result[1], b=Result[2], col = "darkblue")
} # End of function
```

**We could actually also name our function “lm”, though it would conflict with the built in ****lm()**** function, so we do not recommend to rename the above function in that way!!! We also have not tried doing so, i.e., the consequences, at least to us, are not even impetuously figured by a scanner darkly.**

However, let's see what our function can do:

```
# Our function using our second synthetic data set:
linear_least_square(data_table$time, data_table$cups)
# OUTPUT:
# Linear least square method in R
# Independent variable: data_table$time
# Dependent variable: data_table$cups
# alpha 0.2318182
# beta 1.008182 SumR2 9.703364
# With our idealized data set:
linear_least_square(time_passed, cups_of_tea)
# Compare with the regular old lm() function:
lm(data_table$cups~data_table$time)
```

**Note that the actual ****lm()**** function includes much more features and the code will look quite different from the one we provided above, as it is, e.g., compatible with a wide range of input classes, and partially executes code in C (which runs faster). The function above is also not an official comparable function, such as ****lm()**** that can be cited (here you can find the original source code for the lm() function of the pre-installed library “stat”: **[4]**. It was written for didactic purposes only. However, the mathematics will be the same, when it comes to simple linear regression models. **

We hope this has given you a stable intuition and understanding of how a simple linear regression actually works. Multivariate models, linear mixed effect models and other forms of regression analysis are essentially based on the same principle. Importantly: the output of e.g.

or **glm()**

will look more or less the same. This article is therefore our basis for the following tutorials on that matter of regression analysis in general.**lmer()**

So far we have looked into the regular output of the

function. Next up we are going to look at the output of **lm()**

and will replicate and integrate its math into our own function.**summary(lm(y~x))**