Beginner's to Intermediate Sphere
Review: This article was so far internally reviewed by the NOS-Editors Rico Schmitt and Philippa Schunk.
Corresponding R script (use download button to obtain the file; clicking on the file name will only open a new tab in your browser showing a .txt version of the script):
Summary:
In the first part of our tutorial series on inferential statistics we explored the general idea of science as iterative hypothesis testing in its conceptual and intuitive, as well as its most basic mathematical form: conditional probability / Bayes’ theorem (essentially used as a ‘learning algorithm’ in science, representing the inference made and its course). We have also learned about the difference between the frequentist and the Bayesian interpretation or usage of Bayes’ theorem ─ and that the p-value is essentially a likelihood
Before we fully get to linear models as the subject of this tutorial, we want to shed light on the fact that hypothesis testing, e.g., in the sense of a t-test not only refers to plain conditional probability, but implicitly also to methods for obtaining a probabilistic model from a non-probabilistic model, such as a linear model ─ as the latter consists of measurements and not probability values by itself. Also recall that the term ‘model’ in general just means “relation”. Therefore a “non-probabilistic” model can also be looked at as a “joint” in terms of a relation ─ a relation between an independent and a dependent variable. A linear relation is the most basic form of such “non-probabilistic” models, mathematically expressed as a linear function (we will get there in detail below).
So, this time we are essentially going to focus on step II (gathering data / formulating a model) of the three steps of hypothesis testing, but under a non-probabilistic perspective first. This is also what actually distinguishes probability from statistics: estimates not probabilities, both being part of what is called stochastics. However, statistical analyses involve both, but usually start with statistical measurements (e.g., mere frequencies).
We will again use R to plot data and calculate the results. This time we will go in full cyber, beyond simple arithmetic exercises, and will also start with some plotting, whoop! We will also fully introduce R in this tutorial, so in case you missed it, it is not mandatory to have read Inferential statistics I for the R part.
In case you are worried: you have still entered the beginner´s sphere, so no relevant prior knowledge should be required and we will also work with a certain degree of redundancy in order to avoid turbulences on your journey through our tutorial.
Note that some chapters, e.g., on the decomposition of some formulas can also be considered optional side-quests, if you feel you don’t want to go through all the details. As before, we have also provided a summary of this tutorial as a whole, as well as a short summary at the end of every chapter ─ for the rather impatient reader or for those seeking orientation in the text. We still wanted to give you the chance to follow every step taken in the math without missing links and used the opportunity to introduce R as an abstraction of mathematics in general (at least for our first series of tutorials for beginner’s).
Mathematical abstraction? Doesn’t that sound kind of redundant, a little dizzy? N o w o r r i e s ! Mathematically we will mostly just recall and especially rediscover some basic math from our time in school, when going through the math of linear modelling (the essence is a plain linear function and a parabola). We hope that this tutorial will leave you as surprised as we are, considering that we actually possess nearly every tool to solve our problem of drawing an optimal line through data already, given some basic education. Still, every step will be fully introduced, so in case you feel dizzy thinking of a linear function already, fear no more.
The structure of this tutorial will be as follows: As a first step into the sphere of linear modelling we will gain a stable grip on two already mentioned terms: the infamous, but mostly just misunderstood concept of independent and dependent variables. After that we will look at simple linear functions, given idealized data points that are located on the linear function already. From there we will dig in deeper into the math and will apply the linear least square method on further non-idealized synthetic data sets. We will do so both mathematically, as well as computationally. In order to give you a general understanding on R as a programming language, we will also write our own linear_least_square()
function doing the mathematics of a linear regression and will eventually compare the results of our function with the built in lm()
function. All of the above will be the preparation needed for the further parts of this tutorial series that will go through the math of the output of the summary()
of the lm()
function (as said, including F-statistics, p-values etc.).
In general, for orientation within tutorials, we recommend to make use of the contents function (upper right corner), to reorientate in the script.
The distinction and definition of dependent and independent variables is ─ even with some experience ─ not always a trivial task. Nevertheless, we are going to see that there are “algorithmic ways” to clearly orientate within both the terminology and the process of defining or evaluating the dependencies of actual variables.
To get some intuition, we will start with a simple example. Let us say our life is accompanied by a doggo. We want to please our dog and we have experienced that our dog had “health problems” recently, so we thought about checking if the dog food is causing the problems (influencing the dog in some way). We could now check on the weight, take a blood sample etc., but for simplicity we will just check, if the dog is showing signs of “feeling good”, or “feeling not good” ─ his mood (whatever it may mean). In other words: let us not focus on the details of our chosen variables for now, just on the characteristic relation the variables may have to each other.
The specification dependent and independent can intuitively be understood in terms of the influence that variables have to each other ─ either thought of as conditional or causal relation or ‘course of inference’ (again, inferring on an estimate of a measurement, not a probability this time):
Conditional: If the dog food is changed, the mood of the dog changes.
Causal: Changing the dog’s food (cause) results in a change in the dog’s mood (consequence; similarly discussed here).
You may also be familiar with the term correlation, which refers to a special dependent kind of relation — so again: dependency in terms of dependence of variables is meant.
Note that there are experimental settings in which terms or phrases such as “causal/conditional”, “influenced by”, “changed by”, “in relation with” or “dependent on” may appear counterintuitive and / or rises philosophical questions, e.g., because time for example measured via a clock, as well as the clock itself is not what influences an event in the literal sense of a cause (time considered as an object that acts upon things (reification of time)). The upper example concerning the dog food and the dog’s mood makes more sense in the context of “literal” causal influence, as the dog food is “actual matter” that evidently influences the dog as an organism that metabolizes matter.
The most neutral way to address a relation between variables is probably by just saying:
Apart from correctly describing what is meant by a “relation” of variables, note that the form of dependency is also something not always “objectively” clear, as there might be “hidden” variables influencing the situation (e.g., discussed as proxy or confounder, see below), leading to further questions that may have to be evaluated in order to get further insight into possible dependencies influencing the results. In other words: just because one thinks, defines or makes it look plausible by concept / belief that something is independent, it does not actually have to be the case ─ assuming so is essentially setting a hypothesis (which can again be evaluated via a model to see if there is evidence for such assumptions). Note that a measurement that leads to data itself can also be understood as a small experimental design itself, usually evaluated and rated by its precision (of a clock for example, e.g., atomic clocks).
The task of designing an experiment (gathering data) is therefore always to either find or create a situation, where dependencies of variables can be distinguished as being either independent or dependent to each other, so that no other variable influences the result (there are also multivariate models and other concepts, but we are going to stick with the simple scheme for now). This process again involves the goal of gathering evident data (see Inferential statistics I), in other words: to operate with evident variables in the form of data that represent events in the world in some way (can be cast as data in the first place). To cut it short: one’s own confidence is not data, so we have to find ways to intersubjectively make those experience / data accessible via an experiment for instance. In later parts of this series we will also see that we can create more complex models that can include a range of variables. However, statistics can also be used to evaluate if a factor of context really has an relevant effect or can be neglected after all.
Questions concerning the evidence and the epistemics of inference (“what can be known?”) are especially present in conducting psychological experiments: ─ does a certain behavior of a person (dep. var.) really represent the consequence of a certain cause (indep. var.)? However, no matter how far reflections on “relationality” as such have dragged you away, it is always good to recall that the two terms in question are actually mathematical terms and refer to the characteristics of a relation of variables defining, e.g., the probabilistic influence of one event on the probabilities of another event occurring, e.g., a certain type of dog food (indep.) makes the probability of encountering a dog feeling good (dep.) more likely (probabilistic model), or that, e.g., a rise in time corresponds with a rise in the number of cups of coffee, e.g., in a linear way (non-probabilistic model; conditionally linear relation).
Another way of reflecting all of the above difficulties in understanding the dependencies of variables is by becoming aware of an implicit twist in the definition of the dependency of variables, entailing two concepts at once:
I. The terms dep./indep. address the characteristics of the dependencies of variables, e.g., within an experimental setting (as discussed above).
II. The term variable and the chosen variables are pointedly defined/set independently from the definition of the characteristic dependency of any variables.
The latter also implies the need of an evaluation of such a model of the relation of variables in the first place, as we can set a pair of variables and hypothesize on a dependency, but the assumed dependency does not have to be supported by the data / experience. A dependency may be supported by previous observations or plausible for some other reason, but is at first always a mere hypothesis and therefore also set.
In other words: A major difficulty in getting a stable grip on the dependencies of variables relies on the fact that the dependency of variables will never be something that is necessarily given in a variable itself (that’s why we have to do research in the first place). Variables are in general something we set and their dependency something we hypothesize and evaluate. To give an example: the variable “dog’s mood” itself does not entail information on any possible or in general necessary dependency to another possible variable, apart from those we potentially associate the variable with (our hypotheses).
Evaluating the dependency of variables therefore involves a twofold process: a) defining/setting the variables of a model (or finding them in a study design, reading or reviewing a paper) and b) checking on their dependencies, given a definition of both, variables and the various possible forms of dependency (dep./indep.). This twofold process is involved in both designing (includes reflections on evidence as such), as well as conducting an experiment and evaluating its results.
To wrap this up: Operating with the concept of the “dependencies of variables”, either when reading a paper or designing and conducting an experiment oneself, should always involve an algorithm of some kind running through your mind, moving from the linguistic/mathematical definition in general to the set variables and then to their hypothesized and evaluated dependencies respectively (this is essentially a deductive process: from the “rule”, to the “event”). Such an “algorithm” of evaluating the dependencies of set variables is therefore not “only” an abstract mathematical or mere definitional process, somewhat unintuitive or even robotic. It rather represents what makes up creating and testing an experimental design in the first place.
In case of philosophical vertigos: One should not overestimate philosophical perspectives on a relation of variables as not everything we experience is just a matter of interpreting a term or language in the first place (its a tool for communication) and it is also not the goal to represent as much detail as possible in the form of language in order to “speak evidently”, or so. So finding an exclusion of a definitional rule, such as contemplating on the meaning of “causal relation”, “influence”, “conditional relation” et cetera is often not as witty as it may seem, especially when the concept of a variable allows for potentially anything (is contingent) to be in conditional relation with anything else. This also goes for phrases such as “correlation is not causation”, which is nowadays more often used in order to argue against scientific methods simple evidence as such (a kind of hegemonic relativism at play, e.g., against climate science, in politics etc.). See also this artic „Correlation is not causation. Except when it is.“ by Edward Hearn.
We hope this chapter has given you a stable grip on all the diverse aspects involved when operating with the dependencies of variables. What’s next? We will dig in deeper into the actual mathematical representation of the relation between our dependent and our independent variable and go through some basic examples. The easiest way such a relation can be drawn is of course via a linear model (conditional linearity), i.e., a linear function with
If this feels a lot already, no worries, the tutorial is intentionally structured in a redundant way and will soon provide you with some tools in the form of code that empowers you to visualize and literally experience the mathematics behind the concept of statistical inference in the form of linear modelling.
As we know already: The main goal of a linear regression is to draw a relation between different data points in a way that one can literally draw a linear conclusion between the two variables represented as a line in a graph, such that, e.g., a steady rise in the amount of time results in a steady rise in the amount of, say, cups of coffee drank ─ or a change in dogfood (categorical “rise”) resulting in a change in the dog’s behavior.
As mentioned before, the purpose of such a model is in general to do predictions on new data! We intuitively know that a rock-solid linear relation may be rather unrealistic, as there might always be something slightly influencing the result (e.g., referred to as standard deviation). In other words: our predictions may never be exact ─ but maybe good enough! The goal is therefore to find a model that results in the least amount of error to actual data points (events we encountered before as reference for our predictions of the future). We are looking for a compromise, so to speak, given that our model is a simplification of reality. If ‘simplification of reality’ worries you somehow, let’s statistics look a little reductionist, then think of it as a map: When you use a map trying to get from one point to another, you would not need a 1:1 scaled map, or even information on every atom. So simplification is always given, since we are asking specific question in specific contexts…
If this still sounds enigmatic, don´t worry, as the general idea of a linear regression model is actually really simple and intuitive: it is literally just about drawing a line into a graph with data points in a way that the linear model doesn´t deviate too much from the data points, when the data points are seen as a whole of some kind. Try yourself:
For our first idealized example below, we will not work with categorical variables this time. We will also change the species: This time we want to evaluate how many cups of tea Lady Meow drinks in a period of 10h.
To get some orientation and in order to wrap our head around linear relations, we will plot the relation between cups of tea drunk by Lady Meow and the time passed as data points
In our new furry example, the time is considered the independent variable, which can be found intuitively as time is something that is usually not influenced by any other relevant variable apart from maybe failures in measurements, and therefore represents the independent variable in most of the cases. Exclusions are physics experiments on, e.g., the relativity of time and space in relation to mass, or measuring reading times of a text item, where the time passed, i.e., the time needed to read the item, may change (dependent), when changing the item (independent, as the item itself is not influenced by the time needed to read the item). The amount of tea that Lady Meow drank represents the dependent variable ─ dependent on the amount of time (passed).
Further below you will again find code for the programming language R that can be used to plot some idealized data points, i.e., where every point is located on a linear function already, as an example. If you are new to R you can either just read this tutorial and with it read the code (we will provide all the results), or you just download and install R and RStudio. Start R Studio and download and open the R script we provided at the beginning of this tutorial, which contains all of the code we are going to use. We recommend to use our script, instead of copying line by line this time, in case you lose track of copying code, resulting in errors due to missing lines of code. In other words: each code block we added in this tutorial does not necessarily stand and works for itself (see below, why this is the case)!
In case you are trying to install R on an Apple Computer, you have to decide for a version for Apple silicon (M1/M2) or older Intel Macs. If you don’t know which version, just try out which works for you.
Mark the lines you want to execute and press ALT+ENTER. The result can be seen in the environment tab on the upper right side in RStudio. If you ever feel that your script is presenting a funny output (especially after a series of errors), clear the environment via the brush tool ─ there is another brush-icon placed in the console tab to clear the console. If this does not help close R without saving the workspace. In case you are using any Apple hardware, use the command key instead of ALT.
We also recommend using the keyboard when operating within the script: use SHIFT+ARROW to mark and demark letters (left/right) or whole lines (up/down).
Mark the name of an object only and press ALT+ENTER again to obtain the results in the console (below the script) ─ you again won’t need to know much more for this tutorial and all the rest will be introduced along the way.
Note that R scripts are executed from top to bottom. The R script we provided can theoretically be executed as a whole (mark all and execute). However, it may be that a variable, e.g., with the name test
, gets redefined in lower parts of a script with a different value (results in changes in the environment!). In other words: The content of the previous object with the same name test
will be “overwritten” by the new content assigned to the name test
. Keep that in mind, when playing around with the script. Just download the original script again if you have the feeling that you changed to much in the script and lost track or so.
We will represent our data points via a vector in R. Think of a column of a data table as a (column) vector. In R a vector is in general used as an abstraction of mathematics, where vectors can be seen as, e.g., a list of coordinate values c(x,y)
. In R it can list anything that way, such as a set of values of measurements. The function c()
is called the combine function, with which objects of any kind can be combined as a list of objects, or even a list of lists (note that there is also a function called list()
as well, which is not structured in rows and columns, but sequentially numerates elements ─ also of various kinds of classes (vector, matrix, single values, whole data frames, all in one list)).
Below you will find our first lines of code for plotting a set of synthetic data. Below #
marks lines as comments, so they won’t be interpreted as command code by the program (as plain text will result in various errors).
# Variables:
# time_passed = indep. = y-axis
time_passed = c(0:10) # 0:10 = 0 to 10 = c(1,2,3,4 … 10)
# cups_of_tea = dep. = x-axis
cups_of_tea = c(0:10)
# Let us plot our data points
plot(x = time_passed, y = cups_of_tea)
You might remember that the information that is entailed in the formula of a linear function refers to a) the slope or “steepness” of a function and b) the point where it crosses the y-axis, when
Before we look at the actual formula of a classic linear function as a whole, we will start with calculating the slope. First recall that a function is nothing else than a relation of two variables (here a linear relation). The slope is therefore also a relation, just a relation within a time-space in our case, or within a difference in time (
We can mathematically evaluate the slope via two randomly chosen points of our data set, specifying and evaluating “change” as a “differing relation” (denoted
Below you will find a modified plot that will graphically explain the formula above. It also includes a drawn line representing the value of
Let us now do the above calculation of the slope via R. To give you an example for the “logical flexibility” of R, we will use two different methods. There are often several possible ways to do a certain kind of operation. Try to fully figure out what is done in the code below, as the translation of the formula into reasonable code sometimes comes with the necessity of additional abstractions that have to be overlooked.
# Calculating the slope, directly via the objects of the variables
# that we used above already. The [] are used to call a certain value
# of the vector, i.e., element 1 of an object is object[1] etc. If
# this is too abstract, check on the whole object again and play
# around a little by adding [] with and without values to the object.
beta = (cups_of_tea[2]-cups_of_tea[1])/(time_passed[2]-time_passed[1])
# beta = 1
# Alternative via defining the points represented
# as vectors P = c(x,y),
# first. Console output: # x y
P1 = c(time_passed[1],cups_of_tea[1]) # [1] 0 0
P2 = c(time_passed[2],cups_of_tea[2]) # [1] 1 1
# Note that here [] refers to the vector list element of P now,
# not the vector list of our data for x and y!
beta = (P2[2]-P1[2])/(P2[1]-P1[1])
# beta = 1
Now what about the point where our linear function crosses the y-axis? We can see in our graph further above (and by logic) that the minimum value of cups and the minimum input value of time measured (passed) will be set to
Now let’s finally have a look at the complete general formula of a linear function itself (various common forms displayed, do not get confused!).
It is also very important to understand (!!!!) that the concept of functions, such as lm(y~x)
, in programming is an abstraction of the mathematical concept of a function such as the linear function above:
To get a better orientation on the variables
Again, our function will relate the number of cups of tea that Lady Meow drank (
With this information we can now add our function to the plot of our idealized data set. There are again several different ways to do so. We will mostly use the abline()
function in R for such cases:
# We can add a line representing y = a + bx
# to our plot. The line crosses the y-axis
# at 0 (a = 0), as we are starting with y = 0 at x = 0.
# a(lpha) = 0, b(eta) = 1
abline(a = 0, b = 1)
Another way of representing the function via R would be as actual “R function”, referring to what we have learned on the abstraction of functions above:
# Define function f(x) = 0 + x * 1 = x.
f = function(x)(x)
plot(y = f(0:10), x = 0:10, type = "l", ylab = "Cups", xlab = "Time")
points(0:10, 0:10) # add data points to plot
Note that the function above is spoken out “f is the name of a function with one input x and with that x some math is done, i.e., it remains x” ─ which may appear confusing, as it says, nothing is done with it. This is due to the fact that in the case of our function f the input value is equivalent to the output value. A parabola function,
# Example parabola function:
g = function(x)(x^2)
g(2) # = 4
With this we have visually/graphically proven to ourselves that there is a linear relation between the time that had passed and the amount of tea measured in cups, without using any linear regression math or so. We can also use the lm()
function, i.e., the linear model function in R in order to evaluate the optimal
# Using lm() to evaluate alpha (intercept) and beta of x:
# Formula: lm(y~x), saying: a linear model of y (dep.) in
# conditional linear relation (~) to or dependent
# on x (indep.). Note that ~ can also be read as proportional
# which we will get into in Inf. Stat. III:
lm(cups_of_tea~time_passed)
# The result in the console will look like this
# CONSOLE OUTPUT (DO NOT EXECUTE):
# Call:
# lm(formula = cups_of_tea ~ time_passed)
# Coefficients:
# (Intercept) time_passed
# 1.071e-15 1.000e+00
## α β
# Note that one can now also add the line to a plot of data points via:
plot(x = cups_of_tea, y = time_passed)
abline(lm(cups_of_tea~time_passed)) # does not work with plot(f(0:10)...)
The intercept “cups_of_tea” (
You might have noticed that the slope appears much more intuitive than interpreting
Don’t let yourself get confused by unknown terms such as intercept, used in the console output of the lm()
function, as they do not add much to the story. In case you get lost, try to always recall the basics of a simple linear function to reorientate within the math.
Our linear function in the previous chapter represented our so-called “model” function, i.e., a function that served as a model of the relation between variables, such as “time passed” and “cups of tea”, in order to predict new observations (new data). Note that other functions such as logistic functions are also used (see further tutorials in the future). In the far most of the cases we will not obtain an idealized linear relation between variables, representing a perfect line of data points. Again, the goal is to find a model function with which we will make the least number of mistakes, or get the least possible deviation on average from “reality” (phenomenologically evident events), when trying to predict future events ─ new data. In general, minimizing an (prediction) error or cost in terms of “how much more information would I need to get to the right answer, i.e., how do I have to change the parameters of my model to better predict the future?” is considered an optimization problem in mathematics ─ the least square method being a very simple representation for solving such problems.
Let us now work with data points that do not already perfectly align with a linear function. Below you will find some random data points I chose, this time slightly deviating from our linear sequence of 0:10, where the numerical gradient, the difference between each step of time, was always 1, i.e., linear by itself already. The data below will represent three measurements of
# Cups of Tea:
cups = c(0, 1, 2, 3.5, 4.8, 5.2, 6, 6.9, 8.5, 9.1, 9.9,
0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9,
0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9, 9.8)
# Time passed for each run of measurements (cups), abbreviated:
time = c(0:10,0:10,0:10)
The three runs can be understood as different participants or just the same person observed over
With the cbind()
, i.e., the column bind function, we can combine our two (column-) vectors, resulting in an object that is something more of a data table (class of that object is actually a combination of a matrix and an array; see the web or future R-Basic tutorials within this collection on that matter). We will then convert the result of cbind()
and turn it into an object of the class data frame
. With this conversion we are now able to make use of the function of the $
sign in R, to call only a certain column (-vector), such as data_table$time
within the lm()
function. There are also other ways, such as again using y = cups
and x = time
within the function lm()
. I chose the following, as a proper data table should be a more familiar and less abstract format to you and using the $
sign is common when operating with .csv files for example. It is also a good chance to learn about the circumstance that not every object class is compatible with every function, often resulting in enigmatic errors in the console. The function lm()
for example cannot handle our “matrix, array” object for itself, therefore the conversion; you will definitely come across such issues in the future). By the way, you can evaluate the class of an object via class(object)
. Give it a try!
What we get below is a data table, where each row represents a possible relation between the number of cups and the amount of time that has passed. The numbers on the far-left side of the console output (see below) just represents the number of elements as row number. I did not include an extra column differentiating the different runs of measurement any further, so the values of all three runs are entailed in one column. In nuce, each row of the data table entails the y-value with its respective value of x.
data_table = as.data.frame(cbind(cups,time))
# Output of "data_table":
#
# cups time
# 1 0.0 0
# 2 1.0 1
# 3 2.0 2
# 4 3.5 3
# 5 4.8 4
# 6 5.2 5
# ...
# ...
# Let us now plot our new set of data points. By using the $ sign
# a certain column or object in a list only will be called:
plot(x=data_table$time, y=data_table$cups,
ylab = "Cups of Tea", xlab= "Hours passed")
We can now use the lm()
function to again easily obtain a fitted linear model between the variables time and cups of tea:
# As mentioned, we could just use:
linear_model = lm(cups ~ time)
# or the following, using $ to refer to a
# name of a column in this case:
linear_model = lm(data_table$cups ~ data_table$time)
# OUTPUT CONSOLE – Do note execute
# lm(formula = cups ~ time)
# Coefficients:
# (Intercept) time
# 0.2318 1.0082
## α β
Comparing the result of our new model with the previous, we can see that
# Add linear model to plot:
abline(linear_model, col = "blue")
# or again via:
abline(a = 0.2318, b = 1.0082, col = "blue")
We now know how to perform a simple linear regression via the lm()
function in R. In our first synthetic data set all of the data points were perfectly in line with the function already. But how do we decide where to draw the line in the second case mathematically, without taking the easy path via R? We will now start to go through the simplest mathematical method to make a compromising but optimal decision, in order to find a model, which will make the fewest errors of all possible models, when making a prediction: the (linear) least square method.
Let us first look at how these “errors” mentioned above are represented mathematically, before we start to figure out how to optimize them, eventually obtaining a model which makes the fewest errors of all models. In general, solving our optimization problem involves setting up a “guess function” in order to optimize its parameters
The error is a mathematical value that represents the deviation of the value of our model function,
We will now look at some randomly chosen values
# Example residuals:
# x = indep. var.; y = dep. var.
x = c(0:5)
y = c(0, 1.5, 1.8, 3.4, 4.4, 5.9)
We can now evaluate the errors, defining the function f
in R, representing
# Our guess function:
f = function(x)(x)
# Calculate errors (here x and y refer to the above data vectors!):
error = y-f(x)
# Console output of the object "error", entailing the elements 0-5:
# [1] 0.0 0.5 -0.2 0.4 0.4 0.9
Let us now plot our data points and our linear model
# Plot of our data points:
plot(x=x,y=y)
# Plot of our guessed function:
abline(a=0,b=1)
# Adds lines representing the residuals, i.e., the error between our
# "guessed function" and the actual data points, drawn along the y-axis.
# x0 is the x-coordinate of the start of the line and x1 the x-coordinate
# of the end of the dotted line (lty=3):
segments(x0=x,y0=f(x), x1= x, y1=y, lty =3, col = "blue")
Recall that we are after finding a function that produces a minimum of summed errors compared to the errors of all other possible functions. However, below we will also calculate the squared error,
In order to get a function that has at least something like a parabolic minimum as an attribute, we are mathematically just bluntly going to turn our linear problem, i.e., deciding for a linear function with the lowest errors (r), into a quadratic problem, deciding for a particular
Another perspective on the fact why we need to take the square of our error function: the values of
In case you wonder, we can not use the absolute value, since we can’t define the derivative of such a function in full (which is needed in order to find the minimum, e.g., of a parabola for example).
Now let us say, we are again working with our previous idealized values of
We can do the same with our second synthetic dataset, using the same guess function again (theoretically any function can be optimized):
Here is the respective code for the above and a possibility to plot the residuals:
# Following the plain math:
error = (cups_of_tea - time_passed) # here f(x) = x = time_passed itself
sumError = sum(error)
sumError2 = sum(error^2)
# Via lm() and residuals() of our idealized data set:
idealized = lm(cups_of_tea~time_passed)
residuals_fun = residuals(idealized) # all basically 0 values
# Checks for (near) equality of the result of the function
# residuals() with our object "error" that we obtained
# by applying the plain math (change object class of
# residuals_fun via as.numeric() to make it work):
all.equal(error, as.numeric(residuals_fun))
# [1] TRUE
# Let us plot our error / residuals
plot(error)
Let us now do the same with our second synthetic data set and compare the residuals of our first guess with the residuals of our already fitted function obtained via lm()
. We will start with the errors of our guessed function:
# Guessed function
lmRandom = function(x) (x)
# These are the errors we still have to minimize
errorRandom = (data_table$cups-lmRandom(data_table$time))
sumErrorRan = sum(errorRandom)
sumErrorRan2 = sum(errorRandom^2)
# Plot errorRandom
plot(errorRandom)
# adds (h)orizontal line at y-axis = error = 0 (:= no error)
abline(h=0)
# This is our fitted model function
lmFitted = function(x) (0.2318+1.0082*x)
# These are the errors / residuals that are already fitted
# In other words: this is where we want to get to.
errorFitted = (data_table$cups-lmFitted(data_table$time))
# The function residuals() does the same with our model
residualsFitted = residuals(lm(data_table$cups~data_table$time))
# Let us check if our method delivers same results as
# the function residual():
all.equal(errorFitted, residualsFitted)
# [1] "Mean relative difference: 0.0001735146" = neglectable difference
# Plot residuals:
plot(residualsFitted)
abline(h=0)
As promised, we will now compare how the residuals have changed, after optimizing the errors:
In order to represent our minimization problem formally, we are going to use a partial differential equation. The field of partial differential equations (PDEs) can be complicated, due to a lack of solutions for non-linear PDEs. We are lucky though, as we are looking for a linear solution only anyways, and our PDE can be solved, e.g., via a system of linear equations ─ the method we are after.
Note that there are other methods to find a linear function fitting our data, i.e., with the least amount of squared error. For example: in the third part of this series, we will obtain our optimal β just by dividing the so-called covariance of
If any of this sounds too unfamiliar, no worries! This tutorial is not going to introduce the whole field of ordinary and partial differential equations (ODEs and PDEs) or any complex linear algebra ─ you will still be able to fully understand, follow and reproduce all of the calculations below (even by hand). Apart from that it is just important to at least be aware that different methods and formulas are used to evaluate a fitted function, so that you will not become insecure when doing your own research, knowing only one of the methods. [2] However, solving optimization problems is essential for any kind of statistical modelling, so we believe it is more than helpful to at least know the poetics of a linear model in detail, to get a stable overview on statistical modelling.
After we mastered the mathematics, we are also going to reproduce the essentials of the original lm()
function to show you how easy it can be written by oneself and in order to demystify what the code of the original lm()
function actually does.
Let us now finally look at our PDE:
What we see on the far-right-hand side of the equation is essentially our squared error formula above
So, essentially our PDE in the current form above is a function with respect to two variables,
Finding the minimum can easily be done by evaluating the partial derivatives of our PDE and setting each of them to zero, in order to evaluate the optimal
“Setting the partial derivatives of a PDE to zero” has a similar purpose as with regular linear functions, where setting
In a parabola function
In general, the derivative gives us the slope of a tangent at a certain point
The derivative of
Note that the derivative
In case you may wonder or get confused, note that setting the derivative to zero, is not the same as setting
Conceptually it is also important to keep in mind that
Below you will find some code to do some calculations and plotting:
# Slightly more complex parabola function:
# Defining our function g(x) = x^2 + x + 2
g = function(x) (x^2+x+2)
# Plot g(x)
plot(g, xlim = c(-5,5), ylim=c(-1,10), asp=1 )
abline(h=0, v=0)
# Derivative g'(x) of g(x)
gdev = function(x) (2*x + 1)
# Evaluate if minimum or maximum is given (none if below = 0)
# a) Input neg. and pos. test values
# b) Output: changes from negative to positive => maximum is given,
# else a minimum is given (i.e., when input == output)
gdev(-Inf); gdev(+Inf) # Minimum is given
# Add derivative function (it crosses x at our xMin!)
# Here we graphically see that x refers to both g’(x) and g(x)
# g’(x)=2*x+1
abline(a=1,b=2) # adds our linear function to plot
# Setting g'(x) to zero, to get the x coordinate of that minimum
# which has a slope of zero
# g'(x) = 2*x + 1
# 0 = 2*x + 1
# -2x = 1
xMin = solve(-2,1)
# x = -.5
# Let's check if this is true, since g'(-.5) = 0
# In other words: the slope at g(-.5) is zero
gdev(xMin) == 0 # [1] TRUE
g(xMin) # y-coordinate of minimum
points(x = xMin, y = g(xMin))
# Result: P(-.5| 1.75)
# Add vertical line at x = -.5
segments(x0=-.5,y0=0, x1=-.5,y1=g(-.5), col="darkgreen", lty = 3)
Now that we know how to find the minimum in a parabola function, let’s precede to PDEs. The relation in our function
In other words: we cannot evaluate
Most of the below will be more about formalism and will be conceptually familiar to you, promised. It is also enough do get an overview of this chapter. However, we again still want to give you an optional overview of the process with the least number of missing links (at least for simple linear regression models). Same goes for the decomposition in the next chapter, so don’t be afraid of what’s ahead.
The below is supposed to show that a PDE is not too different from an ordinary differential equation and from the concept of a difference quotient. We will therefore again rediscover some math basics from our time in school. By looking into the difference between
So, let´s get to it. This is our PDE again:
The derivatives are formulated as follows. The purple part refers to the outer derivative, and the blue to the inner derivative. Obtaining the outer derivate is simple:
The fraction is spoken “partial derivative of
The symbol
However, the Greek capital delta,
A differential equation on the other hand can also be used to evaluate the slope of a function, but with a twist to it. Equivalent to the difference quotient, a differential equation evaluates the difference, but substitutes all
Short remark on the history of calculus: The above formula was a huge deal in the time where people like Isaac Newton and G. W. Leibniz lived (both discovered calculus independently), since it allows to translate a summation of discrete values into an integration of continuous values up to certain precision! A pragmatic simplification, so to speak. Later on Planck showed that a continuous functions of light can be cast as discrete “quant” packages, which is essentially the inverese, or better: a tilting from continuous to discrete, since the whole wave-particle dualism in physics is mathematically grounded by the mention translation, so to speak.
Let us now check on the slope of our previously evaluated parabola minimum
Here
# Output using different x => x = x +.1 x = x +.001
betaNumerator = (g(-.5+.1)-g(-.5)) # 0.01 1e-06
betaDenominator = ((-.5+.1)+.5) # 0.1 1e-03 = 0.001
betaNumerator/betaDenominator # 0.1 1e-03 = 0.001
The output in R could be coincidental, so let us check with some other value of
# Choose a value for x
xTest = 1
# Using g(x) from above:
g = function(x) (x^2+x+2)
betaTest = (g(xTest+1e-7)-g(xTest))/((xTest+1e-7)-xTest)
# Result for xTest = 1 => beta = 3
# Some values come with a remaining difference
# (not the case with, e.g., xTest = 9):
gdev = function(x) (2*x + 1)
all.equal(betaTest,gdev(xTest))
With an added value of 1e-7
to our chosen gedev(x)
. Using 1e-12
leads to false values and at some point, you will get a NaN
(Not a Number) as output, as the precision is not high enough or due to some other reason within the function (there may be ways in R, but it is not important for us now). We are not going to need the calculations above for the actual linear regression, but in case a PDE is still too abstract, always just recall the general idea from a conceptually downgraded perspective.
We have learned: In differential equations
We are now going to move on to taking the derivative of our PDE, setting them to zero, in order to find the minimum of partial differential equations, in our case the minimum of the sum of our squared errors of our guessed function. The goal is again the same as with ordinary differential equation, just a little more abstract: we will be handling two equations at once and will solve them via a system of equations, when using the derivatives to obtain the optimal values for
The next steps of decomposing our derivatives are a preparation to fit our derivatives into an easily solvable system of equations. Note that the rearranging part is supposed to give you the opportunity to follow all of the steps by hand, but is not necessary to understand in detail. The important part will be our resulting system of equations that can be used for any set of
In case you my wonder: We will get into what a system of equation is in a bit ─ we could have explained it now or before, but as it is rather simple, we decided to first rearrange the equations of our derivatives into its respective form. For now, just think of more than one equation, each looking something like
Let us start decomposing our derivatives: As we are looking for a minimum value of
Now we get the factor
2*sum(1:6) == sum(2*1:6)
We can also get rid of the factor
Resulting in:
Now we get rid of the product,
Resulting in:
Next, we can split up the sum to get closer to the appearance of a system of equations:
Now we bring the negative sum of
Next, we do the same as with the factor 2 above, this time bringing the variable factors
From here everything becomes a lot easier, again as
# The above just represented in R (DO NOT EXECUTE IN THIS FORM)
# sum(a*length(x)) + b*sum(x) = sum(y)
# a*sum(x) + b*sum(x^2) = sum(x*y)
Numerically, the result of applying our second synthetic data set to the upper formulas will leave us with the following system of equations (code will follow in a bit):
With the above we have obtained our desired structure of a system of linear equations and are confidently approaching higher spheres of inference. We can see two equations, each with the variables
One might already have an idea on how to get to a result by substitution… For those who don’t know, or only roughly remember what a system of linear equations is, let us go through a quick example below.
In general, systems of equations are used to solve a wide range of very basic questions in science. We will start with a very simple example that most of us probably have formally encountered in riddles or so, without thinking of a system of linear equations as such:
FURRY EXERCISE: Two cats, Mr. Midnight and Jurassica Parker, are together 15 years old. Their age deviates by 7 years ─ Jurassica Parker being the older sister of Mr. Midnight. How old is each of them?
We can solve this equation by substituting
Another way of solving our exercise is via linear algebra: we can write the left side of the equations as a matrix and the right side as a vector, resulting in:
# Set up two elements, so that we can solve an
# equation in the form: left = right
left = matrix(c(1,1,
-1,1), ncol = 2, byrow = TRUE)
right = matrix(c(15,7), ncol = 1, byrow = TRUE)
# Use the solve function to solve the system of
# equations. It outputs a vector list = c(x,y)
ResultXY = solve(left,right)
Let us now finally solve the system of equations regarding our second synthetic data set by hand:
Resulting in:
Now we can smoothly insert
Resulting in:
And now we can insert
Whoop-whoop! The values for lm()
function! You have now mastered your first linear regression by hand! Let us now replicate the above fully via R and write our own function to perform a linear regression.
Next up we will replicate all of the math above using R and subsequently write our own linear_least_square()
function. In other words: we will replicate the essential mathematical part of the lm()
function ourselves. You may be surprised that only a few lines of code are needed to do so. Apart from putting together all the math from previous chapters, we will also include some fancy code for some nice-looking console output. ^^
At first, we defined a random “guess function” and our
regRandom = function(x) (x)
x = data_table$time; y = data_table$cups
Next up we were optimizing our guessed function, by looking for the minimum of the values of
We have been to the part where we rearranged the derivatives to resemble a system of equations, so we can actually jump right to that point, such that our insights can now be used to abbreviate the code that performs the math dramatically. To do so, we will also define a
and b
, due to an abstraction of linear algebra in R, when using the function solve()
─ you’ll see in a bit.
a = 1; b = 1 # ";" marks a new line (delimitter)
Below you will find our rearranged system of equations that can be looked at as forming a 2x2 matrix (left side) and a 2x1 (column-) vector (right side of the equation). We will use these objects as input for our solve(left,right)
function. The formula of the rearranged equations of our partial derivatives was:
In terms of code, the above can be written as:
# Formally in code (DO NOT EXECUTE):
# sum(a*length(x)) + b*sum(x) = sum(y)
# a*sum(x) + b*sum(x^2) = sum(x*y)
To get the above set for solving a system of linear equations we will create objects of the parts of the above and then fit them into a matrix and a vector in R. The solve()
function will treat each element separately, so something like length()
evaluates
# Named in the form r = row, c = column
left_a_r1c1 = sum(a*length(x)); left_b_r1c2 = b*sum(x) ;
left_a_r2c1 = a*sum(x) ; left_b_r2c2 = b*sum(x^2)
right_r1c1 = sum(y)
right_r2c1 = sum(x*y)
Now we will set up the above as actual matrix (left) and vector (right) objects:
# matrix
left = matrix(c(left_a_r1c1 , left_b_r1c2,
left_a_r2c1 , left_b_r2c2),
ncol=2, nrow=2, byrow = TRUE)
# vector
right = c(right_r1c1, right_r2c1)
Now we can finally solve our system of equations via:
Result = solve(left,right)
# [1] 0.2318182
# [2] 1.008182
# Fitted function:
fitfun = function(x) (Result[1]+Result[2]*x)
SumError2fitted = sum((y-fitfun(x))^2)
# [1] 9.703364
In order to get a nice looking output for our function below, we will use the cat()
function. The cat()
function concatenates elements of any kind in the console (!), such as text and our result values. Use "\n"
for a line break, and "\t"
for the tab separator. We also have to apply a trick, to get the actual names of the input objects into our console output using deparse(substitute(x))
:
# DO NOT EXECUTE. Code is exclusively meant for our function below:
# cat("Linear least square method in R","\n","\n",
# "Independent variable:", "\t", deparse(substitute(x)),"\n",
# "Dependent variable:", "\t", deparse(substitute(y)),"\n","\n",
# "alpha", "\t",Result[1],"\n",
# "beta","\t",Result[2], "\t","SumR2", "\t", SumError2fitted)
Let us now also plot our results. We will also use deparse(substitute(x))
for the labels of our plot axes.
plot(x=x,y=y, ylab = deparse(substitute(y)),
xlab = deparse(substitute(x)))
abline(a=Result[1], b=Result[2], col = "darkblue")
Voilà, that was it! We can now put all of the above together, execute and subsequently use our own linear_least_square()
function ─ whoooooop!!!
# Go-go-gadgeto linear_least_square!!!!
# Replication of the lm() function:
linear_least_square = function (x,y){ # Start of function
# Setting up system of equations:
left_a_r1c1 = sum(a*length(x)); left_b_r1c2 = b*sum(x)
left_a_r2c1 = a*sum(x) ; left_b_r2c2 = b*sum(x^2)
right_r1c1 = sum(y)
right_r2c1 = sum(x*y)
# Now we will set up the above as matrix (left) and
# vector (right) objects:
left = matrix(c(left_a_r1c1 , left_b_r1c2,
left_a_r2c1 , left_b_r2c2),
ncol=2, nrow=2, byrow = TRUE)
right = c(right_r1c1, right_r2c1)
# Now we can solve our system of equations via:
Result = solve(left,right)
# Fitted function:
lmFitted = function(x) (Result[1]+Result[2]*x)
SumError2fitted = sum((y-lmFitted(x))^2)
# Code for nice console output
cat(" Linear least square method in R","\n","\n",
"Independent variable:", "\t", deparse(substitute(x)),"\n",
"Dependent variable:", "\t", deparse(substitute(y)),"\n","\n",
"alpha", "\t",Result[1],"\n",
"beta","\t",Result[2], "\t","SumR2", "\t", SumError2fitted)
# Plot Results:
plot(x=x,y=y, ylab = deparse(substitute(y)),
xlab = deparse(substitute(x)))
abline(a=Result[1], b=Result[2], col = "darkblue")
} # End of function
We could actually also name our function “lm”, though it would conflict with the built in lm()
function, so we do not recommend to rename the above function in that way!!! We also have not tried doing so, i.e., the consequences, at least to us, are not even impetuously figured by a scanner darkly.
However, let's see what our function can do:
# Our function using our second synthetic data set:
linear_least_square(data_table$time, data_table$cups)
# OUTPUT:
# Linear least square method in R
# Independent variable: data_table$time
# Dependent variable: data_table$cups
# alpha 0.2318182
# beta 1.008182 SumR2 9.703364
# With our idealized data set:
linear_least_square(time_passed, cups_of_tea)
# Compare with the regular old lm() function:
lm(data_table$cups~data_table$time)
Note that the actual lm()
function includes much more features and the code will look quite different from the one we provided above, as it is, e.g., compatible with a wide range of input classes, and partially executes code in C (which runs faster). The function above is also not an official comparable function, such as lm()
that can be cited (here you can find the original source code for the lm() function of the pre-installed library “stat”: [4]. It was written for didactic purposes only. However, the mathematics will be the same, when it comes to simple linear regression models.
We hope this has given you a stable intuition and understanding of how a simple linear regression actually works. Multivariate models, linear mixed effect models and other forms of regression analysis are essentially based on the same principle. Importantly: the output of e.g. glm()
or lmer()
will look more or less the same. This article is therefore our basis for the following tutorials on that matter of regression analysis in general.
So far we have looked into the regular output of the lm()
function. Next up we are going to look at the output of summary(lm(y~x))
and will replicate and integrate its math into our own function.