Skip to main content
SearchLoginLogin or Signup

[Python] Inferential Statistics III: Obtaining an Optimal Linear Model via Descriptively Evaluating Conditional Linear Relations

Beginner's Stat-o-Sphere

Published onJan 29, 2024
[Python] Inferential Statistics III: Obtaining an Optimal Linear Model via Descriptively Evaluating Conditional Linear Relations
·

Corresponding Python script:

Summary:

1 Introduction:

Welcome to the third part of our tutorial series on inferential statistics! Note that you are still in the beginner’s spheres. We are also including optional recaps on the first two tutorials. So, in case you missed the first two parts or if you felt they were too basic or too complicated (previous tutorial) for your needs, up to here it should not be too much of an issue starting with this part of our series right away. We will still increase the pacing a bit in this tutorial and will get closer to hypothesis testing in the probabilistic sense and with working with probability distributions, which will be a matter of the fourth part of this series.

In the first two parts of this tutorial series, we focused on the general concept of hypothesis testing (conditional probability) and linear modelling (conditional linear relations). In the third part of this series, we are going to look at structural similarities between both concepts by taking a closer look at what we called a conditional linear relation.

The first goal of this tutorial is to learn about ways to just descriptively evaluate our statistical linear model via concepts such as the mean, the variance and the standard deviation (in the future we may also provide a tutorial on descriptive statistics from a more general perspective). In the fourth and fifth part of this series we will relate these mathematical concepts to the concept of distributions of various kinds. Namely uniform and especially (standard) normal distributions.

The second goal of this tutorial is, as mentioned: taking a closer look into what we have called a conditional linear relation so far and how it can be evaluated via the variance and the covariance of our linear model, resulting in the regression coefficient of our optimal model (its optimal slope). In other words: evaluating the optimal slope of our model can alternatively be done via descriptive measures such as the mentioned forms of variance, as well. This method will also turn out to be mathematically less laborious. Our alternative approach will also result in a surprising but intuitive formal relation between conditional probability and on statistical relations, which concern (estimate) values of measurements and not probabilities. However, the downside of this approach will be that it is not clear at first sight how the formula relates to our “real life problem” of drawing an optimal straight line through data ourselves. Therefore we started with the classic linear least square method in this series, which is more laborious, but allows for a much better “visual intuition”. You may think different. In this case we hope starting with this tutorial for itself will serve you well enough.

2 Recap on Linear Modelling and some Thoughts on why Alternative Mathematical Methods exist and come in handy

In the second part of this series, we have used one of the simplest methods to obtain a fitted linear model. It was simple in the sense that we could relate the process very well to what we wanted to achieve: draw a straight line through data with optimal values for the intercept and the slope. By formulating our error function (our partial differential equation), taking the derivates, setting them to zero and doing some rearrangements we obtained a generalized system of equations that we can use to obtain a optimal model with the least squared errors which can be used for any set of yy (dependent) and xx (independent variable). Solving the equation below will result in the optimal aa (intecept) and bb (slope/regression coefficient) for our fitted linear function. Those particular optimal values of aa and bb in f(x)=a+bxf(x)=a+b*x will result in a model with the least sum of squared errors when predicting (new) data points. We also discussed how this process can be related to classic neural networks performing gradient descent.

We have also written our own linear_least_square() function that does all the calculating for us and which delivers an equivalent output to the basic function of ols() in Python. To do so we translated the above system of equations into the form of code. We therefore don’t have to do all the calculating by hand.

Below you will find the code for the complete linear_least_square() function:

def linear_least_square(x, y):
    # Here we define and operate with our guessed input function.
    lmRandom = lambda x: x
    
    # Linear least square method:
    
    # Next up we will optimize our guessed function, 
    #by looking for the minimum of the values of a (α) in relation to b (β).
    
    # Our PDE goes as follows:
    # Error2(a,b) = sum(y-f(x))^2 = sum(y-a-bx)^2
    # We have been to the part where we rearranged the whole equation to resemble a system of equations, such that we can abbreviate the code. 
    # To do so we define a and b, due to an abstraction of linear algebra in Python, when using the function np.linalg.solve().
    a = 1
    b = 1
    
    # Below you will find our rearranged system of equations in the form of a matrix (left side) and a vector (right side of the equation).
    # We will use these objects as input for our np.linalg.solve(left, right) function.
    
    # From (DO NOT EXECUTE, just formally):
       #   sum(a*length(x)) + b*sum(x)     = sum(y)
       #       a*sum(x)     + b*sum(x^2)   = sum(x*y)

    
    # To get the above set for solving a system of linear equations we will create objects of the parts of the above and then fit them into a matrix and a vector.
    
    left_a_r1c1 = np.sum(a*len(x));    
    left_b_r1c2 = b*np.sum(x) ; 
    left_a_r2c1 = a*np.sum(x) ; 
    left_b_r2c2 = b*np.sum(x**2) 
    
    right_r1c1 = np.sum(y)
    right_r2c1 = np.sum(x*y)
    
    # Now we will set up the above as matrix (left) and vector (right) objects:
    left = np.array([[left_a_r1c1, left_b_r1c2], [left_a_r2c1, left_b_r2c2]])
    right = np.array([right_r1c1, right_r2c1])
    
    # Now we can solve our system of equations via:
    Result = np.linalg.solve(left, right)
    
    # Fitted function:
    lmFitted = lambda x: Result[0]+Result[1]*x
    SumError2fitted =  np.sum((y-lmFitted(x))**2)
    
    # In order to get a nice looking output, we will use the print() function.
    # The print() function concatenates elements of any kind, such as Text and our result values. Use "\n" for a line break, and "\t" for the tab separator.
    # We also have to get the name of the input object into the output by using inspect.currentframe().f_code.co_name:
    
    print(" Linear least square method in Python","\n","\n",
          "Independent variable:", "\t", "x", "\n", 
          "Dependent   variable:", "\t", "y", "\n","\n",
          "alpha", "\t",Result[0],"\n",
          "beta", "\t",Result[1], "\t","SumR2", "\t", SumError2fitted)
    
    # Create a scatter plot of the data
    plt.scatter(x, y, color='black', label='Data')
    
    # Plot the fitted line
    plt.plot(x, [Result[0] + Result[1]*xi for xi in x], color='darkblue', label='Fitted Line')
    
    # Set the x and y axis labels
    plt.xlabel("x")
    plt.ylabel("y")
    
    # Display a legend
    plt.legend()
    
    # Display the plot
    plt.show()

We can again use our furry data set to compare the output of our linear_least_square() function with the output of the regular ols() function:

# linear_least_square(independent,dependent)
print(linear_least_square(time,cups))

The output of our function also entails the sum of the errors squared (SumR2) of our optimal fitted function. Recall that R^2 in the summary(lm(y~x)) output is not the same as errors squared or residuals squared (has a different formula, but we will get there in the later parts of this tutorial series). The results of both functions will lead to the same plot output, suggesting that Lady Meow ^°^ is drinking around 1 cup of tea per hour.

Fig. 1 Fitted linear model function for our second synthetic data set from Inferential Statistics II.

However, there are other ways, other formulas to obtain the optimal parameters of a fitted linear model function. The reason why alternative methods exist is manifold. One reason is the universality of different mathematical concepts: Probability theory for example can be looked at from the perspective of linear algebra (e.g., conditional probability (Markov processes)); we dipped into some very basic linear algebra when solving our system of equations in the former part of this series. Another complex example is a basic mathematical neural network that can be generalized down to become a regular logic circuit (NAND-gates etc.) and can therefore be used to do any kind of math equivalent to a regular computer. In the above two cases alternative methods are at least used to prove a point or to expand methods and perspectives on methods.

Other reasons are based on pure pragmatism: some calculations are easier to do “by hand”, other formulas and algorithms may be especially convenient to be processed by a computer, i.e., by a certain information technological architecture (such as graphic chips (GPU) which are e.g. used for matrix multiplications (CUDA), as it can do a lot of calculations in parallel instead of in series (in a CPU); in computer games graphic chips are used to obtain a high frame rate when playing the game; in the field of machine learning CUDA just lower the processing duration in general).

In other words: some algorithms run faster and therefore more efficient on (certain) computers (with certain hardware and software). A lot of common procedures in statistics nowadays also go back to a time where computers weren’t common (not that long ago). A classic example, we will learn in the fifth part of this tutorial series, is the use of z-tables/t-tables. These tables are used to read out the integral of a standard normal or t-distribution up to certain value of z/t (the standardized/studentized sample mean), so that the actual integration does not have to be performed and can be just read out of the table...

We are mostly going to look at one particular alternative formula that has the same conditional structure as conditional probability and that is at least conceptually known to us: conditional linear relations in terms of the value of the dependent variable, given that the value of the independent variable is/has changed by 1 (regression coefficient / slope).

Via the covariance of yy and xx and the variance of xx we will be able to calculate our slope bb in f(x)=a+bxf(x)=a+bx with the same formalism as used in conditional probability ─ the covariance serving as joint estimate and the variance of xx as unconditional or marginal estimate of xx.

For some orientation: In the fifth part, when approaching the z-test / t-test, used to obtain a p-value, we will be able to check how likely such a conditionally linear relationship actually is (or unlikely, given that we test the relation under the condition of a null hypothesis and some extra constraints we will get into later). What will be the hypothesis? A t-test / z-test for linear regression models essentially checks if the slope is greater than zero. No slope: ─ no change: ─ no linear difference in the data, ergo:─ yy is not dependent on the variable xx (the intercept can also be tested but is often not that important, as we will see…).

Next up we are therefore going to look at very basic values in statistics, such as the mean, the variance and the standard deviation (also part of descriptive statistics).

3 Descriptive Statistics and Conditionally Linear Relations ─ Mean, Sum of Squares, Variance and Covariance

In this chapter we will go through some very basic concepts in statistics, some of them being part of what is called descriptive statistics, such as the mean and the standard deviation.

Along the way we will get to know the mentioned alternative method of calculating our regression coefficient bb, the optimal slope of our model function, via the covariance and variance. We will also need a clear understanding of the mean, the variance and the standard deviation when we approach the topic of probability density functions, which will be a part of evaluating our linear model under a “probabilistic perspective” (a part of calculating the t-value/z-value and subsequently the famous p-value).

We will start with an absolute classic of classics: the (arithmetic) mean.

3.1 Arithmetic Mean / Weighted Arithmetic Mean / Expected Value

There are several different kinds of “mean values”, even though the basic formula remains the same. We will start with the general concept of an arithmetic mean, which can be understood as evaluating an average or central tendency of a value of a data set (set of values; distribution of values). It is referred to as arithmetic, since it just operates with most basic arithmetic operations: sum and division. There is also the harmonic mean (this is the arith. mean of the reciprocal of the values of a set) and the geometric mean (uses the product not the sum of values of a set). All of the three mentioned kinds of mean are part of what is called the Pythagorean means.[1] However, we will stick to the (weighted) arithmetic mean in this tutorial.

There are two kinds of arithmetic mean: the sample and the population mean. A population is a rather abstract concept, as it refers to an absolute quantity of, e.g., “all humans” or “observed over all time” and is therefore in general something that is seldomly given in the form of a data set: It is quite difficult to obtain data from all humans and it is also not necessary to test all humans in order to obtain evidence for an outcome. In general and philosophically spoken the attributes “all” or “everything” may never really fit an empirical scheme, thinking about it (reflecting on boundaries of space and time when performing inference).

However, the mathematical formula for the sample and populations mean is the same, just the size of NN of a population (the number of given data values) will always be greater than the size nn of a sample of that population (which is just a subset of NN). This inequality refers to a simple, but very important fact that we will discuss and make further use of in the next part of this series, when discussing the central limit theorem.

Recall that our second synthetic data set can be understood as three samples, so to speak: It is not a population (observing Lady Meow for her whole life or every tea she drank), but however more than a single sample data point (three runs/days of observing Lady Meow ^^). Note that the sample size orientates on the length n of the events (or relations between y = cups of tea and x= time in hours). So the sample size of our synthetic data sets on observing Lady Meow will be n=33 each (33 events that we observed in each of our synthetic data sets).

### First set with idealized linear relation f(x) = x:

# Cups of tea
cups = np.array([0,  1, 2,   3.5, 4.8, 5.2, 6,   6.9, 8.5, 9.1, 9.9, # run 1
           0, .6, 2.6, 3.1, 4.8, 6.9, 7.1, 7.5, 8.5, 9.9, 9.9, #     2
           0, .2, 2.9, 2.9, 4.9, 5.2, 6.9, 7.1, 7.3, 9,   9.8]) #     3


# Time passed for each run of measurements (cups), abbreviated:           
# np.arrange is used to generate a sequence of numbers for
# a numpy array
# np.tile repeats this sequence 3 times
# This is less clumsy than writing the sequence out manually

time = np.tile(np.arange(11), 3)

# Sample size n
# Please note that Python is different here than R in which 
# you get the variable printed in your terminal without a 
# "print" statement
print(len(time)) 
# Output: 33

In the code below, we will again use the second set of data for the sample as well as the population for simplicity, as the concept of a whole population is a rather abstract matter anyways, since we are at least theoretically able to observe it, but practically it would be at least very hard to do so (e.g., observing or testing all humans is hardly possible and there are also mathematical reasons why we should not and do not need to do so in order to obtain or boost evidence).

The formulas for the two kinds of arithmetic mean are as follows:

The population mean is often denoted as μ (mu) in contrast to the sample mean x\overline{x}.

# Arithmetic mean for the sample x and y of our second synth. data set,
# i.e., the sample mean: 
x = time
y = cups

# Arithmetic mean for the sample x and y of our second synth. data set,
# i.e., the sample mean. Recall: the formula for the population mean
# is the same as for the sample mean, just with a different size
# of n, i.e., a different (greater!) length of x than in the sample, 
# since N>n for the population (N).

n = len(x)
mean = sum(x) / n
mean = (1/n) * sum(x)

# Using the built-in numpy function np.mean():
mean = np.mean(x)


print(np.mean(x) == np.median(x))
# Expected output: True

Fig. Meme by Andy_math_dot_com.

3.1.1 The Ambiguous Use of the Term Expected Value

In the formulas above we saw that the population mean is denoted as μ in contrast to the sample mean x\overline{x}. However, the population mean μ is often also referred to as the expected value, even though it is not always the same. In general, the expected value or expectation value is mathematically also denoted as:

This kind of expected value notation is usually related to the use of the weighted arithmetic mean using probabilistic weights and is not related to the use of the term expected value as population mean, determent on NN. However, before we discuss the weighted arithmetic mean and probabilistic weights, let us first look at the relation between the term expected value and population mean:

Another way to define the expected value is not precisely as an actual population mean, but as representing the mean of previous studies that form a “population of means”, since observing the actual population is usually not the case. In that sense all previous studies can be understood as an approximation of the population mean, since further observations makes the gap between nn of bunch of samples and NN of a population become smaller. So, the difference of the nn of a single new sample and NN as representing a bunch of samples ‘heuristically’ serves the inequality between population and sample N>nN>n, even though we did not actually observe the whole population. As long as we do more studies of the same kind, our approximation of a population with size NN will approach the “actual population of observations”, so to speak.

In the next part of this series we will show how this heuristic is grounded in the so-called central limit theorem, concerning normal distributions (parametric Gaussian functions). The central limit theorem says that when you sample and take the mean of a sample from a distribution of any kind of form (logarithmic functions, uniform functions, linear functions…) the more sample means we obtain from that randomly formed distribution, the more it will apporach a normal distribution of sample means (over time, so to speak). This is one of the reasons why normal distributions are so common in (frequentist) statistics, since it makes no difference at some point or size of nn what the actual distribution has been. A truly fascinating circumstance that we will discuss and widely make use of in the next part of this series.

Fig. On the left we see a histogram of a uniform distribution of values of x ranging from 0 to 10. The y-axis refers to the frequency of each value of x (total of the frequencies is: n = 1000). On the right we see the distributions of the means of various samples. The y-axis again refers to the frequency of each sample mean (x-axis) occurring. We can see that even though we sampled from a uniform distributed set, the histogram of the means of our samples clearly tended towards a normal distribution over time.

To wrap this up: Think of the difference between a sample and a population and the difference between a sample and a bunch of samples of a population as a part-to-whole relationship, all the way down serving the same inequality of N>n. Importanly, when thinking of a z-/t-test, we can argue that when we sample not from any, but from a normal distribution right away, we will approach the mean of the population. In consequence, when we don’t sample from an assumed population but another population, the mean will diverge from the mean of the assumed population. The latter can then serve as evidence for a difference in means, e.g., when comoparing a verum group (sample that takes a medical drug) with a population that did not get the medication. This circumstance is taken advantage of when evaluating a difference in means in the z-/t-test.

Fig. Ms. Miranda, evaluating what to expect.

OPTIONAL furry example for the understanding of the expected value: Say, we have observed the tea drinking behavior of Lady Meow for a couple of years and therefore don’t need to set a random expected value ourselves. The mean of all the previous study outcomes on the amount of tea lady meow drinks on average could be used as expected value. In other words, all previous studies of the same kind (observing Lady Meow ^^) can be understood as a population, again in the sense of a part-to-whole relationship: the current sample in relation to all previous study samples. Evaluating a difference between these two is essentially what is evaluated in a z-/t-test: a test for a difference in means. What could we do with such an expected value of cups of tea in a day of Lady Meow? Well, let us say that Lady Meow drinks less tea when she is sad. When our sample mean deviates a lot from the expected value (from years of experience) it could mean that Lady Meow needs help, since she is sad and therefore not feeling well enough to go after her favorite habit.

Same goes for the slope and intercept of a linear model: we may expect a certain difference, given a change in xx of 1, i.e., we may expect a certain tendency of change of the amount of tea drank during a certain amount of time throughout the day. If this is not the case, maybe it is because Lady Meow is not feeling well (more general: shows certain symptoms).

So, if some of what we discussed above concerning the meaning of the term expected value still appears confusing to you: recall that the definition is actually already in the name: the expected value is the expected mean/average or in general our expectation of how things will turn out, since we have been fully absorbed by Lady Meow and have watched her many, many times drinking tea in the past, so we can say we know her quite well already. Therefore, we will have an expectation of her behavior, given previous data (a population of samples so to speak), even if we were to set an expected value by coming up with it ourselves.

In a more general usage of the expected value it not only reflects the “regular” average but also the weighted average.[2] This will not be so important for our simple linear model, but we will at least show how this is done with estimates and with a probability distribution as weight:

If the weights are equally distributed (equivalent to a uniform prior as a weight on the likelihood for instance) the weighted arithmetic and the regular arithmetic mean will result in the same. Think of a uniform weight as not adding any weight at all, since every value has the same weight (so it is actually unweighted).

In Python the ols() function does not directly work with weights for estimates/weighted linear regression. This is achieved by using the WLSclass, which is designed for this purpose. If you are specifically interested in WLS you can look up its documentation here.

3.1.2 Probabilistic and Non-Probabilistic Weights

The next two chapters can be considered optional, since we will not need to understand the below to understand the remaining parts of this tutorial, as well as further parts of this series.

As mentioned: a weight can be probabilistic or non-probabilistic. Let us start with non-probabilistic weights. A non-probabilistic weight could be, e.g., the differing number of participants in two groups.

Non-Probabilistic Weights:

For a numeric example we will go through the rather simple example on the Wikipedia page on the weighted arithmetic mean (11.2023). The example is about the test grades of two classes with different sizes of nn in terms of the number of participating students. In such a case we can weight every grade of each group by the number of participating students.

In the Wikipedia example one class had 30 the other 20 students. However, the calculation of the numeric example on Wikipedia will slightly differ from the formula we provided above. It is probably wrong on Wikipedia, but teaches a good lesson, we think. On Wikipedia they first calculated the mean of each sample / each class and then weighted it. As mentioned, the actual general weighted mean would weight each student’s grade with the respective length of its sample group (the nn of every class) and then average it out by the sum of all weights.

We will only go through the correct use of the formula of the weighted arithmetic mean. In the script of this tutorial, we also provided the code to calculate the actual Wikipedia example. The fact that the Wikipedia example is deviating from the actual use of the formula is a good example for gaining awareness to be cautious when using open educational resources on the internet (especially when there are no peer-review options; Wikipedia is in general not well reviewed, I’d say, even though when there are some articles or parts of them that are done well). However, R can be used well as a tool to evaluate numeric examples, as it has built in functions to check on both the formula and the output of an example.

Apart from the formula, the Numpy mean() method can also be used (see code below). We call the mean() function on the axes (think of them as columns in a spreadsheet for now) of the arrays we created for the grades. This calculates the arithmetic mean. This enables us to verify our usage of the formula:

### 3.1.2 Probabilistic and Non-Probabilistic weights #

#### Weighted arithmetic mean - (modified) Wikipedia example:

# Samples:
grades_class_1 = np.array([62, 67, 71, 74, 76, 77, 78, 79, 79, 80, 80, 81, 81,
                           82, 83, 84, 86, 89, 93, 98])
grades_class_2 = np.array([81, 82, 83, 84, 85, 86, 87, 87, 88, 88, 89, 89, 89,
                           90, 90, 90, 90, 91, 91, 91, 92, 92, 93, 93, 94, 95,
                           96, 97, 98, 99])

# Regular arithmetic mean of each sample:
mean_class_1 = grades_class_1.mean()
mean_class_2 = grades_class_2.mean()
print("Mean of Class 1:", mean_class_1)  # Expected output: 80
print("Mean of Class 2:", mean_class_2)  # Expected output: 90

# Regular arithmetic mean of the grades of both classes:
mean_grades = (mean_class_1 + mean_class_2) / 2
print("Mean of Grades", mean_grades)  # Expected output: 85

# Weights (here weight == n):
n_class_1 = len(grades_class_1)
n_class_2 = len(grades_class_2)

Probabilistic Weights:

A simple example for probabilistic weighting is the weight of the dice values of a fair dice, weighted by their respectively uniform probability of occurrence of each dice value.

Fig. The possible values of a six-sided dice can be weighted by their respective probability of occurrence, i.e., p(x)=1/6p(x)=1/6. Photo by Alois Kommenda.

Arguing that the dice is truly fair also lets us at least logically relate the expected value with the population mean: fairness argues that after infinite observations, the occurrence of each dice value will turn out to be weighted fairly. We will get into phony dices some other time… In the case of probabilistic weights, the formula of the expected value (weighted arithmetic mean) can also be abbreviated and written as follows:

It is the same formula as before, just lacking the denominator. Why? Take a look: With the denominator the formula would be as follows:

Now, recall that we know that the probability vector pip_i will alway sum to 1, in other words: the denominator will always be 1 and can therefore be abbreviated in the case of probabilities, which may blur the obvious relation to the weighted arithmetic mean a little at first glance.

# Expected value as weighting a regular distribution of data
# points (here dice values) by a uniform probability distribution:

# Dice values of a fair dice weighted by the prob. of 
# occurence of each dice value:
dice_prob = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6]) # prob. vector


# Check if sum is 1
# If we just print the sum of the dice_prob vector, we get a floating point error
# This is due to limitations of floating point numbers in Python
sum_dice_prob = sum(dice_prob)  
print("Sum dice_prob with floating point error", sum_dice_prob)  # Expected output: 0.9999999999999999


# Therefore, we need to fix this by defining a tolerance for floating point errors:
tolerance = 1e-10 # tolerance for floating point errors
is_one = abs(sum(dice_prob) - 1) < tolerance # check if sum is 1
print("is_one:", is_one) # Expected output: True

# Alternative to tolerance level to check if sum is 1 with np.isclose():
#print(np.isclose(np.sum(dice_prob), 1))  # Expected output: True


# Define dice values
dice_value = np.arange(1, 7)

# Calculate expected value of dice
EX_dice = np.sum(dice_prob * dice_value)

# Alternative way to calculate expected value
EX_dice_alt = np.sum(dice_prob * dice_value) / np.sum(dice_prob)

print(EX_dice)  # Expected output: 3.5

# Check if all values are equal
print("All values equal?:", np.isclose(EX_dice, np.mean(dice_value)) and np.isclose(EX_dice, EX_dice_alt))  # Expected output: True

In case your heart only beats in the pace of the time that Lady Meow is drinking her tea (<3), we can do the same with the independent variable of our second synthetic data set, i.e., x = c(0:10) # hours.

x = np.arange(0, 11)  # hours; we will use 0:10 only once

# Obtain prob. value by calculating 1/n (a kind of normalization):
probX = 1/len(x)          # 1/n == uniform probability distribution
probXvec = np.repeat(1/len(x), len(x))  # alternatively represented as 
                                        # probability vector that 
                                        # sums to 1

print("Prob. vector that sums to 1 which is equivalent to 'True':", np.isclose(np.sum(probXvec), 1))  # Expected output: True

# Actual weighted values of x:
weighted_x = x * (1/len(x))
weighted_x_vec = x * probXvec  # alt. via prob. vector
print(np.all(np.isclose(weighted_x, weighted_x_vec)))  # Expected output: True

# weighted_x (uniform weight):
# array([0.09090909, 0.09090909, 0.09090909, 0.09090909, 0.09090909, 
#        0.09090909, 0.09090909, 0.09090909, 0.09090909, 0.09090909, 
#        0.09090909])
print("weighted_x:", weighted_x)

# Expected value as the sum of weighted values of x:
EX = np.sum(x * (1/len(x)))  # sum of weighted_x is in this special case...
EX = np.sum(probXvec * x)
print(EX)  # Expected output: 5

print(np.isclose(EX, np.mean(x)))  # ... equivalent to the unweighted arithmetic  
                                  # mean of x; Expected output: True

3.2 Sum of Squares

In order to obtain the variance and the standard deviation, we have to look at the so-called sum of squares. We have worked with the sum of (least) squares before, or more correctly: with the sum of squared errors / residuals. The sum of squares is not the same as the residual errors. The sum of squares refers to the variability or deviation of the values of a set from the mean of that set ─ only slightly different as the variance or the standard deviation does, as we will see.

CAVE: Importantly the variance and standard deviation actually only really make sense, when understanding its basic structure: the total sum of squares. We promise that the next chapters will be very enlightening when it comes to understanding and never forgetting the relation and difference between the variance and the sd.

The sum of squares is also referred to as total sum of squares. Below you will find the formula of the total sum of squares as presented as the summed and squared deviations of xx from the mean of all given xx (below we are using the sample mean). [3] “Why is the deviation squared?”, you may ask. We will get there after the short code example below.

Here is some code, using our second synthetic data set again. The numeric value of the sum of squares does not make a lot of sense for itself, so we will dig in deeper into the formula.

# Using the second synthetic data set from Inf. Stat. II again:
x = time
y = cups

# Mean (same formula for population, sample mean, and expected value).
# The difference lays in the concept. The expected value can be considered
# as the mean of previous studies (overall mean):
n = len(x)
mean_x = np.sum(x) / n

### (Total) Sum of Squares + Population and Sample Variance ###

# SumSqXX = ∑(x_i-E[X])^2, 
# where E[X] = 〈x〉 = μ_x = ∑(x_i)/n = expected value 

# SumSqXX:
SumSqXX = np.sum((x - mean_x)**2)
# SumSqYY:
SumSqYY = np.sum((y - np.mean(y))**2)

# SumSqXY:
SumSqXY = np.sum((x - mean_x) * (y - np.mean(y)))

Note that the process of subtracting the mean from a set of values (without summing) is also called centralization. It results in a set of values that “centralize around the mean”, i.e., the closer the initial positive/negative value has been to the mean, the lower/higher the centralized value will be, until it approaches zero, when the initial value of the set was equivalent to the mean. In some way, we can look at it as a kind of specific error function as well (figure below).

In contrast to the squared residuals, where we looked at the deviation of yiy_i from f(xi)f(x_i), we now look at the deviation from the mean as “errors”. However, apart from that it follows the same structure and logic.

Note that some values in the plot below, showing the result of an unsquared centralization, are negative and some values turn out to be positive. In our idealized example below (c(0:10) - mean(c(0:10))), we would actually obtain a total sum of the un-squared deviations from the mean of exactly 0, which makes no sense, looking at the plot of it below (indicating a lot of deviations). In order to avoid that positive and negative values just cancel each other out, we have to square the centralized values before summing.

# Pure deviation from the mean (without summing!):
dev_of_mean_X = x - mean_x

plt.show() # Please note that we had to make some adjustments
# for the plot, which would take up to much space here
# if you are interested in this, look it up in the script

Fig. Plot of x-mean(x) only, i.e., without summing and squaring yet. The closer the initial value had been (x-axis) to the mean, the closer to zero the subsequent centralized value will be (y-axis). The centralized value will be zero for that specific value of x where the initial x is equal to the mean(x).

Let us do some further plotting of the individual deviations from the mean, before summing them: As mentioned, squaring leads to values that will be all positive or equal to zero (the standard deviation, which starts with the sum of squares, is therefore always cast as +/sd+/- sd). For comparison, we can achieve the same effect of having positive values only by taking the absolute value where, e.g., 7=7|-7| = 7. By taking the absolute value of the deviations from the mean, we just look at the deviations as such, regardless negative or positive.

Plotting will lead to a graph with a discrete distribution where the height (y-axis) indicates how much each value of xx deviates in general (no matter +/+/-) from the mean. The x-axis will just regard to the respective values of xx (see below):

# Plot using absolute values:
fig, ax = plt.subplots()
ax.plot(x, np.abs(dev_of_mean_X), 'o', markersize=dot_size)
customize_plot(ax)
plt.show()

# Plot using the square, which smoothes the graph:
fig, ax = plt.subplots()
ax.plot(x, dev_of_mean_X**2, 'o', markersize=dot_size)
customize_plot(ax)
plt.show()

Fig. The y-axis relates to the absolute value of, e.g., x-mean(x). The x-axis relates to x. Recall that x is in this case just a sequence of 0 to 10, raised by 1 each step.

Fig. The y-axis relates to (x-mean(x))^2. As an effect of the squaring, we can now see a similar but clearly smoothed out, parabolic discrete distribution of points.

The above plot is also supposed to make it somewhat easier to numerically understand from step 1 (the sum of squares) how values such as the mean, the variance and the standard deviation are in the end related to a (bell curved) Gaussian probability distribution. Specifically, the characteristic symmetry of the plot above makes it hopefully somewhat intuitively plausible how values such as the standard deviation can be related to a normally distributed probability density function. However, we will get into all details of probability (density) functions in the next part of this series.

Upfront for those impatient getting an idea on Gaussian functions: The number ee to the power of x2-x^2 will result in a symmetric Gaussian function. In other words: a normal distribution can be thought of as a combination of two exponential functions exe^{-x} and exe^x (approaching and crossing a middle from the left and the right side), and a parabola function x2-x^2 (the bell-like curve in the middle, connecting the approaching exponential functions exe^x and exe^{-x}). Again, we will go through all the details in the fourth part of this series.

As the total sum of squares is the basis of the variance and the standard deviation, we can conclude that all of these values actually reflect on deviations from the mean in some way. It is therefore not totally wrong when you have difficulties to semantically differentiate the variance from the standard deviation (since it both implies variability of some kind, when just reflecting on the words variability and deviation).

However, different to the variance, the total sum of squares does not directly reflect on the sample or population size, when reflecting on the variability of values around the mean (centralization), even though the size of the sample/population is indirectly entailed (when summing all nn or NN): “Reflecting on” or “in relation to the size of NN” can in mathematical terms mean: Dividing the total sum of squares by the length of NN, which actually already results in the population variance (we have to divide by n1n-1 for the sample variance, see next chapter). Think of this process of putting deviations into an actual proportion of NN as averaging, similar to what we did when calculating the mean.

The total sum of squares can therefore in general be understood as the first step of calculating the variance. The variance on the other hand can then be understood as the average squared deviation from the mean. Even though the term variance is often in general used to indicate variability, recall the variability starts with the step of x-mean(x) as essential part of the sum of squares (centralization).

QUICK OPTIONAL SIDE QUEST: We can also test our vector x for normality via the Shapiro-Wilck test for example (we won’t get into the actual math of this test). However, this is what we would check on when choosing a hypothesis testing method. In the case that the data turns out to be not normally distributed, we would use a rank test instead of a t- or z-test. We will catch up on what rank tests are some other time…

# OPTIONAL SIDE QUEST: Testing for normality of x:
stat, p = shapiro(x)

print("Shapiro-Wilk normality test")
print(f"W = {stat}, p-value = {p}")

# IMPORTANT: A p-value above 0.05 indicates that normality is given!!
#            Here the null-hypothesis is that the data is normal. We
#            cannot discard the null hypothesis, so normality is given.

3.3 Variance

As we already know: Formally the (population) variance is just the total sum of squares divided by N. Therefore, you can conceptually think of the variance as the average squared deviation from the mean, or “the mean of squares”, not only its sum! As mentioned before, it is quite hard to interpret the numeric value of the sum of squares by itself. Same goes for the variance. If we were to plot the variance, we would also need to plot an area, since it is squared (we would need to plot a literal square — this is why we take the square root for the standard deviation later below). Nevertheless, understanding the concept of variance in terms of (averaged) variability, we can in general say: the lower the value of the variance, or in other words: the less variability the better, as a low variability means it will stick more closely to our expectations. The variance for itself though does not come with a numerical base line, as it depends on the values of the set. A variance of 100 — is that low? We could compare the variance of several samples though, in order to get some insight…

Note: The possibility of some form of comparison does not hold for the total sum of squares for itself, as it lacks an interpretation of NN in the denominator (see formula below). NN acts as a kind of counter weight to the total sum of squares, so to speak. Numerically a high total sum of squares for itself might lead to a surprisingly low variance when a certain number of participants / events NN are given. This is probably the reason why the sum of squares for itself is not as often discussed as the variance or the standard deviation.

Similar to the arithmetic mean, the variance comes in two forms: the population and the sample variance:

The difference between sample and population variance is again rather simple: The population refers to a complete set of NN, such as ‘all humans’, which however is rather seldom (collecting data from all humans is rather laborious and costly). The formula of the sample variance however is equivalent to the population variance, except of two differences:

a) Instead of the sample mean x\overline{x} we use the mean of the whole population 𝜇 for centralization in the population variance, which can deviate from x\overline{x}, due to the different size of NN values of xx. Recall that sometimes the expected value is also denoted as 𝜇, instead of E[X]E[X].

b) A 11 is subtracted from the (length of the) sample nn, which is supposed to make sure that the sample variance is always higher than the population variance, in order to compensate for the lack of information we have on the behavior / data of a population. [4][5] This is also referred to as Bessel’s correction. The choice of 1-1 is not arbitrary, but we will not go into further details on why it is carefully chosen here (we will discuss it in the form of degrees of freedom when discussing the t-test).

Note that you will come across several types of bias correcting modifications in other contexts of statistics. Also note that such corrections also do not correct for every kind, but specific types of biases.

Let’s do some coding:

# Variance := σ^2 = SS/n := the variance is the mean of 
# the sum of squares

# Variance of x:
sampVarX = SumSqXX / (len(x) - 1)  # Sample variance
sampVarX_alternative = np.var(x, ddof=1)  # Using numpy with ddof=1 for sample variance
popVarX = SumSqXX / len(x)  # Population variance

# Variance of y:
sampVarY = SumSqYY / (len(y) - 1)  # Sample variance
sampVarY_alternative = np.var(y, ddof=1)  # Using numpy with ddof=1 for sample variance
popVarY = SumSqYY / len(y)  # Population variance

Next we are going to evaluate the covariance, which will lead us to the mentioned alternative way of obtaining the optimal slope of our linear model.

3.4 Covariance

For the covariance we now need to talk about the sum of “squares” of xx and yy. You might have noticed in the code examples above that we have actually already calculated the total sum of squares regarding xx and yy (and not only regarding xx or yy): Different to, e.g., the sum of squares of xx, where xx is related to the mean of xx and then squared, another relation to yy in relation to the mean of yy is drawn, instead of just taking the square! Taking the square is not needed in this case of “co-variability”.

In case you wonder why this is the case, recall what a square is in the simple geometric sense: a perfect square forms an area with equal length on each side, such that the area of such a two-dimensional square can be calculated via aaa*a which is equivalent to a2a^2. The total “co-sum” of squares (so to speak) understood in the geometric sense does not actually involve a square with equivalent side length, but a rectangle: Calculating the area with unequal length of the sides is simply obtained via multiplying aba*b. I hope this helped and explains the whole XXXX, YYYY and XYXY notation so far.

Most import: different to the variance and later the standard deviation, the covariance does not necesserily have to be positive, since the values are not just squared! We will get to the meaning of this special characteristic again in a bit!

# SumSqXY = ∑(x_i - E[X])*(y_i - E[y])
SumSqXY = np.sum((x - np.mean(x)) * (y - np.mean(y)))

Using our “co-sum of squares” to obtain its variance is simply called the covariance; apart from that it has the same formalism as the variance: We can again distinguish between a sample and a population covariance (i.e., including bias correction on the sample covariance):

# Covariance:
covarXYsamp = SumSqXY / (len(x) - 1)  # sample covariance
# 10.39687
covarXYpop = SumSqXY / len(x)  # population covariance
# 10.08182
covarXYsamp_alternative = np.cov(x, y, ddof=1)[0, 1]  # using numpy
# 10.39687

Recall all the great time we had with Lady Meow and take a close look at the population covariance: Do you notice anything unusually coincidental with the population covariance? You are right, it is numerically preeetty close to our beta estimate of our linear model (second synthetic data set)! :O

Fig. Ms Miranda, utterly surprised. ^^

To understand why this is the case let us first have a look if this is only coincidence: We can do so by checking different linear models of different data sets and see how their slope relates to the value of the covariance. After some time, we might notice the following relation: A positive covariance relates to a positive value of beta and a negative value to a negative slope.

Fig. We can see that the covariance is related to the slope of the data. As we will see, the covariance is actually related to the optimal slope of our model function, given a set of data points!

Note: These plots were created for our original inferential statistics tutorials in R. As you can see they look a bit different than the Python plots we provided above but contain the same information. This holds true for all the calculations we show here: They might look a bit different, but the results are the same since they are based on the same math.

Below you will find the code to replicate the graphs above:

# Example scatter plot for cov(x,y) > 0:
plt.scatter(x, y)
plt.show()
np.cov(x, y)[0, 1] > 0 # Expected output: True

# Simple vertical mirroring of our data as an 
# example scatter plot for cov(x,y) < 0
plt.scatter(x, -np.array(y))
plt.show()
print(np.cov(x, -np.array(y))[0, 1] < 0)  # Expected Output: True

# Here another (synthetic) example for cov(x,y) = 0:
cross_x = np.array([5, 5.5, 4.5, 5, 5, 5.25, 4.75, 5, 5])
cross_y = np.array([5, 5, 5, 4.5, 5.5, 5, 5, 4.75, 5.25])

# Plot data:
plt.scatter(cross_x, cross_y)
plt.xlim(0, 10)
plt.ylim(0, 10)
plt.show()
print(np.isclose(np.cov(cross_x, cross_y)[0, 1], 0))  
# Expected Output: True

Below you will find the output for a linear model of the latter data set (the cross).

# Model for the last data set (cross):
# Linear least square method in Python
slope = np.cov(cross_x, cross_y)[0, 1] / np.var(cross_x)

intercept = np.mean(cross_y) - slope * np.mean(cross_x)

print(f"Independent variable: cross_x\nDependent variable: cross_y")
print(f"alpha: {intercept}\nbeta: {slope}")

The output of here is equivalent to the one of our own linear_least_square() function:

Below you find the plot of the function f(x)=5+0xf(x) = 5 + 0*x . We can clearly see that a covariance of zero leads to a slope of zero as well.

Fig. Plot output of our linear_least_square(x,y) function. f(x)=5+0xf(x)=5+0xf(x)=5+0xf(x)=5+0∗xf(x)=5+0∗xf(x)=5+0∗x.

An alternative of a cross would be a ball shaped set of data points. The Wikipedia article on covariance [6] presents it like this:

Fig. Ball shaped Wikipedia example of a covariance of zero in a given set of data points.

Thinking of a model with a slope of zero, we can now already answer what the hypothesis for our linear model will be regarding the slope, when testing our linear model for significance: Think about observing Lady Meow drinking tea again. How useful would a model be with a slope of 0? Let us say, we start observing Lady Meow at time=0htime = 0h and she does not drink any tea at all in 10h10h. What would it mean for us, trying to predict the amount of tea she drinks, given such a set of data points? Let’s do some coding:

# Alt. universe where Lady Meow is drinking no tea at all :'(
# Setting up a respective data set. The rep() function is used
# to obtain a vector c() with 11 times the value 0. We need
# 11 values, as x goes from 0h to 10h, resulting in 11 numbers:
data_x = np.arange(0, 11)
data_y = np.zeros(11)

print(np.cov(data_x, data_y)[0, 1])  # Expected Output: 0

# Plot data:
plt.scatter(data_x, data_y)
plt.show()  

Fig. Plot of our data set with Lady Meow drinking zero cups of tea in 10h. :’(

Of course, such a model with a slope of zero would be useless, as it shows that yy never changes when xx changes, so no dependent relation is given (yy is also independent in this case). If we had a vector yy with only ones, the output would essentially tell us that the model is unreliable.

The only case were such a model would turn out useful is a case where something should always change, but suddendly stopped changing. Say you are testing some fertilizer and want to compare linear models with each other looking for the one with the highsted slope, the highest growth. You will probably expect that plants will always grow to some degree, but think of a situation where plants suddendly stopped growing when given the fertilizer. This would still be a very worthy discovery!

However, mathematically we will always hypothesize a slope that will be greater than zero (alternative hypothesis), when doing hypothesis testing on the slope (the null hypothesis will be that the slope is actually zero). Probabilistically evaluating the intercept also assumes a non-zero crossing point as alternative hypothesis. Testing the intercept however is not always important and a question of logic and concept what to make of it. When observing Lady Meow, we know that we will always start with a value of zero cups of tea, right after Lady Meow woke up. A value of zero in this case is therefore totally sensible. A model where the independent variable is referring, e.g., to the IQ of participants does not make sense being 0 (0 IQ can’t be measured), so it is rather abstract to interpret the intercept for itself in many cases, so the intercept and its p-value are therefore often neglected.

Closely related to the covariance is the so called ‘correlation’ (also referred to as Pearson correlation coefficient). The formula only slightly deviates to the covariance, but we will catch up on that in later parts of this tutorial series.

3.5 Conditional Probabilistic and Conditional Linear Relations

In this chapter we are essentially again going to evaluate the optimal aa and bb of our linear model function. In case you want to precede to the standard deviation, you can also skip this part. However, we recommend to at least read the summary of this chapter to get the idea behind it.

As mentioned before, conditional linearity and conditional probability are closely related to each other by formula. We have already figured out that the population covariance is very close to the optimal slope of our model. Looking at a population covariance of 10.08182 and a regression coefficient of 1.008182, we can argue that the covariance appears to be proportional in some way to the regression coefficient (we will get to the intercept later below). If we were to divide the covariance by 10 (“put things into proportions”, see below), we would obtain the value for the optimal slope of our model. :O

Now the twisty part: Giving up probabilities, and sticking to actual values (estimates), we will see that a conditional linear relation in terms of the optimal slope of our function (regression coefficient) is by structure the same as the posterior probability P(YX)P(Y|X). By structure here means: not using probability values, but estimates of measurements (which don’t have to sum to 1).

What does proportionality exactly mean mathematically? Our well known joint probability for example can also be understood as being in a proportional relation to the posterior probability. Mathematically this essentially just means that the joint hasn’t been normalized yet (or “counter-weighted” by P(B)P(B)).

The mathematical symbols used to denote proportionality are either the tilde (~) or ∝, which looks similar to alpha (α), just more symmetric. The (“un-counter-weighted” or non-normalized) proportional relation of the joint probability and the posterior in conditional probability / Bayes’ rule is denoted as:

Again, mathematically proportionality can be understood as actually formally missing but indicating a (counter-) weight that can “put things into proportions”. In other words: proportionality in mathematics aims for a more general relational fact as the posterior (estimate), so to speak. However, the complete formula of Bayes’ rule / conditional probability is what actually does the “putting into proportions” via a “counter-”weight (model evidence), so to speak (this repetition is meant to make sure that the linguistic semantics fit the mathematical semantics).

Note that proportionality in mathematics also indicates what is called a constant ratio (a linear relation in the case of a linear model). In the case of conditional probability, the model evidence indicates a constant, as it normalizes the joint to sum to 1 (normalizing constant). In the second part of this series, we also once briefly termed the constant ratio of a linear model as having a constant numerical gradient (see plot below), where the ratio of change is always the same (constant) at every step or point of the function. As a linear function is its own tangent, the slope will be constant automatically. In other words: given that ΔyΔx=yx=β\frac{\mathbf{\mathrm{\Delta}y}}{\mathbf{\mathrm{\Delta}x}}\mathbf{=}\frac{\mathbf{y}}{\mathbf{x}}\mathbf{=}\mathbf{\beta}, we know that the actual difference of Δx\Delta x and Δy\Delta y in the case of linear functions actually does not matter: We could raise xx by the factor 10 for a difference Δx\Delta_x, but simply dividing yy by xx would result in the same slope as with a difference of 10 for each xx and yy. The so-called numerical gradient (the difference at each step) will therefore just be a constant series of 1.008182 that is added upon aa each step of xx, when evaluating f(x)f(x) (in other words: it is monotonically increasing).

Fig. A constant ratio of change or constant numerical gradient just means that we start with a+0a+0 and for each step of xx we can just add another 1.0081821.008182 each step to obtain respective value of f(x)f(x) on the yaxisy−axis. Each step changes by 1.0081821.008182, so to speak, starting with a+1.0081820=a+0a+1.008182∗0=a+0. Another way of saying this is to make clear that a linear function is its own tangent, in contrast to, e.g., a parabola function, where the slope and therefore the change is not a constant value.

We can also calculate the numerical gradient for each step by formula, however it is not that important, but we included the necessary code in the downloadable script that we prepared for you at the beginning of this tutorial.

To wrap this up: Equivalent to the joint probability, the covariance is the joint variability of YY and XX, regardless the variance of XX, i.e., without actually putting things into proportion via the “model evidence” in terms of the variance of XX for itself. This is essentially the same as saying the covariance is proportional to the estimate of the change of YY (dependent) given a change in XX (independent; change by 1):

Below you will find the complete formula to obtain the so-called regression coefficient (the optimal slope of our model), structurally analogue to conditional probability. As mentioned, the “counter-weight” is simply the variance of the independent variable:

Outspoken, the above formula says: the slope or regression coefficient can be defined as the estimate of the dependent variable YY given a change in XX (or simply given X=1X=1), which is equivalent to the joint variability estimate, i.e., the covariance of XX and YY, divided by the variance of the independent variable XX itself (the mean of squares of XX, in other words).

If you still have trouble understanding the formal relation above between the regression coefficient bb and conditional probability, think of this: as the slope is equivalent to the concept of a posterior probability, they both indicate the concept of change or updating, equivalent to the steps of probabilistic inference we discussed in the first part of this series! The value of aa also essentially just indicates an estimate of a starting point of YY given X=0X = 0, i.e., where no change in XX is given.

In sum: above we were talking about the same formal structure, just the one structure results in probability values and the other in estimate values of measurements: In the case of Lady Meow (second synth. data set) the regression coefficient was around 1.008182 cups of tea per hour (we can argue explicitly per hour, because at the numerical gradient of a linear function is steady at any point of XX).

Below you will find some code for calculating the optimal slope of the model function via the above formula, again regarding our second synthetic data set, meow (^°^) !

# Estimate of the optimal slope of our model function
# from the second synthetic data set via:
# beta = cov(y, x) / var(x)
beta_meow = np.cov(y, x)[0, 1] / np.var(x)
print("beta_meow", beta_meow)  # [1] 1.00818

The same can also be achieved via using the total sum of squares only. The total sum of squares of xx and yy also relates to the concept of a joint variability, it is just not averaged by nn (leading to the (co-)variance). But how can we just skip the step of taking the average of the total sum of squares? The below formula works, as nn is indirectly entailed in the total sum of squares as a balanced ratio due to the equal vector length of xx and yy. In other words: nn is implicitly given in both the numerator and denominator and can therefore be crossed or left out in the first place, when calculating the regression coefficient:

# beta = SumSqXY / SumSqXX
SumSqXY = np.sum((x - np.mean(x)) * (y - np.mean(y)))
SumSqXX = np.sum((x - np.mean(x)) ** 2)
beta_meow = SumSqXY / SumSqXX
print("beta_meow", beta_meow)  # [1] 1.008182

The message of this chapter is: there is a way to convert the structural idea of probabilistic relations (conditional probability) into the concept of an estimation of measurements by structure (conditional linear relations). Personally, I find this mind-blowing. The crucial point is also that given the fact that conditional probability can be understood as intuitive logical inference (abductive inference), statistics is doing the same, just with real numbers, but in the same way we rather naively assume how the world would be, when bluntly forming a hypothesis on it.

Fig. Mind blown Ms. Miranda. Original photo by Alfred Kenneally and Casey Horner.

There are a lot of other ways to perform the above estimation of YY given a change in XX, especially when bringing in actual linear algebra for a more visual understanding (including even methods such as gradient decent).

Wait!!

What happened to our intercept???

In the last part of this series, we used a system of equations in order to obtain both aa and bb at the same time. However, given the regression coefficient only, we can also obtain the value of aa, our intercept estimate, via some rearrangements of the so-called point-slope formalism.

Let us say, we know the parameters aa and bb of our optimal model already (otherwise we wouldn’t know points on that function): we could then just pick two specific points that lay on our model function.

The first point will entail the variable aa, where we at least know that xx is equal to 00: Pa(a,x=0)P_a(a,x=0). The second point will be our regression coefficient: Pb(f(1),1)P_b(f(1),1).

Below you will find some code to rearrange the above equation in various ways in order to solve for aa, since it is the only variable missing now (!!!).

# Deriving the intercept from the slope-intercept formula:
alpha_meow = np.mean(y) - beta_meow * np.mean(x) 

# Our fitted function:
lmFitted = lambda x: alpha_meow + beta_meow * x

# Value of y of our fitted model function, given that x = 1:
y1 = lmFitted(1)
print("y1", y1)

# Check for equality:
print("Equality?", np.allclose(lmFitted(1), y1))

# Calculating beta with the points:
beta_meow = (alpha_meow - lmFitted(1)) / (0 - 1)

However, a simplified way to obtain the intercept is given via the mean of xx and yy, instead of using given point estimates of the fitted function. The formula goes as follows:

Below some code that explains, why the above formula holds:

print("beta_meow * (0 - 1) == alpha_meow - lmFitted(1))", beta_meow * (0 - 1) == alpha_meow - lmFitted(1))
alpha_meow = lmFitted(1) + beta_meow * (0 - 1)
print("alpha_meow", alpha_meow)  # [1] 0.2318182

alpha_meow = np.mean(y) - beta_meow * (0 - np.mean(x))
print("alpha_meow", alpha_meow)  # alpha_meow = 0.2318182

# The simplest form is:
alpha_meow = np.mean(y) - beta_meow * np.mean(x)

# Checking for equality:
print("Equalityy?", np.allclose((alpha_meow - np.mean(y)), -beta_meow * np.mean(x)))

Mathematically there is still a lot more to discover about the above mathematics of linear regression models. As mentioned before, we could also use methods commonly used in machine learning, such as gradient descent... However, for now we are just going to move on to a simpler topic: the standard deviation.

3.6 Standard Deviation

Finally, we get to the standard deviation. We know most about it already, nut we will go through it in detail again anyways.

So in nuce, the standard deviation is just the square root of the variance. It tells us the average deviation from the mean of a distribution, when we were to take multiple samples from it.

Recall that we started with the total sum of squares. In other words, we started with the squared sum of the individual deviations of the elements of our list, set or vector of XX to the mean of that set XX(squared and summed centralization). In order to obtain the variance, we just had to divide the total sum of squares by the size of our population (in case of the sample variancewe used n1n-1, including Bessel’s correction). In other words, the variance turned out to be just the (bias corrected) average of the total sum of squares (which itself was the squared sum of the deviations to the mean).

As mentioned, we can now obtain the standard deviation simply by taking the square root of the variance, in other words: we simply get rid again of the initial squaring we performed when we calculated the total sum of squares! Taking the root brings the dimensions down to one (the x-axis). That is why the sd becomes relatable to the mean (which is also to be found on the x-axis).This makes the deviation or variability again relatable to the actual mean of a set, which only relates to the input values and not its square. The sd is therefore rather easy relate to the actually mean as mean+orsdmean + or - sd.

Why +/- again? Taking the root does not eliminate the effect that all of the centralized values were turned into positive values only! This is again the reason why the standard deviation is denoted to come with a +/- deviation to the mean. The standard deviation of normal distribution is special, since the +/- sd is related to the turning points of a Gaussian function:

Fig. Standard normal Gaussian function, with a standard deviation of 1 and a mean of 0. The blue lines refer to the standard deviation, which are the turning points of a Gaussian / normal / bell-curved distribution. In a standard normal Gaussian function the unit of the x-axis is itself in standard deviations. The mean is 0, since it represents no deviation from the mean. In a z- or t-test the difference of two means is standardized, meaning that the z- or t-value represents the deviation of the mean in measures of standard deviations from the population, in other words: a z- or t-value of 2 means that the sample deviates 2 population standard deviations from the population mean. We will go through the details in the next part of this series.

For brevity, you will find the formula for the standard deviation for the sample sd below only, which is also sometimes referred to as empirical sd (the following code remains comprehensive though). Note that the standard deviation is commonly denoted as σσ and the variance as σ2σ^2.

# Standard deviation := sqrt(variance), where sqrt() == square root:

# Note: Assuming `x` and `y` are numpy arrays, and `sampVarX`, `popVarX`, `sampVarY`, and `popVarY` have been computed previously.

# Sd for x given the sample variance
sdXsamp = np.sqrt(np.sum((x - np.mean(x))**2) / (len(x) - 1))
sdXsamp = np.sqrt(sampVarX)
sdXsamp = np.sqrt(np.var(x, ddof=1))  # sample variance
sdXsamp = np.std(x, ddof=1)  # sample standard deviation

# Sd for x given population variance:
sdXpop = np.sqrt(np.sum((x - np.mean(x))**2) / len(x))
sdXpop = np.sqrt(popVarX)

# Sd of y given sample variance:
sdYsamp = np.sqrt(np.sum((y - np.mean(y))**2) / (len(y) - 1))
sdYsamp = np.sqrt(sampVarY)
sdYsamp = np.sqrt(np.var(y, ddof=1))  # sample variance
sdYsamp = np.std(y, ddof=1)  # sample standard deviation

# Sd for y given population variance:
sdYpop = np.sqrt(np.sum((y - np.mean(y))**2) / len(y))
sdYpop = np.sqrt(popVarY)

The resulting standard deviation can now be clearly related to the mean value as a standard or an “average deviation from the mean”.

In the next part of this series we will also get to know specials kinds of standard deviation: the standard error/deviation of the mean, the incept and the slope of a model.

4 Super Brief Outlook on Inferential Statistics IV

In order to fully visualize the meaning of the standard deviation, we have to take a closer look on what (uniform and normal) distributions are and what is called a probability (density) function.

Fig. Formula and graph of a (standard) normal probability density function, established by Carl Friedrich Gauß (1777-1855). His effigy, including the famous formula and corresponding graph was depicted on the last 10 Mark bill in Germany until 2001. In nuce the difference to a regular probability function will be that the complete area under curve will sum to 1. We will again go through all the details step-by-step to fully make sense of the formula (e.g., why use an exponential function? how is π\pi involved in all of this? …)

We will get there in the next part of this series, where we will also discuss the the mentioned probability density functions, that are used in a z-test (which is very similar to the t-test as a form of statistical hypothesis test).

Ms. Miranda, longing for feedback. Did any of this makes sense? We would like to hear from you! Similar to our open review process, every of our articles can be commented by you. Our journal still relies on the expertise of other students and professionals. However, the goal of our tutorial collection is especially to come in contact with you, helping us to create an open and transparent peer-teaching space within BEM.

Review: This article was internally reviewed by the BEM-Editor Philippa Schunk.

Contribution declaration: Original R article by Steffen Schwerdtfeger, Python code conversion and implementation of relevant python segments: Rico Schmitt and Moritz Thiele. Code conversion was assisted by GPT by OpenAi. The responsibility of the authors for the content of the code remains unaffected by this.

Comments
0
comment
No comments here
Why not start the discussion?