The Stat-o-Sphere is an open educational editorial collection within the journal of the Student Network for Open Science (NOS) that provides low-threshold, easy-access and slow-paced statistics tutorials for absolute beginners ─ written by students, for students and potentially also scholars. Our goal is to evoke and elevate both research and reviewing abilities within readers in order to shed more light on basics of scientific methodology as well as science communication. We are eager to further expand our peer-teaching spaces within NOS together with you in order to gather assured methodological knowledge over time and promote scientific integrity.
Most of our tutorials will also be accompanied by executable code ─ so far mostly for the open access programming language R. We are currently also working on python versions of our current articles (Inferential Statistics I and III so far, see further below). Others such as SPSS, STATA or Matlab may follow.
A general ideal of this project is to avoid gate-keeping heuristics that often aim to absurdly suppress looking at the actual conceptual and mathematical processes behind statistical inference, discarding actual and most of all lasting knowledge as, e.g., “not necessary”. Our main goal is therefore to bring back critical thinking around the topic of statistics, which is most of all applied mathematics. Nevertheless, we will also provide Quick-Guides of our tutorials in the future, and every statistics tutorial is accompanied with a summary of its content, in order to supply you with knowledge on several levels of detail (Note: Summaries are currently not up to date).
Especially since AI approaches have advanced rapidly in the last years and functionalized misinformation on science has vastly grown, we believe it is more important than ever to create and share consistent material on statistics in any form, in order to avoid habits of applying methods without properly reflecting them, and in order to avoid giving only those people a platform to thrive in science that appear to be most of all convinced of themselves and their approach to scientific methodology without being in any way accessible. We also believe that students should take and should be in general given more time and guidance in that respect, since statistics paves the path of inference in modern evidenced based science and therefore potentially has an inpact on all of us. Inference performed via sloppy or even fraudulant statistics, e.g., used as mere tool to spread one’s own interest and belief, as well as most of all hegemonic relativism agains science as such, are probably the most dangerous tendencies against scientific reasoning in todays times. We believe that such challanges can only be met with open access to education.
In the future, we will also provide you with educational content in the form of an optional extension of submissions within NOS, giving authors the chance to educationally share their methods and making it possible for readers with only little background to fully understand them (especially concerns submissions providing open data and code).
Learn more about our project and its goals in our introductory article “Into the Stat-o-Sphere” below.
We are currently looking for reviewers / test readers and people with a background in statistics / data science and with an interest in Open Education (as part of Open Science). Also have a look at articles all about our Open Science Agenda in that respect!
Contact us via [email protected] if you want to participate in any form. Every reviewer will be credited with a mention at the beginning of the tutorials. You can directly send us your review via the email adress above, or you can also use our open peer-review functions via pubpub (create an account here). How to make use of it?
In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article. We are looking forward to your feedback!
>>>>Note that the autogenerated download functions via pubpub are unfortunatly not very well tuned and is nothing we have control over. <<<<
Our tutorial series on R-Basics will provide you with everything you need to know about installing R/RStudio, creating a script or project, the import/export of data files, essentials in data cleaning and so on. Most of the essentials of the first part of this series consists of basics that are also entailed in the first two articles of our series on inferential statistics. However, this series will focus more on the technical side of R, less on using R in order to understand and perform statistical analyses. Join us in becoming a Cyber!
Part I
starts with a very brief history on programing languages and then focuses on the first steps of using R/RStudio — such as installation, changing the appearance of RStudio, understanding classes of objects, functions and more. We will also introduce you to some exemplary data sets (such as the Datasaurus Dozen), and will go through some basic trivial and non-trivial examples of data cleaning. Optionally you can also learn how to write a function. Note that going through this tutorial is not mandatory in order to start with our core series on inferential statistics.
Our series on R-Basics and Inferential Statistics was also turned into an in-person peer-teaching tutorial at the CIPom/Lernzentrum at Charité (for Students at Charité, you will find more information on how to register in the article below, starting 28.04.2024; last material update 24.05.24). However, the material is still open to explore for anyone interested and feedback always welcome! Note that this material is currently in German only. We are also planning to turn this tutorial in some form or another into a online version as part of NOS once a year, after the material was thoroughly tested — stay tuned for updates!
Inferential Statistics was our first tutorial series and aims to provide thorough tutorials on fundamental methodological basics in statistics on three levels of perspective and representation: intuition/conceptuality, mathematics and computer science (currently in R for computers with 64-bit architecture). We also entailed general science theoretic discussions and concepts, and in general followed a rather pragmatic approach to introduce statistics / probability theory (e.g., a general concept of abductive inference / hypothesis testing). Every tutorial has a length ranging from 8k-12k words and also comes with a short summary, downloadable right at the beginning of every tutorial (currently not up to date and will be exchanged by a general summary tutorial in the future). The general educational goal of this series is to provide you with lasting and consistent knowledge on most commonly used methods of statistics and programming.
This series will also be the thorough basis for providing additional educational material for other peer-teaching formats within NOS, such as the Review Crash Course and the Educational Journal Club (JCed). For those sceptic about going through basics of statistics in detail, it will turn out to be a great advantage when moving on to advanced methods, such as multivariate or logistic models, since they only involve rather minor updates of the basics of statistics to be understood consistently. Different to a lot of tutorial series on inferential statistics, we do not start with frequencies of events and built it from there, but with probability theory, as well as epistemology and science theory.
Part I
introduces the basic concept of probability and the scientific logic behind probabilistic inference (e.g., p-values, a general concept of abductive inference / hypothesis testing). We also provide a general overview on paradigms in statistics: Bayesianism and Frequentism. This tutorial also entails a very basic introduction into R: how to install R, how to open a script, how to execute code…
Part II
presents the linear least square method to obtain the parameters (slope and intercept) of an optimal linear model. This tutorial also introduces the concept of dependent and independent variables and discusses the difference between probabilistic and statistical modelling. We also reproduced the complete math behind the basic lm(y~x)
function in R and learn how to write our own functions (in Part III we will introduce another method, which is easier to calculate, but less intuitive and more abstract to understand).
Part III
provides you with an alternative method to obtain the intercept and the slope of an optimized linear model by using descriptive parameters, such as the variance, the covariance and others. This tutorial also goes through classics in statistics such as the (weighted arithmetic) mean / expected value, the standard deviation and the difference between a sample and a population. Note that this tutorial is mathematically the easiest so far. The presented alternative method for linear modelling also shows a surprising formal relation to conditional probability. However, this method is not as intuitive as the previous linear least square method — which we recommend to start with, at least for a stable intuition on linear modelling.
Part IV
introduces the concept of probability mass, probability density and cumulative distribution functions for uniform distributions, as well as (standard) normal functions (parametric and non-parametric Gaussian bell-curved functions). We will also look into how to calculate z-score in order to standardize any given value of a measurement (not the same as the z-statistics/value of a z-test!). We also take a look at the so called “central limit theorem”, the effect of the so-called “regression to the middle”, as well as the important difference between the “cultural” and mathematical use of the term “normal”.
Part V
discusses the similarity and difference between the z- and t-test for regular distributions (difference in means) as well as the t-test for a linear regression model (testing the difference of the slope/intercept). We will investigate all the different hypotheses possible with a one-sample (one- and two-tail) as well as the two-sample z- and t-test (independent / dependent (paired)) and will eventually replicate most of the summary(lm(y~x))
function.
PYTHON VERSION:
Note that the first and third part of our series on inferential statistics is now also available in a version introducing python, translated by Rico Schmitt and Moritz Thiele. More translations of our tutorial series will follow in the near future!
“Higher Spheres” is the name of our second tutorial series covering more advanced or special methods and topics within statistics, computational neuroscience and other closely related fields. The structure of the tutorials however will follow the structure of our series on inferential statistics above, so don’t be afraid to dip into new spheres of mathematics and programming.
Our first series within “Higher Spheres” will be about basics and advanced methods of information theory. This series is supposed to give you a general insight into the history of computer science and its adaption in fields such as statistics, computational neuroscience and physics. In general, this series will provide you with the basics needed in order to understand concepts such as Akaike and Bayesian information criterion, Markov processes as well as more advanced methods from computational neuroscience and AI research, such as stable diffusion models (see title images below as demonstration), predictive processing / active inference, dynamic causal modelling and other methods linked to classic information theory. In case you have background knowledge: The below parts I to IV essentially go through the first 12 pages of Claude Shannon’s paper “A Mathematical Theory of Communication” from 1948. Part IV mostly discusses Warren Weaver’s essay on “Recent Contributions to the Mathematical Theory of Communication” from 1949. Later parts will also look into more recent applications of information theoretic concepts in neuroscience, as mentioned above (e.g. predictive processing, active inference).
Part I-IV was our contribution for the Summer of Math Exposition 2023 (thanks for the positive and useful feedback to all participants!).
Part I
introduces information theory and technology from a very basic perspective. We will start with a general, selective and brief look into the history of information technology and the importance it gained since computers became affordable. After a summary on conditional probability and Bayes’ rule we will introduce Claude Shannon’s original sender-receiver model, given noiseless and noisy channels and why communication can be looked at as an inferential relationism based on probability theory.
Part II
introduces the remarkable relation between the unit of “bits” of information (the entropy/surprisal of a message) and statistical thermodynamics (the entropy of a physical system). We will go through both in detail and will discuss several intuitions on the concept of entropy and how it can be misunderstood.
Part III
introduces the concept of Markov processes in the form of a simple Markov chain in order to simulate a source generating a message given a set of possible signs. This article provides you with two methods of mathematically representing Markov chains ─ linear algebra (more difficult, but very visual) and probability theory (intuitive in the sense of “updating a model”, but less visual). We will also get to know what a stationary distribution / (equilibrium) steady state is and how it can be calculated in the above two ways. This tutorial will also serve as basics for future tutorials on the Markov Chain Monte Carlo method (MCMC).
Part IV
discusses the expository introduction into information theory by Warren Weaver. We will focus on the implications of Shannon’s sender-receiver model on the understanding of communication in general and will mostly discuss Weaver’s “three levels of communication problems”. We will then have a look at how information theory and technology has influenced life sciences and humanities in the 20th and early 21st century up to now.
Part V
introduces some mathematical basics behind active inference, most of all the variational method as a way to overcome the intractability of exact Bayesian methods. We will also take a look into the relation between the so-called variational free energy and Helmholtz free energy. This part therefore extends the physics we have discussed in the second part of this series and explores the relation between free energy, inner energy and entropy. We will also take a closer look at Jensen’s inequality and the KL-divergence / relative entropy / information gain.