The Stat-o-Sphere is an open educational editorial collection within the student’s journal Berlin Exchange Medicine that provides low-threshold, easy-access and slow-paced statistics tutorials for absolute beginners ─ written by students, for students. Our goal is to evoke and elevate both research and reviewing abilities within readers. We are eager to further expand our peer-teaching spaces within BEM together with you in order to gather assured methodological knowledge over time and promote scientific integrity.
Most of our tutorials will also be accompanied by executable code ─ so far for the open access programming language R; others such as python, SPSS, STATA or Matlab may follow.
A general ideal of this project is to avoid gate-keeping heuristics that often aim to absurdly suppress looking at the actual conceptual and mathematical processes behind statistical inference, discarding actual and most of all lasting knowledge as, e.g., “not necessary”. Our main goal is therefore to bring back critical thinking around the topic of statistics, which is most of all applied mathematics. Nevertheless, we will also provide Quick-Guides of our tutorials in the future, and every statistics tutorial is accompanied with a summary of its content, in order to supply you with knowledge on several levels of detail.
Especially since AI approaches have advanced rapidly in the last years, we believe it is more important than ever to create and share consistent material on statistics in any form, in order to avoid the habit of applying methods without properly reflecting them, and in order to avoid giving only those people a platform to thrive in science that appear to be most of all convinced of themselves and their approach to scientific methodology. We also believe that students should take and should in general be given more time and guidance in that respect, since statistics paves the path of inference in evidenced based science and therefore potentially has an inpact on us all. Inference performed via sloppy or even fraudulant statistics when used as mere tool to spread one’s own interest and belief is probably the most dangerous tool against responsable scientific reasoning of this time.
In the future, we will also provide you with educational content in the form of an optional extension of submissions within BEM, giving authors the chance to educationally share their methods and making it possibility for readers with only little background to fully understand them (especially concerns submissions providing open data and code).
Learn more about our project and its goals in our introductory article “Into the Stat-o-Sphere” below.
We are currently looking for reviewers / test readers and people with a background in statistics / data science and with an interest in Open Education (as part of Open Science). Also have a look at articles all about our Open Science Agenda in that respect!
Contact us via [email protected] if you want to participate in any form. Every reviewer will be credited with a mention at the beginning of the tutorials. You can directly send us your review via the email adress above, or you can also use our open peer-review functions via pubpub (create an account here). How to make use of it?
In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article. We are looking forward to your feedback!
>>>>Note that the autogenerated download functions via pubpub are unfortunatly not very well tuned and is nothing we have control over. <<<<
Inferential Statistics is our first tutorial series and aims to provide thorough tutorials on fundamental methodological basics in statistics on three levels of perspective and representation: intuition/conceptuality, mathematics and computer science (currently in R for computers with 64-bit architecture). Every tutorial has a length ranging from 8000-12000 words and also comes with a short summary, downloadable right at the beginning of every tutorial. The general educational goal of this series is to provide you with lasting and consistent knowledge on most commonly used methods of statistics and programming.
This series will also be the thorough basis for providing additional educational material for other peer-teaching formats within BEM, such as the Review Crash Course and the Educational Journal Club (JCed).
Part I introduces the basic concept of probability and the scientific logic behind probabilistic inference (e.g., p-values). We also provide a general overview on paradigms in statistics: Bayesianism and Frequentism. This tutorial also entails a very basic introduction into R: how to install R, how to open a script, how to execute code…
Part II presents the linear least square method to obtain the parameters (slope and intercept) of an optimal linear model. This tutorial also introduces the concept of dependent and independent variables and discusses the difference between probabilistic and statistical modelling. We also reproduced the complete math behind the basic
lm(y~x) function in R and learn how to write our own functions (in Part III we will introduce another method, which is easier to calculate, but less intuitive and more abstract to understand).
Part III provides you with an alternative method to obtain the intercept and the slope of an optimized linear model by using descriptive parameters, such as the variance, the covariance and others. This tutorial also goes through classics in statistics such as the (weighted arithmetic) mean / expected value, the standard deviation and the difference between a sample and a population. Note that this tutorial is mathematically the easiest so far. The presented alternative method for linear modelling also shows a surprising formal relation to conditional probability. However, this method is not as intuitive as the previous linear least square method — which we recommend to start with, at least for a stable intuition on linear modelling.
Part IV introduces the concept of probability mass, probability density and cumulative distribution functions for uniform distributions, as well as (standard) normal functions (parametric and non-parametric Gaussian bell-curved functions). We will also look into how to calculate z-score in order to standardize any given value of a measurement (not the same as the z-statistics/value of a z-test!). We also take a look at the so called “central limit theorem”, the effect of the so-called “regression to the middle”, as well as the important difference between the “cultural” and mathematical use of the term “normal”.
Part V discusses the similarity and difference between the z- and t-test for regular distributions (difference in means) as well as the t-test for a linear regression model (testing the difference of the slope/intercept). We will investigate all the different hypotheses possible with a one-sample (one- and two-tail) as well as the two-sample z- and t-test (independent / dependent (paired)) and will eventually replicate most of the
“Higher Spheres” is the name of our second tutorial series covering more advanced or special methods and topics within statistics, computational neuroscience and other closely related fields. The structure of the tutorials however will follow the structure of our series on inferential statistics above, so don’t be afraid to dip into new spheres of mathematics and programming.
Our first series within “Higher Spheres” will be about basics and advanced methods of information theory. This series is supposed to give you a general insight into the history of computer science and its adaption in fields such as statistics, computational neuroscience and physics. In general, this series will provide you with the basics needed in order to understand concepts such as Akaike and Bayesian information criterion, Markov processes as well as more advanced methods from computational neuroscience and AI research, such as stable diffusion models (see title images below as demonstration), predictive processing / active inference, dynamic causal modelling and other methods linked to classic information theory. In case you have background knowledge: The below parts I to IV essentially go through the first 12 pages of Claude Shannon’s paper “A Mathematical Theory of Communication” from 1948. Part IV mostly discusses Warren Weaver’s essay on “Recent Contributions to the Mathematical Theory of Communication” from 1949. Later parts will also look into more recent applications of information theoretic concepts in neuroscience, as mentioned above.
Part I-IV was our contribution for the Summer of Math Exposition 2023.
Part I introduces information theory and technology from a very basic perspective. We will start with a general, selective and brief look into the history of information technology and the importance it gained since computers became affordable. After a summary on conditional probability and Bayes’ rule we will introduce Claude Shannon’s original sender-receiver model, given noiseless and noisy channels and why communication can be looked at as an inferential relationism based on probability theory.
Part II introduces the remarkable relation between the unit of “bits” of information (the entropy/surprisal of a message) and statistical thermodynamics (the entropy of a physical system). We will go through both in detail and will discuss several intuitions on the concept of entropy and how it can be misunderstood.
Part III introduces the concept of Markov processes in the form of a simple Markov chain in order to simulate a source generating a message given a set of possible signs. This article provides you with two methods of mathematically representing Markov chains ─ linear algebra (more difficult, but very visual) and probability theory (intuitive in the sense of “updating a model”, but less visual). We will also get to know what a stationary distribution / (equilibrium) steady state is and how it can be calculated in the above two ways. This tutorial will also serve as basics for future tutorials on the Markov Chain Monte Carlo method (MCMC).
Part IV discusses the expository introduction into information theory by Warren Weaver. We will focus on the implications of Shannon’s sender-receiver model on the understanding of communication in general and will mostly discuss Weaver’s “three levels of communication problems”. We will then have a look at how information theory and technology has influenced life sciences and humanities in the 20th and early 21st century up to now.
Part V introduces some mathematical basics behind active inference, most of all the variational method as a way to overcome the intractability of exact Bayesian methods. We will also take a look into the relation between the so-called variational free energy and Helmholtz free energy. This part therefore extends the physics we have discussed in the second part of this series and explores the relation between free energy, inner energy and entropy. We will also take a closer look at Jensen’s inequality and the KL-divergence / relative entropy / information gain.