Absolute Beginner's to Intermediate Stat-o-Sphere
Review: This is a pre-released article, and we are currently looking for reviewers. Contact us via [email protected]. You can also directly send us your review, or use our open peer-review functions via pubpub (create an account here). In order to comment mark some text and click on “Start discussion”, or just comment at the end of an article.
In general, for orientation within tutorials, we recommend to make use of the contents function (upper right corner), to reorientate in the script.
Title-background image by Markus Spiske. Preview-image made by the author.
Corresponding R-Script:
Welcome to our Stat-o-Sphere R-Basics series. The following tutorial on basics in R is an extended collection of all the basic material that we provided in our series on inferential statistics and other tutorials. We will also briefly introduce the concept of data cleaning as an usual preparation procedure before statistical analyses of data and we will explain functions that are used to import and export a .csv or Excel files into R — which we mostly didn’t mention or discuss so far in other tutorials. We will also start with a little introduction into the history of programming languages, to give you a basic understanding of computers and computer science in general.
R is in general the ideal programming language for beginners, since R and RStudio are rather simple to use, open source (different to SPSS or Matlab) and there is plenty of open access materials on the internet — maintained by a wide and consistent community of users and developers. Since the syntax of R is very similar to python or Matlab, it is also easy to advance and learn those and other programming languages as well. R is also the best language to start with when you are mostly interested in applying statistical methods of any kind in, e.g., medical or psychological / social scientific research.
As mentioned, programs such as R can also be used to adjust data sets, change entries (correcting typos) and more, similar to programs such as Access or Excel. The great benefit in that respect is that a R code script is essentially a documentation of all changes that were made to a data table. For example: necessary adjustments of entries, especially for long term data bases, due to typos or because diagnoses have changed over the years etc., can be done in a very transparent way writing a script of commands of changes. The search and change function in Access/Excel can do the same, however it changes are still not documented in the form of a code script, such that this approach remains therefore rather untransparent.
To start our journey into the realms of R and RStudio, let us take a brief look into what the function of a computer actually is, before we try to “speak” with a computer / give a computer commands via a programming language ourselves.
For those that are new to the field of computer science but are eager to become a Cyber, we will start with a brief history on programming languages, starting with the simple question, what is the function of a computer?
A very general answer to the above question with a little twist is: Computers are a representation of the abstraction of the (mathematical/logical) concept of a function itself:
Arguing that the function that was used to transform our input data was a parabola function, we can look at the above relation as a relation in the form of a plot of values:
In general, all of the above three levels of computation can be manipulated. Importantly, the input or the type of input does not have to be numeric. The function also does not have to be mathematical, but could, e.g., be a function that capitalizes the first letter of every entry of a column in a data table, such that the output will be a respectively modified data table.
Crucially, the above levels of computation can also be thought of as a control algorithm for every statistical analysis:
Nowadays it became rather simple to communicate with a computer, i.e., give a computer commands. This has not been always the case and only changed with the invention of programming languages. In the past one always had to also tell the computer how to use the hardware of the computer when implementing an algorithm. To get an idea of what that means in terms of effort, see the below comparison between “printing” the phrase “Hello World!” on the screen:
“Higher-level” programming languages were developed to simplify the use of computers. In nuce, languages such as FORTRAN (FORmula TRANslation) where build to make computers usable for those that actually do not have insight about the hardware of a computer and do not know how to communicate with the hardware of a computer, but still want to print the phrase “Hello, World!”. A fundamental concept invented in order to bypass the above steps necessary in assembly, are compilers and interpreters. We will not go into the difference between them, but what they essentially do is translate a higher-level command into a lower-level command, such as in the form of assembly code (or find a possible interpretation to represent the command in assembly/machine code). Assembly code is also naturally related to a specific hardware, so the code needed for a program such as the above for printing a phrase, will be different between different machines. So compilers and interpreters also made it easier to talk the same higher-level language with different computers, translating code not only to one but also into different kinds of lower-level hardware assembly languages…
The downside of so-called “higher-level” programming languages such as R is the fact that it runs “slower” than “lower-level” languages, like assembly. This also goes for code in one language itself: some code runs faster than others, however it is important to note that this is mostly not a problem for any kind of regular empiristic statistical analysis, since computers have advanced so much, that it would mostly take millions of entries in a data set to take longer than a second when performing, e.g., a linear or multivariate regression. Important to you should only be that you are able to really understand and check if the code you have written really does what it is supposed to do (remember the triadic algorithm of the general concept of a function).
However, due to boundaries in performance assembly code was still used to code, e.g., for computer games, that heavily rely on hardware performance. A classic example for extraordinary programming in assembly code is the game “Rollercoaster Tycoon” from 1999 (99% assembly, the rest in C). The benefit of doing so was that the game ran on a lot of regular PCs from that time without the need of expensive graphics cards. Classic GameBoy games were also often coded in assembly and C for that purpose. An issue that has mostly vanished, since graphic cards have advanced rapidly in the last two decades.
There is a lot more to discover, but we hope that this and other tutorials within our collection Stat-o-Sphere will feel like a kind of futuristic archeology, since computers and programming languages have been around for quite a while already and are by far the most advanced invention humans have ever come up with. However, all of our tutorials will mostly focus on computational statistical modelling of some kind. This also goes for our series on information theory and technology, which we want to recommend to those that want to learn more about the history of computing and the logic behind computers. Our series on information theory also provides some basics in mathematical psychology, psychophysics, cognitive science, AI/machine learning and computational neuroscience.
The goal of this chapter was to make you aware of the fact that there are several hierarchical levels of programming languages working at the same time in a computer. This can, e.g., help to understand the existence of certain errors a little bit better, keeping in mind that there are several processes that are not represented via the code that one typed into a script in R — and other structural circumstances that relate to the underlying structure of computers and programming languages.
Since we did not look into the history of R, we recommend this short lecture by Roger Peng, in case you want to learn more.
The installation of R is rather simple. The only thing is, there are two programs you will need to download and install: R and RStudio (Desktop Version, details below). RStudio is in general used to ease the use of R itself, as we will see.
When you are on the website to download R, click on CRAN (see image below) and choose a server you want to download R from (e.g., a server from your country, but it actually doesn’t matter from where you get the file).
Next choose your operating system:
In case you are trying to install R on an Apple Computer, you have to decide between a version for Apple silicon (M1/M2) or older Intel Macs. If you don’t know which version, check which kind of computer you own or just try out which works for you by trying to install and open R. However, to find out which chipset your Mac uses, click on the Apple symbol and click “About this Mac”, which should give you the desired information.
The website for RStudio Desktop Version is easier to use, so you should be fine finding a way to download the setup file by yourself.
RStudio is an IDE, i.e., an integrated development environment, in general used to ease the use of R itself, as mentioned. On the upper left in the figure above you will find the window for the script (containing unrelated content in the image above). You can open several scripts at a time (tabs). A tab with blue highlighted letters indicates that the script had been changed and not saved yet. When you close R, it usually opens all the previously opened scripts. However sometimes it doesn’t, e.g. due to crashing (wait a few seconds before reopening). You can also generate so-called projects in R, which saves a whole session of several scripts — we will go into project in a separate chapter further below. On the upper right side, you will find the environment (or also called workspace!). It keeps track of what you executed. You can also find the results in the console output, which is usually to be found on the lower left side.
The appearance of RStudio can be changed via:
Tools -> Global Options -> Appearance.
The upper theme I am using is called “Dracula”.
You can also customize RStudio themes. Unfortunately, there is currently no easy to use website available to do so (there had been…).
However, you can also add external themes (e.g. taken from this github repository):
Below is a slightly altered version of the Dracula theme with lighter color for comments (I called it the “daywalker” version, since it helps to read the whole code with comments at daytime as well, while keeping up the whole Dracula flair, haha):
Another simple but important basic feature of RStudio that can come in handy is knowing how to adjust the size of the appearance of RStudio:
Press CTRL and + or - to see the effect on the size of the interface elements in RStudio:
Start R Studio and click on:
File -> New File -> R Script to open a new script.
After that click File-> Save As in order to give the script a title and in order to save future changes. This will be helpful in case you want to reboot R or so, e.g., when false code messed up your environment.
Mark the lines you want to execute and press ALT+ENTER (or mark lines and press “Run” in the RStudio interface). In case you are using any Apple hardware, use the command key instead of ALT. In general we recommend to use the keyboard as much as possible, e.g., hold SHIFT and use the arrow keys to mark letters (left/right) or whole lines (up/down).
Note that hitting ALT alone triggers the option to operate the RStudio interface via the keyboard — try yourself and move the arrow keys to see what we mean! So in case this happens, you have to click at the script again, otherwise hitting enter will just open the option menu of “File” in RStudio.
The result of running our test code can be seen in the environment tab on the upper right side in RStudio (see image below). If you ever feel that your script is presenting a funny or implausible output (especially after a series of errors), clear the environment / workspace via the brush tool ─ there is another brush-icon placed in the console tab to clear the console. If this does not help close R without saving the workspace (but save your scripts first!!).
Note that lines that begin with #
and anything after a #
is marked as comment and will not be read by R as actual code. Also note that R is a case sensitive language, so and object with the name Test
will not be treated the same as test
, and spacing in object names is not allowed (test_1
, test1
or test.1
are possible alternatives).
# Mark line and execute via ALT+ENTER or Cmnd+ENTER (Mac)
test = 2 + 5
To also obtain the result of an object in the console, after you have executed the whole line, such as test = 2+5, mark the name of an object only (here “test”) and execute (again, the whole line has to be executed and in the environment first).
# Console output
# [1] 7
You can also just execute a line with 2+5, i.e., without giving it a name and you will obtain the result in the output too. However, the result will not be saved in the environment — only when you give an object, which is the output vector, a name! You can also type in the name of an object in a separate line and execute that line (or the whole script) to view the output in the console.
Note that we generally recommend using the keyboard when operating within the script: use SHIFT+ARROW to mark and demark letters (left/right) or whole lines (up/down).
Note that R scripts are executed from top to bottom. The R script we provided can theoretically be executed as a whole (mark all and execute). However, it may be that a variable, e.g., with the name test
, gets redefined in lower parts of a script with a different value (results in changes in the environment!). In other words: The content of the previous object with the same name test
will be “overwritten” by the new content assigned to the name test
. Keep that in mind, when playing around with the script.
EXAMPLE: I executed another line “test = 5+3”. Since I also named it test, the previous result gets overwritten in the environment:
Note that when operating with larger scripts and testing and changing around a lot, it can easily come to confusions and possibly errors (especially when opening the script again later and it suddenly doesn’t work anymore). We recommend making frequent uses of the brush function and then re-test your code after changes were made, in order to recognize inconsistencies of a script of any kind as early as possible. It can also help to restart R, especially when writing functions…
The last general “rule”, we want to make you aware of in this chapter is the fact that you can only define one object and only execute one function per line:
# Only one object per line!
# The below will give you an error:
test1 = 2 test2 = 3
# Error: unexpected symbol in "test1 = 2 test2"
In case you want to write a more compact script of commands, you can use a delimiter to demarcate when a line ends within one line in our R script. We will also get to know the role of delimiters when importing and exporting data sets in R. However, the concept of a delimitter is a general concept of demarcation and the use and its function is therefore context dependent, so to speak. Within R scripts the semicolon ;
is used as a delimiter for lines:
# We can use the delimiter ";" within R scripts though:
test1 = 2; test2 = 3
# Delimiters are a general concept of demarcation. In R it
# it demarcates lines. So using ";" is the same as starting a new line!
Congrats, you have now officially become a Cyber!! <3
Knowing the basics of a programing language is truely really something and the further one gets an idea of what computers actually are, the more it feels as if one has lived in the middle ages all along, we promise.
As mentioned, especially when you are working with several scripts, or having several scripts open, creating a project can be helpful in a lot of ways. Projects are also helpful, when you want to make sure that everything is backupped in case of a software crash. Usually R opens the previous session, when restarting. Sometimes it doesn’t. It can help to re-restart R, but in seldom cases a session remains lost. The scripts are usually still backupped, such that the tabs title appears blueish (indicates unsaved changes of the script) when reopened. However, creating a project is a safe way to organize and backup your work. Note that when you close RStudio, it usually asks you if you want to save the session and the scripts. Saving the workspace is usually not necessary, since everything in the environment is the result of executed code that we usually still possess. You may also want to restart RStudio in order to completely clear the environment, when checking if the code actually works (since there may be old variables still active in the environment…). However, you should always save your scripts when asked! When you are asked to save scripts / the workspace and just clicked “cancel”, it will allow you to close RStudio immediately the next time you try to do so and will backup the current session automatically except of the work space / environment.
The steps of creating a project are pretty straight forward:
In the field “Directory name” you can choose the name of a folder that will be created in the directory you choose below. In other words, don’t do the same mistake I did above, creating a folder called “test” beforehand (in the Windows file explorer) and then create another folder within that folder called “Test” via the field directory name… In other words: the filed directory name creates a folder in the directory chosen below that field.
Speaking of file explorers, have a look at the bottom right of RStudio, after you have successfully created a project:
You now have a file explorer of your project folder integrated in R studio (Tab files, next to plot…). I personally hardly ever use it, but you may find a use for it in the future!
As a last remark, you can also easily change projects, by clicking on the R Symbol next to the project title on the upper-right of the IDE:
Data values as well as the output of calculations done by R are usually represented as vector, which is an object type in R. Think of a column of a data table as a (column-) vector. In R a vector is in general used as an abstraction of mathematics (from linear algebra), where vectors can be seen as, e.g., a list of coordinate values c(x,y)
(the latter then being an actual mathematical vector represented in R). In R a vector can list anything that way, such as a set of values of measurements, or a list of names — just like in programs such as Excel or Access.
However, from another set theoretic perspective a column vector of values can be looked at as a set of values c()
. Via the row or column bind function cbind()
and rbind()
vectors can be combined row or column wise to form a matrix / data table.
Part of the analogy of such a “matrix table” is the demand of dimensional integrity, meaning: every NA
(or N/A
) assigned, which means “not available or not applicable”, meaning that there is simply no entry, but a reserved empty field is still given, so to speak. Same goes for the inverted case with constant amount of rows, but variable number of columns (when theoretically both
However, an object of combined vectors — via cbind()
— can also be turned into a data type called “data table” via as.data.frame()
. The columns can then be given names and the data table is turned into a format that can also be exported as a .CSV file and can be used with several commonly used functions (which demand such a data type). You can also turn a vector into a matrix via as.matrix()
. It will then have additional indices to represent a row or column-vector of a matrix. Here is some code to demonstrate how we can get from a vector to a matrix to a data table:
vec = c(1,2,3)
# [1] 1 2 3 # NOTE that an output is also a vector
# therefore the [1] at the beginning.
vec[2] # index 2 == second element
# [1] 2 # second element in the vector is also 2
mat = as.matrix(vec)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
mat[2]
# or
mat[2,1]
# [1] 2
mat2 = as.matrix(t(vec))
# t() == transpose == tilting a vector (different with matrices!!!)
# [,1] [,2] [,3]
# [1,] 1 2 3
mat[1,2]
# Technically a column vector with only one element, when thought of as
# an element of the matrix above:
# [1] 2
mat_bind = cbind(c(1,2,3),c(1,2,3),c(1,2,3))
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
mat_bind = as.data.frame(mat_bind)
# V1 V2 V3 # V stands for variable
# 1 1 1 1
# 2 2 2 2
# 3 3 3 3
# Column names can be changed via:
colnames(mat_bind) = c("One", "Two", "Three") # rownames() exists too
# One Two Three
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
The mat2
example above uses the transpose function t()
to flip the vector. Note that this function does something else to matrices (it then doesn’t just rotate a matrix; I just used this function here for demonstrative purposes).
The programming language Matlab (short for Matrix laboratory) operates with matrices as the primary class, i.e., any list of entries always comes with indices for abstracted rows and columns, even if the object is conceptually a vector (matrix with only one row).
In R there are actually six different classes of vectors, so-called atomic vectors (see list below). They are called atomic and not “molecular”, since they only consist of a series of entries that are all the same type or “class” (all entries are e.g. numbers or words) and therefore do not represent a “molecule” of different “atoms” or simply classes (multiple entries with different classes, following the comparison). This is different with an object of the “molecular” type data.table, which can consist of different types/classes of entries in each column, different to a mere object of the type matrix. The columns for themselves are still assigned a certain class though, such that it is possible to, e.g., type in text into a column with numbers, but the class of the column-vector will then be adjusted respectively and numbers will be treated as symbols from then (class changed string in this case, we will get there again in detail below). For now just keep in mind: an object of the class data table is a composition of atomic vectors (class is determined column-wise).
You may ask why there are different classes anyways? First of all different R functions are “naturally” associated to certain types of information: Numbers can be used to do a calculation and words may have to be adjusted (e.g., capitalized) or exchanged with another word for a whole column. In other words, different types of information involve different processing algorithms and are represented very differently within a computer as well. We will not go into it too deep here, but we hope this helps to make sense of the fact that such classes exist in a program in the first place and what their boundaries are. Below you see a list of the six atomic vector classes in R:
logical (TRUE/FALSE),
integer (whole numbers, such as 1, 4, 100),
double(64-bit floating point format to represent decimal numbers in computers (we will not go any further into the topic here)),
complex (for complex numbers),
character (for so-called character strings, i.e., any sign as text) and
raw (entails raw bytes as elements).
Objects with numerical content of any kind are also classified as “numeric”, which can be integers or single/double precision floating point numbers (which refers to the represented precision of a decimal numbers (there is also the class “single” for 32bit numbers, but 32bit is actually not used anymore in data science)). We will not go any further into the topic of classes, but you can find more information here on the “Comprehensive R archive network” (CRAN). There are more details about it, however for a general basic usage you wont need to know much more, apart from the fact that not every function works with every class, but eventually classes can be changed, as we have seen with the matrix that was turned into a data frame.
Mathematically the general category for vectors and matrices are actually tensors, such that a vector is a 1D tensor and a matrix a 2D tensor. A tensor can have 3 or more dimensions — so does the object type array()
in R. We are not interested in the mathematical structure of a tensor, but how it can be used in an abstract way in programs such as R. However, a classic application can be found in theory of relativity, representing four dimensions: three space coordinates (x,y,z) and one dimension for time… In general, a vector or matrix in R is also an array from a general computer scientific perspective. The term array is derived from French and is related to words like arrange (see online etymology dictionary here). In general, it more or less means “(to put in) sequence, order” etc., so in other words a list that is equipped with indices. Python uses arrays as primary data structure (R vectors and Matlab matrices…).
Array example: imagine you have five tables with five columns and 50 entries each (50 rows). Say you want to “loop over” all five tables adding +1 to each entry, where “loop over” exactly means that: you want to add +1 to each entry of each column of each of the five tables…. You could then use an array with the dimensions array[[row=50, col=5, tables=5]]
to target such a computation via indices (we will look into “for loops” further below)…. However, for the above example, where a addition is performed on the whole array, you can actually just add +1 behind it, without telling the computer formally do so for every single entry of that array (see code below).
# Exemplatory array with
# dim = c(rows, columns, further_dimension)
array(1, dim = c(50,5,5))
# Add a 1 to every entry:
array(1, dim = c(50,5,5)) + 1
The array()
and c()
function can also store a character string and numbers, but the class is then automatically changed to “character” and numbers are understood as character values (as mere symbols, so to speak). If we were to add a 1 to every number in the string, it wont work, since it is not considered to be number in such a case:
string = c("One",2,3)
# Try to add a 1 to the second element in the string above:
string[2]+1
# Error in string[2] + 1 : non-numeric argument to binary operator
R automatically turns numbers into a string, if there is an text element in the “list” of values. Text without the quotation marks is considered an object from the environment. If there is no object with a respective name, an error states that an object is missing:
c(object,2,3)
# Error: object 'object' not found
c("object",2,3)
# [1] "object" "2" "3"
Note that a (character) string is also processed as a kind of array, where every element of a word translates into a set/list with the elements being single symbolic code for letters on another more general level of abstraction, such as byte code that translates a series of bits (8bit) into a symbols (e.g. ASCII-Code). However, this means you can expect to find ways to target e.g. always the first letter, or so, since there is always another list in the background, indexicalizing every symbol of a character string as well. The word of a string from the perspective of the script itself though is just understood as one element. In other words, you can target e.g. the second word in a list and then the second letter… See an example for string manipulation in the data cleaning chapter below.
We will not go into details of information technology, but it is important to get a hint on the circumstance that a higher level human language, such as R, is already the
There is also the data type “list” itself that can be created via list()
. A list can store items of different kind. Different to an array, a list()
can consist of a series of element that are of different class (without just choosing a homogenic atomic vector class automatically). Below there are no more automatic “…” around the numbers; test for the class of the elements yourself via class(test_list[[i]])
, where i stands for index — the index number of the elements in their respective order. Note that the index is always an integer number. Languages such as Matlab, FORTRAN and R start with a 1 as index, however, there are also languages such as python which start with a zero index (there is a debate about which form is better, but we will not go into it further).
test_list = list("test",2,3)
# [[1]]
# [1] "test"
# [[2]]
# [1] 2
# [[3]]
# [1] 3
test_list[[2]]
# [1] 2
A list can also be structured as a matrix and so on (matrix list)… however we hope this chapter has given you enough basic insights into different data types and classes.
In order to not get too confused with the logic of code as such, recall that there is a) something going on concerning mathematical abstractions b) that there is something going on concerning general computer scientific processes, e.g., in the case of an array as a very general data type, or classes that only work with certain functions, but is an abstraction of mathematics...
You can also change classes, e.g., via as.numeric()
. With this function you can change characters into numbers, given that all characters in the vector are actually numbers:
# Character string with numeric symbols to numeric vector:
test = c("1","2","3")
is.character(test)
# [1] TRUE
test = as.numeric(test)
is.numeric(test)
# [1] TRUE
# Check what happens if one element is an actual character string
# not just numbers symbol treated as characters:
test = c("test","2","3")
as.numeric(test)
# [1] NA 2 3
# Warning message:
# NAs introduced by coercion
Before we get to importing and exporting data itself: Any changes made on a dataset that you loaded into R is only “virtual”, in other words: it will not affect the actual data file, since R only loads a copy of the data into the environment and then does something with or to it.
EXAMPLE: Say you changed every entry in a column with names that where all capitalized (“NAME”) into a format where only the first letter is capitalized (“Name”), then you never did so on the actual data set, unless you export it and explicitly overwrite it using the same name (which you shouldn’t, btw). So just keep in mind that it is no necessary to get scared that you are messing up your actual data set!!
We will start with .CSV files. In case you didn’t know, CSV stands for comma separated (delimited) values — however it is not always a comma but also a semi-colon, as we will see. In general, we recommend to export Excel files as .CSV, since it is mostly the easiest and safest way to get data files into R without issues.
There are two functions for loading .CSV data into the environment in R:
read.csv("Your file path")
read.csv2("Your file path")
To find your file path you can use the explorer in Windows and copy it.
However, after you copied the path and pasted it into R, you have to change every “\” into “/”, otherwise you will get an error (it is just different conventions of the use of “\” and “/” in Windows and other OS).
Mac users can find their path by going to the folder with the file, then click on the file and then hold the OPTION key. Now choose “Copy file path”.
Another way to handle file paths is via setting a working directory. The below function will tell you your current working directory.
getwd() # no input needed!!
The function getwd()
will tell you your working directory, which is set to be your project folder, when working with a project. When the data file you want to import is in your working directory folder, then it is enough to type read.csv(“dino.csv”)
, without writing out the fill path. In case you want to change your working directory, you can do so via: setwd()
.
You can even load a .CSV from the internet, using the url()
function!
read.csv(url("http://www.website.net/data_file.csv"))
The above functions for reading .CSV files both work the same but alter by the delimiter that is used (either comma or colon). For what are they good for? A program needs a sign to know how to differentiate entries of one column/row with another. When you open a .CSV file with a regular text editor you will either see that your data table is delimited via “,” or via “;”. To test this try to load the below .CSV files from the Datasaurus Dozen dataset (see next chapter for details).
Note that within an R script, decimal values are written with a dot, not a comma, i.e., 2.5
not 2,5
in R.
Try load the below files into your environment!
Plot the above data set via the code below. The $
sign is used to address only a certain column of a data table with the a respective name. R usually offers you the names of the columns automatically, after typing data_table$
:
We need the column for the values of the x and y-axis for our plot:
# I called the object "dino", such that plotting is done by:
plot(x = dino$x,y = dino$y)
The export of files works pretty much the same, you just have to be very careful to not name your export file after your input file (when using the same path as before), otherwise it will be overwritten. This can become an issue, in case you, e.g., wrongly changed entries of a dataset and want to reverse your actions. So keep in mind to choose the export files name carefully:
# install.packages("readr")
library("readr") # package also within "tidyverse"
# Structure: write.csv(objectname, "YOUR File-Path")
# write.csv()
# write.csv2()
The above also shows how to load and install packages (uncomment the first line to install the package). Packages/libraries are a collection of functions, some of them being already pre-installed and sometimes already loaded, others are not. In case you want to use a function of a certain package that is not pre-loaded, you have to load it. Some packages also consist of or are dependent on other packages, which makes the concept a little confusing (packages within packages, then they are also called libraries somehow…). However, at the end they consist of a variety of functions that become available after a package / library was loaded.
Before we begin with a short introduction into data cleaning, we want to make you aware of a variaty of integrated exemplary data sets for R!
You can use that data() function to import exemplary data sets into your environment! You willgain additional data set, when installing and loading the package library(“dplyr”)
, e.g., a Star Wars data set.
# Install package for further data sets:
# install.packages("dplyr") # (dplyr also used for data cleaning)
library(dplyr) # open/activate/load package
data(starwars) # load data set
View(starwars) # view via RStudio viewer
?starwars # view documentation
Another data set, we briefly introduced before is the Datasaurus Dozen.
You can also find more information on the above in the authors article on Autodesk. The Datasaurus dozen is also available as an R package!
install.packages("datasauRus")
You can then also use ggplot2()
, an advanced plotting package, to play around with the data set. See this website with a documentation of the datasaRus dozen package for further information.
# Another fun and insightful package:
# install.packages("datasauRus")
library(datasauRus)
datasaurus_dozen
# Plot using ggplot2:
#install.packages("ggplot2")
library(ggplot2)
# Plot all:
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset))+
geom_point()+
theme_void()+ # deletes coordinate system
theme(legend.position = "none")+
facet_wrap(~dataset, ncol = 3)
# Plot only certain set:
dino = filter(datasaurus_dozen, dataset == "dino")
ggplot(dino, aes(x = x, y = y, colour = dataset))+
geom_point()+
theme(legend.position = "none")
The Datasaurus Dozen essentially builds upon the famous Ascombe data set with the same message: You can calculate a linear model from any values, but always visually check if such a model is actually the right model for the given data (e.g., due to simple plausability).
As mentioned, the benefit of using R for any kind of data cleaning / manipulation is the fact that every step taken remains transparent, since every step is represented as series of steps in the form of a series of lines of code. We have also learned that R will always load a copy of the original data set into the environment, so the original can remain the same and you only export a changed version, never overwrite the original (well, we have seen you actually can, but we highly recommend you to always pick a different name than the original file when using the write()
function, or find some other backup strategy!).
In any way, using R for data cleaning allows for high transparency and potentially makes corrections also possible in the far future — in case necessary that is.
First let us have a look how data sets in science can look like. Below you will find an example from the medical field (code further below). Let us say we have measured the systolic blood pressure and we checked if a medication lowered this value (note that this example is medically not really accurate, but a choice of simplicity!). We therefore measured the blood pressure of all patients before (t1 = timepoint 1) and after the medication was given (t2). Below we have a data set with four columns: patient_id, time, measurement_sysRR and fam. The table is structured in the following way: we have a column with patient IDs and we can see that the patient ID always appears twice in that column. This is due to the fact the the column “time” relates all patients to two events: t1 and t2. In other words, it is actually a 3D array that was packed into a 2D structure, so we got to filter for the third dimension, represented in a differentiatable way via a redundant column “time”!
The column fam is also redundant and is used to note if a family anamnesis was performed (e.g., to indicate genetic disposition regarding blood pressure). The duplication of such an entry appears unnecessary, since such a procedure is usually just done once per patient only (the family anamnesis always relates to the past only). The reason why there are duplicates is not necessarily the fact that someone made this note twice by hand, but it can happen when data tables are combined. Say a table with patient_id and fam, and another table with patient_id, time and the measurement (we will not show an example for this in this tutorial).
# Example for the structure of data tables:
# Creating an examplatory data set with vectors:
patient_id = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7)
fam = c("yes", "yes", "no", "no", NA, NA, "n","n","no","no","ys", "ys", NA, NA)
time = c("t1","t2","t1","t2","t1","t2","t1","t2","t1","t2","t1","t2","t1","t2")
measurement_sysRR = c(130,122,132,123,133,121,129,125,135,119,134,127,140,125)
# Format into a data frame:
table = as.data.frame(cbind(patient_id,time,measurement_sysRR,fam))
However, since our example indicates a test, what we now want to do with the above data set is essentially comparing the meassurement of t1 and t2 and see if the blood pressure was changed, e.g., via a dependent two-sample t-test in order to evaluate the difference in means between the blood pressure at t1 (before medication) and t2 (after medication)… Do to so, we would have to evaluate, e.g., the mean of t1 and t2 each as part of the process. The mean()
function is rather simple and demands only one input, e.g., the values of t1 only.
Well, this should sound rather simple, I guess but you may wonder already: how do we extract only the values of the column “measurement_sysRR” that are horizontally related to only those rows that have the entry “t1” in the column “time”? One simple way to do so is to filter out all rows with an entry “t2” in the column “time”! Same goes for the case where we only want to keep those rows that have the entry “t2” in the column “time”:
The way this is done in R is mostly via th package “dplyr” and its filter()
function!
# install.packages("dplyr") # install package
library(dplyr) # load/activate package
# Filter function: filter(data_object, columnname == "entry") or != for unequal
t1 = filter(table, time == "t1")
t1 = filter(table, time != "t2") # alternative
t2 = filter(table, time == "t2")
The logic of the so-called “input parameters” of the filter() function is as follows: filter(data_set, column == ”column_entry”)
and can be spoken as: keep all rows from a data_set that have a specific entry. You can also use !=
instead of ==
— see code above to see the difference.
You can now calculate the mean of each t1 and t2 , e.g., via mean(t1)
.
Next we will slightly change the previous synthetic data set, in order to look at another more advanced example of data cleaning. This chapter is particularly important, since it makes clear why it is important to understand the indexical logic of tables. This chapter will therefore thoroughly go through several possible solutions for a rather common but also non-trivial problem below.
# SLIGHTLY DIFFERENT, including NA in the measurements:
patient_id = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7)
fam = c("yes", "yes", "no", "no", NA, NA, "n","n","no","no","ys", "ys", NA, NA)
time = c("t1","t2","t1","t2","t1","t2","t1","t2","t1","t2","t1","t2","t1","t2")
measurement_sysRRalt = c(130,122,132,NA,133,121,NA,125,135,119,134,127,140,125)
# Format into a data frame:
new_table = as.data.frame(cbind(patient_id,time,measurement_sysRRalt,fam))
The situation is now a little different compared to before, since the column measurement_sysRR now entails <NA>
entries (NA = not assigned /available). In general, mathematical functions have a problem with NA values and usually output an error message in such cases:
# Mean sysRR at t1:
t1alt = filter(new_table, time == "t1")
# Problem with NA
mean(t1alt$measurement_sysRRalt)
# [1] NA
# Warning message:
# In mean.default(t1alt$measurement_sysRRalt) :
# Argument ist weder numerisch noch boolesch: gebe NA zurück
The problem is in general that a mathematical function has to assume that the input is numeric (single, double, integer) or Boolean (0 or 1), to be computable. NA is neither of them, so it is unclear what to do with such entries under a mathematical perspective. To solve this issue, we have to filter the NA values. This could again be done via the filter()
function, however, looking closely at the above data set, you may notice an important issue:
Since a lot of the columns such as measurement_sysRR are related horizontally (or row wise) with the entry t1 or t2 in the column “time” (vertically), it would be implausible to just filter all lines with NA entries in the column measurement_sysRR and call it done. But why?
The general problem is, that above one NA entry is associated to t1 and the other to t2 — essentially lacking comparability for those patient. With other words, we now have to filter out all rows associated to one participant (participant_id) for the case that either t1, t2 or both is missing in the column measurement_sysRR. The final decision to do so essentially relies on the chosen study design. For the case of a “per protocol” procedure, only those participants will be included in the statistical calculation that have actually been present at both time points. However, for the case of the “intention to treat” design choice, you may want to also evaluate and consider the fact that a lot of patients didn’t manage to be present at the second time point. In some cases this may be due to an aversion against the treatment, e.g., due to side effects… We will not go into further details here, but we wanted to make you aware that this is a point where decisions of this kind have to be considered when performing data cleaning and subsequent statistical analysis.
However, let us say we want to only consider those patients that have an entry in both t1 and t2, how do we manage to filter our data table correctly under this constrain?
Given the small size of our data table below, you may be tempted to just filter each participant via its ID by hand — and in fact, if your data set is not to big, you can actually do so. For the case of larger data sets, it gets easily laborious, so the question is: can we find a representation for an algorithm that exactly performs what we would do “by hand” in R?
The challenge is at first to find a way to correctly express what we want a function or a series of functions to do with our data set. Below you will find two possible “linguistic” solutions (representations of algorithms) to our problem, including their representation in R. Again, the issue is that not every participant of our study has an entry in both t1 and t2. The resulting table should look like the image below (patients with the ID Nr. 2 and Nr. 4 were deleted):
Solution/Algorithm I:
a) In which row positions are the NAs within the column measurement_sysRR found?
b) Which patient IDs are related to these rows, i.e., which patient IDs are in the same line as the NAs?
c) Given the information of a) and b), filter all rows related to patient IDs that are related to NAs in the column measurement_sysRR (which are given in either t1 or t2 — or both).
Solution/Algorithm II:
This solution takes advantage of redundancies within the above table, e.g., the fact that the patient ID appears twice, related to each t1 and t2:
a) Delete all lines related to NAs in the column measurement_sysRR.
b) Evaluate which individual patent IDs only appear once.
c) Delete all rows of patient IDs that only occur once.
In the chapters below we will present you two ways how to represent the above two algorithms we have formulated in R.
— The first representation represents each step for itself using basic functions and for loops.
— The second representation only concerns the second solution for our problem and is done via tidyverse functions (shorter code, but the functions are a little less accessible!).
Using functions from the package tidyverse is in principle shorter and computed faster (which is actually not an issue anymore for regular statistics, since computers advanced so much in the recent years). However, the reason we present a variety of computable solutions is to make you aware of the basic table logic and how well we can directly represent what we formulate in advance. Sometimes you may also encounter a problem for which you may not find a simple solution with functions from packages, so you have to represent each step yourself. In general not every of your needs will fit to a function somewhere written in advance for you. When searching the web for answers, it can even be hard to formulate the problem you may have in general, so we recommend to write down the necessary data cleaning steps you want to go through to orientate yourself within your data cleaning process and in order to be able to clearly express your issues (e.g., as comment via #
in your script!). You can then start thinking of how to represent them via R step by step. We hope that this chapter will help you with most of the data cleaning problems you’ll encounter in the future. In general, there are often several ways to get to the desired goal, and especially when you do further research on the internet, it is good to be aware of the fact that this is the case.
The below might look a little complicated at first glance — and a little bit like assembly code — but the code below is actually a really direct and simple representation what we would do if we were to follow “Solution/Algorithm I” by hand. Below we will go through every function that is used and will also introduce the concept of for loops. For loops are a very basic concept in all of computer science and can come in handy in several ways. Let’s get to it:
Solution/Algorithm I:
a) In which row positions are the NAs found within the column measurement_sysRR?
b) Which patient IDs are related to those rows (are in the same line as the NAs)?
c) Given the information of a) and b), filter all rows related to patient IDs that are related to NAs in the column measurement_sysRR in either t1 or t2 — or both.
### POSSIBLE SOLUTION I to get rid of patients data with only t1 or t2. not both:
# In which lines are NAs (row / line nummer, not pat. id!)
is.na(new_table$measurement_sysRRalt)
# [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
na_lines = which(is.na(new_table$measurement_sysRRalt) == TRUE)
# [1] 4 7 # Index of the lines with NA entry
# Initialize vector:
pat_id_na = c()
# Determine which patient_id is in the lines from the list na_lines:
for(i in 1:length(na_lines)){
pat_id_na[i] = new_table$patient_id[na_lines[i]]
} # End for i
pat_id_na
#[1] "2" "4" # Patient ids with NA in either t1 or t2
# Initilize table:
fin_table = new_table
# Delete every line with the patient_id given in the list
# pat_id_na:
for(i in 1:length(pat_id_na)){ # or length(na_lines)
fin_table = filter(fin_table, patient_id != pat_id_na[i])
} # End for i
fin_table
STEP A: So, at first we want to determine the row number of those rows with NAs in the columns measrurement_sysRR). We can do so using the is.na()
function. The output will be of the class “logical” (TRUE/FALSE). We can now combine this function with the which()
function in order to determine the row numbers with an NA entry for either t1 or t2 in the column “time”. Further below we will look at examples of character string manipulation, where the which()
function can also be used to evaluate and change specific entries (e.g., change entries with “ys” into “yes”). The resulting object/list with the respective row values is called na_lines
.
STEP B: For step b) a for loop is used, which can be understood as a command to repeat a defined step of an algorithm. In our case we use a for loop to go through all the entries in the list of row numbers from na_lines from the previous step in order to check which patient ID is associated to NAs and in order to then put each resulting patient ID into another list called pat_id_na
. Again, this list entails the patient IDs of those patients with NA values in the column measurement_sysRR.
Before we discuss and write a for loop, note that we always have to initialize an empty list to store the results in such a case, e.g., via object_name = c()
. This object can then be addressed within the for loop in order to store results.
The basic form of a for loop in general looks like this:
for(i in 1:n){something}
“Something” here stands for any process we want to perform within a loop. The i
stands for index (class integer), but is symbolically arbitrary (could be k
, f
or a word
).
In a very general way a for loop says: “do something with every i
is further defined as 1:n
or 1:length(list)
. Note that the vector na_lines
further above has a length of two in this case, so we could theoretically also write 1:2
instead of 1:length(na_lines)
. Here is a quick simple example what a for loop is in general (related to a previous instance where we added +1 to an array — now we do the same with a vector and by using a for loop):
# Define a object you want to loop over:
object = c(1,1,1,1)
# Initilize a list to store the results:
list_results = c()
# For loop that adds +1 to every element of the above object:
for(i in 1:length(object)){
list_results[i] = object[i]+1
} # End for i
# Analog:
list_results[1] = object[1]+1
list_results[2] = object[2]+1
list_results[3] = object[3]+1
list_results[4] = object[4]+1
The above can theoretically also be done via object+1
, just as with the previous array_object+1
, but we hope you get the idea: A for loop is used to perform an operation on a whole list/array/vector or a whole column in a very direct way.
The for loop we are now going to use works a little different and also uses an important trick, as we will see.
Recall: In the case of our example, we want to obtain and store those patient IDs that had NAs in the column measurement_sysRR. What we before called “SOMETHING” as a place holder for the process that a for loop is supposed to repeat now looks like this:
pat_id_na[i] = new_table$patient_id[na_lines[i]]
The right side now uses a trick: Again, the object new_table$patient_id
entails the column vector of the patient IDs. Adding a respective index in brackets behind the object name, e.g., new_table$patient_id[4]
. only relates to one value of that vector at a time. Now importantly note that the single valued index of the column vector with the patient IDs also represents the row number of the whole data set!! In Step A we have already obtained a list of row/line numbers of those lines with NA values in the column measurement_sysRR and now want to find the corresponding patient ID. To do so we can simply insert the object na_lines[i]
into the brackets behind the object new_table$patient_id
, since na_lines entails row numbers itself.
STEP C: We can then store those ID values in another list and then just have to filter the respective IDs from the data table using our filter()
on the column patien_ID (STEP C).
IN CASE THIS IS STILL CONFUSING: Again, the above loop will in principle do what we would do by hand, i.e., is just a representation of the solutions/algorithms we discussed in the previous chapter: We go through each value in the list of na_lines, which consists of row numbers, and then look at the entry of the column patient_id with the respective row number in the vector na_lines. We do so in order to find out which patients with respective ID we have to exclude from the data table and subsequently from further calculations, due to NA values in the column measurement_sysRR.
TRY TO GO TROUGH EACH STEP YOURSELF to get a grip on how each step of Solution/Algorithm I was represented in R.
The below solution/algorithm takes advantage of a simple circumstance: Some ID values will only appear once, after we simply delete all rows with NAs in the column measurement_sysRR. However, it takes some more lines to represent it via basic functions. The only new function we will introduce is the unique()
function. The unique function evaluates which individual values are given in an input vector, e.g., unique(c(“A”, “B”, “B”,“C”))
will have the output [1] A B C
.
Solution/Algorithm II:
This solution takes advantage of redundancies within the above table, e.g., the fact that the patient ID appears twice, related to each t1 and t2:
a) Delete all lines related to NAs in the column measurement_sysRR.
b) Evaluate which individual patent IDs only appear once.
c) Delete all rows of patient IDs that only occur once.
#### POSSIBLE SOLUTION II:
# Filter NA in measurement_sysRRalt
new_table_alt = filter(new_table, is.na(measurement_sysRRalt) == FALSE)
# Use circumstance that patient_id occurs only 1x in such cases:
# Function unique() outputs a vector with individual entries:
unique_id = unique(new_table_alt$patient_id)
# [1] "1" "2" "3" "4" "5" "6" "7"
# Initialize vector:
num_of_id = c()
for(i in 1:length(unique(new_table_alt$patient_id))){
num_of_id[i] = length(which(new_table_alt$patient_id == unique_id[i]))
} # End for i
num_of_id
# [1] 2 1 2 1 2 2 2
# Create table:
table_num_of_id = as.data.frame(cbind(unique_id,num_of_id))
colnames(table_num_of_id) = c("unique_id", "num_of_id")
table_num_of_id = as.data.frame(table_num_of_id)
# Filter all IDs that only occure once:
table_single_id = filter(table_num_of_id, num_of_id == "1")
table_single_id
fin_table_alt = new_table
# The same as before: filter single id from new_table:
for(i in 1:length(table_single_id$unique_id)){ # or length(na_lines)
fin_table_alt = filter(fin_table_alt, patient_id != table_single_id$unique_id[i])
} # End for i
fin_table_alt
STEP A: At first, every line/row in the data table is deleted that has an NA in the column measurement_sysRR. We then use the unique()
function to determine the individual/unique patient IDs. In this case it is simply a series of integers 1 to 7, but patient IDs often look a lot more complicated.
STEP B: We are using a for loop to determine how many times each individual patient ID occurs in the data table that was filtered for NAs in the column measurement_sysRR.
Let us take a look at the first loop:
for(i in 1:length(unique(new_table_alt$patient_id))){
num_of_id[i] = length(which(new_table_alt$patient_id == unique_id[i]))
} # End for i
The first line defines the range as the length of the object unique_id
. Here I didn’t write 1:length(unique_id)
but just evaluated the unique values again within the function length()
(for no reason, but that is just how I did it; I could clean up the script, but since it does not change the results, it is also ok to leave things as they are). The next line needs a little more understanding: We already know what the length()
function does and we have encountered an instance where we used the which()
function as well before. Let us start with the which()
function and let us look at the output:
which(new_table_alt$patient_id == unique_id[1]
# [1] 1 2
The output lists those lines/rows in the column ptaient_id that have an entry that is equal, denoted ==
, to the first value in our list of unique ID values. The first value in our list unique_id is 1. The output of our which()
function now tells us that the ID 1 can be found in line/row number 1 and 2. Since we are not interested in the specific location but the number of values, we can now put all of the above into the length()
function. Adding the length function for the above case of ID Nr. 1, the output would be:
length(which(new_table_alt$patient_id == unique_id[1])
# [1] 2
The output now says: the ID value Nr. 1 appears twice in the column patient_ID. Great!! The for loop now does the above for all individual values in the vector unique_id
and we get a list of frequencies which we call num_of_id
. Next we will relate both of the latter vectors in a table:
# Create table:
table_num_of_id = as.data.frame(cbind(unique_id,num_of_id))
colnames(table_num_of_id) = c("unique_id", "num_of_id")
table_num_of_id = as.data.frame(table_num_of_id)
Next up we can filter all the IDs that have a length of 1 in our table table_num_of_id
. The column unique_id will now only show those IDs that appear twice! We can now use this vector to filter the original data table, such that only those IDs remain that have an entry in both t1 and t2. For the latter filtering part, another for loop was used, which is special in one way: Different to before, we didn’t add an index behind the left side for fin_table_alt:
fin_table_alt = filter(fin_table_alt, patient_id != table_single_id$unique_id[i])
What happens if you leave out an index for the object name within the loop the object will actually be overwritten every time a takes a next step of the loop. In this case it is totally fine! What the for loop does is essentially this “by hand”: The goal is to filter the table in a way that it only keeps those values that are not in the table table_single_id$unique_id
(kind of negative filtering via !=
). To do so we will start with the first value in our list of IDs we want to get rid of is the ID Nr. 2 and we will filter the list such that only those ID values are kept that do not match the Nr. 2. The table is now overwritten with the result of that step and we can now enter the next step in our loop. The second and last value in our list of ID values we want to cancel is the Nr. 4… we filter it… table is again overwritten with the new result since we have come to the end of the loop, we are done!
At last we can check for equality of both of the above solutions/algorithms! The code below compares only the column patient_ID and they match, whoop!!
# Check for equality of both methods:
fin_table$patient_id == fin_table_alt$patient_id
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We hope this chapter has given you a detailed and flexible insight into what can be done in R and what is meant by “representing formulated algorithms”.
In the next chapter we are going to introduce functions from the package tidyverse, which is often used in data science and offers very short code solutions.
The R package tidyverse offers a wide range of functions that can be used for data cleaning purposes and more. In this first R basic tutorial, we are going to introduce some basics to get a first look on how is used.
At first we are going to introduce the so-called pipe operator %>%
. There is also another similar operator |>
, given since later R versions, which also works with base R — you may encounter it in your own research…
The pipe operator essentially reads out "and then" (executes more code at once and therefore also runs faster, but as mentioned, it is not really noticeable but conveniant). This is similar to executing operation by operation line by line, but it does it in a way that it, e.g., skips the updates inbetween the steps for each line (e.g. putting the output into the environment) and only returns the final output (it does a speedrun and doesn’t save inbetween — roughly that way). The function of the “+” symbol using ggplo2()
functions is similar to the function of a pipe.
Below you will see an example of how tidyverse can be used. The first line starts with defining a name for the final object, here test1
and that we want to adjust our table called new_table
(the synthetic data set from Example 2 as initial reference table). Now we see the first pipe, stating “and then”. The next line entails the select()
function, which works is similar to the cbind()
function. The only difference here is that we already referenced our data (new_table), so we do not need to reference it each time via the $
operator! We can now directly select every column we want / need for the “cleaning” part. Next we just filtered the data set, such that only rows are kept that don’t have the entry t1 in the column “time”.
#### Pipe operator %>% using tidyverse:
# install.packages("tidyverse")
library("tidyverse")
# Test 1 Filter von t2 und Zeilen mit NA löschen
test1 = new_table %>%
select(patient_id,time,measurement_sysRRalt) %>% # select columns like cbind()
filter(time != "t1") # Filter for t1
# alternative:
# na.omit() # be careful with this function
# and no %>% at the end, otherwise "+" in console!
test1
# patient_id time measurement_sysRRalt
# 1 1 t2 122
# 3 3 t2 121
# 4 4 t2 125
# 5 5 t2 119
# 6 6 t2 127
# 7 7 t2 125
Note that executing a line of piped commands and in general any code that goes over several lines (delimited by brackets) can be executed without marking everything you want to execute. It is enough when the courser is in the first line and blinking before executing via ALT+ENTER!
Next we are going to look into the “Solution/Algorithm II” via tidyverse.
The only new function that is introduced is the so-called group_by()
function. This function together with the filter(n()>n)
function essentially avoids having to write a loop that loops over every patient ID and checks if it occurs more than once. The function n()
has no input, since the pipe function just passes along the output of the previous line!
### POSSIBLE SOLUTION II Simple dplyr/tidyverse alternative:
new_table_alt %>% # new_table_alt was filtered for NAs already
select(patient_id,time,measurement_sysRRalt)%>%
group_by(patient_id) %>% # sorts individ. entries into indiv. groups
# when executing next line!
filter(n()>1) # filters if n of each grouped entry is greater
# than 1 such that all single entries get filtered
As a last remark, we want to briefly introduce the na.omit()
function, since it can lead to some error you wouldn’t want, i.e., the function can be a little dangerous when not fully understood.
Basically, the function na.omit()
deletes those rows that entail NA — however, it doesn’t do so under a specific condition such as “NAs in a specific column”, but any column. See examples below.
# Example na.omit()
test_na_omit = cbind(c(1,NA,3,4), c(1,2,NA,4))
# [,1] [,2]
# [1,] 1 1
# [2,] NA 2
# [3,] 3 NA
# [4,] 4 4
na.omit(test_na_omit)
# [,1] [,2]
# [1,] 1 1
# [2,] 4 4
# attr(,"na.action")
# [1] 2 3
# attr(,"class")
# [1] "omit"
The message is, when you want to use the NA function make sure to only select those columns where the above doesn’t make a difference. You wouldn’t want to discard measurements just because the patient_id was missing in the respective lines. In such a case, try to remember how this was possible if you acquired the data yourself or make context to those responsible for the documentation of the data. It could be that there was a mistake in the export of the data etc. etc. In any case, make sure you really know what a function does, e.g., check via a small synthetic data set…
As a last brief non-numeric data cleaning example we will look into ways to adjust columns with character string entries (essentially text or symbols in general).
The first case we are going to look at is the column “fam” from our previous synthetic data sets. The column fam is supposed to entail information if a family anamnesis was performed. We already see that entries must have been done by hand, since we can see that there are a few typos.
We can use the unique()
function again, to see which individual entries we have in the column “fam”. Via the which function we can evaluate in which line we will find entries that wrongly say “ys” instead of “yes”.
###### Character string manipulation:
unique(fin_table_alt$fam)
# [1] "yes" NA "no" "ys"
which(fin_table_alt$fam =="ys")
# [1] 7 8
Next we want to correct the respective entries. Below we just inserted the which()
function into the brackets behind the object name itself. This works very well, since it will correct all instances where “ys” is given into “yes”! Below we also exchanged NA with “not specified”…
# Correct entries e.g. via: The below is a composition of the functions
# fin_table$fam[] and which() and is.na()
fin_table$fam[which(is.na(fin_table$fam) == TRUE)] = "not specified"
fin_table$fam[which(fin_table$fam =="ys")] = "yes"
unique(fin_table$fam)
# [1] "yes" "not specified" "no"
Another brief case we are going to look at is targeting certain parts of character string entries. Below we will look at how we can separate the name and the surname, given they are in the same column. For this we will use the word()
function:
# Get first name only!
name = c("Name Surname","Name Surname","Name Surname")
surname = c("","","")
names = cbind(name,surname)
# name surname
# [1,] "Name Surname" ""
# [2,] "Name Surname" ""
# [3,] "Name Surname" ""
names = as.data.frame(names)
first_name = word(names$name,1) # the 1 stands for first word of
# the character string; the spacing
# is targeted like delimiter so to speak
# [1] "Name" "Name" "Name"
This chapter offers you a few examples on how to write a function yourself. This can come in handy in several ways, but in general it is just good to at least once see how it is done in order to get a stable understanding of what a function is in general.
In our series on inferential statistics we once showed a way to write a function that computes Bayes’ rule / CP in R. Below you will find two version.
The first version only relies on the prior and likelihood and the joint probability is calculated via the sum rule (see inferential statistics I or information theory I for an explanation of Bayes’ rule and CP in detail).
The first line of the below function can be spoken as follows: “Bayes_Machine” is the name of a function (= function()
) with the input parameters prior and likelihood and within that function some math is done with these input variables. Within the function all steps of Bayes’ rule are performed and the results are put into a small matrix table and then returned in the console via print()
. The output layout is nothing to fancy, but in case you want to learn more about nice looking output have a look at how we replicated the whole output of the summary(lm(y~x))
function with more sophisticated layout (see inferential statistics II and further parts in that series).
# First simple version of our Bayes_Machine() function:
Bayes_Machine = function (prior,likelihood) {
joint = prior*likelihood
# na.rm = TRUE in sum() deletes 0 rows if given
modelevidence = sum(joint, na.rm = TRUE)
posterior = joint/modelevidence
# Needed for console output and adds text to it
# using a matrix, which works similar as c()
postprint = matrix(c("Posterior",posterior,
"Joint probability", joint,
"Model evidence", modelevidence))
print(postprint)
} # end of function
# Give it a try with defined prior and likelihood:
prior = c(.5,.5)
likelihood = c(.5,.5)
Bayes_Machine(prior,likelihood)
# Try these inputs and contemplate the results:
prior = c(.1,.9)
likelihood = c(.9,.1)
Bayes_Machine(prior,likelihood)
The second version can also handle the model evidence / marginal likelihood as input instead of calculating it via summing the joint probability. You will find this kind of application when testing for test (see inferential statistics VI for more information and examples).
The below function also entails if and else operators and the function missing()
. The latter function outputs a TRUE if the input modelevidence is given, otherwise it will precede in the fashion of our first version. However, for the case that it is not missing but given (else case), a slightly different math is applied.
# Here is the extended function that can also handle
# the model evidence as input:
Bayes_Machine = function (prior,likelihood,modelevidence) {
if (missing(modelevidence)){
joint = prior*likelihood
# na.rm = TRUE in sum() deletes 0 rows if given
modelevidence = sum(joint, na.rm = TRUE)
posterior = joint/modelevidence
# Needed for console output
postprint = matrix(c("Posterior",posterior,
"Joint probability", joint,
"Model evidence", modelevidence))
print(postprint)
}
else {
joint = prior*likelihood
posterior = joint/modelevidence
postprint = matrix(c("Posterior",posterior,
"Joint probability", joint,
"Model evidence", modelevidence))
print(postprint)
}
} # End of function
Bayes_Machine(likelihood = likelihood, prior = prior)
Our second exemplary function will be a sum function. It is built rather simple and the first line is spoken: sum_alt is the name of a function with the input x… We called the function sum_alt()
since there already is a sum()
function and we don’t want to overwrite it, since we want to compare our function with the built in function.
At the beginning an object with a single 0 is pre-defined, called “result”. This object will be constantly updated in the for loop, which has a recursive structure to it. It can be called ‘recursive function’, since a recursive function calls itself, such as result = result + x[[i]]
). In other words, the output is re-entailed in the calculation itself at the next step of the loop….
However, in nuce the for loop loops over every entry in a list and successively adds them together. The result is then returned to the console.
# Exemplary code for a sum function (do not name it "sum",
# since it conflicts with the integrated sum() function)
sum_alt = function(x){ # Start of function
result = 0 # initialize object for recursive addition;
for(i in 1:length(x)){ # successively add the values of a vector;
result = result + x[[i]] # recursive addition; i is element of a set
} # End for i # I = {1 to length(vec)}; n = length(vec)
return(result) # return result in the console
} # End of function
# Example:
sum_alt(c(1,2,3))
# Test for equal results:
sum(c(1,2,3)) == sum_alt(c(1,2,3))
# [1] TRUE
The function below replicates most of the output of the summary(lm())
function (incl. two-tailed t-test for linear regression). This was done as part of our series inferential statistics II to V.
The math part is straight forward and is thoroughly explained in the respective tutorials. Here we will only briefly discuss the concatenate function, i.e., the cat()
function. It can be used to produce a nice looking output in the console. For the layout “\n” and “\t” can be used as a line-separator and a tab separator respectively!
# Go-go-gadgeto linear_least_square!!!!
linear_least_square = function(indep,dep){ # Start of function
# Evaluating coefficients for and optimal linear model
# given a set of dependent and independent variabels:
beta = cov(indep,dep)/var(indep)
alpha = mean(dep)-beta*mean(indep)
fx = alpha+beta*indep
# Sum of Squared Errors/Residuals:
SumR2 = sum((dep-fx)^2)
# Residual Standard Error/Deviation:
residual_standard_error = sqrt(SumR2/(length(indep)-2))
# Standard error of the slope for t-distribution:
se_denom = sqrt(sum((indep - mean(indep))^2))
se_slope_t = residual_standard_error/se_denom
# Standard error of the incercept for t-distribution:
se_a = sqrt(SumR2/(length(indep)-2))*sqrt((1/length(indep))+((mean(indep)^2)/sum((indep-mean(indep))^2)))
# t-value of the slope:
t_value_b = beta/(se_slope_t+exp(-32))
# t-value of the intercept:
t_value_a = alpha/(se_a+exp(-32))
### p-value of the slope via integrating the PDF of a t-distribution
# up to the t-value calculated above:
t_distr = function(x,df){
t_1 = gamma((df+1)/2)/(sqrt(df*pi)*gamma(df/2))
t_2 = (1 + (x^2/df))^(-(df+1)/2)
t_distr = t_1*t_2
return(t_distr)
} # End of function
# Two-Tail P(t=T|H_0):
p_b_2t = 2*integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_b)$value
### p-value of the intercept:
# Two-Tail P(t=T|H_0):
p_a_2t = 2*(1-(integrate(t_distr, df = length(indep)-2, lower = -Inf, upper = t_value_a)$value))
# Results for two tail
Results_a = c(round(alpha,4), round(se_a,4), round(t_value_a,4), p_a_2t)
Results_b = c(round(beta,4), round(se_slope_t,4), round(t_value_b,4), p_b_2t)
Res_data = as.data.frame(rbind(Results_a,Results_b))
colnames(Res_data) = c("Estimate","Std. Error","t value","Pr(>|t|)")
rownames(Res_data) = c("Intercept", "Reg. coeff.")
# Nice output using cat() function:
cat(" Linear least square method in R","\n","\n",
"Independent variable:", "\t", deparse(substitute(indep)),"\n",
"Dependent variable:", "\t", deparse(substitute(dep)),"\n","\n",
"alpha", "\t",alpha,"\n",
"beta","\t",beta, "\t","SumR2", "\t", SumR2, "\n","\n")
print(Res_data)
cat("\n","Residual Standard Error:",round(residual_standard_error),
"on", (length(indep)-2), "degrees of freedom","\n","\n")
# Let us also plot our results:
# We will also use deparse(substitute(x)) for the
# labels of our plot axes.
plot(x=indep,y=dep, ylab = deparse(substitute(dep)),
xlab = deparse(substitute(indep)))
abline(a=alpha, b=beta, col = "darkblue")
} # End of function
# Test:
linear_least_square(indep = c(0:10), dep=(c(0:10)*3)
At last here are two fun famous functions to give another example why programing languages such as R are the ideal tool for exploring mathematics / physics. One function produces a logistic map and the other the famous Mandelbrot set.
Logistic Map taken from this article by Markus Gesmann [1].
logistic.map <- function(r, x, N, M){
## r: bifurcation parameter
## x: initial value
## N: number of iteration
## M: number of iteration points to be returned
z <- 1:N
z[1] <- x
for(i in c(1:(N-1))){
z[i+1] <- r *z[i] * (1 - z[i])
}
## Return the last M iterations
z[c((N-M):N)]
}
# Set scanning range for bifurcation parameter r (caluclation
# may take a while)
my.r <- seq(.9, 4, by=0.003) #!!!! alternative start 2.5, so it may be more vivid
system.time(Orbit <- sapply(my.r, logistic.map, x=0.1, N=1000, M=300))
Orbit <- as.vector(Orbit)
r <- sort(rep(my.r, 301))
plot(Orbit ~ r, pch=".") ## Execute up to here to plot the logistic map
We also recommend this video by numberphile on the topic! Below you will find some code and notes I took watching the video, in order to show how easy it is to reproduce results from mathematics videos.
# We are following this great explanation by the youtube channel
# Numberphile:
# https://www.youtube.com/watch?v=ETrYE4MdoLQ
# Feigenbaumconstant = 4.669... (transc. number like pi or e...)
# Logistic map: simplified without K
# CORRESPONDING TO THE NUMBERPHILE VIDEO:
# Let's say we are talking about a function for a population, a population
# of rabbits.
# Population we start with = x1 = .5 (as in a range of 0-1)
# xn+1 = LAMBDA * xn * (1-xn)
# next year = is like previous year, part that
# via +1 the fertility i.e. existing died in the
# population prev./exist. pop.
# LIFE DEATH
# LAMBDA has to be between 0 and 4 (reasons are
# eventually complicated says the tutorial... so no expl. here
# but we will see intuitive reason soon!)
# So let us say LAMBDA is 2.3, x = .5, and the death rate is therefore
# 1-.5...
# YEAR 2, as x1 = year 1
xnPLUS1 = 2.3*.5*(1-.5)
# Result on the upper right as a value or for the console:
xnPLUS1
# 0.575 => population has increased!
# So what about year two?
# Year 3
xnPLUS2 = 2.3*.575*(1-.575)
xnPLUS2
# 0.5620625 => population dropped!
xnPLUS3 = 2.3*.5620625*(1-.5620625)
xnPLUS3
# 0.566141 => population increases again!!
xnPLUS4 = 2.3*0.566141*(1-0.566141)
xnPLUS4
# 0.5649383 => AWWWW!!! Slightly decreased again!
# We will go on, until something special happens:
xnPLUS5 = 2.3*xnPLUS4*(1-xnPLUS4)
xnPLUS5
# 0.5653009
xnPLUS6 = 2.3*xnPLUS5*(1-xnPLUS5)
xnPLUS6
# 0.5651923
xnPLUS7 = 2.3*xnPLUS6*(1-xnPLUS6)
xnPLUS7
# 0.5652249
xnPLUS8 = 2.3*xnPLUS7*(1-xnPLUS7)
xnPLUS8
# 0.5652151
xnPLUS9 = 2.3*xnPLUS8*(1-xnPLUS8)
xnPLUS9
# 0.5652181
xnPLUS10 = 2.3*xnPLUS9*(1-xnPLUS9)
xnPLUS10
# 0.5652172
# DID YOU SEE IT?
# It starts stagnating!!! :O So the population starts to egalize
# which is called a fixed point of iteration.
# NOW lets pick a LAMBDA of .65 this time! Before 2.3 ...
xnPLUS1 = .65*.5*(1-.5)
xnPLUS1
# Result is .1625, so the population heavily decreased!!!!
xnPLUS2 = .65*xnPLUS1*(1-xnPLUS1)
xnPLUS2
# .08846094 decreasing more and more!
xnPLUS3 = .65*xnPLUS2*(1-xnPLUS2)
xnPLUS3
# .05241314 => further decreas!
# If I'd go on with this a couple of years, around 15,
# the population will die.
# So eventually, if one plays around with LAMBDA (remeber between 0 and 1)
# at some values of LAMBDA funky things will happen!
xnPLUS1 = 3.2*.5*(1-.5)
xnPLUS1
# .8, so increase
xnPLUS2 = 3.2*xnPLUS1*(1-xnPLUS1)
xnPLUS2
# .512, so "competintion kicks in", as they put it in the tutorial
xnPLUS3 = 3.2*xnPLUS2*(1-xnPLUS2)
xnPLUS3
# 0.7995392
xnPLUS4 = 3.2*xnPLUS3*(1-xnPLUS3)
xnPLUS4
# 0.5128841
xnPLUS5 = 3.2*xnPLUS4*(1-xnPLUS4)
xnPLUS5
# 0.7994688
xnPLUS6 = 3.2*xnPLUS5*(1-xnPLUS5)
xnPLUS6
# 0.513019
xnPLUS7 = 3.2*xnPLUS6*(1-xnPLUS6)
xnPLUS7
# 0.7994576
# AS YOU CAN SEE IT STARTS TO BOUNCE and balance between two values :O
# so we get a MULTIPLE FIXED POINT!!!
# If you do the same thing with LAMBDA being 3.5, then you will get
# cycle of 4!! What happens when you go roughly above 3.59 for LAMBDA is,
# that the behaviour gets chaotic!! The further you get to 3.59 the faster
# and the more cycles one gets, which are actually doubling!
# So Feigenbaum got involved in the 70s and wanted to know how
# far LAMDA has to change for that to happen.
# When he was looking at the ratio of these changes in X, i.e.
# values of LAMBDA, he figured out a constant, which is now
# called, the Feigenbaum constant ..... 4.669...
# SO WHAT THE NUMBER SAYS is that the length of X/LAMBDA is getting
# 4.669 times smaller than the previous, i.e. with the previous value
# of LAMBDA, in case of a doubling, as the length represents a
# stabilasation of the cylce within that length!!
Below is a function that plots the so-called Mandelbrot set. Benoît Mandelbrot was working for IBM in the 50s when he discovered that noise has a fractal structure, meaning that no matter if you look at the signal of 1h or 1 min., the structure of the signal turns out to be the same. The so-called Mandelbrot set is closely related to the logistic map (it is equivalent with the logistic map just using complex numbers; also note that a logistic map showing saturation technically results in a logistic function).
The code below can be found here at dandelbrot.com. For further information on the mathematics see the above numberphile video on the logisitic map as well as this video on the Mandelbrot set by the same account.
mandelbrot_generator <- function(
p = 2,
q = 1,
xmin = -2.1, # minimum x value
xmax = 0.8, # maximum x value
nx = 500,
ymin = -1.3, # minimum y value
ymax = 1.3, # maximum y value
ny = 500,
n = 100,
showplot = TRUE, # If TRUE then display image,
showvals = FALSE, # Turn values off/on
cols = colorRampPalette(c("black","cyan","cyan3","black"))(11))
{
# variables
x <- seq(xmin, xmax, length.out=nx)
y <- seq(ymin, ymax, length.out=ny)
c <- outer(x,y*1i,FUN="+")
z <- matrix(0.0, nrow=length(x), ncol=length(y))
k <- matrix(0.0, nrow=length(x), ncol=length(y))
for (rep in 1:n) {
index <- which(Mod(z) < 2)
z[index] <- z[index]^p + c[index]*q
k[index] <- k[index] + 1
}
if (showplot==TRUE) { image(x,y,k,col=cols, xlab="Re(c)", ylab="Im(c)")}
if (showvals==TRUE) {return(k)}
}
mandelbrot_generator(p=2, q=1)