[1] 90
A very brief introduction
September 10, 2024
R (Version 4.4+)
RStudio (Version 4.4+)
R is the programming language
RStudio is software to help you efficiently write R code
if
/else
, for
)apply
family of functionsCode
represents R code you type into the editor or console, for example: “you can use the mean()
function to find the average of a vector of numbers. Sometimes median()
is a better summary, though.”[1]
represent the output, or result, of running the code in the code chunkFeatures of RStudio include:
Auto-complete code, format code, and highlight syntax
Organize your code, output, plots, and objects into one window
View data and objects in readable and searchable format
Seamless creation of RMarkdown documents
RStudio can also…
Interact with git and github
Run interactive tutorials
Help you write in other languages like python, javascript, C++, and more!
There are several ways to run R code in RStudio:
You can also write blocks of text that do not appear as code by also starting the text with a “#”.
And, if you put four dashes after your code, the line automatically becomes a heading. You can see these headings from the outline bar.
If you try to run incomplete code (e.g. leave off a parenthesis), R might show a +
sign prompting you to finish the command:
Finish the command by typing )
and hitting Enter or hit Esc to cancel code execution
There are several arithmetic operators in R that are extremely useful:
+
addition, -
subtraction, *
multiplication, /
division
^
(less commonly **
) exponentiation
%*%
matrix multiplication
%/%
integer division (9 %/% 2
is 4)
%%
modulo (returns the remainder of the left number divided by the right)
number %% 2
. If the return value is 0 it is even, otherwise it is odd.Now in your blank R document in the editor, typing the line sqrt(25)
and either clicking Run or hitting Ctrl + Enter or ⌘</kbd> + Enter.
sqrt()
is an example of a function in R.
If we didn’t have a good guess as to what sqrt()
will do, we can type ?sqrt
in the console (help(sqrt)
will also work) and look at the Help panel on the bottom right.
Arguments are the inputs to a function. In this case, the only argument to sqrt()
is x
which can be a number or a vector of numbers.
To create an object, you can use:
<-
<<-
assignment (right or left)
->
->>
rightwards assignment
=
assignment
99.9% of the time you will use <-
:
You should never use =
for object assignment. Depending on the context =
has different meanings.
You can call (i.e., use) an object simply by using its name:
Object names have to start with a letter. Object names can include numbers, and underscores.
Good
Bad
If you name an object the same name as an existing object, it will overwrite it:
You can create a copy of the object by assigning it a new name
This does not clone the object (i.e., changing one does not change the other)
Packages are bundles of R functions that help you perform some kind of task. For example, lavaan
is a package used for structural equation modeling, psych
is a package used for estimating inter-rater reliability, and rvest
is a package used for scraping data from the web.
You only need to install a package once per R installation.
Use the console to install a package (don’t put this in your script)
To use a package in your script, use the library()
function:
The only thing library()
does for you is load your packages. If you don’t have the package installed, you will get an error. Most of the time this is what you want!
You can also use a function from a package without loading that package by prepending the function call with the package name + ::
. For example:
Most of the packages you will use are stored on CRAN (the Comprehensive R Archive Network), which is maintained by the folks at the R Project. By default, install.packages()
will look at a CRAN mirror to download and install the requested packages.
You can install packages from other sources as well. For example, to install a package from github:
To install a package from your local computer:
A vector is a series of elements, such as numbers.
You can create a vector and store it as an object in the same way as we just did. To do this, use the function c()
which stands for “combine”
You can provide a vector as an argument for many functions (more on this later in the course)
Numeric vectors (more on these later) are numbers, either integers or doubles (i.e., floating point numbers, decimals)
To check directly if something is numeric, an integer, or a double. You can also coerce (i.e., tell R) that you want to store a value as a numeric object:
is.numeric(x)
, as.numeric(x)
is.integer(x)
, as.integer(x)
is.double(x)
, as.double(x)
Character vectors contain strings of characters squished together using quotations. For example:
To check directly if something is of type character:
is.character(x)
You can also coerce values into characters using:
as.character(x)
Logical values are TRUE
, FALSE
, and NA
.
Logical types must be capitalized. True
is not the same as TRUE
.
Logical types are also commonly represented as one uppercase letter:
TRUE
, T
FALSE
, F
To check directly if something is logical type and you can also coerce values into logical types:
is.logical(x)
, as.logical(x)
Logical operators are foundational to programming in R and allow you to compare two values together to control your programming logic.
Logical operators always return a logical value (TRUE
, FALSE
, or NA
), and are most commonly used to subset data (more on this later) and control the flow if your code if/else
statements (more on this later as well!)
Relational Operators:
<
, >
,>=
, <=
,==
, !=
>
and <
return TRU
E if the left side is greater than (>
) or less than (<
) the right side, otherwise they return FALSE
Also known as the “bang” operator
Converts TRUE
into FALSE
and FALSE
into TRUE
Missing values in R are represented as NA
(without quotes)
Even one NA
poisons the well. Your calculations will return NA
unless you handle missing values properly:
Matrices are basically two dimensional vectors with rows and columns and are made with the matrix()
function
[,1] [,2] [,3]
[1,] "A" "C" "E"
[2,] "B" "D" "F"
You can also make matrices by binding vectors together with rbind()
(row bind) and cbind()
(column bind). For instance,
Matrices are subset wtih [rows, colums]
[,1] [,2] [,3]
[1,] "a" "b" "c"
[2,] "d" "e" "f"
lists
?Lists are objects that can store multiple types of data and are created with list()
my_list <- list("groceries" = c("Soy Sauce", "Rice", "Tofu"),
"numbers" = 1:7,
"mydata" = matrix(8:11, nrow = 2),
"linearmod" = lm(mpg ~ disp, data = mtcars))
print(my_list)
$groceries
[1] "Soy Sauce" "Rice" "Tofu"
$numbers
[1] 1 2 3 4 5 6 7
$mydata
[,1] [,2]
[1,] 8 10
[2,] 9 11
$linearmod
Call:
lm(formula = mpg ~ disp, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
You can access a list element by its name or number in [[ ]]
(note the double square brackets) or $
followed by its name:
[ ]
versus [[ ]]
[x]
chooses elements but keeps the list while [[x]]
extracts the element from the list
library(datasets)
data(mtcars)
my_regression <- lm(mpg ~ hp + wt + cyl, data = mtcars)
str(my_regression, list.len = 7)
List of 12
$ coefficients : Named num [1:4] 38.752 -0.018 -3.167 -0.942
..- attr(*, "names")= chr [1:4] "(Intercept)" "hp" "wt" "cyl"
$ residuals : Named num [1:32] -1.82 -1.013 -3.16 0.464 1.532 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ effects : Named num [1:32] -113.65 -26.05 -15.89 4.29 1.61 ...
..- attr(*, "names")= chr [1:32] "(Intercept)" "hp" "wt" "cyl" ...
$ rank : int 4
$ fitted.values: Named num [1:32] 22.8 22 26 20.9 17.2 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ assign : int [1:4] 0 1 2 3
$ qr :List of 5
..$ qr : num [1:32, 1:4] -5.657 0.177 0.177 0.177 0.177 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
.. .. ..$ : chr [1:4] "(Intercept)" "hp" "wt" "cyl"
.. ..- attr(*, "assign")= int [1:4] 0 1 2 3
..$ qraux: num [1:4] 1.18 1.08 1.09 1.03
..$ pivot: int [1:4] 1 2 3 4
..$ tol : num 1e-07
..$ rank : int 4
..- attr(*, "class")= chr "qr"
[list output truncated]
- attr(*, "class")= chr "lm"
dataframes
?Dataframes are a special type of list where all elements of the list are the same length and are bound together
Unlike matrices, dataframes can hold data of different atomic types (but each column needs to be the same type)
Dataframes are subset in the same way as matrices ([rows, columns])
You can also use the $ operator to target a single column
You can view an entire dataframe by print()
ing it out in the console.1 However, it often just spams your console and is too large to meaningfully read.
To view the first n rows of a dataframe use head() (defaults to 5 rows):
dplyr
BasicsYour working directory is the folder on your computer where R will look for and save things by default
You can find your current working directory with getwd()
You can change your working directory using setwd()
R scripts are automatically run in the directory they are currently in. This means that, when you open a .R or a .Rmd file, your working directory is automatically set to that folder.
Note: Windows users need to change back slashes (\
) to forwarded slashes (/
) for filepaths
In your working directory, you can (and should!) refer to files using relative paths:
.
refers to your current working directory..
refers to the folder your working directory is located inExamples:
./data/my_data.csv
refers to a file called “my_data.csv” located in the “data” subfolder of my working directory
../../figure1.png
refers to a file called “figure1.png” located two folders “up” from my working directory
R has the ability to read and write data in a number of formats. Although much of this functionality is built into Base R, several packages help as well:
haven
(SPSS, Stata, and SAS files)foreign
(SPSS, Stata, SAS, and other files)1readxl
(MS Excel files)googlesheets4
(communicate directly with Google Sheets)readr
(enhances base R functionality)The most common way to read/write data in R is with a .csv
file!
.csv
FilesA .csv
file looks something like this:
"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb" 21,6,160,110,3.9,2.62,16.46,0,1,4,4 21,6,160,110,3.9,2.875,17.02,0,1,4,4 22.8,4,108,93,3.85,2.32,18.61,1,1,4,1 21.4,6,258,110,3.08,3.215,19.44,1,0,3,1 18.7,8,360,175,3.15,3.44,17.02,0,0,3,2 18.1,6,225,105,2.76,3.46,20.22,1,0,3,1 14.3,8,360,245,3.21,3.57,15.84,0,0,3,4 24.4,4,146.7,62,3.69,3.19,20,1,0,4,2 22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2 19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4 17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4 16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3 17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3 15.2,8,275.8,180,3.07,3.78,18,0,0,3,3 10.4,8,472,205,2.93,5.25,17.98,0,0,3,4 10.4,8,460,215,3,5.424,17.82,0,0,3,4 14.7,8,440,230,3.23,5.345,17.42,0,0,3,4 32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1 30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2 33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1 21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1 15.5,8,318,150,2.76,3.52,16.87,0,0,3,2 15.2,8,304,150,3.15,3.435,17.3,0,0,3,2 13.3,8,350,245,3.73,3.84,15.41,0,0,3,4 19.2,8,400,175,3.08,3.845,17.05,0,0,3,2 27.3,4,79,66,4.08,1.935,18.9,1,1,4,1 26,4,120.3,91,4.43,2.14,16.7,0,1,5,2 30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2 15.8,8,351,264,4.22,3.17,14.5,0,1,5,4 19.7,6,145,175,3.62,2.77,15.5,0,1,5,6 15,8,301,335,3.54,3.57,14.6,0,1,5,8 21.4,4,121,109,4.11,2.78,18.6,1,1,4,2
read.csv("data/my_data.csv")
write.csv("data/my_data_cleaned.csv
)Alternatively you can use readr
…
readr::read_csv("data/my_data.csv")
readr::write_csv("data/my_data_cleaned.csv)
Control structures allow you to control the flow of your R program (script) and are critical to programming in R. There are three components of a program’s flow: . . .
TRUE
. Like if
/else
for
, while
, repeat
, break
, next
if()
otherwise else
if()
and else
statements allow you to conditionally execute code . . .
For example, write a program that tells a cashier whether or not they should sell alcohol to a customer. The cashier enters the customer’s birthday into their POS, which needs to display the appropriate message:
ifelse()
and if_else()
if/else
takes one TRUE
or FALSE
value, but sometimes we want to evaluate multiple values at once . . .
ifelse()
is a vectorized version of if/else
that can operator over vectors:
[1] "Sell alcohol" "Do not sell alcohol" "Sell alcohol"
[4] "Do not sell alcohol" "Do not sell alcohol"
Computers are really good at repeating the same task over and over, and loops are the way to accomplish it
From Wikipedia: A loop is a sequence of statements which is specified once but which may be carried out several times in succession. The code “inside” the loop is obeyed a specified number of times, or once for each of a collection of items, or until some condition is met, or indefinitely.’’
There are three types of loops in R: . . .
for
loopswhile
loopsrepeat
loopsBad repetition: Let’s say you wanted to take the mean of all columns in the swiss
dataset:
Can you spot the problems with this code? . . . How frustrated would you be if swiss
had 200 columns instead of 6?
DRY: do not repeat yourself! If you are wriing the the same code over several lines, there’s probably a more efficient way to write it
WET: - write every time - write everything twice - we enjoy typing - waste everyone’s time
Writing DRY code reduces risk of making typos in your code, substantially reduces the time and effort involves in processing large volumes of data, and is more readable and easier to troubleshoot
for
Loopfor
loops iterate over a vector of values (any atomic type!) and execute instructions (R code) after each iteration . . .
In English: “for each of these values, in this order, execute this set of instructions” . . .
General structure of a for
loop:
var
is an index variable that holds the current value in seq
(You can call this whatever you want! In most cases it is custom to call it i
but there are meaningful exceptions to this)seq
is a vector of values that you want to iterate overexpr
is the R code you want to run for each iterationfor
Loop: DiagramGiven a set of values:
for
Loop Examplewhile
repeat
next
Issues around preparing a dataset for the analyses you want to run:
dplyr
package in tidyverse
To demonstrate much of dplyr
’s functionality, we will use the starwars
data that is loaded with dplyr
and originally from SWAPI (a Star Wars API)
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
The dplyr
package uses verbs to name the functions within. As a result, they work very nicely with the pipe (%>%
) syntax
group_by()
group_by()
is a special function that controls the behavior of other functions as they operate on the data . . .grouped_df
, tbl_df
, tbl
, and data.frame
group_by()
For example, group_by()
characters’ eye_color
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
Notice that this dataset has exactly the same data as the ungrouped version, except it now controls the output of other function calls
group_by()
group_by()
To remove a grouping structure use the ungroup()
function (if left blank, all grouping is removed, otherwise just the specified groups are ungrouped):
group_by()
exampleMean mass by character gender:
starwars %>%
# Center mass by sample average
mutate(mass_gmc = mass - mean(mass, na.rm = T)) %>% #<<
group_by(gender) %>% #<<
# Center mass by group average
mutate(mass_pmc = mass - mean(mass, na.rm = T)) %>% #<<
select(name, gender, mass, mass_gmc, mass_pmc)
# A tibble: 87 × 5
# Groups: gender [3]
name gender mass mass_gmc mass_pmc
<chr> <chr> <dbl> <dbl> <dbl>
1 Luke Skywalker masculine 77 -20.3 -29.5
2 C-3PO masculine 75 -22.3 -31.5
3 R2-D2 masculine 32 -65.3 -74.5
4 Darth Vader masculine 136 38.7 29.5
5 Leia Organa feminine 49 -48.3 -5.69
6 Owen Lars masculine 120 22.7 13.5
7 Beru Whitesun Lars feminine 75 -22.3 20.3
8 R5-D4 masculine 32 -65.3 -74.5
9 Biggs Darklighter masculine 84 -13.3 -22.5
10 Obi-Wan Kenobi masculine 77 -20.3 -29.5
# ℹ 77 more rows
filter()
filter()
is used to subset rows from a dataframe . . . Similar to [x, ]
except that it drops NAs . . .
.data
is the data to subset on . . ....
are the condition(s) that specify the subset . . ..preserve
controls the grouping of the returned dataframefilter()
Example# A tibble: 10 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Darth V… 202 136 none white yellow 41.9 male mascu…
2 Owen La… 178 120 brown, gr… light blue 52 male mascu…
3 Chewbac… 228 112 brown unknown blue 200 male mascu…
4 Jabba D… 175 1358 <NA> green-tan… orange 600 herm… mascu…
5 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
6 IG-88 200 140 none metal red 15 none mascu…
7 Bossk 190 113 none green red 53 male mascu…
8 Dexter … 198 102 none brown yellow NA male mascu…
9 Grievous 216 159 none brown, wh… green, y… NA male mascu…
10 Tarfful 234 136 brown brown blue NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Notice that you don’t need to refer to the mass
column as starwars$mass
because filter()
already knows the context from the .data
argument.
filter()
ExampleThis is
👇 same as
select()
select()
is used to subset columns from a dataframe (similar to [, x]
, but variable names don’t need to be quoted or passed as a vector)
:
is used to select a range of consecutive columns!
is used to negate a selection of columns-
is also used to negate a selection of columnsYou can rename variables while selecting them with select()
:
[1] "character" "weight" "height"
mutate()
mutate()
allows you to create a new column of data or modify existing columns
mutate()
We can also convert height to meters first, then calculate BMI:
👈 multiple statements can be placed in the same mutate()
call, and just like in tibble()
, they build on each other dynamically
dplyr
functions for manipulationacross()
if_else()
case_when()
lag()
lead()
arrange()
distinct()
summarize()
summarize()
returns a dataframe with specified summary statistic(s) of your data with 1+ rows for each combination of grouping variables (1 for no grouping structure) and 1 column for each summary statistic (much like tapply()
, but much more flexible with cleaner output)
summarize()
When we group the data before calling summarize()
we will get summary statistics for each group:
When merging datasets A
and B
: . . .
Which rows do you want to keep from each dataframe? . . .
Which columns do you want to keep from each dataframe? . . .
Which variable(s) determine whether rows match?
To keep things simple, let’s use the following data to practice merging:
# A tibble: 7 × 3
ID x y
<int> <dbl> <dbl>
1 0 1.37 1.43
2 1 2.18 1.67
3 2 1.16 0.277
4 3 3.60 3.31
5 4 5.33 3.05
6 5 4.18 1.47
7 6 5.49 0.529
# A tibble: 7 × 3
ID group age
<int> <dbl> <int>
1 1 1 24
2 2 1 21
3 3 1 29
4 4 2 44
5 5 2 31
6 6 2 34
7 7 2 20
Notice that ID == 0
is not in B
and ID == 7
is not in A
We will use the ID
column to merge the data (in the by =
argument)
left_join()
left_join(A, B)
keeps all rows from A
, all cols from A
and B
dplyr
# A tibble: 7 × 5
ID x y group age
<int> <dbl> <dbl> <dbl> <int>
1 0 1.37 1.43 NA NA
2 1 2.18 1.67 1 24
3 2 1.16 0.277 1 21
4 3 3.60 3.31 1 29
5 4 5.33 3.05 2 44
6 5 4.18 1.47 2 31
7 6 5.49 0.529 2 34
☝️ ID == 7
from B
not in merged data ☝️
right_join()
inner_join()
full_join()
semi_join()
anti_join()
by =
by =
a character vector of columns upon which to match. This is useful when merging nested data (e.g., data from clinic visits nested within patients)To make your RStudio look like mine…
Tools > Global Options > Pane Layout and make top right your Console
Tools > Global Options > Appearance and select Chaos as your Editor theme
Tools > Global Options > Code > Display and select: - Highlight selected line - Allow scroll past end of document - Highlight R function calls
Open up RStudio now and choose File > New File > R Script.
Then, let’s get oriented with the interface:
Top left: Code editor pane, data viewer (browse with tabs)
Top right: Console for running code (>
prompt)
Bottom left: List of objects in environment, code history tab.
Bottom right: Browsing files, viewing plots, managing packages, and viewing help files.