R Programming Language

A very brief introduction

Ozlem Tuncel

Georgia State University

September 10, 2024

Research Data Services (RDS)

Workshops and more 👀

Before we begin – Why learn R?

R skills means more money when you graduate (from O’Reilly 2021 Data/AI Salary Survey)

Before we begin – Why learn R?

Changing trends in the job market (particularly in the US)

Let’s talk about how can we learn R 🧑‍🏫

General advice

  • Workshops and courses are great starting points! ⭐
    • They provide a foundation and introduce you to essential concepts and tools.
  • Practice makes perfect. ✏️
    • The more time you spend using R, the more proficient you’ll become.
  • Embrace mistakes. 😍
    • Making mistakes is a valuable part of the learning process and helps you understand the software better.

R generally has higher learning curve!

And, learning is a life-time practice with software (from O’Reilly 2021 Data/AI Salary Survey)

Some resources from RDS

Some resources from Ozlem

Persistency is the key!

Introduction to R

Required Software

  • R (Version 4.4+)

  • RStudio (Version 4.4+)

  • R is the programming language

    • Comes with its own GUI on Windows and Mac, but it’s not great!
  • RStudio is software to help you efficiently write R code

    • Called an integrated development environment (IDE)

Lecture Plan

  1. R, Rstudio, and Packages
  2. Fundamentals of R
  3. Introduction to control structures (if/else, for)
  4. Functions, user-defined functions, and the apply family of functions
  5. Importing/exporting data, and data cleaning
  6. Data manipulation and summarizing

Formatting in Slides ⚠️

  • Code represents R code you type into the editor or console, for example: “you can use the mean() function to find the average of a vector of numbers. Sometimes median() is a better summary, though.”
  • Code chunks that span the page represent actual R code embedded in the slides.
# Sometimes important lines of code will be highlighted! #<<
10*9
[1] 90
  • The lines preceded by [1] represent the output, or result, of running the code in the code chunk

R and RStudio

Why R?

  • R is a programming language built for statistical computing
  • Why use R if you are already familiar with other statistical software? 😟
  • R is free
  • R has a very large community
  • R is a programming language, so you can do almost anything you want with it (you don’t have to rely on whatever is implemented in the software you use)
  • R can handle almost any data format
  • R makes it easy to create reproducible analysis
  • R is becoming a norm in a lot of fields!
  • R is a transferable skill! Many stats-related jobs require proficiency in R

R Studio

  • R Studio is an integrated development environment (IDE) for R that will make your life much easier

Features of RStudio include:

  • Auto-complete code, format code, and highlight syntax

  • Organize your code, output, plots, and objects into one window

  • View data and objects in readable and searchable format

  • Seamless creation of RMarkdown documents

RStudio can also…

  • Interact with git and github

  • Run interactive tutorials

  • Help you write in other languages like python, javascript, C++, and more!

Program Conventions

  • There are various file formats to work with …

Saving a .R file

  1. Click on the floppy disk icon highlighted in the red circle.
  1. Use the Ctrl+S shortcut on Windows or the ⌘</kbd>+ S shortcut on Mac

Editing and Running Code

There are several ways to run R code in RStudio:

  1. Highlight lines in the editor window and click Run at the top or hit Ctrl+Enter or +Enter to run them all
  • You can change these (other other hot keys) in Tools > Global Options > Code > Modify Keyboard Shortcuts
  1. With your cursor on a line you want to run, hit Ctrl+Enter or +Enter.
  • The console will show the lines you ran followed by any printed output.

Making comments

  • By starting a line with a “# “, we can write notes and comments that will appear green. RStudio does not recognize this as code.

Making comments

  • You can also write blocks of text that do not appear as code by also starting the text with a “#”.

  • And, if you put four dashes after your code, the line automatically becomes a heading. You can see these headings from the outline bar.

Incomplete Code

If you try to run incomplete code (e.g. leave off a parenthesis), R might show a + sign prompting you to finish the command:

> (11-2
+

Finish the command by typing ) and hitting Enter or hit Esc to cancel code execution

Basic R Syntax and Operations

Arithmetic operations in R

There are several arithmetic operators in R that are extremely useful:

  • + addition, - subtraction, * multiplication, / division

  • ^ (less commonly **) exponentiation

  • %*% matrix multiplication

  • %/% integer division (9 %/% 2 is 4)

  • %% modulo (returns the remainder of the left number divided by the right)

    • e.g., To check that a number is even, you can do number %% 2. If the return value is 0 it is even, otherwise it is odd.
2^2
[1] 4
4 %% 2
[1] 0
2**2
[1] 4
5 %% 2
[1] 1

Operations

Now in your blank R document in the editor, typing the line sqrt(25) and either clicking Run or hitting Ctrl + Enter or ⌘</kbd> + Enter.

sqrt(25)
[1] 5

R uses the PEMDAS order of operations:

# 2*3 then / 2 then + 25
25 + 2 * 3 / 2
[1] 28

You can tell R to evaluate chunks of the equation together by using parentheses:

# 25 + 2 then *3 then /2
(25 + 2) * 3 / 2
[1] 40.5

Functions and Help

sqrt() is an example of a function in R.

If we didn’t have a good guess as to what sqrt() will do, we can type ?sqrt in the console (help(sqrt) will also work) and look at the Help panel on the bottom right.

?sqrt

Arguments are the inputs to a function. In this case, the only argument to sqrt() is x which can be a number or a vector of numbers.

R is an object-oriented program ⚠️

Objects

To create an object, you can use:

  • <- <<- assignment (right or left)

  • -> ->> rightwards assignment

  • = assignment

99.9% of the time you will use <-:

my_object <- c(3) # object on the left and value on the right of the assignment operator

You should never use = for object assignment. Depending on the context = has different meanings.

Creating Objects

You can call (i.e., use) an object simply by using its name:

luckynumber <- 13
luckynumber + 1
[1] 14

Object names have to start with a letter. Object names can include numbers, and underscores.

Good

myobject <- "My New Object!"

Bad

1stobjectever! <- "My first object ever!"

Overwriting Objects

If you name an object the same name as an existing object, it will overwrite it:

age <- 30
print(age)
[1] 30
age <- 40
print(age)
[1] 40

Anything can be an object in R

  • figures
  • tables
  • datasets
  • variables

Avoid Overwriting Objects

You can create a copy of the object by assigning it a new name

object1 <- 100
object2 <- object1
print(object1)
[1] 100
print(object2)
[1] 100

This does not clone the object (i.e., changing one does not change the other)

object2 <- object2 + 1
print(object1)
[1] 100
print(object2)
[1] 101

Packages

Packages

  • A lot of R’s abilities come packaged with your base R installation, and you will use many of these in your work.
  • However, you will also need to do things that aren’t included in Base R’s default functionality.
  • You could write all these functions yourself, but this is difficult and time consuming.

Packages are bundles of R functions that help you perform some kind of task. For example, lavaan is a package used for structural equation modeling, psych is a package used for estimating inter-rater reliability, and rvest is a package used for scraping data from the web.

Installing packages is easy

You only need to install a package once per R installation.

Use the console to install a package (don’t put this in your script)

install.packages("psych")
install.packages("psych", "tidyverse") # You can also do this

Installing is not enough!

To use a package in your script, use the library() function:

# The packages name can be in quotes, but doesn't have to be #<<
library(psych)

The only thing library() does for you is load your packages. If you don’t have the package installed, you will get an error. Most of the time this is what you want!

You can also use a function from a package without loading that package by prepending the function call with the package name + ::. For example:

psych::ICC()

Package Repositories

Most of the packages you will use are stored on CRAN (the Comprehensive R Archive Network), which is maintained by the folks at the R Project. By default, install.packages() will look at a CRAN mirror to download and install the requested packages.

Packages from GitHub

You can install packages from other sources as well. For example, to install a package from github:

library(devtools)
# install_github("DeveloperName/RepoName")
install_github("hadley/dplyr")

To install a package from your local computer:

install.packages(path_to_file, repos = NULL, type="source")

Elements in R

Creating Vectors

A vector is a series of elements, such as numbers.

You can create a vector and store it as an object in the same way as we just did. To do this, use the function c() which stands for “combine”

myvector <- c(4, 9, 16, 25, 36)
print(myvector)
[1]  4  9 16 25 36

You can provide a vector as an argument for many functions (more on this later in the course)

mean(myvector)
[1] 18

Hierarchy of R’s vector types

Source: R for Data Science

1. Numeric vectors

Numeric vectors (more on these later) are numbers, either integers or doubles (i.e., floating point numbers, decimals)

  • Integers: 1, 2, 3, 4, etc.
  • Doubles: 1.0, 2.43, 3.92, 4.0934853409

To check directly if something is numeric, an integer, or a double. You can also coerce (i.e., tell R) that you want to store a value as a numeric object:

  • is.numeric(x), as.numeric(x)
  • is.integer(x), as.integer(x)
  • is.double(x), as.double(x)

2. Character

Character vectors contain strings of characters squished together using quotations. For example:

my_string <- "This is a string"

my_string
[1] "This is a string"

To check directly if something is of type character:

  • is.character(x)

You can also coerce values into characters using:

  • as.character(x)

3. Logical

Logical values are TRUE, FALSE, and NA.

Logical types must be capitalized. True is not the same as TRUE.

Logical types are also commonly represented as one uppercase letter:

  • TRUE, T
  • FALSE, F

To check directly if something is logical type and you can also coerce values into logical types:

  • is.logical(x), as.logical(x)

Relational Operators

What are operators?

Logical operators are foundational to programming in R and allow you to compare two values together to control your programming logic.

Logical operators always return a logical value (TRUE, FALSE, or NA), and are most commonly used to subset data (more on this later) and control the flow if your code if/else statements (more on this later as well!)

Relational Operators:

  • <, >,
  • >=, <=,
  • ==, !=

Relational Operators

> and < return TRUE if the left side is greater than (>) or less than (<) the right side, otherwise they return FALSE

200 > 300
[1] FALSE
300 > 200
[1] TRUE
200 < 300
[1] TRUE

not!

Also known as the “bang” operator

Converts TRUE into FALSE and FALSE into TRUE

!TRUE
[1] FALSE
!FALSE
[1] TRUE

Missing Values

Missing values in R are represented as NA (without quotes)

Even one NA poisons the well. Your calculations will return NA unless you handle missing values properly:

vector_with_NAs <- c(1, 2, 3, 4, NA)
mean(vector_with_NAs)
[1] NA
mean(vector_with_NAs,
     na.rm = TRUE)
[1] 2.5

The na.rm argument in mean() removes missing values prior to calculating the mean.

Matrices

Making Matrices

Matrices are basically two dimensional vectors with rows and columns and are made with the matrix() function

# LETTERS is a built-in vector in R w/ elements A-Z
matrix(LETTERS[1:6], nrow = 2, ncol = 3)
     [,1] [,2] [,3]
[1,] "A"  "C"  "E" 
[2,] "B"  "D"  "F" 

You can also make matrices by binding vectors together with rbind() (row bind) and cbind() (column bind). For instance,

rbind(1:3, 4:6, 7:9)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Subsetting Matrices

Matrices are subset wtih [rows, colums]

letters_matrix <- matrix(letters[1:6], nrow = 2,
                         ncol = 3, byrow = T)
print(letters_matrix)
     [,1] [,2] [,3]
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f" 
# Row 2, Column 3
letters_matrix[2, 3]
[1] "f"
# Row 1, Columns 2 and 3
letters_matrix[1, c(2, 3)]
[1] "b" "c"

Lists

What are lists?

Lists are objects that can store multiple types of data and are created with list()

my_list <- list("groceries" = c("Soy Sauce", "Rice", "Tofu"),
               "numbers" = 1:7,
               "mydata" = matrix(8:11, nrow = 2),
               "linearmod" = lm(mpg ~ disp, data = mtcars))
print(my_list)
$groceries
[1] "Soy Sauce" "Rice"      "Tofu"     

$numbers
[1] 1 2 3 4 5 6 7

$mydata
     [,1] [,2]
[1,]    8   10
[2,]    9   11

$linearmod

Call:
lm(formula = mpg ~ disp, data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122  

Accessing List Elements

You can access a list element by its name or number in [[ ]] (note the double square brackets) or $ followed by its name:

my_list[[1]]
[1] "Soy Sauce" "Rice"      "Tofu"     
my_list[["groceries"]]
[1] "Soy Sauce" "Rice"      "Tofu"     

[ ] versus [[ ]]

Source: Hadley Wickham

[x] chooses elements but keeps the list while [[x]] extracts the element from the list

Regression Output is a List!

library(datasets)
data(mtcars)
my_regression <- lm(mpg ~ hp + wt + cyl, data = mtcars)

str(my_regression, list.len = 7)
List of 12
 $ coefficients : Named num [1:4] 38.752 -0.018 -3.167 -0.942
  ..- attr(*, "names")= chr [1:4] "(Intercept)" "hp" "wt" "cyl"
 $ residuals    : Named num [1:32] -1.82 -1.013 -3.16 0.464 1.532 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ effects      : Named num [1:32] -113.65 -26.05 -15.89 4.29 1.61 ...
  ..- attr(*, "names")= chr [1:32] "(Intercept)" "hp" "wt" "cyl" ...
 $ rank         : int 4
 $ fitted.values: Named num [1:32] 22.8 22 26 20.9 17.2 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ assign       : int [1:4] 0 1 2 3
 $ qr           :List of 5
  ..$ qr   : num [1:32, 1:4] -5.657 0.177 0.177 0.177 0.177 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  .. .. ..$ : chr [1:4] "(Intercept)" "hp" "wt" "cyl"
  .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
  ..$ qraux: num [1:4] 1.18 1.08 1.09 1.03
  ..$ pivot: int [1:4] 1 2 3 4
  ..$ tol  : num 1e-07
  ..$ rank : int 4
  ..- attr(*, "class")= chr "qr"
  [list output truncated]
 - attr(*, "class")= chr "lm"

Dataframes

What are dataframes?

Dataframes are a special type of list where all elements of the list are the same length and are bound together

Unlike matrices, dataframes can hold data of different atomic types (but each column needs to be the same type)

my_data <- data.frame("name" = c("Ozlem", "Angela", "Bill", "Mary", "Joe", "Jane"),
                          "grads" = c(1, 0, 4, 3, 2, 3),
                          "fullprof" = c(F, F, T, T, T, T))
print(my_data)
    name grads fullprof
1  Ozlem     1    FALSE
2 Angela     0    FALSE
3   Bill     4     TRUE
4   Mary     3     TRUE
5    Joe     2     TRUE
6   Jane     3     TRUE

Subsetting dataframes

Dataframes are subset in the same way as matrices ([rows, columns])

my_data[, 1]
[1] "Ozlem"  "Angela" "Bill"   "Mary"   "Joe"    "Jane"  
my_data[c(1, 3, 5), ]
   name grads fullprof
1 Ozlem     1    FALSE
3  Bill     4     TRUE
5   Joe     2     TRUE

You can also use the $ operator to target a single column

my_data$name
[1] "Ozlem"  "Angela" "Bill"   "Mary"   "Joe"    "Jane"  

Viewing dataframes

You can view an entire dataframe by print()ing it out in the console.1 However, it often just spams your console and is too large to meaningfully read.

To view the first n rows of a dataframe use head() (defaults to 5 rows):

head(my_data, 3)
    name grads fullprof
1  Ozlem     1    FALSE
2 Angela     0    FALSE
3   Bill     4     TRUE

To view the last n rows of a dataframe use tail() (defaults to 5 rows):

tail(my_data, 4)
  name grads fullprof
3 Bill     4     TRUE
4 Mary     3     TRUE
5  Joe     2     TRUE
6 Jane     3     TRUE

dplyr Basics

Working Directory

Your working directory is the folder on your computer where R will look for and save things by default

You can find your current working directory with getwd()

getwd()
[1] "C:/Users/otuncelgurlek1/OneDrive - Georgia State University/r-introduction"

You can change your working directory using setwd()

setwd("/home/ozlem/Desktop")

R scripts are automatically run in the directory they are currently in. This means that, when you open a .R or a .Rmd file, your working directory is automatically set to that folder.

Note: Windows users need to change back slashes (\) to forwarded slashes (/) for filepaths

Relative Paths

In your working directory, you can (and should!) refer to files using relative paths:

  • . refers to your current working directory
  • .. refers to the folder your working directory is located in

Examples:

./data/my_data.csv refers to a file called “my_data.csv” located in the “data” subfolder of my working directory

../../figure1.png refers to a file called “figure1.png” located two folders “up” from my working directory

Helpful Packages

R has the ability to read and write data in a number of formats. Although much of this functionality is built into Base R, several packages help as well:

  • haven (SPSS, Stata, and SAS files)
  • foreign (SPSS, Stata, SAS, and other files)1
  • readxl (MS Excel files)
  • googlesheets4 (communicate directly with Google Sheets)
  • readr (enhances base R functionality)

The most common way to read/write data in R is with a .csv file!

.csv Files

A .csv file looks something like this:

"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
15.2,8,275.8,180,3.07,3.78,18,0,0,3,3
10.4,8,472,205,2.93,5.25,17.98,0,0,3,4
10.4,8,460,215,3,5.424,17.82,0,0,3,4
14.7,8,440,230,3.23,5.345,17.42,0,0,3,4
32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
15.5,8,318,150,2.76,3.52,16.87,0,0,3,2
15.2,8,304,150,3.15,3.435,17.3,0,0,3,2
13.3,8,350,245,3.73,3.84,15.41,0,0,3,4
19.2,8,400,175,3.08,3.845,17.05,0,0,3,2
27.3,4,79,66,4.08,1.935,18.9,1,1,4,1
26,4,120.3,91,4.43,2.14,16.7,0,1,5,2
30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
15.8,8,351,264,4.22,3.17,14.5,0,1,5,4
19.7,6,145,175,3.62,2.77,15.5,0,1,5,6
15,8,301,335,3.54,3.57,14.6,0,1,5,8
21.4,4,121,109,4.11,2.78,18.6,1,1,4,2
  • Read .csv files with: read.csv("data/my_data.csv")
  • Write .csv files with: write.csv("data/my_data_cleaned.csv)

Alternatively you can use readr

  • Read with readr::read_csv("data/my_data.csv")
  • Write with: readr::write_csv("data/my_data_cleaned.csv)

Control Structures

Control structures allow you to control the flow of your R program (script) and are critical to programming in R. There are three components of a program’s flow: . . .

  1. Sequential: the order in which the R code is executed > “do this first, then that
  1. Selection: which path of an algorithm R will execute based on certain criteria > “do that, but only if this is TRUE. Like if/else
  1. Iteration: how many types should a certain algorithm be repeated? > “do this 100 times, then move on to that”. Like for, while, repeat, break, next

if() otherwise else

if() and else statements allow you to conditionally execute code . . .

For example, write a program that tells a cashier whether or not they should sell alcohol to a customer. The cashier enters the customer’s birthday into their POS, which needs to display the appropriate message:

# Calculate age
age <- as.numeric(difftime(Sys.time(), as.Date(birthday))) / 365

if(age >= 21){
  print(paste("Age:", floor(age), "(Sell)"))
} else {
  print(paste("Age:", floor(age), "(Do not sell)"))
}

ifelse() and if_else()

if/else takes one TRUE or FALSE value, but sometimes we want to evaluate multiple values at once . . .

ifelse() is a vectorized version of if/else that can operator over vectors:

ages <- c(35, 12, 82, 21, 15)
ifelse(ages > 21, "Sell alcohol", "Do not sell alcohol")
[1] "Sell alcohol"        "Do not sell alcohol" "Sell alcohol"       
[4] "Do not sell alcohol" "Do not sell alcohol"

ifelse() if very useful inside a dataframe to transform your data. 👉

uwclinpsych <- data.frame("name" = c("Corey", "Angela", "Bill", "Mary", "Jane", "Lori"),
                          "grads" = c(1, 0, 4, 3, 2, 3),
                          "fullprof" = c(F, F, T, T, T, T))

Loops

Computers are really good at repeating the same task over and over, and loops are the way to accomplish it

From Wikipedia: A loop is a sequence of statements which is specified once but which may be carried out several times in succession. The code “inside” the loop is obeyed a specified number of times, or once for each of a collection of items, or until some condition is met, or indefinitely.’’

There are three types of loops in R: . . .

  • for loops
  • while loops
  • repeat loops

Without loops

Bad repetition: Let’s say you wanted to take the mean of all columns in the swiss dataset:

mean1 <- mean(swiss$Fertility)
mean2 <- mean(swiss$Agriculture)
mean3 <- mean(swissExamination)
mean4 <- mean(swiss$Fertility)
mean5 <- mean(swiss$Catholic)
mean5 <- mean(swiss$Infant.Mortality)
c(mean1, mean2 mean3, mean4, mean5, man6)

Can you spot the problems with this code? . . . How frustrated would you be if swiss had 200 columns instead of 6?

DRY vs. WET Programming

DRY: do not repeat yourself! If you are wriing the the same code over several lines, there’s probably a more efficient way to write it

WET: - write every time - write everything twice - we enjoy typing - waste everyone’s time

Writing DRY code reduces risk of making typos in your code, substantially reduces the time and effort involves in processing large volumes of data, and is more readable and easier to troubleshoot

for Loop

for loops iterate over a vector of values (any atomic type!) and execute instructions (R code) after each iteration . . .

In English: “for each of these values, in this order, execute this set of instructions” . . .

General structure of a for loop:

for(var in seq){ #<<
  expr
}
  • var is an index variable that holds the current value in seq (You can call this whatever you want! In most cases it is custom to call it i but there are meaningful exceptions to this)
  • seq is a vector of values that you want to iterate over
  • expr is the R code you want to run for each iteration

for Loop: Diagram

Given a set of values:

Diagram of a for loop

for Loop Example

for(i in 1:10){
  print(i^2)
}

same as 👉

i <- 1
print(i^2)
i <- 2
print(i^2)
i <- 3
print(i^2)

and so on…

Other loops we are not covering

  • while
  • repeat
  • next

Data Manipulation and Summary

“Data Engineer Work”

Issues around preparing a dataset for the analyses you want to run:

  • Subsetting data . . .
  • Performing operations across rows and columns . . .
  • Creating new variables . . .
  • Creating rich summaries of your data . . .
  • Merging multiple datasets together

Manipulation

  • Base R can be really challenging for data manipulation
  • Instead, we are going to use dplyr package in tidyverse

Starwars Data

To demonstrate much of dplyr’s functionality, we will use the starwars data that is loaded with dplyr and originally from SWAPI (a Star Wars API)

glimpse(starwars)
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

A Brief Reminder About Pipes

The dplyr package uses verbs to name the functions within. As a result, they work very nicely with the pipe (%>%) syntax

take_these_data %>%
    do_first_thing(with = this_value) %>%
    do_next_thing(using = that_value) %>%

group_by()

  • group_by() is a special function that controls the behavior of other functions as they operate on the data . . .
  • It returns a tibble with the following classes: grouped_df, tbl_df, tbl, and data.frame
  • Most functions called on grouped data operate within each group rather than on the entire dataset . . .
  • Data are typically grouped by variables that are characters, factors, or integers, not continuous data

group_by()

For example, group_by() characters’ eye_color

starwars_grouped <- starwars %>%
  group_by(eye_color)

class(starwars_grouped)
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Notice that this dataset has exactly the same data as the ungrouped version, except it now controls the output of other function calls

group_by()

Not grouped

dim(starwars)
[1] 87 14

Grouped

dim(starwars_grouped)
[1] 87 14

group_by()

To remove a grouping structure use the ungroup() function (if left blank, all grouping is removed, otherwise just the specified groups are ungrouped):

starwars_ungrouped <- starwars_grouped %>%
  ungroup()

class(starwars_ungrouped)
[1] "tbl_df"     "tbl"        "data.frame"

group_by() example

Mean mass by character gender:

starwars %>%
  # Center mass by sample average
  mutate(mass_gmc = mass - mean(mass, na.rm = T)) %>% #<<
  group_by(gender) %>% #<<
  # Center mass by group average
  mutate(mass_pmc = mass - mean(mass, na.rm = T)) %>% #<<
  select(name, gender, mass, mass_gmc, mass_pmc)
# A tibble: 87 × 5
# Groups:   gender [3]
   name               gender     mass mass_gmc mass_pmc
   <chr>              <chr>     <dbl>    <dbl>    <dbl>
 1 Luke Skywalker     masculine    77    -20.3   -29.5 
 2 C-3PO              masculine    75    -22.3   -31.5 
 3 R2-D2              masculine    32    -65.3   -74.5 
 4 Darth Vader        masculine   136     38.7    29.5 
 5 Leia Organa        feminine     49    -48.3    -5.69
 6 Owen Lars          masculine   120     22.7    13.5 
 7 Beru Whitesun Lars feminine     75    -22.3    20.3 
 8 R5-D4              masculine    32    -65.3   -74.5 
 9 Biggs Darklighter  masculine    84    -13.3   -22.5 
10 Obi-Wan Kenobi     masculine    77    -20.3   -29.5 
# ℹ 77 more rows

filter()

filter() is used to subset rows from a dataframe . . . Similar to [x, ] except that it drops NAs . . .

filter(.data, ..., .preserve = FALSE)
  • .data is the data to subset on . . .
  • ... are the condition(s) that specify the subset . . .
  • .preserve controls the grouping of the returned dataframe

filter() Example

starwars %>%
  filter(mass > mean(mass, na.rm = T))
# A tibble: 10 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 2 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 3 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 4 Jabba D…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
 5 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
 6 IG-88       200   140 none       metal      red             15   none  mascu…
 7 Bossk       190   113 none       green      red             53   male  mascu…
 8 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
 9 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
10 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Notice that you don’t need to refer to the mass column as starwars$mass because filter() already knows the context from the .data argument.

filter() Example

This is

starwars %>%
  filter(mass > mean(mass, na.rm = T))

👇 same as

starwars[starwars$mass > mean(starwars$mass, na.rm = T), ]

select()

select() is used to subset columns from a dataframe (similar to [, x], but variable names don’t need to be quoted or passed as a vector)

starwars %>%
  select(character, mass, skin_color)
  • : is used to select a range of consecutive columns
starwars %>%
  select(birth_year:species)
  • ! is used to negate a selection of columns
starwars %>%
  select(!birth_year, !vehicles, !starships)
  • - is also used to negate a selection of columns
starwars %>%
  select(-birth_year, -vehicles, -starships)

Column Renaming

You can rename variables while selecting them with select():

starwars %>%
  select(character = name, weight = mass, height) %>%
  colnames()
[1] "character" "weight"    "height"   

However it is more explicitly done with rename():

starwars %>%
  rename(character = name, weight = mass) %>% #<<
  select(character, weight, height) %>%
  colnames()
[1] "character" "weight"    "height"   

Creating New Columns: mutate()

mutate() allows you to create a new column of data or modify existing columns

starwars %>% 
  mutate(bmi = mass / (height/100)^2) %>% # convert cm to m #<<
  select(mass, height, bmi) %>%
  head(3)
# A tibble: 3 × 3
   mass height   bmi
  <dbl>  <int> <dbl>
1    77    172  26.0
2    75    167  26.9
3    32     96  34.7

Creating New Columns: mutate()

We can also convert height to meters first, then calculate BMI:

starwars %>%
  mutate(height = height / 100, #<<
         bmi = mass / height^2) %>% #<<
  select(mass, height, bmi) %>%
  slice(1, 3)
# A tibble: 2 × 3
   mass height   bmi
  <dbl>  <dbl> <dbl>
1    77   1.72  26.0
2    32   0.96  34.7

👈 multiple statements can be placed in the same mutate() call, and just like in tibble(), they build on each other dynamically

Other dplyr functions for manipulation

  • across()
  • if_else()
  • case_when()
  • lag()
  • lead()
  • arrange()
  • distinct()

Summarizing data with summarize()

summarize() returns a dataframe with specified summary statistic(s) of your data with 1+ rows for each combination of grouping variables (1 for no grouping structure) and 1 column for each summary statistic (much like tapply(), but much more flexible with cleaner output)

mtcars %>%
  summarize(nobs = n(),
            mpg_mean = mean(mpg, na.rm = T),
            mpg_sd = sd(mpg, na.rm = T))
  nobs mpg_mean   mpg_sd
1   32 20.09062 6.026948

Summarizing data with summarize()

When we group the data before calling summarize() we will get summary statistics for each group:

mtcars %>%
  group_by(cyl) %>% #<<
  summarize(nobs = n(),
            mpg_mean = mean(mpg, na.rm = T),
            mpg_sd = sd(mpg, na.rm = T))
# A tibble: 3 × 4
    cyl  nobs mpg_mean mpg_sd
  <dbl> <int>    <dbl>  <dbl>
1     4    11     26.7   4.51
2     6     7     19.7   1.45
3     8    14     15.1   2.56

Merging data

Questions to Ask Yourself When Merging

When merging datasets A and B: . . .

  • Which rows do you want to keep from each dataframe? . . .

  • Which columns do you want to keep from each dataframe? . . .

  • Which variable(s) determine whether rows match?

Data Example for Merging

To keep things simple, let’s use the following data to practice merging:

# A tibble: 7 × 3
     ID     x     y
  <int> <dbl> <dbl>
1     0  1.37 1.43 
2     1  2.18 1.67 
3     2  1.16 0.277
4     3  3.60 3.31 
5     4  5.33 3.05 
6     5  4.18 1.47 
7     6  5.49 0.529
# A tibble: 7 × 3
     ID group   age
  <int> <dbl> <int>
1     1     1    24
2     2     1    21
3     3     1    29
4     4     2    44
5     5     2    31
6     6     2    34
7     7     2    20
  • Notice that ID == 0 is not in B and ID == 7 is not in A

  • We will use the ID column to merge the data (in the by = argument)

left_join()

left_join(A, B) keeps all rows from A, all cols from A and B

dplyr

left_join(A, B, by = "ID")
# A tibble: 7 × 5
     ID     x     y group   age
  <int> <dbl> <dbl> <dbl> <int>
1     0  1.37 1.43     NA    NA
2     1  2.18 1.67      1    24
3     2  1.16 0.277     1    21
4     3  3.60 3.31      1    29
5     4  5.33 3.05      2    44
6     5  4.18 1.47      2    31
7     6  5.49 0.529     2    34

☝️ ID == 7 from B not in merged data ☝️

Base R Equivalent

merge(A, B, by = "ID", all.x = T)
  ID        x         y group age
1  0 1.373546 1.4250978    NA  NA
2  1 2.183643 1.6676030     1  24
3  2 1.164371 0.2767973     1  21
4  3 3.595281 3.3094216     1  29
5  4 5.329508 3.0545971     2  44
6  5 4.179532 1.4685252     2  31
7  6 5.487429 0.5290146     2  34

Other joins

  • right_join()
  • inner_join()
  • full_join()
  • semi_join()
  • anti_join()

Matching With by =

  • You can pass by = a character vector of columns upon which to match. This is useful when merging nested data (e.g., data from clinic visits nested within patients)
left_join(A, B, by = c("ID", "Date"))
merge(A, B, by = c("ID", "Date"), all.x = T)
  • If the by columns used for merging don’t have the same name (e.g., “PatientNum” and “PatientID”):
left_join(A, B, by = c("PatientNum" = "PatientID"))
merge(A, B, by.x = "PatientNum", by.y = "PatientID", all.x = T)

Practice Exercise

Any Questions?

Getting Started

To make your RStudio look like mine…

  1. Tools > Global Options > Pane Layout and make top right your Console

  2. Tools > Global Options > Appearance and select Chaos as your Editor theme

  3. Tools > Global Options > Code > Display and select: - Highlight selected line - Allow scroll past end of document - Highlight R function calls

To make your RStudio look like mine…

Open up RStudio now and choose File > New File > R Script.

Then, let’s get oriented with the interface:

  • Top left: Code editor pane, data viewer (browse with tabs)

  • Top right: Console for running code (> prompt)

  • Bottom left: List of objects in environment, code history tab.

  • Bottom right: Browsing files, viewing plots, managing packages, and viewing help files.