Data Visualization in R

A very brief introduction

Ozlem Tuncel

Georgia State University

August 27, 2024

Research Data Services (RDS)

Workshops and more 👀

Before we begin – Why learn R?

R skills means more money when you graduate (from O’Reilly 2021 Data/AI Salary Survey)

Before we begin – Why learn R?

Changing trends in the job market (particularly in the US)

Let’s talk about how can we learn more about data visualization 🧑‍🏫

General advice

  • Workshops and courses are great starting points! ⭐

    • They provide a foundation and introduce you to essential concepts and tools.
  • Practice makes perfect. ✏️

    • The more time you spend using R, the more proficient you’ll become.
  • Embrace mistakes. 😍

    • Making mistakes is a valuable part of the learning process and helps you understand the software better.

R generally has higher learning curve!

And, learning is a life-time practice with software (from O’Reilly 2021 Data/AI Salary Survey)

Some resources from Ozlem

Why use R for data visualization?

  • R data visualization is extremely flexible! Almost any data visualization you can think of is possible to create in R.
  • Compared to Stata and Python – R graphs look much better!
  • There are dozens of packages that make it easier to create complex figures, including ggplot2, patchwork, lattice, diagrammeR, and more!
  • Create interactive visualizations with plotly, ggvis, htmlwidgets, leaflet, shiny apps, and other R tools

Some examples

Different Types of Visualizations

Today’s agenda

  • generic plot() function
  • tidyverse environment and ggplot2

ggplot2 exists under tidyverse

Why ggplot2 is better than plot()?

There is a huge debate over base R and tidyverse

Pros and cons of each approach

  • plot() - Cons
    • Creating legends
    • Grouping lines, faceting
    • Losing potential conclusions from EDA
  • plot() - Pros
    • exploratory plots
    • quick and dirty approach
  • ggplot2 - Cons
    • Could be verbose
    • Steep learning curve
  • ggplot2 - Pros
    • Pretty
    • Easy to use
    • Advanced options

Let’s start with plot() function

Generic plot function

Many data visualizations created in R start with the same function: plot()

plot() knows how to handle several different types of objects because it is a generic function with lots of methods:

 [1] plot,ANY-method     plot,color-method   plot.acf*          
 [4] plot.data.frame*    plot.decomposed.ts* plot.default       
 [7] plot.dendrogram*    plot.density*       plot.ecdf          
[10] plot.factor*        plot.formula*       plot.function      
[13] plot.ggplot*        plot.gtable*        plot.hcl_palettes* 
[16] plot.hclust*        plot.histogram*     plot.HoltWinters*  
[19] plot.isoreg*        plot.lm*            plot.medpolish*    
[22] plot.mlm*           plot.ppr*           plot.prcomp*       
[25] plot.princomp*      plot.profile*       plot.profile.nls*  
[28] plot.R6*            plot.raster*        plot.spec*         
[31] plot.stepfun        plot.stl*           plot.table*        
[34] plot.transform*     plot.ts             plot.tskernel*     
[37] plot.TukeyHSD*     
see '?methods' for accessing help and source code

plot() arguments

There are a lot of arguments to plot()! Several of these arguments will be discussed in these slides, but not all of them. That means that making plots often involves teaching yourself something new each time with the help pages, Stack Overflow, and other various websites and blogs.

plot(x, y = NULL, type = "p",  xlim = NULL, ylim = NULL,
     log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
     ann = par("ann"), axes = TRUE, frame.plot = axes,
     panel.first = NULL, panel.last = NULL, asp = NA,
     xgap.axis = NA, ygap.axis = NA,
     ...)

Scatterplot

# Vectors of coordinates
plot(x = mtcars$wt,
     y = mtcars$mpg)

  • What do we notice?
    • Plots points (type = “p”) of a specific shape (pch = 1)
    • Axis labels (R code supplied to the arguments)
    • No header (main = NULL)
    • Chooses axis ticks for you
    • …and hundreds more!

Change the main title

plot(x = mtcars$wt, y = mtcars$mpg,
     main = "Vehicle Efficiency by Weight") # adding the title

Change the axis titles

plot(x = mtcars$wt, y = mtcars$mpg,
     main = "Vehicle Efficiency by Weight",
     xlab = "Vehicle Weight (1000 lbs)", #<< adding x-axis
     ylab = "Miles Per Gallon (MPG)")    #<< adding y-axis

You can create other types of graphs.

For instance, histogram

hist(x = mtcars$mpg, # data to plot
     breaks = 15, # change default number of bars
     xlim = c(10, 35), # change size of x-axis
     main = "", # no main title
     xlab = "Mile Per Gallon (MPG)", # x-axis title
     las = 1, # y-axis ticks horizontal
     border = "darkblue", # bar border color
     col = "lightblue") # bar fill color

plot() works great for EDA but not for advanced stuff

Something like this is not possible in plot()

Animated gifs are possible with gganimation extension

Let’s learn ggplot2

ggplot2()

  • The ggplot2 package is based on a philosophy outlined in The Grammar of Graphics

  • Understanding the philosophy is 90% of understanding how to create figures with ggplot2

  • The remaining 10% is learning the various functions that correspond with each part of the philosophy

The Grammar of Graphics

Central Idea: Instead of creating a function for every single type of plot, decompose graphics its its separate components/layers that can be used flexibly to create (almost) any type of plot you want.

Example Data: gapminder

We will be using the gapminder data from the gapminder package for this lecture

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
  • 142 countries (country)
  • 5 continents (continent)
  • 12 discrete years from 1952 to 2007 (year)
  • life expectancy (lifeExp)
  • population estimate (pop)
  • GDP per capita (gpsPercap)

Constructing a ggplot

ggplot figures are created started with the ggplot() function

ggplot(data = NULL, mapping = aes(), ...)
  • data defined within a call to ggplot() are defined globally, which means each layer will used these data for plotting by default
  • mapping defined within a call to ggplot() are also defined globally for each layer to use by default
# A blank ggplot template
ggplot()

Constructing a ggplot

The + sign is used to add layers

ggplot() +
  geom_layer() + 
  another_geom_layer() + 
  facet_layer() +
  theme_layer()

Layers are added on top of each other, so the order matters

Unlike base R plots, ggplots can be saved as objects

p <- ggplot() + ...

Let’s go over each layer

Layer: Data

  • This layer refers to the data that go into your figure

  • Can be one dataframe (defined globally) or many dataframes (defined at each layer)

  • Creating graphics is mostly about getting your data cleaned and in the format you need

    • Most of the time your data needs “tidy” format
# Set global data as flights
# (from nycflights13 package)
ggplot(data = gapminder) # gapminder %>% ggplot() does the same thing

☝️ Because we do not have any geometry, the plot is still blank

Layer: Aesthetics

  • Once you have your data, you need to inform the graphics function how those data fit into the plot you want to create

  • In other words, you need to tell the graphics which variable represents x, y, etc.

  • There are tons of different aesthetic specifications, which can be found in this documentation. Some of the most common are:

    • x, y (x and y axis)
    • color, fill
    • shape
    • linetype

Layer: Aesthetics

  • Mapping occurs inside a function called aes(), which stands for for aesthetics
# aes function should look something like this
aes(x = my_x_var, y = my_y_var)
# Map the x-axis to `year`
# color to `continent`
ggplot(data = gapminder,
       mapping = aes(x = year, color = continent))

☝️ The plot is getting some shape, but still no data are plotted because we have not added a geometry layer

Layer: Aesthetics

  • Calls to aes() are always made within other ggplot2 functions (i.e., they are attributes of a layer, not their own layer)

  • Most of the time mapping will take column names for your data that you want to map on to each aesthetic of a plot

  • However, aes() can also take expressions (i.e., R code) that determine the axes, color groups, etc. For example:

ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = gdpPercap * pop,
                     color = continent == "Asia"))

☝ From the gapminder data:

  • map year onto the x-axis
  • total GDP (gdpPercap * pop) on the y-axis
  • different colors where contintent == and != “Asia”

Layer: Aesthetics

  • Now that we know how to create geometry and statistics layers we can understand aesthetic mapping more completely

No color mapping

ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp)) + 
  # No color mapping
  geom_line(stat = "summary")

Color mapped to continent

ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp,
                     color = continent)) + #<< 
  geom_line(stat = "summary")

Layer: Aesthetics vs Geometries

# Mapping (wrong)
ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp,
                     color = "red"))

# Mapping (correct way)
ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp)) +
  geom_line(stat = "smooth", color = "red")

Layer: Geometries

  • Mostly what you think about in ggplot2
  • Take all the scale values from come from mapping and may have been transformed by statistics and interprets/plots them in some way
  • e.g., a line geometry (geom_line()) interprets data on way and creates lines on your figure while a boxplot geometry (geom_boxplot()) interprets the data another way
ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp)) +
  geom_line(stat = "summary") #<<

ggplot(data = gapminder,
       mapping = aes(x = year)) +
  geom_boxplot() # No `y` mapping needed for boxplot

Some examples

The geoms

geoms is the abbreviation for geometric objects which are used to specify which type of graphic you want to produce (boxplot, barplot, scatterplot, histogram, …). All ggplot2 geoms start with the geom_ prefix.

 [1] "geom_abline"            "geom_area"              "geom_bar"              
 [4] "geom_bin_2d"            "geom_bin2d"             "geom_blank"            
 [7] "geom_boxplot"           "geom_col"               "geom_contour"          
[10] "geom_contour_filled"    "geom_count"             "geom_crossbar"         
[13] "geom_curve"             "geom_density"           "geom_density_2d"       
[16] "geom_density_2d_filled" "geom_density2d"         "geom_density2d_filled" 
[19] "geom_dotplot"           "geom_errorbar"          "geom_errorbarh"        
[22] "geom_freqpoly"          "geom_function"          "geom_hex"              
[25] "geom_histogram"         "geom_hline"             "geom_jitter"           
[28] "geom_label"             "geom_line"              "geom_linerange"        
[31] "geom_map"               "geom_path"              "geom_point"            
[34] "geom_pointrange"        "geom_polygon"           "geom_qq"               
[37] "geom_qq_line"           "geom_quantile"          "geom_raster"           
[40] "geom_rect"              "geom_ribbon"            "geom_rug"              
[43] "geom_segment"           "geom_sf"                "geom_sf_label"         
[46] "geom_sf_text"           "geom_smooth"            "geom_spoke"            
[49] "geom_step"              "geom_text"              "geom_tile"             
[52] "geom_violin"            "geom_vline"            

Layer: Geometries

  • Geometries are intimately intertwined with statistics, and each geom_*() has a default statistic (stat) assigned
  • The default statistic for geom_line() is identity, which means (“leave the data as is”).
ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp)) +
  geom_line() # stat = "identity"

☝ Plots every single point across every single year

Layer: Geometries

  • When we change the default to summary, we get a plots of mean values (changed with fun)
ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp)) +
  geom_line(stat = "summary")

☝ Plots mean values for all observations within each year

Multiple Geometries

  • Different geometries do not necessary share the same mapping. For example, geom_point() needs (at minimum) an x and y mapping, but geom_histogram() only needs an x mapping (the statistic determines the y-axis)
  • There is an “Aesthetics” section in the help page for each geom that describes the required and optional mapping parameters
  • You can (and often will) have multiple layers of geometries in the same figure
  • ⚠️ The order or your geometries matter, because each later is plotted on top of the previous layers

Adding geometries example

ggplot(data = gapminder,
       mapping = aes(x = year,
                     y = lifeExp)) +
  geom_bar(stat = "summary", width = 3, fill = "red") + # adding bar graph
  geom_line(stat = "summary", color = "blue")           # adding line graph

Layer: Statistics

  • Your data do not always have the required statistics for each type of figure
  • For example, plotting a boxplot requires calculating the 25th, 50th, and 75th percentiles of your data and the interquartile range
  • Sometimes your data are exactly what is needed (e.g., creating a scatterplot), in which case you set your statistic to identity which just passes your data on to that layer

Statistics: Errorbars example

  • However, sometimes you do need to manipulate your data to get the correct aesthetic mapping for a geom (e.g., creating errobars)

Step 1: create summary statistics from our data (mean and standard error)

gp_summary <- gapminder %>%
  group_by(year) %>%
  summarize(mean_lifeExp = mean(lifeExp, na.rm = T),
            se_lifeExp = sd(lifeExp, na.rm = T) / sqrt(n()))

Statistics: Errorbars example

Step 2: Supply these data to the geom_errorbar() layer to control the height of the errorbars

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
  geom_point(stat = "summary") +
  geom_errorbar(data = gp_summary, #<<
                mapping = aes(x = year, y = mean_lifeExp, #<<
                              ymin = mean_lifeExp - se_lifeExp, #<<
                              ymax = mean_lifeExp + se_lifeExp)) #<<

Layer: Statistics – complete list

  • Statistics are linked to geometries such that each geometry requires a statistic (and vice versa: each statistic requires a geometry)

  • Thus, geometries have default statistics that try to guess what you want to plot but that can also be changed

  • Defaults for common geometries:

  • geom_point(stat = "identity"

  • geom_count(stat = "sum")

  • geom_jitter(stat = "identity")

  • geom_bar(stat = "count")

  • geom_density(stat = "density")

  • geom_histrogram(stat = "bin")

  • geom_boxplot(stat = "boxplot")

  • geom_violin(stat = "ydensity")

  • geom_rug(stat = "identity")

  • geom_freqpoly(stat = "bin")

  • geom_quantile(stat = "quantile")

  • geom_smooth(stat = "smooth")

Layer: Scales

  • How are properties of the axes, colors, and other aesthetics determined?
  • Scales control the details of how data values are translated to visual properties (e.g., plot Africa with #F8766D, Americas with #B79F00, etc.)
  • All geometries are given default scales which you can override with the scale_*() function

Layer: Scales

  • Scale functions have the syntax: scale_<aesthetic>_<type> where <aesthetic> refers to each aesthetic mapping (x, y, color, etc.) and <type> refers to the type of scale (continuous, discrete, log10, etc.)

Axes scales

  • scale_x_continous(), scale_y_continuous()
  • scale_x_log10(), scale_y_log10()
  • scale_x_sqrt(), scale_y_sqrt()
  • scale_x_reverse(), scale_y_reverse()
  • scale_x_discrete(), scale_y_discrete()
  • scale_x_binned(), scale_y_binned()

Color, shape, size scales

  • scale_color_continuous(), scale_shape_continuous(), scale_size_continuous()
  • scale_color_discrete(), scale_shape_discrete(), scale_size_discrete()
  • scale_color_binned(), scale_shape_binned(), scale_size_binned()
  • scale_color_brewer(), scale_shape_brewer(), scale_size_brewer()
  • There are dozens of different types of scales in ggplot2, all of which can be found in this documentation.

Layer: Scales – example

ggplot(data = gapminder,
       mapping = aes(x = year, y = gdpPercap * pop, color = continent)) + 
  geom_line(stat = "summary", size = 1.5) +
  geom_point(stat = "summary", shape = 21, fill = "white", size = 2) +
  scale_x_continuous(name = "Year", breaks = unique(gapminder$year)) +
  scale_y_continuous(name = "Gross Domestic Product (USD)") + 
  scale_color_brewer(palette = "Set1") + 
  ggtitle("Life Expectancy Over Time by Continent")

Layer: Facets

  • Often we are focused on creating one figure per plotting area, but we are not constrained to this and may want to create multiple subplots when looking at our data
  • Facets are multiple panels of plots, with the same plotting logic, on different groups of your data
  • Use facets to prevent overplotting (plotting too much data in one figure)
  • Two different kinds of facets: facet_wrap() and facet_grid()

Layer: Facetsfacet_wrap()

  • facet_wrap() takes a column from your data with a grouping structure and creates several subplots for each group
  • Like this: (facet_wrap(~ groupvar))
ggplot(data = gapminder, mapping = aes(x = year,y = lifeExp)) +
        geom_line(mapping = aes(group = country), stat = "summary", size = .25) +
        geom_line(stat = "summary", size = 2, alpha = .5, color = "blue") +
        facet_wrap(~ continent) #<<

Layer: Facetsfacet_grid()

  • facet_grid() takes two grouping variables and creates plots that show the intersection between them
  • Like this: (facet_grid(group1 ~ group2))
ggplot(data = gapminder,
        mapping = aes(x = year, y = lifeExp)) +
        geom_line(mapping = aes(group = country), stat = "summary", size = .25) +
        geom_line(stat = "summary", size = 2, alpha = .5, color = "blue") +
        facet_grid(continent ~ gdprelative) # GDP larger/smaller to continent mean #<<

Layer: Coordinates

  • The coordinate system represents a physical mapping of the plot’s aesthetics onto the screen
  • Many types of coordinate systems, but we are most used to the Cartesian system (x, y value pairs)
  • Types of coordinate systems in ggplot2:
  • coord_cartesian(): Cartesian coordinates
  • coord_trans(): Transformed Cartesian coordinate system
  • coord_fixed(): Cartesian coordinates with a fixed aspect ratio
  • coord_flip(): Cartesian coordinates with x and y flipped
  • coord_polar() polar coordinates
  • coord_map(), coord_quickmap(): map projections (latitude, longitude)

Layer: Coordinatescoord_flip()

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = gdpPercap)) +
  geom_bar(stat = "summary")

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = gdpPercap)) +
  geom_bar(stat = "summary") +
  coord_flip()

Layer: Coordinatescoord_map()

Cartesian System

ggplot(data = map_data("world"),
       mapping = aes(x = long,
                     y = lat,
                     group = group)) +
  geom_path() +
  scale_y_continuous(breaks = (-2:2) * 30) +
  scale_x_continuous(breaks = (-4:4) * 45)

Azimuthal (orthographic) Projection

ggplot(data = map_data("world"),
       mapping = aes(x = long,
                     y = lat, 
                     group = group)) +
  geom_path() +
  scale_y_continuous(breaks = (-2:2) * 30) +
  scale_x_continuous(breaks = (-4:4) * 45) +
  coord_map(projection = "ortho",
            orientation = c(30, -94, 0))

Layer: Theme

  • The theme encompasses every part of the graphic that is not part of the data (i.e., has no mapping to the data)

  • There are several pre-made themes that come with ggplot2:

  • theme_grey() 👈 default

  • theme_bw()

  • theme_linedraw()

  • theme_light()

  • theme_dark()

  • theme_minimal()

  • theme_classic()

  • theme_void()

  • theme_test()

Layer: Theme

Layer: Theme

  • To tweak other aspects of your plots them you can use the theme() function, which has 94 arguments to give you complete control over all elements of your plot
  • To demonstrate, we’ll use the following plot from previous slides:

Layer: Themepanel features

p +
    theme(panel.grid.major = element_line(color = "black", linetype = 2, size = 0.25),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white"),
    panel.border = element_rect(color = "black", fill = NA, size = 1))

Layer: Themeaxes features

p +
  theme(title = element_text(family = "Ubuntu Mono", face = "bold"),
  axis.title.y = element_text(family = "Ubuntu Mono"),
  axis.title.x = element_blank(),
  axis.text = element_text(family = "Ubuntu Mono", color = "black", size = 11),
  axis.text.x = element_text(angle = 45, hjust = 1))

Layer: Themelegend features

p +
  theme(legend.key = element_blank(),
  legend.title = element_text(family = "Ubuntu Mono"),
  legend.text = element_text(family = "Ubuntu Mono")) + 
  guides(color = guide_legend(title = "Continent"))

Layer: Theme – create your own

  • Just like ggplot2 comes with pre-made themes, you can create your own themes to use repeatedly throughout your data visualizations
theme_monotype <- theme(
# Grid theme
panel.grid.major = element_line(color = "grey", linetype = 2, size = 0.5),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
panel.border = element_rect(color = "black", fill = NA, size = 1),

# Axis theme
title = element_text(family = "Ubuntu Mono", face = "bold"),
axis.title.y = element_text(family = "Ubuntu Mono"),
axis.title.x = element_blank(),       
axis.text = element_text(family = "Ubuntu Mono", color = "black", size = 11),
axis.text.x = element_text(angle = 45, hjust = 1),

# Legend theme
legend.key = element_blank(),
legend.title = element_text(family = "Ubuntu Mono"),
legend.text = element_text(family = "Ubuntu Mono")
)

Layer: Theme – let’s use our new theme

ggplot(data = gapminder,
          mapping = aes(x = year, y = gdpPercap * pop, color = continent)) + 
          geom_line(stat = "summary", size = 1.5) +
          geom_point(stat = "summary", shape = 21, fill = "white", size = 2) +
          scale_x_continuous(name = "Year",
          breaks = unique(gapminder$year)) +
          scale_y_continuous(name = "Gross Domestic Product (USD)") + 
          scale_color_brewer(palette = "Set1", name = "Continent") + 
          coord_cartesian(ylim = c(0, 8e+11)) +
          ggtitle("Life Expectancy Over Time by Continent")

ggplot(data = gapminder,
          mapping = aes(x = year, y = gdpPercap * pop, color = continent)) + 
          geom_line(stat = "summary", size = 1.5) +
          geom_point(stat = "summary", shape = 21, fill = "white", size = 2) +
          scale_x_continuous(name = "Year",
          breaks = unique(gapminder$year)) +
          scale_y_continuous(name = "Gross Domestic Product (USD)") + 
          scale_color_brewer(palette = "Set1", name = "Continent") + 
          coord_cartesian(ylim = c(0, 8e+11)) +
          ggtitle("Life Expectancy Over Time by Continent") +
          theme_monotype

Is there more?

YES! 🥳

ggplot2 extensions

There are a lot of extentions!

ggplot2 extensions

  • My favorites:
    • patchwork
    • ggpubr
    • GGally
    • ggstats

A final note…

There is a lot to learn about ggplot2 and data visualization, but I hope you have learned something today. If it is not the case…

First GIF Second GIF Third GIF

TidyTuesday

TidyTuesday is a weekly challenge where people use (mostly) ggplot2 to explore a new dataset.

A weekly data project aimed at the R ecosystem. As this project was borne out of the R4DS Online Learning Community and the R for Data Science textbook, an emphasis was placed on understanding how to summarize and arrange data to make meaningful charts with ggplot2, tidyr, dplyr, and other tools in the tidyverse ecosystem.

drawing
#TidyTuesday on