12 Data Visualisation I

What is R? - R is a language and environment for statistical computing and graphics.

Data visualisation serves two purposes:

Discovery (This session)
Presentation (next session)

Recommended readings

Formulae for statistical models
- Chapter 11, Statistical models in R
Base R graphics
- Chapter 12, Graphical procedures, An Introduction to R
Better graphics
- Ten Simple Rules for Better Figures
- https://github.com/rougier/ten-rules
- http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html

12.1 Anscombe’s quartet

Among the many datasets included in R, there is one named anscombe. This dataset comes from the journal article authored by the F. J. Anscombe published on The American Statistician journal in 1973, titled Graphs in Statistical Analysis.

There are eight columns in anscombe giving four pairs of x and y values:

(x1, y1)
(x2, y2)
(x3, y3)
(x4, y4)

Let’s take a look at some of the descriptive statistics of the dataset:

head(anscombe)

  x1 x2 x3 x4   y1   y2    y3   y4
1 10 10 10  8 8.04 9.14  7.46 6.58
2  8  8  8  8 6.95 8.14  6.77 5.76
3 13 13 13  8 7.58 8.74 12.74 7.71
4  9  9  9  8 8.81 8.77  7.11 8.84
5 11 11 11  8 8.33 9.26  7.81 8.47
6 14 14 14  8 9.96 8.10  8.84 7.04

attributes(anscombe)

$names
[1] "x1" "x2" "x3" "x4" "y1" "y2" "y3" "y4"

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11

summary(anscombe)

       x1             x2             x3             x4           y1        
 Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
 1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
 Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
 Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
 3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
 Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
       y2              y3              y4        
 Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
 1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
 Median :8.140   Median : 7.11   Median : 7.040  
 Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
 3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
 Max.   :9.260   Max.   :12.74   Max.   :12.500

sapply(anscombe, var)    # variance

       x1        x2        x3        x4        y1        y2        y3        y4 
11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620  4.123249

sapply(anscombe, median) # median

  x1   x2   x3   x4   y1   y2   y3   y4 
9.00 9.00 9.00 8.00 7.58 8.14 7.11 7.04

# correlation
cor(anscombe[["x1"]], anscombe[["y1"]])

[1] 0.8164205

cor(anscombe[["x2"]], anscombe[["y2"]])

[1] 0.8162365

cor(anscombe[["x3"]], anscombe[["y3"]])

[1] 0.8162867

cor(anscombe[["x4"]], anscombe[["y4"]])

[1] 0.8165214

The four sets of x-y pairs don’t seem to be different according to the statistical values. So we should expect to see four very similar shapes once they are plotted:

#?anscombe
# PAR_DEFAULTS <- par(no.readonly = TRUE)

ff <- y ~ x
par(mfrow = c(2, 2), mar = 0.1 + c(4, 4, 1, 1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y", "x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13))
}
mtext("Anscombe's quartet", outer = TRUE, cex = 1.5)

# par(PAR_DEFAULTS)

This dataset, published in 1973, demonstrates the importance of data visualisation. Graphics is not only useful for presenting results, but also for data analysis and discovery.

Original publication: F. J. Anscombe, Graphs in Statistical Analysis. The American Statistician
Wiki: Anscombe’s quartet
Datasaurus is a modern iteration of Anscombe’s quartet.

12.2 Formulae, `y ~ x`

?formula - An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model.

Familiarity with statistical concepts and terms is needed to understand formulae definition in R. We will talk about statistical models in the following sessions, but more details will be given in the Statistics module later in the term. We give brief introduction in this session, which is sufficient for the purpose of this module.

Suppose x1 and x2 are independent variables, y is the response / dependent variable, a formula fm may be defined as follows:

fm <- y ~ x1 + x2 # y is a function of x1 and x2
terms(fm)
typeof(fm)
class(fm)
attributes(fm)
str(fm)
length(fm)
fm[[1]]
fm[[2]]
fm[[3]]

#fm <- ~ x1 + x2 # one-sided formula

Formula expresses a relationship between variables
~, tilde is used to separate the left- and right-hand sides in a model formula
LHS of ~ are the dependent variable (a.k.a response, outcome, label)
RHS of ~ are the independent variables (a.k.a predictor, controlled variable, feature)

+, join variables
-, remove variable
*, crossing
%in%, nesting
^, to the specified degree
., all other variables that have not been included in the formula
poly(x, degree = d), the orthogonal polynomials of degree d over x
I(x), x is treated as is, i.e. poly(x, 2) is equivalent to 1 + x + I(x^2)

fm <- as.formula("y ~ x1 + x2") # string to formula
typeof(fm)
terms(fm)
all.vars(fm)

fm <- update(fm, ~. + x3)
fm

fm <- y ~ x + I(x^2)  # I(), as-is operator
fm

12.3 Base R graphics

Part of base R is the graphics package (?graphics) which contains functions for base graphics. The graphic functions are divided into three groups:

High-level commands - create a new plot
- plot
- barplot
- hist
- curve
- coplot
- …
Low-level commands - add information / graphics to existing plot
- points
- lines
- text
- abline
- polygon
- arrows
- legend
- title(main, sub)
- axis
- …
Interactive - reactive to mouse clicks on the graph
- locator
- identify

These commands are all well documented, e.g. ?curve. In this session, we give a few example of commonly used plotting commands.

To list the functions available in graphics package:

ls("package:graphics")
lsf.str("package:graphics")

12.3.1 High-level commands

12.3.1.1 `plot`

# Plot three data points
plot(x = c(1, 2, 3), y = c(2, 5, 4))

We use built-in dataset cars which contains data for two variables:

summary(cars)

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00

The simplest form:

plot(cars)

# as long as the data is in matrix form
plot(as.matrix(cars)) # matrix works just fine

# specify the columns in formula form
plot(dist ~ speed, data = cars)

# specify xy values by vector
plot(x = cars$dist, y = cars$speed)
# - the axes are swapped
# - axis labels

Now let’s play spot the difference:

#PAR_DEFAULTS <- par(no.readonly = TRUE)

par(mfrow = c(2, 1))
plot(cars) # top plot
plot(      # bottom plot
  cars$dist ~ cars$speed, type = "b",
  main = "cars dataset", xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
  xlim = c(30, 0), ylim = c(0, 140), pch = 16, lty = "dashed", col = "blue",
  xaxs = "i", yaxs = "i"
)

col, to get all colour names, run either
- showCols1()
- demo("colors")
type can be the followings:
- p for points
- l for lines
- b for both points and lines
- c for empty points joined by lines
- o for overplotted points and lines
- s and “S” for stair steps
- h for histogram-like vertical lines
- n does not produce any points or lines

12.3.1.2 `barplot`

dta <- c("red" = 10, "black" = 13, "silver" = 20, "yellow" = 4)
barplot(dta)

Ideal for plotting result of table:

barplot(
  table(c("banana", "apple", "coconut", "apple", "apple", "banana", "banana", "banana")),
  horiz = TRUE
)

Values can be stacked

barplot(cbind(Employed, Unemployed) ~ Year, data = longley)

barplot(height = 1:10, names = letters[1:10])

12.3.1.3 `hist`

hist(1:10) # Compare this to the figure produced by `barplot`

A table command is implied in hist

dta <- c(1, 1, 2, 2, 3, 3, 3, 4, 5, 6, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9)
par(mfrow = c(3, 1))
hist(x = dta)
hist(x = dta, breaks = c(0, 1, 2, 3, 5, 8, 9)) # note the change in y label
hist(x = dta, breaks = 2)

Not so reliable when the size of data is small
Good for representing distribution

n <- 10000
par(mfrow = c(3, 1))
hist(rnorm(n), breaks = 40)
hist(runif(n), breaks = 40)

d <- density(rnorm(n))
plot(d)

12.3.1.4 `stripchart`

stripchart treats each column of data as measurements of one variable.

stripchart(1:10)

Use jitter to distinguish overlapping points

dta <- rnorm(100)
par(mfrow = c(2, 1), mar = c(2, 2, 1, 1))
stripchart(dta, pch = 20)
stripchart(dta, method = "jitter", pch = 20)

A strip is created for each variable (column) in the dataset:

set.seed(1)
n <- 500
dtf <- data.frame(
  normal = rnorm(n),
  uniform = runif(n, min = -1, max = 1)
)
stripchart(
  dtf,
  at = c(1, 2), xlim = c(0.5, 2.5), ylim = c(-3, 3),
  method = "jitter", pch = 20,
  group.names = c("Normal", "Uniform"), xlab = "Distribution", ylab = "Value",
  vertical = TRUE,
  col = c("blue", "red")
)
abline(h = c(-1, 1), lty = "dashed")

12.3.1.5 `boxplot`

Descriptive statistics of data
Good for big datasets

boxplot(dtf) # compare this with stripchart

# abline(h = c(-1, 1), lty = "dashed")

Quantiles isn’t everything, don’t rely on boxplots

boxplot(anscombe)

12.3.1.6 `pairs`

Correlation between pairs of variables

pairs(swiss, panel = panel.smooth, pch = ".")

12.3.1.7 `coplot`

Multiple variables involved
Focus on correlation between two variables

coplot(conc ~ Time | Subject, data = Theoph, show.given = FALSE)

# coplot(lat ~ long | depth * mag, data = quakes)

12.3.1.8 `matplot`

matplot, plot columns of matrices

n <- 20
ld <- c(3, 5, 8, 10)
dta <- data.frame(
  "a" = cumsum(rpois(n, lambda = ld[1])),
  "b" = cumsum(rpois(n, lambda = ld[2])),
  "c" = cumsum(rpois(n, lambda = ld[3])),
  "d" = cumsum(rpois(n, lambda = ld[4]))
)
matplot(dta, type = "o", pch = names(dta))

12.3.1.9 `dotchart`

number of values to be plotted are tenable
all values are numeric from a range
all values are to be compared
both column and row names are important

dotchart(c("a" = 1:3, "b" = 4:6))

with row name

dotchart(VADeaths)

# dotchart(mtcars)

12.3.1.10 `curve`

Plot value of a function over an interval

{
par(pty="s")
curve(x^2 - 2, -3, 3, col = "blue", asp = 1, ylab = "", lwd = 2,
      main = expression(paste("Curve shifting ", x^{2})))
curve(x^2, add = TRUE, col = "black", lwd = 2)
curve((x + 1)^2 + 1, add = TRUE, col = "violet", lwd = 2)
abline(h = 0, v = 0)
abline(h = c(1, 2), v = -1, lty = "dashed")
axis(side = 2, at = 1)
axis(side = 1, at = -1)
arrows(x0 = 0, y0 = 0, x1 = 0, y1 = -2, code = 2, length = 0.2, angle = 15, col = "blue", lwd = 3)
arrows(x0 = 0, y0 = 0, x1 = -1, y1 = 0, code = 2, length = 0.2, angle = 15, col = "violet", lwd = 3)
arrows(x0 = -1, y0 = 0, x1 = -1, y1 = 1, code = 2, length = 0.2, angle = 15, col = "violet", lwd = 3)
legend("bottomright",
       legend = c(
          expression(x^{2}),
          expression(x^{2} - 2),
          expression((x + 1)^{2} + 1)
        ),
       lty = 1,
       lwd = 2,
       col = c("black", "blue", "violet")
)
}

See 12.2.1 Mathematical annotation

From this example we see that

Plots can be added to the existing plot with the add option (only available in some high-level plotting commands. e.g. ?barplot, ?stripchart, ?curve)
Low-level commands can modify elements in the existing plot

12.3.2 Low-level commands

We’ve seen in the previous example how extra information can be added to plots created by high-level commands. We give another example:

{
xs <- seq(0, 1, by = 0.01)
ys <- sqrt(1 - xs^2)

set.seed(1)
xp <- runif(n)
yp <- runif(n)

ic <- xp^2 + yp^2 < 1

n <- 1000
par(pty="s")
plot(0:1, 0:1, type = "n", asp = 1, xaxs = "i", yaxs = "i",
     xlab = "", ylab = "")
lines(x = xs, y = ys)
# curve(sqrt(1 - x^2), 0, 1, add = TRUE)
#curve(-sqrt(1 - x^2), -1, 1, add = TRUE)
polygon(c(xs, 1, 0), c(ys, 0, 0), col = "lightblue")
points(xp[ic], yp[ic], col = "blue", pch = 19)
points(xp[!ic], yp[!ic], col = "black", pch = 1)
text(0.4, 0.4, expression(pi/4), cex = 4, bg = "red")
title(main = "Monte Carlo integration", sub = expression(paste("estimating ", pi)))

}

points, add scattered points
lines, add lines
text, annotate graph with text
abline, add straight line, shorthand for horizontal or vertical
polygon, add a shaded area
arrows, add arrow
legend, add legend
title(main =, sub =), add title to plot
axis, edit axis

12.4 `stats` graphics

The commands we demonstrated so far are from the graphics package which is part of base R. Another key package in base R is the stats package, which also includes some plotting commands. Take a look at ls("package:stats") for more. We give a short example

scatter.smooth(x = cars$speed, y = cars$dist)

12.5 Graphical Parameters

Many of the graph settings are set via the par command. For a full list, refer to ?par. We give a lookup table for two most frequently used parameters.

12.5.1 Point shapes `pch`

pch in R: How to Use Plot Character in R, R-Lang

ggpubr::show_point_shapes()

Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

12.5.2 Line Type `lty`

ggpubr::show_line_types()

Exercises

See Section 15.3 for exercises on plotting with Base R functions.

12.1 Anscombe’s quartet

12.2 Formulae, y ~ x

12.3 Base R graphics

12.3.1 High-level commands

12.3.1.1 plot

12.3.1.2 barplot

12.3.1.3 hist

12.3.1.4 stripchart

12.3.1.5 boxplot

12.3.1.6 pairs

12.3.1.7 coplot

12.3.1.8 matplot

12.3.1.9 dotchart

12.3.1.10 curve