12  Data Visualisation I

What is R? - R is a language and environment for statistical computing and graphics.

Data visualisation serves two purposes:

Recommended readings

12.1 Anscombe’s quartet

Among the many datasets included in R, there is one named anscombe. This dataset comes from the journal article authored by the F. J. Anscombe published on The American Statistician journal in 1973, titled Graphs in Statistical Analysis.

There are eight columns in anscombe giving four pairs of x and y values:

  • (x1, y1)
  • (x2, y2)
  • (x3, y3)
  • (x4, y4)

Let’s take a look at some of the descriptive statistics of the dataset:

head(anscombe)
  x1 x2 x3 x4   y1   y2    y3   y4
1 10 10 10  8 8.04 9.14  7.46 6.58
2  8  8  8  8 6.95 8.14  6.77 5.76
3 13 13 13  8 7.58 8.74 12.74 7.71
4  9  9  9  8 8.81 8.77  7.11 8.84
5 11 11 11  8 8.33 9.26  7.81 8.47
6 14 14 14  8 9.96 8.10  8.84 7.04
attributes(anscombe)
$names
[1] "x1" "x2" "x3" "x4" "y1" "y2" "y3" "y4"

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11
summary(anscombe)
       x1             x2             x3             x4           y1        
 Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
 1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
 Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
 Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
 3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
 Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
       y2              y3              y4        
 Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
 1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
 Median :8.140   Median : 7.11   Median : 7.040  
 Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
 3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
 Max.   :9.260   Max.   :12.74   Max.   :12.500  
sapply(anscombe, var)    # variance
       x1        x2        x3        x4        y1        y2        y3        y4 
11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620  4.123249 
sapply(anscombe, median) # median
  x1   x2   x3   x4   y1   y2   y3   y4 
9.00 9.00 9.00 8.00 7.58 8.14 7.11 7.04 
# correlation
cor(anscombe[["x1"]], anscombe[["y1"]])
[1] 0.8164205
cor(anscombe[["x2"]], anscombe[["y2"]])
[1] 0.8162365
cor(anscombe[["x3"]], anscombe[["y3"]])
[1] 0.8162867
cor(anscombe[["x4"]], anscombe[["y4"]])
[1] 0.8165214

The four sets of x-y pairs don’t seem to be different according to the statistical values. So we should expect to see four very similar shapes once they are plotted:

#?anscombe
# PAR_DEFAULTS <- par(no.readonly = TRUE)

ff <- y ~ x
par(mfrow = c(2, 2), mar = 0.1 + c(4, 4, 1, 1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y", "x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13))
}
mtext("Anscombe's quartet", outer = TRUE, cex = 1.5)

# par(PAR_DEFAULTS)

This dataset, published in 1973, demonstrates the importance of data visualisation. Graphics is not only useful for presenting results, but also for data analysis and discovery.

12.2 Formulae, y ~ x

?formula - An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model.

Familiarity with statistical concepts and terms is needed to understand formulae definition in R. We will talk about statistical models in the following sessions, but more details will be given in the Statistics module later in the term. We give brief introduction in this session, which is sufficient for the purpose of this module.

Suppose x1 and x2 are independent variables, y is the response / dependent variable, a formula fm may be defined as follows:

fm <- y ~ x1 + x2 # y is a function of x1 and x2
terms(fm)
typeof(fm)
class(fm)
attributes(fm)
str(fm)
length(fm)
fm[[1]]
fm[[2]]
fm[[3]]

#fm <- ~ x1 + x2 # one-sided formula
  • Formula expresses a relationship between variables
  • ~, tilde is used to separate the left- and right-hand sides in a model formula
  • LHS of ~ are the dependent variable (a.k.a response, outcome, label)
  • RHS of ~ are the independent variables (a.k.a predictor, controlled variable, feature)

  • +, join variables
  • -, remove variable
  • *, crossing
  • %in%, nesting
  • ^, to the specified degree
  • ., all other variables that have not been included in the formula
  • poly(x, degree = d), the orthogonal polynomials of degree d over x
  • I(x), x is treated as is, i.e. poly(x, 2) is equivalent to 1 + x + I(x^2)
fm <- as.formula("y ~ x1 + x2") # string to formula
typeof(fm)
terms(fm)
all.vars(fm)

fm <- update(fm, ~. + x3)
fm

fm <- y ~ x + I(x^2)  # I(), as-is operator
fm

12.3 Base R graphics

Part of base R is the graphics package (?graphics) which contains functions for base graphics. The graphic functions are divided into three groups:

  • High-level commands - create a new plot
    • plot
    • barplot
    • hist
    • curve
    • coplot
  • Low-level commands - add information / graphics to existing plot
    • points
    • lines
    • text
    • abline
    • polygon
    • arrows
    • legend
    • title(main, sub)
    • axis
  • Interactive - reactive to mouse clicks on the graph
    • locator
    • identify

These commands are all well documented, e.g. ?curve. In this session, we give a few example of commonly used plotting commands.

To list the functions available in graphics package:

  • ls("package:graphics")
  • lsf.str("package:graphics")

12.3.1 High-level commands

12.3.1.1 plot

# Plot three data points
plot(x = c(1, 2, 3), y = c(2, 5, 4))

We use built-in dataset cars which contains data for two variables:

summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  

The simplest form:

plot(cars)

# as long as the data is in matrix form
plot(as.matrix(cars)) # matrix works just fine

# specify the columns in formula form
plot(dist ~ speed, data = cars)

# specify xy values by vector
plot(x = cars$dist, y = cars$speed)
# - the axes are swapped
# - axis labels

Now let’s play spot the difference:

#PAR_DEFAULTS <- par(no.readonly = TRUE)

par(mfrow = c(2, 1))
plot(cars) # top plot
plot(      # bottom plot
  cars$dist ~ cars$speed, type = "b",
  main = "cars dataset", xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
  xlim = c(30, 0), ylim = c(0, 140), pch = 16, lty = "dashed", col = "blue",
  xaxs = "i", yaxs = "i"
)

  • col, to get all colour names, run either

    • showCols1()
    • demo("colors")
  • type can be the followings:

    • p for points
    • l for lines
    • b for both points and lines
    • c for empty points joined by lines
    • o for overplotted points and lines
    • s and “S” for stair steps
    • h for histogram-like vertical lines
    • n does not produce any points or lines

12.3.1.2 barplot

dta <- c("red" = 10, "black" = 13, "silver" = 20, "yellow" = 4)
barplot(dta)

Ideal for plotting result of table:

barplot(
  table(c("banana", "apple", "coconut", "apple", "apple", "banana", "banana", "banana")),
  horiz = TRUE
)

  • Values can be stacked
barplot(cbind(Employed, Unemployed) ~ Year, data = longley)

barplot(height = 1:10, names = letters[1:10])

12.3.1.3 hist

hist(1:10) # Compare this to the figure produced by `barplot`

A table command is implied in hist

dta <- c(1, 1, 2, 2, 3, 3, 3, 4, 5, 6, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9)
par(mfrow = c(3, 1))
hist(x = dta)
hist(x = dta, breaks = c(0, 1, 2, 3, 5, 8, 9)) # note the change in y label
hist(x = dta, breaks = 2)

  • Not so reliable when the size of data is small
  • Good for representing distribution
n <- 10000
par(mfrow = c(3, 1))
hist(rnorm(n), breaks = 40)
hist(runif(n), breaks = 40)

d <- density(rnorm(n))
plot(d)

12.3.1.4 stripchart

  • stripchart treats each column of data as measurements of one variable.
stripchart(1:10)

  • Use jitter to distinguish overlapping points
dta <- rnorm(100)
par(mfrow = c(2, 1), mar = c(2, 2, 1, 1))
stripchart(dta, pch = 20)
stripchart(dta, method = "jitter", pch = 20)

  • A strip is created for each variable (column) in the dataset:
set.seed(1)
n <- 500
dtf <- data.frame(
  normal = rnorm(n),
  uniform = runif(n, min = -1, max = 1)
)
stripchart(
  dtf,
  at = c(1, 2), xlim = c(0.5, 2.5), ylim = c(-3, 3),
  method = "jitter", pch = 20,
  group.names = c("Normal", "Uniform"), xlab = "Distribution", ylab = "Value",
  vertical = TRUE,
  col = c("blue", "red")
)
abline(h = c(-1, 1), lty = "dashed")

12.3.1.5 boxplot

  • Descriptive statistics of data
  • Good for big datasets
boxplot(dtf) # compare this with stripchart

# abline(h = c(-1, 1), lty = "dashed")
  • Quantiles isn’t everything, don’t rely on boxplots
boxplot(anscombe)

12.3.1.6 pairs

  • Correlation between pairs of variables
pairs(swiss, panel = panel.smooth, pch = ".") 

12.3.1.7 coplot

  • Multiple variables involved
  • Focus on correlation between two variables
coplot(conc ~ Time | Subject, data = Theoph, show.given = FALSE)

# coplot(lat ~ long | depth * mag, data = quakes)

12.3.1.8 matplot

  • matplot, plot columns of matrices
n <- 20
ld <- c(3, 5, 8, 10)
dta <- data.frame(
  "a" = cumsum(rpois(n, lambda = ld[1])),
  "b" = cumsum(rpois(n, lambda = ld[2])),
  "c" = cumsum(rpois(n, lambda = ld[3])),
  "d" = cumsum(rpois(n, lambda = ld[4]))
)
matplot(dta, type = "o", pch = names(dta))

12.3.1.9 dotchart

  • number of values to be plotted are tenable
  • all values are numeric from a range
  • all values are to be compared
  • both column and row names are important
dotchart(c("a" = 1:3, "b" = 4:6))

  • with row name
dotchart(VADeaths)

# dotchart(mtcars)

12.3.1.10 curve

  • Plot value of a function over an interval
{
par(pty="s")
curve(x^2 - 2, -3, 3, col = "blue", asp = 1, ylab = "", lwd = 2,
      main = expression(paste("Curve shifting ", x^{2})))
curve(x^2, add = TRUE, col = "black", lwd = 2)
curve((x + 1)^2 + 1, add = TRUE, col = "violet", lwd = 2)
abline(h = 0, v = 0)
abline(h = c(1, 2), v = -1, lty = "dashed")
axis(side = 2, at = 1)
axis(side = 1, at = -1)
arrows(x0 = 0, y0 = 0, x1 = 0, y1 = -2, code = 2, length = 0.2, angle = 15, col = "blue", lwd = 3)
arrows(x0 = 0, y0 = 0, x1 = -1, y1 = 0, code = 2, length = 0.2, angle = 15, col = "violet", lwd = 3)
arrows(x0 = -1, y0 = 0, x1 = -1, y1 = 1, code = 2, length = 0.2, angle = 15, col = "violet", lwd = 3)
legend("bottomright",
       legend = c(
          expression(x^{2}),
          expression(x^{2} - 2),
          expression((x + 1)^{2} + 1)
        ),
       lty = 1,
       lwd = 2,
       col = c("black", "blue", "violet")
)
}

See 12.2.1 Mathematical annotation

From this example we see that

  • Plots can be added to the existing plot with the add option (only available in some high-level plotting commands. e.g. ?barplot, ?stripchart, ?curve)
  • Low-level commands can modify elements in the existing plot

12.3.2 Low-level commands

We’ve seen in the previous example how extra information can be added to plots created by high-level commands. We give another example:

{
xs <- seq(0, 1, by = 0.01)
ys <- sqrt(1 - xs^2)

set.seed(1)
xp <- runif(n)
yp <- runif(n)

ic <- xp^2 + yp^2 < 1

n <- 1000
par(pty="s")
plot(0:1, 0:1, type = "n", asp = 1, xaxs = "i", yaxs = "i",
     xlab = "", ylab = "")
lines(x = xs, y = ys)
# curve(sqrt(1 - x^2), 0, 1, add = TRUE)
#curve(-sqrt(1 - x^2), -1, 1, add = TRUE)
polygon(c(xs, 1, 0), c(ys, 0, 0), col = "lightblue")
points(xp[ic], yp[ic], col = "blue", pch = 19)
points(xp[!ic], yp[!ic], col = "black", pch = 1)
text(0.4, 0.4, expression(pi/4), cex = 4, bg = "red")
title(main = "Monte Carlo integration", sub = expression(paste("estimating ", pi)))

}

  • points, add scattered points
  • lines, add lines
  • text, annotate graph with text
  • abline, add straight line, shorthand for horizontal or vertical
  • polygon, add a shaded area
  • arrows, add arrow
  • legend, add legend
  • title(main =, sub =), add title to plot
  • axis, edit axis

12.4 stats graphics

The commands we demonstrated so far are from the graphics package which is part of base R. Another key package in base R is the stats package, which also includes some plotting commands. Take a look at ls("package:stats") for more. We give a short example

scatter.smooth(x = cars$speed, y = cars$dist)

12.5 Graphical Parameters

Many of the graph settings are set via the par command. For a full list, refer to ?par. We give a lookup table for two most frequently used parameters.

12.5.1 Point shapes pch

pch in R: How to Use Plot Character in R, R-Lang

ggpubr::show_point_shapes()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

12.5.2 Line Type lty

ggpubr::show_line_types()

Exercises

See Section 15.3 for exercises on plotting with Base R functions.