Visualizing data

The Grammar of Graphics

As explained in the lecture, the Grammar of Graphics, developed by Leland Wilkinson [1], is a grand conception of the nature of scientific graphics. It relies on an analogy with linguistic grammar: a plot has parts just as a sentence has subject, predicates, and subordinate clauses.

From this point of view, what is a plot?

A representation of data using objects drawn on a 2D surface, e.g., a page or screen.
The representation may involve statistical transformations.
Properties of the data are mapped onto perceptable qualities, e.g., position, color, shape, size, transparency, ….
All representations involve a choice of scale for each of the perceptible qualities, and a choice of coordinates for the 2D page.

According to the grammar, statistical graphics are composed of elements analogous to the parts of speech:

the data itself,
the geometrical objects that actually representing the data (geoms),
the mappings of data variables onto perceptible qualities called aesthetics,
statistical transformations of the data (stats),

In this conception, a plot is composed of one or more layers. Each layer has the components above. In addition to the layers, plots must have

a set of scales modifying the data-to-aesthetics mapping
a coordinate system mapping the plot onto the page, and
a faceting specification, if there are multiple plots in the graphic.

Data visualization with ggplot2

The package ggplot2 implements the grammar of graphics in an elegant way.

Let’s look at some basic plots. Make sure you have the ggplot2, plyr, and reshape2 packages installed. Whenever you’re working with ggplot2, it’s helpful to have a browser open and pointed at docs.ggplot2.org.

We’ll start by exploring the transgenic mosquito data.

library(plyr)
library(reshape2)
library(magrittr)
options(stringsAsFactors=FALSE)

course.url <- "https://kinglab.eeb.lsa.umich.edu/202/data/"

file.path(course.url,"mosquitoes.csv") %>%
  read.csv(comment.char="#") -> dat
sapply(dat,class)
head(dat)

These are the raw data, just the observations themselves: there are no derived variables in the dataset.

To construct a plot, we begin with a call to ggplot(), which establishes the data and mapping components:

library(ggplot2)
dat %>% ggplot(aes(x=lifespan)) -> p

Note that this has no side effects: it simply stores some information in p.

Now we can begin to add layers. Let’s add a barplot layer:

p+layer(geom='bar',params=list(color='black',fill='blue',width=0.5),
        position="dodge",stat='count')

Because each geom is associated with a default stat (and vice-versa), and has defaults for most or all of the settings, we can usually use simpler notation:

p+geom_bar()

This gives an equivalent result because bin is the default stat associated with geom_bar. As usual, we can choose different settings for various attributes of the geom:

p+geom_bar(fill="darkblue",width=0.5)

Exercise

What does the following code do?

dat %>%
  ggplot(aes(x=lifespan,fill=type))+
  geom_bar()

What about the following?

dat %>%
  ggplot(aes(x=lifespan,fill=type))+
  geom_bar(position='dodge')

It’s important to note that ggplot2 graphics are different from base graphics in that the side effect of generating the graphics is not generated in any of the plot-construction commands. Rather, it is in the printing of the constructed object that the graphics are generated. Thus

q <- p+geom_bar(color=NA,fill='blue',width=0.5)

produces no side effect. To display the graphic, we must

print(q)

Let’s look more closely at how the lifespans differ between the treatments. We can bin the lifespans to make the plot less noisy:

dat %>%
  ggplot(aes(x=lifespan,group=type,fill=type))+
  geom_histogram(position='dodge',binwidth=5)

Alternatively, we can estimate a smooth probability density for the lifespan. This obviates the need to choose the binwidth.

dat %>%
  ggplot(aes(x=lifespan,group=type,fill=type))+
  geom_density(position='dodge',alpha=0.5) -> pl
pl

Note the transparency setting (alpha). We can add a rug to the plot to show the actual data:

pl+geom_rug()

Another way to look at how the distributions of lifespans differ is to use an empirical cumulative distribution function.

dat %>%
  ggplot(aes(x=lifespan,group=type,color=type))+
  stat_ecdf()

Exercise

What are some other ways to visualize these data? What about scatterplots, boxplots, and violinplots? [Hint: check out geom_point, geom_boxplot, and geom_violin.]

Scales

In the grammar of graphics, scales modify the mapping of data variables onto aesthetics. For example

dat %>%
  ggplot(aes(x=lifespan,group=type,fill=type))+
  geom_density(alpha=0.5)+
  scale_x_log10() -> pl
pl

pl+scale_fill_manual(values=c(transgenic="red",wildtype="blue"))

Similar scaling functions exist for all the aesthetics (see docs.ggplot2.org).

Facetting

We can break a plot into facets using facet_wrap and facet_grid. For example:

pl+facet_wrap(~type)

pl+facet_wrap(~type,ncol=1)

pl+facet_grid(type~.)

Fine tuning

It is possible to produce very beautiful publication-quality graphs using ggplot2. In so doing, you usually want to fine tune the figure through choice of fonts, colors, labels, legend positions, etc. Investigate the following topics in the ggplot2 documentation to get a start on this.

Standard themes: theme_bw
Customizing themes: theme
Setting labels and limits: labs, lims
Legends: guides

Exploring the measles data

Download and pre-process the data by doing

file.path(course.url,"ewcitmeas.csv") %>%
  read.csv(comment.char="#",colClasses=c(date="Date")) %>%
  melt(id="date",value.name="cases",variable.name="town") -> meas

These are time series data. It’s traditional (and useful) to plot them using line plots:

meas %>%
  ggplot(aes(x=date,y=cases,group=town))+
  geom_line()+
  facet_wrap(~town)

The towns are of very different sizes, which obscures any similarities in the pattern there might be. To counteract this, we can plot them on different scales.

meas %>%
  ggplot(aes(x=date,y=cases,group=town))+
  geom_line()+
  facet_wrap(~town,scales="free_y")

Alternatively, we could log-transform the y-axis:

meas %>%
  ggplot(aes(x=date,y=cases,group=town))+
  geom_line()+
  facet_wrap(~town)+
  scale_y_log10()

There appears to be a cyclical pattern in the prevaccine period. Is this due to seasonality?

meas %>%
  subset(date<"1968-01-01") %>%
  mutate(week=as.integer(format(date,"%V")),
         year=as.integer(format(date,"%Y"))) %>%
  ggplot(aes(x=week,group=year,color=year,y=cases))+
  geom_line()+
  facet_wrap(~town)+
  scale_y_log10()

Exercise

There seems to be a hint that the cycle is not the same in every year. Do even-numbered years differ in their seasonal pattern from odd-numbered ones? Design and create a visualization to answer this question.

Exploring the energy production data

Download and preprocess the data:

paste0(course.url,"energy_production.csv") %>%
  read.csv(comment.char="#") %>%
  mutate(source=abbreviate(source,5),
         region=abbreviate(region,3)) %>%
  mutate(region=revalue(region,c(MdE="MidE",AaO="AO",CaSA="LatA",Erp="Eur",NrA="NA"))) -> nrg

Visualizing data

and the grammar of graphics

Aaron A. King

The Grammar of Graphics

Data visualization with ggplot2

Exercise

Exercise

Scales

Facetting

Fine tuning

Exploring the measles data

Exercise

Exploring the energy production data

Exercise

Back to course schedule

R codes for this document

References