© 2018 Aaron A. King.
As explained in the lecture, the Grammar of Graphics, developed by Leland Wilkinson [1], is a grand conception of the nature of scientific graphics. It relies on an analogy with linguistic grammar: a plot has parts just as a sentence has subject, predicates, and subordinate clauses.
From this point of view, what is a plot?
According to the grammar, statistical graphics are composed of elements analogous to the parts of speech:
In this conception, a plot is composed of one or more layers. Each layer has the components above. In addition to the layers, plots must have
The package ggplot2 implements the grammar of graphics in an elegant way.
Let’s look at some basic plots. Make sure you have the ggplot2, plyr, and reshape2 packages installed. Whenever you’re working with ggplot2, it’s helpful to have a browser open and pointed at docs.ggplot2.org.
We’ll start by exploring the transgenic mosquito data.
library(plyr)
library(reshape2)
library(magrittr)
options(stringsAsFactors=FALSE)
course.url <- "https://kinglab.eeb.lsa.umich.edu/202/data/"
file.path(course.url,"mosquitoes.csv") %>%
read.csv(comment.char="#") -> dat
sapply(dat,class)
head(dat)These are the raw data, just the observations themselves: there are no derived variables in the dataset.
To construct a plot, we begin with a call to ggplot(), which establishes the data and mapping components:
library(ggplot2)
dat %>% ggplot(aes(x=lifespan)) -> pNote that this has no side effects: it simply stores some information in p.
Now we can begin to add layers. Let’s add a barplot layer:
p+layer(geom='bar',params=list(color='black',fill='blue',width=0.5),
position="dodge",stat='count')Because each geom is associated with a default stat (and vice-versa), and has defaults for most or all of the settings, we can usually use simpler notation:
p+geom_bar()This gives an equivalent result because bin is the default stat associated with geom_bar. As usual, we can choose different settings for various attributes of the geom:
p+geom_bar(fill="darkblue",width=0.5)What does the following code do?
dat %>%
ggplot(aes(x=lifespan,fill=type))+
geom_bar()What about the following?
dat %>%
ggplot(aes(x=lifespan,fill=type))+
geom_bar(position='dodge')It’s important to note that ggplot2 graphics are different from base graphics in that the side effect of generating the graphics is not generated in any of the plot-construction commands. Rather, it is in the printing of the constructed object that the graphics are generated. Thus
q <- p+geom_bar(color=NA,fill='blue',width=0.5)produces no side effect. To display the graphic, we must
print(q)Let’s look more closely at how the lifespans differ between the treatments. We can bin the lifespans to make the plot less noisy:
dat %>%
ggplot(aes(x=lifespan,group=type,fill=type))+
geom_histogram(position='dodge',binwidth=5)Alternatively, we can estimate a smooth probability density for the lifespan. This obviates the need to choose the binwidth.
dat %>%
ggplot(aes(x=lifespan,group=type,fill=type))+
geom_density(position='dodge',alpha=0.5) -> pl
plNote the transparency setting (alpha). We can add a rug to the plot to show the actual data:
pl+geom_rug()Another way to look at how the distributions of lifespans differ is to use an empirical cumulative distribution function.
dat %>%
ggplot(aes(x=lifespan,group=type,color=type))+
stat_ecdf()What are some other ways to visualize these data? What about scatterplots, boxplots, and violinplots? [Hint: check out geom_point, geom_boxplot, and geom_violin.]
In the grammar of graphics, scales modify the mapping of data variables onto aesthetics. For example
dat %>%
ggplot(aes(x=lifespan,group=type,fill=type))+
geom_density(alpha=0.5)+
scale_x_log10() -> pl
plpl+scale_fill_manual(values=c(transgenic="red",wildtype="blue"))Similar scaling functions exist for all the aesthetics (see docs.ggplot2.org).
We can break a plot into facets using facet_wrap and facet_grid. For example:
pl+facet_wrap(~type)pl+facet_wrap(~type,ncol=1)pl+facet_grid(type~.)It is possible to produce very beautiful publication-quality graphs using ggplot2. In so doing, you usually want to fine tune the figure through choice of fonts, colors, labels, legend positions, etc. Investigate the following topics in the ggplot2 documentation to get a start on this.
theme_bwthemelabs, limsguidesDownload and pre-process the data by doing
file.path(course.url,"ewcitmeas.csv") %>%
read.csv(comment.char="#",colClasses=c(date="Date")) %>%
melt(id="date",value.name="cases",variable.name="town") -> measThese are time series data. It’s traditional (and useful) to plot them using line plots:
meas %>%
ggplot(aes(x=date,y=cases,group=town))+
geom_line()+
facet_wrap(~town)The towns are of very different sizes, which obscures any similarities in the pattern there might be. To counteract this, we can plot them on different scales.
meas %>%
ggplot(aes(x=date,y=cases,group=town))+
geom_line()+
facet_wrap(~town,scales="free_y")Alternatively, we could log-transform the y-axis:
meas %>%
ggplot(aes(x=date,y=cases,group=town))+
geom_line()+
facet_wrap(~town)+
scale_y_log10()There appears to be a cyclical pattern in the prevaccine period. Is this due to seasonality?
meas %>%
subset(date<"1968-01-01") %>%
mutate(week=as.integer(format(date,"%V")),
year=as.integer(format(date,"%Y"))) %>%
ggplot(aes(x=week,group=year,color=year,y=cases))+
geom_line()+
facet_wrap(~town)+
scale_y_log10()There seems to be a hint that the cycle is not the same in every year. Do even-numbered years differ in their seasonal pattern from odd-numbered ones? Design and create a visualization to answer this question.
Download and preprocess the data:
paste0(course.url,"energy_production.csv") %>%
read.csv(comment.char="#") %>%
mutate(source=abbreviate(source,5),
region=abbreviate(region,3)) %>%
mutate(region=revalue(region,c(MdE="MidE",AaO="AO",CaSA="LatA",Erp="Eur",NrA="NA"))) -> nrgPose some questions of these data and design visualizations to answer them.
1. Wilkinson L (2005) The grammar of graphics. 2nd ed. New York: Springer.