© 2018 Aaron A. King.
As explained in the lecture, the Grammar of Graphics, developed by Leland Wilkinson [1], is a grand conception of the nature of scientific graphics. It relies on an analogy with linguistic grammar: a plot has parts just as a sentence has subject, predicates, and subordinate clauses.
From this point of view, what is a plot?
According to the grammar, statistical graphics are composed of elements analogous to the parts of speech:
In this conception, a plot is composed of one or more layers. Each layer has the components above. In addition to the layers, plots must have
The package ggplot2 implements the grammar of graphics in an elegant way.
Let’s look at some basic plots. Make sure you have the ggplot2, plyr, and reshape2 packages installed. Whenever you’re working with ggplot2, it’s helpful to have a browser open and pointed at docs.ggplot2.org.
We’ll start by exploring the transgenic mosquito data.
library(plyr)
library(reshape2)
library(magrittr)
options(stringsAsFactors=FALSE)
course.url <- "https://kinglab.eeb.lsa.umich.edu/202/data/"
file.path(course.url,"mosquitoes.csv") %>%
read.csv(comment.char="#") -> dat
sapply(dat,class)
head(dat)
These are the raw data, just the observations themselves: there are no derived variables in the dataset.
To construct a plot, we begin with a call to ggplot()
, which establishes the data and mapping components:
library(ggplot2)
dat %>% ggplot(aes(x=lifespan)) -> p
Note that this has no side effects: it simply stores some information in p
.
Now we can begin to add layers. Let’s add a barplot layer:
p+layer(geom='bar',params=list(color='black',fill='blue',width=0.5),
position="dodge",stat='count')
Because each geom
is associated with a default stat
(and vice-versa), and has defaults for most or all of the settings, we can usually use simpler notation:
p+geom_bar()
This gives an equivalent result because bin
is the default stat
associated with geom_bar
. As usual, we can choose different settings for various attributes of the geom
:
p+geom_bar(fill="darkblue",width=0.5)
What does the following code do?
dat %>%
ggplot(aes(x=lifespan,fill=type))+
geom_bar()
What about the following?
dat %>%
ggplot(aes(x=lifespan,fill=type))+
geom_bar(position='dodge')
It’s important to note that ggplot2 graphics are different from base graphics in that the side effect of generating the graphics is not generated in any of the plot-construction commands. Rather, it is in the print
ing of the constructed object that the graphics are generated. Thus
q <- p+geom_bar(color=NA,fill='blue',width=0.5)
produces no side effect. To display the graphic, we must
print(q)
Let’s look more closely at how the lifespans differ between the treatments. We can bin the lifespans to make the plot less noisy:
dat %>%
ggplot(aes(x=lifespan,group=type,fill=type))+
geom_histogram(position='dodge',binwidth=5)
Alternatively, we can estimate a smooth probability density for the lifespan. This obviates the need to choose the binwidth.
dat %>%
ggplot(aes(x=lifespan,group=type,fill=type))+
geom_density(position='dodge',alpha=0.5) -> pl
pl
Note the transparency setting (alpha
). We can add a rug to the plot to show the actual data:
pl+geom_rug()
Another way to look at how the distributions of lifespans differ is to use an empirical cumulative distribution function.
dat %>%
ggplot(aes(x=lifespan,group=type,color=type))+
stat_ecdf()
What are some other ways to visualize these data? What about scatterplots, boxplots, and violinplots? [Hint: check out geom_point
, geom_boxplot
, and geom_violin
.]
In the grammar of graphics, scales modify the mapping of data variables onto aesthetics. For example
dat %>%
ggplot(aes(x=lifespan,group=type,fill=type))+
geom_density(alpha=0.5)+
scale_x_log10() -> pl
pl
pl+scale_fill_manual(values=c(transgenic="red",wildtype="blue"))
Similar scaling functions exist for all the aesthetics (see docs.ggplot2.org).
We can break a plot into facets using facet_wrap
and facet_grid
. For example:
pl+facet_wrap(~type)
pl+facet_wrap(~type,ncol=1)
pl+facet_grid(type~.)
It is possible to produce very beautiful publication-quality graphs using ggplot2. In so doing, you usually want to fine tune the figure through choice of fonts, colors, labels, legend positions, etc. Investigate the following topics in the ggplot2 documentation to get a start on this.
theme_bw
theme
labs
, lims
guides
Download and pre-process the data by doing
file.path(course.url,"ewcitmeas.csv") %>%
read.csv(comment.char="#",colClasses=c(date="Date")) %>%
melt(id="date",value.name="cases",variable.name="town") -> meas
These are time series data. It’s traditional (and useful) to plot them using line plots:
meas %>%
ggplot(aes(x=date,y=cases,group=town))+
geom_line()+
facet_wrap(~town)
The towns are of very different sizes, which obscures any similarities in the pattern there might be. To counteract this, we can plot them on different scales.
meas %>%
ggplot(aes(x=date,y=cases,group=town))+
geom_line()+
facet_wrap(~town,scales="free_y")
Alternatively, we could log-transform the y-axis:
meas %>%
ggplot(aes(x=date,y=cases,group=town))+
geom_line()+
facet_wrap(~town)+
scale_y_log10()
There appears to be a cyclical pattern in the prevaccine period. Is this due to seasonality?
meas %>%
subset(date<"1968-01-01") %>%
mutate(week=as.integer(format(date,"%V")),
year=as.integer(format(date,"%Y"))) %>%
ggplot(aes(x=week,group=year,color=year,y=cases))+
geom_line()+
facet_wrap(~town)+
scale_y_log10()
There seems to be a hint that the cycle is not the same in every year. Do even-numbered years differ in their seasonal pattern from odd-numbered ones? Design and create a visualization to answer this question.
Download and preprocess the data:
paste0(course.url,"energy_production.csv") %>%
read.csv(comment.char="#") %>%
mutate(source=abbreviate(source,5),
region=abbreviate(region,3)) %>%
mutate(region=revalue(region,c(MdE="MidE",AaO="AO",CaSA="LatA",Erp="Eur",NrA="NA"))) -> nrg
Pose some questions of these data and design visualizations to answer them.
1. Wilkinson L (2005) The grammar of graphics. 2nd ed. New York: Springer.