© 2018 Aaron A. King.
This is an extremely condensed introduction to the powerful data-munging tools developed by Hadley Wickham and friends and contained in the packages plyr, reshape2, and magrittr. Run the codes shown and study the outputs to learn about these tools. For your convenience, the R codes for this document are provided in a script which you can download, edit, and run.
We’ll practice on three datasets. The first contains the results of an experiment on the lifespan of transgenic mosquitoes.
course.url <- "https://kinglab.eeb.lsa.umich.edu/202/data/"
read.csv(paste0(course.url,"mosquitoes.csv")) -> mos
The second contains data on primary energy production by source, year, and region.
read.csv(paste0(course.url,"energy_production.csv"),comment.char="#") -> nrg
The third dataset consists of measles cases in the 7 largest cities in England and Wales during the decades immediately before and after the introduction of the vaccine in 1968.
read.csv(paste0(course.url,"ewcitmeas.csv"),comment.char="#",
colClasses=c(date="Date")) -> meas
We can obtain an idea of the nature of these datasets as follows:
head(mos)
## type lifespan
## 1 wildtype 10
## 2 transgenic 39
## 3 wildtype 38
## 4 wildtype 47
## 5 wildtype 11
## 6 wildtype 16
sapply(mos,class)
## type lifespan
## "factor" "integer"
head(nrg)
## year source region TJ
## 1 1900 Coal Asia and Oceania 542522.6
## 2 1901 Coal Asia and Oceania 612609.6
## 3 1902 Coal Asia and Oceania 653338.2
## 4 1903 Coal Asia and Oceania 694323.1
## 5 1904 Coal Asia and Oceania 731250.6
## 6 1905 Coal Asia and Oceania 780040.2
sapply(nrg,class)
## year source region TJ
## "integer" "factor" "factor" "numeric"
range(nrg$year)
## [1] 1900 2014
levels(nrg$source)
## [1] "Coal" "Gas" "Hydro"
## [4] "Nuclear" "Oil" "Other Renewables"
levels(nrg$region)
## [1] "Africa" "Asia and Oceania"
## [3] "Central and South America" "Eurasia"
## [5] "Europe" "Middle East"
## [7] "North America"
summary(nrg)
## year source region
## Min. :1900 Coal :766 Africa :439
## 1st Qu.:1942 Gas :609 Asia and Oceania :532
## Median :1971 Hydro :686 Central and South America:516
## Mean :1967 Nuclear :302 Eurasia :489
## 3rd Qu.:1993 Oil :784 Europe :596
## Max. :2014 Other Renewables:336 Middle East :338
## North America :573
## TJ
## Min. : 4
## 1st Qu.: 86461
## Median : 929178
## Mean : 5758038
## 3rd Qu.: 8293176
## Max. :105601013
##
head(meas)
## date London Bristol Liverpool Manchester Newcastle Birmingham
## 1 1948-01-10 NA 3 40 22 58 78
## 2 1948-01-17 240 4 51 19 52 84
## 3 1948-01-24 284 3 54 23 34 65
## 4 1948-01-31 340 5 54 31 25 106
## 5 1948-02-07 511 1 89 66 27 142
## 6 1948-02-14 649 3 73 60 47 143
## Sheffield
## 1 9
## 2 11
## 3 11
## 4 4
## 5 7
## 6 3
The reshape2 package works with a metaphor of melting and casting.
Melting takes a wide data frame and makes it long. Multiple columns are combined into one value column with a variable column keeping track of which column the different values came from. Only the columns containing measure variables are reshaped; those containing identifier variables are left alone.
library(reshape2)
melt(meas,
measure.vars=c("London","Bristol","Liverpool","Manchester",
"Newcastle","Birmingham","Sheffield")) -> ml
head(ml)
## date variable value
## 1 1948-01-10 London NA
## 2 1948-01-17 London 240
## 3 1948-01-24 London 284
## 4 1948-01-31 London 340
## 5 1948-02-07 London 511
## 6 1948-02-14 London 649
Every variable is either an identifier or a measure variable. Thus, the following gives the same result as the first melt
operation above.
melt(meas,id.vars=c("date")) -> ml
head(ml)
## date variable value
## 1 1948-01-10 London NA
## 2 1948-01-17 London 240
## 3 1948-01-24 London 284
## 4 1948-01-31 London 340
## 5 1948-02-07 London 511
## 6 1948-02-14 London 649
It is possible to override the default, variable
-value
, naming scheme:
melt(meas,id.vars=c("date"),
value.name="cases",variable.name="town") -> ml
head(ml)
## date town cases
## 1 1948-01-10 London NA
## 2 1948-01-17 London 240
## 3 1948-01-24 London 284
## 4 1948-01-31 London 340
## 5 1948-02-07 London 511
## 6 1948-02-14 London 649
One can also melt an array:
a <- array(LETTERS[1:15],dim=c(3,5)); a
## [,1] [,2] [,3] [,4] [,5]
## [1,] "A" "D" "G" "J" "M"
## [2,] "B" "E" "H" "K" "N"
## [3,] "C" "F" "I" "L" "O"
melt(a)
## Var1 Var2 value
## 1 1 1 A
## 2 2 1 B
## 3 3 1 C
## 4 1 2 D
## 5 2 2 E
## 6 3 2 F
## 7 1 3 G
## 8 2 3 H
## 9 3 3 I
## 10 1 4 J
## 11 2 4 K
## 12 3 4 L
## 13 1 5 M
## 14 2 5 N
## 15 3 5 O
b <- array(LETTERS[1:18],dim=c(2,3,3)); b
## , , 1
##
## [,1] [,2] [,3]
## [1,] "A" "C" "E"
## [2,] "B" "D" "F"
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] "G" "I" "K"
## [2,] "H" "J" "L"
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] "M" "O" "Q"
## [2,] "N" "P" "R"
melt(b)
## Var1 Var2 Var3 value
## 1 1 1 1 A
## 2 2 1 1 B
## 3 1 2 1 C
## 4 2 2 1 D
## 5 1 3 1 E
## 6 2 3 1 F
## 7 1 1 2 G
## 8 2 1 2 H
## 9 1 2 2 I
## 10 2 2 2 J
## 11 1 3 2 K
## 12 2 3 2 L
## 13 1 1 3 M
## 14 2 1 3 N
## 15 1 2 3 O
## 16 2 2 3 P
## 17 1 3 3 Q
## 18 2 3 3 R
c <- array(1:6,dim=c(3,2),
dimnames=list(letters[1:3],LETTERS[1:2]))
melt(c)
## Var1 Var2 value
## 1 a A 1
## 2 b A 2
## 3 c A 3
## 4 a B 4
## 5 b B 5
## 6 c B 6
The result is always a data frame. To override the default naming scheme for the variables, one can name the dimnames
:
b <- array(LETTERS[1:18],dim=c(2,3,3),
dimnames=list(one=1:2,two=c("a","b","c"),three=c("x","y","z"))); b
## , , three = x
##
## two
## one a b c
## 1 "A" "C" "E"
## 2 "B" "D" "F"
##
## , , three = y
##
## two
## one a b c
## 1 "G" "I" "K"
## 2 "H" "J" "L"
##
## , , three = z
##
## two
## one a b c
## 1 "M" "O" "Q"
## 2 "N" "P" "R"
melt(b)
## one two three value
## 1 1 a x A
## 2 2 a x B
## 3 1 b x C
## 4 2 b x D
## 5 1 c x E
## 6 2 c x F
## 7 1 a y G
## 8 2 a y H
## 9 1 b y I
## 10 2 b y J
## 11 1 c y K
## 12 2 c y L
## 13 1 a z M
## 14 2 a z N
## 15 1 b z O
## 16 2 b z P
## 17 1 c z Q
## 18 2 c z R
Casting turns a long data frame into a wide one. A single column (called the value column) is separated into multiple columns according to the specification given. Use dcast
or acast
according to whether you want the result as a data frame or an array.
dcast(ml,date~town) -> d1; head(d1)
## date London Bristol Liverpool Manchester Newcastle Birmingham
## 1 1948-01-10 NA 3 40 22 58 78
## 2 1948-01-17 240 4 51 19 52 84
## 3 1948-01-24 284 3 54 23 34 65
## 4 1948-01-31 340 5 54 31 25 106
## 5 1948-02-07 511 1 89 66 27 142
## 6 1948-02-14 649 3 73 60 47 143
## Sheffield
## 1 9
## 2 11
## 3 11
## 4 4
## 5 7
## 6 3
class(d1)
## [1] "data.frame"
dcast(nrg,source+year~region) -> d2; head(d2)
## source year Africa Asia and Oceania Central and South America Eurasia
## 1 Coal 1900 24048.3 542522.6 10798.4 389922.0
## 2 Coal 1901 40355.5 612609.6 18657.2 398876.0
## 3 Coal 1902 60157.1 653338.2 21673.0 397403.8
## 4 Coal 1903 81631.5 694323.1 24982.4 407106.0
## 5 Coal 1904 92145.9 731250.6 23518.5 473259.5
## 6 Coal 1905 106125.0 780040.2 25142.5 450572.7
## Europe Middle East North America
## 1 11541836 NA 6750472
## 2 11187846 NA 7349399
## 3 11262939 NA 7686815
## 4 11750772 NA 8955323
## 5 11876122 NA 8829374
## 6 12544596 NA 9839131
dcast(nrg,year+source~region) -> d3; head(d3)
## year source Africa Asia and Oceania Central and South America Eurasia
## 1 1900 Coal 24048.3 542522.6 10798.4 389922.0
## 2 1900 Gas NA NA NA NA
## 3 1900 Hydro NA NA 50.4 NA
## 4 1900 Oil NA 23822.9 1591.0 432998.9
## 5 1901 Coal 40355.5 612609.6 18657.2 398876.0
## 6 1901 Gas NA NA NA NA
## Europe Middle East North America
## 1 11541835.5 NA 6750471.5
## 2 NA NA 265825.5
## 3 396.0 NA 9849.6
## 4 27172.3 NA 360232.3
## 5 11187846.1 NA 7349398.6
## 6 NA NA 294581.0
dcast(nrg,year~source+region) -> d4; head(d4)
## year Coal_Africa Coal_Asia and Oceania Coal_Central and South America
## 1 1900 24048.3 542522.6 10798.4
## 2 1901 40355.5 612609.6 18657.2
## 3 1902 60157.1 653338.2 21673.0
## 4 1903 81631.5 694323.1 24982.4
## 5 1904 92145.9 731250.6 23518.5
## 6 1905 106125.0 780040.2 25142.5
## Coal_Eurasia Coal_Europe Coal_Middle East Coal_North America Gas_Africa
## 1 389922.0 11541836 NA 6750472 NA
## 2 398876.0 11187846 NA 7349399 NA
## 3 397403.8 11262939 NA 7686815 NA
## 4 407106.0 11750772 NA 8955323 NA
## 5 473259.5 11876122 NA 8829374 NA
## 6 450572.7 12544596 NA 9839131 NA
## Gas_Asia and Oceania Gas_Central and South America Gas_Eurasia
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Gas_Europe Gas_Middle East Gas_North America Hydro_Africa
## 1 NA NA 265825.5 NA
## 2 NA NA 294581.0 NA
## 3 NA NA 323336.5 NA
## 4 NA NA 352092.0 NA
## 5 NA NA 380847.5 NA
## 6 NA NA 409603.1 NA
## Hydro_Asia and Oceania Hydro_Central and South America Hydro_Eurasia
## 1 NA 50.4 NA
## 2 NA 57.6 NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA 226.8 NA
## Hydro_Europe Hydro_Middle East Hydro_North America Nuclear_Africa
## 1 396 NA 9849.6 NA
## 2 576 NA 10908.0 NA
## 3 792 NA 12312.0 NA
## 4 1080 NA 14137.2 NA
## 5 1260 NA 16131.6 NA
## 6 1440 NA 18194.4 NA
## Nuclear_Asia and Oceania Nuclear_Central and South America
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Nuclear_Eurasia Nuclear_Europe Nuclear_Middle East Nuclear_North America
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## Oil_Africa Oil_Asia and Oceania Oil_Central and South America
## 1 NA 23822.9 1591.0
## 2 NA 37262.5 1632.9
## 3 NA 29684.4 1339.8
## 4 NA 52628.1 1549.1
## 5 NA 64225.5 1632.9
## 6 NA 75781.1 2093.4
## Oil_Eurasia Oil_Europe Oil_Middle East Oil_North America
## 1 432998.9 27172.3 NA 360232.3
## 2 484161.6 31484.7 NA 391675.1
## 3 459878.1 37806.8 NA 498689.7
## 4 431700.9 48315.7 NA 563962.0
## 5 448992.4 60792.3 NA 657495.1
## 6 312251.5 65607.2 NA 757099.0
## Other Renewables_Africa Other Renewables_Asia and Oceania
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Other Renewables_Central and South America Other Renewables_Eurasia
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Other Renewables_Europe Other Renewables_Middle East
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Other Renewables_North America
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
acast(nrg,source+year~region) -> a1; head(a1)
## Africa Asia and Oceania Central and South America Eurasia
## Coal_1900 24048.3 542522.6 10798.4 389922.0
## Coal_1901 40355.5 612609.6 18657.2 398876.0
## Coal_1902 60157.1 653338.2 21673.0 397403.8
## Coal_1903 81631.5 694323.1 24982.4 407106.0
## Coal_1904 92145.9 731250.6 23518.5 473259.5
## Coal_1905 106125.0 780040.2 25142.5 450572.7
## Europe Middle East North America
## Coal_1900 11541836 NA 6750472
## Coal_1901 11187846 NA 7349399
## Coal_1902 11262939 NA 7686815
## Coal_1903 11750772 NA 8955323
## Coal_1904 11876122 NA 8829374
## Coal_1905 12544596 NA 9839131
class(a1); dim(a1)
## [1] "matrix"
## [1] 618 7
acast(nrg,year~source~region) -> a2;
class(a2); dim(a2)
## [1] "array"
## [1] 115 6 7
dimnames(a2)
## [[1]]
## [1] "1900" "1901" "1902" "1903" "1904" "1905" "1906" "1907" "1908" "1909"
## [11] "1910" "1911" "1912" "1913" "1914" "1915" "1916" "1917" "1918" "1919"
## [21] "1920" "1921" "1922" "1923" "1924" "1925" "1926" "1927" "1928" "1929"
## [31] "1930" "1931" "1932" "1933" "1934" "1935" "1936" "1937" "1938" "1939"
## [41] "1940" "1941" "1942" "1943" "1944" "1945" "1946" "1947" "1948" "1949"
## [51] "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957" "1958" "1959"
## [61] "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969"
## [71] "1970" "1971" "1972" "1973" "1974" "1975" "1976" "1977" "1978" "1979"
## [81] "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
## [91] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [101] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [111] "2010" "2011" "2012" "2013" "2014"
##
## [[2]]
## [1] "Coal" "Gas" "Hydro"
## [4] "Nuclear" "Oil" "Other Renewables"
##
## [[3]]
## [1] "Africa" "Asia and Oceania"
## [3] "Central and South America" "Eurasia"
## [5] "Europe" "Middle East"
## [7] "North America"
plyr implements a very flexible and intuitive syntax for split-apply-combine computations. That is, it allows you to split data according to a wide range of criteria, apply some operation to each piece, them recombine the pieces back together.
In the following, we first detail the “basic” functions that make up the “apply” piece of split-apply-combine. Then, we discuss the “split” and “combine” pieces.
The following are the basic functions for manipulating data using plyr.
arrange
arrange
sorts a data frame according to specifications.
library(plyr)
arrange(mos,type)
arrange(ml,date,town)
arrange(nrg,-year,source,region)
arrange(nrg,-TJ,year)
count
count(x)
counts the combinations that occur and returns a data frame.
count(mos,~type)
count(ml,~town)
count(nrg,~region+source)
summarise
and summarize
Given a data frame, summarise
(synonym summarize
), produces a new data frame.
summarize(ml,mean=mean(cases,na.rm=TRUE),
sd=sd(cases,na.rm=TRUE),
midpoint=mean(date))
## mean sd midpoint
## 1 122.1738 291.2688 1967-12-25
summarize(nrg,tot=sum(TJ),n=length(TJ))
summarize(nrg,range(year))
summarize(nrg,min(year),max(year),interval=diff(range(year)))
mutate
Given a data frame, mutate
modifies, adds, or removes variables.
mutate(mos,lsw=lifespan/7)
mutate(ml,
year=as.integer(format(date,"%Y")),
month=format(date,"%b"),
week=as.integer(format(date,"%V")),
time=seq_along(date)) -> ml
mutate(nrg,
region=abbreviate(region,3),
carbon=source %in% c("Coal","Gas","Oil"),
renewable=source %in% c("Hydro","Other Renewables")) -> nrg
subset
subset
doesn’t belong to plyr, but would if it didn’t already exist in the base package. This function allows you to choose a subset of rows and/or columns. The subset
argument specifies a logical condition: those rows that satisfy it are chosen. The select
argument picks out which columns to keep or throw away.
subset(ml,date<"1949-01-01")
subset(nrg,select=c(year,source))
subset(nrg,carbon,select=-carbon)
subset(mos,type=="wildtype" & lifespan>10)
join
join
can be used to merge two data frames together. This can be done in several ways. It can perform a left join, a right join, an inner join, or a full join. Read the documentation (?join
) for explanations.
data.frame(
source=c("Coal","Gas","Hydro","Nuclear","Oil","Other Renewables"),
category=c("dirty","dirty","clean","dirty","dirty","clean")
) -> cats
join(cats,nrg,by="source",type="right") -> x; head(x)
## source category year region TJ carbon renewable
## 1 Coal dirty 1900 AaO 542522.6 TRUE FALSE
## 2 Coal dirty 1901 AaO 612609.6 TRUE FALSE
## 3 Coal dirty 1902 AaO 653338.2 TRUE FALSE
## 4 Coal dirty 1903 AaO 694323.1 TRUE FALSE
## 5 Coal dirty 1904 AaO 731250.6 TRUE FALSE
## 6 Coal dirty 1905 AaO 780040.2 TRUE FALSE
join(nrg,cats,by="source",type="left") -> nrg; head(nrg)
## year source region TJ carbon renewable category
## 1 1900 Coal AaO 542522.6 TRUE FALSE dirty
## 2 1901 Coal AaO 612609.6 TRUE FALSE dirty
## 3 1902 Coal AaO 653338.2 TRUE FALSE dirty
## 4 1903 Coal AaO 694323.1 TRUE FALSE dirty
## 5 1904 Coal AaO 731250.6 TRUE FALSE dirty
## 6 1905 Coal AaO 780040.2 TRUE FALSE dirty
-ply
functionsplyr provides a systematic, intuitive, and regular expansion of base R’s apply
family (apply
, lapply
, sapply
, tapply
, mapply
) and replicate
. Collectively, these functions implement the split-apply-combine pattern of computation. They first split the data up according to some criterion, then apply some function, then combine the results. The functions are all named according to the scheme XYply
, where X
tells about the class of the source object and Y
the class of the desired target object. In particular X
and Y
can be in d
(data-frames), a
(arrays), l
(lists), _
(null), and r
(replicate).
ddply
This is probably the most useful of the lot. It splits a data frame according to some criterion, conveniently expressed as a formula involving the variables of the data frame, applies a specified function, and combines the results back into a data frame. It is best to use a function that returns a data frame, but if the function returns something else, ddply
will attempt to coerce the value into a data frame. Here are some examples:
ddply(nrg,~source,subset,TJ==max(TJ))
## year source region TJ carbon renewable category
## 1 2011 Coal AaO 105601013 TRUE FALSE dirty
## 2 2014 Gas NrA 35538636 TRUE FALSE dirty
## 3 2013 Hydro AaO 4756182 FALSE TRUE clean
## 4 2004 Nuclear Erp 10587219 FALSE FALSE dirty
## 5 2012 Oil MdE 59215428 TRUE FALSE dirty
## 6 2014 Other Renewables NrA 2562705 FALSE TRUE clean
ddply(nrg,~category+region,summarize,TJ=mean(TJ))
## category region TJ
## 1 clean AaO 666330.12
## 2 clean Afr 86670.82
## 3 clean CaSA 543343.88
## 4 clean Erp 572569.53
## 5 clean Ers 323347.13
## 6 clean MdE 21563.08
## 7 clean NrA 883269.50
## 8 dirty AaO 9038474.91
## 9 dirty Afr 3536241.58
## 10 dirty CaSA 2559243.08
## 11 dirty Erp 7230392.98
## 12 dirty Ers 8286548.41
## 13 dirty MdE 10285141.64
## 14 dirty NrA 14217332.73
ddply(nrg,~source+region,summarize,TJ=sum(TJ)) -> x
ddply(x,~region,summarize,source=source,frac=TJ/sum(TJ)) -> x
mutate(x,frac=round(frac,3)) -> x
dcast(x,region~source) -> x
arrange(x,region)
## region Coal Gas Hydro Nuclear Oil Other Renewables
## 1 AaO 0.631 0.108 0.027 0.040 0.191 0.003
## 2 Afr 0.196 0.151 0.009 0.003 0.641 0.000
## 3 CaSA 0.052 0.151 0.066 0.005 0.712 0.015
## 4 Erp 0.559 0.165 0.035 0.104 0.129 0.007
## 5 Ers 0.235 0.352 0.012 0.025 0.376 0.000
## 6 MdE 0.001 0.129 0.001 0.000 0.869 0.000
## 7 NrA 0.314 0.277 0.021 0.049 0.334 0.004
Notice that only combinations of the variables that exist are included in the result by default.
daply
This one is very similar, except that (as the name implies), the result is returned as an array:
s <- function (df) sum(df$TJ)
daply(nrg,~region,s)
daply(nrg,~region+source,s) -> y; y
dlply
This splits the data according to the given specifications, applies the function, and returns each result (as its name implies) as a distinct element of a list.
dlply(nrg,~region,summarize,TJ=sum(TJ))
adply
, aaply
, alply
These take arrays and, like the base function apply
, divide the array up into slices along specified directions. They then apply a function to each slice and return the results in the desired form (if possible). As an example, we first create an array from dat
, then act on it with each of these.
mutate(nrg,time=year-min(year)) -> x
daply(x,~source+region,function(df) min(df$time)) -> A; A
## region
## source AaO Afr CaSA Erp Ers MdE NrA
## Coal 0 0 0 0 0 39 0
## Gas 22 46 29 22 22 55 0
## Hydro 14 29 0 0 13 37 0
## Nuclear 63 84 66 56 66 111 57
## Oil 0 11 0 0 0 6 0
## Other Renewables 59 80 75 16 89 90 60
aaply(A,1,max)
## Coal Gas Hydro Nuclear
## 39 55 37 111
## Oil Other Renewables
## 11 90
aaply(y,1,function(x)x/sum(x))
##
## region Coal Gas Hydro Nuclear Oil
## AaO 0.6313563678 0.1075899 0.0267581794 0.04014686169 0.1909662
## Afr 0.1959709264 0.1508423 0.0088339808 0.00326077904 0.6407904
## CaSA 0.0515765633 0.1514090 0.0657369730 0.00475165703 0.7115233
## Erp 0.5593898932 0.1651974 0.0354909302 0.10357590283 0.1293587
## Ers 0.2350606877 0.3524597 0.0120168576 0.02478184174 0.3755739
## MdE 0.0005844682 0.1292670 0.0007873494 0.00004092059 0.8693124
## NrA 0.3144170820 0.2770508 0.0212464315 0.04912379344 0.3338705
##
## region Other Renewables
## AaO 0.003182454260
## Afr 0.000301592264
## CaSA 0.015002504869
## Erp 0.006987204623
## Ers 0.000106998188
## MdE 0.000007842489
## NrA 0.004291397089
Use the d-ply
and a-ply
functions to compute some interesting summary statistics on the energy data.
llply
, laply
, ldply
These functions are generalizations of lapply
and sapply
.
mlply
, maply
, mdply
These work with multi-argument functions.
rename
, revalue
, mapvalues
rename
helps one to change the (column) names of a data frame.
rename(nrg,c(TJ='energy',year="time"))
revalue
allows you to change one or more of the levels of a factor without worrying about how the factors are coded.
mapvalues
does the same, but works on vectors of any type.
mutate(nrg,
region=revalue(region,c(AaO="Asia",CaSA="Latin.America")),
source=mapvalues(source,
from=c("Coal","Gas","Oil"),
to=c("Carbon","Carbon","Carbon")))
René Magritte, La Trahison des Images
magrittr gives a set of “pipe” operators. These allow one to chain operations together. When calculations get complex, it is easier and more natural to view them as a chain of operations instead of using nested function calls or defining intermediate variables.
%>%
operatorf(g(data, a, b, c, ...), d, e, ...)
is equivalent to
data %>% g(a, b, c, ...) %>% f(d, e, ...)
%<>%
operatorx %>% f(a, b, c, ...) -> x
is equivalent to
x %<>% f(a, b, c, ...)
For example, we can replace the long series of computations above with the following pipeline.
library(magrittr)
nrg %>%
ddply(~source+region,summarize,TJ=sum(TJ)) %>%
ddply(~region,summarize,source=source,frac=TJ/sum(TJ)) %>%
mutate(frac=round(frac,3)) %>%
dcast(region~source) %>%
mutate(region=revalue(region,c(AaO="AO",CaSA="LA",Erp="Eur",NrA="NA"))) %>%
arrange(region)
## region Coal Gas Hydro Nuclear Oil Other Renewables
## 1 Afr 0.196 0.151 0.009 0.003 0.641 0.000
## 2 AO 0.631 0.108 0.027 0.040 0.191 0.003
## 3 Ers 0.235 0.352 0.012 0.025 0.376 0.000
## 4 Eur 0.559 0.165 0.035 0.104 0.129 0.007
## 5 LA 0.052 0.151 0.066 0.005 0.712 0.015
## 6 MdE 0.001 0.129 0.001 0.000 0.869 0.000
## 7 NA 0.314 0.277 0.021 0.049 0.334 0.004
%$%
operatord %$% e
exposes the variables in data frame (or list) d
to the expression e
. Thus
ml %$% levels(town)
## [1] "London" "Bristol" "Liverpool" "Manchester" "Newcastle"
## [6] "Birmingham" "Sheffield"
mos %$% mean(lifespan)
## [1] 18.54011
nrg %>%
subset(year==2014) %$%
sum(TJ)
## [1] 527258533