Advanced data munging

How to use this document.

This is an extremely condensed introduction to the powerful data-munging tools developed by Hadley Wickham and friends and contained in the packages plyr, reshape2, and magrittr. Run the codes shown and study the outputs to learn about these tools. For your convenience, the R codes for this document are provided in a script which you can download, edit, and run.

Importing data

We’ll practice on three datasets. The first contains the results of an experiment on the lifespan of transgenic mosquitoes.

course.url <- "https://kinglab.eeb.lsa.umich.edu/202/data/"
read.csv(paste0(course.url,"mosquitoes.csv")) -> mos

The second contains data on primary energy production by source, year, and region.

read.csv(paste0(course.url,"energy_production.csv"),comment.char="#") -> nrg

The third dataset consists of measles cases in the 7 largest cities in England and Wales during the decades immediately before and after the introduction of the vaccine in 1968.

read.csv(paste0(course.url,"ewcitmeas.csv"),comment.char="#",
         colClasses=c(date="Date")) -> meas

We can obtain an idea of the nature of these datasets as follows:

head(mos)

##         type lifespan
## 1   wildtype       10
## 2 transgenic       39
## 3   wildtype       38
## 4   wildtype       47
## 5   wildtype       11
## 6   wildtype       16

sapply(mos,class)

##      type  lifespan 
##  "factor" "integer"

head(nrg)

##   year source           region       TJ
## 1 1900   Coal Asia and Oceania 542522.6
## 2 1901   Coal Asia and Oceania 612609.6
## 3 1902   Coal Asia and Oceania 653338.2
## 4 1903   Coal Asia and Oceania 694323.1
## 5 1904   Coal Asia and Oceania 731250.6
## 6 1905   Coal Asia and Oceania 780040.2

sapply(nrg,class)

##      year    source    region        TJ 
## "integer"  "factor"  "factor" "numeric"

range(nrg$year)

## [1] 1900 2014

levels(nrg$source)

## [1] "Coal"             "Gas"              "Hydro"           
## [4] "Nuclear"          "Oil"              "Other Renewables"

levels(nrg$region)

## [1] "Africa"                    "Asia and Oceania"         
## [3] "Central and South America" "Eurasia"                  
## [5] "Europe"                    "Middle East"              
## [7] "North America"

summary(nrg)

##       year                   source                          region   
##  Min.   :1900   Coal            :766   Africa                   :439  
##  1st Qu.:1942   Gas             :609   Asia and Oceania         :532  
##  Median :1971   Hydro           :686   Central and South America:516  
##  Mean   :1967   Nuclear         :302   Eurasia                  :489  
##  3rd Qu.:1993   Oil             :784   Europe                   :596  
##  Max.   :2014   Other Renewables:336   Middle East              :338  
##                                        North America            :573  
##        TJ           
##  Min.   :        4  
##  1st Qu.:    86461  
##  Median :   929178  
##  Mean   :  5758038  
##  3rd Qu.:  8293176  
##  Max.   :105601013  
##

head(meas)

##         date London Bristol Liverpool Manchester Newcastle Birmingham
## 1 1948-01-10     NA       3        40         22        58         78
## 2 1948-01-17    240       4        51         19        52         84
## 3 1948-01-24    284       3        54         23        34         65
## 4 1948-01-31    340       5        54         31        25        106
## 5 1948-02-07    511       1        89         66        27        142
## 6 1948-02-14    649       3        73         60        47        143
##   Sheffield
## 1         9
## 2        11
## 3        11
## 4         4
## 5         7
## 6         3

Reshaping data with reshape2

The reshape2 package works with a metaphor of melting and casting.

Melting

Melting a data frame

Melting takes a wide data frame and makes it long. Multiple columns are combined into one value column with a variable column keeping track of which column the different values came from. Only the columns containing measure variables are reshaped; those containing identifier variables are left alone.

library(reshape2)

melt(meas,
     measure.vars=c("London","Bristol","Liverpool","Manchester",
                    "Newcastle","Birmingham","Sheffield")) -> ml
head(ml)

##         date variable value
## 1 1948-01-10   London    NA
## 2 1948-01-17   London   240
## 3 1948-01-24   London   284
## 4 1948-01-31   London   340
## 5 1948-02-07   London   511
## 6 1948-02-14   London   649

Every variable is either an identifier or a measure variable. Thus, the following gives the same result as the first melt operation above.

melt(meas,id.vars=c("date")) -> ml
head(ml)

##         date variable value
## 1 1948-01-10   London    NA
## 2 1948-01-17   London   240
## 3 1948-01-24   London   284
## 4 1948-01-31   London   340
## 5 1948-02-07   London   511
## 6 1948-02-14   London   649

It is possible to override the default, variable-value, naming scheme:

melt(meas,id.vars=c("date"),
     value.name="cases",variable.name="town") -> ml
head(ml)

##         date   town cases
## 1 1948-01-10 London    NA
## 2 1948-01-17 London   240
## 3 1948-01-24 London   284
## 4 1948-01-31 London   340
## 5 1948-02-07 London   511
## 6 1948-02-14 London   649

Melting an array

One can also melt an array:

a <- array(LETTERS[1:15],dim=c(3,5)); a

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "A"  "D"  "G"  "J"  "M" 
## [2,] "B"  "E"  "H"  "K"  "N" 
## [3,] "C"  "F"  "I"  "L"  "O"

melt(a)

##    Var1 Var2 value
## 1     1    1     A
## 2     2    1     B
## 3     3    1     C
## 4     1    2     D
## 5     2    2     E
## 6     3    2     F
## 7     1    3     G
## 8     2    3     H
## 9     3    3     I
## 10    1    4     J
## 11    2    4     K
## 12    3    4     L
## 13    1    5     M
## 14    2    5     N
## 15    3    5     O

b <- array(LETTERS[1:18],dim=c(2,3,3)); b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,] "A"  "C"  "E" 
## [2,] "B"  "D"  "F" 
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,] "G"  "I"  "K" 
## [2,] "H"  "J"  "L" 
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,] "M"  "O"  "Q" 
## [2,] "N"  "P"  "R"

melt(b)

##    Var1 Var2 Var3 value
## 1     1    1    1     A
## 2     2    1    1     B
## 3     1    2    1     C
## 4     2    2    1     D
## 5     1    3    1     E
## 6     2    3    1     F
## 7     1    1    2     G
## 8     2    1    2     H
## 9     1    2    2     I
## 10    2    2    2     J
## 11    1    3    2     K
## 12    2    3    2     L
## 13    1    1    3     M
## 14    2    1    3     N
## 15    1    2    3     O
## 16    2    2    3     P
## 17    1    3    3     Q
## 18    2    3    3     R

c <- array(1:6,dim=c(3,2),
           dimnames=list(letters[1:3],LETTERS[1:2]))
melt(c)

##   Var1 Var2 value
## 1    a    A     1
## 2    b    A     2
## 3    c    A     3
## 4    a    B     4
## 5    b    B     5
## 6    c    B     6

The result is always a data frame. To override the default naming scheme for the variables, one can name the dimnames:

b <- array(LETTERS[1:18],dim=c(2,3,3),
           dimnames=list(one=1:2,two=c("a","b","c"),three=c("x","y","z"))); b

## , , three = x
## 
##    two
## one a   b   c  
##   1 "A" "C" "E"
##   2 "B" "D" "F"
## 
## , , three = y
## 
##    two
## one a   b   c  
##   1 "G" "I" "K"
##   2 "H" "J" "L"
## 
## , , three = z
## 
##    two
## one a   b   c  
##   1 "M" "O" "Q"
##   2 "N" "P" "R"

melt(b)

##    one two three value
## 1    1   a     x     A
## 2    2   a     x     B
## 3    1   b     x     C
## 4    2   b     x     D
## 5    1   c     x     E
## 6    2   c     x     F
## 7    1   a     y     G
## 8    2   a     y     H
## 9    1   b     y     I
## 10   2   b     y     J
## 11   1   c     y     K
## 12   2   c     y     L
## 13   1   a     z     M
## 14   2   a     z     N
## 15   1   b     z     O
## 16   2   b     z     P
## 17   1   c     z     Q
## 18   2   c     z     R

Casting

Casting turns a long data frame into a wide one. A single column (called the value column) is separated into multiple columns according to the specification given. Use dcast or acast according to whether you want the result as a data frame or an array.

dcast(ml,date~town) -> d1; head(d1)

##         date London Bristol Liverpool Manchester Newcastle Birmingham
## 1 1948-01-10     NA       3        40         22        58         78
## 2 1948-01-17    240       4        51         19        52         84
## 3 1948-01-24    284       3        54         23        34         65
## 4 1948-01-31    340       5        54         31        25        106
## 5 1948-02-07    511       1        89         66        27        142
## 6 1948-02-14    649       3        73         60        47        143
##   Sheffield
## 1         9
## 2        11
## 3        11
## 4         4
## 5         7
## 6         3

class(d1)

## [1] "data.frame"

dcast(nrg,source+year~region) -> d2; head(d2)

##   source year   Africa Asia and Oceania Central and South America  Eurasia
## 1   Coal 1900  24048.3         542522.6                   10798.4 389922.0
## 2   Coal 1901  40355.5         612609.6                   18657.2 398876.0
## 3   Coal 1902  60157.1         653338.2                   21673.0 397403.8
## 4   Coal 1903  81631.5         694323.1                   24982.4 407106.0
## 5   Coal 1904  92145.9         731250.6                   23518.5 473259.5
## 6   Coal 1905 106125.0         780040.2                   25142.5 450572.7
##     Europe Middle East North America
## 1 11541836          NA       6750472
## 2 11187846          NA       7349399
## 3 11262939          NA       7686815
## 4 11750772          NA       8955323
## 5 11876122          NA       8829374
## 6 12544596          NA       9839131

dcast(nrg,year+source~region) -> d3; head(d3)

##   year source  Africa Asia and Oceania Central and South America  Eurasia
## 1 1900   Coal 24048.3         542522.6                   10798.4 389922.0
## 2 1900    Gas      NA               NA                        NA       NA
## 3 1900  Hydro      NA               NA                      50.4       NA
## 4 1900    Oil      NA          23822.9                    1591.0 432998.9
## 5 1901   Coal 40355.5         612609.6                   18657.2 398876.0
## 6 1901    Gas      NA               NA                        NA       NA
##       Europe Middle East North America
## 1 11541835.5          NA     6750471.5
## 2         NA          NA      265825.5
## 3      396.0          NA        9849.6
## 4    27172.3          NA      360232.3
## 5 11187846.1          NA     7349398.6
## 6         NA          NA      294581.0

dcast(nrg,year~source+region) -> d4; head(d4)

##   year Coal_Africa Coal_Asia and Oceania Coal_Central and South America
## 1 1900     24048.3              542522.6                        10798.4
## 2 1901     40355.5              612609.6                        18657.2
## 3 1902     60157.1              653338.2                        21673.0
## 4 1903     81631.5              694323.1                        24982.4
## 5 1904     92145.9              731250.6                        23518.5
## 6 1905    106125.0              780040.2                        25142.5
##   Coal_Eurasia Coal_Europe Coal_Middle East Coal_North America Gas_Africa
## 1     389922.0    11541836               NA            6750472         NA
## 2     398876.0    11187846               NA            7349399         NA
## 3     397403.8    11262939               NA            7686815         NA
## 4     407106.0    11750772               NA            8955323         NA
## 5     473259.5    11876122               NA            8829374         NA
## 6     450572.7    12544596               NA            9839131         NA
##   Gas_Asia and Oceania Gas_Central and South America Gas_Eurasia
## 1                   NA                            NA          NA
## 2                   NA                            NA          NA
## 3                   NA                            NA          NA
## 4                   NA                            NA          NA
## 5                   NA                            NA          NA
## 6                   NA                            NA          NA
##   Gas_Europe Gas_Middle East Gas_North America Hydro_Africa
## 1         NA              NA          265825.5           NA
## 2         NA              NA          294581.0           NA
## 3         NA              NA          323336.5           NA
## 4         NA              NA          352092.0           NA
## 5         NA              NA          380847.5           NA
## 6         NA              NA          409603.1           NA
##   Hydro_Asia and Oceania Hydro_Central and South America Hydro_Eurasia
## 1                     NA                            50.4            NA
## 2                     NA                            57.6            NA
## 3                     NA                              NA            NA
## 4                     NA                              NA            NA
## 5                     NA                              NA            NA
## 6                     NA                           226.8            NA
##   Hydro_Europe Hydro_Middle East Hydro_North America Nuclear_Africa
## 1          396                NA              9849.6             NA
## 2          576                NA             10908.0             NA
## 3          792                NA             12312.0             NA
## 4         1080                NA             14137.2             NA
## 5         1260                NA             16131.6             NA
## 6         1440                NA             18194.4             NA
##   Nuclear_Asia and Oceania Nuclear_Central and South America
## 1                       NA                                NA
## 2                       NA                                NA
## 3                       NA                                NA
## 4                       NA                                NA
## 5                       NA                                NA
## 6                       NA                                NA
##   Nuclear_Eurasia Nuclear_Europe Nuclear_Middle East Nuclear_North America
## 1              NA             NA                  NA                    NA
## 2              NA             NA                  NA                    NA
## 3              NA             NA                  NA                    NA
## 4              NA             NA                  NA                    NA
## 5              NA             NA                  NA                    NA
## 6              NA             NA                  NA                    NA
##   Oil_Africa Oil_Asia and Oceania Oil_Central and South America
## 1         NA              23822.9                        1591.0
## 2         NA              37262.5                        1632.9
## 3         NA              29684.4                        1339.8
## 4         NA              52628.1                        1549.1
## 5         NA              64225.5                        1632.9
## 6         NA              75781.1                        2093.4
##   Oil_Eurasia Oil_Europe Oil_Middle East Oil_North America
## 1    432998.9    27172.3              NA          360232.3
## 2    484161.6    31484.7              NA          391675.1
## 3    459878.1    37806.8              NA          498689.7
## 4    431700.9    48315.7              NA          563962.0
## 5    448992.4    60792.3              NA          657495.1
## 6    312251.5    65607.2              NA          757099.0
##   Other Renewables_Africa Other Renewables_Asia and Oceania
## 1                      NA                                NA
## 2                      NA                                NA
## 3                      NA                                NA
## 4                      NA                                NA
## 5                      NA                                NA
## 6                      NA                                NA
##   Other Renewables_Central and South America Other Renewables_Eurasia
## 1                                         NA                       NA
## 2                                         NA                       NA
## 3                                         NA                       NA
## 4                                         NA                       NA
## 5                                         NA                       NA
## 6                                         NA                       NA
##   Other Renewables_Europe Other Renewables_Middle East
## 1                      NA                           NA
## 2                      NA                           NA
## 3                      NA                           NA
## 4                      NA                           NA
## 5                      NA                           NA
## 6                      NA                           NA
##   Other Renewables_North America
## 1                             NA
## 2                             NA
## 3                             NA
## 4                             NA
## 5                             NA
## 6                             NA

acast(nrg,source+year~region) -> a1; head(a1)

##             Africa Asia and Oceania Central and South America  Eurasia
## Coal_1900  24048.3         542522.6                   10798.4 389922.0
## Coal_1901  40355.5         612609.6                   18657.2 398876.0
## Coal_1902  60157.1         653338.2                   21673.0 397403.8
## Coal_1903  81631.5         694323.1                   24982.4 407106.0
## Coal_1904  92145.9         731250.6                   23518.5 473259.5
## Coal_1905 106125.0         780040.2                   25142.5 450572.7
##             Europe Middle East North America
## Coal_1900 11541836          NA       6750472
## Coal_1901 11187846          NA       7349399
## Coal_1902 11262939          NA       7686815
## Coal_1903 11750772          NA       8955323
## Coal_1904 11876122          NA       8829374
## Coal_1905 12544596          NA       9839131

class(a1); dim(a1)

## [1] "matrix"

## [1] 618   7

acast(nrg,year~source~region) -> a2;
class(a2); dim(a2)

## [1] "array"

## [1] 115   6   7

dimnames(a2)

## [[1]]
##   [1] "1900" "1901" "1902" "1903" "1904" "1905" "1906" "1907" "1908" "1909"
##  [11] "1910" "1911" "1912" "1913" "1914" "1915" "1916" "1917" "1918" "1919"
##  [21] "1920" "1921" "1922" "1923" "1924" "1925" "1926" "1927" "1928" "1929"
##  [31] "1930" "1931" "1932" "1933" "1934" "1935" "1936" "1937" "1938" "1939"
##  [41] "1940" "1941" "1942" "1943" "1944" "1945" "1946" "1947" "1948" "1949"
##  [51] "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957" "1958" "1959"
##  [61] "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969"
##  [71] "1970" "1971" "1972" "1973" "1974" "1975" "1976" "1977" "1978" "1979"
##  [81] "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
##  [91] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [101] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [111] "2010" "2011" "2012" "2013" "2014"
## 
## [[2]]
## [1] "Coal"             "Gas"              "Hydro"           
## [4] "Nuclear"          "Oil"              "Other Renewables"
## 
## [[3]]
## [1] "Africa"                    "Asia and Oceania"         
## [3] "Central and South America" "Eurasia"                  
## [5] "Europe"                    "Middle East"              
## [7] "North America"

Split-apply-combine with plyr

plyr implements a very flexible and intuitive syntax for split-apply-combine computations. That is, it allows you to split data according to a wide range of criteria, apply some operation to each piece, them recombine the pieces back together.

In the following, we first detail the “basic” functions that make up the “apply” piece of split-apply-combine. Then, we discuss the “split” and “combine” pieces.

Basic plyr functions

The following are the basic functions for manipulating data using plyr.

`arrange`

arrange sorts a data frame according to specifications.

library(plyr)

arrange(mos,type)
arrange(ml,date,town)
arrange(nrg,-year,source,region)
arrange(nrg,-TJ,year)

`count`

count(x) counts the combinations that occur and returns a data frame.

count(mos,~type)
count(ml,~town)
count(nrg,~region+source)

`summarise` and `summarize`

Given a data frame, summarise (synonym summarize), produces a new data frame.

summarize(ml,mean=mean(cases,na.rm=TRUE),
          sd=sd(cases,na.rm=TRUE),
          midpoint=mean(date))

##       mean       sd   midpoint
## 1 122.1738 291.2688 1967-12-25

summarize(nrg,tot=sum(TJ),n=length(TJ))
summarize(nrg,range(year))
summarize(nrg,min(year),max(year),interval=diff(range(year)))

`mutate`

Given a data frame, mutate modifies, adds, or removes variables.

mutate(mos,lsw=lifespan/7)
mutate(ml,
       year=as.integer(format(date,"%Y")),
       month=format(date,"%b"), 
       week=as.integer(format(date,"%V")),
       time=seq_along(date)) -> ml
mutate(nrg,
       region=abbreviate(region,3),
       carbon=source %in% c("Coal","Gas","Oil"),
       renewable=source %in% c("Hydro","Other Renewables")) -> nrg

`subset`

subset doesn’t belong to plyr, but would if it didn’t already exist in the base package. This function allows you to choose a subset of rows and/or columns. The subset argument specifies a logical condition: those rows that satisfy it are chosen. The select argument picks out which columns to keep or throw away.

subset(ml,date<"1949-01-01")
subset(nrg,select=c(year,source))
subset(nrg,carbon,select=-carbon)
subset(mos,type=="wildtype" & lifespan>10)

`join`

join can be used to merge two data frames together. This can be done in several ways. It can perform a left join, a right join, an inner join, or a full join. Read the documentation (?join) for explanations.

data.frame(
  source=c("Coal","Gas","Hydro","Nuclear","Oil","Other Renewables"),
  category=c("dirty","dirty","clean","dirty","dirty","clean")
) -> cats
join(cats,nrg,by="source",type="right") -> x; head(x)

##   source category year region       TJ carbon renewable
## 1   Coal    dirty 1900    AaO 542522.6   TRUE     FALSE
## 2   Coal    dirty 1901    AaO 612609.6   TRUE     FALSE
## 3   Coal    dirty 1902    AaO 653338.2   TRUE     FALSE
## 4   Coal    dirty 1903    AaO 694323.1   TRUE     FALSE
## 5   Coal    dirty 1904    AaO 731250.6   TRUE     FALSE
## 6   Coal    dirty 1905    AaO 780040.2   TRUE     FALSE

join(nrg,cats,by="source",type="left") -> nrg; head(nrg)

##   year source region       TJ carbon renewable category
## 1 1900   Coal    AaO 542522.6   TRUE     FALSE    dirty
## 2 1901   Coal    AaO 612609.6   TRUE     FALSE    dirty
## 3 1902   Coal    AaO 653338.2   TRUE     FALSE    dirty
## 4 1903   Coal    AaO 694323.1   TRUE     FALSE    dirty
## 5 1904   Coal    AaO 731250.6   TRUE     FALSE    dirty
## 6 1905   Coal    AaO 780040.2   TRUE     FALSE    dirty

The `-ply` functions

plyr provides a systematic, intuitive, and regular expansion of base R’s apply family (apply, lapply, sapply, tapply, mapply) and replicate. Collectively, these functions implement the split-apply-combine pattern of computation. They first split the data up according to some criterion, then apply some function, then combine the results. The functions are all named according to the scheme XYply, where X tells about the class of the source object and Y the class of the desired target object. In particular X and Y can be in d (data-frames), a (arrays), l (lists), _ (null), and r (replicate).

`ddply`

This is probably the most useful of the lot. It splits a data frame according to some criterion, conveniently expressed as a formula involving the variables of the data frame, applies a specified function, and combines the results back into a data frame. It is best to use a function that returns a data frame, but if the function returns something else, ddply will attempt to coerce the value into a data frame. Here are some examples:

ddply(nrg,~source,subset,TJ==max(TJ))

##   year           source region        TJ carbon renewable category
## 1 2011             Coal    AaO 105601013   TRUE     FALSE    dirty
## 2 2014              Gas    NrA  35538636   TRUE     FALSE    dirty
## 3 2013            Hydro    AaO   4756182  FALSE      TRUE    clean
## 4 2004          Nuclear    Erp  10587219  FALSE     FALSE    dirty
## 5 2012              Oil    MdE  59215428   TRUE     FALSE    dirty
## 6 2014 Other Renewables    NrA   2562705  FALSE      TRUE    clean

ddply(nrg,~category+region,summarize,TJ=mean(TJ))

##    category region          TJ
## 1     clean    AaO   666330.12
## 2     clean    Afr    86670.82
## 3     clean   CaSA   543343.88
## 4     clean    Erp   572569.53
## 5     clean    Ers   323347.13
## 6     clean    MdE    21563.08
## 7     clean    NrA   883269.50
## 8     dirty    AaO  9038474.91
## 9     dirty    Afr  3536241.58
## 10    dirty   CaSA  2559243.08
## 11    dirty    Erp  7230392.98
## 12    dirty    Ers  8286548.41
## 13    dirty    MdE 10285141.64
## 14    dirty    NrA 14217332.73

ddply(nrg,~source+region,summarize,TJ=sum(TJ)) -> x
ddply(x,~region,summarize,source=source,frac=TJ/sum(TJ)) -> x
mutate(x,frac=round(frac,3)) -> x
dcast(x,region~source) -> x
arrange(x,region)

##   region  Coal   Gas Hydro Nuclear   Oil Other Renewables
## 1    AaO 0.631 0.108 0.027   0.040 0.191            0.003
## 2    Afr 0.196 0.151 0.009   0.003 0.641            0.000
## 3   CaSA 0.052 0.151 0.066   0.005 0.712            0.015
## 4    Erp 0.559 0.165 0.035   0.104 0.129            0.007
## 5    Ers 0.235 0.352 0.012   0.025 0.376            0.000
## 6    MdE 0.001 0.129 0.001   0.000 0.869            0.000
## 7    NrA 0.314 0.277 0.021   0.049 0.334            0.004

Notice that only combinations of the variables that exist are included in the result by default.

`daply`

This one is very similar, except that (as the name implies), the result is returned as an array:

s <- function (df) sum(df$TJ)
daply(nrg,~region,s)
daply(nrg,~region+source,s) -> y; y

`dlply`

This splits the data according to the given specifications, applies the function, and returns each result (as its name implies) as a distinct element of a list.

dlply(nrg,~region,summarize,TJ=sum(TJ))

`adply`, `aaply`, `alply`

These take arrays and, like the base function apply, divide the array up into slices along specified directions. They then apply a function to each slice and return the results in the desired form (if possible). As an example, we first create an array from dat, then act on it with each of these.

mutate(nrg,time=year-min(year)) -> x
daply(x,~source+region,function(df) min(df$time)) -> A; A

##                   region
## source             AaO Afr CaSA Erp Ers MdE NrA
##   Coal               0   0    0   0   0  39   0
##   Gas               22  46   29  22  22  55   0
##   Hydro             14  29    0   0  13  37   0
##   Nuclear           63  84   66  56  66 111  57
##   Oil                0  11    0   0   0   6   0
##   Other Renewables  59  80   75  16  89  90  60

aaply(A,1,max)

##             Coal              Gas            Hydro          Nuclear 
##               39               55               37              111 
##              Oil Other Renewables 
##               11               90

aaply(y,1,function(x)x/sum(x))

##       
## region         Coal       Gas        Hydro       Nuclear       Oil
##   AaO  0.6313563678 0.1075899 0.0267581794 0.04014686169 0.1909662
##   Afr  0.1959709264 0.1508423 0.0088339808 0.00326077904 0.6407904
##   CaSA 0.0515765633 0.1514090 0.0657369730 0.00475165703 0.7115233
##   Erp  0.5593898932 0.1651974 0.0354909302 0.10357590283 0.1293587
##   Ers  0.2350606877 0.3524597 0.0120168576 0.02478184174 0.3755739
##   MdE  0.0005844682 0.1292670 0.0007873494 0.00004092059 0.8693124
##   NrA  0.3144170820 0.2770508 0.0212464315 0.04912379344 0.3338705
##       
## region Other Renewables
##   AaO    0.003182454260
##   Afr    0.000301592264
##   CaSA   0.015002504869
##   Erp    0.006987204623
##   Ers    0.000106998188
##   MdE    0.000007842489
##   NrA    0.004291397089

Exercise

Use the d-ply and a-ply functions to compute some interesting summary statistics on the energy data.

`llply`, `laply`, `ldply`

These functions are generalizations of lapply and sapply.

`mlply`, `maply`, `mdply`

These work with multi-argument functions.

Other functions

`rename`, `revalue`, `mapvalues`

rename helps one to change the (column) names of a data frame.

rename(nrg,c(TJ='energy',year="time"))

revalue allows you to change one or more of the levels of a factor without worrying about how the factors are coded.

mapvalues does the same, but works on vectors of any type.

mutate(nrg,
       region=revalue(region,c(AaO="Asia",CaSA="Latin.America")),
       source=mapvalues(source,
                        from=c("Coal","Gas","Oil"),
                        to=c("Carbon","Carbon","Carbon")))

Pipelines with magrittr

ceci n’est pas une pipe
René Magritte, La Trahison des Images

magrittr gives a set of “pipe” operators. These allow one to chain operations together. When calculations get complex, it is easier and more natural to view them as a chain of operations instead of using nested function calls or defining intermediate variables.

The `%>%` operator

f(g(data, a, b, c, ...), d, e, ...)

is equivalent to

data %>% g(a, b, c, ...) %>% f(d, e, ...)

The `%<>%` operator

x %>% f(a, b, c, ...) -> x

is equivalent to

x %<>% f(a, b, c, ...)

For example, we can replace the long series of computations above with the following pipeline.

library(magrittr)

nrg %>%
  ddply(~source+region,summarize,TJ=sum(TJ)) %>%
  ddply(~region,summarize,source=source,frac=TJ/sum(TJ)) %>%
  mutate(frac=round(frac,3)) %>%
  dcast(region~source) %>%
  mutate(region=revalue(region,c(AaO="AO",CaSA="LA",Erp="Eur",NrA="NA"))) %>%
  arrange(region)

##   region  Coal   Gas Hydro Nuclear   Oil Other Renewables
## 1    Afr 0.196 0.151 0.009   0.003 0.641            0.000
## 2     AO 0.631 0.108 0.027   0.040 0.191            0.003
## 3    Ers 0.235 0.352 0.012   0.025 0.376            0.000
## 4    Eur 0.559 0.165 0.035   0.104 0.129            0.007
## 5     LA 0.052 0.151 0.066   0.005 0.712            0.015
## 6    MdE 0.001 0.129 0.001   0.000 0.869            0.000
## 7     NA 0.314 0.277 0.021   0.049 0.334            0.004

The `%$%` operator

d %$% e exposes the variables in data frame (or list) d to the expression e. Thus

ml %$% levels(town)

## [1] "London"     "Bristol"    "Liverpool"  "Manchester" "Newcastle" 
## [6] "Birmingham" "Sheffield"

mos %$% mean(lifespan)

## [1] 18.54011

nrg %>% 
  subset(year==2014) %$%
  sum(TJ)

## [1] 527258533

Advanced data munging

with plyr, reshape2, and magrittr

Aaron A. King

How to use this document.

Importing data

Reshaping data with reshape2

Melting

Melting a data frame

Melting an array

Casting

Split-apply-combine with plyr

Basic plyr functions

`arrange`

`count`

`summarise` and `summarize`

`mutate`

`subset`

`join`

The `-ply` functions

`ddply`

`daply`

`dlply`

`adply`, `aaply`, `alply`

Exercise

`llply`, `laply`, `ldply`

`mlply`, `maply`, `mdply`

Other functions

`rename`, `revalue`, `mapvalues`

Pipelines with magrittr

The `%>%` operator

The `%<>%` operator

The `%$%` operator

Back to course schedule

R codes for this document

References

Advanced data munging

with plyr, reshape2, and magrittr

Aaron A. King

How to use this document.

Importing data

Reshaping data with reshape2

Melting

Melting a data frame

Melting an array

Casting

Split-apply-combine with plyr

Basic plyr functions

arrange

count

summarise and summarize

mutate

subset

join

The -ply functions

ddply

daply

dlply

adply, aaply, alply

Exercise

llply, laply, ldply

mlply, maply, mdply

Other functions

rename, revalue, mapvalues

Pipelines with magrittr

The %>% operator

The %<>% operator

The %$% operator

Back to course schedule

R codes for this document

References

`arrange`

`count`

`summarise` and `summarize`

`mutate`

`subset`

`join`

The `-ply` functions

`ddply`

`daply`

`dlply`

`adply`, `aaply`, `alply`

`llply`, `laply`, `ldply`

`mlply`, `maply`, `mdply`

`rename`, `revalue`, `mapvalues`

The `%>%` operator

The `%<>%` operator

The `%$%` operator