Chapter 21 The Hadleyverse
The Hadleyverse, short for “Hadley Wickham’s universe”, is a set of packages that make it easier to handle data. If you are developing packages, you should be careful since using these packages may create many dependencies and compatibility issues. If you are analyzing data, and the portability of your functions to other users, machines, and operating systems is not of a concern, you will LOVE these packages. The term Hadleyverse refers to all of Hadley’s packages, but here, we mention only a useful subset, which can be collectively installed via the tidyverse package:
- ggplot2 for data visualization. See the Plotting Chapter 12.
- dplyr for data manipulation.
- tidyr for data tidying.
- readr for data import.
- stringr for character strings.
- anytime for time data.
21.1 readr
The readr package (Wickham, Hester, and Francois 2016) replaces base functions for importing and exporting data such as read.table
.
It is faster, with a cleaner syntax.
We will not go into the details and refer the reader to the official documentation here and the R for data sciecne book.
21.2 dplyr
When you think of data frame operations, think dplyr (Wickham and Francois 2016). Notable utilities in the package include:
select()
Select columns from a data frame.filter()
Filter rows according to some condition(s).arrange()
Sort / Re-order rows in a data frame.mutate()
Create new columns or transform existing ones.group_by()
Group a data frame by some factor(s) usually in conjunction to summary.summarize()
Summarize some values from the data frame or across groups.inner_join(x,y,by="col")
return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.left_join(x,y,by="col")
return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.right_join(x,y,by="col")
return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.anti_join(x,y,by="col")
return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’.
The following example involve data.frame
objects, but dplyr can handle other classes.
In particular data.table
s from the data.table package (Dowle and Srinivasan 2017), which is designed for very large data sets.
dplyr can work with data stored in a database. In which case, it will convert your command to the appropriate SQL syntax, and issue it to the database. This has the advantage that (a) you do not need to know the specific SQL implementation of your database, and (b), you can enjoy the optimized algorithms provided by the database supplier. For more on this, see the databses vignette.
The following examples are taken from Kevin Markham.
The nycflights13::flights
has delay data for US flights.
library(nycflights13)
flights
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, ind <int>
The data is of class tbl_df
which is an extension of the data.frame
class, designed for large data sets.
Notice that the printing of flights
is short, even without calling the head
function. This is a feature of the tbl_df
class ( print(data.frame)
would try to load all the data, thus take a long time).
class(flights) # a tbl_df is an extension of the data.frame class
## [1] "tbl_df" "tbl" "data.frame"
Let’s filter the observations from the first day of the first month. Notice how much better (i.e. readable) is the dplyr syntax, with piping, compared to the basic syntax.
flights[flights$month == 1 & flights$day == 1, ] # old style
library(dplyr)
filter(flights, month == 1, day == 1) #dplyr style
flights %>% filter(month == 1, day == 1) # dplyr with piping.
More filtering.
filter(flights, month == 1 | month == 2) # First OR second month.
slice(flights, 1:10) # selects first ten rows.
arrange(flights, year, month, day) # sort
arrange(flights, desc(arr_delay)) # sort descending
select(flights, year, month, day) # select columns year, month, and day
select(flights, year:day) # select column range
select(flights, -(year:day)) # drop columns
rename(flights, c(tail_num = "tailnum")) # rename column
# add a new computed colume
mutate(flights,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
# you can refer to columns you just created! (gain)
mutate(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)
# keep only new variables, not all data frame.
transmute(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)
# simple statistics
summarise(flights,
delay = mean(dep_delay, na.rm = TRUE)
)
# random subsample
sample_n(flights, 10)
sample_frac(flights, 0.01)
We now perform operations on subgroups.
we group observations along the plane’s tail number (tailnum
), and compute the count, average distance traveled, and average delay.
We group with group_by
, and compute subgroup statistics with summarise
.
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = length(),
avg.dist = mean(distance, na.rm = TRUE),
avg.delay = mean(arr_delay, na.rm = TRUE))
delay
We can group along several variables, with a hierarchy. We then collapse the hierarchy one by one.
daily <- group_by(flights, year, month, day)
per_day <- summarise(daily, flights = n())
per_month <- summarise(per_day, flights = sum(flights))
per_year <- summarise(per_month, flights = sum(flights))
Things to note:
- Every call to
summarise
collapses one level in the hierarchy of grouping. The output ofgroup_by
recalls the hierarchy of aggregation, and collapses along this hierarchy.
We can use dplyr for two table operations, i.e., joins.
For this, we join the flight data, with the airplane data in airplanes
.
library(dplyr)
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
# select the subset of interesting flight data.
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
# join on left table with automatic matching.
flights2 %>% left_join(airlines)
## # A tibble: 336,776 x 9
## year month day hour origin dest tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA United Air Lines I…
## 2 2013 1 1 5 LGA IAH N24211 UA United Air Lines I…
## 3 2013 1 1 5 JFK MIA N619AA AA American Airlines …
## 4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
## 5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines In…
## 6 2013 1 1 5 EWR ORD N39463 UA United Air Lines I…
## 7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
## 8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airline…
## 9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
## 10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines …
## # … with 336,766 more rows
flights2 %>% left_join(weather)
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
# join with named matching
flights2 %>% left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe…
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe…
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe…
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe…
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe…
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe…
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixe…
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixe…
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixe…
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # … with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
# join with explicit column matching
flights2 %>% left_join(airports, by= c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Geor… 30.0 -95.3
## 2 2013 1 1 5 LGA IAH N24211 UA Geor… 30.0 -95.3
## 3 2013 1 1 5 JFK MIA N619AA AA Miam… 25.8 -80.3
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Hart… 33.6 -84.4
## 6 2013 1 1 5 EWR ORD N39463 UA Chic… 42.0 -87.9
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort… 26.1 -80.2
## 8 2013 1 1 6 LGA IAD N829AS EV Wash… 38.9 -77.5
## 9 2013 1 1 6 JFK MCO N593JB B6 Orla… 28.4 -81.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chic… 42.0 -87.9
## # … with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## # dst <chr>, tzone <chr>
Types of join with SQL equivalent.
# Create simple data
(df1 <- data_frame(x = c(1, 2), y = 2:1))
## # A tibble: 2 x 2
## x y
## <dbl> <int>
## 1 1 2
## 2 2 1
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
## # A tibble: 2 x 3
## x a b
## <dbl> <dbl> <chr>
## 1 1 10 a
## 2 3 10 a
# Return only matched rows
df1 %>% inner_join(df2) # SELECT * FROM x JOIN y ON x.a = y.a
## # A tibble: 1 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
# Return all rows in df1.
df1 %>% left_join(df2) # SELECT * FROM x LEFT JOIN y ON x.a = y.a
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
# Return all rows in df2.
df1 %>% right_join(df2) # SELECT * FROM x RIGHT JOIN y ON x.a = y.a
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 3 NA 10 a
# Return all rows.
df1 %>% full_join(df2) # SELECT * FROM x FULL JOIN y ON x.a = y.a
## # A tibble: 3 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
## 3 3 NA 10 a
# Like left_join, but returning only columns in df1
df1 %>% semi_join(df2, by = "x") # SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
## # A tibble: 1 x 2
## x y
## <dbl> <int>
## 1 1 2
21.3 tidyr
21.4 reshape2
21.5 stringr
21.6 anytime
21.7 Biblipgraphic Notes
21.8 Practice Yourself
References
Dowle, Matt, and Arun Srinivasan. 2017. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, Jim Hester, and Romain Francois. 2016. Readr: Read Tabular Data. https://CRAN.R-project.org/package=readr.