Chapter 4 data.table
data.table
is an excellent extension of the data.frame
class6.
If used as a data.frame
it will look and feel like a data frame.
If, however, it is used with it’s unique capabilities, it will prove faster and easier to manipulate.
This is because data.frame
s, like most of R objects, make a copy of themselves when modified.
This is known as passing by value, and it is done to ensure that object are not corrupted if an operation fails (if your computer shuts down before the operation is completed, for instance).
Making copies of large objects is clearly time and memory consuming.
A data.table
can make changes in place.
This is known as passing by reference, which is considerably faster than passing by value.
Let’s start with importing some freely available car sales data from Kaggle.
library(data.table)
library(magrittr)
auto <- fread('data/autos.csv')
View(auto)
dim(auto) # Rows and columns
## [1] 371824 20
names(auto) # Variable names
## [1] "dateCrawled" "name" "seller"
## [4] "offerType" "price" "abtest"
## [7] "vehicleType" "yearOfRegistration" "gearbox"
## [10] "powerPS" "model" "kilometer"
## [13] "monthOfRegistration" "fuelType" "brand"
## [16] "notRepairedDamage" "dateCreated" "nrOfPictures"
## [19] "postalCode" "lastSeen"
class(auto) # Object class
## [1] "data.table" "data.frame"
file.info('data/autos.csv') # File info on disk
## size isdir mode mtime ctime
## data/autos.csv 68439217 FALSE 644 2019-02-24 21:52:04 2019-02-24 21:52:04
## atime uid gid uname grname
## data/autos.csv 2019-10-10 14:54:41 1000 1000 rstudio rstudio
gdata::humanReadable(68439217)
## [1] "65.3 MiB"
object.size(auto) %>% print(units = 'auto') # File size in memory
## 103.3 Mb
Things to note:
- The import has been done with
fread
instead ofread.csv
. This is more efficient, and directly creates adata.table
object. - The import is very fast.
- The data after import is slightly larger than when stored on disk (in this case). The extra data allows faster operation of this object, and the rule of thumb is to have 3 to 5 times more RAM than file size (e.g.: 4GB RAM for 1GB file)
auto
has two classes. It means that everything that expects adata.frame
we can feed it adata.table
and it will work.
Let’s start with verifying that it behaves like a data.frame
when expected.
auto[,2] %>% head
## name
## 1: Golf_3_1.6
## 2: A5_Sportback_2.7_Tdi
## 3: Jeep_Grand_Cherokee_"Overland"
## 4: GOLF_4_1_4__3T\xdcRER
## 5: Skoda_Fabia_1.4_TDI_PD_Classic
## 6: BMW_316i___e36_Limousine___Bastlerfahrzeug__Export
auto[[2]] %>% head
## [1] "Golf_3_1.6"
## [2] "A5_Sportback_2.7_Tdi"
## [3] "Jeep_Grand_Cherokee_\"Overland\""
## [4] "GOLF_4_1_4__3T\xdcRER"
## [5] "Skoda_Fabia_1.4_TDI_PD_Classic"
## [6] "BMW_316i___e36_Limousine___Bastlerfahrzeug__Export"
auto[1,2] %>% head
## name
## 1: Golf_3_1.6
But notice the difference between data.frame
and data.table
when subsetting multiple rows. Uhh!
auto[1:3] %>% dim # data.table will exctract *rows*
## [1] 3 20
as.data.frame(auto)[1:3] %>% dim # data.frame will exctract *columns*
## [1] 371824 3
Just use columns (,
) and be explicit regarding the dimension you are extracting…
Now let’s do some data.table
specific operations.
The general syntax has the form DT[i,j,by]
.
SQL users may think of i
as WHERE
, j
as SELECT
, and by
as GROUP BY
.
We don’t need to name the arguments explicitly.
Also, the Tab
key will typically help you to fill in column names.
auto[,vehicleType,] %>% table # Exctract column and tabulate
## .
## andere bus cabrio coupe kleinwagen
## 37899 3362 30220 22914 19026 80098
## kombi limousine suv
## 67626 95963 14716
auto[vehicleType=='coupe',,] %>% dim # Exctract rows
## [1] 19026 20
auto[,gearbox:model,] %>% head # exctract column range
## gearbox powerPS model
## 1: manuell 0 golf
## 2: manuell 190
## 3: automatik 163 grand
## 4: manuell 75 golf
## 5: manuell 69 fabia
## 6: manuell 102 3er
auto[,gearbox,] %>% table
## .
## automatik manuell
## 20223 77169 274432
auto[vehicleType=='coupe' & gearbox=='automatik',,] %>% dim # intersect conditions
## [1] 6008 20
auto[,table(vehicleType),] # uhh? why would this even work?!?
## vehicleType
## andere bus cabrio coupe kleinwagen
## 37899 3362 30220 22914 19026 80098
## kombi limousine suv
## 67626 95963 14716
auto[, mean(price), by=vehicleType] # average price by car group
## vehicleType V1
## 1: 20124.688
## 2: coupe 25951.506
## 3: suv 13252.392
## 4: kleinwagen 5691.167
## 5: limousine 11111.107
## 6: cabrio 15072.998
## 7: bus 10300.686
## 8: kombi 7739.518
## 9: andere 676327.100
The .N
operator is very useful if you need to count the length of the result.
Notice where I use it:
auto[.N,,] # will exctract the *last* row
## dateCrawled name seller
## 1: 2016-03-07 19:39:19 BMW_M135i_vollausgestattet_NP_52.720____Euro privat
## offerType price abtest vehicleType yearOfRegistration gearbox powerPS
## 1: Angebot 28990 control limousine 2013 manuell 320
## model kilometer monthOfRegistration fuelType brand notRepairedDamage
## 1: m_reihe 50000 8 benzin bmw nein
## dateCreated nrOfPictures postalCode lastSeen
## 1: 2016-03-07 00:00:00 0 73326 2016-03-22 03:17:10
auto[,.N] # will count rows
## [1] 371824
auto[,.N, vehicleType] # will count rows by type
## vehicleType N
## 1: 37899
## 2: coupe 19026
## 3: suv 14716
## 4: kleinwagen 80098
## 5: limousine 95963
## 6: cabrio 22914
## 7: bus 30220
## 8: kombi 67626
## 9: andere 3362
You may concatenate results into a vector:
auto[,c(mean(price), mean(powerPS)),]
## [1] 17286.2996 115.5414
This c()
syntax no longer behaves well if splitting:
auto[,c(mean(price), mean(powerPS)), by=vehicleType]
## vehicleType V1
## 1: 20124.68801
## 2: 71.23249
## 3: coupe 25951.50589
## 4: coupe 172.97614
## 5: suv 13252.39182
## 6: suv 166.01903
## 7: kleinwagen 5691.16738
## 8: kleinwagen 68.75733
## 9: limousine 11111.10661
## 10: limousine 132.26936
## 11: cabrio 15072.99782
## 12: cabrio 145.17684
## 13: bus 10300.68561
## 14: bus 113.58137
## 15: kombi 7739.51760
## 16: kombi 136.40654
## 17: andere 676327.09964
## 18: andere 102.11154
Use a list()
instead of c()
, within data.table
commands:
auto[,list(mean(price), mean(powerPS)), by=vehicleType]
## vehicleType V1 V2
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
You can add names to your new variables:
auto[,list(Price=mean(price), Power=mean(powerPS)), by=vehicleType]
## vehicleType Price Power
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
You can use .()
to replace the longer list()
command:
auto[,.(Price=mean(price), Power=mean(powerPS)), by=vehicleType]
## vehicleType Price Power
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
And split by multiple variables:
auto[,.(Price=mean(price), Power=mean(powerPS)), by=.(vehicleType,fuelType)] %>% head
## vehicleType fuelType Price Power
## 1: benzin 11820.443 70.14477
## 2: coupe diesel 51170.248 179.48704
## 3: suv diesel 15549.369 168.16115
## 4: kleinwagen benzin 5786.514 68.74309
## 5: kleinwagen diesel 4295.550 76.83666
## 6: limousine benzin 6974.360 127.87025
Compute with variables created on the fly:
auto[,sum(price<1e4),] # Count prices lower than 10,000
## [1] 310497
auto[,mean(price<1e4),] # Proportion of prices lower than 10,000
## [1] 0.8350644
auto[,.(Power=mean(powerPS)), by=.(PriceRange=price>1e4)]
## PriceRange Power
## 1: FALSE 101.8838
## 2: TRUE 185.9029
Things to note:
- The term
price<1e4
creates on the fly a binary vector of TRUE=1 / FALSE=0 for prices less than 10k and then sums/means this vector, hencesum
is actually a count, andmean
is proportion=count/total - Summing all prices lower than 10k is done with the command
auto[price<1e4,sum(price),]
You may sort along one or more columns
auto[order(-price), price,] %>% head # Order along price. Descending
## [1] 2147483647 99999999 99999999 99999999 99999999 99999999
auto[order(price, -lastSeen), price,] %>% head# Order along price and last seen . Ascending and descending.
## [1] 0 0 0 0 0 0
You may apply a function to ALL columns using a Subset of the Data using .SD
count.uniques <- function(x) length(unique(x))
auto[,lapply(.SD, count.uniques), vehicleType]
## vehicleType dateCrawled name seller offerType price abtest
## 1: 36714 32891 1 2 1378 2
## 2: coupe 18745 13182 1 2 1994 2
## 3: suv 14549 9707 1 1 1667 2
## 4: kleinwagen 75591 49302 2 2 1927 2
## 5: limousine 89352 58581 2 1 2986 2
## 6: cabrio 22497 13411 1 1 2014 2
## 7: bus 29559 19651 1 2 1784 2
## 8: kombi 64415 41976 2 1 2529 2
## 9: andere 3352 3185 1 1 562 2
## yearOfRegistration gearbox powerPS model kilometer monthOfRegistration
## 1: 101 3 374 244 13 13
## 2: 75 3 414 117 13 13
## 3: 73 3 342 122 13 13
## 4: 75 3 317 163 13 13
## 5: 83 3 506 210 13 13
## 6: 88 3 363 95 13 13
## 7: 65 3 251 106 13 13
## 8: 64 3 393 177 13 13
## 9: 81 3 230 162 13 13
## fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode
## 1: 8 40 3 65 1 6304
## 2: 8 35 3 51 1 5159
## 3: 8 37 3 61 1 4932
## 4: 8 38 3 68 1 7343
## 5: 8 39 3 82 1 7513
## 6: 7 38 3 70 1 5524
## 7: 8 33 3 63 1 6112
## 8: 8 38 3 75 1 7337
## 9: 8 38 3 41 1 2220
## lastSeen
## 1: 32813
## 2: 16568
## 3: 13367
## 4: 59354
## 5: 65813
## 6: 19125
## 7: 26094
## 8: 50668
## 9: 3294
Things to note:
.SD
is the data subset after splitting along theby
argument.- Recall that
lapply
applies the same function to all elements of a list. In this example, to all columns of.SD
.
If you want to apply a function only to a subset of columns, use the .SDcols
argument
auto[,lapply(.SD, count.uniques), by=vehicleType, .SDcols=price:gearbox]
## vehicleType price abtest vehicleType yearOfRegistration gearbox
## 1: 1378 2 1 101 3
## 2: coupe 1994 2 1 75 3
## 3: suv 1667 2 1 73 3
## 4: kleinwagen 1927 2 1 75 3
## 5: limousine 2986 2 1 83 3
## 6: cabrio 2014 2 1 88 3
## 7: bus 1784 2 1 65 3
## 8: kombi 2529 2 1 64 3
## 9: andere 562 2 1 81 3
4.1 Make your own variables
It is very easy to compute new variables
auto[,log(price/powerPS),] %>% head # This makes no sense
## [1] Inf 4.567632 4.096387 2.995732 3.954583 1.852000
And if you want to store the result in a new variable, use the :=
operator
auto[,newVar:=log(price/powerPS),]
Or create multiple variables at once.
The syntax c("A","B"):=.(expression1,expression2)
is read “save the list of results from expression1 and expression2 using the vector of names A, and B”.
auto[,c('newVar','newVar2'):=.(log(price/powerPS),price^2/powerPS),]
4.2 Join
data.table can be used for joining. A join is the operation of aligning two (or more) data frames/tables along some index. The index can be a single variable, or a combination thereof.
Here is a simple example of aligning age and gender from two different data tables:
DT1 <- data.table(Names=c("Alice","Bob"), Age=c(29,31))
DT2 <- data.table(Names=c("Alice","Bob","Carl"), Gender=c("F","M","M"))
setkey(DT1, Names)
setkey(DT2, Names)
DT1[DT2,,]
## Names Age Gender
## 1: Alice 29 F
## 2: Bob 31 M
## 3: Carl NA M
DT2[DT1,,]
## Names Gender Age
## 1: Alice F 29
## 2: Bob M 31
Things to note:
- A join with
data.tables
is performed by indexing onedata.table
with another. Which is the outer and which is the inner will affect the result. - The indexing variable needs to be set using the
setkey
function.
There are several types of joins:
- Inner join: Returns the rows along the intersection of keys, i.e., rows that appear in all data sets.
- Outer join: Returns the rows along the union of keys, i.e., rows that appear in any of the data sets.
- Left join: Returns the rows along the index of the “left” data set.
- Right join: Returns the rows along the index of the “right” data set.
Assuming DT1
is the “left” data set, we see that DT1[DT2,,]
is a right join, and DT2[DT1,,]
is a left join.
For an inner join use the nomath=0
argument:
DT1[DT2,,,nomatch=0]
## Names Age Gender
## 1: Alice 29 F
## 2: Bob 31 M
DT2[DT1,,,nomatch=0]
## Names Gender Age
## 1: Alice F 29
## 2: Bob M 31
4.3 Reshaping data
Data sets (i.e. frames or tables) may arrive in a “wide” form or a “long” form.
The difference is best illustrated with an example.
The ChickWeight
data encodes the weight of various chicks. It is “long” in that a variable encodes the time of measurement, making the data, well, simply long:
ChickWeight %>% head
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
The mtcars
data encodes 11 characteristics of 32 types of automobiles. It is “wide” since the various characteristics are encoded in different variables, making the data, well, simply wide.
mtcars %>% head
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## type
## Mazda RX4 Mazda RX4
## Mazda RX4 Wag Mazda RX4 Wag
## Datsun 710 Datsun 710
## Hornet 4 Drive Hornet 4 Drive
## Hornet Sportabout Hornet Sportabout
## Valiant Valiant
Most of R’s functions, with exceptions, will prefer data in the long format.
There are thus various facilities to convert from one format to another.
We will focus on the melt
and dcast
functions to convert from one format to another.
4.3.1 Wide to long
melt
will convert from wide to long.
dimnames(mtcars)
## [[1]]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## [[2]]
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb" "type"
mtcars$type <- rownames(mtcars)
melt(mtcars, id.vars=c("type")) %>% head
## type variable value
## 1 Mazda RX4 mpg 21.0
## 2 Mazda RX4 Wag mpg 21.0
## 3 Datsun 710 mpg 22.8
## 4 Hornet 4 Drive mpg 21.4
## 5 Hornet Sportabout mpg 18.7
## 6 Valiant mpg 18.1
Things to note:
- The car type was originally encoded in the rows’ names, and not as a variable. We thus created an explicit variable with the cars’ type using the
rownames
function. - The
id.vars
of themelt
function names the variables that will be used as identifiers. All other variables are assumed to be measurements. These can have been specified using their index instead of their name. - If not all variables are measurements, we could have names measurement variables explicitly using the
measure.vars
argument of themelt
function. These can have been specified using their index instead of their name. - By default, the molten columns are automatically named
variable
andvalue
.
We can replace the automatic namings using variable.name
and value.name
:
melt(mtcars, id.vars=c("type"), variable.name="Charachteristic", value.name="Measurement") %>% head
## type Charachteristic Measurement
## 1 Mazda RX4 mpg 21.0
## 2 Mazda RX4 Wag mpg 21.0
## 3 Datsun 710 mpg 22.8
## 4 Hornet 4 Drive mpg 21.4
## 5 Hornet Sportabout mpg 18.7
## 6 Valiant mpg 18.1
4.3.2 Long to wide
dcast
will convert from long to wide:
dcast(ChickWeight, Chick~Time, value.var="weight")
## Chick 0 2 4 6 8 10 12 14 16 18 20 21
## 1 18 39 35 NA NA NA NA NA NA NA NA NA NA
## 2 16 41 45 49 51 57 51 54 NA NA NA NA NA
## 3 15 41 49 56 64 68 68 67 68 NA NA NA NA
## 4 13 41 48 53 60 65 67 71 70 71 81 91 96
## 5 9 42 51 59 68 85 96 90 92 93 100 100 98
## 6 20 41 47 54 58 65 73 77 89 98 107 115 117
## 7 10 41 44 52 63 74 81 89 96 101 112 120 124
## 8 8 42 50 61 71 84 93 110 116 126 134 125 NA
## 9 17 42 51 61 72 83 89 98 103 113 123 133 142
## 10 19 43 48 55 62 65 71 82 88 106 120 144 157
## 11 4 42 49 56 67 74 87 102 108 136 154 160 157
## 12 6 41 49 59 74 97 124 141 148 155 160 160 157
## 13 11 43 51 63 84 112 139 168 177 182 184 181 175
## 14 3 43 39 55 67 84 99 115 138 163 187 198 202
## 15 1 42 51 59 64 76 93 106 125 149 171 199 205
## 16 12 41 49 56 62 72 88 119 135 162 185 195 205
## 17 2 40 49 58 72 84 103 122 138 162 187 209 215
## 18 5 41 42 48 60 79 106 141 164 197 199 220 223
## 19 14 41 49 62 79 101 128 164 192 227 248 259 266
## 20 7 41 49 57 71 89 112 146 174 218 250 288 305
## 21 24 42 52 58 74 66 68 70 71 72 72 76 74
## 22 30 42 48 59 72 85 98 115 122 143 151 157 150
## 23 22 41 55 64 77 90 95 108 111 131 148 164 167
## 24 23 43 52 61 73 90 103 127 135 145 163 170 175
## 25 27 39 46 58 73 87 100 115 123 144 163 185 192
## 26 28 39 46 58 73 92 114 145 156 184 207 212 233
## 27 26 42 48 57 74 93 114 136 147 169 205 236 251
## 28 25 40 49 62 78 102 124 146 164 197 231 259 265
## 29 29 39 48 59 74 87 106 134 150 187 230 279 309
## 30 21 40 50 62 86 125 163 217 240 275 307 318 331
## 31 33 39 50 63 77 96 111 137 144 151 146 156 147
## 32 37 41 48 56 68 80 83 103 112 135 157 169 178
## 33 36 39 48 61 76 98 116 145 166 198 227 225 220
## 34 31 42 53 62 73 85 102 123 138 170 204 235 256
## 35 39 42 50 61 78 89 109 130 146 170 214 250 272
## 36 38 41 49 61 74 98 109 128 154 192 232 280 290
## 37 32 41 49 65 82 107 129 159 179 221 263 291 305
## 38 40 41 55 66 79 101 120 154 182 215 262 295 321
## 39 34 41 49 63 85 107 134 164 186 235 294 327 341
## 40 35 41 53 64 87 123 158 201 238 287 332 361 373
## 41 44 42 51 65 86 103 118 127 138 145 146 NA NA
## 42 45 41 50 61 78 98 117 135 141 147 174 197 196
## 43 43 42 55 69 96 131 157 184 188 197 198 199 200
## 44 41 42 51 66 85 103 124 155 153 175 184 199 204
## 45 47 41 53 66 79 100 123 148 157 168 185 210 205
## 46 49 40 53 64 85 108 128 152 166 184 203 233 237
## 47 46 40 52 62 82 101 120 144 156 173 210 231 238
## 48 50 41 54 67 84 105 122 155 175 205 234 264 264
## 49 42 42 49 63 84 103 126 160 174 204 234 269 281
## 50 48 39 50 62 80 104 125 154 170 222 261 303 322
Things to note:
dcast
uses a formula interface (~
) to specify the row identifier and the variables. The LHS is the row identifier, and the RHS for the variables to be created.- The measurement of each LHS at each RHS, is specified using the
value.var
argument.
4.4 Bibliographic Notes
data.table
has excellent online documentation.
See here.
See here for joining.
See here for more on reshaping.
See here for a comparison of the data.frame
way, versus the data.table
way.
For some advanced tips and tricks see Andrew Brooks’ blog.
4.5 Practice Yourself
- Create a matrix of ones with
1e5
rows and1e2
columns. Create adata.table
using this matrix.- Replace the first column of each, with the sequence \(1,2,3,\dots\).
- Create a column which is the sum of all columns, and a \(\mathcal{N}(0,1)\) random variable.
- Use the cars dataset used in this chapter from kaggle Kaggle.
- Import the data using the function
fread
. What is the class of your object? - Use
system.time()
to measure the time to sort along “seller”. Do the same after converting the data todata.frame
. Are data tables faster?
- Import the data using the function
Also, see DataCamp’s Data Manipulation in R with data.table, by Matt Dowle, the author of data.table for more self practice.
Not to be confused with
DT::datatable()
which is an interface for interactive inspection of data tables in your browser.↩