Introduction
Let’s say we wish to group some data by a variable, then for each group we wish to find the row of the maximum value of another variable, and then finally extract the entire row. This is a fairly common task and in fact I’ve had to do this exact data exploration technique on several occasions in the last week using different syntax, data.table
and sparklyr
; so I thought I would share my code with you so you can compare the different options. In fact for this blog entry, I will be using base R before looking at data.table
, dplyr
and sparklyr
in subsequent posts.
For this exercise, I will be using the classic datasets::mtcars
data. The aim here is to find the row for each cylinder group (cyl
) where the miles per gallon (mpg
) value is at its highest. I am not interested in finding multiple rows, I just want one row for each group even if there are cars which share the same mpg
.
Base R Solution
For this blog entry, I am going to look at using base R. This is a classic split()
+ lapply()
problem. First I’ll do some pre-processing to mtcars
just so that we can see which cars have the maximum mpg for each cylinder group.
# Create a column of car names
mtcars_colnames <- colnames(mtcars)
mtcars[, "car"] <- rownames(mtcars)
mtcars <- mtcars[, c("car", mtcars_colnames)]
Now we can find the rows we want.
max_mpg <- do.call(rbind, lapply(
# Split the data into groups for each cylinder
split(mtcars, mtcars[, "cyl"]),
function(x) {
# For the group `x`, select the row which has the maximum mpg
x[which.max(x[, "mpg"]), ]
}
))
rownames(max_mpg) <- NULL
max_mpg
# car mpg cyl disp hp drat wt qsec vs am gear carb
# 1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 2 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 3 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
So what we did here is split()
the data.frame
into separate lists, one for each cylinder group, and then used lapply()
(list apply) to apply a function which, for each group, selects the row which has the maximum mpg. Finally we rbind()
each list back together into a data.frame
.
You may be curious about my use of which.max(x[, "mpg"])
over say mtcars[mtcars$mpg == max(mtcars$mpg), ]
, well it’s because the former will only return a single row, whereas the latter will return multiple matching rows (check out ?which.max
) and in this instance I was only interested in a single row for each group.
Another approach for base R is to use the by
function.
do.call(rbind, by(datasets::mtcars, mtcars$cyl, function(x) x[which.max(x$mpg), ]))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 4 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 6 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 8 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
This acts in a similar way to the split
+ lapply
approach in that the data.frame
is split by row into data.frame
s subset by the values of one or more factors, in this case cyl
, and a function is then applied to each subset in turn. However as we can see from the benchmarks below, it isn’t quite as fast (in this case) as the former approach.
Benchmarks
Conclusion
To conclude, should you need to perform an operation on a data.frame
in R, you can split()
your data into lists of data.frame
s of the required groups, lapply()
a function to each data.frame
in the list; and finally rbind()
those data.frame
s back into a single data.frame
.
Are you aware of any other base R solutions to this problem? If so, let me know in the comments!
In my next blog entry, we will be looking at how to perform this task using the data.table
package.