Introduction

Let’s say we wish to group some data by a variable, then for each group we wish to find the row of the maximum value of another variable, and then finally extract the entire row. This is a fairly common task and in fact I’ve had to do this exact data exploration technique on several occasions in the last week using different syntax, data.table and sparklyr; so I thought I would share my code with you so you can compare the different options. In fact for this blog entry, I will be using base R before looking at data.table, dplyr and sparklyr in subsequent posts.

For this exercise, I will be using the classic datasets::mtcars data. The aim here is to find the row for each cylinder group (cyl) where the miles per gallon (mpg) value is at its highest. I am not interested in finding multiple rows, I just want one row for each group even if there are cars which share the same mpg.

Base R Solution

For this blog entry, I am going to look at using base R. This is a classic split() + lapply() problem. First I’ll do some pre-processing to mtcars just so that we can see which cars have the maximum mpg for each cylinder group.

# Create a column of car names
mtcars_colnames <- colnames(mtcars)
mtcars[, "car"] <- rownames(mtcars)
mtcars <- mtcars[, c("car", mtcars_colnames)]

Now we can find the rows we want.

max_mpg <- do.call(rbind, lapply(
# Split the data into groups for each cylinder
split(mtcars, mtcars[, "cyl"]),
function(x) {
# For the group x, select the row which has the maximum mpg
x[which.max(x[, "mpg"]), ]
}
))
rownames(max_mpg) <- NULL
max_mpg
#                car  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1   Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# 2   Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 3 Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

So what we did here is split() the data.frame into separate lists, one for each cylinder group, and then used lapply() (list apply) to apply a function which, for each group, selects the row which has the maximum mpg. Finally we rbind() each list back together into a data.frame.

You may be curious about my use of which.max(x[, "mpg"]) over say mtcars[mtcars$mpg == max(mtcars$mpg), ], well it’s because the former will only return a single row, whereas the latter will return multiple matching rows (check out ?which.max) and in this instance I was only interested in a single row for each group.

Another approach for base R is to use the by function.

do.call(rbind, by(datasets::mtcars, mtcars$cyl, function(x) x[which.max(x$mpg), ]))
#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 4 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# 6 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 8 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

This acts in a similar way to the split + lapply approach in that the data.frame is split by row into data.frames subset by the values of one or more factors, in this case cyl, and a function is then applied to each subset in turn. However as we can see from the benchmarks below, it isn’t quite as fast (in this case) as the former approach.

Conclusion

To conclude, should you need to perform an operation on a data.frame in R, you can split() your data into lists of data.frames of the required groups, lapply() a function to each data.frame in the list; and finally rbind() those data.frames back into a single data.frame.

Are you aware of any other base R solutions to this problem? If so, let me know in the comments!

In my next blog entry, we will be looking at how to perform this task using the data.table package.