7 min read

poorman: Version 0.2.1 Release

Introduction

Welcome to my series of blog posts about my data manipulation package, {poorman}. For those of you that don’t know, {poorman} is aiming to be a replication of {dplyr} but using only {base} R, and therefore be completely dependency free. What’s nice about this series is that if you would rather just use {dplyr}, then that’s absolutely OK! By highlighting {poorman} functionality, this series of blog posts simultaneously highlights {dplyr} functionality too! However I sometimes also describe how I developed the internals of {poorman}, often highlighting useful {base} R tips and tricks.

Today marks the release of v0.2.1 of {poorman} and with it a whole host of new functions and features. In today’s blog post we will be taking a look at some of these new features. Given the sheer amount of features this release brings, we won’t be focusing on the internals of any of these functions; the internals will be saved for another post. In stead, we will simply be taking a look at what some of them can do.

Selecting Distinct Rows

The first function we will take a look at is distinct(). Let’s say you want to select only the distinct, or unique, rows from your data.frame, distinct() will help you do that. Let’s create some fake data; some are duplicated.

df <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 1, 2, 7, 1, 4, 6),
  age = c(26, 24, 26, 22, 23, 24, 26, 24, 22, 26, 22, 25),
  score = c(85, 63, 55, 74, 31, 77, 85, 63, 42, 85, 74, 78)
)
df
#    id age score
# 1   1  26    85
# 2   2  24    63
# 3   3  26    55
# 4   4  22    74
# 5   5  23    31
# 6   6  24    77
# 7   1  26    85
# 8   2  24    63
# 9   7  22    42
# 10  1  26    85
# 11  4  22    74
# 12  6  25    78

Now we wish to see the distinct records from this data.

library(poorman, warn.conflicts = FALSE)
df %>% distinct()
#    id age score
# 1   1  26    85
# 2   2  24    63
# 3   3  26    55
# 4   4  22    74
# 5   5  23    31
# 6   6  24    77
# 9   7  22    42
# 12  6  25    78

So we see that we now only have 8 records out of the original 12 because the duplicates have been removed. We can actually obtain the distinct rows for a particular column, returning just that column.

df %>% distinct(age)
#    age
# 1   26
# 2   24
# 4   22
# 5   23
# 12  25

But if you need the other variables still, you can choose to keep those too.

df %>% distinct(age, .keep_all = TRUE)
#    id age score
# 1   1  26    85
# 2   2  24    63
# 4   4  22    74
# 5   5  23    31
# 12  6  25    78

Slicing Data

{dplyr} provides a couple of ways to selecting a subset of rows. It has the functions top_n() and top_frac() as well as the slice_*() family of functions. The former functions have now been superseded by the latter and so {poorman} skipped the implementation of the former. So what exactly do they do? Let’s take a look at some examples using the mtcars dataset.

slice_head() returns the first n rows (defaults to 1). slice_tail() returns the last n rows (not shown here).

slice_head(mtcars, n = 3)
#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

slice_sample() randomly selects rows with or without replacement.

slice_sample(mtcars, n = 3, replace = TRUE)
#                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

slice_min() and slice_max() select rows with highest or lowest values of a variable.

mtcars %>% slice_min(mpg, n = 3)
#                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
# Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
# Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4

Selecting With Predicates

It is now possible to select columns in your data.frame which match a predicate such as is.numeric(). where() takes a function and returns all variables for which the function returns TRUE.

df <- data.frame(
  col1 = c(1, 2, 3),
  col2 = c("x", "y", "z"),
  col3 = c(TRUE, FALSE, TRUE)
)
df %>% select(where(is.numeric))
#   col1
# 1    1
# 2    2
# 3    3

Working With NA Values

Finding the First Non-Missing Element

Given a set of vectors, the coalesce() function finds the first non-missing value at each position.

# Use a single value to replace all missing values
x <- sample(c(1:5, NA, NA, NA))
coalesce(x, 0L)
# [1] 4 0 5 0 1 2 0 3

# Or match together a complete vector from missing pieces
y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
coalesce(y, z)
# [1] 1 2 3 4 5

Convert Values To NA

We can convert values in a vector x if they match values in a second vector y.

na_if(1:5, 5:1)
# [1]  1  2 NA  4  5

This is particularly useful in a data.frame if you need to replace a particular value.

df <- data.frame(a = c("a", "b", "c", "BAD_VALUE"))
df %>% mutate(a = na_if(a, "BAD_VALUE"))
#      a
# 1    a
# 2    b
# 3    c
# 4 <NA>

Replacing NA Values

Within a data.frame we often have missing values in multiple columns. We sometimes wish to replace these values which is where replace_na() comes in. replace_na() is actually a function from the {tidyr} package but I decided to add it to {poorman} as it is extremely useful. Let’s take a look.

df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
#   x       y
# 1 1       a
# 2 2 unknown
# 3 0       b

Recoding Values

If we wish to replace values within a vector or a column of a data.frame, we can use recode(). This is a vectorised version of base::switch(): you can replace numeric values based on their position or their name, and character or factor values only by their name.

char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple")
#  [1] "b"     "c"     "b"     "c"     "Apple" "Apple" "Apple" "c"     "Apple" "b"
recode(char_vec, a = "Apple", b = "Banana")
#  [1] "Banana" "c"      "Banana" "c"      "Apple"  "Apple"  "Apple"  "c"      "Apple" 
# [10] "Banana"

Group Details

The final group (no pun intended) of features are focussed solely on grouped data. Given how many there are, I am not going to go into detail and instead I provide a brief overview here for the reader. The plan is to detail these functions in a separate blog post since a lot of work went on under the hood that may be interesting to discuss.

  • Functions for splitting data.frames: group_split(), group_keys()
  • Extract grouping metadata: group_data(), group_indices(), group_vars(), group_rows(), group_size(), n_groups(), groups()
  • Extract information about the current group: cur_data(), cur_group(), cur_group_id(), cur_group_rows(), cur_column()

Conclusion

You made it this far, great! I won’t keep you much longer. This post has demonstrated some of the capabilities of the {poorman} (and therefore {dplyr}) package. The v0.2.1 release actually includes a sleuth of other features and functions so be sure to check out the release page for a full list.

As this blog post is quite long, I haven’t gone into any further details of the internals of {poorman} however if you are interested in taking a closer look at how I handle the different input types, you can see the code on the relevant {poorman} GitHub page. {poorman} is still a work in progress but as you can see, it already has a lot of functionality you know and love from {dplyr} so if you are working on a new project and don’t want to have to deal with dependency management, especially if you are sharing work with colleagues, why not give {poorman} a try?

If you’d like to show your support for {poorman}, please consider giving the package a Star on Github as it gives me that boost of dopamine needed to continue development.