Introduction
Welcome to my series of blog posts about my data manipulation package, {poorman}
. For those of you that don’t know, {poorman}
is aiming to be a replication of {dplyr}
but using only {base}
R, and therefore be completely dependency free. What’s nice about this series is that if you would rather just use {dplyr}
, then that’s absolutely OK! By highlighting {poorman}
functionality, this series of blog posts simultaneously highlights {dplyr}
functionality too! However I sometimes also describe how I developed the internals of {poorman}
, often highlighting useful {base}
R tips and tricks.
Today marks the release of v0.2.1 of {poorman}
and with it a whole host of new functions and features. In today’s blog post we will be taking a look at some of these new features. Given the sheer amount of features this release brings, we won’t be focusing on the internals of any of these functions; the internals will be saved for another post. In stead, we will simply be taking a look at what some of them can do.
Selecting Distinct Rows
The first function we will take a look at is distinct()
. Let’s say you want to select only the distinct, or unique, rows from your data.frame
, distinct()
will help you do that. Let’s create some fake data; some are duplicated.
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 1, 2, 7, 1, 4, 6),
age = c(26, 24, 26, 22, 23, 24, 26, 24, 22, 26, 22, 25),
score = c(85, 63, 55, 74, 31, 77, 85, 63, 42, 85, 74, 78)
)
df
# id age score
# 1 1 26 85
# 2 2 24 63
# 3 3 26 55
# 4 4 22 74
# 5 5 23 31
# 6 6 24 77
# 7 1 26 85
# 8 2 24 63
# 9 7 22 42
# 10 1 26 85
# 11 4 22 74
# 12 6 25 78
Now we wish to see the distinct records from this data.
library(poorman, warn.conflicts = FALSE)
df %>% distinct()
# id age score
# 1 1 26 85
# 2 2 24 63
# 3 3 26 55
# 4 4 22 74
# 5 5 23 31
# 6 6 24 77
# 9 7 22 42
# 12 6 25 78
So we see that we now only have 8 records out of the original 12 because the duplicates have been removed. We can actually obtain the distinct rows for a particular column, returning just that column.
df %>% distinct(age)
# age
# 1 26
# 2 24
# 4 22
# 5 23
# 12 25
But if you need the other variables still, you can choose to keep those too.
df %>% distinct(age, .keep_all = TRUE)
# id age score
# 1 1 26 85
# 2 2 24 63
# 4 4 22 74
# 5 5 23 31
# 12 6 25 78
Slicing Data
{dplyr}
provides a couple of ways to selecting a subset of rows. It has the functions top_n()
and top_frac()
as well as the slice_*()
family of functions. The former functions have now been superseded by the latter and so {poorman}
skipped the implementation of the former. So what exactly do they do? Let’s take a look at some examples using the mtcars
dataset.
slice_head()
returns the first n
rows (defaults to 1). slice_tail()
returns the last n
rows (not shown here).
slice_head(mtcars, n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
slice_sample()
randomly selects rows with or without replacement.
slice_sample(mtcars, n = 3, replace = TRUE)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
slice_min()
and slice_max()
select rows with highest or lowest values of a variable.
mtcars %>% slice_min(mpg, n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
# Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
# Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Selecting With Predicates
It is now possible to select columns in your data.frame
which match a predicate such as is.numeric()
. where()
takes a function and returns all variables for which the function returns TRUE
.
df <- data.frame(
col1 = c(1, 2, 3),
col2 = c("x", "y", "z"),
col3 = c(TRUE, FALSE, TRUE)
)
df %>% select(where(is.numeric))
# col1
# 1 1
# 2 2
# 3 3
Working With NA Values
Finding the First Non-Missing Element
Given a set of vectors, the coalesce()
function finds the first non-missing value at each position.
# Use a single value to replace all missing values
x <- sample(c(1:5, NA, NA, NA))
coalesce(x, 0L)
# [1] 4 0 5 0 1 2 0 3
# Or match together a complete vector from missing pieces
y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
coalesce(y, z)
# [1] 1 2 3 4 5
Convert Values To NA
We can convert values in a vector x
if they match values in a second vector y
.
na_if(1:5, 5:1)
# [1] 1 2 NA 4 5
This is particularly useful in a data.frame
if you need to replace a particular value.
df <- data.frame(a = c("a", "b", "c", "BAD_VALUE"))
df %>% mutate(a = na_if(a, "BAD_VALUE"))
# a
# 1 a
# 2 b
# 3 c
# 4 <NA>
Replacing NA Values
Within a data.frame
we often have missing values in multiple columns. We sometimes wish to replace these values which is where replace_na()
comes in. replace_na()
is actually a function from the {tidyr}
package but I decided to add it to {poorman}
as it is extremely useful. Let’s take a look.
df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
# x y
# 1 1 a
# 2 2 unknown
# 3 0 b
Recoding Values
If we wish to replace values within a vector or a column of a data.frame
, we can use recode()
. This is a vectorised version of base::switch()
: you can replace numeric values based on their position or their name, and character or factor values only by their name.
char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple")
# [1] "b" "c" "b" "c" "Apple" "Apple" "Apple" "c" "Apple" "b"
recode(char_vec, a = "Apple", b = "Banana")
# [1] "Banana" "c" "Banana" "c" "Apple" "Apple" "Apple" "c" "Apple"
# [10] "Banana"
Group Details
The final group (no pun intended) of features are focussed solely on grouped data. Given how many there are, I am not going to go into detail and instead I provide a brief overview here for the reader. The plan is to detail these functions in a separate blog post since a lot of work went on under the hood that may be interesting to discuss.
- Functions for splitting
data.frame
s:group_split()
,group_keys()
- Extract grouping metadata:
group_data()
,group_indices()
,group_vars()
,group_rows()
,group_size()
,n_groups()
,groups()
- Extract information about the current group:
cur_data()
,cur_group()
,cur_group_id()
,cur_group_rows()
,cur_column()
Conclusion
You made it this far, great! I won’t keep you much longer. This post has demonstrated some of the capabilities of the {poorman}
(and therefore {dplyr}
) package. The v0.2.1 release actually includes a sleuth of other features and functions so be sure to check out the release page for a full list.
As this blog post is quite long, I haven’t gone into any further details of the internals of {poorman}
however if you are interested in taking a closer look at how I handle the different input types, you can see the code on the relevant {poorman}
GitHub page. {poorman}
is still a work in progress but as you can see, it already has a lot of functionality you know and love from {dplyr}
so if you are working on a new project and don’t want to have to deal with dependency management, especially if you are sharing work with colleagues, why not give {poorman}
a try?
If you’d like to show your support for {poorman}
, please consider giving the package a Star on Github as it gives me that boost of dopamine needed to continue development.