Why the join family of data verbs always have a data table as one of the arguments inside the parentheses?(In R)
lmost all joins between 2 data.table
s use a
notation where one of them is used as i
in a frame
applied to the other, and the joining columns are specified with
the on
parameter. However, in addition to the “basic”
joins, data.table
allows for special cases like
rolling joins, summarizing while joining, non-equi joins, etc. This
vignette will describe the notation to apply these joins with verbs
defined in table.express
, which, like the single-table
verbs, build data.table
expressions.
Basic joins
We’ll consider most of the dplyr
joining verbs in
this section:
inner_join
left_join
right_join
anti_join
semi_join
full_join
A <- data.table::data.table(x = rep(c("b", "a", "c"), each = 3),
y = c(1, 3, 6),
v = 1:9)
B <- data.table::data.table(x = c("c", "b"),
v2 = 8:7,
foo = c(4, 2))
A
#> x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9
B
#> x v2 foo
#> 1: c 8 4
#> 2: b 7 2
The methods defined in table.express
accept the
on
part of the expression in their ellipsis:
A %>%
inner_join(B, x)
#> x y v v2 foo
#> 1: c 1 7 8 4
#> 2: c 3 8 8 4
#> 3: c 6 9 8 4
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
A %>%
inner_join(B, x, v = v2)
#> x y v foo
#> 1: c 3 8 4
An important thing to note in the second example above is the
order in which the columns are given, i.e. that v
is
written before v2
, since the order is relevant for
data.table
. We can remember the correct order simply
by looking at which data.table
appears first in the
expression, and knowing that said data.table
’s columns
must appear first in the on
expressions. In this case,
A
appears before B
, so writing v2 =
v
would not work.
Get Answers For Free
Most questions answered within 1 hours.