Cross Join — cross_join • flicker

The CROSS JOIN returns all combinations of x and y, i.e. the dataset which is the number of rows in the first dataset multiplied by the number of rows in the second dataset. This kind of result is called the Cartesian Product.

cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("never", "na")
)

# S3 method for tbl_lazy
cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("never", "na")
)

# S3 method for data.frame
cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("na", "never")
)

Arguments

x, y	A pair of `tbl_spark`s or `data.frame`s.
copy	If `x` and `y` are not from the same data source, and `copy` is `TRUE`, then `y` will be copied into a temporary table in same database as `x`. `*_join()` will automatically run `ANALYZE` on the created table in the hope that this will make you queries as efficient as possible by giving more data to the query planner. This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it.
suffix	If there are non-joined duplicate variables in `x` and `y`, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.
...	Other parameters passed onto methods.
na_matches	Should NA (NULL) values match one another? The default, "never", is how databases usually work. `"na"` makes the joins behave like the dplyr join functions, `merge()`, `match()`, and `%in%`.

Details

From Spark 2.1 the prerequisite for using a cross join is that, spark.sql.crossJoin.enabled must be set to true, otherwise an exception will be thrown. Cartesian products are very slow. More importantly, they could consume a lot of memory and trigger an OOM. If the join type is not Inner, Spark SQL could use a Broadcast Nested Loop Join even if both sides of tables are not small enough. Thus, it also could cause lots of unwanted network traffic.

Examples

x <- data.frame(
  id = c("id1", "id2", "id3", "id4", "id5"),
  val = c(2, 7, 11, 13, 17),
  stringsAsFactors = FALSE
)
cross_join(x, x)
#>    id_x val_x id_y val_y
#> 1   id1     2  id1     2
#> 2   id1     2  id2     7
#> 3   id1     2  id3    11
#> 4   id1     2  id4    13
#> 5   id1     2  id5    17
#> 6   id2     7  id1     2
#> 7   id2     7  id2     7
#> 8   id2     7  id3    11
#> 9   id2     7  id4    13
#> 10  id2     7  id5    17
#> 11  id3    11  id1     2
#> 12  id3    11  id2     7
#> 13  id3    11  id3    11
#> 14  id3    11  id4    13
#> 15  id3    11  id5    17
#> 16  id4    13  id1     2
#> 17  id4    13  id2     7
#> 18  id4    13  id3    11
#> 19  id4    13  id4    13
#> 20  id4    13  id5    17
#> 21  id5    17  id1     2
#> 22  id5    17  id2     7
#> 23  id5    17  id3    11
#> 24  id5    17  id4    13
#> 25  id5    17  id5    17