The CROSS JOIN returns all combinations of x and y, i.e. the dataset which is the number of rows in the first dataset multiplied by the number of rows in the second dataset. This kind of result is called the Cartesian Product.

cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("never", "na")
)

# S3 method for tbl_lazy
cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("never", "na")
)

# S3 method for data.frame
cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("na", "never")
)

Arguments

x, y

A pair of tbl_sparks or data.frames.

copy

If x and y are not from the same data source, and copy is TRUE, then y will be copied into a temporary table in same database as x. *_join() will automatically run ANALYZE on the created table in the hope that this will make you queries as efficient as possible by giving more data to the query planner.

This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it.

suffix

If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

...

Other parameters passed onto methods.

na_matches

Should NA (NULL) values match one another? The default, "never", is how databases usually work. "na" makes the joins behave like the dplyr join functions, merge(), match(), and %in%.

Details

From Spark 2.1 the prerequisite for using a cross join is that, spark.sql.crossJoin.enabled must be set to true, otherwise an exception will be thrown. Cartesian products are very slow. More importantly, they could consume a lot of memory and trigger an OOM. If the join type is not Inner, Spark SQL could use a Broadcast Nested Loop Join even if both sides of tables are not small enough. Thus, it also could cause lots of unwanted network traffic.

Examples

x <- data.frame( id = c("id1", "id2", "id3", "id4", "id5"), val = c(2, 7, 11, 13, 17), stringsAsFactors = FALSE ) cross_join(x, x)
#> id_x val_x id_y val_y #> 1 id1 2 id1 2 #> 2 id1 2 id2 7 #> 3 id1 2 id3 11 #> 4 id1 2 id4 13 #> 5 id1 2 id5 17 #> 6 id2 7 id1 2 #> 7 id2 7 id2 7 #> 8 id2 7 id3 11 #> 9 id2 7 id4 13 #> 10 id2 7 id5 17 #> 11 id3 11 id1 2 #> 12 id3 11 id2 7 #> 13 id3 11 id3 11 #> 14 id3 11 id4 13 #> 15 id3 11 id5 17 #> 16 id4 13 id1 2 #> 17 id4 13 id2 7 #> 18 id4 13 id3 11 #> 19 id4 13 id4 13 #> 20 id4 13 id5 17 #> 21 id5 17 id1 2 #> 22 id5 17 id2 7 #> 23 id5 17 id3 11 #> 24 id5 17 id4 13 #> 25 id5 17 id5 17