The CROSS JOIN
returns all combinations of x
and y
, i.e. the dataset which is the number of rows in the first
dataset multiplied by the number of rows in the second dataset. This kind of result is called the Cartesian Product.
cross_join( x, y, copy = FALSE, suffix = c("_x", "_y"), ..., na_matches = c("never", "na") ) # S3 method for tbl_lazy cross_join( x, y, copy = FALSE, suffix = c("_x", "_y"), ..., na_matches = c("never", "na") ) # S3 method for data.frame cross_join( x, y, copy = FALSE, suffix = c("_x", "_y"), ..., na_matches = c("na", "never") )
x, y | A pair of |
---|---|
copy | If This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it. |
suffix | If there are non-joined duplicate variables in |
... | Other parameters passed onto methods. |
na_matches | Should NA (NULL) values match one another?
The default, "never", is how databases usually work. |
From Spark 2.1 the prerequisite for using a cross join is that, spark.sql.crossJoin.enabled
must be set to true
,
otherwise an exception will be thrown. Cartesian products are very slow. More importantly, they could consume a lot
of memory and trigger an OOM. If the join type is not Inner
, Spark SQL could use a Broadcast Nested Loop Join even
if both sides of tables are not small enough. Thus, it also could cause lots of unwanted network traffic.
x <- data.frame( id = c("id1", "id2", "id3", "id4", "id5"), val = c(2, 7, 11, 13, 17), stringsAsFactors = FALSE ) cross_join(x, x)#> id_x val_x id_y val_y #> 1 id1 2 id1 2 #> 2 id1 2 id2 7 #> 3 id1 2 id3 11 #> 4 id1 2 id4 13 #> 5 id1 2 id5 17 #> 6 id2 7 id1 2 #> 7 id2 7 id2 7 #> 8 id2 7 id3 11 #> 9 id2 7 id4 13 #> 10 id2 7 id5 17 #> 11 id3 11 id1 2 #> 12 id3 11 id2 7 #> 13 id3 11 id3 11 #> 14 id3 11 id4 13 #> 15 id3 11 id5 17 #> 16 id4 13 id1 2 #> 17 id4 13 id2 7 #> 18 id4 13 id3 11 #> 19 id4 13 id4 13 #> 20 id4 13 id5 17 #> 21 id5 17 id1 2 #> 22 id5 17 id2 7 #> 23 id5 17 id3 11 #> 24 id5 17 id4 13 #> 25 id5 17 id5 17