Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient cartesian join in PySpark

I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node.

These ids are about 30 characters long. For example:

ids
O2LWk4MAbcrOCWo3IVM0GInelSXfcG
HbDckDXCye20kwu0gfeGpLGWnJ2yif
o43xSMBUJLOKDxkYEQbAEWk4aPQHkm

I am looking to write to file a list of all of the pairs of ids. For example:

id1,id2
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,o43xSMBUJLOKDxkYEQbAEWk4aPQHkm
HbDckDXCye20kwu0gfeGpLGWnJ2yif,O2LWk4MAbcrOCWo3IVM0GInelSXfcG
# etc...

So I need to cross join the dataset on itself. I was hoping to do this on PySpark using a 10 node cluster, but it needs to be memory efficient.

like image 840
mgoldwasser Avatar asked Feb 06 '17 14:02

mgoldwasser


People also ask

How do you optimize cross join IN Spark?

To make the computation faster, reduce the number of partitions of the input DataFrames before the cross join, so that the resulting cross joined DataFrame doesn't have too many partitions.

How do you avoid Cartesian join in Spark?

To avoid Cartesian product, a SQL query that joins N tables must have N-1 join conditions. Join condition is missing or trivial. Into each of your Spark driver application code.

What is Cartesian join in Spark?

In Spark, the Cartesian function generates a Cartesian product of two datasets and returns all the possible combination of pairs. Here, each element of one dataset is paired with each element of another dataset.


1 Answers

pySpark will handle your dataset easily and memory efficient but it will take time to process 10^8 * 10^8 records (this is estimated size of cross join result). See sample code:

from pyspark.sql.types import *
df = spark.read.csv('input.csv', header=True, schema=StructType([StructField('id', StringType())]))
df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show()
like image 97
Mariusz Avatar answered Sep 22 '22 13:09

Mariusz