I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node. These ids are about 30 characters long. For example: <pre class="prettyprint"><code>ids O2LWk4MAbcrOCWo3IVM0GInelSXfcG HbDckDXCye20kwu0gfeGpLGWnJ2yif o43xSMBUJLOKDxkYEQbAEWk4aPQHkm </code></pre> I am looking to write to file a list of all of the pairs of ids. For example: <pre class="prettyprint"><code>id1,id2 O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif O2LWk4MAbcrOCWo3IVM0GInelSXfcG,o43xSMBUJLOKDxkYEQbAEWk4aPQHkm HbDckDXCye20kwu0gfeGpLGWnJ2yif,O2LWk4MAbcrOCWo3IVM0GInelSXfcG # etc... </code></pre> So I need to cross join the dataset on itself. I was hoping to do this on PySpark using a 10 node cluster, but it needs to be memory efficient.

pySpark will handle your dataset easily and memory efficient but it will take time to process 10^8 * 10^8 records (this is estimated size of cross join result). See sample code: <pre class="prettyprint"><code>from pyspark.sql.types import * df = spark.read.csv('input.csv', header=True, schema=StructType([StructField('id', StringType())])) df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show() </code></pre>

Memory efficient cartesian join in PySpark

Tags:

cartesian-product

apache-spark

pyspark

cross-join

I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node.

These ids are about 30 characters long. For example:

Click to copy

ids
O2LWk4MAbcrOCWo3IVM0GInelSXfcG
HbDckDXCye20kwu0gfeGpLGWnJ2yif
o43xSMBUJLOKDxkYEQbAEWk4aPQHkm

I am looking to write to file a list of all of the pairs of ids. For example:

Click to copy

id1,id2
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,o43xSMBUJLOKDxkYEQbAEWk4aPQHkm
HbDckDXCye20kwu0gfeGpLGWnJ2yif,O2LWk4MAbcrOCWo3IVM0GInelSXfcG
# etc...

So I need to cross join the dataset on itself. I was hoping to do this on PySpark using a 10 node cluster, but it needs to be memory efficient.

840

asked Feb 06 '17 14:02

mgoldwasser

1 Answers

pySpark will handle your dataset easily and memory efficient but it will take time to process 10^8 * 10^8 records (this is estimated size of cross join result). See sample code:

Click to copy

from pyspark.sql.types import *
df = spark.read.csv('input.csv', header=True, schema=StructType([StructField('id', StringType())]))
df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show()

answered Sep 22 '22 13:09

Mariusz

Related questions
                            
                                How do I increase decimal precision in Spark?
                            
                                Spark Mongodb Connector Scala - Missing database name
                            
                                Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]
                            
                                UDF with multiple rows as response pySpark
                            
                                Custom Evaluator in PySpark
                            
                                Check if table exists in hive metastore using Pyspark
                            
                                How does Apache Spark handles system failure when deployed in YARN?
                            
                                Apache Spark or Cascading framework? [closed]
                            
                                How to get pass "requires authentication" while connecting to remote Cassandra cluster using SparkConf?
                            
                                Functions from Python packages for udf() of Spark dataframe
                            
                                Spark JSON text field to RDD
                            
                                Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
                            
                                Getting NullPointerException using spark-csv with DataFrames
                            
                                Does a flatMap in spark cause a shuffle?
                            
                                How to use Spark's repartitionAndSortWithinPartitions?
                            
                                Select array element from Spark Dataframes split method in same call?
                            
                                Running yarn with spark not working with Java 8
                            
                                How to read in-memory JSON string into Spark DataFrame
                            
                                Why is the number of partitions after groupBy 200? Why is this 200 not some other number?
                            
                                Convert List into dataframe spark scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory efficient cartesian join in PySpark

Tags:

cartesian-product

apache-spark

pyspark

cross-join

mgoldwasser

People also ask

1 Answers

Mariusz

Recent Activity

Donate For Us