I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code. In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two different systems: <code>sa</code> and <code>sb</code>. Each <code>Row</code> contains <code>name</code>, <code>id_sa</code> and <code>id_sb</code>. My goal is to produce a mapping from <code>id_sa</code> to <code>id_sb</code> such that for each <code>id_sa</code>, the corresponding <code>id_sb</code> is the most frequent id among all names attached to <code>id_sa</code>. Let's try to clarify with an example. If I have the following rows: <pre class="prettyprint"><code>[Row(name='n1', id_sa='a1', id_sb='b1'), Row(name='n2', id_sa='a1', id_sb='b2'), Row(name='n3', id_sa='a1', id_sb='b2'), Row(name='n4', id_sa='a2', id_sb='b2')] </code></pre> My goal is to produce a mapping from <code>a1</code> to <code>b2</code>. Indeed, the names associated to <code>a1</code> are <code>n1</code>, <code>n2</code> and <code>n3</code>, which map respectively to <code>b1</code>, <code>b2</code> and <code>b2</code>, so <code>b2</code> is the most frequent mapping in the names associated to <code>a1</code>. In the same way, <code>a2</code> will be mapped to <code>b2</code>. It's OK to assume that there will always be a winner: no need to break ties. I was hoping that I could use <code>groupBy(df.id_sa)</code> on my dataframe, but I don't know what to do next. I was hoping for an aggregation that could produce, in the end, the following rows: <pre class="prettyprint"><code>[Row(id_sa=a1, max_id_sb=b2), Row(id_sa=a2, max_id_sb=b2)] </code></pre> But maybe I'm trying to use the wrong tool and I should just go back to using RDDs.

Using <code>join</code> (it will result in more than one row in group in case of ties): <pre class="prettyprint"><code>import pyspark.sql.functions as F from pyspark.sql.functions import count, col cnts = df.groupBy("id_sa", "id_sb").agg(count("*").alias("cnt")).alias("cnts") maxs = cnts.groupBy("id_sa").agg(F.max("cnt").alias("mx")).alias("maxs") cnts.join(maxs, (col("cnt") == col("mx")) & (col("cnts.id_sa") == col("maxs.id_sa")) ).select(col("cnts.id_sa"), col("cnts.id_sb")) </code></pre> Using window functions (will drop ties): <pre class="prettyprint"><code>from pyspark.sql.functions import row_number from pyspark.sql.window import Window w = Window().partitionBy("id_sa").orderBy(col("cnt").desc()) (cnts .withColumn("rn", row_number().over(w)) .where(col("rn") == 1) .select("id_sa", "id_sb")) </code></pre> Using <code>struct</code> ordering: <pre class="prettyprint"><code>from pyspark.sql.functions import struct (cnts .groupBy("id_sa") .agg(F.max(struct(col("cnt"), col("id_sb"))).alias("max")) .select(col("id_sa"), col("max.id_sb"))) </code></pre> See also How to select the first row of each group?

Find maximum row per group in Spark DataFrame

Tags:

apache-spark

apache-spark-sql

pyspark

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code.

In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two different systems: sa and sb. Each Row contains name, id_sa and id_sb. My goal is to produce a mapping from id_sa to id_sb such that for each id_sa, the corresponding id_sb is the most frequent id among all names attached to id_sa.

Let's try to clarify with an example. If I have the following rows:

[Row(name='n1', id_sa='a1', id_sb='b1'),  Row(name='n2', id_sa='a1', id_sb='b2'),  Row(name='n3', id_sa='a1', id_sb='b2'),  Row(name='n4', id_sa='a2', id_sb='b2')]

My goal is to produce a mapping from a1 to b2. Indeed, the names associated to a1 are n1, n2 and n3, which map respectively to b1, b2 and b2, so b2 is the most frequent mapping in the names associated to a1. In the same way, a2 will be mapped to b2. It's OK to assume that there will always be a winner: no need to break ties.

I was hoping that I could use groupBy(df.id_sa) on my dataframe, but I don't know what to do next. I was hoping for an aggregation that could produce, in the end, the following rows:

[Row(id_sa=a1, max_id_sb=b2),  Row(id_sa=a2, max_id_sb=b2)]

But maybe I'm trying to use the wrong tool and I should just go back to using RDDs.

872

asked Feb 05 '16 07:02

Quentin Pradet

1 Answers

Using join (it will result in more than one row in group in case of ties):

import pyspark.sql.functions as F from pyspark.sql.functions import count, col   cnts = df.groupBy("id_sa", "id_sb").agg(count("*").alias("cnt")).alias("cnts") maxs = cnts.groupBy("id_sa").agg(F.max("cnt").alias("mx")).alias("maxs")  cnts.join(maxs,    (col("cnt") == col("mx")) & (col("cnts.id_sa") == col("maxs.id_sa")) ).select(col("cnts.id_sa"), col("cnts.id_sb"))

Using window functions (will drop ties):

from pyspark.sql.functions import row_number from pyspark.sql.window import Window  w = Window().partitionBy("id_sa").orderBy(col("cnt").desc())  (cnts   .withColumn("rn", row_number().over(w))   .where(col("rn") == 1)   .select("id_sa", "id_sb"))

Using struct ordering:

from pyspark.sql.functions import struct  (cnts   .groupBy("id_sa")   .agg(F.max(struct(col("cnt"), col("id_sb"))).alias("max"))   .select(col("id_sa"), col("max.id_sb")))

See also How to select the first row of each group?

140

answered Oct 02 '22 08:10

zero323

Related questions
                            
                                How do I check for equality using Spark Dataframe without SQL Query?
                            
                                When are accumulators truly reliable?
                            
                                Spark dataframe: collect () vs select ()
                            
                                Convert a spark DataFrame to pandas DF
                            
                                Including null values in an Apache Spark Join
                            
                                Spark DataFrame TimestampType - how to get Year, Month, Day values from field?
                            
                                How to prevent Spark Executors from getting Lost when using YARN client mode?
                            
                                What's the difference between join and cogroup in Apache Spark
                            
                                How to convert Row of a Scala DataFrame into case class most efficiently?
                            
                                Apply StringIndexer to several columns in a PySpark Dataframe
                            
                                Spark sql how to explode without losing null values
                            
                                DataFrame partitionBy to a single Parquet file (per partition)
                            
                                What is yarn-client mode in Spark?
                            
                                SparkR vs sparklyr [closed]
                            
                                Derive multiple columns from a single column in a Spark DataFrame
                            
                                What conditions should cluster deploy mode be used instead of client?
                            
                                View RDD contents in Python Spark?
                            
                                Spark load data and add filename as dataframe column
                            
                                Convert date from String to Date format in Dataframes
                            
                                PySpark: multiple conditions in when clause

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With