How to do opposite of explode in PySpark?

Tags:

Let's say I have a DataFrame with a column for users and another column for words they've written:

Row(user='Bob', word='hello')
Row(user='Bob', word='world')
Row(user='Mary', word='Have')
Row(user='Mary', word='a')
Row(user='Mary', word='nice')
Row(user='Mary', word='day')

I would like to aggregate the word column into a vector:

Row(user='Bob', words=['hello','world'])
Row(user='Mary', words=['Have','a','nice','day'])

It seems I can't use any of Sparks grouping functions because they expect a subsequent aggregation step. My use case is that I want to feed these data into Word2Vec not use other Spark aggregations.

565

asked Apr 11 '17 23:04

Evan Zamir

1 Answers

Thanks to @titipat for giving the RDD solution. I did realize shortly after my post that there is actually a DataFrame solution using collect_set (or collect_list):

from pyspark.sql import Row
from pyspark.sql.functions import collect_set
rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
df = spark.createDataFrame(rdd)
group_user = df.groupBy('user').agg(collect_set('word').alias('words'))
print(group_user.collect())

>[Row(user='Mary', words=['Have', 'nice', 'day', 'a']), Row(user='Bob', words=['world', 'hello'])]

108

answered Sep 21 '22 19:09

Evan Zamir

Related questions
                            
                                RDD Aggregate in spark
                            
                                Spark RDD - is partition(s) always in RAM?
                            
                                How can I get from 'pyspark.sql.types.Row' all the columns/attributes name?
                            
                                how to select all columns that starts with a common label
                            
                                Standalone Manager Vs. Yarn Vs. Mesos
                            
                                The system cannot find the path specified error while running pyspark
                            
                                Spark UDF with varargs
                            
                                Trouble building a simple SparkSQL application
                            
                                Limit Kafka batches size when using Spark Streaming
                            
                                PySpark: TypeError: condition should be string or Column
                            
                                Spark Dataframes UPSERT to Postgres Table
                            
                                spark sql window function lag
                            
                                Apache Spark java.lang.ClassNotFoundException
                            
                                Spark can access Hive table from pyspark but not from spark-submit
                            
                                SparkSQL : Can I explode two different variables in the same query?
                            
                                Create DataFrame with null value for few column
                            
                                Multiple SparkSessions in single JVM
                            
                                Spark dataframe filter
                            
                                Spark Dataframe groupBy and sort results into a list
                            
                                Concatenating string by rows in pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to do opposite of explode in PySpark?

Tags:

apache-spark

apache-spark-sql

pyspark

Evan Zamir

People also ask

1 Answers

Evan Zamir

Recent Activity

Donate For Us