I have a dataset in the following way: <pre class="prettyprint"><code>FieldA FieldB ArrayField 1 A {1,2,3} 2 B {3,5} </code></pre> I would like to explode the data on ArrayField so the output will look in the following way: <pre class="prettyprint"><code>FieldA FieldB ExplodedField 1 A 1 1 A 2 1 A 3 2 B 3 2 B 5 </code></pre> I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields. How would you implement it in Spark. Notice that the input dataset is very large.

The explode function should get that done. pyspark version: <pre class="prettyprint"><code>>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"]) >>> from pyspark.sql.functions import explode >>> df.withColumn("col3", explode(df.col3)).show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| A| 1| | 1| A| 2| | 1| A| 3| | 2| B| 3| | 2| B| 5| +----+----+----+ </code></pre> Scala version <pre class="prettyprint"><code>scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3") df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field] scala> df.withColumn("col3", explode($"col3")).show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| A| 1| | 1| A| 2| | 1| A| 3| | 2| B| 3| | 2| B| 5| +----+----+----+ </code></pre>

Explode array data into rows in spark [duplicate]

Tags:

apache-spark

pyspark

I have a dataset in the following way:

FieldA    FieldB    ArrayField 1         A         {1,2,3} 2         B         {3,5}

I would like to explode the data on ArrayField so the output will look in the following way:

FieldA    FieldB    ExplodedField 1         A         1 1         A         2 1         A         3 2         B         3 2         B         5

I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields.

How would you implement it in Spark. Notice that the input dataset is very large.

950

asked Jun 08 '17 13:06

Gluz

1 Answers

The explode function should get that done.

pyspark version:

>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"]) >>> from pyspark.sql.functions import explode >>> df.withColumn("col3", explode(df.col3)).show() +----+----+----+ |col1|col2|col3| +----+----+----+ |   1|   A|   1| |   1|   A|   2| |   1|   A|   3| |   2|   B|   3| |   2|   B|   5| +----+----+----+

Scala version

scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3") df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]  scala> df.withColumn("col3", explode($"col3")).show() +----+----+----+ |col1|col2|col3| +----+----+----+ |   1|   A|   1| |   1|   A|   2| |   1|   A|   3| |   2|   B|   3| |   2|   B|   5| +----+----+----+

102

answered Sep 22 '22 00:09

rogue-one

Related questions
                            
                                Increase memory available to PySpark at runtime
                            
                                how to convert json string to dataframe on spark
                            
                                Difference in dense rank and row number in spark
                            
                                How to set Master address for Spark examples from command line
                            
                                Querying on multiple Hive stores using Apache Spark
                            
                                Concatenating datasets of different RDDs in Apache spark using scala
                            
                                How to know which piece of code runs on driver or executor?
                            
                                What is the difference between Spark Standalone, YARN and local mode?
                            
                                How to create correct data frame for classification in Spark ML
                            
                                PySpark dataframe convert unusual string format to Timestamp
                            
                                Save Spark dataframe as dynamic partitioned table in Hive
                            
                                Change nullable property of column in spark dataframe
                            
                                Reading DataFrame from partitioned parquet file
                            
                                Running scheduled Spark job
                            
                                pyspark: Efficiently have partitionBy write to same number of total partitions as original table
                            
                                Spark DataFrames: registerTempTable vs not
                            
                                Select Specific Columns from Spark DataFrame
                            
                                Spark2.1.0 incompatible Jackson versions 2.7.6
                            
                                How to obtain the symmetric difference between two DataFrames?
                            
                                Difference between na().drop() and filter(col.isNotNull) (Apache Spark)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With