After: <pre class="prettyprint"><code>val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2") </code></pre> I have this DataFrame in Apache Spark: <pre class="prettyprint"><code>+------+---------+ | Col1 | Col2 | +------+---------+ | 1 |[2, 3, 4]| | 1 |[2, 3, 4]| +------+---------+ </code></pre> How do I convert this into: <pre class="prettyprint"><code>+------+------+------+------+ | Col1 | Col2 | Col3 | Col4 | +------+------+------+------+ | 1 | 2 | 3 | 4 | | 1 | 2 | 3 | 4 | +------+------+------+------+ </code></pre>

Just to give the Pyspark version of sgvd's answer. If the array column is in <code>Col2</code>, then this select statement will move the first <code>nElements</code> of each array in <code>Col2</code> to their own columns: <pre class="prettyprint"><code>from pyspark.sql import functions as F df.select([F.col('Col2').getItem(i) for i in range(nElements)]) </code></pre>

How to explode columns?

Tags:

dataframe

apache-spark

spark-dataframe

After:

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")

I have this DataFrame in Apache Spark:

+------+---------+
| Col1 | Col2    |
+------+---------+
|  1   |[2, 3, 4]|
|  1   |[2, 3, 4]|
+------+---------+

How do I convert this into:

+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
|  1   |  2   |  3   |  4   |
|  1   |  2   |  3   |  4   |
+------+------+------+------+

553

asked May 23 '16 12:05

Jorge Machado

2 Answers

A solution that doesn't convert to and from RDD:

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

Or arguable nicer:

val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

The size of a Spark array column is not fixed, you could for instance have:

+----+------------+
|Col1|        Col2|
+----+------------+
|   1|   [2, 3, 4]|
|   1|[2, 3, 4, 5]|
+----+------------+

So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

val nElements = df.select("Col2").first.getList(0).size

186

answered Oct 22 '22 11:10

sgvd

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

from pyspark.sql import functions as F            
df.select([F.col('Col2').getItem(i) for i in range(nElements)])

answered Oct 22 '22 11:10

Shane Halloran

Related questions
                            
                                Saving dataframe to local file system results in empty results
                            
                                Does groupByKey in Spark preserve the original order?
                            
                                Spark on Amazon EMR: "Timeout waiting for connection from pool"
                            
                                How to execute Spark programs with Dynamic Resource Allocation?
                            
                                Difference between reduce and reduceByKey in Apache Spark
                            
                                What is scheduler delay in spark UI's event timeline
                            
                                Why does Complete output mode require aggregation?
                            
                                Spark Word2vec vector mathematics
                            
                                EMR Spark - TransportClient: Failed to send RPC
                            
                                Spark: Why does Python significantly outperform Scala in my use case?
                            
                                How to find the most recent partition in HIVE table
                            
                                Extracting `Seq[(String,String,String)]` from spark DataFrame
                            
                                Spark without Hadoop: Failed to Launch
                            
                                converting pandas dataframes to spark dataframe in zeppelin
                            
                                Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
                            
                                Creating Spark dataframe from numpy matrix
                            
                                Why does Spark Planner prefer sort merge join over shuffled hash join?
                            
                                Kafka topic partitions to Spark streaming
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
                            
                                Why does Spark job fail with "Exit code: 52"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With