Split large array columns into multiple columns - Pyspark

Tags:

pyspark

I have:

+---+-------+-------+
| id|   var1|   var2|
+---+-------+-------+
|  a|[1,2,3]|[1,2,3]|
|  b|[2,3,4]|[2,3,4]|
+---+-------+-------+

I want:

+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
|  a|      1|      2|      3|      1|      2|      3|
|  b|      2|      3|      4|      2|      3|      4|
+---+-------+-------+-------+-------+-------+-------+

The solution provided by How to split a list to multiple columns in Pyspark?

df1.select('id', df1.var1[0], df1.var1[1], ...).show()

works, but some of my arrays are very long (max 332).

How can I write this so that it takes account of all length arrays?

726

asked Aug 02 '18 07:08

1 Answers

This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.

from pyspark.sql import functions as F

df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])

columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()

df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
|  a|      1|      2|      3|      1|      2|      3|
|  b|      2|      3|      4|      2|      3|      4|
+---+-------+-------+-------+-------+-------+-------+

104

answered Sep 28 '22 03:09

Pierre Gourseaud

Related questions
                            
                                How can I declare a Column as a categorical feature in a DataFrame for use in ml
                            
                                Passing Python functions as objects to Spark
                            
                                Convert GraphFrames ShortestPath Map into DataFrame rows in PySpark
                            
                                Spark Streaming from Kafka Consumer
                            
                                How to read and write data in Google Cloud Bigtable in PySpark application?
                            
                                How to Connect Python to Spark Session and Keep RDDs Alive
                            
                                Pyspark append executor environment variable
                            
                                Testing Spark with pytest - cannot run Spark in local mode
                            
                                is there any pyspark function for add next month like DATE_ADD(date, month(int type))
                            
                                UDF to map words to term Index in Spark
                            
                                how to change column value in spark sql
                            
                                Kafka with Spark 2.1 Structured Streaming - cannot deserialize
                            
                                Spark Pipeline error
                            
                                Pyspark udf high memory utilization
                            
                                pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10
                            
                                pyspark returns a no module named error for a custom module
                            
                                Convert array<string> into string pyspark dataframe
                            
                                Pyspark Split Columns
                            
                                Why is difference between sqlContext.read.load and sqlContext.read.text?
                            
                                update a dataframe column with new values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split large array columns into multiple columns - Pyspark

Tags:

pyspark

Microsim

People also ask

1 Answers

Pierre Gourseaud

Recent Activity

Donate For Us