Split column of list into multiple columns in the same PySpark dataframe

Question

I have the following dataframe which contains 2 columns:

1st column has column names

2nd Column has list of values.

+--------------------+--------------------+
|              Column|            Quantile|
+--------------------+--------------------+
|                rent|[4000.0, 4500.0, ...|
|     is_rent_changed|[0.0, 0.0, 0.0, 0...|
|               phone|[7.022372888E9, 7...|
|          Area_house|[1000.0, 1000.0, ...|
|       bedroom_count|[1.0, 1.0, 1.0, 1...|
|      bathroom_count|[1.0, 1.0, 1.0, 1...|
|    maintenance_cost|[0.0, 0.0, 0.0, 0...|
|            latitude|[12.8217605, 12.8...|
|            Max_rent|[9000.0, 10000.0,...|
|                Beds|[2.0, 2.0, 2.0, 2...|
|                Area|[1000.0, 1000.0, ...|
|            Avg_Rent|[3500.0, 4000.0, ...|
|      deposit_amount|[0.0, 0.0, 0.0, 0...|
|          commission|[0.0, 0.0, 0.0, 0...|
|        monthly_rent|[0.0, 0.0, 0.0, 0...|
|is_min_rent_guara...|[0.0, 0.0, 0.0, 0...|
|min_guarantee_amount|[0.0, 0.0, 0.0, 0...|
|min_guarantee_dur...|[1.0, 1.0, 1.0, 1...|
|        furnish_cost|[0.0, 0.0, 0.0, 0...|
|  owner_furnish_part|[0.0, 0.0, 0.0, 0...|
+--------------------+--------------------+

How do I split the second column into Multiple Columns Preserving the same dataset.

I can access the values using :

univar_df10.select("Column", univar_df10.Quantile[0],univar_df10.Quantile[1],univar_df10.Quantile[2]).show()

+--------------------+-------------+-------------+------------+
|              Column|  Quantile[0]|  Quantile[1]| Quantile[2]|
+--------------------+-------------+-------------+------------+
|                rent|       4000.0|       4500.0|      5000.0|
|     is_rent_changed|          0.0|          0.0|         0.0|
|               phone|7.022372888E9|7.042022842E9|7.07333021E9|
|          Area_house|       1000.0|       1000.0|      1000.0|
|       bedroom_count|          1.0|          1.0|         1.0|
|      bathroom_count|          1.0|          1.0|         1.0|
|    maintenance_cost|          0.0|          0.0|         0.0|
|            latitude|   12.8217605|   12.8490502|   12.863517|
|            Max_rent|       9000.0|      10000.0|     11500.0|
|                Beds|          2.0|          2.0|         2.0|
|                Area|       1000.0|       1000.0|      1000.0|
|            Avg_Rent|       3500.0|       4000.0|      4125.0|
|      deposit_amount|          0.0|          0.0|         0.0|
|          commission|          0.0|          0.0|         0.0|
|        monthly_rent|          0.0|          0.0|         0.0|
|is_min_rent_guara...|          0.0|          0.0|         0.0|
|min_guarantee_amount|          0.0|          0.0|         0.0|
|min_guarantee_dur...|          1.0|          1.0|         1.0|
|        furnish_cost|          0.0|          0.0|         0.0|
|  owner_furnish_part|          0.0|          0.0|         0.0|
+--------------------+-------------+-------------+------------+
only showing top 20 rows

I want my new dataframe to to split my 2nd column of lists into multiple columns like the above dataset. Thanks in advance.

desertnaut · Accepted Answer

Assuming (your question is flagged for closure as unclear what you're asking) that your issue is that the lists in your Quantile column are of some length, and so it is not convenient to build the respective command by hand, here is a solution using list addition and comprehension as an argument to select:

spark.version
# u'2.2.1'

# make some toy data
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
                            Row([0,0,0,85,0,69,0]),
                            Row([0,89,56,0,0,0,0])],
                            ['features'])

df.show()
# result:
+-----------------------+
|features               |
+-----------------------+
|[0, 45, 63, 0, 0, 0, 0]|
|[0, 0, 0, 85, 0, 69, 0]|
|[0, 89, 56, 0, 0, 0, 0]|
+-----------------------+

# get the length of your lists, if you don't know it already (here is 7):
length = len(df.select('features').take(1)[0][0])
length
# 7

df.select([df.features] + [df.features[i] for i in range(length)]).show()
# result:
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|            features|features[0]|features[1]|features[2]|features[3]|features[4]|features[5]|features[6]|  
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|[0, 45, 63, 0, 0,...|          0|         45|         63|          0|          0|          0|          0| 
|[0, 0, 0, 85, 0, ...|          0|          0|          0|         85|          0|         69|          0|
|[0, 89, 56, 0, 0,...|          0|         89|         56|          0|          0|          0|          0|
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

So, in your case,

univar_df10.select([univar_df10.Column] + [univar_df10.Quantile[i] for i in range(length)])

should do the job, after you have calculated length as

length = len(univar_df10.select('Quantile').take(1)[0][0])

Deepak · Answer

Here's the pseudo code to do it in scala :-

import org.apache.spark.sql.functions.split 
import org.apache.spark.sql.functions.col

#Create column which you wanted to be .
val quantileColumn = Seq("quantile1","qunatile2","quantile3")

#Get the number of columns
val numberOfColums = quantileColumn.size

#Create a list of column
val columList = for (i <- 0 until numberOfColums ) yield split(col("Quantile"),",").getItem(i).alias(quantileColumn(i))

#Just perfom Select operation.
df.select(columList: _ *)

# If you want some columns to be added or dropped , use withColumn & dropp on df.

Split column of list into multiple columns in the same PySpark dataframe

Tags:

apache-spark-sql

pyspark

kcvizer

2 Answers

desertnaut

Deepak

Recent Activity

Donate For Us

Split column of list into multiple columns in the same PySpark dataframe

Tags:

apache-spark-sql

pyspark

kcvizer

2 Answers

desertnaut

Deepak

Related questions

Recent Activity

Donate For Us