I have the following dataframe which contains 2 columns:
1st column has column names
2nd Column has list of values.
+--------------------+--------------------+
| Column| Quantile|
+--------------------+--------------------+
| rent|[4000.0, 4500.0, ...|
| is_rent_changed|[0.0, 0.0, 0.0, 0...|
| phone|[7.022372888E9, 7...|
| Area_house|[1000.0, 1000.0, ...|
| bedroom_count|[1.0, 1.0, 1.0, 1...|
| bathroom_count|[1.0, 1.0, 1.0, 1...|
| maintenance_cost|[0.0, 0.0, 0.0, 0...|
| latitude|[12.8217605, 12.8...|
| Max_rent|[9000.0, 10000.0,...|
| Beds|[2.0, 2.0, 2.0, 2...|
| Area|[1000.0, 1000.0, ...|
| Avg_Rent|[3500.0, 4000.0, ...|
| deposit_amount|[0.0, 0.0, 0.0, 0...|
| commission|[0.0, 0.0, 0.0, 0...|
| monthly_rent|[0.0, 0.0, 0.0, 0...|
|is_min_rent_guara...|[0.0, 0.0, 0.0, 0...|
|min_guarantee_amount|[0.0, 0.0, 0.0, 0...|
|min_guarantee_dur...|[1.0, 1.0, 1.0, 1...|
| furnish_cost|[0.0, 0.0, 0.0, 0...|
| owner_furnish_part|[0.0, 0.0, 0.0, 0...|
+--------------------+--------------------+
How do I split the second column into Multiple Columns Preserving the same dataset.
I can access the values using :
univar_df10.select("Column", univar_df10.Quantile[0],univar_df10.Quantile[1],univar_df10.Quantile[2]).show()
+--------------------+-------------+-------------+------------+
| Column| Quantile[0]| Quantile[1]| Quantile[2]|
+--------------------+-------------+-------------+------------+
| rent| 4000.0| 4500.0| 5000.0|
| is_rent_changed| 0.0| 0.0| 0.0|
| phone|7.022372888E9|7.042022842E9|7.07333021E9|
| Area_house| 1000.0| 1000.0| 1000.0|
| bedroom_count| 1.0| 1.0| 1.0|
| bathroom_count| 1.0| 1.0| 1.0|
| maintenance_cost| 0.0| 0.0| 0.0|
| latitude| 12.8217605| 12.8490502| 12.863517|
| Max_rent| 9000.0| 10000.0| 11500.0|
| Beds| 2.0| 2.0| 2.0|
| Area| 1000.0| 1000.0| 1000.0|
| Avg_Rent| 3500.0| 4000.0| 4125.0|
| deposit_amount| 0.0| 0.0| 0.0|
| commission| 0.0| 0.0| 0.0|
| monthly_rent| 0.0| 0.0| 0.0|
|is_min_rent_guara...| 0.0| 0.0| 0.0|
|min_guarantee_amount| 0.0| 0.0| 0.0|
|min_guarantee_dur...| 1.0| 1.0| 1.0|
| furnish_cost| 0.0| 0.0| 0.0|
| owner_furnish_part| 0.0| 0.0| 0.0|
+--------------------+-------------+-------------+------------+
only showing top 20 rows
I want my new dataframe to to split my 2nd column of lists into multiple columns like the above dataset. Thanks in advance.
Assuming (your question is flagged for closure as unclear what you're asking) that your issue is that the lists in your Quantile
column are of some length, and so it is not convenient to build the respective command by hand, here is a solution using list addition and comprehension as an argument to select
:
spark.version
# u'2.2.1'
# make some toy data
from pyspark.sql import Row
df = spark.createDataFrame([Row([0,45,63,0,0,0,0]),
Row([0,0,0,85,0,69,0]),
Row([0,89,56,0,0,0,0])],
['features'])
df.show()
# result:
+-----------------------+
|features |
+-----------------------+
|[0, 45, 63, 0, 0, 0, 0]|
|[0, 0, 0, 85, 0, 69, 0]|
|[0, 89, 56, 0, 0, 0, 0]|
+-----------------------+
# get the length of your lists, if you don't know it already (here is 7):
length = len(df.select('features').take(1)[0][0])
length
# 7
df.select([df.features] + [df.features[i] for i in range(length)]).show()
# result:
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| features|features[0]|features[1]|features[2]|features[3]|features[4]|features[5]|features[6]|
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|[0, 45, 63, 0, 0,...| 0| 45| 63| 0| 0| 0| 0|
|[0, 0, 0, 85, 0, ...| 0| 0| 0| 85| 0| 69| 0|
|[0, 89, 56, 0, 0,...| 0| 89| 56| 0| 0| 0| 0|
+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
So, in your case,
univar_df10.select([univar_df10.Column] + [univar_df10.Quantile[i] for i in range(length)])
should do the job, after you have calculated length
as
length = len(univar_df10.select('Quantile').take(1)[0][0])
Here's the pseudo code to do it in scala :-
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions.col
#Create column which you wanted to be .
val quantileColumn = Seq("quantile1","qunatile2","quantile3")
#Get the number of columns
val numberOfColums = quantileColumn.size
#Create a list of column
val columList = for (i <- 0 until numberOfColums ) yield split(col("Quantile"),",").getItem(i).alias(quantileColumn(i))
#Just perfom Select operation.
df.select(columList: _ *)
# If you want some columns to be added or dropped , use withColumn & dropp on df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With