I have a dataframe with following columns:
groupid,unit,height
----------------------
1,in,55
2,in,54
I want to create another dataframe with additional rows where unit=cm and height=height*2.54.
Resulting dataframe:
groupid,unit,height
----------------------
1,in,55
2,in,54
1,cm,139.7
2,cm,137.16
Not sure how I can use spark udf and explode here. Any help is appreciated. Thanks in advance.
Here we create an empty DataFrame where data is to be added, then we convert the data to be added into a Spark DataFrame using createDataFrame() and further convert both DataFrames to a Pandas DataFrame using toPandas() and use the append() function to add the non-empty data frame to the empty DataFrame and ignore the ...
The row_number() is a window function in Spark SQL that assigns a row number (sequential integer number) to each row in the result DataFrame. This function is used with Window. partitionBy() which partitions the data into windows frames and orderBy() clause to sort the rows in each partition.
The createOrReplaceTempView() is used to create a temporary view/table from the Spark DataFrame or Dataset objects. Since it is a temporary view, the lifetime of the table/view is tied to the current SparkSession. Hence, It will be automatically removed when your spark session ends.
you can create another dataframe
with changes you require using withColumn
and then union
both dataframes
as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "in", 55),
(2, "in", 54)
).toDF("groupid", "unit", "height")
val df2 = df.withColumn("unit", lit("cm")).withColumn("height", col("height")*2.54)
df.union(df2).show(false)
you should have
+-------+----+------+
|groupid|unit|height|
+-------+----+------+
|1 |in |55.0 |
|2 |in |54.0 |
|1 |cm |139.7 |
|2 |cm |137.16|
+-------+----+------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With