Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble With Pyspark Round Function

Tags:

Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it:

output = output.select(col("ad").alias("ad_id"),
                       col("part").alias("part_id"),
                       func.round(col("new_bid"), 2).alias("bid"))

the new_bid column here is of type float - the resulting dataframe does not have the newly named bid column rounded to 2 decimal places as I am trying to do, rather it is still 8 or 9 decimal places out.

I've tried various things but can't seem to get the resulting dataframe to have the rounded value - any pointers would be greatly appreciated! Thanks!

like image 950
dave Avatar asked Nov 01 '17 01:11

dave


People also ask

How do you use the round function in python PySpark?

PySpark ROUND rounds up the data to a given value in the Data frame. PySpark ROUND can be used to round up, down the values of the Data frame. PySpark ROUND function results can be used to create new columns in the Data frame. PySpark ROUND function uses the function ceil and floor for rounding up the value.

How do you set decimal places in PySpark?

Format Number You can use format_number to format a number to desired decimal places as stated in the official api document: Formats numeric column x to a format like '#,###,###. ##', rounded to d decimal places, and returns the result as a string column.

How do I round a number in SQL spark?

You can use the floor() function. It rounds down its argument. Note that floor(x) gives the largest integer ≤x. So floor(5.8) returns 5, but floor(-5.8) return -6.


1 Answers

Here are a couple of ways to do it with some toy data:

spark.version
# u'2.2.0'

import pyspark.sql.functions as func

df = spark.createDataFrame(
        [(0.0, 0.2, 3.45631),
         (0.4, 1.4, 2.82945),
         (0.5, 1.9, 7.76261),
         (0.6, 0.9, 2.76790),
         (1.2, 1.0, 9.87984)],
         ["col1", "col2", "col3"])

df.show()
# +----+----+-------+ 
# |col1|col2|   col3|
# +----+----+-------+
# | 0.0| 0.2|3.45631| 
# | 0.4| 1.4|2.82945|
# | 0.5| 1.9|7.76261| 
# | 0.6| 0.9| 2.7679| 
# | 1.2| 1.0|9.87984| 
# +----+----+-------+

# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2)).withColumnRenamed("col4","new_col3")
df2.show()
# +----+----+-------+--------+ 
# |col1|col2|   col3|new_col3|
# +----+----+-------+--------+
# | 0.0| 0.2|3.45631|    3.46|
# | 0.4| 1.4|2.82945|    2.83|
# | 0.5| 1.9|7.76261|    7.76|
# | 0.6| 0.9| 2.7679|    2.77|
# | 1.2| 1.0|9.87984|    9.88|
# +----+----+-------+--------+

# round & replace existing 'col3':
df3 = df.withColumn("col3", func.round(df["col3"], 2))
df3.show()
# +----+----+----+ 
# |col1|col2|col3| 
# +----+----+----+ 
# | 0.0| 0.2|3.46| 
# | 0.4| 1.4|2.83| 
# | 0.5| 1.9|7.76| 
# | 0.6| 0.9|2.77| 
# | 1.2| 1.0|9.88| 
# +----+----+----+ 

It's a personal taste, but I am not a great fan of either col or alias - I prefer withColumn and withColumnRenamed instead. Nevertheless, if you would like to stick with select and col, here is how you should adapt your own code snippet:

from pyspark.sql.functions import col

df4 = df.select(col("col1").alias("new_col1"), 
                col("col2").alias("new_col2"), 
                func.round(df["col3"],2).alias("new_col3"))
df4.show()
# +--------+--------+--------+ 
# |new_col1|new_col2|new_col3| 
# +--------+--------+--------+
# |     0.0|     0.2|    3.46|
# |     0.4|     1.4|    2.83|
# |     0.5|     1.9|    7.76|
# |     0.6|     0.9|    2.77|
# |     1.2|     1.0|    9.88|
# +--------+--------+--------+
like image 176
desertnaut Avatar answered Sep 16 '22 13:09

desertnaut