Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid
column to 2 decimal places, and rename the column as bid
afterwards - I'm importing pyspark.sql.functions AS func
for reference, and using the round
function contained within it:
output = output.select(col("ad").alias("ad_id"),
col("part").alias("part_id"),
func.round(col("new_bid"), 2).alias("bid"))
the new_bid
column here is of type float - the resulting dataframe does not have the newly named bid
column rounded to 2 decimal places as I am trying to do, rather it is still 8 or 9 decimal places out.
I've tried various things but can't seem to get the resulting dataframe to have the rounded value - any pointers would be greatly appreciated! Thanks!
PySpark ROUND rounds up the data to a given value in the Data frame. PySpark ROUND can be used to round up, down the values of the Data frame. PySpark ROUND function results can be used to create new columns in the Data frame. PySpark ROUND function uses the function ceil and floor for rounding up the value.
Format Number You can use format_number to format a number to desired decimal places as stated in the official api document: Formats numeric column x to a format like '#,###,###. ##', rounded to d decimal places, and returns the result as a string column.
You can use the floor() function. It rounds down its argument. Note that floor(x) gives the largest integer ≤x. So floor(5.8) returns 5, but floor(-5.8) return -6.
Here are a couple of ways to do it with some toy data:
spark.version
# u'2.2.0'
import pyspark.sql.functions as func
df = spark.createDataFrame(
[(0.0, 0.2, 3.45631),
(0.4, 1.4, 2.82945),
(0.5, 1.9, 7.76261),
(0.6, 0.9, 2.76790),
(1.2, 1.0, 9.87984)],
["col1", "col2", "col3"])
df.show()
# +----+----+-------+
# |col1|col2| col3|
# +----+----+-------+
# | 0.0| 0.2|3.45631|
# | 0.4| 1.4|2.82945|
# | 0.5| 1.9|7.76261|
# | 0.6| 0.9| 2.7679|
# | 1.2| 1.0|9.87984|
# +----+----+-------+
# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2)).withColumnRenamed("col4","new_col3")
df2.show()
# +----+----+-------+--------+
# |col1|col2| col3|new_col3|
# +----+----+-------+--------+
# | 0.0| 0.2|3.45631| 3.46|
# | 0.4| 1.4|2.82945| 2.83|
# | 0.5| 1.9|7.76261| 7.76|
# | 0.6| 0.9| 2.7679| 2.77|
# | 1.2| 1.0|9.87984| 9.88|
# +----+----+-------+--------+
# round & replace existing 'col3':
df3 = df.withColumn("col3", func.round(df["col3"], 2))
df3.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | 0.0| 0.2|3.46|
# | 0.4| 1.4|2.83|
# | 0.5| 1.9|7.76|
# | 0.6| 0.9|2.77|
# | 1.2| 1.0|9.88|
# +----+----+----+
It's a personal taste, but I am not a great fan of either col
or alias
- I prefer withColumn
and withColumnRenamed
instead. Nevertheless, if you would like to stick with select
and col
, here is how you should adapt your own code snippet:
from pyspark.sql.functions import col
df4 = df.select(col("col1").alias("new_col1"),
col("col2").alias("new_col2"),
func.round(df["col3"],2).alias("new_col3"))
df4.show()
# +--------+--------+--------+
# |new_col1|new_col2|new_col3|
# +--------+--------+--------+
# | 0.0| 0.2| 3.46|
# | 0.4| 1.4| 2.83|
# | 0.5| 1.9| 7.76|
# | 0.6| 0.9| 2.77|
# | 1.2| 1.0| 9.88|
# +--------+--------+--------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With