I have a data frame in PySpark like below.
import pyspark.sql.functions as func
df = sqlContext.createDataFrame(
[(0.0, 0.2, 3.45631),
(0.4, 1.4, 2.82945),
(0.5, 1.9, 7.76261),
(0.6, 0.9, 2.76790),
(1.2, 1.0, 9.87984)],
["col1", "col2", "col3"])
df.show()
+----+----+-------+
|col1|col2| col3|
+----+----+-------+
| 0.0| 0.2|3.45631|
| 0.4| 1.4|2.82945|
| 0.5| 1.9|7.76261|
| 0.6| 0.9| 2.7679|
| 1.2| 1.0|9.87984|
+----+----+-------+
# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2))
df2.show()
+----+----+-------+----+
|col1|col2| col3|col4|
+----+----+-------+----+
| 0.0| 0.2|3.45631|3.46|
| 0.4| 1.4|2.82945|2.83|
| 0.5| 1.9|7.76261|7.76|
| 0.6| 0.9| 2.7679|2.77|
| 1.2| 1.0|9.87984|9.88|
+----+----+-------+----+
In the above data frame col4
is double
. Now I want to convert col4
as Integer
df2 = df.withColumn("col4", func.round(df["col3"], 2).cast('integer'))
+----+----+-------+----+
|col1|col2| col3|col4|
+----+----+-------+----+
| 0.0| 0.2|3.45631| 3|
| 0.4| 1.4|2.82945| 2|
| 0.5| 1.9|7.76261| 7|
| 0.6| 0.9| 2.7679| 2|
| 1.2| 1.0|9.87984| 9|
+----+----+-------+----+
But I want to round the col4
values to nearest
expected result
+----+----+-------+----+
|col1|col2| col3|col4|
+----+----+-------+----+
| 0.0| 0.2|3.45631| 3|
| 0.4| 1.4|2.82945| 3|
| 0.5| 1.9|7.76261| 8|
| 0.6| 0.9| 2.7679| 3|
| 1.2| 1.0|9.87984| 10|
+----+----+-------+----+
How can I do that?
You should use the round
function and then cast to integer type. However, do not use a second argument to the round
function. By using 2 there it will round to 2 decimal places, the cast
to integer will then round down to the nearest number.
Instead use:
df2 = df.withColumn("col4", func.round(df["col3"]).cast('integer'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With