How to set display precision in PySpark Dataframe show

Tags:

pyspark

spark-dataframe

How do you set the display precision in PySpark when calling .show()?

Consider the following example:

from math import sqrt
import pyspark.sql.functions as f

data = zip(
    map(lambda x: sqrt(x), range(100, 105)),
    map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()

Which outputs:

#+------------------+------------------+
#|              col1|              col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+

How can I change it so that it only displays 3 digits after the decimal point?

Desired output:

#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.

344

asked Feb 16 '18 18:02

pault

1 Answers

Round

The easiest option is to use pyspark.sql.functions.round():

from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This will maintain the values as numeric types.

Format Number

The functions are the same for scala and python. The only difference is the import.

You can use format_number to format a number to desired decimal places as stated in the official api document:

Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.

from pyspark.sql.functions import avg, format_number 
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

The transformed columns would of StringType and a comma is used as a thousands separator:

#+-----------+--------------+
#|       col1|          col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+

As stated in the scala version of this answer we can use regexp_replace to replace the , with any string you want

Replace all substrings of the specified string value that match regexp with rep.

from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
    [regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#|      col1|        col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+

148

answered Sep 22 '22 16:09

6 revs, 2 users 72%

Related questions
                            
                                pyspark matrix with dummy variables
                            
                                Remove rows from dataframe based on condition in pyspark
                            
                                PySpark computing correlation
                            
                                Spark: Merge 2 dataframes by adding row index/number on both dataframes
                            
                                Difference between two DataFrames columns in pyspark
                            
                                get all the dates between two dates in Spark DataFrame
                            
                                jupyter throwing error: socket.gaierror: [Errno -2] Name or service not known
                            
                                remove last few characters in PySpark dataframe column
                            
                                Spark MLlib - trainImplicit warning
                            
                                Java heap space OutOfMemoryError in pyspark spark-submit?
                            
                                WARN BlockManagerMasterEndpoint: No more replicas available for rdd
                            
                                Manually calling spark's garbage collection from pyspark
                            
                                Loading a pyspark ML model in a non-Spark environment
                            
                                Error: AttributeError: 'DataFrame' object has no attribute '_jdf'
                            
                                Memory leaks when using pandas_udf and Parquet serialization?
                            
                                How to write pyspark dataframe to HDFS and then how to read it back into dataframe?
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver
                            
                                Convert an RDD to iterable: PySpark?
                            
                                How to fully utilize all Spark nodes in cluster?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With