I would like to display the entire Apache Spark SQL DataFrame with the Scala API. I can use the show()
method:
myDataFrame.show(Int.MaxValue)
Is there a better way to display an entire DataFrame than using Int.MaxValue
?
Spark show() – Display DataFrame Contents in Table. Spark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default, it shows only 20 Rows and the column values are truncated at 20 characters.
By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage. Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD's. Took the best out of 3 for each test.
It is generally not advisable to display an entire DataFrame to stdout, because that means you need to pull the entire DataFrame (all of its values) to the driver (unless DataFrame
is already local, which you can check with df.isLocal
).
Unless you know ahead of time that the size of your dataset is sufficiently small so that driver JVM process has enough memory available to accommodate all values, it is not safe to do this. That's why DataFrame API's show()
by default shows you only the first 20 rows.
You could use the df.collect
which returns Array[T]
and then iterate over each line and print it:
df.collect.foreach(println)
but you lose all formatting implemented in df.showString(numRows: Int)
(that show()
internally uses).
So no, I guess there is no better way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With