Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve PySpark DataFrame.show output to fit Jupyter notebook

Tags:

Using PySpark in a Jupyter notebook, the output of Spark's DataFrame.show is low-tech compared to how Pandas DataFrames are displayed. I thought "Well, it does the job", until I got this:

enter image description here

The output is not adjusted to the width of the notebook, so that the lines wrap in an ugly way. Is there a way to customize this? Even better, is there a way to get output Pandas-style (without converting to pandas.DataFrame obviously)?

like image 448
clstaudt Avatar asked May 25 '18 07:05

clstaudt


People also ask

How do I show full output in Jupyter?

To show the full data without any hiding, you can use pd. set_option('display. max_rows', 500) and pd.

How do you show full column content in a PySpark Dataframe?

The only way to show the full column content we are using show() function. show(): Function is used to show the Dataframe. n: Number of rows to display. truncate: Through this parameter we can tell the Output sink to display the full column content by setting truncate option to false, by default this value is true.

How do you get the size of a Dataframe in PySpark?

To obtain the shape of a data frame in PySpark, you can obtain the number of rows through "DF. count()" and the number of columns through "len(DF. columns)".


2 Answers

This is now possible natively as of Spark 2.4.0 by setting spark.sql.repl.eagerEval.enabled to True:

enter image description here

like image 126
Kyle Barron Avatar answered Sep 30 '22 18:09

Kyle Barron


After playing around with my table which has a lot of columns I decided the best thing to do to get a feel for the data is to use:

df.show(n=5, truncate=False, vertical=True) 

This displays it vertically without truncation and is the cleanest viewing I can come up with.

like image 36
user1761806 Avatar answered Sep 30 '22 17:09

user1761806