Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: display a spark data frame in a table format

I am using pyspark to read a parquet file like below:

my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**') 

Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.

Is it possible to display the data frame in a table format like pandas data frame? Thanks!

like image 846
Edamame Avatar asked Aug 21 '16 18:08

Edamame


People also ask

How do you show the DataFrame in PySpark?

Spark show() – Display DataFrame Contents in Table. Spark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default, it shows only 20 Rows and the column values are truncated at 20 characters.

How do I Preview Spark data frame?

You can visualize a Spark dataframe in Jupyter notebooks by using the display(<dataframe-name>) function. The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns. By default, the dataframe is visualized as a table.


2 Answers

The show method does what you're looking for.

For example, given the following dataframe of 3 rows, I can print just the first two rows like this:

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v')) df.show(n=2) 

which yields:

+---+---+ |  k|  v| +---+---+ |foo|  1| |bar|  2| +---+---+ only showing top 2 rows 
like image 85
eddies Avatar answered Sep 21 '22 15:09

eddies


As mentioned by @Brent in the comment of @maxymoo's answer, you can try

df.limit(10).toPandas() 

to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit() will not keep the order of original spark dataframe.

like image 24
Louis Yang Avatar answered Sep 23 '22 15:09

Louis Yang