I am using Spark 2 and Scala 2.11 in a Zeppelin 0.7 notebook. I have a dataframe that I can print like this:
dfLemma.select("text", "lemma").show(20,false)
and the output looks like:
+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |text |lemma | +---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |RT @Dope_Promo: When you and your crew beat your high scores on FUGLY FROG 😍🔥 https://time.com/Sxp3Onz1w8 |[rt, @dope_promo, :, when, you, and, you, crew, beat, you, high, score, on, FUGLY, FROG, https://time.com/sxp3onz1w8] | |RT @axolROSE: Did yall just call Kermit the frog a lizard? https://time.com/wDAEAEr1Ay |[rt, @axolrose, :, do, yall, just, call, Kermit, the, frog, a, lizard, ?, https://time.com/wdaeaer1ay] |
I am trying to make the output nicer in Zeppelin, by:
val printcols= dfLemma.select("text", "lemma") println("%table " + printcols)
which gives this output:
printcols: org.apache.spark.sql.DataFrame = [text: string, lemma: array<string>]
and a new blank Zeppelin paragraph headed
[text: string, lemma: array]
Is there a way of getting the dataframe to show as a nicely formatted table? TIA!
Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.
You can visualize a Spark dataframe in Jupyter notebooks by using the display(<dataframe-name>) function. The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns. By default, the dataframe is visualized as a table.
The where() filter can be used on array collection column using array_contains(), Spark SQL function that checks if the array contains a value if present it returns true else false. The filter condition is applied on the dataframe consist of nested struct columns to filter the rows based on a nested column.
In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise.
In Zeppelin you can use z.show(df)
to show a pretty table. Here's an example:
val df = Seq( (1,1,1), (2,2,2), (3,3,3) ).toDF("first_column", "second_column", "third_column") z.show(df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With