Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving result of DataFrame show() to string in pyspark

I would like to capture the result of show in pyspark, similar to here and here. I was not able to find a solution with pyspark, only scala.

df.show()
#+----+-------+
#| age|   name|
#+----+-------+
#|null|Michael|
#|  30|   Andy|
#|  19| Justin|
#+----+-------+

The ultimate purpose is to capture this as string inside my logger.info I tried logger.info(df.show()) which will only display on console.

like image 526
Kenny Avatar asked Apr 12 '19 14:04

Kenny


People also ask

How do you print strings in PySpark?

By default, Pyspark reads all the data in the form of strings. So, we call our data variable then it returns every column with its number in the form of a string. To print, the raw data call the show() function with the data variable using the dot operator – '. '

What does describe () do in PySpark?

DESCRIBE FUNCTION statement returns the basic metadata information of an existing function. The metadata information includes the function name, implementing class and the usage details. If the optional EXTENDED option is specified, the basic metadata information is returned along with the extended usage information.

How do you save a PySpark DataFrame?

In Spark/PySpark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.

How to display the contents of a Dataframe in pyspark/spark dataframe?

Spark/PySpark DataFrame show () is used to display the contents of the DataFrame in a Table Row & Column Format. By default it shows only 20 Rows and the column values are truncated at 20 characters.

How do I save a Dataframe as a Parquet file in Python?

Save DataFrame as Parquet File: To save or write a DataFrame as a Parquet file, we can use write.parquet() within the DataFrameWriter class. df.write.parquet(path='OUTPUT_DIR')

How to display The Dataframe in pandas?

We are going to use show () function and toPandas function to display the dataframe in the required format. show (): Used to display the dataframe. N is the number of rows to be displayed from the top ,if n is not specified it will print entire rows in the dataframe

How to save or write a Dataframe in Python?

Save DataFrame as ORC File: To save or write a DataFrame as a ORC file, we can use write.orc () within the DataFrameWriter class. 3. Save DataFrame as JSON File: To save or write a DataFrame as a JSON file, we can use write.json () within the DataFrameWriter class. 4. Save DataFrame as Parquet File:


Video Answer


1 Answers

You can build a helper function using the same approach as shown in post you linked Capturing the result of explain() in pyspark. Just examine the source code for show() and observe that it is calling self._jdf.showString().

The answer depends on which version of spark you are using, as the number of arguments to show() has changed over time.

Spark Version 2.3 and above

In version 2.3, the vertical argument was added.

def getShowString(df, n=20, truncate=True, vertical=False):
    if isinstance(truncate, bool) and truncate:
        return(df._jdf.showString(n, 20, vertical))
    else:
        return(df._jdf.showString(n, int(truncate), vertical))

Spark Versions 1.5 through 2.2

As of version 1.5, the truncate argument was added.

def getShowString(df, n=20, truncate=True):
    if isinstance(truncate, bool) and truncate:
        return(df._jdf.showString(n, 20))
    else:
        return(df._jdf.showString(n, int(truncate)))

Spark Versions 1.3 through 1.4

The show function was first introduced in version 1.3.

def getShowString(df, n=20):
    return(df._jdf.showString(n))

Now use the helper function as follows:

x = getShowString(df)  # default arguments
print(x)
#+----+-------+
#| age|   name|
#+----+-------+
#|null|Michael|
#|  30|   Andy|
#|  19| Justin|
#+----+-------+

Or in your case:

logger.info(getShowString(df))
like image 138
pault Avatar answered Oct 17 '22 01:10

pault