Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark Logging?

I want my Spark driver program, written in Python, to output some basic logging information. There are three ways I can see to do this:

  1. Using the PySpark py4j bridge to get access to the Java log4j logging facility used by Spark.

log4jLogger = sc._jvm.org.apache.log4j LOGGER = log4jLogger.LogManager.getLogger(__name__) LOGGER.info("pyspark script logger initialized")

  1. Just use standard console print.

  2. logging Python standard library module. This seems ideal and the most Pythonic approach, however, at least out of the box, it doesn't work and logged messages don't seem to be recoverable. Of course, this can be configured to log to py4j->log4j and/or to console.

So, the official programming guide (https://spark.apache.org/docs/1.6.1/programming-guide.html) doesn't mention logging at all. That's disappointing. There should be standard documented recommended way to log from a Spark driver program.

searched for this issue, and found this: How do I log from my Python Spark script

But the contents of that thread were unsatisfactory.

Specifically, I have the following questions:

  • Am I missing a standard way to log from a PySpark driver program?
  • Are there any pros/cons to logging to py4j->log4j vs console?
like image 973
clay Avatar asked May 18 '16 06:05

clay


People also ask

Does PySpark use log4j?

PySpark is a Python API for Apache Spark. Using PySpark requires the Spark JARs that make use of Log4j. On December 13, 2021, Team Anaconda announced that “CVE-2021-44228 does not affect the core packages in the PyData stack” because “An older 1.

How do you implement log4j in PySpark?

Updating the configuration of Log4jAdd a file named log4j2. properties to $SPARK_HOME/conf . The code in Listing 1.1 is added to configure an appender that logs to stderr; any output to stdout and stderr is appended to Docker container logs. The last two lines set the format to JSON.

Does Python use log4j?

log4j is a popular logging package written in Java. log4j has been ported to the C, C++, C#, Perl, Python, Ruby, and Eiffel languages.


1 Answers

A cleaner solution is to use standard python logging module with a custom distributed handler to collect log messages from all nodes of the spark cluster.

See "Logging in PySpark" of this Gist.

like image 134
user1944010 Avatar answered Sep 22 '22 23:09

user1944010