Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I log from my Python Spark script

I have a Python Spark program which I run with spark-submit. I want to put logging statements in it.

logging.info("This is an informative message.") logging.debug("This is a debug message.") 

I want to use the same logger that Spark is using so that the log messages come out in the same format and the level is controlled by the same configuration files. How do I do this?

I've tried putting the logging statements in the code and starting out with a logging.getLogger(). In both cases I see Spark's log messages but not mine. I've been looking at the Python logging documentation, but haven't been able to figure it out from there.

Not sure if this is something specific to scripts submitted to Spark or just me not understanding how logging works.

like image 422
W.P. McNeill Avatar asked Aug 20 '14 14:08

W.P. McNeill


People also ask

How do I connect Spark to Python?

Standalone PySpark applications should be run using the bin/pyspark script, which automatically configures the Java and Python environment using the settings in conf/spark-env.sh or . cmd . The script automatically adds the bin/pyspark package to the PYTHONPATH .

Does PySpark use log4j?

PySpark is a Python API for Apache Spark. Using PySpark requires the Spark JARs that make use of Log4j. On December 13, 2021, Team Anaconda announced that “CVE-2021-44228 does not affect the core packages in the PyData stack” because “An older 1.


2 Answers

You can get the logger from the SparkContext object:

log4jLogger = sc._jvm.org.apache.log4j LOGGER = log4jLogger.LogManager.getLogger(__name__) LOGGER.info("pyspark script logger initialized") 
like image 93
Alex Q Avatar answered Oct 29 '22 22:10

Alex Q


You need to get the logger for spark itself, by default getLogger() will return the logger for you own module. Try something like:

logger = logging.getLogger('py4j') logger.info("My test info statement") 

It might also be 'pyspark' instead of 'py4j'.

In case the function that you use in your spark program (and which does some logging) is defined in the same module as the main function it will give some serialization error.

This is explained here and an example by the same person is given here

I also tested this on spark 1.3.1

EDIT:

To change logging from STDERR to STDOUT you will have to remove the current StreamHandler and add a new one.

Find the existing Stream Handler (This line can be removed when finished)

print(logger.handlers) # will look like [<logging.StreamHandler object at 0x7fd8f4b00208>] 

There will probably only be a single one, but if not you will have to update position.

logger.removeHandler(logger.handlers[0]) 

Add new handler for sys.stdout

import sys # Put at top if not already there sh = logging.StreamHandler(sys.stdout) sh.setLevel(logging.DEBUG) logger.addHandler(sh) 
like image 24
CasualDemon Avatar answered Oct 29 '22 21:10

CasualDemon