How to log/print message in pyspark pandas_udf?

Tags:

I have tested that both logger and print can't print message in a pandas_udf , either in cluster mode or client mode.

Test code:

import sys
import numpy as np
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import logging

logger = logging.getLogger('test')

spark = (SparkSession
.builder
.appName('test')
.getOrCreate())


df = spark.createDataFrame(pd.DataFrame({
    'y': np.random.randint(1, 10, (20,)),
    'ds': np.random.randint(1000, 9999, (20,)),
    'store_id' : ['a'] * 10 + ['b'] *7 + ['q']*3,
    'product_id' : ['c'] * 5 + ['d'] *12 + ['e']*3,
    })
)


@pandas_udf('y int, ds int, store_id string, product_id string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
    print('#'*100)
    logger.info('$'*100)
    logger.error('&'*100)
    return pd.DataFrame([], columns=['y', 'ds','store_id','product_id'])


df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)

Also note:

log4jLogger = spark.sparkContext._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("#"*50)

You can't use this in pandas_udf, because this log beyond to spark context object, you can't refer to spark session/context in a udf.

The only way I know is use Excetion as the answer I wrote below. But it is tricky and with drawback. I want to know if there is any way to just print message in pandas_udf.

473

asked Jul 24 '19 05:07

Mithril

1 Answers

Currently, I tried every way in spark 2.4 .

Without log, it is hard to debug a faulty pandas_udf. The only workable way I know can print error messgage in pandas_udf is raise Exception . So it really cost time to debug in this way, but there isn't a better way I know .

@pandas_udf('y int, ds int, store_id string, product_id string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
    print('#'*100)
    logger.info('$'*100)
    logger.error('&'*100)
    raise Exception('@'*100)  # The only way I know can print message but would break execution 
    return pd.DataFrame([], columns=['y', 'ds','store_id','product_id'])

The drawback is you can't keep spark running after print message.

196

answered Nov 10 '22 03:11

Mithril

Related questions
                            
                                Merging dataframes iteratively with pandas
                            
                                Search and Replace in pandas dataframe for large dataset
                            
                                df.append() with dicts converts booleans to 1s and 0s
                            
                                Pandas to_csv to GzipFile in Python 3 not working
                            
                                Selecting a random value in a pandas data frame by column
                            
                                Model Output `to_excel` in Python?
                            
                                Performance issue turning rows with start - end into a dataframe with TimeIndex
                            
                                Python - read parquet file without pandas
                            
                                How I can get the vectors for words that were not present in word2vec vocabulary?
                            
                                Remove '\n' in text in pandas python
                            
                                How to apply function to each group of dataframe
                            
                                Pandas Datetime Conversion from a particular ISO format
                            
                                Multiply each column in a data frame with the columns to its right in Python
                            
                                When the dataframe has duplicate columns, it seems that fillna function cannot work correctly with dict parameter
                            
                                Datatypes issue when convert parquet data to pandas dataframe
                            
                                Add personalized methods and attributes to CategoricalDtype
                            
                                Pandas: Locale format not working in style.format()
                            
                                Python 3 - ValueError: Found array with 0 sample(s) (shape=(0, 11)) while a minimum of 1 is required by MinMaxScaler
                            
                                Pandas dataframe: How to set values after an index to 0
                            
                                Create a new column based on previous row value and delete the current row

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to log/print message in pyspark pandas_udf?

Tags:

pandas

apache-spark

pyspark

user-defined-functions

Mithril

People also ask

1 Answers

Mithril

Recent Activity

Donate For Us