Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark print to console

When running a PySpark job on the dataproc server like this

gcloud --project <project_name> dataproc jobs submit pyspark --cluster <cluster_name> <python_script>

my print statements don't show up in my terminal.

Is there any way to output data onto the terminal in PySpark when running jobs on the cloud?

Edit: I would like to print/log info from within my transformation. For example:

def print_funct(l):
    print(l)
    return l

rddData.map(lambda l: print_funct(l)).collect()

Should print every line of data in the RDD rddData.

Doing some digging, I found this answer for logging, however, testing it provides me the results of this question, whose answer states that that logging isn't possible within the transformation

like image 724
Roman Avatar asked May 24 '16 07:05

Roman


People also ask

How do I print values in PySpark?

By default, Pyspark reads all the data in the form of strings. So, we call our data variable then it returns every column with its number in the form of a string. To print, the raw data call the show() function with the data variable using the dot operator – '. '

How do you use display in PySpark?

You can visualize a Spark dataframe in Jupyter notebooks by using the display(<dataframe-name>) function. The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns. By default, the dataframe is visualized as a table.

How do you print a message in Python?

Python print() Function The print() function prints the specified message to the screen, or other standard output device. The message can be a string, or any other object, the object will be converted into a string before written to the screen.


1 Answers

Printing or logging inside of a transform will end up in the Spark executor logs, which can be accessed through your Application's AppMaster or HistoryServer via the YARN ResourceManager Web UI.

You could alternatively collect the information you are printing alongside your output (e.g. in a dict or tuple). You could also stash it away in an accumulator and then print it from the driver.

If you are doing a lot of print statement debugging, you might find it faster to SSH into your master node and use the pyspark REPL or IPython to experiment with your code. This would also allow you to use the --master local flag which would make your print statements appear in stdout.

like image 116
Patrick Clay Avatar answered Oct 16 '22 11:10

Patrick Clay