Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

View RDD contents in Python Spark?

Running a simple app in pyspark.

f = sc.textFile("README.md") wc = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add) 

I want to view RDD contents using foreach action:

wc.foreach(print) 

This throws a syntax error:

SyntaxError: invalid syntax 

What am I missing?

like image 926
lmart999 Avatar asked Aug 13 '14 20:08

lmart999


People also ask

How do I print all RDD records?

To print RDD contents, we can use RDD collect action or RDD foreach action. RDD. collect() returns all the elements of the dataset as an array at the driver program, and using for loop on this array, we can print elements of RDD. RDD foreach(f) runs a function f on each element of the dataset.


1 Answers

This error is because print isn't a function in Python 2.6.

You can either define a helper UDF that performs the print, or use the __future__ library to treat print as a function:

>>> from operator import add >>> f = sc.textFile("README.md") >>> def g(x): ...     print x ... >>> wc.foreach(g) 

or

>>> from __future__ import print_function >>> wc.foreach(print) 

However, I think it would be better to use collect() to bring the RDD contents back to the driver, because foreach executes on the worker nodes and the outputs may not necessarily appear in your driver / shell (it probably will in local mode, but not when running on a cluster).

>>> for x in wc.collect(): ...     print x 
like image 60
Josh Rosen Avatar answered Sep 29 '22 15:09

Josh Rosen