Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to print out snippets of a RDD in the spark-shell / pyspark?

When working in the spark-shell, I frequently want to inspect RDDs (similar to using head in unix).

For example:

scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> // how to inspect the readmeFile?

and ...

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> // how to inspect linesContainingSpark?
like image 450
Chris Snow Avatar asked Jun 29 '15 12:06

Chris Snow


People also ask

Can we print RDD?

To print RDD contents, we can use RDD collect action or RDD foreach action. RDD. collect() returns all the elements of the dataset as an array at the driver program, and using for loop on this array, we can print elements of RDD.


1 Answers

I found out how to do this (here) and thought this will be useful for other users, so sharing here. take(x) selects the first x items and foreach prints them:

scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> readmeFile.take(5).foreach(println)
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

and ...

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> linesContainingSpark.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming

The examples below are the equivalent but using pyspark:

>>> readmeFile = sc.textFile("input/tmp/README.md")
>>> for line in readmeFile.take(5): print line
... 
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

and

>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line)
>>> for line in linesContainingSpark.take(5): print line
... 
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming
like image 164
Chris Snow Avatar answered Sep 18 '22 17:09

Chris Snow