When working in the spark-shell, I frequently want to inspect RDDs (similar to using head
in unix).
For example:
scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> // how to inspect the readmeFile?
and ...
scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> // how to inspect linesContainingSpark?
To print RDD contents, we can use RDD collect action or RDD foreach action. RDD. collect() returns all the elements of the dataset as an array at the driver program, and using for loop on this array, we can print elements of RDD.
I found out how to do this (here) and thought this will be useful for other users, so sharing here. take(x)
selects the first x items and foreach prints them:
scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> readmeFile.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
and ...
scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> linesContainingSpark.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming
The examples below are the equivalent but using pyspark:
>>> readmeFile = sc.textFile("input/tmp/README.md")
>>> for line in readmeFile.take(5): print line
...
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
and
>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line)
>>> for line in linesContainingSpark.take(5): print line
...
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With