If you want to view the content of a RDD, one way is to use <code>collect()</code>: <pre class="prettyprint"><code>myRDD.collect().foreach(println) </code></pre> That's not a good idea, though, when the RDD has billions of lines. Use <code>take()</code> to take just a few to print out: <pre class="prettyprint"><code>myRDD.take(n).foreach(println) </code></pre> The <code>map</code> function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it. To print it, you can use <code>foreach</code> (which is an action): <pre class="prettyprint"><code>linesWithSessionId.foreach(println) </code></pre> To write it to disk you can use one of the <code>saveAs...</code> functions (still actions) from the RDD API You can convert your <code>RDD</code> to a <code>DataFrame</code> then <code>show()</code> it. <pre class="prettyprint"><code>// For implicit conversion from RDD to DataFrame import spark.implicits._ fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)]) // convert to DF then show it fruits.toDF().show() </code></pre> This will show the top 20 lines of your data, so the size of your data should not be an issue. <pre class="prettyprint"><code>+------+---+ | _1| _2| +------+---+ | apple| 1| |banana| 2| |orange| 17| +------+---+ </code></pre> If you're running this on a cluster then <code>println</code> won't print back to your context. You need to bring the <code>RDD</code> data to your session. To do this you can force it to local array and then print it out: <pre class="prettyprint"><code>linesWithSessionId.toArray().foreach(line => println(line)) </code></pre> There are probably many architectural differences between <code>myRDD.foreach(println)</code> and <code>myRDD.collect().foreach(println)</code> (not only 'collect', but also other actions). One the differences I saw is when doing <code>myRDD.foreach(println)</code>, the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did <code>myRDD.collect().foreach(println)</code>, order remains just like the text file.

How to print the contents of RDD?

Tags:

apache-spark

If you want to view the content of a RDD, one way is to use collect():

myRDD.collect().foreach(println)

That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:

myRDD.take(n).foreach(println)

The map function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.

To print it, you can use foreach (which is an action):

linesWithSessionId.foreach(println)

To write it to disk you can use one of the saveAs... functions (still actions) from the RDD API

You can convert your RDD to a DataFrame then show() it.

// For implicit conversion from RDD to DataFrame
import spark.implicits._

fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])

// convert to DF then show it
fruits.toDF().show()

This will show the top 20 lines of your data, so the size of your data should not be an issue.

+------+---+                                                                    
|    _1| _2|
+------+---+
| apple|  1|
|banana|  2|
|orange| 17|
+------+---+

If you're running this on a cluster then println won't print back to your context. You need to bring the RDD data to your session. To do this you can force it to local array and then print it out:

linesWithSessionId.toArray().foreach(line => println(line))

There are probably many architectural differences between myRDD.foreach(println) and myRDD.collect().foreach(println) (not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.

Related questions
                            
                                Array initializing in Scala
                            
                                What’s the difference between ScalaTest and Scala Specs unit test frameworks?
                            
                                Why does pattern matching in Scala not work with variables?
                            
                                SBT stop run without exiting
                            
                                How to read files from resources folder in Scala?
                            
                                Maven package works but Intellij's build fails
                            
                                Understanding scala enumerations
                            
                                Why does "split" on an empty string return a non-empty array?
                            
                                How to convert a scala.List to a java.util.List?
                            
                                Scala downwards or decreasing for loop?
                            
                                Cleaner way to update nested structures
                            
                                How to profile methods in Scala?
                            
                                How is pattern matching in Scala implemented at the bytecode level?
                            
                                How to force IntelliJ IDEA to reload dependencies from build.sbt after they changed?
                            
                                In Scala, what exactly does 'val a: A = _' (underscore) mean?
                            
                                Split list into multiple lists with fixed number of elements
                            
                                What JSON library to use in Scala? [closed]
                            
                                How to pattern match using regular expression in Scala?
                            
                                How to update a mongo record using Rogue with MongoCaseClassField when case class contains a scala Enumeration
                            
                                What are some compelling use cases for dependent method types?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to print the contents of RDD?

Tags:

scala

apache-spark

Related questions

Recent Activity

Donate For Us