If you want to view the content of a RDD, one way is to use collect()
:
myRDD.collect().foreach(println)
That's not a good idea, though, when the RDD has billions of lines. Use take()
to take just a few to print out:
myRDD.take(n).foreach(println)
The map
function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.
To print it, you can use foreach
(which is an action):
linesWithSessionId.foreach(println)
To write it to disk you can use one of the saveAs...
functions (still actions) from the RDD API
You can convert your RDD
to a DataFrame
then show()
it.
// For implicit conversion from RDD to DataFrame
import spark.implicits._
fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])
// convert to DF then show it
fruits.toDF().show()
This will show the top 20 lines of your data, so the size of your data should not be an issue.
+------+---+
| _1| _2|
+------+---+
| apple| 1|
|banana| 2|
|orange| 17|
+------+---+
If you're running this on a cluster then println
won't print back to your context. You need to bring the RDD
data to your session. To do this you can force it to local array and then print it out:
linesWithSessionId.toArray().foreach(line => println(line))
There are probably many architectural differences between myRDD.foreach(println)
and myRDD.collect().foreach(println)
(not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println)
, the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println)
, order remains just like the text file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With