I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so :
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
textRDD1 = sc.textFile("hobbit.txt")
textRDD2 = spark.read.text('hobbit.txt').rdd
then I look into the data and see that the two RDDs are structured differently
textRDD1.take(5)
['The king beneath the mountain',
'The king of carven stone',
'The lord of silver fountain',
'Shall come unto his own',
'His throne shall be upholden']
textRDD2.take(5)
[Row(value='The king beneath the mountain'),
Row(value='The king of carven stone'),
Row(value='The lord of silver fountain'),
Row(value='Shall come unto his own'),
Row(value='His throne shall be upholden')]
Based on this, all subsequent processing has to be changed to reflect the presence of the 'value'
My questions are
To answer (a),
sc.textFile(...)
returns a RDD[String]
textFile(String path, int minPartitions)
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
spark.read.text(...)
returns a DataSet[Row]
or a DataFrame
text(String path)
Loads text files and returns a DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any.
For (b), it really depends on your use case. Since you are trying to create a RDD here, you should go with sc.textFile
. You can always convert a dataframe to a rdd and vice-versa.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With