I have a simple line:
line = "Hello, world"
I would like to convert it to an RDD with only one element. I have tried
sc.parallelize(line)
But it get:
sc.parallelize(line).collect()
['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd']
Any ideas?
To create text file RDD, we can use SparkContext's textFile method. It takes URL of the file and read it as a collection of line. URL can be a local path on the machine or a hdfs://, s3n://, etc. The point to jot down is that the path of the local file system and worker node should be the same.
Dataset is a strong typed Dataframe, so both Dataset and Dataframe could use . rdd to convert to a RDD.
try using List as parameter:
sc.parallelize(List(line)).collect()
it returns
res1: Array[String] = Array(hello,world)
The below code works fine in Python
sc.parallelize([line]).collect()
['Hello, world']
Here we are passing the parameter "line" as a list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With