I have read this SO post, but I still need random.
I have datasets, like the follow:
123456789
23458ef12
ef12345ea
111223345
I want to get some ranom lines from it, so I write the follow pyspark code:
rdd = spark_context.textFile('a.tx').takeSample(False, 3)
rdd.saveAsTextFile('b.tx')
So takeSample returns on list, it will have one error:
'list' object has no attribute 'saveAsTextFile'
Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory. This is how Spark becomes able to write output from multiple codes.
From Azure Databricks home, you can go to “Upload Data” (under Common Tasks)→ “DBFS” → “FileStore”. DBFS FileStore is where you create folders and save your data frames into CSV format.
In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.
write(). text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default.
takeSample()
returns array. you need parallelize it and save it.
rdd = spark_context.textFile('a.tx')
spark_context.parallelize(rdd.takeSample(False, 3)).saveAsTextFile('b.tx')
But the best way is to use sample()
(Here, I am taking 30%) which will return RDD
rdd.sample(False, 0.3).saveAsTextFile('b.tx')
If you need to begin from a pure python list ; such as on the result of calling .collect()
on a pyspark dataframe, I have the following function
def write_lists_to_hdfs_textfile(ss, python_list, hdfs_filename):
'''
:param ss : SparkSession Object
:param python_list: simple list in python. Can be a result of .collect() on pyspark dataframe.
:param hdfs_filename : the path of hdfs filename to write
:return: None
'''
# First need to convert the list to parallel RDD
rdd_list = ss.sparkContext.parallelize(python_list)
# Use the map function to write one element per line and write all elements to a single file (coalesce)
rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(hdfs_filename)
return None
Eg:
write_lists_to_hdfs_textfile(ss,[5,4,1,18],"/test_file.txt")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With