Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How save list to file in spark?

I have read this SO post, but I still need random.

I have datasets, like the follow:

123456789
23458ef12
ef12345ea
111223345

I want to get some ranom lines from it, so I write the follow pyspark code:

rdd = spark_context.textFile('a.tx').takeSample(False, 3)
rdd.saveAsTextFile('b.tx')

So takeSample returns on list, it will have one error:

'list' object has no attribute 'saveAsTextFile'
like image 217
thinkerou Avatar asked Dec 26 '16 12:12

thinkerou


People also ask

How do I save a file in Spark?

Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory. This is how Spark becomes able to write output from multiple codes.

How do I save a list in Databricks?

From Azure Databricks home, you can go to “Upload Data” (under Common Tasks)→ “DBFS” → “FileStore”. DBFS FileStore is where you create folders and save your data frames into CSV format.

How do I save a Spark DataFrame to a CSV file?

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.

How do you write data into a text file in PySpark?

write(). text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default.


2 Answers

takeSample() returns array. you need parallelize it and save it.

rdd = spark_context.textFile('a.tx')
spark_context.parallelize(rdd.takeSample(False, 3)).saveAsTextFile('b.tx')

But the best way is to use sample()(Here, I am taking 30%) which will return RDD

rdd.sample(False, 0.3).saveAsTextFile('b.tx')
like image 125
mrsrinivas Avatar answered Oct 17 '22 15:10

mrsrinivas


If you need to begin from a pure python list ; such as on the result of calling .collect() on a pyspark dataframe, I have the following function

def write_lists_to_hdfs_textfile(ss, python_list, hdfs_filename):
    '''
    :param ss : SparkSession Object
    :param python_list: simple list in python. Can be a result of .collect() on pyspark dataframe.
    :param hdfs_filename : the path of hdfs filename to write
    :return: None
    '''

    # First need to convert the list to parallel RDD
    rdd_list = ss.sparkContext.parallelize(python_list)

    # Use the map function to write one element per line and write all elements to a single file (coalesce)
    rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(hdfs_filename)

    return None

Eg:

write_lists_to_hdfs_textfile(ss,[5,4,1,18],"/test_file.txt")

like image 41
Bikash Gyawali Avatar answered Oct 17 '22 15:10

Bikash Gyawali