I have read this SO post, but I still need random. I have datasets, like the follow: <pre class="prettyprint"><code>123456789 23458ef12 ef12345ea 111223345 </code></pre> I want to get some ranom lines from it, so I write the follow pyspark code: <pre class="prettyprint"><code>rdd = spark_context.textFile('a.tx').takeSample(False, 3) rdd.saveAsTextFile('b.tx') </code></pre> So takeSample returns on list, it will have one error: <pre class="prettyprint"><code>'list' object has no attribute 'saveAsTextFile' </code></pre>

<code>takeSample()</code> returns array. you need parallelize it and save it. <pre class="prettyprint"><code>rdd = spark_context.textFile('a.tx') spark_context.parallelize(rdd.takeSample(False, 3)).saveAsTextFile('b.tx') </code></pre> But the best way is to use <code>sample()</code>(Here, I am taking 30%) which will return RDD <pre class="prettyprint"><code>rdd.sample(False, 0.3).saveAsTextFile('b.tx') </code></pre>

How save list to file in spark?

Tags:

python

apache-spark

pyspark

I have read this SO post, but I still need random.

I have datasets, like the follow:

Click to copy

I want to get some ranom lines from it, so I write the follow pyspark code:

Click to copy

rdd = spark_context.textFile('a.tx').takeSample(False, 3)
rdd.saveAsTextFile('b.tx')

So takeSample returns on list, it will have one error:

Click to copy

'list' object has no attribute 'saveAsTextFile'

217

asked Dec 26 '16 12:12

thinkerou

2 Answers

takeSample() returns array. you need parallelize it and save it.

Click to copy

rdd = spark_context.textFile('a.tx')
spark_context.parallelize(rdd.takeSample(False, 3)).saveAsTextFile('b.tx')

But the best way is to use sample()(Here, I am taking 30%) which will return RDD

Click to copy

rdd.sample(False, 0.3).saveAsTextFile('b.tx')

125

answered Oct 17 '22 15:10

mrsrinivas

If you need to begin from a pure python list ; such as on the result of calling .collect() on a pyspark dataframe, I have the following function

Click to copy

def write_lists_to_hdfs_textfile(ss, python_list, hdfs_filename):
    '''
    :param ss : SparkSession Object
    :param python_list: simple list in python. Can be a result of .collect() on pyspark dataframe.
    :param hdfs_filename : the path of hdfs filename to write
    :return: None
    '''

    # First need to convert the list to parallel RDD
    rdd_list = ss.sparkContext.parallelize(python_list)

    # Use the map function to write one element per line and write all elements to a single file (coalesce)
    rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(hdfs_filename)

    return None

Eg:

write_lists_to_hdfs_textfile(ss,[5,4,1,18],"/test_file.txt")

answered Oct 17 '22 15:10

Bikash Gyawali

Related questions
                            
                                Python-like multiprocessing in C++
                            
                                Using Boto3 in python to acquire results from dynamodb and parse into a usable variable or dictionary
                            
                                Maximum of an annotation after a group by
                            
                                Numpy roll vertical in 2d array
                            
                                How to select specific the cipher while sending request via python request module
                            
                                Python-Sphinx: "inherit" method documentation from superclass
                            
                                How to run django and wordpress on NGINX server using same domain?
                            
                                How to unpack a dictionary of list (of dictionaries!) and return as grouped tuples?
                            
                                Numpy unique 2D sub-array [duplicate]
                            
                                Enhance performance of geopandas overlay(intersection)
                            
                                How to log Python warnings in a Django log file?
                            
                                how to reproduce "Connection reset by peer"
                            
                                After resizing an image with cv2, how to get the new bounding box coordinate
                            
                                How to query pre-existing table from SQlAlchemy ORM session?
                            
                                Pandas: read_csv ignore rows after a blank line
                            
                                How can Python be used to write line breaks to a csv as '\n'?
                            
                                Python add custom property/metadata to file
                            
                                Using a decorator function defined as an instance variable
                            
                                Can this cython code be optimized?
                            
                                Use Python regex to parse string of floats output by Java Arrays.deepToString

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How save list to file in spark?

Tags:

python

apache-spark

pyspark

thinkerou

People also ask

2 Answers

mrsrinivas

Bikash Gyawali

Recent Activity

Donate For Us