Spark using Python : save RDD output into text files

Tags:

I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.

import re

from pyspark import SparkConf , SparkContext

def normalizewords(text):
    return re.compile(r'\W+',re.UNICODE).split(text.lower())

conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)

input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")

words=input.flatMap(normalizewords)

wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()

results=sortedwordsCount.collect()

for result in results:
    count=str(result[0])
    word=result[1].encode('ascii','ignore')

    if(word):
        print word +"\t\t"+ count

results.saveAsTextFile("/var/www/myoutput")

952

asked Dec 04 '15 11:12

RACHITA PATRO

2 Answers

since you collected results=sortedwordsCount.collect() so, its not RDD. It will be normal python list or tuple.

As you know list is python object/data structure and append is method to add element.

>>> x = []
>>> x.append(5)
>>> x
[5]

Similarly RDD is sparks object/data structure and saveAsTextFile is method to write the file. Important thing is its distributed data structure.

So, we cannot use append on RDD or saveAsTextFile on list. collect is method on RDD to get to RDD to driver memory.

As mentioned in comments, save sortedwordsCount with saveAsTextFile or open file in python and use results to write in a file

157

answered Nov 03 '22 21:11

WoodChopper

Change results=sortedwordsCount.collect() to results=sortedwordsCount, because using .collect() results will be a list.

answered Nov 03 '22 19:11

Derrick wang

Related questions
                            
                                Python map each integer within input to int
                            
                                Integrating Ember CLI with Django app
                            
                                Summation of only consecutive values in a python array
                            
                                Set style property in PyGObject
                            
                                Python list(set(list(...)) to remove duplicates
                            
                                How to create a sequential combined list in python?
                            
                                Flask Confirm Action
                            
                                selenium change language browser chrome / firefox
                            
                                Formation of dictionary from list element
                            
                                Changing values in multiple columns of a pandas DataFrame using known column values
                            
                                Some questions about Flask sessions
                            
                                How to fit a double Gaussian distribution in Python?
                            
                                Django Error - Reverse for 'password_reset_confirm' with arguments '()' and keyword arguments '
                            
                                Is there a simple way to get rid of junk values that come when you SSH using Python's Paramiko library and fetch output from CLI of a remote machine?
                            
                                Python requests.post multipart/form-data [duplicate]
                            
                                Iterative solving of sparse systems of linear equations with (M, N) right-hand size matrix
                            
                                Django template: Embed css from file
                            
                                How can I obtain the same 'special' solutions to underdetermined linear systems that Matlab's `A \ b` (mldivide) operator returns using numpy/scipy?
                            
                                Lists are the same but not considered equal?
                            
                                Overloading the [] operator in python class to refer to a numpy.array data member

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark using Python : save RDD output into text files

Tags:

python

apache-spark

pyspark

RACHITA PATRO

People also ask

2 Answers

WoodChopper

Derrick wang

Recent Activity

Donate For Us