Iterating through a Spark RDD

Tags:

Starting with a Spark DataFrame to create a vector matrix for further analytics processing.

feature_matrix_vectors = feature_matrix1.map(lambda x: Vectors.dense(x)).cache()
feature_matrix_vectors.first()

The output is an array of vectors. Some of those vector have an null in them

>>> DenseVector([1.0, 31.0, 5.0, 1935.0, 24.0])
...
>>> DenseVector([1.0, 1231.0, 15.0, 2008.0, null])

From this i want to iterate through the vector matrix and create an LabeledPoint array with 0 (zero) if the vector contains a null, otherwise with a 1.

def f(row):
    if row.contain(None):
       LabeledPoint(1.0,row)
    else:
       LabeledPoint(0.0,row)

I have tried to iterate through the vector matrix using

feature_matrix_labeledPoint = (f(row) for row in feature_matrix_vectors) #   create a generator of row sums
next(feature_matrix_labeledPoint) # Run the iteration protocol

but this doesn't work.

TypeError: 'PipelinedRDD' object is not iterable

Any help would be great

725

asked Jun 29 '15 12:06

Eoin Lane

1 Answers

RDDs are not a drop in replacement for a Python lists. You have to use either actions or transformations which are available on a given RDD. Here you can simply use map:

from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.regression import LabeledPoint


feature_matrix_vectors = sc.parallelize([
    DenseVector([1.0, 31.0, 5.0, 1935.0, 24.0]),
    DenseVector([1.0, 1231.0, 15.0, 2008.0, None])
])

(feature_matrix_vectors
    .map(lambda v: LabeledPoint(1.0 if None in v else 0.0, v))
    .collect())

137

answered Oct 22 '22 13:10

zero323

Related questions
                            
                                Python: ConfigParser.NoSectionError: No section: 'TestInformation'
                            
                                Reading Maven Pom xml in Python
                            
                                Detect Interpreter shut down in daemon thread
                            
                                Python data structure
                            
                                How to add namespace url to a django-rest-framework router viewset
                            
                                Creating numpy array of custom objects gives error "SystemError: error return without exception set"
                            
                                Python - Datetime not accounting for leap second properly?
                            
                                Writing bytes stream to s3 using python
                            
                                Creating a threshold-coded ROC plot in Python
                            
                                Python to JavaScript converter [closed]
                            
                                Python ggplot rotate axis labels
                            
                                How to set a timer & clear a timer?
                            
                                Memory Error with Multiprocessing in Python
                            
                                Do I need to explicitly pass multiprocessing.Queue instance variables to a child Process executing on an instance method?
                            
                                How do I return a value when @click.option is used to pass a command line argument to a function?
                            
                                Nu is infeasible
                            
                                Why can't I create a default, ordered dict by inheriting OrderedDict and defaultdict?
                            
                                AttributeError: 'unicode' object has no attribute 'values' when parsing JSON dictionary values
                            
                                Python classes: Inheritance vs Instantiation
                            
                                How do you get the display width of combined Unicode characters in Python 3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Iterating through a Spark RDD

Tags:

python

vector

apache-spark

pyspark

Eoin Lane

People also ask

1 Answers

zero323

Recent Activity

Donate For Us