Combining Spark Streaming + MLlib

Tags:

I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples. Here is the code used in pyspark:

sc = SparkContext(appName="App")

model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150)


ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(hostname, int(port))

parsedLines = lines.map(parse)
parsedLines.pprint()

predictions = parsedLines.map(lambda event: model.predict(event.features))

and the error returned while compiling it in the cluster:

  Error : "It appears that you are attempting to reference SparkContext from a broadcast "
    Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

is there a way to use a modèle generated from a static data to predict a streaming examples ?

Thanks guys i really appreciate it !!!!

971

asked Apr 25 '16 10:04

testing

1 Answers

Yes, you can use model generated from static data. The problem you experience is not related to streaming at all. You simply cannot use JVM based model inside action or transformations (see How to use Java/Scala function from an action or a transformation? for an explanation why). Instead you should apply predict method to a complete RDD for example using transform on DStream:

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from operator import attrgetter


sc = SparkContext("local[2]", "foo")
ssc = StreamingContext(sc, 1)

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
trainingData, testData = data.randomSplit([0.7, 0.3])

model = RandomForest.trainClassifier(
    trainingData, numClasses=2, nmTrees=3
)

(ssc
    .queueStream([testData])
    # Extract features
    .map(attrgetter("features"))
    # Predict 
    .transform(lambda _, rdd: model.predict(rdd))
    .pprint())

ssc.start()
ssc.awaitTerminationOrTimeout(10)

177

answered Sep 30 '22 07:09

zero323

Related questions
                            
                                bokeh multiple figures with shared legend
                            
                                Python: List containing sublist of strings
                            
                                Resize a batch of images in numpy
                            
                                shuffling a list with restrictions in Python
                            
                                Python: How do I find which pip package a library belongs to?
                            
                                Python: Get source code of class (using inspect)
                            
                                Insert file records into postgres db using clojure jdbc is taking long time compared to python psycopg2
                            
                                Creating separate database connection for every celery worker
                            
                                2^n Itertools combinations with advanced filtering
                            
                                How can I fill arbitrary closed regions in Matplotlib?
                            
                                Schedule reminder for recurring event
                            
                                Read Amibroker price volume data using python
                            
                                XOR Neural Network Converges to 0.5
                            
                                What is the most efficient way to do a sorted reduce in PySpark?
                            
                                optparse - why the last char of the option can be ignored? With `--file` it behaves same as `--fil`
                            
                                How to select dataframe rows according to multi-(other column)-condition on columnar groups?
                            
                                FutureWarning: elementwise comparison failed; returning scalar instead
                            
                                How to message child process in Firefox add-on like Chrome native messaging
                            
                                Select an input element using Selenium
                            
                                Detect int32 overflow using 0xFFFFFFFF masking in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Combining Spark Streaming + MLlib

Tags:

python

apache-spark

pyspark

apache-spark-mllib

spark-streaming

testing

People also ask

1 Answers

zero323

Recent Activity

Donate For Us