Can a model be created on Spark batch and use it in Spark streaming?

2 Answers

Can I create a model in spark batch and use it on Spark streaming for real-time processing?

Ofcourse, yes. In spark community they call it offline training online predictions. Many training algorithms in spark allow you to save the model on file system HDFS/S3. Same model can be loaded by a streaming application. You simply call predict method of the model to do predictions.

See the section Streaming + MLLib in this link.

For example, if you want to train a DecisionTree offline and do predictions online...

In batch application -

    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,impurity, maxDepth, maxBins)
    model.save(sc, "target/tmp/myDecisionTreeClassificationModel")

In streaming application -

    val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
    sameModel.predict(newData)

140

answered Sep 18 '22 18:09

Pranav Shukla

here is one more solution which I just implemented.

I created a model in spark-Batch. suppose the final model object name is regmodel.

final LinearRegressionModel regmodel =algorithm.run(JavaRDD.toRDD(parsedData));

and spark context name is sc as

JavaSparkContext sc = new JavaSparkContext(sparkConf);

Now in a same code I am creating a spark streaming using the same sc

final JavaStreamingContext jssc = new JavaStreamingContext(sc,new Duration(Integer.parseInt(conf.getWindow().trim())));

and doing prediction like this:

JavaPairDStream<Double, Double> predictvalue = dist1.mapToPair(new PairFunction<LabeledPoint, Double,Double>() {
                private static final long serialVersionUID = 1L;
                @Override
                public Tuple2<Double, Double> call(LabeledPoint v1) throws Exception {
                    Double p = v1.label();
                    Double q = regmodel.predict(v1.features());
                    return new Tuple2<Double, Double>(p,q);
                }
            });

answered Sep 18 '22 18:09

Saurabh

Related questions
                            
                                In Apache-spark, how to add the sparse vector?
                            
                                SparkSQL - Lag function?
                            
                                How to config checkpoint to redeploy spark streaming application?
                            
                                Spark + Kafka integration - mapping of Kafka partitions to RDD partitions
                            
                                Spark - Adding JDBC Driver JAR to Google Dataproc
                            
                                Do parquet files preserve the row order of Spark DataFrames?
                            
                                Not enough space to cache rdd in memory warning
                            
                                How does the number of partitions affect `wholeTextFiles` and `textFiles`?
                            
                                Regrouping / Concatenating DataFrame rows in Spark
                            
                                A quick guide on Salt-based install of Spark cluster
                            
                                What are the pros and cons of using broadcast variables in a singleton?
                            
                                Spark: why tasks assigned only to one worker?
                            
                                Spark-HBASE Error java.lang.IllegalStateException: unread block data
                            
                                How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)?
                            
                                Is it possible to run spark yarn cluster from the code?
                            
                                Persisting data to DynamoDB using Apache Spark
                            
                                Merge multiple RDD generated in loop
                            
                                Spark not leveraging hdfs partitioning with parquet
                            
                                Efficiency of flatMap vs map followed by reduce in Spark
                            
                                How access individual element in a tuple on a RDD in pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can a model be created on Spark batch and use it in Spark streaming?

Tags:

machine-learning

apache-spark

spark-streaming

Saurabh

People also ask

2 Answers

Pranav Shukla

Saurabh

Recent Activity

Donate For Us