Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Streaming get warn "replicated to only 0 peer(s) instead of 1 peers"

I use spark streaming to receive twitts from twitter. I get many warning that says:

replicated to only 0 peer(s) instead of 1 peers

what is this warning for?

my code is:

    SparkConf conf = new SparkConf().setAppName("Test");
    JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(5));
    sc.checkpoint("/home/arman/Desktop/checkpoint");

    ConfigurationBuilder cb = new ConfigurationBuilder();
    cb.setOAuthConsumerKey("****************")
        .setOAuthConsumerSecret("**************")
        .setOAuthAccessToken("*********************")
        .setOAuthAccessTokenSecret("***************");


    JavaReceiverInputDStream<twitter4j.Status> statuses = TwitterUtils.createStream(sc, 
            AuthorizationFactory.getInstance(cb.build()));

    JavaPairDStream<String, Long> hashtags = statuses.flatMapToPair(new GetHashtags());
    JavaPairDStream<String, Long> hashtagsCount = hashtags.updateStateByKey(new UpdateReduce());
    hashtagsCount.foreachRDD(new saveText(args[0], true));

    sc.start();
    sc.awaitTerminationOrTimeout(Long.parseLong(args[1]));
    sc.stop();
like image 978
Arman Avatar asked Sep 15 '15 10:09

Arman


1 Answers

When reading data with Spark Streaming, incoming data blocks are replicated to at least one another node/worker because of fault-tolerance. Without that it may happen that in case the runtime reads data from stream and then fails this particular piece of data would be lost (it's already read and erased from stream and it's also lost at the worker side because of failure).

Referring to the Spark documentation :

While a Spark Streaming driver program is running, the system receives data from various sources and and divides it into batches. Each batch of data is treated as an RDD, that is, an immutable parallel collection of data. These input RDDs are saved in memory and replicated to two nodes for fault-tolerance.

The warning in your case means that incoming data from stream are not replicated at all. The reason for that may be that you run the app with just one instance of Spark worker or running in local mode. Try to start more Spark workers and see if the warning is gone.

like image 75
vanekjar Avatar answered Oct 31 '22 17:10

vanekjar