I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark <code>java.lang.StackOverflowError</code> when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though. Entry Example : <pre class="prettyprint"><code>product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form. </code></pre> The Code: <pre class="prettyprint"><code>public void calculatePageRank() { sc.clearCallSite(); sc.clearJobGroup(); JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache(); sc.setCheckpointDir("pagerankCheckpoint/"); JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () { @Override public String call(String arg0) throws Exception { String[] data = arg0.split("\t"); String movieId = data[0].split(":")[1].trim(); String userId = data[1].split(":")[1].trim(); return movieId + "\t" + userId; } }); JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () { @Override public Tuple2 < String, String > call(String arg0) throws Exception { String[] data = arg0.split("\t"); return new Tuple2 < String, String > (data[0], data[1]); } }).groupByKey().cache(); JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2()); List<Iterable<String>> cartUsersList = cartUsers.collect(); JavaPairRDD<String,String> finalCartesian = null; int iterCounter = 0; for(Iterable<String> out : cartUsersList){ JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out)); if(finalCartesian==null){ finalCartesian = currentUsersRDD.cartesian(currentUsersRDD); } else{ finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian); if(iterCounter % 20 == 0) { finalCartesian.checkpoint(); } } } JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2())); finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0); JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2())); JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () { //Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>> @Override public String call (Tuple2<String, String> t) throws Exception { return t._1 + " " + t._2; } }); try { //calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java JavaPageRank.calculatePageRank(userIdPairsString, 100); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } sc.close(); } </code></pre>

I have multiple suggestions which will help you to greatly improve the performance of the code in your question. <ol> <li> Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.</li> </ol> <blockquote> An example is RDD.<code>count</code> — to tell you the number of lines in the file, the file needs to be read. So if you write RDD.<code>count</code>, at this point the file will be read, the lines will be counted, and the count will be returned. What if you call RDD.<code>count</code> again? The same thing: the file will be read and counted again. So what does RDD.<code>cache</code> do? Now, if you run RDD.<code>count</code> the first time, the file will be loaded, cached, and counted. If you call RDD.<code>count</code> a second time, the operation will use the cache. It will just take the data from the cache and count the lines, no recomputing. </blockquote> Read more about caching here. In your code sample you are not reusing anything that you've cached. So you may remove the <code>.cache</code> from there. <ol start="2"> <li> Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the <code>rddFileData</code>, <code>rddMovieData</code> and <code>rddPairReviewData</code> steps so that it happens in one go. </li> </ol> Get rid of <code>.collect</code> since that brings the results back to the driver and maybe the actual reason for your error.

Spark java.lang.StackOverflowError

Tags:

java

apache-spark

mapreduce

I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.

Entry Example :

product/productId: B00004CK40   review/userId: A39IIHQF18YGZA   review/profileName: C. A. M. Salas  review/helpfulness: 0/0 review/score: 4.0   review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.

The Code:

public void calculatePageRank() {
    sc.clearCallSite();
    sc.clearJobGroup();

    JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
    sc.setCheckpointDir("pagerankCheckpoint/");

    JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {

        @Override
        public String call(String arg0) throws Exception {
            String[] data = arg0.split("\t");
            String movieId = data[0].split(":")[1].trim();
            String userId = data[1].split(":")[1].trim();
            return movieId + "\t" + userId;
        }
    });

    JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {

        @Override
        public Tuple2 < String, String > call(String arg0) throws Exception {
            String[] data = arg0.split("\t");
            return new Tuple2 < String, String > (data[0], data[1]);
        }
    }).groupByKey().cache();


    JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
      List<Iterable<String>> cartUsersList = cartUsers.collect();
      JavaPairRDD<String,String> finalCartesian = null;
      int iterCounter = 0;
      for(Iterable<String> out : cartUsersList){
          JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
          if(finalCartesian==null){
              finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
          }
          else{
              finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
              if(iterCounter % 20 == 0) {
                  finalCartesian.checkpoint();
              }
          }
      }
      JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));

      finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
      JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));

      JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {

        //Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
          @Override
          public String call (Tuple2<String, String> t) throws Exception {
            return t._1 + " " + t._2;
          }
      });

    try {

//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
        JavaPageRank.calculatePageRank(userIdPairsString, 100);
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    sc.close();

}

317

asked Jun 19 '16 16:06

Khal Mei

1 Answers

I have multiple suggestions which will help you to greatly improve the performance of the code in your question.

Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.

An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write RDD.count, at this point the file will be read, the lines will be counted, and the count will be returned.

What if you call RDD.count again? The same thing: the file will be read and counted again. So what does RDD.cache do? Now, if you run RDD.count the first time, the file will be loaded, cached, and counted. If you call RDD.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines, no recomputing.

Chitral Verma

Related questions
                            
                                Your project location contains whitespace. (Android Studio)
                            
                                Giving an RxJava Observable something to emit from another method
                            
                                Why is Jackson mapping these values twice, in differing case?
                            
                                JavaFX append text to TextArea throws Exception
                            
                                Event Bus equivalent in iOS
                            
                                Eclipse be like : "Cannot determine URI for [project-name]/[file-path]/[file-name]"
                            
                                One-to-one map with Java streams
                            
                                Polymorphism in java: Why do we set parent reference to child object?
                            
                                javafx: How can I make TableCell Edit return double instead of string and the font changes color based on a condition?
                            
                                How to determine length of X509 Public Key
                            
                                How to calculate next week?
                            
                                In the Circle Hough Transform, what is the Inverse Ratio of Accumulator Resolution (dp) and how does it affect circle detection?
                            
                                Java performance String.indexOf(char) vs String.indexOf(single String)
                            
                                android Static Initialization opencv 3.0 Cannot load library "opencv_java3"
                            
                                Byte Buddy: Create implementation for an abstract class
                            
                                Increase heap size for sqlworkbench/J
                            
                                What is the c# equivalent of Java 8 java.util.function.Consumer<>?
                            
                                How to use Jackson to validate duplicated properties?
                            
                                How to specify correct dialog size in XML layout file for Android dialog?
                            
                                Thymeleaf - how to add a custom util?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With