I'm using spark with java, and i hava an RDD of 5 millions rows. Is there a sollution that allows me to calculate the number of rows of my RDD. I've tried <code>RDD.count()</code> but it takes a lot of time. I've seen that i can use the function <code>fold</code>. But i didn't found a java documentation of this function. Could you please show me how to use it or show me another solution to get the number of rows of my RDD. Here is my code : <pre class="prettyprint"><code>JavaPairRDD<String, String> lines = getAllCustomers(sc).cache(); JavaPairRDD<String,String> CFIDNotNull = lines.filter(notNull()).cache(); JavaPairRDD<String, Tuple2<String, String>> join =lines.join(CFIDNotNull).cache(); double count_ctid = (double)join.count(); // i want to get the count of these three RDD double all = (double)lines.count(); double count_cfid = all - CFIDNotNull.count(); System.out.println("********** :"+count_cfid*100/all +"% and now : "+ count_ctid*100/all+"%"); </code></pre> Thank you.

You had the right idea: use <code>rdd.count()</code> to count the number of rows. There is no faster way. I think the question you should have asked is why is <code>rdd.count()</code> so slow? The answer is that <code>rdd.count()</code> is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before <code>count()</code> were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, just queued up. When you call <code>count()</code>, you force all the previous lazy operations to be performed. The input files need to be loaded now, <code>map()</code>s and <code>filter()</code>s executed, shuffles performed, etc, until finally we have the data and can say how many rows it has. Note that if you call <code>count()</code> twice, all this will happen twice. After the count is returned, all the data is discarded! If you want to avoid this, call <code>cache()</code> on the RDD. Then the second call to <code>count()</code> will be fast and also derived RDDs will be faster to calculate. However, in this case the RDD will have to be stored in memory (or disk).

Count number of rows in an RDD

Tags:

I'm using spark with java, and i hava an RDD of 5 millions rows. Is there a sollution that allows me to calculate the number of rows of my RDD. I've tried RDD.count() but it takes a lot of time. I've seen that i can use the function fold. But i didn't found a java documentation of this function. Could you please show me how to use it or show me another solution to get the number of rows of my RDD.

Here is my code :

JavaPairRDD<String, String> lines = getAllCustomers(sc).cache(); JavaPairRDD<String,String> CFIDNotNull = lines.filter(notNull()).cache(); JavaPairRDD<String, Tuple2<String, String>> join =lines.join(CFIDNotNull).cache();  double count_ctid = (double)join.count(); // i want to get the count of these three RDD double all = (double)lines.count(); double count_cfid = all - CFIDNotNull.count(); System.out.println("********** :"+count_cfid*100/all +"% and now : "+ count_ctid*100/all+"%");

Thank you.

754

asked Feb 09 '15 15:02

Amine CHERIFI

1 Answers

You had the right idea: use rdd.count() to count the number of rows. There is no faster way.

I think the question you should have asked is why is rdd.count() so slow?

The answer is that rdd.count() is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before count() were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, just queued up. When you call count(), you force all the previous lazy operations to be performed. The input files need to be loaded now, map()s and filter()s executed, shuffles performed, etc, until finally we have the data and can say how many rows it has.

Note that if you call count() twice, all this will happen twice. After the count is returned, all the data is discarded! If you want to avoid this, call cache() on the RDD. Then the second call to count() will be fast and also derived RDDs will be faster to calculate. However, in this case the RDD will have to be stored in memory (or disk).

134

answered Sep 28 '22 13:09

Daniel Darabos

Related questions
                            
                                Yii2 subquery in Active Record
                            
                                Programmatically focus on a form in a webview (WKWebView)
                            
                                Linker Error in iOS (duplicate symbols for architecture x86_64)
                            
                                addEventListener("click",...) firing immediately [duplicate]
                            
                                Emoji are not inserting in database node js mysql
                            
                                swift UIAlertController with pickerView button action stay up
                            
                                how to scale (large) font-awesome icons from the react-icons package
                            
                                How to unsubscribe from ngrx/store?
                            
                                How to prevent multiple selection in jQuery UI Selectable plugin
                            
                                htaccess rewrite for query string
                            
                                Testing SMTP server is running via C#
                            
                                How can I iterate through all checkboxes on a form?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With