A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.: <pre class="prettyprint"><code>imp_sample.where(col("location").isNotNull()).count() </code></pre> And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this: <pre class="prettyprint"><code>imp_sample.where(col("location").isNull()).count() </code></pre> and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!

As per your comment, you are using <code>sampleBy</code> in your pipeline. <code>sampleBy</code>doesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run. Regarding your <code>monotonically_increasing_id</code> question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...). Finally, you can persist a data frame, by called persist() on it.

spark inconsistency when running count command

Tags:

count

pyspark

spark-dataframe

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:

imp_sample.where(col("location").isNotNull()).count()

And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this:

imp_sample.where(col("location").isNull()).count()

and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!

698

asked Dec 02 '17 21:12

user3245256

2 Answers

As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.

Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).

Finally, you can persist a data frame, by called persist() on it.

179

answered Oct 02 '22 18:10

Alex

Ok, I have suffered majorly from this in the past. I had a seven or eight stage pipeline that normalised a couple of tables, added ids, joined them and grouped them. Consecutive runs of the same pipeline gave different results, although not in any coherent pattern I could understand.

Long story short, I traced this feature to my usage of the function monotonically_increasing_id, supposed resolved by this JIRA ticket, but still evident in Spark 2.2.

I do not know exactly what your pipeline does, but please understand that my fix is to force SPARK to persist results after calling monotonically_increasing_id. I never saw the issue again after I started doing this.

Let me know if a judicious persist resolves this issue.

To persist an RDD or DataFrame, call either df.cache (which defaults to in-memory persistence) or df.persist([some storage level]), for example

df.persist(StorageLevel.DISK_ONLY)

Again, it may not help you, but in my case it forced Spark to flush out and write id values which were behaving non-deterministically given repeated invocations of the pipeline.

answered Oct 02 '22 20:10

Chondrops

Related questions
                            
                                How can I efficiently extract repeated elements in a Ruby array? [duplicate]
                            
                                mysql show Count of rows from other table in each row
                            
                                C# list count always returns 1 even when list is empty
                            
                                Count the number of positive and negative numbers in a column
                            
                                Count distinct column case when/conditional
                            
                                How to count null values in postgresql?
                            
                                How to count how many lines in a file are the same?
                            
                                Mysql Count with Inner join of two tables, Average Join
                            
                                mysql count into PHP variable
                            
                                How do I count comma-separated values in PHP?
                            
                                How can I have the real count of an array?
                            
                                PHP count of occurrences of characters of a string within another string
                            
                                Mysql upper limit for count(*)
                            
                                django get latest for each group, group by foreign key
                            
                                Get Count of Sequelize BelongsToMany Association
                            
                                Show zero in button_count Like button
                            
                                SQL: count number of distinct values in every column
                            
                                MySQL Double count on left join
                            
                                SQL: How to find the minimum number of entries linked from one table to another
                            
                                How to generate a date range + count earlier dates from another table in PostgreSQL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With