Spark-Monotonically increasing id not working as expected in dataframe?

Tags:

I have a dataframe df in Spark which looks something like this:

scala> df.show()
+--------+--------+
|columna1|columna2|
+--------+--------+
|     0.1|     0.4|
|     0.2|     0.5|
|     0.1|     0.3|
|     0.3|     0.6|
|     0.2|     0.7|
|     0.2|     0.8|
|     0.1|     0.7|
|     0.5|     0.5|
|     0.6|    0.98|
|     1.2|     1.1|
|     1.2|     1.2|
|     0.4|     0.7|
+--------+--------+

I tried to include an id column with the following code

val df_id = df.withColumn("id",monotonicallyIncreasingId)

but the id column is not what I expect:

scala> df_id.show()
+--------+--------+----------+
|columna1|columna2|        id|
+--------+--------+----------+
|     0.1|     0.4|         0|
|     0.2|     0.5|         1|
|     0.1|     0.3|         2|
|     0.3|     0.6|         3|
|     0.2|     0.7|         4|
|     0.2|     0.8|         5|
|     0.1|     0.7|8589934592|
|     0.5|     0.5|8589934593|
|     0.6|    0.98|8589934594|
|     1.2|     1.1|8589934595|
|     1.2|     1.2|8589934596|
|     0.4|     0.7|8589934597|
+--------+--------+----------+

As you can see, it goes well from 0 to 5 but then the next id is 8589934592 instead of 6 and so on.

So what is wrong here? Why is the id column not properly indexed here?

625

asked Dec 19 '17 20:12

antonioACR1

1 Answers

It works as expected. This function is not intended for generating consecutive values. Instead it encodes partition number and index by partition

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

If you want consecutive numbers, use RDD.zipWithIndex.

answered Oct 16 '22 10:10

Alper t. Turker

Related questions
                            
                                Scala, generic tuple
                            
                                How to curry a function in Scala
                            
                                scala: 'def foo = {1}' vs 'def foo {1}'
                            
                                Scala: Why are Actors lightweight?
                            
                                Does Scala have an operator similar to Haskell's `$`?
                            
                                Stable identifier required during pattern matching? (Scala)
                            
                                How to get payload from a POST in Play 2.0
                            
                                Scala SWT project with SBT
                            
                                mongodb database with scala play 2.0 tutorial
                            
                                Scala: List of pairs to pair of lists
                            
                                Is this the latest version of the maven scala plugin ?
                            
                                What do multiple, consecutive fat arrows in method parameters mean in Scala?
                            
                                Scala vs Java performance (HashSet and bigram generation)
                            
                                How to split an inbound stream on a delimiter character using Akka Streams
                            
                                Custom JSON validation constraints in Play Framework 2.3 (Scala)
                            
                                Scala collections: why do we need a case statement to extract values tuples in higher order functions?
                            
                                how to interpret RDD.treeAggregate
                            
                                Explain the `LowPriorityImplicits` pattern used in Scala type-level programming
                            
                                Parallelize / avoid foreach loop in spark
                            
                                Implicit conversion from Int to Double in scala doesn't work

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark-Monotonically increasing id not working as expected in dataframe?

Tags:

scala

apache-spark

apache-spark-sql

antonioACR1

People also ask

1 Answers

Alper t. Turker

Recent Activity

Donate For Us