Scala: How to get a range of rows in a dataframe

Tags:

I have a DataFrame created by running sqlContext.read of a Parquet file.

The DataFrame consists of 300 M rows. I need to use these rows as input to another function, but I want to do it in smaller batches to prevent OOM error.

Currently, I am using df.head(1000000) to read the first 1M rows, but I cannot find a way to read the subsequent rows. I tried df.collect(), but it gives me a Java OOM error.

I want to iterate over this dataframe. I tried adding another column with the withColumn() API to generate a unique set of values to iterate over, but none of the existing columns in the dataframe have solely unique values.

For example, I tried val df = df1.withColumn("newColumn", df1("col") + 1) as well as val df = df1.withColumn("newColumn",lit(i+=1)), both of which do not return a sequential set of values.

Any other way to get the first n rows of a dataframe and then the next n rows, something that works like a range function of SqlContext?

847

asked Sep 02 '16 03:09

newbie_learner

1 Answers

You can simple use the limit and except api of dataset or dataframes as follows

long count = df.count();
int limit = 50;
while(count > 0){
    df1 = df.limit(limit);
    df1.show();            //will print 50, next 50, etc rows
    df = df.except(df1);
    count = count - limit;
}

answered Sep 23 '22 10:09

Sandeep Purohit

Related questions
                            
                                How to compare multiple rows?
                            
                                Scala F-bounded polymorphism on object
                            
                                Accessing to PostgreSQL array via ScalikeJDBC
                            
                                Using different monads in for-comprehension
                            
                                Using groupBy in Spark and getting back to a DataFrame
                            
                                Add Yarn cluster configuration to Spark application
                            
                                Can not always reuse Actor's name after graceful stop
                            
                                Using the java.time API in scala
                            
                                How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?
                            
                                ScalaCheck: choose an integer with custom probability distribution
                            
                                What does Some(string.!!) mean in Scala?
                            
                                Controlling false IntelliJ code editor error in Scala plugin
                            
                                Replace new line (\n) character in csv file - spark scala
                            
                                spray-json can't find JsonReader for type List[T]
                            
                                How to get date and time from string?
                            
                                Scala overriding def with val throws NPE
                            
                                How to implement the lifecycle callbacks of play framework(2.5.x)
                            
                                Split function difference between char and string arguments
                            
                                Why does `Future#toString` returns `"List()"`?
                            
                                How to return full row using Slick's insertOrUpdate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scala: How to get a range of rows in a dataframe

Tags:

dataframe

scala

apache-spark

newbie_learner

People also ask

1 Answers

Sandeep Purohit

Recent Activity

Donate For Us