Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala: How to get a range of rows in a dataframe

I have a DataFrame created by running sqlContext.read of a Parquet file.

The DataFrame consists of 300 M rows. I need to use these rows as input to another function, but I want to do it in smaller batches to prevent OOM error.

Currently, I am using df.head(1000000) to read the first 1M rows, but I cannot find a way to read the subsequent rows. I tried df.collect(), but it gives me a Java OOM error.

I want to iterate over this dataframe. I tried adding another column with the withColumn() API to generate a unique set of values to iterate over, but none of the existing columns in the dataframe have solely unique values.

For example, I tried val df = df1.withColumn("newColumn", df1("col") + 1) as well as val df = df1.withColumn("newColumn",lit(i+=1)), both of which do not return a sequential set of values.

Any other way to get the first n rows of a dataframe and then the next n rows, something that works like a range function of SqlContext?

like image 847
newbie_learner Avatar asked Sep 02 '16 03:09

newbie_learner


People also ask

How do I iterate through Spark DataFrame rows?

For looping through each row using map() first we have to convert the PySpark dataframe into RDD because map() is performed on RDD's only, so first convert into RDD it then use map() in which, lambda function for iterating through each row and stores the new RDD in some variable then convert back that new RDD into ...

How do you select a range of rows from a PySpark DataFrame?

By using SQL query with between() operator we can get the range of rows.

How do you select 10 rows in PySpark DataFrame?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).

How do I find the number of rows in a DataFrame Spark?

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame.


1 Answers

You can simple use the limit and except api of dataset or dataframes as follows

long count = df.count();
int limit = 50;
while(count > 0){
    df1 = df.limit(limit);
    df1.show();            //will print 50, next 50, etc rows
    df = df.except(df1);
    count = count - limit;
}
like image 95
Sandeep Purohit Avatar answered Sep 23 '22 10:09

Sandeep Purohit