Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop first row of Spark DataFrame

I have a variable rawData of type DataFrame in my Spark/Scala code.

I would like to drop the first element, something like this:

rawData.drop(1)

However, the drop function is not available.

What's the simplest way of dropping the first element?

like image 901
octavian Avatar asked Jul 12 '16 19:07

octavian


People also ask

How do I delete the first row in a data frame?

Use drop() to remove first row of pandas dataframe. Use tail() function to remove first row of pandas dataframe.

How do I delete a first n row in pandas DataFrame?

Use drop() to remove first N rows of pandas dataframe To make sure that it removes the rows only, use argument axis=0 and to make changes in place i.e. in calling dataframe object, pass argument inplace=True.

How do I delete rows from a DataFrame in Pyspark?

In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.


1 Answers

To answer the question we first must clarify what is exactly the first element of a DataFrame, since we are not speaking about an ordered collection that placed on a single machine, but instead we are dealing with distributed collection with no particular order between partitions, so the answer is not obvious.

In case you want to drop the first element from every partition you can use:

df.mapPartitions(iterator => iterator.drop(1))

In case you want to drop the first element from the first partition, you can use:

val rdd = df.rdd.mapPartitionsWithIndex{
  case (index, iterator) => if(index==0) iterator.drop(1) else iterator
}
sqlContext.createDataFrame(rdd, df.schema)

Both solutions are not very graceful, and seems like bad practise, would be interesting to know the complete use case, maybe there is a better approach.

like image 64
Michael Kopaniov Avatar answered Oct 12 '22 23:10

Michael Kopaniov