I have a variable <code>rawData</code> of type DataFrame in my Spark/Scala code. I would like to drop the first element, something like this: <pre class="prettyprint"><code>rawData.drop(1) </code></pre> However, the <code>drop</code> function is not available. What's the simplest way of dropping the first element?

To answer the question we first must clarify what is exactly the first element of a DataFrame, since we are not speaking about an ordered collection that placed on a single machine, but instead we are dealing with distributed collection with no particular order between partitions, so the answer is not obvious. In case you want to drop the first element from every partition you can use: <pre class="prettyprint"><code>df.mapPartitions(iterator => iterator.drop(1)) </code></pre> In case you want to drop the first element from the first partition, you can use: <pre class="prettyprint"><code>val rdd = df.rdd.mapPartitionsWithIndex{ case (index, iterator) => if(index==0) iterator.drop(1) else iterator } sqlContext.createDataFrame(rdd, df.schema) </code></pre> Both solutions are not very graceful, and seems like bad practise, would be interesting to know the complete use case, maybe there is a better approach.

Drop first row of Spark DataFrame

Tags:

dataframe

scala

apache-spark

I have a variable rawData of type DataFrame in my Spark/Scala code.

I would like to drop the first element, something like this:

rawData.drop(1)

However, the drop function is not available.

What's the simplest way of dropping the first element?

901

asked Jul 12 '16 19:07

octavian

1 Answers

To answer the question we first must clarify what is exactly the first element of a DataFrame, since we are not speaking about an ordered collection that placed on a single machine, but instead we are dealing with distributed collection with no particular order between partitions, so the answer is not obvious.

In case you want to drop the first element from every partition you can use:

df.mapPartitions(iterator => iterator.drop(1))

In case you want to drop the first element from the first partition, you can use:

val rdd = df.rdd.mapPartitionsWithIndex{
  case (index, iterator) => if(index==0) iterator.drop(1) else iterator
}
sqlContext.createDataFrame(rdd, df.schema)

Both solutions are not very graceful, and seems like bad practise, would be interesting to know the complete use case, maybe there is a better approach.

answered Oct 12 '22 23:10

Michael Kopaniov

Related questions
                            
                                How to return Unit from a scala function?
                            
                                Query one row with max value in one column in Slick
                            
                                Unresponsive actor system: ThreadPoolExecutor dispatcher only creates core thread pool, apparently ignores max thread pool size
                            
                                Can I change Spark's executor memory at runtime?
                            
                                Scala type erasure in pattern matching Map[String, Int]
                            
                                NoSuchMethodError writing Avro object to HDFS using Builder
                            
                                Spark joinWithCassandraTable() on map multiple partition key ERROR
                            
                                Relation of free monad and AST
                            
                                Using scala.Future with Java 8 lambdas
                            
                                importance of cake pattern in scala
                            
                                combining slick queries into single query
                            
                                Scodec combinators: Header contains magic number that is used to discriminate types
                            
                                SQLContext implicits
                            
                                How can I copy Scala.js source maps using resourceGenerators?
                            
                                Spark job restarted after showing all jobs completed and then fails (TimeoutException: Futures timed out after [300 seconds])
                            
                                How to select a subset of fields from an array column in Spark?
                            
                                Creating a Source out of an EventStream
                            
                                Akka HTTPS (SSL) Server with ssl-conf
                            
                                dynamically changing library dependencies in sbt build file from provided etc
                            
                                Create Akka Dispatcher programmatically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With