Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers? <pre class="prettyprint"><code>val rdd=sc.textFile("file1,file2,file3") </code></pre> Now, how can we skip header lines from this rdd?

In Spark 2.0 a CSV reader is build into Spark, so you can easily load a CSV file as follows: <pre class="prettyprint"><code>spark.read.option("header","true").csv("filePath") </code></pre>

From Spark 2.0 onwards what you can do is use SparkSession to get this done as a one liner: <pre class="prettyprint"><code>val spark = SparkSession.builder.config(conf).getOrCreate() </code></pre> and then as @SandeepPurohit said: <pre class="prettyprint"><code>val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath) </code></pre> I hope it solved your question ! P.S: SparkSession is the new entry point introduced in Spark 2.0 and can be found under spark_sql package

How do I skip a header from CSV files in Spark?

Tags:

csv

scala

apache-spark

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?

val rdd=sc.textFile("file1,file2,file3")

Now, how can we skip header lines from this rdd?

877

asked Jan 09 '15 06:01

Hafiz Mujadid

Video Answer

4 Answers

data = sc.textFile('path_to_data') header = data.first() #extract header data = data.filter(row => row != header)   #filter out header

159

answered Oct 02 '22 16:10

Jimmy

If there were just one header line in the first record, then the most efficient way to filter it out would be:

rdd.mapPartitionsWithIndex {   (idx, iter) => if (idx == 0) iter.drop(1) else iter  }

This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.

You could also just write a filter that matches only a line that could be a header. This is quite simple, but less efficient.

Python equivalent:

from itertools import islice  rdd.mapPartitionsWithIndex(     lambda idx, it: islice(it, 1, None) if idx == 0 else it  )

answered Oct 02 '22 17:10

Sean Owen

In Spark 2.0 a CSV reader is build into Spark, so you can easily load a CSV file as follows:

spark.read.option("header","true").csv("filePath")

answered Oct 02 '22 17:10

Sandeep Purohit

From Spark 2.0 onwards what you can do is use SparkSession to get this done as a one liner:

val spark = SparkSession.builder.config(conf).getOrCreate()

and then as @SandeepPurohit said:

val dataFrame = spark.read.format("CSV").option("header","true").load(csvfilePath)

I hope it solved your question !

P.S: SparkSession is the new entry point introduced in Spark 2.0 and can be found under spark_sql package

answered Oct 02 '22 15:10

Shivansh

Related questions
                            
                                Design patterns/best practice for building Actor-based system
                            
                                How to find out which Play version I'm using?
                            
                                Asynchronous IO in Scala with futures
                            
                                Scala: Boolean to Option
                            
                                How can I easily get a Scala case class's name?
                            
                                SBT: Start a command line 'run' of the main class of a non-default project
                            
                                What does setMaster `local[*]` mean in spark?
                            
                                What are the problems with an ADT encoding that associates types with data constructors? (Such as Scala.)
                            
                                How can I get the name of an Akka actor from within the actor itself?
                            
                                In Scala, what is an "early initializer"?
                            
                                What is *so* wrong with case class inheritance?
                            
                                Why is appending to a list bad?
                            
                                Extending scala case class without constantly duplicating constructors vals?
                            
                                Why does IntelliJ IDEA compile Scala so slowly? [closed]
                            
                                Can't push to the heroku
                            
                                Purpose of Scala's Symbol? [duplicate]
                            
                                Scala: “any” and “all” functions
                            
                                Scala: pass Seq to var-args functions
                            
                                Scala double definition (2 methods have the same type erasure)
                            
                                Functional Programming - Lots of emphasis on recursion, why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With