How to strip headers from all files in RDD, where RDD = sc.textFile("s3n://bucket/*.csv")?

Question

I am trying to think of the best way to do this, however, I am unable to think of a way that would not include reading headers from all files into array, and then filtering the RDD from those headers.

Is there a simpler way ?

NOTE: I am reading all csv files from a S3 bucket, and all of those files have a different header.

Dan Osipov · Accepted Answer

One option is to use SparkSQL, which can load CSV with the option to ignore the header. Take a look: https://github.com/databricks/spark-csv

header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. Default value is false.

How to strip headers from all files in RDD, where RDD = sc.textFile("s3n://bucket/*.csv")?

Tags:

csv

header

amazon-s3

apache-spark

rdd

3xCh1_23

1 Answers

Dan Osipov

Recent Activity

Donate For Us

How to strip headers from all files in RDD, where RDD = sc.textFile("s3n://bucket/*.csv")?

Tags:

csv

header

amazon-s3

apache-spark

rdd

3xCh1_23

1 Answers

Dan Osipov

Related questions

Recent Activity

Donate For Us