I am trying to think of the best way to do this, however, I am unable to think of a way that would not include reading headers from all files into array, and then filtering the RDD from those headers.
Is there a simpler way ?
NOTE: I am reading all csv files from a S3 bucket, and all of those files have a different header.
One option is to use SparkSQL, which can load CSV with the option to ignore the header. Take a look: https://github.com/databricks/spark-csv
header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. Default value is false.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With