I am reading a dataset as below.
f = sc.textFile("s3://test/abc.csv")
My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.
How do I do that in PySpark ? Is DataFrame way to go here ?
PS - Newbie to Spark.
The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.
filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With