Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign and use column headers in Spark?

I am reading a dataset as below.

 f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

How do I do that in PySpark ? Is DataFrame way to go here ?

PS - Newbie to Spark.

like image 351
GoldenPlatinum Avatar asked Apr 13 '16 20:04

GoldenPlatinum


1 Answers

The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.

filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")
like image 134
BushMinusZero Avatar answered Oct 05 '22 13:10

BushMinusZero