How to assign and use column headers in Spark?

Question

I am reading a dataset as below.

 f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

How do I do that in PySpark ? Is DataFrame way to go here ?

PS - Newbie to Spark.

BushMinusZero · Accepted Answer

The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.

filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")

How to assign and use column headers in Spark?

Tags:

python

multiple-columns

apache-spark

hadoop

pyspark

GoldenPlatinum

1 Answers

BushMinusZero

Recent Activity

Donate For Us

How to assign and use column headers in Spark?

Tags:

python

multiple-columns

apache-spark

hadoop

pyspark

GoldenPlatinum

1 Answers

BushMinusZero

Related questions

Recent Activity

Donate For Us