RDD to DataFrame in pyspark (columns from rdd's first element)

Question

I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.

Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?

lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####'))  ###multiple char sep can be there #### or #@# , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']]  ###first element is the header
df = rdd.toDF(rdd.first())  ###retaing te column from rdd.first()
df.show()
#mailid  age  address
 mailid  age  address   ####I don't want this as dataframe data
 satya    23  Mumbai
 abc      27  Goa

How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??

Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...

Please suggest!!!Thanks

eliasah · Accepted Answer

You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :

>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# |   abc| 27|    Goa|
# +------+---+-------+

RDD to DataFrame in pyspark (columns from rdd's first element)

Tags:

python-2.7

apache-spark

rdd

pyspark

pyspark-sql

Satya

1 Answers

eliasah

Recent Activity

Donate For Us

RDD to DataFrame in pyspark (columns from rdd's first element)

Tags:

python-2.7

apache-spark

rdd

pyspark

pyspark-sql

Satya

1 Answers

eliasah

Related questions

Recent Activity

Donate For Us