Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RDD to DataFrame in pyspark (columns from rdd's first element)

I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.

Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?

lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####'))  ###multiple char sep can be there #### or #@# , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']]  ###first element is the header
df = rdd.toDF(rdd.first())  ###retaing te column from rdd.first()
df.show()
#mailid  age  address
 mailid  age  address   ####I don't want this as dataframe data
 satya    23  Mumbai
 abc      27  Goa

How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??

Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...

Please suggest!!!Thanks

like image 786
Satya Avatar asked Feb 05 '23 20:02

Satya


1 Answers

You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :

>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# |   abc| 27|    Goa|
# +------+---+-------+ 
like image 129
eliasah Avatar answered Feb 08 '23 15:02

eliasah