Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark - passing list/tuple to toDF function

I have a dataframe, and want to rename it using toDF by passing the columns names from list, here column list is dynamic, when i do as below getting error, how can i achieve this?

>>> df.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- dept: string (nullable = true)

columns = ['NAME_FIRST', 'DEPT_NAME']

df2 = df.toDF('ID', 'NAME_FIRST', 'DEPT_NAME')
(or) 
df2 = df.toDF('id', columns[0], columns[1])

this, does not work if we dont know how many columns would be there in the input data frame, so want to pass the list to df2, i tried as below

df2 = df.toDF('id', columns)
pyspark.sql.utils.IllegalArgumentException: u"requirement failed: The number of columns doesn't match.\nOld column names (3): id, name, dept\nNew column names (2): id, name_first, dept_name"

Here it treats list as single item, how to pass the columns from list?

like image 750
user491 Avatar asked May 02 '17 21:05

user491


1 Answers

df2 = df.toDF(columns) does not work, add a * like below -

columns = ['NAME_FIRST', 'DEPT_NAME']

df2 = df.toDF(*columns)

"*" is the "splat" operator: It takes a list as input, and expands it into actual positional arguments in the function call

like image 125
Pushkr Avatar answered Feb 07 '23 03:02

Pushkr