How to change dataframe column names in pyspark?

Tags:

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn't work in PySpark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas?

My Spark version is 1.5.0

756

asked Oct 12 '22 04:10

Shubhanshu Mishra

Video Answer

1 Answers

There are many ways to do that:

Option 1. Using selectExpr.

 data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                   ["Name", "askdaosdka"])
 data.show()
 data.printSchema()

 # Output
 #+-------+----------+
 #|   Name|askdaosdka|
 #+-------+----------+
 #|Alberto|         2|
 #| Dakota|         2|
 #+-------+----------+

 #root
 # |-- Name: string (nullable = true)
 # |-- askdaosdka: long (nullable = true)

 df = data.selectExpr("Name as name", "askdaosdka as age")
 df.show()
 df.printSchema()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

 #root
 # |-- name: string (nullable = true)
 # |-- age: long (nullable = true)

Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

 from functools import reduce

 oldColumns = data.schema.names
 newColumns = ["name", "age"]

 df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
 df.printSchema()
 df.show()

Option 3. using alias, in Scala you can also use as.

 from pyspark.sql.functions import col

 data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
 data.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

 sqlContext.registerDataFrameAsTable(data, "myTable")
 df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

 df2.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

476

answered Oct 13 '22 16:10

Alberto Bonsanto

Related questions
                            
                                How to install multiple python packages at once using pip
                            
                                How to call a shell script from python code?
                            
                                What is the difference between join and merge in Pandas?
                            
                                Iterating over every two elements in a list [duplicate]
                            
                                How to count the frequency of the elements in an unordered list?
                            
                                How to check if all elements of a list match a condition?
                            
                                How to process SIGTERM signal gracefully?
                            
                                How to add property to a class dynamically?
                            
                                Dictionaries and default values
                            
                                Django Template Variables and Javascript
                            
                                In Python, how do you convert a `datetime` object to seconds?
                            
                                Modify tick label text
                            
                                Raise warning in Python without interrupting program
                            
                                Using both Python 2.x and Python 3.x in IPython Notebook
                            
                                Simple way to measure cell execution time in ipython notebook
                            
                                TypeError: unhashable type: 'dict'
                            
                                How to import module when module name has a '-' dash or hyphen in it?
                            
                                How can I recover the return value of a function passed to multiprocessing.Process?
                            
                                Convert python datetime to epoch with strftime
                            
                                How to print a dictionary's key?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to change dataframe column names in pyspark?

Tags:

python

rename

apache-spark

apache-spark-sql

pyspark

pyspark-sql

Shubhanshu Mishra

People also ask

Video Answer

1 Answers

Alberto Bonsanto

Recent Activity

Donate For Us