I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.
I know a single column can be renamed using withColumnRenamed()
in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).
To automate this, i have tried:
val old_names = df.columns() // contains array of old column names
val new_names = old_names.map { x =>
if(x.contains(" ") == true)
x.replaceAll("\\s","_")
else x
} // array of new column names with removed whitespace.
Now, how to replace df's header with new_names
As best practice, you should prefer expressions and immutability.
You should use val
and not var
as much as possible.
Thus, it's preferable to use the foldLeft
operator, in this case :
val newDf = df.columns
.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))
In Python, this can be done by the following code:
# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col
# Building a simple dataframe:
schema = StructType([
StructField("id name", StringType(), True),
StructField("cities venezuela", StringType(), True)
])
column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']
# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()
exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()
You can do the exact same thing in python:
raw_data1 = raw_data
for col in raw_data.columns:
raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))
var newDf = df
for(col <- df.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
}
You can encapsulate it in some method so it won't be too much pollution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With