Replacing whitespace in all column names in spark Dataframe

Question

I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.

I know a single column can be renamed using withColumnRenamed() in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).

To automate this, i have tried:

val old_names = df.columns()        // contains array of old column names

val new_names = old_names.map { x => 
   if(x.contains(" ") == true) 
      x.replaceAll("\s","_") 
   else x 
}                    // array of new column names with removed whitespace.

Now, how to replace df's header with new_names

kanielc · Accepted Answer

As best practice, you should prefer expressions and immutability. You should use val and not var as much as possible.

Thus, it's preferable to use the foldLeft operator, in this case :

val newDf = df.columns
              .foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\s", "_")))

Hugo Reyes · Answer

In Python, this can be done by the following code:

# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col

# Building a simple dataframe:
schema = StructType([
             StructField("id name", StringType(), True),
             StructField("cities venezuela", StringType(), True)
         ])

column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']

# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()

exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()

Victor Kironde · Answer

You can do the exact same thing in python:

raw_data1 = raw_data
for col in raw_data.columns:
  raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))

Igor Berman · Answer

  var newDf = df
  for(col <- df.columns){
    newDf = newDf.withColumnRenamed(col,col.replaceAll("\s", "_"))
  }

You can encapsulate it in some method so it won't be too much pollution.

Replacing whitespace in all column names in spark Dataframe

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

vdep

4 Answers

kanielc

Hugo Reyes

Victor Kironde

Igor Berman

Recent Activity

Donate For Us

Replacing whitespace in all column names in spark Dataframe

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

vdep

4 Answers

kanielc

Hugo Reyes

Victor Kironde

Igor Berman

Related questions

Recent Activity

Donate For Us