Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing whitespace in all column names in spark Dataframe

I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.

I know a single column can be renamed using withColumnRenamed() in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).

To automate this, i have tried:

val old_names = df.columns()        // contains array of old column names

val new_names = old_names.map { x => 
   if(x.contains(" ") == true) 
      x.replaceAll("\\s","_") 
   else x 
}                    // array of new column names with removed whitespace.

Now, how to replace df's header with new_names

like image 895
vdep Avatar asked Mar 15 '16 17:03

vdep


4 Answers

As best practice, you should prefer expressions and immutability. You should use val and not var as much as possible.

Thus, it's preferable to use the foldLeft operator, in this case :

val newDf = df.columns
              .foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))
like image 144
kanielc Avatar answered Oct 23 '22 19:10

kanielc


In Python, this can be done by the following code:

# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col

# Building a simple dataframe:
schema = StructType([
             StructField("id name", StringType(), True),
             StructField("cities venezuela", StringType(), True)
         ])

column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']

# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()

exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()
like image 41
Hugo Reyes Avatar answered Oct 23 '22 21:10

Hugo Reyes


You can do the exact same thing in python:

raw_data1 = raw_data
for col in raw_data.columns:
  raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))
like image 41
Victor Kironde Avatar answered Oct 23 '22 21:10

Victor Kironde


  var newDf = df
  for(col <- df.columns){
    newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
  }

You can encapsulate it in some method so it won't be too much pollution.

like image 37
Igor Berman Avatar answered Oct 23 '22 21:10

Igor Berman