Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: Convert column to lowercase

Tags:

pyspark

I want to convert the values inside a column to lowercase. Currently if I use the lower() method, it complains that column objects are not callable. Since there's a function called lower() in SQL, I assume there's a native Spark solution that doesn't involve UDFs, or writing any SQL.

like image 952
wlad Avatar asked Nov 08 '17 12:11

wlad


People also ask

How do you convert a column to lowercase in PySpark?

In order to convert a column to Upper case in pyspark we will be using upper() function, to convert a column to Lower case in pyspark is done using lower() function, and in order to convert to title case or proper case in pyspark uses initcap() function.

What is withColumn PySpark?

PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.

How do I rename a column in PySpark?

Method 1: Using withColumnRenamed() We will use of withColumnRenamed() method to change the column names of pyspark data frame. existingstr: Existing column name of data frame to rename. newstr: New column name. Returns type: Returns a data frame by renaming an existing column.


2 Answers

Import lower alongside col:

from pyspark.sql.functions import lower, col 

Combine them together using lower(col("bla")). In a complete query:

spark.table('bla').select(lower(col('bla')).alias('bla')) 

which is equivalent to the SQL query

SELECT lower(bla) AS bla FROM bla 

To keep the other columns, do

spark.table('foo').withColumn('bar', lower(col('bar'))) 

Needless to say, this approach is better than using a UDF because UDFs have to call out to Python (which is a slow operation, and Python itself is slow), and is more elegant than writing it in SQL.

like image 57
wlad Avatar answered Sep 22 '22 23:09

wlad


You can use a combination of concat_ws and split

from pyspark.sql.functions import *  df.withColumn('arr_str', lower(concat_ws('::','arr'))).withColumn('arr', split('arr_str','::')).drop('arr_str') 
like image 26
smishra Avatar answered Sep 22 '22 23:09

smishra