Pyspark: Convert column to lowercase

Tags:

pyspark

I want to convert the values inside a column to lowercase. Currently if I use the lower() method, it complains that column objects are not callable. Since there's a function called lower() in SQL, I assume there's a native Spark solution that doesn't involve UDFs, or writing any SQL.

952

asked Nov 08 '17 12:11

wlad

2 Answers

Import lower alongside col:

from pyspark.sql.functions import lower, col

Combine them together using lower(col("bla")). In a complete query:

spark.table('bla').select(lower(col('bla')).alias('bla'))

which is equivalent to the SQL query

SELECT lower(bla) AS bla FROM bla

To keep the other columns, do

spark.table('foo').withColumn('bar', lower(col('bar')))

Needless to say, this approach is better than using a UDF because UDFs have to call out to Python (which is a slow operation, and Python itself is slow), and is more elegant than writing it in SQL.

answered Sep 22 '22 23:09

wlad

You can use a combination of concat_ws and split

from pyspark.sql.functions import *  df.withColumn('arr_str', lower(concat_ws('::','arr'))).withColumn('arr', split('arr_str','::')).drop('arr_str')

answered Sep 22 '22 23:09

smishra

Related questions
                            
                                Count number of non-NaN entries in each column of Spark dataframe with Pyspark
                            
                                Spark union of multiple RDDs
                            
                                How to build a sparkSession in Spark 2.0 using pyspark?
                            
                                Specifying the filename when saving a DataFrame as a CSV [duplicate]
                            
                                Calling Java/Scala function from a task
                            
                                pyspark: rolling average using timeseries data
                            
                                Where do you need to use lit() in Pyspark SQL?
                            
                                py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
                            
                                PySpark row-wise function composition
                            
                                How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?
                            
                                PySpark create new column with mapping from a dict
                            
                                How to exclude multiple columns in Spark dataframe in Python
                            
                                Viewing the content of a Spark Dataframe Column
                            
                                Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
                            
                                Spark SQL Row_number() PartitionBy Sort Desc
                            
                                Reading csv files with quoted fields containing embedded commas
                            
                                Applying UDFs on GroupedData in PySpark (with functioning python example)
                            
                                GroupBy column and filter rows with maximum value in Pyspark
                            
                                AttributeError: 'DataFrame' object has no attribute 'map'
                            
                                Number of partitions in RDD and performance in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: Convert column to lowercase

Tags:

pyspark

wlad

People also ask

2 Answers

wlad

smishra

Recent Activity

Donate For Us