Trim string column in PySpark dataframe

Tags:

After creating a Spark DataFrame from a CSV file, I would like to trim a column. I've tried:

df = df.withColumn("Product", df.Product.strip())

df is my data frame, Product is a column in my table

But I get the error:

Column object is not callable

Any suggestions?

684

asked Feb 02 '16 14:02

2 Answers

from pyspark.sql.functions import trim  df = df.withColumn("Product", trim(col("Product")))

answered Nov 12 '22 00:11

Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions first. Here is an example:

 from pyspark.sql import SQLContext  from pyspark.sql.functions import *  sqlContext = SQLContext(sc)   df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings  df.collect()  # [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')]  df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1  df.collect()  # [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')]  df = df.withColumn('d1', rtrim(df.d1))  # trim right whitespace from d1  df.collect()  # [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]

answered Nov 12 '22 00:11

desertnaut

Related questions
                            
                                How to perform one operation on each executor once in spark
                            
                                SPARK SQL - update MySql table using DataFrames and JDBC
                            
                                Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]
                            
                                How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?
                            
                                Why does Spark job fail with "too many open files"?
                            
                                How do I run graphx with Python / pyspark?
                            
                                What is the difference between sort and orderBy functions in Spark
                            
                                Shipping Python modules in pyspark to other nodes
                            
                                How to do left outer join in spark sql?
                            
                                Spark dataframe get column value into a string variable
                            
                                Differences between null and NaN in spark? How to deal with it?
                            
                                Best Practice to launch Spark Applications via Web Application?
                            
                                Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database
                            
                                Explode in PySpark
                            
                                Iterate rows and columns in Spark dataframe
                            
                                Apache Hadoop Yarn - Underutilization of cores
                            
                                How to save a spark DataFrame as csv on disk?
                            
                                How to use AND or OR condition in when in Spark
                            
                                Read multiline JSON in Apache Spark
                            
                                Map can not be serializable in scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trim string column in PySpark dataframe

Tags:

trim

apache-spark

apache-spark-sql

pyspark

minh-hieu.pham

People also ask

2 Answers

Maniganda Prakash

desertnaut

Recent Activity

Donate For Us