Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trim string column in PySpark dataframe

After creating a Spark DataFrame from a CSV file, I would like to trim a column. I've tried:

df = df.withColumn("Product", df.Product.strip()) 

df is my data frame, Product is a column in my table

But I get the error:

Column object is not callable 

Any suggestions?

like image 684
minh-hieu.pham Avatar asked Feb 02 '16 14:02

minh-hieu.pham


People also ask

How do I apply trim on all columns in a Spark Dataframe?

You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.

How do you find the length of a string in Pyspark?

length. Computes the character length of string data or number of bytes of binary data. The length of character data includes the trailing spaces.


2 Answers

from pyspark.sql.functions import trim  df = df.withColumn("Product", trim(col("Product"))) 
like image 72
Maniganda Prakash Avatar answered Nov 12 '22 00:11

Maniganda Prakash


Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions first. Here is an example:

 from pyspark.sql import SQLContext  from pyspark.sql.functions import *  sqlContext = SQLContext(sc)   df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings  df.collect()  # [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')]  df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1  df.collect()  # [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')]  df = df.withColumn('d1', rtrim(df.d1))  # trim right whitespace from d1  df.collect()  # [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')] 
like image 41
desertnaut Avatar answered Nov 12 '22 00:11

desertnaut