After creating a Spark DataFrame from a CSV file, I would like to trim a column. I've tried:
df = df.withColumn("Product", df.Product.strip())
df
is my data frame, Product
is a column in my table
But I get the error:
Column object is not callable
Any suggestions?
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
length. Computes the character length of string data or number of bytes of binary data. The length of character data includes the trailing spaces.
from pyspark.sql.functions import trim df = df.withColumn("Product", trim(col("Product")))
Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim
and rtrim
(search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions
first. Here is an example:
from pyspark.sql import SQLContext from pyspark.sql.functions import * sqlContext = SQLContext(sc) df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings df.collect() # [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')] df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1 df.collect() # [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')] df = df.withColumn('d1', rtrim(df.d1)) # trim right whitespace from d1 df.collect() # [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With