Remove blank space from data frame column values in Spark

Question

I have a data frame (business_df) of schema:

|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)

I want to make a new data frame (new_df) so that the values in the 'name' column do not contain any blank spaces.

My code is:

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()

I keep receiving this error:

NameError: global name 'replace' is not defined

What's wrong with this code?

zero323 · Accepted Answer

While the problem you've described is not reproducible with provided code, using Python UDFs to handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace:

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

If you want to normalize empty lines use trim:

from pyspark.sql.functions import trim

df.select(trim(col("v")))

If you want to keep leading / trailing spaces you can adjust regexp_replace:

df.select(regexp_replace(col("v"), "^\s+$", ""))

Alberto Bonsanto · Answer

As @zero323 said, it's probably that you overlapped the replace function somewhere. I tested your code and it works perfectly.

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()

#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+

Powers · Answer

Here's a function that removes all whitespace in a string:

import pyspark.sql.functions as F

def remove_all_whitespace(col):
    return F.regexp_replace(col, "\s+", "")

You can use the function like this:

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

The remove_all_whitespace function is defined in the quinn library. quinn also defines single_space and anti_trim methods to manage whitespace. PySpark defines ltrim, rtrim, and trim methods to manage whitespace.

Remove blank space from data frame column values in Spark

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

Iz M

3 Answers

zero323

Alberto Bonsanto

Powers

Recent Activity

Donate For Us

Remove blank space from data frame column values in Spark

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

Iz M

3 Answers

zero323

Alberto Bonsanto

Powers

Related questions

Recent Activity

Donate For Us