What is the best way to remove accents with Apache Spark dataframes in PySpark?

Tags:

I need to delete accents from characters in Spanish and others languages from different datasets.

I already did a function based in the code provided in this post that removes special the accents. The problem is that the function is slow because it uses an UDF. I'm just wondering if I can improve the performance of my function to get results in less time, because this is good for small dataframes but not for big ones.

Thanks in advance.

Here the code, you will be able to run it as it is presented:

# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import udf, col
import unicodedata

# Building a simple dataframe:
schema = StructType([StructField("city", StringType(), True),
                     StructField("country", StringType(), True),
                     StructField("population", IntegerType(), True)])

countries = ['Venezuela', 'US@A', 'Brazil', 'Spain']
cities = ['Maracaibó', 'New York', '   São Paulo   ', '~Madrid']
population = [37800000,19795791,12341418,6489162]

# Dataframe:
df = sqlContext.createDataFrame(list(zip(cities, countries, population)), schema=schema)

df.show()

class Test():
    def __init__(self, df):
        self.df = df

    def clearAccents(self, columns):
        """This function deletes accents in strings column dataFrames, 
        it does not eliminate main characters, but only deletes special tildes.

        :param columns  String or a list of column names.
        """
        # Filters all string columns in dataFrame
        validCols = [c for (c, t) in filter(lambda t: t[1] == 'string', self.df.dtypes)]

        # If None or [] is provided with column parameter:
        if (columns == "*"): columns = validCols[:]

        # Receives  a string as an argument
        def remove_accents(inputStr):
            # first, normalize strings:
            nfkdStr = unicodedata.normalize('NFKD', inputStr)
            # Keep chars that has no other char combined (i.e. accents chars)
            withOutAccents = u"".join([c for c in nfkdStr if not unicodedata.combining(c)])
            return withOutAccents

        function = udf(lambda x: remove_accents(x) if x != None else x, StringType())
        exprs = [function(col(c)).alias(c) if (c in columns) and (c in validCols) else c for c in self.df.columns]
        self.df = self.df.select(*exprs)

foo = Test(df)
foo.clearAccents(columns="*")
foo.df.show()

788

asked Jul 13 '16 18:07

Hugo Reyes

2 Answers

One possible improvement is to build a custom Transformer, which will handle Unicode normalization, and corresponding Python wrapper. It should reduce overall overhead of passing data between JVM and Python and doesn't require any modifications in Spark itself or access to private API.

On JVM side you'll need a transformer similar to this one:

package net.zero323.spark.ml.feature

import java.text.Normalizer
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.param._
import org.apache.spark.ml.util._
import org.apache.spark.sql.types.{DataType, StringType}

class UnicodeNormalizer (override val uid: String)
  extends UnaryTransformer[String, String, UnicodeNormalizer] {

  def this() = this(Identifiable.randomUID("unicode_normalizer"))

  private val forms = Map(
    "NFC" -> Normalizer.Form.NFC, "NFD" -> Normalizer.Form.NFD,
    "NFKC" -> Normalizer.Form.NFKC, "NFKD" -> Normalizer.Form.NFKD
  )

  val form: Param[String] = new Param(this, "form", "unicode form (one of NFC, NFD, NFKC, NFKD)",
    ParamValidators.inArray(forms.keys.toArray))

  def setN(value: String): this.type = set(form, value)

  def getForm: String = $(form)

  setDefault(form -> "NFKD")

  override protected def createTransformFunc: String => String = {
    val normalizerForm = forms($(form))
    (s: String) => Normalizer.normalize(s, normalizerForm)
  }

  override protected def validateInputType(inputType: DataType): Unit = {
    require(inputType == StringType, s"Input type must be string type but got $inputType.")
  }

  override protected def outputDataType: DataType = StringType
}

Corresponding build definition (adjust Spark and Scala versions to match your Spark deployment):

name := "unicode-normalization"

version := "1.0"

crossScalaVersions := Seq("2.11.12", "2.12.8")

organization := "net.zero323"

val sparkVersion = "2.4.0"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.apache.spark" %% "spark-mllib" % sparkVersion
)

On Python side you'll need a wrapper similar to this one.

from pyspark.ml.param.shared import *
# from pyspark.ml.util import keyword_only  # in Spark < 2.0
from pyspark import keyword_only 
from pyspark.ml.wrapper import JavaTransformer

class UnicodeNormalizer(JavaTransformer, HasInputCol, HasOutputCol):

    @keyword_only
    def __init__(self, form="NFKD", inputCol=None, outputCol=None):
        super(UnicodeNormalizer, self).__init__()
        self._java_obj = self._new_java_obj(
            "net.zero323.spark.ml.feature.UnicodeNormalizer", self.uid)
        self.form = Param(self, "form",
            "unicode form (one of NFC, NFD, NFKC, NFKD)")
        # kwargs = self.__init__._input_kwargs  # in Spark < 2.0
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, form="NFKD", inputCol=None, outputCol=None):
        # kwargs = self.setParams._input_kwargs  # in Spark < 2.0
        kwargs = self._input_kwargs
        return self._set(**kwargs)

    def setForm(self, value):
        return self._set(form=value)

    def getForm(self):
        return self.getOrDefault(self.form)

Build Scala package:

sbt +package

include it when you start shell or submit. For example for Spark build with Scala 2.11:

bin/pyspark --jars path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar \
 --driver-class-path path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar

and you should be ready to go. All what is left is a little bit of regexp magic:

from pyspark.sql.functions import regexp_replace

normalizer = UnicodeNormalizer(form="NFKD",
    inputCol="text", outputCol="text_normalized")

df = sc.parallelize([
    (1, "Maracaibó"), (2, "New York"),
    (3, "   São Paulo   "), (4, "~Madrid")
]).toDF(["id", "text"])

(normalizer
    .transform(df)
    .select(regexp_replace("text_normalized", "\p{M}", ""))
    .show())

## +--------------------------------------+
## |regexp_replace(text_normalized,\p{M},)|
## +--------------------------------------+
## |                             Maracaibo|
## |                              New York|
## |                          Sao Paulo   |
## |                               ~Madrid|
## +--------------------------------------+

Please note that this follows the same conventions as built in text transformers and is not null safe. You can easily correct for that by check for null in createTransformFunc.

174

answered Oct 08 '22 10:10

zero323

Another way for doing using python Unicode Database :

import unicodedata
import sys

from pyspark.sql.functions import translate, regexp_replace

def make_trans():
    matching_string = ""
    replace_string = ""

    for i in range(ord(" "), sys.maxunicode):
        name = unicodedata.name(chr(i), "")
        if "WITH" in name:
            try:
                base = unicodedata.lookup(name.split(" WITH")[0])
                matching_string += chr(i)
                replace_string += base
            except KeyError:
                pass

    return matching_string, replace_string

def clean_text(c):
    matching_string, replace_string = make_trans()
    return translate(
        regexp_replace(c, "\p{M}", ""), 
        matching_string, replace_string
    ).alias(c)

So now let's test it :

df = sc.parallelize([
(1, "Maracaibó"), (2, "New York"),
(3, "   São Paulo   "), (4, "~Madrid"),
(5, "São Paulo"), (6, "Maracaibó")
]).toDF(["id", "text"])

df.select(clean_text("text")).show()
## +---------------+
## |           text|
## +---------------+
## |      Maracaibo|
## |       New York|
## |   Sao Paulo   |
## |        ~Madrid|
## |      Sao Paulo|
## |      Maracaibo|
## +---------------+

acknowledge @zero323

answered Oct 08 '22 09:10

eliasah

Related questions
                            
                                Seaborn Lineplot Module Object Has No Attribute 'Lineplot'
                            
                                Call Hierarchy in Visual Studio Code
                            
                                Python not able to connect to grpc channel -> "failed to connect to all addresses" "grpc_status":14
                            
                                PyQt: How to update progress without freezing the GUI?
                            
                                Is python's shutil.move() atomic on linux?
                            
                                Can I get a view of a numpy array at specified indexes? (a view from "fancy indexing")
                            
                                Convert HTML string to an image in Python [closed]
                            
                                Library or tool to download multiple files in parallel [closed]
                            
                                How do I apply a DCT to an image in Python?
                            
                                Django's Double Underscore
                            
                                SqlAlchemy won't accept datetime.datetime.now value in a DateTime column
                            
                                How to train a neural network to supervised data set using pybrain black-box optimization?
                            
                                Scipy: lognormal fitting
                            
                                Writing wav file in Python with wavfile.write from SciPy
                            
                                Python windows path slash [duplicate]
                            
                                What is the point of .ix indexing for pandas Series
                            
                                Utility of parameter 'out' in numpy functions
                            
                                Mock side effect only X number of times
                            
                                What to do if I want 3D spline/smooth interpolation of random unstructured data?
                            
                                ImportError: No module named extern

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the best way to remove accents with Apache Spark dataframes in PySpark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

unicode-normalization

Hugo Reyes

People also ask

2 Answers

zero323

eliasah

Recent Activity

Donate For Us