Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove trailing white space from elements in a list

I have a spark dataframe where a given column is some text. I'm attempting to clean the text and split by comma which would output a new column containing a list of words.

The problem that I'm having is that some of the elements in that list contain trailing white spaces that I would like to remove.

Code:

# Libraries
# Standard Libraries
from typing import Dict, List, Tuple

# Third Party Libraries
import pyspark
from pyspark.ml.feature import Tokenizer
from pyspark.sql import SparkSession
import pyspark.sql.functions as s_function


def tokenize(sdf, input_col="text", output_col="tokens"):
    # Remove email 
    sdf_temp = sdf.withColumn(
        colName=input_col,
        col=s_function.regexp_replace(s_function.col(input_col), "[\w\.-]+@[\w\.-]+\.\w+", ""))
    # Remove digits
    sdf_temp = sdf_temp.withColumn(
        colName=input_col,
        col=s_function.regexp_replace(s_function.col(input_col), "\d", ""))
    # Remove one(1) character that is not a word character except for
    # commas(,), since we still want to split on commas(,)
    sdf_temp = sdf_temp.withColumn(
        colName=input_col,
        col=s_function.regexp_replace(s_function.col(input_col), "[^a-zA-Z0-9,]+", " ")) 
    # Split the affiliation string based on a comma
    sdf_temp = sdf_temp.withColumn(
        colName=output_col,
        col=s_function.split(sdf_temp[input_col], ", "))

    return sdf_temp


if __name__ == "__main__":
    # Sample data
    a_1 = "Department of Bone and Joint Surgery, Ehime University Graduate"\
        " School of Medicine, Shitsukawa, Toon 791-0295, Ehime, Japan."\
        " [email protected]." 
    a_2 = "Stroke Pharmacogenomics and Genetics, Fundació Docència i Recerca"\
        " Mútua Terrassa, Hospital Mútua de Terrassa, 08221 Terrassa, Spain."
    a_3 = "Neurovascular Research Laboratory, Vall d'Hebron Institute of Research,"\
        " Hospital Vall d'Hebron, 08035 Barcelona, Spain;[email protected]"\
        " (C.C.). [email protected]."

    data = [(1, a_1), (2, a_2), (3, a_3)]

    spark = SparkSession\
        .builder\
        .master("local[*]")\
        .appName("My_test")\
        .config("spark.ui.port", "37822")\
        .getOrCreate()
    sc = spark.sparkContext
    sc.setLogLevel("WARN")

    af_data = spark.createDataFrame(data, ["index", "text"])
    sdf_tokens = tokenize(af_data)
    # sdf_tokens.select("tokens").show(truncate=False)

Output

|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon , Ehime, Japan ]                                                |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain ]                                       |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C ]  

Desired Output:

|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon, Ehime, Japan]                                                |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain]                                       |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C]  

so that in the

  1. 1st line: 'Toon ' -> 'Toon', 'Japan ' -> 'Japan'.
  2. 2nd line: 'Spain ' -> 'Spain'
  3. 3rd line: 'Spain C C ' -> 'Spain C C'

Note

The trailing white spaces do not only appear with the last element of the list, they can occur with any element.

like image 937
Lukasz Avatar asked Dec 03 '25 10:12

Lukasz


1 Answers

Update

The original solution won't work because trim only operates on the beginning and the end of the entire string, whereas you need it to work on each token.

@PatrickArtner's solution works, but an alternative is to use RegexTokenizer.

Here is an example of how you can modify your tokenize() function:

from pyspark.ml.feature import RegexTokenizer

def tokenize(sdf, input_col="text", output_col="tokens"):

    # Remove email 
    sdf_temp = sdf.withColumn(
        colName=input_col,
        col=s_function.regexp_replace(s_function.col(input_col), "[\w\.-]+@[\w\.-]+\.\w+", ""))
    # Remove digits
    sdf_temp = sdf_temp.withColumn(
        colName=input_col,
        col=s_function.regexp_replace(s_function.col(input_col), "\d", ""))
    # Remove one(1) character that is not a word character except for
    # commas(,), since we still want to split on commas(,)
    sdf_temp = sdf_temp.withColumn(
        colName=input_col,
        col=s_function.regexp_replace(s_function.col(input_col), "[^a-zA-Z0-9,]+", " "))

    # call trim to remove any trailing (or leading spaces)
    sdf_temp = sdf_temp.withColumn(
        colName=input_col,
        col=s_function.trim(sdf_temp[input_col]))

    # use RegexTokenizer to split on commas optionally surrounded by whitespace
    myTokenizer = RegexTokenizer(
        inputCol=input_col,
        outputCol=output_col,
        pattern="( +)?, ?")

    sdf_temp = myTokenizer.transform(sdf_temp)

    return sdf_temp

Essentially, call trim on your string to take care of any leading or trailing spaces. Then use the RegexTokenizer to split using the pattern "( +)?, ?".

  • ( +)?: match between zero and unlimited spaces
  • ,: match a comma exactly
  • ?: match an optional space

Here is the output of

sdf_tokens.select('tokens', f.size('tokens').alias('size')).show(truncate=False)

You can see that the length of the array (number of tokens) is correct, but all of the tokens are lower case (because that's what Tokenizer and RegexTokenizer do).

+------------------------------------------------------------------------------------------------------------------------------+----+
|tokens                                                                                                                        |size|
+------------------------------------------------------------------------------------------------------------------------------+----+
|[department of bone and joint surgery, ehime university graduate school of medicine, shitsukawa, toon, ehime, japan]          |6   |
|[stroke pharmacogenomics and genetics, fundaci doc ncia i recerca m tua terrassa, hospital m tua de terrassa, terrassa, spain]|5   |
|[neurovascular research laboratory, vall d hebron institute of research, hospital vall d hebron, barcelona, spain c c]        |5   |
+------------------------------------------------------------------------------------------------------------------------------+----+

Original Answer

As long as you're using Spark version 1.5 or greater, you can use pyspark.sql.functions.trim() which will:

Trim the spaces from both ends for the specified string column.

So one way would be to add:

sdf_temp = sdf_temp.withColumn(
        colName=input_col,
        col=s_function.trim(sdf_temp[input_col]))

At the end of your tokenize() function.

But you may want to instead look into the pyspark.ml.feature.Tokenizer or pyspark.ml.feature.RegexTokenizer. One idea could be to use your function to clean up your strings, and then use the Tokenizer to make the tokens. (I see you've imported it, but don't seem to be using it).

like image 165
pault Avatar answered Dec 04 '25 23:12

pault



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!