I have a spark dataframe where a given column is some text. I'm attempting to clean the text and split by comma which would output a new column containing a list of words.
The problem that I'm having is that some of the elements in that list contain trailing white spaces that I would like to remove.
Code:
# Libraries
# Standard Libraries
from typing import Dict, List, Tuple
# Third Party Libraries
import pyspark
from pyspark.ml.feature import Tokenizer
from pyspark.sql import SparkSession
import pyspark.sql.functions as s_function
def tokenize(sdf, input_col="text", output_col="tokens"):
# Remove email
sdf_temp = sdf.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[\w\.-]+@[\w\.-]+\.\w+", ""))
# Remove digits
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "\d", ""))
# Remove one(1) character that is not a word character except for
# commas(,), since we still want to split on commas(,)
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[^a-zA-Z0-9,]+", " "))
# Split the affiliation string based on a comma
sdf_temp = sdf_temp.withColumn(
colName=output_col,
col=s_function.split(sdf_temp[input_col], ", "))
return sdf_temp
if __name__ == "__main__":
# Sample data
a_1 = "Department of Bone and Joint Surgery, Ehime University Graduate"\
" School of Medicine, Shitsukawa, Toon 791-0295, Ehime, Japan."\
" [email protected]."
a_2 = "Stroke Pharmacogenomics and Genetics, Fundació Docència i Recerca"\
" Mútua Terrassa, Hospital Mútua de Terrassa, 08221 Terrassa, Spain."
a_3 = "Neurovascular Research Laboratory, Vall d'Hebron Institute of Research,"\
" Hospital Vall d'Hebron, 08035 Barcelona, Spain;[email protected]"\
" (C.C.). [email protected]."
data = [(1, a_1), (2, a_2), (3, a_3)]
spark = SparkSession\
.builder\
.master("local[*]")\
.appName("My_test")\
.config("spark.ui.port", "37822")\
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
af_data = spark.createDataFrame(data, ["index", "text"])
sdf_tokens = tokenize(af_data)
# sdf_tokens.select("tokens").show(truncate=False)
Output
|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon , Ehime, Japan ] |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain ] |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C ]
Desired Output:
|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon, Ehime, Japan] |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain] |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C]
so that in the
'Toon ' -> 'Toon', 'Japan ' -> 'Japan'.'Spain ' -> 'Spain''Spain C C ' -> 'Spain C C'Note
The trailing white spaces do not only appear with the last element of the list, they can occur with any element.
Update
The original solution won't work because trim only operates on the beginning and the end of the entire string, whereas you need it to work on each token.
@PatrickArtner's solution works, but an alternative is to use RegexTokenizer.
Here is an example of how you can modify your tokenize() function:
from pyspark.ml.feature import RegexTokenizer
def tokenize(sdf, input_col="text", output_col="tokens"):
# Remove email
sdf_temp = sdf.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[\w\.-]+@[\w\.-]+\.\w+", ""))
# Remove digits
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "\d", ""))
# Remove one(1) character that is not a word character except for
# commas(,), since we still want to split on commas(,)
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[^a-zA-Z0-9,]+", " "))
# call trim to remove any trailing (or leading spaces)
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.trim(sdf_temp[input_col]))
# use RegexTokenizer to split on commas optionally surrounded by whitespace
myTokenizer = RegexTokenizer(
inputCol=input_col,
outputCol=output_col,
pattern="( +)?, ?")
sdf_temp = myTokenizer.transform(sdf_temp)
return sdf_temp
Essentially, call trim on your string to take care of any leading or trailing spaces. Then use the RegexTokenizer to split using the pattern "( +)?, ?".
( +)?: match between zero and unlimited spaces,: match a comma exactly?: match an optional spaceHere is the output of
sdf_tokens.select('tokens', f.size('tokens').alias('size')).show(truncate=False)
You can see that the length of the array (number of tokens) is correct, but all of the tokens are lower case (because that's what Tokenizer and RegexTokenizer do).
+------------------------------------------------------------------------------------------------------------------------------+----+
|tokens |size|
+------------------------------------------------------------------------------------------------------------------------------+----+
|[department of bone and joint surgery, ehime university graduate school of medicine, shitsukawa, toon, ehime, japan] |6 |
|[stroke pharmacogenomics and genetics, fundaci doc ncia i recerca m tua terrassa, hospital m tua de terrassa, terrassa, spain]|5 |
|[neurovascular research laboratory, vall d hebron institute of research, hospital vall d hebron, barcelona, spain c c] |5 |
+------------------------------------------------------------------------------------------------------------------------------+----+
Original Answer
As long as you're using Spark version 1.5 or greater, you can use pyspark.sql.functions.trim() which will:
Trim the spaces from both ends for the specified string column.
So one way would be to add:
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.trim(sdf_temp[input_col]))
At the end of your tokenize() function.
But you may want to instead look into the pyspark.ml.feature.Tokenizer or pyspark.ml.feature.RegexTokenizer. One idea could be to use your function to clean up your strings, and then use the Tokenizer to make the tokens. (I see you've imported it, but don't seem to be using it).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With