I have some data in which column 'X' contains strings. I am writing a function, using pyspark, where a search_word is passed and all rows which do not contain the substring search_word within the column 'X' string are filtered out. The function must also allow for misspellings of the word, i.e. fuzzy matching. I have loaded the data into a pyspark dataframe and written a function using the NLTK and fuzzywuzzy python libraries to return True or False if the string contains the search_word.
My problem is that I cannot map the function to the dataframe correctly. Am I approaching this problem incorrectly? Should I be trying to do the fuzzy match through some kind of SQL query, or using an RDD perhaps?
I am new to pyspark so I feel like this question must have been answered before but I cannot find the answer anywhere. I have never done any NLP with SQL and I have never heard of SQL being capable of fuzzy matching a substring.
Update #1
The function looks like:
wf = WordFinder(search_word='some_substring')
result1 = wf.find_word_in_string(string_to_search='string containing some_substring or misspelled some_sibstrung')
result2 = wf.find_word_in_string(string_to_search='string not containing the substring')
result1 is True
result2 is False
An easy way is to use the built-in levenstein
function. For example,
(
spark.createDataFrame([("apple",), ("aple",), ("orange",), ("pear",)], ["fruit"])
.withColumn("substring", func.lit("apple"))
.withColumn("levenstein", func.levenshtein("fruit", "substring"))
.filter("levenstein <= 1")
.toPandas()
)
returns
fruit substring levenstein
0 apple apple 0
1 aple apple 1
If you want to use a vanilla Python function, like something from an NLTK package, you'll have to define a UDF that takes a string and returns a boolean.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With