Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy matching a word inside a pyspark dataframe string

I have some data in which column 'X' contains strings. I am writing a function, using pyspark, where a search_word is passed and all rows which do not contain the substring search_word within the column 'X' string are filtered out. The function must also allow for misspellings of the word, i.e. fuzzy matching. I have loaded the data into a pyspark dataframe and written a function using the NLTK and fuzzywuzzy python libraries to return True or False if the string contains the search_word.

My problem is that I cannot map the function to the dataframe correctly. Am I approaching this problem incorrectly? Should I be trying to do the fuzzy match through some kind of SQL query, or using an RDD perhaps?

I am new to pyspark so I feel like this question must have been answered before but I cannot find the answer anywhere. I have never done any NLP with SQL and I have never heard of SQL being capable of fuzzy matching a substring.

Update #1

The function looks like:

wf = WordFinder(search_word='some_substring')
result1 = wf.find_word_in_string(string_to_search='string containing some_substring or misspelled some_sibstrung')
result2 = wf.find_word_in_string(string_to_search='string not containing the substring')

result1 is True

result2 is False

like image 424
Dónal Flanagan Avatar asked Jan 03 '18 09:01

Dónal Flanagan


1 Answers

An easy way is to use the built-in levenstein function. For example,

(
    spark.createDataFrame([("apple",), ("aple",), ("orange",), ("pear",)], ["fruit"])
    .withColumn("substring", func.lit("apple"))
    .withColumn("levenstein", func.levenshtein("fruit", "substring"))
    .filter("levenstein <= 1")
    .toPandas()
)

returns

   fruit substring  levenstein
0  apple     apple           0
1   aple     apple           1

If you want to use a vanilla Python function, like something from an NLTK package, you'll have to define a UDF that takes a string and returns a boolean.

like image 144
santon Avatar answered Sep 17 '22 01:09

santon