Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient way to find partial string matches in large file of strings (python)

I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match. For example, I might have the word "hockey", but the Wikipedia article for hockey that I would want is "Ice_hockey". It should be a case-insensitive search too.

I'm using Python, and is there a more efficient way than to just do a line by line search? I'll be performing this search like 500 or a 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do within this?

I think there are several million lines in the file.

Any ideas?

Thanks.

like image 202
apexdodge Avatar asked Jan 29 '11 21:01

apexdodge


People also ask

How do you partially match text in Python?

Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.

How do I check if a string matches in Python?

Use re. match() to check if a string matches a pattern A pattern specifies a sequence of characters that can be matched to a string and follows the regular expression syntax in Python. re. match(pattern, string) tests if the regex pattern matches string .

How do you find the substring of a string in Python?

Using the 'in' operator: The in operator is the easiest and pythonic way to check if a python string contains a substring. The in and not in are membership operators, they take in two arguments and evaluate if one is a member of the other. They return a boolean value.


1 Answers

If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.

This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".

like image 189
Greg Hewgill Avatar answered Nov 07 '22 03:11

Greg Hewgill