I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match. For example, I might have the word "hockey", but the Wikipedia article for hockey that I would want is "Ice_hockey". It should be a case-insensitive search too.
I'm using Python, and is there a more efficient way than to just do a line by line search? I'll be performing this search like 500 or a 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do within this?
I think there are several million lines in the file.
Any ideas?
Thanks.
Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.
Use re. match() to check if a string matches a pattern A pattern specifies a sequence of characters that can be matched to a string and follows the regular expression syntax in Python. re. match(pattern, string) tests if the regex pattern matches string .
Using the 'in' operator: The in operator is the easiest and pythonic way to check if a python string contains a substring. The in and not in are membership operators, they take in two arguments and evaluate if one is a member of the other. They return a boolean value.
If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.
This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With