The Problem: A large static list of strings is provided as A
, A long string is provided as B
, strings in A
are all very short (a keywords list), I want to check if every string in A
is a sub-string of B
and get them.
Now I use a simple loop like:
result = []
for word in A:
if word in B:
result.append(word)
But it's crazy slow when A contains ~500,000 or more items.
Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.
Thank you!
You can use any : a_string = "A string is more than its parts!" matches = ["more", "wholesome", "milk"] if any(x in a_string for x in matches): Similarly to check if all the strings from the list are found, use all instead of any .
Practical Data Science using Python The best way of appending a string to a string variable is to use + or +=. This is because it's readable and fast. They are also just as fast.
String Equals Check in Python In python programming we can check whether strings are equal or not using the “==” or by using the “. __eq__” function. Example: s1 = 'String' s2 = 'String' s3 = 'string' # case sensitive equals check if s1 == s2: print('s1 and s2 are equal.
It returns a Boolean (either True or False ). To check if a string contains a substring in Python using the in operator, we simply invoke it on the superstring: fullstring = "StackAbuse" substring = "tack" if substring in fullstring: print("Found!") else: print("Not found!")
For simple tasks, we can use the find () or index () function to search substrings of Python strings. The find () can only be used for strings and will return -1 if there is no match. The index () function can also be used for lists or tuples and will raise an exception when there is no match.
Also, you can use the mmap module to find a string in a huge file. The mmap.mmap () method creates a bytearray object that checks the underlying file instead of reading the whole file in memory. Sometimes you want to search a string in multiple files present in a directory. Use the below steps to search a text in all files of a directory.
In this section, we’ll see the fastest and most memory-efficient way to search a string in a large text file. Use for loop with enumerate () function to get a line and its number. The enumerate () function adds a counter to an iterable and returns it in enumerate object. Pass the file pointer returned by the open () function to the enumerate ().
Python already uses a pretty fast C-level implementation of "a mix between boyer-moore and horspool", so implementing a different string search algorithm at Python level is probably going to be several orders of magnitude slower. Show activity on this post. I do not see how to make it more optimal on the comparison, to be honest.
Your problem is large enough that you probably need to hit it with the algorithm bat.
Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.
Also, look into the work by Nicholas Lehuen with his PyTST package.
There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With