Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High performance mass short string search in Python

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.

Now I use a simple loop like:

result = []
for word in A:
    if word in B:
        result.append(word)

But it's crazy slow when A contains ~500,000 or more items.

Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.

Thank you!

like image 906
Felix Yan Avatar asked Jan 13 '12 02:01

Felix Yan


People also ask

How do I search for multiple strings in python?

You can use any : a_string = "A string is more than its parts!" matches = ["more", "wholesome", "milk"] if any(x in a_string for x in matches): Similarly to check if all the strings from the list are found, use all instead of any .

What is the most efficient way to concatenate many strings together python?

Practical Data Science using Python The best way of appending a string to a string variable is to use + or +=. This is because it's readable and fast. They are also just as fast.

What is Strcmp in python?

String Equals Check in Python In python programming we can check whether strings are equal or not using the “==” or by using the “. __eq__” function. Example: s1 = 'String' s2 = 'String' s3 = 'string' # case sensitive equals check if s1 == s2: print('s1 and s2 are equal.

How do you compare a substring to a string in python?

It returns a Boolean (either True or False ). To check if a string contains a substring in Python using the in operator, we simply invoke it on the superstring: fullstring = "StackAbuse" substring = "tack" if substring in fullstring: print("Found!") else: print("Not found!")

How to search substrings of a string in Python?

For simple tasks, we can use the find () or index () function to search substrings of Python strings. The find () can only be used for strings and will return -1 if there is no match. The index () function can also be used for lists or tuples and will raise an exception when there is no match.

How to find a string in a huge file in Python?

Also, you can use the mmap module to find a string in a huge file. The mmap.mmap () method creates a bytearray object that checks the underlying file instead of reading the whole file in memory. Sometimes you want to search a string in multiple files present in a directory. Use the below steps to search a text in all files of a directory.

How to search a string in a large text file?

In this section, we’ll see the fastest and most memory-efficient way to search a string in a large text file. Use for loop with enumerate () function to get a line and its number. The enumerate () function adds a counter to an iterable and returns it in enumerate object. Pass the file pointer returned by the open () function to the enumerate ().

Is it possible to implement a string search algorithm in Python?

Python already uses a pretty fast C-level implementation of "a mix between boyer-moore and horspool", so implementing a different string search algorithm at Python level is probably going to be several orders of magnitude slower. Show activity on this post. I do not see how to make it more optimal on the comparison, to be honest.


1 Answers

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

like image 148
dyoo Avatar answered Sep 25 '22 18:09

dyoo