What is the fastest technology/algorithm that can be implemented in order to lookup list of strings in a quite large text files (Up to 1GB text files). For starters, I'm using C# and was able to achieve the logic (Simply by matching a file with list of strings, string by string every time. Which means that the file will be read n number of strings to match with" times ), but since I'm dealing with a lot of files, it is taking forever running through them and get matches. I'm open to any suggestion even if it was not C#..
To elaborate more, I have a text file that contains many numbers(A), and I have a lot of large files(B). I'm trying to take every element in (A) and see if there is a match for it in (B) line by line. If there is a match I write the whole line into a text file. The way I'm doing it is really traditional and it takes a lot of time to get done with a single file, while I have hundreds of them with sizes up to 1GB
Appreciate your time
Use a method from Scanner object - FindWithinHorizon. Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.
To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.
To look at the last few lines of a file, use the tail command. tail works the same way as head: type tail and the filename to see the last 10 lines of that file, or type tail -number filename to see the last number lines of the file.
The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.
Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:
// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();
// And then do the search on each file
foreach (var fileName in listOfFiles)
{
foreach (var line in File.ReadLines(filename))
{
var matches = matcher.Search(line);
foreach (m in matches)
{
// output match
}
}
}
Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.
I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.
100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.
Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>
. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:
Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
var matches = re.Matchs(line);
foreach (var m in matches)
{
if (hashSetOfValues.Contains(m))
{
// output match
}
}
}
This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.
There are other options for whole words, as I mention in the comments.
Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search
method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With