Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching a string in a Large text file?

I have a list of strings containing about 7 million items in a text file of size 152MB. I was wondering what could be best way to implement the a function that takes a single string and returns whether it is in that list of strings.

like image 506
Tasawer Khan Avatar asked Apr 19 '10 08:04

Tasawer Khan


People also ask

How do I find a specific string in a file?

Grep is a Linux / Unix command-line tool used to search for a string of characters in a specified file. The text search pattern is called a regular expression. When it finds a match, it prints the line with the result. The grep command is handy when searching through large log files.


2 Answers

Are you going to have to match against this text file several times? If so, I'd create a HashSet<string>. Otherwise, just read it line by line (I'm assuming there's one string per line) and see whether it matches.

152MB of ASCII will end up as over 300MB of Unicode data in memory - but in modern machines have plenty of memory, so keeping the whole lot in a HashSet<string> will make repeated lookups very fast indeed.

The absolute simplest way to do this is probably to use File.ReadAllLines, although that will create an array which will then be discarded - not great for memory usage, but probably not too bad:

HashSet<string> strings = new HashSet<string>(File.ReadAllLines("data.txt"));
...

if (strings.Contains(stringToCheck))
{
    ...
}
like image 99
Jon Skeet Avatar answered Oct 01 '22 23:10

Jon Skeet


Depends what you want to do. When you want to repeat the search for matches again and again, I'd load the whole file into memory (into a HashSet). There it is very easy to search for matches.

like image 35
tanascius Avatar answered Oct 01 '22 23:10

tanascius