Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search for multiple strings in a text file

i am working in text files. I want to implement a search algorithm in Java. I have a text files i need to search.

If I want to find one word I can do it by just putting all the text into the hashmap and store each word's occurrence. But is there any algorithm if i want to search for two strings (or may be more)? Should i hash the strings in pair of two ?

like image 554
Arjit Avatar asked Oct 04 '11 12:10

Arjit


People also ask

How do you search a file for a specific string of text?

You need to use the grep command. The grep command or egrep command searches the given input FILEs for lines containing a match or a text string.

How do I use grep search?

The grep command searches through the file, looking for matches to the pattern specified. To use it type grep , then the pattern we're searching for and finally the name of the file (or files) we're searching in. The output is the three lines in the file that contain the letters 'not'.

What is grep in shell script?

In Linux and Unix Systems Grep, short for “global regular expression print”, is a command used in searching and matching text files contained in the regular expressions.


1 Answers

It depends a lot on the size of the text file. There are usually several cases you should consider:

  1. Lot's of queries on very short documents (web pages, texts of essay length etc). Text distribution like normal language. A simple O(n^2) algorithm is fine. For a query of length n just take a window of length n and slide it over. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a big string (including spaces). This is probably what most browsers does. KMP or Boyer Moore is not worth the effort, since the O(n^2) case is very rare.

  2. Lot's of queries on one large document. Preprocess your document and store it preprocessed. Common storage options are suffix trees and inverted lists. If you have multiple documents you can build one document from when by concatenating them and storing the end of documents seperately. This is the way to go for document databases where the collection is almost constant.

  3. If you have several documents where you have a high redundancy and your collections changes often, use KMP or Boyer Moore. For example if you want to find certain sequences in DNA data and you often get new sequences to find as well new DNA from experiments, the O(n^2) part of the naive algorithm would kill your time.

There are probably lot's of more possibilities that need different algorithms and data structures, so you should figure out which one is the best in your case.

like image 146
LiKao Avatar answered Nov 01 '22 00:11

LiKao