I have a text document and a query (the query could be more than one word). I want to find the position of all occurrences of the query in the document.
I thought of the documentText.indexOf(query)
or using regular expression but I could not make it work.
I end up with the following method:
First, I have create a dataType called QueryOccurrence
public class QueryOccurrence implements Serializable{
public QueryOccurrence(){}
private int start;
private int end;
public QueryOccurrence(int nameStart,int nameEnd,String nameText){
start=nameStart;
end=nameEnd;
}
public int getStart(){
return start;
}
public int getEnd(){
return end;
}
public void SetStart(int i){
start=i;
}
public void SetEnd(int i){
end=i;
}
}
Then, I have used this datatype in the following method:
public static List<QueryOccurrence>FindQueryPositions(String documentText, String query){
// Normalize do the following: lower case, trim, and remove punctuation
String normalizedQuery = Normalize.Normalize(query);
String normalizedDocument = Normalize.Normalize(documentText);
String[] documentWords = normalizedDocument.split(" ");;
String[] queryArray = normalizedQuery.split(" ");
List<QueryOccurrence> foundQueries = new ArrayList();
QueryOccurrence foundQuery = new QueryOccurrence();
int index = 0;
for (String word : documentWords) {
if (word.equals(queryArray[0])){
foundQuery.SetStart(index);
}
if (word.equals(queryArray[queryArray.length-1])){
foundQuery.SetEnd(index);
if((foundQuery.End()-foundQuery.Start())+1==queryArray.length){
//add the found query to the list
foundQueries.add(foundQuery);
//flush the foundQuery variable to use it again
foundQuery= new QueryOccurrence();
}
}
index++;
}
return foundQueries;
}
This method return a list of all occurrence of the query in the document each one with its position.
Could you suggest any easer and faster way to accomplish this task.
Thanks
The indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.
group() returns the substring that was matched by the RE. start() and end() return the starting and ending index of the match. span() returns both start and end indexes in a single tuple. Since the match() method only checks if the RE matches at the start of a string, start() will always be zero.
Put brackets ( [ ] ) in the pattern string, and inside the brackets put the lowest and highest characters in the range, separated by a hyphen ( – ). Any single character within the range makes a successful match.
Your first approach was a good idea, but String.indexOf does not support regular expressions.
Another easier way which uses a similar approach, but in a two step method, is as follows:
List<Integer> positions = new ArrayList();
Pattern p = Pattern.compile(queryPattern); // insert your pattern here
Matcher m = p.matcher(documentText);
while (m.find()) {
positions.add(m.start());
}
Where positions will hold all the start positions of the matches.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With