Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene highlighter

How does the Lucene 4.3.1 highlighter work? I want to print out the search results(as the searched word and 8 words after that word) from the document. How can I use the Highlighter class to do that? I have added full txt, html and xml documents to a file and added those into my index, now I have a search formula, from which I will presumably be adding the highlighter capability:

String index = "index";
String field = "contents";
String queries = null;
int repeat = 1;
boolean raw = true; //not sure what raw really does???
String queryString = null; //keep null, prompt user later for it
int hitsPerPage = 10; //leave it at 10, go from there later

//need to add all files to same directory
index = "C:\\Users\\plib\\Documents\\index";
repeat = 4;


IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);

BufferedReader in = null;
if (queries != null) {
  in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "UTF-8"));
} else {
  in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
}
QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);
while (true) {
  if (queries == null && queryString == null) {                        // prompt the user
    System.out.println("Enter query. 'quit' = quit: ");
  }

  String line = queryString != null ? queryString : in.readLine();

  if (line == null || line.length() == -1) {
    break;
  }

  line = line.trim();
  if (line.length() == 0 || line.equalsIgnoreCase("quit")) {
    break;
  }

  Query query = parser.parse(line);
  System.out.println("Searching for: " + query.toString(field));

  if (repeat > 0) {                           // repeat & time as benchmark
    Date start = new Date();
    for (int i = 0; i < repeat; i++) {
      searcher.search(query, null, 100);
    }
    Date end = new Date();
    System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");
  }

  doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);

  if (queryString != null) {
    break;
  }
}
reader.close();

}

like image 859
abitnew Avatar asked Jul 08 '13 20:07

abitnew


2 Answers

I had the same question, and finally stumbled up this post.

http://vnarcher.blogspot.ca/2012/04/highlighting-text-with-lucene.html

The key part is that as you iterate over your results, will call getHighlightedField on the result value that you want to highlight.

private String getHighlightedField(Query query, Analyzer analyzer, String fieldName, String fieldValue) throws IOException, InvalidTokenOffsetsException {
    Formatter formatter = new SimpleHTMLFormatter("<span class="\"MatchedText\"">", "</span>");
    QueryScorer queryScorer = new QueryScorer(query);
    Highlighter highlighter = new Highlighter(formatter, queryScorer);
    highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer, Integer.MAX_VALUE));
    highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
    return highlighter.getBestFragment(this.analyzer, fieldName, fieldValue);
}

In this case, it assumes the output is going to be html, and it simply wraps the highlighted text with the <span> using a css class of MatchedText. You can then define a custom css rule to do whatever you want for highlighting.

like image 84
stuckless Avatar answered Oct 22 '22 05:10

stuckless


For the Lucene highlighter to work you need to add two fields in your document that you are indexing. One field should be with Term Vector enabled and another field without using Term Vector. For simplicity I am showing you a code snippet:

    FieldType type = new FieldType();
    type.setIndexed(true);
    type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    type.setStored(true);
    type.setStoreTermVectors(true);
    type.setTokenized(true);
    type.setStoreTermVectorOffsets(true);
    Field field = new Field("content", "This is fragment. Highlters", type);
    doc.add(field);  //this field has term Vector enabled.

    //without term vector enabled.
    doc.add(new StringField("ncontent","This is fragment. Highlters", Field.Store.YES));

After enabling them add that document in your index. Now to make use of lucene highlighter use the method given below (It uses Lucene 4.2, I have not tested with Lucene 4.3.1) :

         public void highLighter() throws IOException, ParseException, InvalidTokenOffsetsException {
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("INDEXDIRECTORY")));
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
    IndexSearcher searcher = new IndexSearcher(reader);
    QueryParser parser = new QueryParser(Version.LUCENE_42, "content", analyzer);
    Query query = parser.parse("Highlters"); //your search keyword
    TopDocs hits = searcher.search(query, reader.maxDoc());
    System.out.println(hits.totalHits);
    SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
    Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
    for (int i = 0; i < reader.maxDoc(); i++) {
        int id = hits.scoreDocs[i].doc;
        Document doc = searcher.doc(id);
        String text = doc.get("ncontent");
        TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "ncontent", analyzer);
        TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
        for (int j = 0; j < frag.length; j++) {
            if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                System.out.println((frag[j].toString()));
            }
        }
        //Term vector
        text = doc.get("content");
        tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.scoreDocs[i].doc, "content", analyzer);
        frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
        for (int j = 0; j < frag.length; j++) {
            if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                System.out.println((frag[j].toString()));
            }
        }

        System.out.println("-------------");
    }
}         
like image 7
user1234 Avatar answered Oct 22 '22 07:10

user1234