Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a Lucene Analyzer to tokenize a String?

Is there a simple way I could use any subclass of Lucene's Analyzer to parse/tokenize a String?

Something like:

String to_be_parsed = "car window seven"; Analyzer analyzer = new StandardAnalyzer(...); List<String> tokenized_string = analyzer.analyze(to_be_parsed); 
like image 835
Felipe Hummel Avatar asked Jun 13 '11 18:06

Felipe Hummel


People also ask

Is Lucene analyzer thread safe?

Analysers are mainly the factory class for TokenStreams and in particular the EnglishAnalyzer is thread-safe.

What does Lucene analyzer do?

Overview. Lucene Analyzers are used to analyze text while indexing and searching documents.

What is Tokenizing a string?

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.

Why do we Tokenize strings?

The string tokenizer class allows an application to break a string into tokens. The tokenization method is much simpler than the one used by the StreamTokenizer class. The StringTokenizer methods do not distinguish among identifiers, numbers, and quoted strings, nor do they recognize and skip comments.


2 Answers

Based off of the answer above, this is slightly modified to work with Lucene 4.0.

public final class LuceneUtil {    private LuceneUtil() {}    public static List<String> tokenizeString(Analyzer analyzer, String string) {     List<String> result = new ArrayList<String>();     try {       TokenStream stream  = analyzer.tokenStream(null, new StringReader(string));       stream.reset();       while (stream.incrementToken()) {         result.add(stream.getAttribute(CharTermAttribute.class).toString());       }     } catch (IOException e) {       // not thrown b/c we're using a string reader...       throw new RuntimeException(e);     }     return result;   }  } 
like image 198
Ben McCann Avatar answered Sep 28 '22 11:09

Ben McCann


As far as I know, you have to write the loop yourself. Something like this (taken straight from my source tree):

public final class LuceneUtils {      public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) {          List<String> result = new ArrayList<String>();         TokenStream stream  = analyzer.tokenStream(field, new StringReader(keywords));          try {             while(stream.incrementToken()) {                 result.add(stream.getAttribute(TermAttribute.class).term());             }         }         catch(IOException e) {             // not thrown b/c we're using a string reader...         }          return result;     }   } 
like image 21
stevevls Avatar answered Sep 28 '22 13:09

stevevls