Is there a simple way I could use any subclass of Lucene's Analyzer
to parse/tokenize a String
?
Something like:
String to_be_parsed = "car window seven"; Analyzer analyzer = new StandardAnalyzer(...); List<String> tokenized_string = analyzer.analyze(to_be_parsed);
Analysers are mainly the factory class for TokenStreams and in particular the EnglishAnalyzer is thread-safe.
Overview. Lucene Analyzers are used to analyze text while indexing and searching documents.
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.
The string tokenizer class allows an application to break a string into tokens. The tokenization method is much simpler than the one used by the StreamTokenizer class. The StringTokenizer methods do not distinguish among identifiers, numbers, and quoted strings, nor do they recognize and skip comments.
Based off of the answer above, this is slightly modified to work with Lucene 4.0.
public final class LuceneUtil { private LuceneUtil() {} public static List<String> tokenizeString(Analyzer analyzer, String string) { List<String> result = new ArrayList<String>(); try { TokenStream stream = analyzer.tokenStream(null, new StringReader(string)); stream.reset(); while (stream.incrementToken()) { result.add(stream.getAttribute(CharTermAttribute.class).toString()); } } catch (IOException e) { // not thrown b/c we're using a string reader... throw new RuntimeException(e); } return result; } }
As far as I know, you have to write the loop yourself. Something like this (taken straight from my source tree):
public final class LuceneUtils { public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) { List<String> result = new ArrayList<String>(); TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords)); try { while(stream.incrementToken()) { result.add(stream.getAttribute(TermAttribute.class).term()); } } catch(IOException e) { // not thrown b/c we're using a string reader... } return result; } }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With