Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?

Tags:

lucene

I'm working on indexing tweets that are in English using Lucene 4.3, however I'm not sure which Analyzer to use. What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?

Also I tried to test the StandardAnalyzer with this text: "XY&Z Corporation - [email protected]". The output is: [xy] [z] [corporation] [xyz] [example.com], however I thought the output will be: [XY&Z] [Corporation] [[email protected]]

Am I doing something wrong?

like image 775
Jack Twain Avatar asked Jun 09 '13 16:06

Jack Twain


People also ask

What is Lucene StandardAnalyzer?

Advertisements. This is the most sophisticated analyzer and is capable of handling names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any.

How do you find special characters in Lucene?

You can't search for special characters in Lucene Search. These are + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ / @. You can search for special characters, with the exception of the @ character, in a field-level search as long as you escape them using \ before the special character.

What is Lucene library?

Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.


1 Answers

Take a look at the source. Generally, analyzers are pretty readable. You just need to look into CreateComponents method to see the Tokenizer and Filters being used by it:

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream result = new StandardFilter(matchVersion, source);
    // prior to this we get the classic behavior, standardfilter does it for us.
    if (matchVersion.onOrAfter(Version.LUCENE_31))
      result = new EnglishPossessiveFilter(matchVersion, result);
    result = new LowerCaseFilter(matchVersion, result);
    result = new StopFilter(matchVersion, result, stopwords);
    if(!stemExclusionSet.isEmpty())
      result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new PorterStemFilter(result);
    return new TokenStreamComponents(source, result);
 }

Whereas, StandardAnalyzer is just a StandardTokenizer, StandardFilter, LowercaseFilter, and StopFilter. EnglishAnalyzer rolls in an EnglishPossesiveFilter, KeywordMarkerFilter, and PorterStemFilter.

Mainly, the EnglishAnalyzer rolls in some English stemming enhancements, which should work well for plain English text.

For StandardAnalyzer, really the only assumption I'm aware of that ties it directly to English analysis, is the default stopword set, which is of course, just a default and can be changed. StandardAnalyzer now implements Unicode Standard Annex #29, which attempts to provide a non-language-specific text segmentation.

like image 76
femtoRgon Avatar answered Oct 07 '22 16:10

femtoRgon