What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?

Q: What is Lucene library?

Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.

Tags:

lucene

I'm working on indexing tweets that are in English using Lucene 4.3, however I'm not sure which Analyzer to use. What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?

Also I tried to test the StandardAnalyzer with this text: "XY&Z Corporation - [email protected]". The output is: [xy] [z] [corporation] [xyz] [example.com], however I thought the output will be: [XY&Z] [Corporation] [[email protected]]

Am I doing something wrong?

775

asked Jun 09 '13 16:06

Jack Twain

1 Answers

Take a look at the source. Generally, analyzers are pretty readable. You just need to look into CreateComponents method to see the Tokenizer and Filters being used by it:

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream result = new StandardFilter(matchVersion, source);
    // prior to this we get the classic behavior, standardfilter does it for us.
    if (matchVersion.onOrAfter(Version.LUCENE_31))
      result = new EnglishPossessiveFilter(matchVersion, result);
    result = new LowerCaseFilter(matchVersion, result);
    result = new StopFilter(matchVersion, result, stopwords);
    if(!stemExclusionSet.isEmpty())
      result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new PorterStemFilter(result);
    return new TokenStreamComponents(source, result);
 }

Whereas, StandardAnalyzer is just a StandardTokenizer, StandardFilter, LowercaseFilter, and StopFilter. EnglishAnalyzer rolls in an EnglishPossesiveFilter, KeywordMarkerFilter, and PorterStemFilter.

Mainly, the EnglishAnalyzer rolls in some English stemming enhancements, which should work well for plain English text.

For StandardAnalyzer, really the only assumption I'm aware of that ties it directly to English analysis, is the default stopword set, which is of course, just a default and can be changed. StandardAnalyzer now implements Unicode Standard Annex #29, which attempts to provide a non-language-specific text segmentation.

answered Oct 07 '22 16:10

femtoRgon

Related questions
                            
                                Speeding up Solr Indexing
                            
                                How to store tree data in a Lucene/Solr/Elasticsearch index or a NoSQL db?
                            
                                PDFBox adding white spaces within words
                            
                                Sitecore Lucene: content delivery server index not updating on publish
                            
                                Python file indexing and searching
                            
                                Java Lucene NGramTokenizer
                            
                                Migrating from Hit/Hits to TopDocs/TopDocCollector
                            
                                Error 404: Prob accessing /solr/update. Reason: Not Found
                            
                                Why do I need a tokenizer for each language? [closed]
                            
                                Unable to find schema.xml file in solr 6.0,so to configure it,am i supposed to add a new file,or it will happen automatically?
                            
                                How to run Luke(Lucene tool)?
                            
                                the store attribute of a lucene field
                            
                                How to implement auto suggest using Lucene's new AnalyzingInfixSuggester API?
                            
                                Questions on Upgrading Lucene from 2.2 to 2.9 to 3.1
                            
                                Elastic Search Interaction of Highlights with Synonym Filter
                            
                                Solr faceting: Inconsistent JSON formatting
                            
                                php mysql fulltext search: lucene, sphinx, or?
                            
                                Syncing Lucene.net indexes across multiple app servers
                            
                                Do multiple Solr shards on a single machine improve performance?
                            
                                Cassandra or SOLR? What gives better performance to frond end read queries?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With