I am new to Solr.I want to know when to use StandardTokenizerFactory and KeywordTokenizerFactory? I read the docs on Apache Wiki, but I am not getting it. Can anybody explain the difference between StandardTokenizerFactory and KeywordTokenizerFactory?

StandardTokenizerFactory :- It tokenizes on whitespace, as well as strips characters Documentation :- <blockquote> Splits words at punctuation characters, removing punctuations. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token. In that case, the whole token is interpreted as a product number and is not split. Recognizes email addresses and Internet hostnames as one token. </blockquote> Would use this for fields where you want to search on the field data. e.g. - <pre class="prettyprint"><code>http://example.com/I-am+example?Text=-Hello </code></pre> would generate 7 tokens (separated by comma) - <pre class="prettyprint"><code>http,example.com,I,am,example,Text,Hello </code></pre> KeywordTokenizerFactory :- Keyword Tokenizer does not split the input at all. No processing in performed on the string, and the whole string is treated as a single entity. This doesn't actually do any tokenization. It returns the original text as one term. Mainly used for sorting or faceting requirements, where you want to match the exact facet when filtering on multiple words and sorting as sorting does not work on tokenized fields. e.g. <pre class="prettyprint"><code>http://example.com/I-am+example?Text=-Hello </code></pre> would generate a single token - <pre class="prettyprint"><code>http://example.com/I-am+example?Text=-Hello </code></pre>

Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?

1 Answers

StandardTokenizerFactory :-
It tokenizes on whitespace, as well as strips characters

Documentation :-

Splits words at punctuation characters, removing punctuations. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token. In that case, the whole token is interpreted as a product number and is not split. Recognizes email addresses and Internet hostnames as one token.

Would use this for fields where you want to search on the field data.

e.g. -

http://example.com/I-am+example?Text=-Hello

would generate 7 tokens (separated by comma) -

http,example.com,I,am,example,Text,Hello

KeywordTokenizerFactory :-

Keyword Tokenizer does not split the input at all.
No processing in performed on the string, and the whole string is treated as a single entity.
This doesn't actually do any tokenization. It returns the original text as one term.

Mainly used for sorting or faceting requirements, where you want to match the exact facet when filtering on multiple words and sorting as sorting does not work on tokenized fields.

e.g.

http://example.com/I-am+example?Text=-Hello

would generate a single token -

http://example.com/I-am+example?Text=-Hello

answered Oct 22 '22 08:10

Jayendra

Related questions
                            
                                How does synchronized work in Java
                            
                                How can I get a Future<MyObject> without using ExecutorService?
                            
                                Download a file from the internet using java : How to authenticate?
                            
                                How do I suppress Eclipse 3.5's warnings of dead code
                            
                                Unidentified whitespace character in Java
                            
                                How to package Factories in Java
                            
                                Problem using generic map with wildcard
                            
                                Why is PermGen space growing?
                            
                                Is there a recommended way to use the Observer pattern in MVP using GWT?
                            
                                URI scheme is not "file"
                            
                                Java - Collections.sort() performance
                            
                                Do I need extra synchronization when using a BlockingQueue?
                            
                                Resolving relative paths when loading XSLT files
                            
                                Query String Manipulation in Java
                            
                                Java static methods accessing private variables
                            
                                Changing content type in jax-rs REST service
                            
                                "Fastest" hash function implemented in Java, comparing part of file
                            
                                Spring: Create new instance of bean for each call of get method
                            
                                Convert existing project to a maven project
                            
                                Is it possible to use DDD and BDD together?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?

Tags:

java

solr

tokenize

solrnet

ravidev

People also ask

1 Answers

Jayendra

Recent Activity

Donate For Us