I'm looking for feedback on which analyzer to use with an index that has documents from multiple languages. Currently I am using the simpleanalyzer, as it seems to handle the broadest amount of languages. Most of the documents to be indexed will be english, but there will be the occasional double-byte language indexed as well.
Are there any other suggestions or should I just stick with the simpleanalyzer.
Thanks
From your description, I presume you have document of multiple languages but each document has text in only one language.
For this case, you can use Nutch's language identification to get the language of the document. Then use respective language analyzer to index. To get the correct results for search, you need apply language identification to the search query and use that analyzer.
The upside here is you will be able to use language-specific stemmer & stopwords, pushing the quality of search up. The extra overhead while indexing should be acceptable. The search queries where language identification fails to identify correct language may suffer though. I have used this couple of years back and the results were better than expected.
For CJK, you can apply similar technique but the tools might be different.
I've used the StandardAnalyzer with non-English words and it works ok. It even deals with accented characters. If the language is CJK (Chinese, Japanese, Korean), Russian or German it may have problems, but I suspect most of the problems will be related to the stemming of words. If you don't have stemming enabled, it will probably be adequate.
SimpleAnalyzer really is simple, all it does is lower-case the terms. I'd have thought that the StandardAnalyzer would give better results than SimpleAnalyzer even with non-english language data. You could perhaps improve it slightly by supplying a custom list of stop words in addition to the default english-language ones.
Purely anecdotal evidence, but we use a (customised, but not in any relevant way) version of StandardAnalyzer
for our system. Our documents may not only be in different languages to each other, but documents may contain chunks of different languages (for example, imagine an article written in Japanese with comments in English), so language-sniffing is difficult.
The majority of our documents are in English, but significant numbers are in Chinese and Japanese, with a smaller number in French, Spanish, Portuguese and Korean.
End result? We use StandardAnalyzer
, and have very few complaints from people using the system in non-Roman languages about the way our searching works. Our system is somewhat 'enforced' on its users, by the way, so it's not like people are not complaining but moving elsewhere; if they're unhappy, we generally know.
So based on the fact that I'm not swamped with user complaints (very occasional ones, mainly about Chinese, but nothing serious and they're easily explained) it seems to be 'good enough' for many cases.
The correct answer depends on your main language (if any).
For best cross-language IR performance I'd go with a 4/5-grams analyzer, it has shown to work great on many languages. It might even work better than SimpleAnalyzer for English too. See http://www.eecs.qmul.ac.uk/~christof/html/publications/inrt142.pdf for example.
I have looked into this, but from another angle. It seems like there isn't a catch-all analyzer - each language needs its own approach for the best results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With