Is it OK to create a term for each number in a text? Example text:
I got 2295910 unique terms.
The numbers can be timestamps, port numbers, anything. The unique numbers lead to a very large number of unique terms. It does not feel right to have the same number of unique terms as documents. Lucene memory usage grows with the number of unique terms.
Is there a special analyzer or a trick for texts with numbers? The StandardAnalyzer creates a term for each unique number.
The needs:
The numbers should remain searchable. There could be multiple numbers in a document. The memory usage is the issue. I have 800M documents in multiple index directories. The memory usage forces me to close the least recently used IndexSearchers.
Untested ideas:
Maybe I'm reinventing the wheel. Was it solved by somebody already?
Are you currently having a memory problem? It is true that Lucene memory usage grows with the number of unique terms, but it's still a relatively minuscule amount of memory even for indices that have a lot a terms.
If memory is an issue and you've profiled your code to ensure that it is indeed Lucene that is the problem, you can create another Analyzer that throws away numeric terms. If you do that, obviously, you won't be able to search for documents using numbers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With