Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing texts with many numbers in Lucene

Is it OK to create a term for each number in a text? Example text:

I got 2295910 unique terms.

The numbers can be timestamps, port numbers, anything. The unique numbers lead to a very large number of unique terms. It does not feel right to have the same number of unique terms as documents. Lucene memory usage grows with the number of unique terms.

Is there a special analyzer or a trick for texts with numbers? The StandardAnalyzer creates a term for each unique number.

The needs:

The numbers should remain searchable. There could be multiple numbers in a document. The memory usage is the issue. I have 800M documents in multiple index directories. The memory usage forces me to close the least recently used IndexSearchers.

Untested ideas:

  • Use a special analyzer. It would split the numbers into chunks. 123456 would become "123 456". The query parser would use a phrase search to find a number.
  • Change Lucene code to use a bigger termInfosIndexDivisor when seeing numeric terms.

Maybe I'm reinventing the wheel. Was it solved by somebody already?

like image 290
Ivo Danihelka Avatar asked Jan 28 '26 03:01

Ivo Danihelka


1 Answers

Are you currently having a memory problem? It is true that Lucene memory usage grows with the number of unique terms, but it's still a relatively minuscule amount of memory even for indices that have a lot a terms.

If memory is an issue and you've profiled your code to ensure that it is indeed Lucene that is the problem, you can create another Analyzer that throws away numeric terms. If you do that, obviously, you won't be able to search for documents using numbers.

like image 58
bajafresh4life Avatar answered Jan 31 '26 19:01

bajafresh4life



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!