I am trying to implement an index of documents (rougly corresponding to DB rows), where one of the fields is an integer. I'm adding them to index like:
Document doc = new Document();
doc.add(new StringField("ticket_number", rs.getString("ticket_number"),
Field.Store.YES));
doc.add(new IntField("ticket_id", rs.getInt("ticket_id"),
Field.Store.YES));
doc.add(new StringField("id_s", rs.getString("ticket_id"),
Field.Store.YES));
w.addDocument(doc);
It seems I can't query the ticket_id
field at all, while id_s
works just fine.
One of the documents is (I added whitespace for readability):
Document<
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<ticket_number:230114W>
stored<ticket_id:152>
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<id_s:152>>
So my int field is stored, but not indexed. This query works as expected: id_s:152
, while this one never returns anything: ticket_id:152
.
What am I doing wrong? How can I add such a field to the index and make it searchable?
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.
Simply put, Lucene uses an “inverted indexing” of data – instead of mapping pages to keywords, it maps keywords to pages just like a glossary at the end of any book. This allows for faster search responses, as it searches through an index, instead of searching through text directly.
Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.
Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.
Below works for me:
RAMDirectory idx = new RAMDirectory();
IndexWriter writer = new IndexWriter(
idx,
new IndexWriterConfig(Version.LUCENE_40, new ClassicAnalyzer(Version.LUCENE_40))
);
Document document = new Document();
document.add(new StringField("ticket_number", "t123", Field.Store.YES));
document.add(new IntField("ticket_id", 234, Field.Store.YES));
document.add(new StringField("id_s", "234", Field.Store.YES));
writer.addDocument(document);
writer.commit();
IndexReader reader = DirectoryReader.open(idx);
IndexSearcher searcher = new IndexSearcher(reader);
Query q1 = new TermQuery(new Term("id_s", "234"));
TopDocs td1 = searcher.search(q1, 1);
System.out.println(td1.totalHits); // prints "1"
Query q2 = NumericRangeQuery.newIntRange("ticket_id", 1, 234, 234, true, true);
TopDocs td2 = searcher.search(q2, 1);
System.out.println(td2.totalHits); // prints "1"
As femtoRgon pointed out, for numeric values (longs, dates, floats, etc.) you need to have NumericRangeQuery
and specify precision. Otherwise Lucene has no idea how do you want to define similarity.
Another answer comes from this thread (third answer): Lucene 4.0 IndexWriter updateDocument for Numeric Term
Basically, you create a Term with your int value like this:
String field = "myfield";
int value = 4711;
BytesRef bytes = new BytesRef(NumericUtils.BUF_SIZE_INT);
NumericUtils.intToPrefixCoded(value, 0, bytes);
Term term = new Term(field, bytes);
Then you can use this term for searching, or deleting/updating your index. In a first test, this worked fine for me. I can't tell if this is the "right" way to do things however. I've used the NumericRangeFilter before for filtering IntFields, but now I'm inclined to use this approach and use regular TermsFilter, or TermQueries instead.
Numeric Fields can be queried with a NumericRangeQuery. For an exact match, simply set the max and min to equal values.
Your output indicating the field is not indexed could be due to the differences in how a numeric value is indexed, compared to a text value. Considering that the field is transformed into Lucene's numeric representation, the literal value 152
will indeed not be indexed
At a glance, however, it's possible that your handling of id_s may be the better alternative. IDs are not usually handled as numeric values, but rather as just simple identifiers that happen to be represented with digits. If you don't need numeric sorting or range querying on the field, indexing as a StringField
certainly makes more sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With