Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding a multi-valued string field to a Lucene Document, do commas matter?

Tags:

java

lucene

I'm building a Lucene Index and adding Documents.

I have a field that is multi-valued, for this example I'll use Categories.

An Item can have many categories, for example, Jeans can fall under Clothing, Pants, Men's, Women's, etc.

When adding the field to a document, do commas make a difference? Will Lucene simply ignore them? if I change commas to spaces will there be a difference? Does this automatically make the field multi-valued?

String categoriesForItem = getCategories(); // returns "category1, category2, cat3" from a DB call

categoriesForItem = categoriesForItem.replaceAll(",", " ").trim(); // not sure if to remove comma

doc.add(new StringField("categories", categoriesForItem , Field.Store.YES)); // doc is a Document

Am I doing this correctly? or is there another way to create multivalued fields?

Any help/advice is appreciated.

like image 378
SoluableNonagon Avatar asked Jan 08 '14 17:01

SoluableNonagon


1 Answers

This would be a better way to index multiValued fields per document

String categoriesForItem = getCategories(); // get "category1, category2, cat3" from a DB call

String [] categoriesForItems = categoriesForItem.split(","); 
for(String cat : categoriesForItems) {
    doc.add(new StringField("categories", cat , Field.Store.YES)); // doc is a Document 
}

Whenever multiple fields with the same name appear in one document, both the inverted index and term vectors will logically append the tokens of the field to one another, in the order the fields were added.

Also during the analysis phase two different values will be seperated by a position increment via setPositionIncrementGap() automatically. Let me explain why this is needed.

Your field "categories" in Document D1 has two values - "foo bar" and "foo baz" Now if you were to do a phrase query "bar foo" D1 should not come up. This is ensure by adding an extra increment between two values of the same field.

If you yourself concatenate the field values and rely on the analyzer to split it into multiple values "bar foo" would return D1 which would be incorrect.

like image 60
varunthacker Avatar answered Oct 18 '22 16:10

varunthacker