Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do documents in Lucene have to contain the same fields?

I'm considering / working on implementing a search engine for our company's various content types, and am attempting to wrap my head around Lucene (specifically the .net flavor).

For the moment, my primary question is whether or not documents one indexes have to contain the same fields.

For instance:

Document1:

  • Title: "I'm a document, baby"
  • Body: "Here are some important things"
  • Latitude: 26.12224
  • Longtitude: -65.23124
  • Brand: Toshiba

Document2:

  • Title: "Another Document by Me"
  • Body: "Lorem ipsum and all that jazz"
  • Category: Articles
  • Author: Sir Loin

...and so forth

like image 263
Matt Avatar asked Jan 14 '10 19:01

Matt


People also ask

What is a document in Lucene?

A Document is a set of fields. Each field has a name and a textual value. A field may be stored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.

How does Lucene store data?

But the more general answer is that they use/implement a Inverted Index. The specifics of how Lucene stores it you can find in file formats (as milan said). But the general idea is that they store a Inverted Index data structure and other auxiliar data structures to help answer queries quickly.

What are Lucene fields?

A field is a section of a Document. Each field has three parts: name, type and value. Values may be text (String, Reader or pre-analyzed TokenStream), binary (byte[]), or numeric (a Number). Fields are optionally stored in the index, so that they may be returned with hits on the document.

How does Lucene index work?

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.


1 Answers

Nothing in lucene forces uniformity.

If you search on a field named 'fred', and not all docs have 'fred,' that search will not find the fredless docs.

like image 135
bmargulies Avatar answered Nov 06 '22 05:11

bmargulies