I'm considering / working on implementing a search engine for our company's various content types, and am attempting to wrap my head around Lucene (specifically the .net flavor).
For the moment, my primary question is whether or not documents one indexes have to contain the same fields.
For instance:
Document1:
Document2:
...and so forth
A Document is a set of fields. Each field has a name and a textual value. A field may be stored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.
But the more general answer is that they use/implement a Inverted Index. The specifics of how Lucene stores it you can find in file formats (as milan said). But the general idea is that they store a Inverted Index data structure and other auxiliar data structures to help answer queries quickly.
A field is a section of a Document. Each field has three parts: name, type and value. Values may be text (String, Reader or pre-analyzed TokenStream), binary (byte[]), or numeric (a Number). Fields are optionally stored in the index, so that they may be returned with hits on the document.
In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.
Nothing in lucene forces uniformity.
If you search on a field named 'fred', and not all docs have 'fred,' that search will not find the fredless docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With