I need to perform text searches on documents based on the following scopes:
Is it possible to index a document so that you I can filter the scope of the query based on this requirement?
Edit due to the answers
I have now created the following index
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
},
"mappings": {
"books": {
"properties": {
"content": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"folded": {
"type": "string",
"analyzer": "folding"
}
}
},
"author": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"language": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"source": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"fileType": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
},
"sections": {
"_parent": { "type": "books" },
"properties": {
"content": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"folded": {
"type": "string",
"analyzer": "folding"
}
}
},
"paragraphs": {
"type": "nested",
"properties": {
"paragraph": {
"properties": {
"page": { "type": "integer" },
"number": { "type": "integer" },
"html_tag": { "type": "string" },
"content": { "type": "string" }
}
}
}
},
"author": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"language": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"source": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"fileType": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
},
"messages": {
"properties": {
"content": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"folded": {
"type": "string",
"analyzer": "folding"
}
}
},
"paragraphs": {
"type": "nested",
"properties": {
"paragraph": {
"properties": {
"page": { "type": "integer" },
"number": { "type": "integer" },
"html_tag": { "type": "string" },
"content": { "type": "string" }
}
}
}
},
"author": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"language": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"source": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"fileType": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Which gives me the following types: Books, Sections(parent Books) and Messages. Sections and Messages has the nested type Paragraphs and I have skipped the sentence level.
I can now perform searches on content on the book level, content on the section level. Which allows me to search for words between paragraphs. I can also search directly on the paragraph level which is helpful if I want to match two words in a paragraph.
Example: Let say I have the following document
paragraph 1: It is a beautiful warm day.
paragraph 2: The cloud is clear.
I can now search for beautiful AND cloud on the content level and get back the document. I do however not get back the document if I search for beautiful AND cloud on the paragraph level using nested search, which is what I wanted.
The problems I see width this solution are:
Elasticsearch takes in unstructured data from different locations, stores and indexes it according to user-specified mapping (which can also be derived automatically from data), and makes it searchable. Its distributed architecture makes it possible to search and analyze huge volumes of data in near real time.
You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .
Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents.
Elasticsearch is fast. Because Elasticsearch is built on top of Lucene, it excels at full-text search. Elasticsearch is also a near real-time search platform, meaning the latency from the time a document is indexed until it becomes searchable is very short — typically one second.
To achieve this you can index all sentences and along with the words of the sentence you include the information about the enclosing context, i.e. in which paragraph, chapter and book is the given sentence.
Then querying for terms will return you sentences and along with them the information about the chapter and book. With this information you know which sentence, paragraph, chapter or book is meant.
Then you simply use whatever scope you're interested in.
Example document to index:
{
"book": <book-id>,
"chapter": <chapter-id>,
"paragraph": <paragraph-id>,
"sentence": <sentence-id>,
"sentence_text": "Here comes the text from a sentence in the indexed book"
}
Additional answer after question clarification
To achieve this you could use different document types stored in the same index. Then you can use one query which will return documents of possibly different types (paragraphs, books, etc). Afterwards by filtering the result type, you get what you want. Here is an example:
Entire book:
POST /books/book/1
{
"text": "It is a beautiful warm day. The cloud is clear."
}
1st paragraph:
POST /books/para/1
{
"text": "It is a beautiful warm day."
}
2nd paragraph:
POST /books/para/2
{
"text": "The cloud is clear."
}
Query to retrieve documents:
POST /books/_search
{
"query": {
"match": {
"text": {
"query": "beautiful cloud",
"operator": "and"
}
}
}
}
Does this solve your problem?
An other alternative is to have a single document / book but have many nested documents within, this way they can all share the same "book" context at the root level. It is up to you if you'd have one level of hierarchy (all sentences as nested documents) or more (capter => paragrap => sentence). A single level would keep queries simpler to write.
{
"book": 123,
"author": "Harry",
"written": 1995,
"sentences": [
{
"chapter": 1,
"paragraph": 2,
"sentence": 3,
"text": "abc def"
},
{
"chapter": 2,
"paragraph": 3,
"sentence": 4,
"text": "ghi jkl"
},
{ ... }
]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With