Is it possible to search for specific scopes with elasticsearch?

Tags:

elasticsearch

I need to perform text searches on documents based on the following scopes:

Whole document
Chapters
Paragraphs
Sentences

Is it possible to index a document so that you I can filter the scope of the query based on this requirement?

Edit due to the answers

I have now created the following index

{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter": [ "lowercase", "asciifolding" ]
        }
      }
    }
  },
  "mappings": {
    "books": {
      "properties": {
        "content": {
          "type": "string",
          "fields": {
            "english": {
              "type": "string",
              "analyzer": "english"
            },
            "folded": {
              "type": "string",
              "analyzer": "folding"
            }
          }
        },
        "author": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "language": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "source": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "title": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "fileType": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    },
    "sections": {
      "_parent": { "type": "books" },
      "properties": {
        "content": {
          "type": "string",
          "fields": {
            "english": {
              "type": "string",
              "analyzer": "english"
            },
            "folded": {
              "type": "string",
              "analyzer": "folding"
            }
          }
        },
        "paragraphs": {
          "type": "nested",
          "properties": {
            "paragraph": {
              "properties": {
                "page": { "type": "integer" },
                "number": { "type": "integer" },
                "html_tag": { "type": "string" },
                "content": { "type": "string" }

              }
            }
          }
        },
        "author": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "language": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "source": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "title": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "fileType": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    },
    "messages": {
      "properties": {
        "content": {
          "type": "string",
          "fields": {
            "english": {
              "type": "string",
              "analyzer": "english"
            },
            "folded": {
              "type": "string",
              "analyzer": "folding"
            }
          }
        },
        "paragraphs": {
          "type": "nested",
          "properties": {
            "paragraph": {
              "properties": {
                "page": { "type": "integer" },
                "number": { "type": "integer" },
                "html_tag": { "type": "string" },
                "content": { "type": "string" }

              }
            }
          }
        },
        "author": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "language": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "source": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "title": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "fileType": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

Which gives me the following types: Books, Sections(parent Books) and Messages. Sections and Messages has the nested type Paragraphs and I have skipped the sentence level.

I can now perform searches on content on the book level, content on the section level. Which allows me to search for words between paragraphs. I can also search directly on the paragraph level which is helpful if I want to match two words in a paragraph.

Example: Let say I have the following document

paragraph 1: It is a beautiful warm day.
paragraph 2: The cloud is clear.

I can now search for beautiful AND cloud on the content level and get back the document. I do however not get back the document if I search for beautiful AND cloud on the paragraph level using nested search, which is what I wanted.

The problems I see width this solution are:

I need to index the same paragraph 3 times. Once on the Paragraph level, once in the content Section level and once in the content Book level.
I do not understand what benefit I am getting from having parent/child relationship between Books and Sections. I haven't found any way of searching both at the same time using highlighting.
I need a separate Message type which is exactly the same as Section type without parent. Is there no way of having a children type without parents so that I can avoid an extra type?

772

asked Jan 01 '16 13:01

Alex Lyman

2 Answers

To achieve this you can index all sentences and along with the words of the sentence you include the information about the enclosing context, i.e. in which paragraph, chapter and book is the given sentence.

Then querying for terms will return you sentences and along with them the information about the chapter and book. With this information you know which sentence, paragraph, chapter or book is meant.

Then you simply use whatever scope you're interested in.

Example document to index:

{
    "book": <book-id>,
    "chapter": <chapter-id>,
    "paragraph": <paragraph-id>,
    "sentence": <sentence-id>,
    "sentence_text": "Here comes the text from a sentence in the indexed book"
}

Additional answer after question clarification

To achieve this you could use different document types stored in the same index. Then you can use one query which will return documents of possibly different types (paragraphs, books, etc). Afterwards by filtering the result type, you get what you want. Here is an example:

Entire book:

POST /books/book/1
{
    "text": "It is a beautiful warm day. The cloud is clear."
}

1st paragraph:

POST /books/para/1
{
    "text": "It is a beautiful warm day."
}

2nd paragraph:

POST /books/para/2
{
    "text": "The cloud is clear."
}

Query to retrieve documents:

POST /books/_search
{
    "query": {
        "match": {
           "text": {
                "query": "beautiful cloud",
                "operator": "and"
           }
        }
    }
}

Does this solve your problem?

171

answered Oct 03 '22 12:10

paweloque

An other alternative is to have a single document / book but have many nested documents within, this way they can all share the same "book" context at the root level. It is up to you if you'd have one level of hierarchy (all sentences as nested documents) or more (capter => paragrap => sentence). A single level would keep queries simpler to write.

{
    "book": 123,
    "author": "Harry",
    "written": 1995,
    "sentences": [
        {
            "chapter": 1,
            "paragraph": 2,
            "sentence": 3,
            "text": "abc def"
        },
        {
            "chapter": 2,
            "paragraph": 3,
            "sentence": 4,
            "text": "ghi jkl"
        },
        { ... }
    ]
}

answered Oct 03 '22 13:10

NikoNyrh

Related questions
                            
                                Mocking elasticsearch-py calls
                            
                                making a calculation with the elements of an elasticsearch json object, of a contract bridge score, using Python
                            
                                compute geo distance in elasticsearch
                            
                                Searching subtitle data in elasticsearch
                            
                                Update/delete existing log entry with logstash
                            
                                elasticsearch multi_match vs should
                            
                                Configure sink elasticsearch apache-flume
                            
                                Why is mongoosastic populate / elastic search not populating one of my references? I'm getting an empty object
                            
                                Elastic search query using match_phrase_prefix and fuzziness at the same time?
                            
                                Filter or analyzer to equate English numbers and arabic numerals
                            
                                Elasticsearch - Rank userIds based on score
                            
                                failed to parse field [datefield] of type [date]
                            
                                Kibana fails to pick up date from elasticsearch when I include the hour and minute
                            
                                Settings to improve elasticsearch startup time for unit tests?
                            
                                elasticsearch "Trying to create too many buckets" with nested bucket aggregations
                            
                                How to get elasticsearch to perform similar to SQL 'LIKE'
                            
                                Control order of token filters in ElasticSearch
                            
                                how to implement ElasticSearch in Flask app?
                            
                                Paging in Elasticsearch when results have equal scores
                            
                                Best practices for field names in ElasticSearch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With