Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to search for specific scopes with elasticsearch?

I need to perform text searches on documents based on the following scopes:

  1. Whole document
  2. Chapters
  3. Paragraphs
  4. Sentences

Is it possible to index a document so that you I can filter the scope of the query based on this requirement?

Edit due to the answers

I have now created the following index

{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter": [ "lowercase", "asciifolding" ]
        }
      }
    }
  },
  "mappings": {
    "books": {
      "properties": {
        "content": {
          "type": "string",
          "fields": {
            "english": {
              "type": "string",
              "analyzer": "english"
            },
            "folded": {
              "type": "string",
              "analyzer": "folding"
            }
          }
        },
        "author": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "language": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "source": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "title": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "fileType": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    },
    "sections": {
      "_parent": { "type": "books" },
      "properties": {
        "content": {
          "type": "string",
          "fields": {
            "english": {
              "type": "string",
              "analyzer": "english"
            },
            "folded": {
              "type": "string",
              "analyzer": "folding"
            }
          }
        },
        "paragraphs": {
          "type": "nested",
          "properties": {
            "paragraph": {
              "properties": {
                "page": { "type": "integer" },
                "number": { "type": "integer" },
                "html_tag": { "type": "string" },
                "content": { "type": "string" }

              }
            }
          }
        },
        "author": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "language": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "source": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "title": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "fileType": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    },
    "messages": {
      "properties": {
        "content": {
          "type": "string",
          "fields": {
            "english": {
              "type": "string",
              "analyzer": "english"
            },
            "folded": {
              "type": "string",
              "analyzer": "folding"
            }
          }
        },
        "paragraphs": {
          "type": "nested",
          "properties": {
            "paragraph": {
              "properties": {
                "page": { "type": "integer" },
                "number": { "type": "integer" },
                "html_tag": { "type": "string" },
                "content": { "type": "string" }

              }
            }
          }
        },
        "author": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "language": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "source": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "title": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "fileType": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

Which gives me the following types: Books, Sections(parent Books) and Messages. Sections and Messages has the nested type Paragraphs and I have skipped the sentence level.

I can now perform searches on content on the book level, content on the section level. Which allows me to search for words between paragraphs. I can also search directly on the paragraph level which is helpful if I want to match two words in a paragraph.

Example: Let say I have the following document

paragraph 1: It is a beautiful warm day.
paragraph 2: The cloud is clear.

I can now search for beautiful AND cloud on the content level and get back the document. I do however not get back the document if I search for beautiful AND cloud on the paragraph level using nested search, which is what I wanted.

The problems I see width this solution are:

  1. I need to index the same paragraph 3 times. Once on the Paragraph level, once in the content Section level and once in the content Book level.
  2. I do not understand what benefit I am getting from having parent/child relationship between Books and Sections. I haven't found any way of searching both at the same time using highlighting.
  3. I need a separate Message type which is exactly the same as Section type without parent. Is there no way of having a children type without parents so that I can avoid an extra type?
like image 772
Alex Lyman Avatar asked Jan 01 '16 13:01

Alex Lyman


People also ask

How does search in Elasticsearch work?

Elasticsearch takes in unstructured data from different locations, stores and indexes it according to user-specified mapping (which can also be derived automatically from data), and makes it searchable. Its distributed architecture makes it possible to search and analyze huge volumes of data in near real time.

How do you search in elastic?

You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .

Does elastic search store data?

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents.

How does Elasticsearch search so fast?

Elasticsearch is fast. Because Elasticsearch is built on top of Lucene, it excels at full-text search. Elasticsearch is also a near real-time search platform, meaning the latency from the time a document is indexed until it becomes searchable is very short — typically one second.


2 Answers

To achieve this you can index all sentences and along with the words of the sentence you include the information about the enclosing context, i.e. in which paragraph, chapter and book is the given sentence.

Then querying for terms will return you sentences and along with them the information about the chapter and book. With this information you know which sentence, paragraph, chapter or book is meant.

Then you simply use whatever scope you're interested in.

Example document to index:

{
    "book": <book-id>,
    "chapter": <chapter-id>,
    "paragraph": <paragraph-id>,
    "sentence": <sentence-id>,
    "sentence_text": "Here comes the text from a sentence in the indexed book"
}

Additional answer after question clarification

To achieve this you could use different document types stored in the same index. Then you can use one query which will return documents of possibly different types (paragraphs, books, etc). Afterwards by filtering the result type, you get what you want. Here is an example:

Entire book:

POST /books/book/1
{
    "text": "It is a beautiful warm day. The cloud is clear."
}

1st paragraph:

POST /books/para/1
{
    "text": "It is a beautiful warm day."
}

2nd paragraph:

POST /books/para/2
{
    "text": "The cloud is clear."
}

Query to retrieve documents:

POST /books/_search
{
    "query": {
        "match": {
           "text": {
                "query": "beautiful cloud",
                "operator": "and"
           }
        }
    }
}

Does this solve your problem?

like image 171
paweloque Avatar answered Oct 03 '22 12:10

paweloque


An other alternative is to have a single document / book but have many nested documents within, this way they can all share the same "book" context at the root level. It is up to you if you'd have one level of hierarchy (all sentences as nested documents) or more (capter => paragrap => sentence). A single level would keep queries simpler to write.

{
    "book": 123,
    "author": "Harry",
    "written": 1995,
    "sentences": [
        {
            "chapter": 1,
            "paragraph": 2,
            "sentence": 3,
            "text": "abc def"
        },
        {
            "chapter": 2,
            "paragraph": 3,
            "sentence": 4,
            "text": "ghi jkl"
        },
        { ... }
    ]
}
like image 21
NikoNyrh Avatar answered Oct 03 '22 13:10

NikoNyrh