Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing HTML Documents in Elasticsearch

Scenario

I have HTML documents, let's say: emails. I want to store these on elastic search and search the plaintext of HTML emails.

Problem

Elasticsearch would index all the HTML tags and attributes, too. I don't want that. I want to search for span if it is a plain text, not a html element. For example <span>span</span> could be a hit, but not <span>some other content</span>.

Question

Would you recommend, to store a HTML stripped field and a HTML field in a document? Or should I store the HTML document on S3 and rather leave a stripped HTML version in elastic search? Does it even make sense

I honestly don't know what happens if elastic search is indexing a HTML document, but I could imagine that it will also index divs and spans and all the attributes. These are things I totally don't search for. So: any suggestion on solving the problem here would be great!

What am I doing now?

Right now before I store a document in ES, I check if the index exists for the document type. If not, I create a collection with a given mapping. The mapping looks like this

{
    "analysis": {
        "analyzer": {
            "htmlStripAnalyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": "standard",
                "char_filter": [
                    "html_strip"
                ]
            }
        }
    },
    "mappings": {
        "${type}": {
            "dynamic_templates": [
                {
                    "_metadata": {
                        "path_match": "_metadata.*",
                        "mapping": {
                            "type": "keyword"
                        }
                    }
                }
            ],
            "properties": {
                "_tags": {
                    "type": "nested",
                    "dynamic": true
                }
            }
        }
    }
}

Warning: Ignore the existing mappings. It has nothing to do with my intentions. They are just there.

I am replacing ${type} with the document type, let's say emails. What would it look like to tell ES to not index the HTML stuff?

like image 631
AmazingTurtle Avatar asked Apr 07 '17 09:04

AmazingTurtle


People also ask

Can we store file in Elasticsearch?

Elasticsearch is a powerful search engine that can be used to store and search PDF files. To store PDF files in Elasticsearch, you need to first index the files using the Elasticsearch indexer. The indexer will extract the text from the PDF files and store it in the Elasticsearch index.

How are documents stored in elastic search?

Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

What type of data can I store in Elasticsearch?

Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names of fields or properties) with their corresponding values (strings, numbers, Booleans, dates, arrays of values, geolocations, or other types of data).

What should I store in elastic search?

There are two types of data you might want to store in Elasticsearch: Your JSON documents, containing numbers, lists, text, geo coordinates, and all the other formats Elasticsearch supports.


2 Answers

A complete test case:

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "html": {
          "type": "text",
          "analyzer": "htmlStripAnalyzer"
        }
      }
    }
  }
}

POST /test/test/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/test/2
{
  "html": "<span>whatever</span>"
}
POST /test/test/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}

Update for Elasticsearch >=7 (removal of types)

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "html": {
        "type": "text",
        "analyzer": "htmlStripAnalyzer"
      }
    }
  }
}

POST /test/_doc/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/_doc/2
{
  "html": "<span>whatever</span>"
}
POST /test/_doc/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}
like image 94
Andrei Stefan Avatar answered Nov 15 '22 21:11

Andrei Stefan


By default Elasticsearch will dynamically add new fields if it finds any during the indexing process (see this):

When Elasticsearch encounters a previously unknown field in a document, it uses dynamic mapping to determine the datatype for the field and automatically adds the new field to the type mapping.

To disable this behavior (see the doc for more details), the simplest is to put dynamic to false (prevents the automatic creation) or to strict (throws an exception and does not create a new document). In that case, you would need to explicitly write the mapping for the tags you wish to keep inside your _tags section, and pre parse the HTML document to only feed the tags you are interested in to Elasticsearch.

So let's say you have the following HTML document:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>A simple example</title>
</head>
<body>
  <div>
    <p><span class="ref">A sentence I want to reference from this HTML document</span></p>
    <p><span class="">Something less important</span></p>
</body>
</html>

The first thing you want to have is a static mapping inside Elasticsearch, I would do the following (assuming the ref is a string):

PUT html
{

"mappings": {
  "test":{
    "dynamic": "strict",
    "properties": {
      "ref":{
        "type": "string"
      }
    }
  }
}

Now if you try adding a document this way, it will succeed:

PUT html/test/1
{
  "ref": "A sentence I want to reference from this HTML document"
}

But this won't succeed:

PUT html/test/2
{
  "ref": "A sentence I want to reference from this HTML document",
  "some_field": "Some field"
}

Now the only thing remaining is to parse the HTML to retrieve the "ref" field, and create the above query (use whatever language you like, Java, Python...)

Edit: Actually to store the HTML without indexing it, in your mapping you simply need to set index to no (see here):

"_tags": {
          "type": "nested",
          "dynamic": true,
          "index": "no"
         }
like image 36
Adonis Avatar answered Nov 15 '22 22:11

Adonis