Searching subtitle data in elasticsearch

Tags:

Having the following data (simple srt)

1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final

2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.

...

what would be the best way to index it in Elasticsearch? Now here's the catch: I want search results highlights to link to the exact time the timestamp indicates. Also, there are phrases overlapping multiple srt rows (such as final approach in the example above).

My ideas are

Index the srt file as list type, timestamps being the indexes. I'm believe this would not match phrases overlapping multiple keys
Create custom tokenizer that only indexes the text part. I'm not sure how well can elasticsearch highlight the original content then.
Index only the text part and map it back to timestamp outside of elasticsearch

Or is there yet another option that would solve this in an elegant way?

521

asked Feb 10 '15 12:02

Mikulas Dite

1 Answers

Interesting question. Here's my take on it.

In essence, the subtitles "don't know" about each other — meaning that it'd be best to contain the previous and subsequent subtitle text in each doc (n - 1, n, n + 1) whenever applicable.

As such, you'd be gunning for a doc structure similar to:

{
  "sub_id" : 0,
  "start" : "00:02:17,440",
  "end" : "00:02:20,375",
  "text" : "Senator, we're making our final",
  "overlapping_text" : "Senator, we're making our final approach into Coruscant."
}

To arrive at such a doc structure I used the following (inspired by this excellent answer):

from itertools import groupby
from collections import namedtuple


def parse_subs(fpath):
    # "chunk" our input file, delimited by blank lines
    with open(fpath) as f:
        res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]

    Subtitle = namedtuple('Subtitle', 'sub_id start end text')

    subs = []

    # grouping
    for sub in res:
        if len(sub) >= 3:  # not strictly necessary, but better safe than sorry
            sub = [x.strip() for x in sub]
            sub_id, start_end, *content = sub  # py3 syntax
            start, end = start_end.split(' --> ')

            # ints only
            sub_id = int(sub_id)

            # join multi-line text
            text = ', '.join(content)

            subs.append(Subtitle(
                sub_id,
                start,
                end,
                text
            ))

    es_ready_subs = []

    for index, sub_object in enumerate(subs):
        prev_sub_text = ''
        next_sub_text = ''

        if index > 0:
            prev_sub_text = subs[index - 1].text + ' '

        if index < len(subs) - 1:
            next_sub_text = ' ' + subs[index + 1].text

        es_ready_subs.append(dict(
            **sub_object._asdict(),
            overlapping_text=prev_sub_text + sub_object.text + next_sub_text
        ))

    return es_ready_subs

Once the subtitles are parsed, they can be ingested into ES. Before that's done, set up the following mapping so that your timestamps are properly searchable and sortable:

PUT my_subtitles_index
{
  "mappings": {
    "properties": {
      "start": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      },
      "end": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      }
    }
  }
}

Once that's done, proceed to ingest:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

from utils.parse import parse_subs

es = Elasticsearch()

es_ready_subs = parse_subs('subs.txt')

actions = [
    {
        "_index": "my_subtitles_index",
        "_id": sub_group['sub_id'],
        "_source": sub_group
    } for sub_group in es_ready_subs
]

bulk(es, actions)

Once ingested, you can target the original subtitle text and boost it if it directly matches your phrase. Otherwise, add a fallback on the overlapping text which'll ensure that both "overlapping" subtitles are returned.

Before returning, you can make sure that the hits are ordered by the start, ascending. That kind of defeats the purpose of boosting but if you do sort, you can specify track_scores:true in the URI to make sure the originally calculated scores are returned too.

Putting it all together:

POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "text": {
              "query": "final approach",
              "boost": 2
            }
          }
        },
        {
          "match_phrase": {
            "overlapping_text": {
              "query": "final approach"
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "start.as_timestamp": {
        "order": "asc"
      }
    }
  ]
}

yields:

{
  "hits" : {
    "hits" : [
      {
        "_index" : "my_subtitles_index",
        "_type" : "_doc",
        "_id" : "0",
        "_score" : 6.0236287,
        "_source" : {
          "sub_id" : 0,
          "start" : "00:02:17,440",
          "end" : "00:02:20,375",
          "text" : "Senator, we're making our final",
          "overlapping_text" : "Senator, we're making our final approach into Coruscant."
        },
        "sort" : [
          137440
        ]
      },
      {
        "_index" : "my_subtitles_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 5.502407,
        "_source" : {
          "sub_id" : 1,
          "start" : "00:02:20,476",
          "end" : "00:02:22,501",
          "text" : "approach into Coruscant.",
          "overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."
        },
        "sort" : [
          140476
        ]
      }
    ]
  }
}

188

answered Sep 20 '22 06:09

Joe - Elasticsearch Handbook

Related questions
                            
                                Elasticsearch field name case sensitive
                            
                                elasticsearch - breaking english compound words?
                            
                                Data model for fields that change frequently in ElasticSearch
                            
                                Getting IP address of Logstash-forwarder machine
                            
                                Elasticsearch get size stats of document given document Id
                            
                                In ElasticSearch, how does sort interact with function_score?
                            
                                Passing dynamic value to script query in Elastic Search
                            
                                how to configure Jira Dashboard in Kibana
                            
                                Elasticsearch document id type integer vs string : Is there any performace difference?
                            
                                ElasticSearch: compare dotted version strings
                            
                                Elasticsearch NoNodeAvailableException None of the configured nodes are available
                            
                                Laravel Scout - observe relations
                            
                                ElasticSearch as EventStore
                            
                                ElasticSearch - different result ordering for simple request and aggregation request (NEST)
                            
                                elasticsearch doc['...'] Arrays and order
                            
                                JestClient is throwing SocketTimeoutException after being idle for sometime
                            
                                Elasticsearch - Analyser creating the right tokens but query is not matching
                            
                                Mocking elasticsearch-py calls
                            
                                making a calculation with the elements of an elasticsearch json object, of a contract bridge score, using Python
                            
                                compute geo distance in elasticsearch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Searching subtitle data in elasticsearch

Tags:

elasticsearch

elasticsearch-mapping

elasticsearch-query

elasticsearch-model

Mikulas Dite

People also ask

1 Answers

Joe - Elasticsearch Handbook

Recent Activity

Donate For Us