Having the following data (simple srt)
1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final
2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.
...
what would be the best way to index it in Elasticsearch? Now here's the catch: I want search results highlights to link to the exact time the timestamp indicates. Also, there are phrases overlapping multiple srt rows (such as final approach
in the example above).
My ideas are
Or is there yet another option that would solve this in an elegant way?
You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .
To search multiple data streams and indices, add them as comma-separated values in the search API's request path. The following request searches the my-index-000001 and my-index-000002 indices. You can also search multiple data streams and indices using an index pattern.
The size parameter is the maximum number of hits to return. Together, these two parameters define a page of results. response = client.
Interesting question. Here's my take on it.
In essence, the subtitles "don't know" about each other — meaning that it'd be best to contain the previous and subsequent subtitle text in each doc (n - 1
, n
, n + 1
) whenever applicable.
As such, you'd be gunning for a doc structure similar to:
{
"sub_id" : 0,
"start" : "00:02:17,440",
"end" : "00:02:20,375",
"text" : "Senator, we're making our final",
"overlapping_text" : "Senator, we're making our final approach into Coruscant."
}
To arrive at such a doc structure I used the following (inspired by this excellent answer):
from itertools import groupby
from collections import namedtuple
def parse_subs(fpath):
# "chunk" our input file, delimited by blank lines
with open(fpath) as f:
res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]
Subtitle = namedtuple('Subtitle', 'sub_id start end text')
subs = []
# grouping
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
sub_id, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
# ints only
sub_id = int(sub_id)
# join multi-line text
text = ', '.join(content)
subs.append(Subtitle(
sub_id,
start,
end,
text
))
es_ready_subs = []
for index, sub_object in enumerate(subs):
prev_sub_text = ''
next_sub_text = ''
if index > 0:
prev_sub_text = subs[index - 1].text + ' '
if index < len(subs) - 1:
next_sub_text = ' ' + subs[index + 1].text
es_ready_subs.append(dict(
**sub_object._asdict(),
overlapping_text=prev_sub_text + sub_object.text + next_sub_text
))
return es_ready_subs
Once the subtitles are parsed, they can be ingested into ES. Before that's done, set up the following mapping so that your timestamps are properly searchable and sortable:
PUT my_subtitles_index
{
"mappings": {
"properties": {
"start": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
},
"end": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
}
}
}
}
Once that's done, proceed to ingest:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from utils.parse import parse_subs
es = Elasticsearch()
es_ready_subs = parse_subs('subs.txt')
actions = [
{
"_index": "my_subtitles_index",
"_id": sub_group['sub_id'],
"_source": sub_group
} for sub_group in es_ready_subs
]
bulk(es, actions)
Once ingested, you can target the original subtitle text
and boost it if it directly matches your phrase. Otherwise, add a fallback on the overlapping
text which'll ensure that both "overlapping" subtitles are returned.
Before returning, you can make sure that the hits are ordered by the start
, ascending. That kind of defeats the purpose of boosting but if you do sort, you can specify track_scores:true
in the URI to make sure the originally calculated scores are returned too.
Putting it all together:
POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"text": {
"query": "final approach",
"boost": 2
}
}
},
{
"match_phrase": {
"overlapping_text": {
"query": "final approach"
}
}
}
]
}
},
"sort": [
{
"start.as_timestamp": {
"order": "asc"
}
}
]
}
yields:
{
"hits" : {
"hits" : [
{
"_index" : "my_subtitles_index",
"_type" : "_doc",
"_id" : "0",
"_score" : 6.0236287,
"_source" : {
"sub_id" : 0,
"start" : "00:02:17,440",
"end" : "00:02:20,375",
"text" : "Senator, we're making our final",
"overlapping_text" : "Senator, we're making our final approach into Coruscant."
},
"sort" : [
137440
]
},
{
"_index" : "my_subtitles_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 5.502407,
"_source" : {
"sub_id" : 1,
"start" : "00:02:20,476",
"end" : "00:02:22,501",
"text" : "approach into Coruscant.",
"overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."
},
"sort" : [
140476
]
}
]
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With