Revisiting a stalled project and looking for advice in modernizing thousands of "old" documents and making them available via web.
Documents exist in various formats, some obsolete: (.doc, PageMaker, hardcopy (OCR), PDF, etc.). Funds are available to migrate the documents into a 'modern' format, and many of the hardcopies have already been OCR'd into PDFs - we had originally assumed that PDF would be the final format but we're open to suggestions (XML?).
Once all docs are in a common format we would like to make their contents available and searchable via a web interface. We'd like the flexibility to return only portions (pages?) of the entire document where a search 'hit' is found (I believe Lucene/elasticsearch makes this possible?!?) Might it be more flexible if content was all XML? If so how/where to store the XML? Directly in database, or as discrete files in the filesystem? What about embedded images/graphs in the documents?
Curious how others might approach this. There is no "wrong" answer I'm just looking for as many inputs as possible to help us proceed.
Thanks for any advice.
PDF/A is the standard for archiving electronic documents. The PDF format is widespread globally. It is used in both the public and private sectors for a wide range of purposes. The PDF/A Standard is the perfect instrument to ensure long-term preservation and reproduc- ibility of documents over extended periods.
To set up a PDF database file, your best bet is to create it first using a database or spreadsheet program, such as Microsoft Excel. Then you can convert the file into a PDF and add Adobe Acrobat's search bar and index features, making it easy for users to search the database.
A searchable PDF file is a PDF file that includes text that can be searched upon using the standard Adobe Reader “search” functionality. In addition, the text can be selected and copied from the PDF.
“Image-only” or Scanned PDFs Consequently, image-only PDF files are not searchable, and their text usually cannot be modified or marked up. An “image-only” PDF can be made searchable by applying OCR with which a text layer is added, normally under the page image.
In summary: I'm going to be recommending ElasticSearch, but let's break the problem down and talk about how to implement it:
There are a few parts to this:
What can ElasticSearch provide:
You could just send the whole doc to ElasticSearch as an attachment, and you'd get full text search. But the sticking points are (4) and (5) above: knowing where you are in a doc, and returning parts of a doc.
Storing individual pages is probably sufficient for your where-am-I purposes (although you could equally go down to paragraph level), but you want them grouped in a way that a doc would be returned in the search results, even if search keywords appear on different pages.
First the indexing part: storing your docs in ElasticSearch:
Index each page as a "page" doc, which contains:
Now for searching. How you do this depends on how you want to present your results - by page, or grouped by doc.
Results by page are easy. This query returns a list of matching pages (each page is returned in full) plus a list of highlighted snippets from the page:
curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1' -d ' { "query" : { "text" : { "text" : "interesting keywords" } }, "highlight" : { "fields" : { "text" : {} } } } '
Displaying results grouped by "doc" with highlights from the text is a bit trickier. It can't be done with a single query, but a little client side grouping will get you there. One approach might be:
Step 1: Do a top-children-query to find the parent ("doc") whose children ("page") best match the query:
curl -XGET 'http://127.0.0.1:9200/my_index/doc/_search?pretty=1' -d ' { "query" : { "top_children" : { "query" : { "text" : { "text" : "interesting keywords" } }, "score" : "sum", "type" : "page", "factor" : "5" } } }
Step 2: Collect the "doc" IDs from the above query and issue a new query to get the snippets from the matching "page" docs:
curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1' -d ' { "query" : { "filtered" : { "query" : { "text" : { "text" : "interesting keywords" } }, "filter" : { "terms" : { "doc_id" : [ 1,2,3], } } } }, "highlight" : { "fields" : { "text" : {} } } } '
Step 3: In your app, group the results from the above query by doc and display them.
With the search results from the second query, you already have the full text of the page which you can display. To move to the next page, you can just search for it:
curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1' -d ' { "query" : { "constant_score" : { "filter" : { "and" : [ { "term" : { "doc_id" : 1 } }, { "term" : { "page" : 2 } } ] } } }, "size" : 1 } '
Or alternatively, give the "page" docs an ID consisting of $doc_id _ $page_num
(eg 123_2) then you can just retrieve that page:
curl -XGET 'http://127.0.0.1:9200/my_index/page/123_2
Parent-child relationship:
Normally, in ES (and most NoSQL solutions) each doc/object is independent - there are no real relationships. By establishing a parent-child relationship between the "doc" and the "page", ElasticSearch makes sure that the child docs (ie the "page") are stored on the same shard as the parent doc (the "doc").
This enables you to run the top-children-query which will find the best matching "doc" based on the content of the "pages".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With