Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How-to extract text from a pdf doc within a specific rectangular region? [closed]

Tags:

c#

pdf

I have to extract text from a pdf doc within a specific rectangular region. The work-flow is as following. First of all pdf is converted to an jpg image. Then user draws selection rectangle on top of the picture. Then I somehow need to extract all text from pdf doc within that selection region. Any suggestions what freeware pdf libs accessible from C# to use?

like image 373
mmierins Avatar asked Nov 28 '10 17:11

mmierins


People also ask

How do I extract text from a flattened PDF?

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.

How do you scrape words in a PDF?

Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.

How do I separate text from an image in PDF?

You can capture text from a scanned image, upload your image file from your computer, or take a screenshot on your desktop. Then simply right click on the image, and select Grab Text. The text from your scanned PDF can then be copied and pasted into other programs and applications.

How do I read text in a PDF?

Open Reader and navigate to the document page you want to have read aloud. From the top left menu, click View, then Read Out Loud. You can choose to have the whole document read aloud or just the page you're on. Select either Read to End of Document or Read This Page Only, respectively.


1 Answers

this code will perfectly extract pdf data on the basis of rectangular coordinates using itextsharp

    List<string> linestringlist = new List<string>();
    PdfReader reader = new PdfReader(pdfFilename);
    iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(coordinate1, coordinate2, coordinate3, coordinate4);
    RenderFilter[] renderFilter = new RenderFilter[1];
    renderFilter[0] = new RegionTextRenderFilter(rect);
    ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
    string text = PdfTextExtractor.GetTextFromPage(reader, 1, textExtractionStrategy);
like image 196
shailendra Avatar answered Sep 20 '22 08:09

shailendra