Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font

I am using iText7 and I want to extract all the texts from a pdf and put html tag for bold ( <b>...</b> ) around all the words that uses bold fonts and save it in text file. Any pointers? I am able to independently extract text and also extract all the bold words but not able to co-relate the two. Here is the code snippet I am using for extracting the text:

PdfDocument MyDocument = new PdfDocument(new PdfReader("C:\\MyTest.pdf"));
string MyText = PdfTextExtractor.GetTextFromPage(MyDocument.GetPage(1), new 
SimpleTextExtractionStrategy());

Here is the code I am using for extracting all the words using the bold font:

MyRectangle = new Rectangle(0, 0, 50, 100);
CustomFontFilter fontFilter = new CustomFontFilter(MyRectangle);
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy = 
listener.AttachEventListener(new LocationTextExtractionStrategy(), fontFilter);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(MyDocument.GetPage(1));
String MyBoldTextList = extractionStrategy.GetResultantText();
//------
class CustomFontFilter : TextRegionEventFilter
{
    public CustomFontFilter(iText.Kernel.Geom.Rectangle filterRect) : base(filterRect){ }
    override public bool Accept(IEventData data, EventType type)
    {
        if (type == EventType.RENDER_TEXT){
            TextRenderInfo renderInfo = (TextRenderInfo)data;
            PdfFont font = renderInfo.GetFont();
            if (font!=null)
                return font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold");
        }
        return false;
    }
}

The problem is that the pdf in question here is a multi-column document. SimpleTextExtractionStrategy brings the text in perfect order but if I use the LocationStrategy, it messes up texts by jumping from one column to next column in each line. I am not able to find any way to get the list of bold words using SimpleTextExtractionStrategy. In LocationStrategy, the list that I get is not in the right order so I am unable to co-relate it.

like image 721
Manoj Misran Avatar asked Sep 17 '25 14:09

Manoj Misran


1 Answers

So to summarize:

  • You want to extract all the text from a pdf and put the html tag for bold (<b>...</b>) around all the text that uses bold fonts.

  • Your PDFs allow normal text extraction (without those <b> tags) using the SimpleTextExtractionStrategy. The LocationTextExtractionStrategy on the other hand cannot be used as it messes up the order of the multi-column text.

  • Bold text in your PDFs can properly be recognized by your CustomFontFilter, i.e. by the

    font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold")
    

    condition.

Thus, one way to implement your task would be to extend the SimpleTextExtractionStrategy to check every chunk received using the CustomFontFilter condition and insert <b> tags where required.

For example like this:

public class BoldTaggingSimpleTextExtractionStrategy : SimpleTextExtractionStrategy
{
    FieldInfo textField = typeof(TextRenderInfo).GetField("text", BindingFlags.NonPublic | BindingFlags.Instance);
    bool currentlyBold = false;

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (type.Equals(EventType.RENDER_TEXT))
        {
            TextRenderInfo renderInfo = (TextRenderInfo)data;
            string fontName = renderInfo.GetFont()?.GetFontProgram()?.GetFontNames()?.GetFontName();
            if (fontName != null && fontName.Contains("Bold"))
            {
                if (!currentlyBold)
                {
                    textField.SetValue(renderInfo, "<b>" + renderInfo.GetText());
                    currentlyBold = true;
                }
            }
            else if (currentlyBold)
            {
                AppendTextChunk("</b>");
                currentlyBold = false;
            }
        }
        base.EventOccurred(data, type);
    }
}

As you see I used reflection here. I did so because (A) TextRenderInfo does not allow public setting of the text and (B) AppendTextChunk must not be used before the first chunk is processed by base.EventOccurred - there the size of a StringBuilder containing the collected text chunks is used to check whether the chunk currently processed is the first one or not; if something is in that builder before at least one chunk has been processed, one gets a NullReferenceException. There are other work-arounds for that but reflection here means but one more line of code.

like image 52
mkl Avatar answered Sep 19 '25 06:09

mkl