I am using iText7 and I want to extract all the texts from a pdf and put html tag for bold ( <b>...</b> ) around all the words that uses bold fonts and save it in text file. Any pointers? I am able to independently extract text and also extract all the bold words but not able to co-relate the two. Here is the code snippet I am using for extracting the text:
PdfDocument MyDocument = new PdfDocument(new PdfReader("C:\\MyTest.pdf"));
string MyText = PdfTextExtractor.GetTextFromPage(MyDocument.GetPage(1), new
SimpleTextExtractionStrategy());
Here is the code I am using for extracting all the words using the bold font:
MyRectangle = new Rectangle(0, 0, 50, 100);
CustomFontFilter fontFilter = new CustomFontFilter(MyRectangle);
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy =
listener.AttachEventListener(new LocationTextExtractionStrategy(), fontFilter);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(MyDocument.GetPage(1));
String MyBoldTextList = extractionStrategy.GetResultantText();
//------
class CustomFontFilter : TextRegionEventFilter
{
public CustomFontFilter(iText.Kernel.Geom.Rectangle filterRect) : base(filterRect){ }
override public bool Accept(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT){
TextRenderInfo renderInfo = (TextRenderInfo)data;
PdfFont font = renderInfo.GetFont();
if (font!=null)
return font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold");
}
return false;
}
}
The problem is that the pdf in question here is a multi-column document. SimpleTextExtractionStrategy brings the text in perfect order but if I use the LocationStrategy, it messes up texts by jumping from one column to next column in each line. I am not able to find any way to get the list of bold words using SimpleTextExtractionStrategy. In LocationStrategy, the list that I get is not in the right order so I am unable to co-relate it.
So to summarize:
You want to extract all the text from a pdf and put the html tag for bold (<b>...</b>
) around all the text that uses bold fonts.
Your PDFs allow normal text extraction (without those <b>
tags) using the SimpleTextExtractionStrategy
. The LocationTextExtractionStrategy
on the other hand cannot be used as it messes up the order of the multi-column text.
Bold text in your PDFs can properly be recognized by your CustomFontFilter
, i.e. by the
font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold")
condition.
Thus, one way to implement your task would be to extend the SimpleTextExtractionStrategy
to check every chunk received using the CustomFontFilter
condition and insert <b>
tags where required.
For example like this:
public class BoldTaggingSimpleTextExtractionStrategy : SimpleTextExtractionStrategy
{
FieldInfo textField = typeof(TextRenderInfo).GetField("text", BindingFlags.NonPublic | BindingFlags.Instance);
bool currentlyBold = false;
public override void EventOccurred(IEventData data, EventType type)
{
if (type.Equals(EventType.RENDER_TEXT))
{
TextRenderInfo renderInfo = (TextRenderInfo)data;
string fontName = renderInfo.GetFont()?.GetFontProgram()?.GetFontNames()?.GetFontName();
if (fontName != null && fontName.Contains("Bold"))
{
if (!currentlyBold)
{
textField.SetValue(renderInfo, "<b>" + renderInfo.GetText());
currentlyBold = true;
}
}
else if (currentlyBold)
{
AppendTextChunk("</b>");
currentlyBold = false;
}
}
base.EventOccurred(data, type);
}
}
As you see I used reflection here. I did so because (A) TextRenderInfo
does not allow public setting of the text and (B) AppendTextChunk
must not be used before the first chunk is processed by base.EventOccurred
- there the size of a StringBuilder
containing the collected text chunks is used to check whether the chunk currently processed is the first one or not; if something is in that builder before at least one chunk has been processed, one gets a NullReferenceException
. There are other work-arounds for that but reflection here means but one more line of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With