Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get the particular paragraph in pdf file using iTextSharp in C#?

I am using iTextSharp in my C# winform application.I want to get particular paragraph in PDF file. Is this possible in iTextSharp?

like image 407
Saravanan Avatar asked Nov 28 '22 09:11

Saravanan


1 Answers

Yes and no.

First the no. The PDF format doesn't have a concept of text structures such as paragraphs, sentences or even words, it just has runs of text. The fact that two runs of text are near to each other so that we think of them as structured is a human thing. When you see something that looks like a three line paragraph in a PDF, in reality the program that generated the PDF actually did the job of chopping up the text into three unrelated text lines and then drew each line at specific x,y coordinates. And even worse, depending on what the designer wants, each line of text could be composed of smaller runs that could be words or even just characters. So it might be draw "the cat in the hat" at 10,10 or it might be draw "t" at 10,10, then draw "h" at 14,10, then draw "e" at 18,10 and so on. This is actually pretty common with PDFs from heavily designed programs like Adobe InDesign.

Now the yes. Actually its a maybe. If you are willing to put in a little work you might be able to get iTextSharp to do what you are looking for. There is a class called PdfTextExtractor that has a method called GetTextFromPage that will get all of the raw text from a page. The last parameter to this method is an object that implements the ITextExtractionStrategy interface. If you create your own class that implements this interface you can process each run of text and perform your own logic.

In this interface there's a method called RenderText which gets called for every run of text. You'll be given a iTextSharp.text.pdf.parser.TextRenderInfo object from which you can get the raw text from the run as well as other things like current coordinates that it is starting at, current font, etc. Since a visual line of text can be composed of multiple runs, you can use this method to compare the run's baseline (the starting x coordinate) to the previous run to determine if it is part of the same visual line.

Below is an example of an implementation of that interface:

    public class TextAsParagraphsExtractionStrategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy {
        //Text buffer
        private StringBuilder result = new StringBuilder();

        //Store last used properties
        private Vector lastBaseLine;

        //Buffer of lines of text and their Y coordinates. NOTE, these should be exposed as properties instead of fields but are left as is for simplicity's sake
        public List<string> strings = new List<String>();
        public List<float> baselines = new List<float>();

        //This is called whenever a run of text is encountered
        public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
            //This code assumes that if the baseline changes then we're on a newline
            Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();

            //See if the baseline has changed
            if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2])) {
                //See if we have text and not just whitespace
                if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                    //Mark the previous line as done by adding it to our buffers
                    this.baselines.Add(this.lastBaseLine[Vector.I2]);
                    this.strings.Add(this.result.ToString());
                }
                //Reset our "line" buffer
                this.result.Clear();
            }

            //Append the current text to our line buffer
            this.result.Append(renderInfo.GetText());

            //Reset the last used line
            this.lastBaseLine = curBaseline;
        }

        public string GetResultantText() {
            //One last time, see if there's anything left in the buffer
            if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                this.baselines.Add(this.lastBaseLine[Vector.I2]);
                this.strings.Add(this.result.ToString());
            }
            //We're not going to use this method to return a string, instead after callers should inspect this class's strings and baselines fields.
            return null;
        }

        //Not needed, part of interface contract
        public void BeginTextBlock() { }
        public void EndTextBlock() { }
        public void RenderImage(ImageRenderInfo renderInfo) { }
    }

To call it we'd do:

        PdfReader reader = new PdfReader(workingFile);
        TextAsParagraphsExtractionStrategy S = new TextAsParagraphsExtractionStrategy();
        iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
        for (int i = 0; i < S.strings.Count; i++) {
            Console.WriteLine("Line {0,-5}: {1}", S.baselines[i], S.strings[i]);
        }

We're actually throwing away the value from GetTextFromPage and instead inspecting the worker's baselines and strings array fields. The next step for this would be to compare the baselines and try to determine how to group lines together to become paragraphs.

I should note, not all paragraphs have spacing that's different from individual lines of text. For instance, if you run the PDF created below through the code above you'll see that every line of text is 18 points away from each other, regardless of if the line forms a new paragraph or not. If you open the PDF it creates in Acrobat and cover everything but the first letter of each line you'll see that your eye can't even tell the difference between a line break and a paragraph break.

        using (FileStream fs = new FileStream(workingFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
            using (Document doc = new Document(PageSize.LETTER)) {
                using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
                    doc.Open();
                    doc.Add(new Paragraph("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna."));
                    doc.Add(new Paragraph("This"));
                    doc.Add(new Paragraph("Is"));
                    doc.Add(new Paragraph("A"));
                    doc.Add(new Paragraph("Test"));
                    doc.Close();
                }
            }
        }
like image 71
Chris Haas Avatar answered Dec 05 '22 02:12

Chris Haas