Extract text by line from PDF using iTextSharp c#

Tags:

I need to run some analysis my extracting data from a PDF document.

Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.

Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.

Below is the code I used:

       string urlFileName1 = "pdf_link";
        PdfReader reader = new PdfReader(urlFileName1);
        string text = string.Empty;
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, page);
        }
        reader.Close();
        candidate3.Text = text.ToString();

617

asked Apr 01 '13 18:04

Xander

2 Answers

I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:

using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFApp2
{
class Program
{
    static void Main(string[] args)
    {

        string filePath = @"Your said path\the file name.pdf";
        string outPath = @"the output said path\the text file name.txt";
        int pagesToScan = 2;

        string strText = string.Empty;
        try
        {
            PdfReader reader = new PdfReader(filePath);

            for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
            {
                ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                //creating the string array and storing the PDF line by line
                string[] lines = strText.Split('\n');
                foreach (string line in lines)
                {
                    //Creating and appending to a text file
                    using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
                    {
                        file.WriteLine(line);
                    }
                }
            }

            reader.Close();
        }
        catch (Exception ex)
        {
            Console.Write(ex);
        }
    }
}
}

I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.

130

answered Sep 19 '22 18:09

supersoka

All the other code samples here didn't work for me, probably due to changes to the itext7 API.

This minimal example here works ok:

var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

answered Sep 18 '22 18:09

dodgy_coder

Related questions
                            
                                OracleConnection.Open is throwing ORA-12541 TNS no listener
                            
                                Are GetCallingAssembly() and GetExecutingAssembly() equally prone to JIT inlining?
                            
                                NUnit test Bug? Expected <MyType> But was <MyType>
                            
                                How to prevent EntityFramework deadlock when concurrently running these two statements
                            
                                ToolStrip Rounded Corners
                            
                                Why does using ConfigurationManager.GetSection cause "SecurityException: Request failed" but ConfigurationManager.OpenExeConfiguration does not?
                            
                                Is “If” condition better than ?? and casting
                            
                                Visual Studio - Referencing third party DLL
                            
                                Why is RSAParameters Modulus not equal product of P and Q?
                            
                                'System.Net.HttpWebRequest' does not contain a definition for 'GetRequestStream'
                            
                                Implementation of feature flags in C#
                            
                                CollectionViewSource, how to filter data?
                            
                                WPF - TreeView hide expand icon (arrow)
                            
                                How to download a string via HTTP from .NET 4.5?
                            
                                Enum-like class
                            
                                What .NET 4.0 System.Collections.Concurrent collection added in functionality to .NET 3.0 SynchronizedCollection?
                            
                                What is the equivalent of a Servlet (Java class that extends HttpServlet in tomcat) in an ASP.net project?
                            
                                Implement method decorators in C#
                            
                                loop through object and get properties [duplicate]
                            
                                Why people use ProjectData

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract text by line from PDF using iTextSharp c#

Tags:

c#

pdf

extract

carriage-return

itext

Xander

People also ask

2 Answers

supersoka

dodgy_coder

Recent Activity

Donate For Us