How to convert PDF to text file in iTextSharp

Question

I have to retrieve text from PDF file. But using the following code I only get empty text file.

for (int i = 0; i < n; i++)
{
    pagenumber = i + 1;
    filename = pagenumber.ToString();
    while (filename.Length < digits) filename = "0" + filename;
    filename = "_" + filename;
    filename = splitFile + name + filename;
    // step 1: creation of a document-object
    document = new Document(reader.GetPageSizeWithRotation(pagenumber));
    // step 2: we create a writer that listens to the document
    PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(filename + ".pdf", FileMode.Create));

    // step 3: we open the document
    document.Open();

    PdfContentByte cb = writer.DirectContent;
    PdfImportedPage page = writer.GetImportedPage(reader, pagenumber);
    int rotation = reader.GetPageRotation(pagenumber);
    if (rotation == 90 || rotation == 270)
    {
        cb.AddTemplate(page, 0, -1f, 1f, 0, 0, reader.GetPageSizeWithRotation(pagenumber).Height);
    }
    else
    {
        cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
    }
    // step 5: we close the document

    document.Close();
    PDFParser parser = new PDFParser();
    parser.ExtractText(filename + ".pdf", filename + ".txt");
}

What am I doing wrong and how should I extract text from PDF?

mkl · Accepted Answer

For text extraction with iTextSharp, take a current version of that library and use

PdfTextExtractor.GetTextFromPage(reader, pageNumber);

Beware, there is a bug in the text extraction code in some 5.3.x version which has meanwhile been fixed in trunk. You, therefore, might want to checkout the trunk revision.

Miguel · Answer

using System;
using System.IO;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace Pdf2Text
{
    class Program
    {
        static void Main(string[] args)
        {
            if (!args.Any()) return;

            var file = args[0];
            var output = Path.ChangeExtension(file, ".txt");
            if (!File.Exists(file)) return;

            var bytes = File.ReadAllBytes(file);
            File.WriteAllText(output, ConvertToText(bytes), Encoding.UTF8);
        }

        private static string ConvertToText(byte[] bytes)
        {
            var sb = new StringBuilder();

            try
            {
                var reader = new PdfReader(bytes);
                var numberOfPages = reader.NumberOfPages;

                for (var currentPageIndex = 1; currentPageIndex <= numberOfPages; currentPageIndex++)
                {
                    sb.Append(PdfTextExtractor.GetTextFromPage(reader, currentPageIndex));
                }
            }
            catch (Exception exception)
            {
                Console.WriteLine(exception.Message);
            }

            return sb.ToString();
        }
    }
}

How to convert PDF to text file in iTextSharp

Tags:

c#

pdf

itextsharp

Manik

2 Answers

mkl

Miguel

Recent Activity

Donate For Us

How to convert PDF to text file in iTextSharp

Tags:

c#

pdf

itextsharp

Manik

2 Answers

mkl

Miguel

Related questions

Recent Activity

Donate For Us