PDF text search and split library

Question

I am look for a server side PDF library (or command line tool) which can:

split a multi-page PDF file into individual PDF files, based on
a search result of the PDF file content

Examples:

Search "Page ???" pattern in text and split the big PDF into 001.pdf, 002,pdf, ... ???.pdf

A server program will scan the PDF, look for the search pattern, save the page(s) which match the patten, and save the file in the disk.

It will be nice with integration with PHP / Ruby. Command line tool is also acceptable. It will be a server side (linux or win32) batch processing tool. GUI/login is not supported. i18n support will be nice but no required. Thanks~

plinth · Accepted Answer

My company, Atalasoft, has just released some PDF manipulation tools that run on .NET. There is a text extract class that you can use to find the text and determine how you will split your document and a very high level document class that makes the splitting trivial. Suppose you have a Stream to your source PDF and an increasingly ordered List that describes the starting page of each split, then the code to generate your split files looks like this:

public void SplitPdf(Stream stm, List<int> pageStarts, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);
    int lastPage = mainDoc.Pages.Count - 1;

    for (int i=0; i < pageStarts.Count; i++) {
        int startPage = pageStarts[i];
        int endPage= (i < pageStarts.Count - 1) ?
            pageStarts[i + 1] - 1 :
            lastPage;
        if (startPage > endPage) throw new ArgumentException("list is not ordered properly", "pageStarts");
        PdfDocument splitDoc = new PdfDocument();
        for (j = startPage; j <= endPage; j++)
            splitDoc.Pages.Add(mainDoc.Pages[j];

        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", i + 1));
        splitDoc.Save(outputPath);
    }

if you generalize this into a page range struct:

public struct PageRange {
    public int StartPage;
    public int EndPage;
}

where StartPage and EndPage inclusively describe a range of pages, then the code is simpler:

public void SplitPdf(Stream stm, List<PageRange> ranges, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);

    int outputDocCount = 1;
    foreach (PageRange range in ranges) {
        int startPage = Math.Min(range.StartPage, range.EndPage); // assume not in order
        int endPage = Math.Max(range.StartPage, range.EndPage);
        PdfDocument splitDoc = new PdfDocument();
        for (int i=startPage; i <= endPage; i++)
            splitDoc.Pages.Add(mainDoc.Pages[i]);
        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", outputDocCount));
        splitDoc.Save(outputPath);
        outputDocCount++;
    }
}

Steve Claridge · Answer

PDFBox is a Java library but it does have some command line tools as well:

http://pdfbox.apache.org/

PDFBox can extract text and also rebuilt/split PDFS

PDF text search and split library

Tags:

search

pdf

ohho

2 Answers

plinth

Steve Claridge

Recent Activity

Donate For Us

PDF text search and split library

Tags:

search

pdf

ohho

2 Answers

plinth

Steve Claridge

Related questions

Recent Activity

Donate For Us