Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF text search and split library

Tags:

search

pdf

I am look for a server side PDF library (or command line tool) which can:

  • split a multi-page PDF file into individual PDF files, based on
  • a search result of the PDF file content

Examples:

  • Search "Page ???" pattern in text and split the big PDF into 001.pdf, 002,pdf, ... ???.pdf

A server program will scan the PDF, look for the search pattern, save the page(s) which match the patten, and save the file in the disk.

It will be nice with integration with PHP / Ruby. Command line tool is also acceptable. It will be a server side (linux or win32) batch processing tool. GUI/login is not supported. i18n support will be nice but no required. Thanks~

like image 280
ohho Avatar asked May 11 '26 05:05

ohho


2 Answers

My company, Atalasoft, has just released some PDF manipulation tools that run on .NET. There is a text extract class that you can use to find the text and determine how you will split your document and a very high level document class that makes the splitting trivial. Suppose you have a Stream to your source PDF and an increasingly ordered List that describes the starting page of each split, then the code to generate your split files looks like this:

public void SplitPdf(Stream stm, List<int> pageStarts, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);
    int lastPage = mainDoc.Pages.Count - 1;

    for (int i=0; i < pageStarts.Count; i++) {
        int startPage = pageStarts[i];
        int endPage= (i < pageStarts.Count - 1) ?
            pageStarts[i + 1] - 1 :
            lastPage;
        if (startPage > endPage) throw new ArgumentException("list is not ordered properly", "pageStarts");
        PdfDocument splitDoc = new PdfDocument();
        for (j = startPage; j <= endPage; j++)
            splitDoc.Pages.Add(mainDoc.Pages[j];

        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", i + 1));
        splitDoc.Save(outputPath);
    }

if you generalize this into a page range struct:

public struct PageRange {
    public int StartPage;
    public int EndPage;
}

where StartPage and EndPage inclusively describe a range of pages, then the code is simpler:

public void SplitPdf(Stream stm, List<PageRange> ranges, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);

    int outputDocCount = 1;
    foreach (PageRange range in ranges) {
        int startPage = Math.Min(range.StartPage, range.EndPage); // assume not in order
        int endPage = Math.Max(range.StartPage, range.EndPage);
        PdfDocument splitDoc = new PdfDocument();
        for (int i=startPage; i <= endPage; i++)
            splitDoc.Pages.Add(mainDoc.Pages[i]);
        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", outputDocCount));
        splitDoc.Save(outputPath);
        outputDocCount++;
    }
}
like image 187
plinth Avatar answered May 17 '26 16:05

plinth


PDFBox is a Java library but it does have some command line tools as well:

http://pdfbox.apache.org/

PDFBox can extract text and also rebuilt/split PDFS

like image 28
Steve Claridge Avatar answered May 17 '26 17:05

Steve Claridge



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!