Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a PDF by Bookmarks?

I am to process single PDFs that have each been created by 'merging' multiple PDFs. Each of the merged PDF has the places where the PDF parts start displayed with a bookmark.

Is there any way to automatically split this up by bookmarks with a script?

We only have the bookmarks to indicate the parts, not the page numbers, so we would need to infer the page numbers from the bookmarks. A Linux tool would be best.

like image 362
Jason Avatar asked Apr 08 '10 16:04

Jason


4 Answers

There's a command line tool written in Java called Sejda where you can find the splitbybookmarks command that does exactly what you asked. It's Java so it runs on Linux and being a command line tool you can write script to do that.

Disclaimer
I'm one of the authors

like image 177
Andrea Vacondio Avatar answered Nov 17 '22 19:11

Andrea Vacondio


pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.

To get the page numbers of the bookmarks do

pdftk in.pdf dump_data

and make your script read the page numbers from the output.

Then use

pdftk in.pdf cat A-B output out_A-B.pdf

to get the pages from A to B into out_A-B.pdf.

The script could be something like this:

#!/bin/bash

infile=$1 # input pdf
outputprefix=$2

[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args

pagenumbers=( $(pdftk "$infile" dump_data | \
                grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq)
              end )

for ((i=0; i < ${#pagenumbers[@]} - 1; ++i)); do
  a=${pagenumbers[i]} # start page number
  b=${pagenumbers[i+1]} # end page number
  [ "$b" = "end" ] || b=$[b-1]
  pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done
like image 41
Tuomas Avatar answered Nov 17 '22 20:11

Tuomas


you have programs that are built like pdf-split that can do that for you:

A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.

A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.

A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.

edit*

also found a free open sourced program Here if you do not want to pay.

like image 3
Justin Gregoire Avatar answered Nov 17 '22 20:11

Justin Gregoire


Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:

#!perl
use v5.24;
use warnings;

use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);

my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';

die "Can't find $ARGV[0]\n" unless -e $file;

# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';

my @chapters;
while( <$pdftk_fh> ) {
    state $chapter = 0;
    next unless /\ABookmark/;

    if( /\ABookmarkBegin/ ) {
        my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
        my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;

        my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;

        # I only want to split on chapters, so I skip higher
        # level numbers (higher means more nesting, 1 is lowest).
        next unless $level == 1;

        # If you have front matter (preface, etc) then this numbering
        # will be off. Chapter 1 might be called Chapter 3.
        push @chapters, {
            title         => $title,
            start_page    => $page_number,
            chapter       => $chapter++,
            };
        }
    }

# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
    my $last_page = $chapters[$i+1]->{start_page} - 1;
    $chapters[$i]->{last_page} = $last_page;
    }
$chapters[$#chapters]->{last_page} = 'end';

make_path $split_dir;
foreach my $chapter ( @chapters ) {
    my( $start, $end ) = $chapter->@{qw(start_page last_page)};

    # slugify the title so use it as a filename
    my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );

    my $path = catfile( $split_dir, "$title.pdf" );
    say "Outputting $path";

    # Use pdftk to extract that part of the PDF
    system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
    }
like image 1
brian d foy Avatar answered Nov 17 '22 21:11

brian d foy