I need to generate a customized PDF copy of a template document. The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <code><first_name></code> and <code><last_name></code>, and then replace these with the correct values. I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF? I looked at PyPDF2 and ReportLab but neither seem to be able to do so. Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.

There is no definite solution but I found 2 solutions that works most of the time. In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code: <pre class="prettyprint"><code># Redact things that look like social security numbers, replacing the # text with X's. options.content_filters = [ # First convert all dash-like characters to dashes. ( re.compile(u"Tom Xavier"), lambda m : "XXXXXXX" ), # Then do an actual SSL regex. # See https://github.com/opendata/SSN-Redaction for why this regex is complicated. ( re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"), lambda m : "XXX-XX-XXXX" ), ] # Perform the redaction using PDF on standard input and writing to standard output. pdf_redactor.redactor(options) </code></pre> Full Example can be found here In ruby https://github.com/gettalong/hexapdf works for black out text. Example code: <pre class="prettyprint"><code>require 'hexapdf' class ShowTextProcessor < HexaPDF::Content::Processor def initialize(page, to_hide_arr) super() @canvas = page.canvas(type: :overlay) @to_hide_arr = to_hide_arr end def show_text(str) boxes = decode_text_with_positioning(str) return if boxes.string.empty? if @to_hide_arr.include? boxes.string @canvas.stroke_color(0, 0 , 0) boxes.each do |box| x, y = *box.lower_left tx, ty = *box.upper_right @canvas.rectangle(x, y, tx - x, ty - y).fill end end end alias :show_text_with_positioning :show_text end file_name = ARGV[0] strings_to_black = ARGV[1].split("|") doc = HexaPDF::Document.open(file_name) puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]." doc.pages.each.with_index do |page, index| processor = ShowTextProcessor.new(page, strings_to_black) page.process_contents(processor) end new_file_name = "#{file_name.split('.').first}_updated.pdf" doc.write(new_file_name, optimize: true) puts "Writing updated file [#{new_file_name}]." </code></pre> In this you can black out text on select text will be visible.

Search and replace placeholder text in PDF with Python

Tags:

python

pdf

I need to generate a customized PDF copy of a template document. The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name> and <last_name>, and then replace these with the correct values.

I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?

I looked at PyPDF2 and ReportLab but neither seem to be able to do so. Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.

622

asked Sep 26 '16 21:09

uncrase

2 Answers

There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.

If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.

From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.

156

answered Sep 28 '22 11:09

RecursivelyIronic

There is no definite solution but I found 2 solutions that works most of the time.

In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:

# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
        # First convert all dash-like characters to dashes.
        (
                re.compile(u"Tom Xavier"),
                lambda m : "XXXXXXX"
        ),

        # Then do an actual SSL regex.
        # See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
        (
                re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
                lambda m : "XXX-XX-XXXX"
        ),
]

# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)

Full Example can be found here

In ruby https://github.com/gettalong/hexapdf works for black out text. Example code:

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    return if boxes.string.empty?
    if @to_hide_arr.include? boxes.string
        @canvas.stroke_color(0, 0 , 0)

        boxes.each do |box|
          x, y = *box.lower_left
          tx, ty = *box.upper_right
          @canvas.rectangle(x, y, tx - x, ty - y).fill
        end
    end

  end
  alias :show_text_with_positioning :show_text

end

file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

In this you can black out text on select text will be visible.

answered Sep 28 '22 11:09

Priyanshu Jain

Related questions
                            
                                Get warning messages through psycopg2
                            
                                How do I correct this sqlalchemy.exc.NoForeignKeysError?
                            
                                Numpy equivalent of itertools.product [duplicate]
                            
                                Do Cython extension types support class attributes?
                            
                                'super' object has no attribute '__getattr__' in python3
                            
                                How to get data from a list Json with python?
                            
                                Convert Rust vector of tuples to a C compatible structure
                            
                                Using with sns.set in seaborn plots
                            
                                Cython: Buffer type mismatch, expected 'int' but got 'long'
                            
                                Implementing Bi-directional LSTM-CRF Network
                            
                                Why not use python's assert statement in tests, these days?
                            
                                Complete a multipart_upload with boto3?
                            
                                figure.add_subplot() vs pyplot.subplot()
                            
                                Passing arguments (for argparse) with unittest discover
                            
                                sqlalchemy, using check constraints
                            
                                TensorBoard: How to plot histogram for gradients?
                            
                                How to smooth by interpolation when using pcolormesh?
                            
                                Is there a comprehensive table of Python's "magic constants"?
                            
                                Simplifying / optimizing a chain of for-loops
                            
                                Heroku - No web process running

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With