Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search and replace placeholder text in PDF with Python

Tags:

python

pdf

I need to generate a customized PDF copy of a template document. The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name> and <last_name>, and then replace these with the correct values.

I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?

I looked at PyPDF2 and ReportLab but neither seem to be able to do so. Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.

like image 622
uncrase Avatar asked Sep 26 '16 21:09

uncrase


People also ask

Can you search and replace in PDF?

Choose Edit > Find (Ctrl/Command+F). Type the text you want to search for in the text box on the Find toolbar. To replace text, click Replace With to expand the toolbar, then type the replacement text in the Replace With text box. Finds only occurrences of the complete word you type in the text box.

How do you find and replace all in a PDF?

On the PDF file, press “Ctrl+F” on your keyboard and input the text you would like to be replaced. Then type in new text in the input field of Replace to modify the current one to this new text. Click on “Replace” to start replacing PDF texts.

How extract extract specific text from PDF file in Python?

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.


2 Answers

There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.

If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.

From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.

like image 156
RecursivelyIronic Avatar answered Sep 28 '22 11:09

RecursivelyIronic


There is no definite solution but I found 2 solutions that works most of the time.

In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:

# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
        # First convert all dash-like characters to dashes.
        (
                re.compile(u"Tom Xavier"),
                lambda m : "XXXXXXX"
        ),

        # Then do an actual SSL regex.
        # See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
        (
                re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
                lambda m : "XXX-XX-XXXX"
        ),
]

# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)

Full Example can be found here

In ruby https://github.com/gettalong/hexapdf works for black out text. Example code:

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    return if boxes.string.empty?
    if @to_hide_arr.include? boxes.string
        @canvas.stroke_color(0, 0 , 0)

        boxes.each do |box|
          x, y = *box.lower_left
          tx, ty = *box.upper_right
          @canvas.rectangle(x, y, tx - x, ty - y).fill
        end
    end

  end
  alias :show_text_with_positioning :show_text

end

file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

In this you can black out text on select text will be visible.

like image 37
Priyanshu Jain Avatar answered Sep 28 '22 11:09

Priyanshu Jain