I need to generate a customized PDF copy of a template document.
The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name>
and <last_name>
, and then replace these with the correct values.
I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?
I looked at PyPDF2 and ReportLab but neither seem to be able to do so. Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.
Choose Edit > Find (Ctrl/Command+F). Type the text you want to search for in the text box on the Find toolbar. To replace text, click Replace With to expand the toolbar, then type the replacement text in the Replace With text box. Finds only occurrences of the complete word you type in the text box.
On the PDF file, press “Ctrl+F” on your keyboard and input the text you would like to be replaced. Then type in new text in the input field of Replace to modify the current one to this new text. Click on “Replace” to start replacing PDF texts.
Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.
There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.
If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.
From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.
There is no definite solution but I found 2 solutions that works most of the time.
In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:
# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
# First convert all dash-like characters to dashes.
(
re.compile(u"Tom Xavier"),
lambda m : "XXXXXXX"
),
# Then do an actual SSL regex.
# See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
(
re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
lambda m : "XXX-XX-XXXX"
),
]
# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)
Full Example can be found here
In ruby https://github.com/gettalong/hexapdf works for black out text. Example code:
require 'hexapdf'
class ShowTextProcessor < HexaPDF::Content::Processor
def initialize(page, to_hide_arr)
super()
@canvas = page.canvas(type: :overlay)
@to_hide_arr = to_hide_arr
end
def show_text(str)
boxes = decode_text_with_positioning(str)
return if boxes.string.empty?
if @to_hide_arr.include? boxes.string
@canvas.stroke_color(0, 0 , 0)
boxes.each do |box|
x, y = *box.lower_left
tx, ty = *box.upper_right
@canvas.rectangle(x, y, tx - x, ty - y).fill
end
end
end
alias :show_text_with_positioning :show_text
end
file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")
doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
processor = ShowTextProcessor.new(page, strings_to_black)
page.process_contents(processor)
end
new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)
puts "Writing updated file [#{new_file_name}]."
In this you can black out text on select text will be visible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With