Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyPDF2: Concatenating pdfs in memory

I wish to concatenate (append) a bunch of small pdfs together effectively in memory in pure python. Specifically, an usual case is 500 single page pdfs, each with a size of about 400 kB, to be merged into one. Let's say the pdfs are available as a iterable in memory, say a list:

my_pdfs = [pdf1_fileobj, pdf2_fileobj, ..., pdfn_fileobj]  # type is BytesIO

Where each pdf_fileobj is of type BytesIO. Then, the base memory usage is about 200 MB (500 pdfs, 400kB each).

Ideally, I would want the following code to concatenate using no more than 400-500 MB of memory in total (including my_pdfs). However, that doesn't seem to be the case, the debugging statement on the last line indicates the maximum memory used to be almost 700 MB. Moreover, using the Mac os x resource monitor, the allocated memory is indicated to be 600 MB when reaching the last line.

Running gc.collect() reduces this to 350 MB (almost too good?). Why do I have to run garbage collection manually to get rid of merging garbage, in this case? I have seen this (probably) causing memory build up in a slightly different scenario I'll skip for now.

import io
import resource  # For debugging

from PyPDF2 import PdfFileMerger


def merge_pdfs(iterable):
    """Merge pdfs in memory"""
    merger = PdfFileMerger()
    for pdf_fileobj in iterable:
        merger.append(pdf_fileobj)

    myio = io.BytesIO()
    merger.write(myio)
    merger.close()

    myio.seek(0)
    return myio


my_concatenated_pdf = merge_pdfs(my_pdfs)

# Print the maximum memory usage
print("Memory usage: %s (kB)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

Question summary

  • Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimize it?
  • Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
  • What about this general approach? Is BytesIO suitable to use is this case? merger.write(myio) does seem to run kind of slow given that all happen in ram.

Thank you!

like image 743
Andreas Avatar asked Aug 13 '17 16:08

Andreas


People also ask

How do I combine PDF files without losing quality?

Select the files you want to merge using the Acrobat PDF combiner tool. Reorder the files if needed. Click Merge files. Download the merged PDF.

Is PDF Merger & Splitter safe?

It allows you to merge and split instantly, real-time to the PDF file. You don't need to upload PDFs to any server. It's very safe, and guarantees privacy.


1 Answers

Q: Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimise it?

A: Because .append creates a new stream object and then you use merger.write(myio), which creates another stream object and you already have 200 MB of pdf files in memory so 3*200 MB.


Q: Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?

A: It is a known issue in PyPDF2.


Q: What about this general approach? Is BytesIO suitable to use is this case?

A: Considering the memory issues, you might want to try a different approach. Maybe merging one by one, temporarily saving the files to disk, then clearing the already merged ones from memory.

like image 106
spedy Avatar answered Nov 05 '22 14:11

spedy