Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly utilize the multiprocessing module in Python?

I have 110 PDFs that I'm trying to extract images from. Once the images are extracted, I'd like to remove any duplicates and delete images that are less than 4KB. My code to do that looks like this:

def extract_images_from_file(pdf_file):
    file_name = os.path.splitext(os.path.basename(pdf_file))[0]
    call(["pdfimages", "-png", pdf_file, file_name])
    os.remove(pdf_file)

def dedup_images():
    os.mkdir("unique_images")
    md5_library = []
    images = glob("*.png")
    print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..."
    for image in images:
        if os.path.getsize(image) <= 4000:
            os.remove(image)
        else:
            m = md5.new()
            image_data = list(Image.open(image).getdata())
            image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data])
            m.update(image_string)
            md5_library.append([image, m.digest()])
    headers = ['image_file', 'md5']
    dat = pd.DataFrame(md5_library, columns=headers).sort(['md5'])
    dat.drop_duplicates(subset="md5", inplace=True)

    print "Extracting the unique images."
    unique_images = dat.image_file.tolist()
    for image in unique_images:
        old_file = image
        new_file = "unique_images\\" + image
        shutil.copy(old_file, new_file)

This process can take a while, so I've started to dabble in multithreading. Feel free to interpret that as me saying I have no idea what I'm doing. I thought the process would be easily parallelisable with regard to extracting the images, but not deduping since there's a lot of I/O going on with one file and I have no idea how to do that. So here's my attempt at the parallel process:

if __name__ == '__main__':
    filepath = sys.argv[1]
    folder_name = os.getcwd() + "\\all_images\\"
    if not os.path.exists(folder_name):
        os.mkdir(folder_name)
    pdfs = glob("*.pdf")
    print "Copying all PDFs to the images folder..."
    for pdf in pdfs:
        shutil.copy(pdf, ".\\all_images\\")
    os.chdir("all_images")
    pool = Pool(processes=8)
    print "Extracting images from PDFs..."
    pool.map(extract_images_from_file, pdfs)
    print "Extracting unique images into a new folder..."
    dedup_images()
    print "All images have been extracted and deduped."

Everything seems to have worked fine when extracting the images, but then it all went haywire. So here are my questions:

1) Am I setting up the parallel process correctly?
2) Does it continue to try to use all 8 processors on dedup_images()?
3) Is there anything I'm missing and/or not doing correctly?

Thanks in advance!

EDIT Here is what I mean by "haywire". The errors start out with a bunch of lines like this:

I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey   of1pi0e
l2ne1  1i'4mS auogbiepl o2fefinrlaee e N@'egSwmu abYipolor ekcn oaCm o Nupentwt  y1Y -o18r16k11 8.C1po4nu gn3't4
y7 5160120821143  3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C
3o-u3l6d0n.'ptn go'p
en image file 'Ia/ ON eEwr rYoorr:k  CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o  uiolmidalng2'eft r m '
ai gpceoo emfn iapl teN  e1'w-S 8uY6bo2pr.okpe nnCgao' u
Nnetwy  Y1o0r2k8 1C4o u3n4t7y9 918181881134  3p4t7 536-1306211.3p npgt'
4-879.png'
I/O Error: CoulId/nO' tE rorpoern:  iCmoaugled nf'itl eo p'eub piomeangae  fNielwe  Y'oSrukb pCooeunnat yN e1w0 2Y8o1r
4k  3C4o7u9n9t8y8 811032 1p1t4  3o-i3l622f pt 1-863.png'

And then gets more readable with multiple lines like this:

I/O Error: Couldn't open image file 'pt 1-864.png'
I/O Error: Couldn't open image file 'pt 1-865.png'
I/O Error: Couldn't open image file 'pt 1-866.png'
I/O Error: Couldn't open image file 'pt 1-867.png'

This repeats for a while, going back and forth between the garbled error text and the readable.

Finally, it gets to here:

Deleting images smaller than 4KB and generating the MD5 hash values for all other images...
Extracting unique images into a new folder...

which implies that the code picks back up and continues on with the process. What could be going wrong?

like image 536
tblznbits Avatar asked Oct 02 '15 14:10

tblznbits


People also ask

How do you do multiprocessing in Python?

In this example, at first we import the Process class then initiate Process object with the display() function. Then process is started with start() method and then complete the process with the join() method. We can also pass arguments to the function using args keyword.

When should I use multiprocessing in Python?

If your code is CPU bound: You should use multiprocessing (if your machine has multiple cores)

How is multiprocessing done?

Multiprocessing is the ability of a system to run multiple processors at one time. If you had a computer with a single processor, it would switch between multiple processes to keep all of them running. However, most computers today have at least a multi-core processor, allowing several processes to be executed at once.


1 Answers

Your code is basically fine.

The garbled text is all the processes trying to write different versions of the I/O Error message interleaved to the console. The error message is being generated by the pdfimages command, probably because when you run two at once they conflict, possibly over temporary files, or both using the same file name or something like that.

Try using a different image root for each separate pdf file.

like image 186
strubbly Avatar answered Sep 30 '22 14:09

strubbly