I have 110 PDFs that I'm trying to extract images from. Once the images are extracted, I'd like to remove any duplicates and delete images that are less than 4KB. My code to do that looks like this:
def extract_images_from_file(pdf_file):
file_name = os.path.splitext(os.path.basename(pdf_file))[0]
call(["pdfimages", "-png", pdf_file, file_name])
os.remove(pdf_file)
def dedup_images():
os.mkdir("unique_images")
md5_library = []
images = glob("*.png")
print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..."
for image in images:
if os.path.getsize(image) <= 4000:
os.remove(image)
else:
m = md5.new()
image_data = list(Image.open(image).getdata())
image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data])
m.update(image_string)
md5_library.append([image, m.digest()])
headers = ['image_file', 'md5']
dat = pd.DataFrame(md5_library, columns=headers).sort(['md5'])
dat.drop_duplicates(subset="md5", inplace=True)
print "Extracting the unique images."
unique_images = dat.image_file.tolist()
for image in unique_images:
old_file = image
new_file = "unique_images\\" + image
shutil.copy(old_file, new_file)
This process can take a while, so I've started to dabble in multithreading. Feel free to interpret that as me saying I have no idea what I'm doing. I thought the process would be easily parallelisable with regard to extracting the images, but not deduping since there's a lot of I/O going on with one file and I have no idea how to do that. So here's my attempt at the parallel process:
if __name__ == '__main__':
filepath = sys.argv[1]
folder_name = os.getcwd() + "\\all_images\\"
if not os.path.exists(folder_name):
os.mkdir(folder_name)
pdfs = glob("*.pdf")
print "Copying all PDFs to the images folder..."
for pdf in pdfs:
shutil.copy(pdf, ".\\all_images\\")
os.chdir("all_images")
pool = Pool(processes=8)
print "Extracting images from PDFs..."
pool.map(extract_images_from_file, pdfs)
print "Extracting unique images into a new folder..."
dedup_images()
print "All images have been extracted and deduped."
Everything seems to have worked fine when extracting the images, but then it all went haywire. So here are my questions:
1) Am I setting up the parallel process correctly?
2) Does it continue to try to use all 8 processors on dedup_images()
?
3) Is there anything I'm missing and/or not doing correctly?
Thanks in advance!
EDIT Here is what I mean by "haywire". The errors start out with a bunch of lines like this:
I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey of1pi0e
l2ne1 1i'4mS auogbiepl o2fefinrlaee e N@'egSwmu abYipolor ekcn oaCm o Nupentwt y1Y -o18r16k11 8.C1po4nu gn3't4
y7 5160120821143 3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C
3o-u3l6d0n.'ptn go'p
en image file 'Ia/ ON eEwr rYoorr:k CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o uiolmidalng2'eft r m '
ai gpceoo emfn iapl teN e1'w-S 8uY6bo2pr.okpe nnCgao' u
Nnetwy Y1o0r2k8 1C4o u3n4t7y9 918181881134 3p4t7 536-1306211.3p npgt'
4-879.png'
I/O Error: CoulId/nO' tE rorpoern: iCmoaugled nf'itl eo p'eub piomeangae fNielwe Y'oSrukb pCooeunnat yN e1w0 2Y8o1r
4k 3C4o7u9n9t8y8 811032 1p1t4 3o-i3l622f pt 1-863.png'
And then gets more readable with multiple lines like this:
I/O Error: Couldn't open image file 'pt 1-864.png'
I/O Error: Couldn't open image file 'pt 1-865.png'
I/O Error: Couldn't open image file 'pt 1-866.png'
I/O Error: Couldn't open image file 'pt 1-867.png'
This repeats for a while, going back and forth between the garbled error text and the readable.
Finally, it gets to here:
Deleting images smaller than 4KB and generating the MD5 hash values for all other images...
Extracting unique images into a new folder...
which implies that the code picks back up and continues on with the process. What could be going wrong?
In this example, at first we import the Process class then initiate Process object with the display() function. Then process is started with start() method and then complete the process with the join() method. We can also pass arguments to the function using args keyword.
If your code is CPU bound: You should use multiprocessing (if your machine has multiple cores)
Multiprocessing is the ability of a system to run multiple processors at one time. If you had a computer with a single processor, it would switch between multiple processes to keep all of them running. However, most computers today have at least a multi-core processor, allowing several processes to be executed at once.
Your code is basically fine.
The garbled text is all the processes trying to write different versions of the I/O Error
message interleaved to the console. The error message is being generated by the pdfimages
command, probably because when you run two at once they conflict, possibly over temporary files, or both using the same file name or something like that.
Try using a different image root for each separate pdf file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With