I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.
So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.
python-pptx
can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):
from pptx import Presentation
for pptx_filename in directory:
prs = Presentation(pptx_filename)
for slide in prs.slides:
for shape in slide.shapes:
print shape.text
You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)
If you want to extract text:
from pptx import Presentation
import glob
for eachfile in glob.glob("*.pptx"):
prs = Presentation(eachfile)
print(eachfile)
print("----------------------")
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
print(shape.text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With