Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from multiple powerpoint files using python

I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.

So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.

like image 280
kacey Avatar asked Sep 09 '16 19:09

kacey


2 Answers

python-pptx can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):

from pptx import Presentation

for pptx_filename in directory:
    prs = Presentation(pptx_filename)
    for slide in prs.slides:
        for shape in slide.shapes:
            print shape.text

You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)

like image 78
scanny Avatar answered Sep 24 '22 20:09

scanny


Actually working

If you want to extract text:

  • import Presentation from pptx (pip install python-pptx)
  • for each file in the directory (using glob module)
  • look in every slides and in every shape in each slide
  • if there is a shape with text attribute, print the shape.text

from pptx import Presentation
import glob

for eachfile in glob.glob("*.pptx"):
    prs = Presentation(eachfile)
    print(eachfile)
    print("----------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
like image 24
PythonProgrammi Avatar answered Sep 23 '22 20:09

PythonProgrammi