Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from a text shape within a Group Shape in powerpoint, using python-pptx.

My PowerPoint slide has a number of group shapes in which there are child text shapes.

Earlier I was using this code, but it doesn't handle Group shapes.

for eachfile in files:
prs = Presentation(eachfile)

textrun=[]
for slide in prs.slides:
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            print(shape.text)
            textrun.append(shape.text)
new_list=" ".join(textrun)
text_list.append(new_list)

I am trying to extract the text from these child text boxes. I have managed to reach these child elements using GroupShape.shape But I get an error, that these are of type 'property', so I am not able to access the text or iterate (TypeError: 'property' object is not iterable) over them.

from pptx.shapes.group import GroupShape
from pptx import Presentation
for eachfile in files:
prs = Presentation(eachfile)

textrun=[]
for slide in prs.slides:
    for shape in slide.shapes:
        for text in GroupShape.shapes:
            print(text)

I would then like to catch the text and append to a string for further processing.

So my question is, how to access the child text elements and extract the text from them.

I have spent a lot of time going though the documentation and source code, but haven't been able to figure it out. Any help would be appreciated.

like image 758
sjm20066 Avatar asked Jan 27 '23 15:01

sjm20066


2 Answers

I think you need something like this:

from pptx.enum.shapes import MSO_SHAPE_TYPE

for slide in prs.slides:
    # ---only operate on group shapes---
    group_shapes = [
        shp for shp in slide.shapes
        if shp.shape_type == MSO_SHAPE_TYPE.GROUP
    ]
    for group_shape in group_shapes:
        for shape in group_shape.shapes:
            if shape.has_text_frame:
                print(shape.text)

A group shape contains other shapes, accessible on its .shapes property. It does not itself have a .text property. So you need to iterate the shapes in the group and get the text from each of those.

Note that this solution only goes one level deep. A recursive approach could be used to walk the tree depth-first and get text from groups containing groups if there were any.

Also note that not all shapes have text, so you must check the .has_text_frame property to avoid raising an exception on, say, a picture shape.

like image 169
scanny Avatar answered Jan 31 '23 21:01

scanny


Earlier answer misses some deeper "group in group" cases. Group shapes may contain many levels of shapes, including group shapes. Thus, in many real life cases there is a need to do a recursive search among the group shapes.

The previous answer parses only some of these (down to second layer of group shapes). But even that layer group shape may in turn contain further groups. So we need an iterative search strategy. This is best shown by reusing above code, keeping the first part:

from pptx.shapes.group import GroupShape
from pptx import Presentation
for eachfile in files:
prs = Presentation(eachfile)

textrun=[]
for slide in prs.slides:
    for shape in slide.shapes:

then we need to replace the "for text in GroupShape.shapes:" test with a call for the recursive part:

    textrun=checkrecursivelyfortext(slide.shapes,textrun)

and also insert a new recursive function definition of the function (like after the import statement). To make comparison easier, the inserted function is using the same code as above, only adding the recursive part:

def checkrecursivelyfortext(shpthissetofshapes,textrun):
    for shape in shpthissetofshapes:
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
            textrun=checkrecursivelyfortext(shape.shapes,textrun)
        else:
            if hasattr(shape, "text"):
                print(shape.text)
                textrun.append(shape.text)
    return textrun
like image 40
Mats Bengtsson Avatar answered Jan 31 '23 22:01

Mats Bengtsson