Skip to content

[Python] Extracting Text from PPT Using the python-pptx Library

Last Updated on 2024-10-04 by Clay

Introduction

Recently, while handling some work-related matters, I noticed that the client might potentially need a way to extract text from PPT files. I discussed this with the PM and my supervisor, and they mentioned that the client could simply copy the text from the PPT slides manually. Unless the client explicitly requests us to extract it programmatically.

However, I think this task may not be particularly difficult (last year, I worked on a similar task involving extracting information from PPT, but it involved extracting different colored labels). Additionally, I fear that if the client eventually makes such a request, it might coincide with other tasks, causing everything to collide, which wouldn't be ideal.

So, I researched how to accomplish this task and found that there is already a package available. We only need to install it, run some sample code, and the text from the PPT slides will be directly extracted.

The only thing we may need to do additionally is to handle some post-processing based on different requirements.


Installation

First, we use the pip command to install the python-pptx library.

pip3 install python-pptx

Usage

The usage is as follows, which should be easy to understand for anyone with some programming background.

from pptx import Presentation


def get_text_from_pptx(file_path: str) -> str:
    presentation = Presentation(file_path)
    text = []
    
    for slide in presentation.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                for paragraph in shape.text_frame.paragraphs:
                    for run in paragraph.runs:
                        text.append(run.text)

    return " ".join(text)


def main() -> None:
    text = get_text_from_pptx("./raw_data/clay.pptx")
    print(text)

if __name__ == "__main__":
    main()



Output:

– Text Summarization Long short-term memory, Gated Recurrent Unit, TextCNN Extractive summarization, Abstractive summarization Sequence to Sequence

This PPT file is a sample I created casually, and as you can see, the textual information has been successfully extracted.


References


Read More

Tags:

Leave a Reply