Last Updated on 2024-10-04 by Clay
Introduction
Recently, while handling some work-related matters, I noticed that the client might potentially need a way to extract text from PPT files. I discussed this with the PM and my supervisor, and they mentioned that the client could simply copy the text from the PPT slides manually. Unless the client explicitly requests us to extract it programmatically.
However, I think this task may not be particularly difficult (last year, I worked on a similar task involving extracting information from PPT, but it involved extracting different colored labels). Additionally, I fear that if the client eventually makes such a request, it might coincide with other tasks, causing everything to collide, which wouldn't be ideal.
So, I researched how to accomplish this task and found that there is already a package available. We only need to install it, run some sample code, and the text from the PPT slides will be directly extracted.
The only thing we may need to do additionally is to handle some post-processing based on different requirements.
Installation
First, we use the pip
command to install the python-pptx
library.
pip3 install python-pptx
Usage
The usage is as follows, which should be easy to understand for anyone with some programming background.
from pptx import Presentation
def get_text_from_pptx(file_path: str) -> str:
presentation = Presentation(file_path)
text = []
for slide in presentation.slides:
for shape in slide.shapes:
if shape.has_text_frame:
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text.append(run.text)
return " ".join(text)
def main() -> None:
text = get_text_from_pptx("./raw_data/clay.pptx")
print(text)
if __name__ == "__main__":
main()
Output:
– Text Summarization Long short-term memory, Gated Recurrent Unit, TextCNN Extractive summarization, Abstractive summarization Sequence to Sequence
This PPT file is a sample I created casually, and as you can see, the textual information has been successfully extracted.