[Python][PDF]PythonでPDFからテキストを抽出する方法

python

Photo by Ketut Subiyanto on Pexels.com

2023.03.292023.03.28

Table of Contents

PyPDF2

Welcome to PyPDF2 — PyPDF2 documentation

PyPDF2を用いる。version 3.0.0以上推奨。インストールはいつもので。

pip install PyPDF2

もうこういうのはChatGPTに聞くほうが早くなってきたな。

コードを一気に

# Open the PDF file in read-binary mode
with open('yourfile.pdf', 'rb') as pdf_file:

    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    # Get the total number of pages in the PDF document
    num_pages = len(pdf_reader.pages)

    # Initialize an empty string variable to store the text
    text = ""

    # Loop through each page and extract the text
    for page_num in range(num_pages):
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()
        text += page_text

# Remove line breaks not followed by a space character
text = re.sub(r'(?<!\s)\n', ' ', text)
# Print the extracted text
print(text)