Effortlessly Extract Text from PDFs with PyPDF

[

How to Work With a PDF in Python

The Portable Document Format, or PDF, is a file format that is widely used for presenting and exchanging documents. Working with PDF files in Python can be accomplished using the PyPDF2 package. In this article, we will explore different operations that can be performed on PDFs using PyPDF2.

By the end of this article, you will learn how to:

Extract document information from a PDF in Python
Rotate pages in a PDF
Merge multiple PDFs into a single document
Split a PDF into multiple documents
Add watermarks to a PDF
Encrypt a PDF file

History of PyPDF2

The original pyPdf package was released in 2005, but it stopped being actively maintained after 2010. To keep the project alive, a company called Phasit sponsored a fork called PyPDF2 in 2011. PyPDF2 aimed to be backwards compatible with pyPdf and introduced new features. However, the development of PyPDF2 also ceased in 2016.

PyPDF3 was introduced as a series of releases, but eventually, the project was renamed to PyPDF4. Both PyPDF3 and PyPDF4 serve the same purpose as PyPDF2, with PyPDF4 offering Python 3 support. Although a separate fork for Python 3 was created for pyPdf, it has not been maintained in recent years.

While PyPDF2 was abandoned in 2016, it has been revived in 2022 and is currently being actively maintained. However, there is an alternative package called pdfrw, created by Patrick Maupin, which offers similar functionality to PyPDF2. pdfrw integrates with the ReportLab package and allows you to build a new PDF using an existing one.

Installation

To install PyPDF2 using pip, run the following command:

$ pip install pypdf2

If you are using Anaconda, you can install PyPDF2 using conda. However, PyPDF2 does not have any dependencies, so the installation is straightforward.

Extracting Document Information from a PDF

To extract document information from a PDF file, follow these steps:

Import the necessary classes from the PyPDF2 package.
Open the PDF file using the PdfFileReader class.
Access the document information using the documentInfo attribute.
Retrieve specific information such as the title, author, subject, and creation date.

from PyPDF2 import PdfFileReader

def extract_document_info(pdf_file):
    pdf = PdfFileReader(pdf_file)
    info = pdf.getDocumentInfo()

    title = info.title
    author = info.author
    subject = info.subject
    creation_date = info.created

    return title, author, subject, creation_date

# Usage
title, author, subject, creation_date = extract_document_info("sample.pdf")
print(f"Title: {title}")
print(f"Author: {author}")
print(f"Subject: {subject}")
print(f"Creation Date: {creation_date}")

Rotating Pages in a PDF

To rotate pages in a PDF file, you can use the rotateClockwise or rotateCounterClockwise methods of the PdfFileWriter class. Here is an example:

from PyPDF2 import PdfFileReader, PdfFileWriter

def rotate_pages(pdf_file, output_file, degrees=90):
    pdf = PdfFileReader(pdf_file)
    writer = PdfFileWriter()

    for page in range(pdf.getNumPages()):
        rotated_page = pdf.getPage(page).rotateClockwise(degrees)
        writer.addPage(rotated_page)

    with open(output_file, "wb") as file:
        writer.write(file)

# Usage
rotate_pages("sample.pdf", "rotated.pdf", degrees=180)

Merging PDFs

To merge multiple PDF files into a single document, you can use the PdfFileMerger class from the PyPDF2 package. Here’s how you can do it:

from PyPDF2 import PdfFileMerger

def merge_pdfs(pdf_files, output_file):
    merger = PdfFileMerger()

    for file in pdf_files:
        merger.append(file)

    merger.write(output_file)
    merger.close()

# Usage
pdf_files = ["file1.pdf", "file2.pdf", "file3.pdf"]
merge_pdfs(pdf_files, "merged.pdf")

Splitting PDFs

To split a PDF into multiple documents, you can use the extractPages method of the PdfFileReader class. Here’s an example:

from PyPDF2 import PdfFileReader, PdfFileWriter

def split_pdf(pdf_file, output_files):
    pdf = PdfFileReader(pdf_file)

    for i, output_file in enumerate(output_files):
        writer = PdfFileWriter()
        writer.addPage(pdf.getPage(i))

        with open(output_file, "wb") as file:
            writer.write(file)

# Usage
output_files = ["page1.pdf", "page2.pdf", "page3.pdf"]
split_pdf("sample.pdf", output_files)

Adding Watermarks to a PDF

Adding watermarks to a PDF can be achieved by overlaying a watermark image or text on each page using the mergePage method of the PdfFileWriter class. Here’s an example:

from PyPDF2 import PdfFileReader, PdfFileWriter
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def add_watermark(pdf_file, watermark_text, output_file):
    pdf = PdfFileReader(pdf_file)
    watermark = PdfFileReader("watermark.pdf").getPage(0)

    writer = PdfFileWriter()

    for page in range(pdf.getNumPages()):
        pdf_page = pdf.getPage(page)
        pdf_page.mergePage(watermark)
        writer.addPage(pdf_page)

    with open(output_file, "wb") as file:
        writer.write(file)

# Usage
add_watermark("sample.pdf", "Confidential", "watermarked.pdf")

Encrypting a PDF

To encrypt a PDF file, you can use the encrypt method of the PdfFileWriter class. Here’s how:

from PyPDF2 import PdfFileWriter

def encrypt_pdf(input_file, output_file, password):
    pdf = PdfFileReader(input_file)
    writer = PdfFileWriter()

    for page in range(pdf.getNumPages()):
        writer.addPage(pdf.getPage(page))

    writer.encrypt(password)

    with open(output_file, "wb") as file:
        writer.write(file)

# Usage
encrypt_pdf("file.pdf", "encrypted.pdf", "password")

In conclusion, working with PDF files in Python is made easy with the PyPDF2 package. You can extract document information, rotate pages, merge or split PDFs, add watermarks, and encrypt PDF files using the provided methods. Additionally, you can explore alternative packages such as pdfrw, which offers similar functionality.