Skip to content

Effortlessly Extract Text from PDFs with PyPDF

[

How to Work With a PDF in Python

The Portable Document Format, or PDF, is a file format that is widely used for presenting and exchanging documents. Working with PDF files in Python can be accomplished using the PyPDF2 package. In this article, we will explore different operations that can be performed on PDFs using PyPDF2.

By the end of this article, you will learn how to:

  • Extract document information from a PDF in Python
  • Rotate pages in a PDF
  • Merge multiple PDFs into a single document
  • Split a PDF into multiple documents
  • Add watermarks to a PDF
  • Encrypt a PDF file

History of PyPDF2

The original pyPdf package was released in 2005, but it stopped being actively maintained after 2010. To keep the project alive, a company called Phasit sponsored a fork called PyPDF2 in 2011. PyPDF2 aimed to be backwards compatible with pyPdf and introduced new features. However, the development of PyPDF2 also ceased in 2016.

PyPDF3 was introduced as a series of releases, but eventually, the project was renamed to PyPDF4. Both PyPDF3 and PyPDF4 serve the same purpose as PyPDF2, with PyPDF4 offering Python 3 support. Although a separate fork for Python 3 was created for pyPdf, it has not been maintained in recent years.

While PyPDF2 was abandoned in 2016, it has been revived in 2022 and is currently being actively maintained. However, there is an alternative package called pdfrw, created by Patrick Maupin, which offers similar functionality to PyPDF2. pdfrw integrates with the ReportLab package and allows you to build a new PDF using an existing one.

Installation

To install PyPDF2 using pip, run the following command:

Terminal window
$ pip install pypdf2

If you are using Anaconda, you can install PyPDF2 using conda. However, PyPDF2 does not have any dependencies, so the installation is straightforward.

Extracting Document Information from a PDF

To extract document information from a PDF file, follow these steps:

  1. Import the necessary classes from the PyPDF2 package.
  2. Open the PDF file using the PdfFileReader class.
  3. Access the document information using the documentInfo attribute.
  4. Retrieve specific information such as the title, author, subject, and creation date.
from PyPDF2 import PdfFileReader
def extract_document_info(pdf_file):
pdf = PdfFileReader(pdf_file)
info = pdf.getDocumentInfo()
title = info.title
author = info.author
subject = info.subject
creation_date = info.created
return title, author, subject, creation_date
# Usage
title, author, subject, creation_date = extract_document_info("sample.pdf")
print(f"Title: {title}")
print(f"Author: {author}")
print(f"Subject: {subject}")
print(f"Creation Date: {creation_date}")

Rotating Pages in a PDF

To rotate pages in a PDF file, you can use the rotateClockwise or rotateCounterClockwise methods of the PdfFileWriter class. Here is an example:

from PyPDF2 import PdfFileReader, PdfFileWriter
def rotate_pages(pdf_file, output_file, degrees=90):
pdf = PdfFileReader(pdf_file)
writer = PdfFileWriter()
for page in range(pdf.getNumPages()):
rotated_page = pdf.getPage(page).rotateClockwise(degrees)
writer.addPage(rotated_page)
with open(output_file, "wb") as file:
writer.write(file)
# Usage
rotate_pages("sample.pdf", "rotated.pdf", degrees=180)

Merging PDFs

To merge multiple PDF files into a single document, you can use the PdfFileMerger class from the PyPDF2 package. Here’s how you can do it:

from PyPDF2 import PdfFileMerger
def merge_pdfs(pdf_files, output_file):
merger = PdfFileMerger()
for file in pdf_files:
merger.append(file)
merger.write(output_file)
merger.close()
# Usage
pdf_files = ["file1.pdf", "file2.pdf", "file3.pdf"]
merge_pdfs(pdf_files, "merged.pdf")

Splitting PDFs

To split a PDF into multiple documents, you can use the extractPages method of the PdfFileReader class. Here’s an example:

from PyPDF2 import PdfFileReader, PdfFileWriter
def split_pdf(pdf_file, output_files):
pdf = PdfFileReader(pdf_file)
for i, output_file in enumerate(output_files):
writer = PdfFileWriter()
writer.addPage(pdf.getPage(i))
with open(output_file, "wb") as file:
writer.write(file)
# Usage
output_files = ["page1.pdf", "page2.pdf", "page3.pdf"]
split_pdf("sample.pdf", output_files)

Adding Watermarks to a PDF

Adding watermarks to a PDF can be achieved by overlaying a watermark image or text on each page using the mergePage method of the PdfFileWriter class. Here’s an example:

from PyPDF2 import PdfFileReader, PdfFileWriter
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
def add_watermark(pdf_file, watermark_text, output_file):
pdf = PdfFileReader(pdf_file)
watermark = PdfFileReader("watermark.pdf").getPage(0)
writer = PdfFileWriter()
for page in range(pdf.getNumPages()):
pdf_page = pdf.getPage(page)
pdf_page.mergePage(watermark)
writer.addPage(pdf_page)
with open(output_file, "wb") as file:
writer.write(file)
# Usage
add_watermark("sample.pdf", "Confidential", "watermarked.pdf")

Encrypting a PDF

To encrypt a PDF file, you can use the encrypt method of the PdfFileWriter class. Here’s how:

from PyPDF2 import PdfFileWriter
def encrypt_pdf(input_file, output_file, password):
pdf = PdfFileReader(input_file)
writer = PdfFileWriter()
for page in range(pdf.getNumPages()):
writer.addPage(pdf.getPage(page))
writer.encrypt(password)
with open(output_file, "wb") as file:
writer.write(file)
# Usage
encrypt_pdf("file.pdf", "encrypted.pdf", "password")

In conclusion, working with PDF files in Python is made easy with the PyPDF2 package. You can extract document information, rotate pages, merge or split PDFs, add watermarks, and encrypt PDF files using the provided methods. Additionally, you can explore alternative packages such as pdfrw, which offers similar functionality.