Effortlessly Extract Text from PDFs with PyPDF
How to Work With a PDF in Python
The Portable Document Format, or PDF, is a file format that is widely used for presenting and exchanging documents. Working with PDF files in Python can be accomplished using the PyPDF2 package. In this article, we will explore different operations that can be performed on PDFs using PyPDF2.
By the end of this article, you will learn how to:
- Extract document information from a PDF in Python
- Rotate pages in a PDF
- Merge multiple PDFs into a single document
- Split a PDF into multiple documents
- Add watermarks to a PDF
- Encrypt a PDF file
History of PyPDF2
The original pyPdf package was released in 2005, but it stopped being actively maintained after 2010. To keep the project alive, a company called Phasit sponsored a fork called PyPDF2 in 2011. PyPDF2 aimed to be backwards compatible with pyPdf and introduced new features. However, the development of PyPDF2 also ceased in 2016.
PyPDF3 was introduced as a series of releases, but eventually, the project was renamed to PyPDF4. Both PyPDF3 and PyPDF4 serve the same purpose as PyPDF2, with PyPDF4 offering Python 3 support. Although a separate fork for Python 3 was created for pyPdf, it has not been maintained in recent years.
While PyPDF2 was abandoned in 2016, it has been revived in 2022 and is currently being actively maintained. However, there is an alternative package called pdfrw, created by Patrick Maupin, which offers similar functionality to PyPDF2. pdfrw integrates with the ReportLab package and allows you to build a new PDF using an existing one.
Installation
To install PyPDF2 using pip, run the following command:
If you are using Anaconda, you can install PyPDF2 using conda. However, PyPDF2 does not have any dependencies, so the installation is straightforward.
Extracting Document Information from a PDF
To extract document information from a PDF file, follow these steps:
- Import the necessary classes from the PyPDF2 package.
- Open the PDF file using the
PdfFileReader
class. - Access the document information using the
documentInfo
attribute. - Retrieve specific information such as the title, author, subject, and creation date.
Rotating Pages in a PDF
To rotate pages in a PDF file, you can use the rotateClockwise
or rotateCounterClockwise
methods of the PdfFileWriter
class. Here is an example:
Merging PDFs
To merge multiple PDF files into a single document, you can use the PdfFileMerger
class from the PyPDF2 package. Here’s how you can do it:
Splitting PDFs
To split a PDF into multiple documents, you can use the extractPages
method of the PdfFileReader
class. Here’s an example:
Adding Watermarks to a PDF
Adding watermarks to a PDF can be achieved by overlaying a watermark image or text on each page using the mergePage
method of the PdfFileWriter
class. Here’s an example:
Encrypting a PDF
To encrypt a PDF file, you can use the encrypt
method of the PdfFileWriter
class. Here’s how:
In conclusion, working with PDF files in Python is made easy with the PyPDF2 package. You can extract document information, rotate pages, merge or split PDFs, add watermarks, and encrypt PDF files using the provided methods. Additionally, you can explore alternative packages such as pdfrw, which offers similar functionality.