Extracting images from PDF files

Happy new year!

Ever needed to extract images from PDFs and found both on-line and off-line tools lacking? Well, I certainly have, and here I present my Python code to extract JPGs/PNGs from PDFs, using PyMuPDF.

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

There are many Python PDF libraries out there, but I found PyMuPDF to be not only easy to use and well documented, but also with rich, built-in image manipulation capabilities (Pixmaps) that do not require external dependencies like Pillow / PIL or anything else! I didn’t even need to look at sample code to figure it out.

Other PDF libraries only extract images in the default color-space but are unable to convert CMYK properly, or do not handle masks, or require understanding and iterating over every single resource object. But PyMuPDF is very simple to use, I just had to glance at the docs and sample to figure it out!

Python Code

I am using the default Python 3.9.6 runtime that comes with the latest version of macOS. No error handling is done, and I have only tested this on macOS on a small number of PDFs.

My usual disclaimers apply! Don’t run anything you don’t understand! Again... no error handling!

In a new directory, setup the virtual environment and install PyMuPDF, which, at time of writing, is PyMuPDF v1.23.8:

python -m venv v
source v/bin/activate
pip install PyMuPDF

Without further ado, here is the code:

# getImages.py v0.2 8 Jan 2024 (c) C.Y. Wong, myByways.com

import fitz, sys, os

if len(sys.argv) == 1 or not os.path.isfile(sys.argv[1]):
    print('Export images from PDFs as JPG or, for images with mask, as PNG')
    print(f' Usage: {sys.argv[0]} filename.pdf [page_number...]')
    sys.exit(1)

file = sys.argv[1]
folder = os.path.splitext(os.path.basename(file))[0]
if not os.path.exists(folder):
    os.mkdir(folder)

if len(sys.argv) > 1:
    try:
        pgnums = [(int(arg) - 1) for arg in sys.argv[2:]]
    except:
        print(f'{sys.argv[0]}: Invalid page numbers', file=sys.stderr)
        sys.exit(2)

exported = []
doc = fitz.open(file)
print(f'PDF "{file}" has {doc.page_count} page{"" if doc.page_count == 1 else "s"}:')

for page in doc.pages():
    if pgnums and not page.number in pgnums:
        continue

    refs = page.get_images()
    print(f' Page {page.number + 1} has {len(refs)} image{"" if len(refs) == 1 else "s"}:')

    for i, ref in enumerate(refs):
        xref, smask, w, h = [ref[n] for n in (0, 1, 2, 3)]
        if xref in exported:
            print(f'  Image {i + 1} ({w}x{h}) {"+ mask " if smask else ""}-> duplicate')
            continue

        print(f'  Image {i + 1} ({w}x{h}) {"+ mask " if smask else ""}-> ', end = '')
        output = f'{page.number + 1:03}-{xref + 1:05}'

        if smask:
            mask = fitz.Pixmap(doc, smask)
            if (mask.width != w) or (mask.height != h):
                mask = fitz.Pixmap(mask, w, h, None)
            image = fitz.Pixmap(doc, xref)
            image = fitz.Pixmap(image, mask)
            image = fitz.Pixmap(fitz.csRGB, image)
            output = f'{os.path.join(folder, output)}.png'
        else:
            image = fitz.Pixmap(doc, xref)
            output = f'{os.path.join(folder, output)}.jpg'

        image.save(output)
        print(f'"{output}"')
        exported.append(xref)

print(f'All done: Exported {len(exported)} image{"" if len(exported) == 1 else "s"}!')

Usage:

python getImages.py filename.pdf [page_number...]

Usage notes:

Only supports images - typically JPG and JPG with masks to us laymen. Paths and other objects are not supported.
In the latter case, the code will merge the mask and convert to an RGB-colorspace PNG.
To export images from all pages, provide the PDF filename as the only input argument, i.e. python getImages.py my_document.pdf.
To export images only from specific pages, add one more more page numbers to the arguments list, i.e. python getImages.py my_document.pdf 5 10 11.
Once run, the code will go page-by-page and save (export) images encountered into a folder with the same name as the document itself.
The filenames will be a 3 digit page number (to make it easier to cross-reference), followed by a five (or more) digit PDF Xref ID (in case you ever need it).
Be warned! Any existing files with the same file name will be overwritten.
If the same image is used in multiple pages (i.e. same Xref), it will only be exported once.

Simple code that works well for my PDFs!

I’ve not managed to find any other code or tool that works as well as this. Or rather, Google has not managed to reveal any other code or tool that works as well as this! :)

Update 8 Jan 24: Added support to extract images from selected pages only. Update 19 Mar 24: May have left a bug in if no page numbers specified... either that or I just introduced a new bug.

❮ Older

Newer ❯