I recently had a bunch of screen shots that I wanted to OCR, so that I could search for text content in the future. I found an open source solution called OCRmyPDF (created by jbarlow83 over on GitHub) to be very simple to use. It’s also very well documented with many usage examples, including my preferred option - from a Docker container.

OCR Background

Optical Character Recognition (OCR) converts an image containing text into searchable text. OCRmyPDF uses the open source OCR engine called Tesseract, originally created by HP and currently maintained by Google. OCRmyPDF is v9.0.3 at time of writing, the screen shot below is from the GitHub project page (I like the logo):

As usual proceed with caution! Don’t blindly download software or run code snippets from the Internet without thoroughly reviewing them - that goes for anything on this site too!

In my case, my content was already in PDF. However, OCRmyPDF can take images (JPEG and PNG) and convert them to PDF with an OCR text layer. Note that accuracy is dependent on the quality of the image and font used.

Usage

I created a simple shell script ocrmypdf.sh to either convert a single PDF or all PDFs in my folder. It also names the output sensibly with the extension .ocr.pdf:

#!/bin/bash
if [[ $(docker images | grep jbarlow83/ocrmypdf -c) -lt 1 ]]; then
 docker pull jbarlow83/ocrmypdf
fi
if [[ -z "$1" ]]; then
 echo $0 filename(s)
else
 for f in "$@"; do
  echo "**** PROCESSING $f"
  docker run --rm -i jbarlow83/ocrmypdf -l eng --redo-ocr - - <"$f" >"${f%.*}.ocr.pdf"
 done
fi

Remember to set your script to executable (chmod +x ocrmypdf.sh). And make sure the docker.app is running (otherwise you will get an error like Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?).

So, to OCR a PDF document, simply run the script with the input PDF file or files (use wildcards like *):

./ocrmypdf.sh input*.pdf

Explanation

The first part just checks if the docker image is present, and if not, pulls it from Docker Hub.

Yeah, once the image is downloaded the first time, it’s never refreshed with the latest. So periodically do a manual docker pull.

Instead of copying files into and out of the container, I redirected the input file to the container program’s stdin and redirected the container program’s stdout to an output file. That’s the two dashes - -, followed by <input.pdf and >output.ocr.pdf.

Combined with Docker running the container in interactive and transient / ephemeral mode using --rm -i, I believe that the container cannot interact with my file system at all (although you still have to trust that OCRmyPDF does not transmit your PDF anywhere).

OCRmyPDF Parameters

I used a couple of parameters with OCRmyPDF:

-l eng sets the Tesseract language to English.
For a page with both and image and text, I found --redo-ocr worked best to OCR the image but retain the text. The other options didn’t work for me: -s skipped the page entirely, while -f converted the text into an image and then OCR’ed that instead

Other parameters are listed by running docker run --rm -i jbarlow83/ocrmypdf --help. Here are some that I find useful, depending on the input file type:

-l LANGUAGE, --language LANGUAGE
                      Language(s) of the file to be OCRed (see tesseract
                      --list-langs for all language packs installed in your
                      system). Use -l eng+deu for multiple languages.

-r, --rotate-pages    Automatically rotate pages based on detected text
                      orientation
--remove-background   Attempt to remove background from gray or color pages,
                      setting it to white
-d, --deskew          Deskew each page before performing OCR
-c, --clean           Clean pages from scanning artifacts before performing
                      OCR, and send the cleaned page to OCR, but do not
                      include the cleaned page in the output
-i, --clean-final     Clean page as above, and incorporate the cleaned image
                      in the final PDF. Might remove desired content.

-f, --force-ocr       Rasterize any text or vector objects on each page,
                      apply OCR, and save the rastered output (this rewrites
                      the PDF)
-s, --skip-text       Skip OCR on any pages that already contain text, but
                      include the page in final output; useful for PDFs that
                      contain a mix of images, text pages, and/or previously
                      OCRed pages
--redo-ocr            Attempt to detect and remove the hidden OCR layer from
                      files that were previously OCRed with OCRmyPDF or
                      another program. Apply OCR to text found in raster
                      images. Existing visible text objects will not be
                      changed. If there is no existing OCR, OCR will be
                      added.

--pages PAGES         Limit OCR to the specified pages (ranges or comma
                      separated), skipping others

Hope this helps you get going quickly!

Note: this article was submitted on 8 October 2019, but was not properly published during my last migration! I just noticed the omission today!

❮ Older

Newer ❯