I recently had a bunch of screen shots that I wanted to OCR, so that I could search for text content in the future. I found an open source solution called OCRmyPDF (created by jbarlow83 over on GitHub) to be very simple to use. It’s also very well documented with many usage examples, including my preferred option - from a Docker container.
OCR Background
Optical Character Recognition (OCR) converts an image containing text into searchable text. OCRmyPDF uses the open source OCR engine called Tesseract, originally created by HP and currently maintained by Google. OCRmyPDF is v9.0.3 at time of writing, the screen shot below is from the GitHub project page (I like the logo):
As usual proceed with caution! Don’t blindly download software or run code snippets from the Internet without thoroughly reviewing them - that goes for anything on this site too!
In my case, my content was already in PDF. However, OCRmyPDF can take images (JPEG and PNG) and convert them to PDF with an OCR text layer. Note that accuracy is dependent on the quality of the image and font used.
Usage
I created a simple shell script ocrmypdf.sh
to either convert a single PDF or all PDFs in my folder. It also names the output sensibly with the extension .ocr.pdf
:
#!/bin/bash
if [[ $(docker images | grep jbarlow83/ocrmypdf -c) -lt 1 ]]; then
docker pull jbarlow83/ocrmypdf
fi
if [[ -z "$1" ]]; then
echo $0 filename(s)
else
for f in "$@"; do
echo "**** PROCESSING $f"
docker run --rm -i jbarlow83/ocrmypdf -l eng --redo-ocr - - <"$f" >"${f%.*}.ocr.pdf"
done
fi
Remember to set your script to executable (chmod +x ocrmypdf.sh
). And make sure the docker.app
is running (otherwise you will get an error like Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
).
So, to OCR a PDF document, simply run the script with the input PDF file or files (use wildcards like *
):
./ocrmypdf.sh input*.pdf
Explanation
The first part just checks if the docker image is present, and if not, pulls it from Docker Hub.
Yeah, once the image is downloaded the first time, it’s never refreshed with the latest. So periodically do a manual docker pull
.
Instead of copying files into and out of the container, I redirected the input file to the container program’s stdin and redirected the container program’s stdout to an output file. That’s the two dashes - -
, followed by <input.pdf
and >output.ocr.pdf
.
Combined with Docker running the container in interactive and transient / ephemeral mode using --rm -i
, I believe that the container cannot interact with my file system at all (although you still have to trust that OCRmyPDF does not transmit your PDF anywhere).
OCRmyPDF Parameters
I used a couple of parameters with OCRmyPDF:
-l eng
sets the Tesseract language to English.- For a page with both and image and text, I found
--redo-ocr
worked best to OCR the image but retain the text. The other options didn’t work for me:-s
skipped the page entirely, while-f
converted the text into an image and then OCR’ed that instead
Other parameters are listed by running docker run --rm -i jbarlow83/ocrmypdf --help
. Here are some that I find useful, depending on the input file type:
-l LANGUAGE, --language LANGUAGE
Language(s) of the file to be OCRed (see tesseract
--list-langs for all language packs installed in your
system). Use -l eng+deu for multiple languages.
-r, --rotate-pages Automatically rotate pages based on detected text
orientation
--remove-background Attempt to remove background from gray or color pages,
setting it to white
-d, --deskew Deskew each page before performing OCR
-c, --clean Clean pages from scanning artifacts before performing
OCR, and send the cleaned page to OCR, but do not
include the cleaned page in the output
-i, --clean-final Clean page as above, and incorporate the cleaned image
in the final PDF. Might remove desired content.
-f, --force-ocr Rasterize any text or vector objects on each page,
apply OCR, and save the rastered output (this rewrites
the PDF)
-s, --skip-text Skip OCR on any pages that already contain text, but
include the page in final output; useful for PDFs that
contain a mix of images, text pages, and/or previously
OCRed pages
--redo-ocr Attempt to detect and remove the hidden OCR layer from
files that were previously OCRed with OCRmyPDF or
another program. Apply OCR to text found in raster
images. Existing visible text objects will not be
changed. If there is no existing OCR, OCR will be
added.
--pages PAGES Limit OCR to the specified pages (ranges or comma
separated), skipping others
Hope this helps you get going quickly!
Note: this article was submitted on 8 October 2019, but was not properly published during my last migration! I just noticed the omission today!