Extracting embedded ZIPs from Office documents

Posted

I have a plethora of Bash scripts litterred throughout my filesystem, many written for one specific purpose and promptly forgotten about (often, just named go.sh!) I thought I share one today - a script to extract embedded ZIP files from Microsoft Office documents created in Windows.

A few things to notice (all links below to Wikipedia):

  • Modern Microsoft Office documents are in Office Open XML format, denoted by .pptx, .xlsx and .docx extensions.
  • Each file is actually a ZIP-compressed, which encapsulates a bunch of other files, many in XML.
  • When embedding some file types, like ZIP archives, MS Office for Windows for some reason wraps them as OLE objects, saved in a Compound File Binary Format v3 with the header (in hex) d0 cf 11 e0 (i.e. DOCFILE, haha)
  • Also, you should know that ZIP files start with the 50 4b (i.e. PK signature) and end with a well defined directory structure.

On macOS, double clicking the icon of the embedded ZIP file in PowerPoint / World / Excel file won’t open it, but instead will yield this unhelpful and misleading error.

Error opening an embedded ZIP with PowerPoint on macOS

So how to extract the embedded ZIP file? One could write a VBA macro... but me, I like my shell scripts.

Assuming one has a PowerPoint file called foo.pptx, one can do this to check if the embedded ZIP can be extracted manually via UNIX tools:

  • unzip -l foo.pptx to look at the content - specifically I am interested in files named something like ppt/embeddings/oleObject1.bin.
  • unzip -x foo.pptx \*oleObject1.bin -o oleObject1.bin to extract the embedded file.
  • hexdump -C -n2800 oleObject1.bin or xxd -l2800 oleObject1.bin to view the bytes and confirm the OLE and ZIP signatures e.g.
    00000000: d0cf 11e0 a1b1 1ae1 0000 0000 0000 0000  ................
    ...
    00000###: 00f6 2b00 0050 4b03 0414 0008 0808 00e9  ..+..PK.........
  • if you see this, fortunately, unzip -l oleObject1.bin can list the contents of the file (and similarly -x to extract). Unzip is smart enough to ignore the OLE preable, you’ll just see a warning:
    warning [oleObject1.bin]:  2789 extra bytes at beginning or within zipfile

Usual disclaimer: Don’t run this code. I barely test stuff. No error handling. Intendended for education only.

If you are lazy like I am, then this script will extract the contents of all embedded ZIPs for you:

#!/bin/bash
if [[ $# -ge 1 ]]; then
    for a in $(unzip -l "$1" \*oleObject\*.bin | grep oleObject | cut -c28-); do
        o=$(uuidgen).zip 
        unzip -p "$1" "${a##+([[:space:]])}" > $o
        unzip -x $o
        rm $o
    done
fi

Hope this helps!