I have a plethora of Bash scripts littered throughout my filesystem, many written for one specific purpose and promptly forgotten about (often, just named go.sh
!) I thought I share one today - a script to extract embedded ZIP files from Microsoft Office documents created in Windows.
A few things to notice (all links below to Wikipedia):
- Modern Microsoft Office documents are in Office Open XML format, denoted by
.pptx
,.xlsx
and.docx
extensions. - Each file is actually a ZIP-compressed, which encapsulates a bunch of other files, many in XML.
- When embedding some file types, like ZIP archives, MS Office for Windows for some reason wraps them as OLE objects, saved in a Compound File Binary Format v3 with the header (in hex)
d0 cf 11 e0
(i.e. DOCFILE, haha) - Also, you should know that ZIP files start with the
50 4b
(i.e. PK signature) and end with a well defined directory structure.
On macOS, double clicking the icon of the embedded ZIP file in PowerPoint / World / Excel file won’t open it, but instead will yield this unhelpful and misleading error.
So how to extract the embedded ZIP file? One could write a VBA macro... but me, I like my shell scripts.
Assuming one has a PowerPoint file called foo.pptx
, one can do this to check if the embedded ZIP can be extracted manually via UNIX tools:
unzip -l foo.pptx
to look at the content - specifically I am interested in files named something likeppt/embeddings/oleObject1.bin
.unzip -x foo.pptx \*oleObject1.bin -o oleObject1.bin
to extract the embedded file.hexdump -C -n2800 oleObject1.bin
orxxd -l2800 oleObject1.bin
to view the bytes and confirm the OLE and ZIP signatures e.g.00000000: d0cf 11e0 a1b1 1ae1 0000 0000 0000 0000 ................ ... 00000###: 00f6 2b00 0050 4b03 0414 0008 0808 00e9 ..+..PK.........
- if you see this, fortunately,
unzip -l oleObject1.bin
can list the contents of the file (and similarly-x
to extract). Unzip is smart enough to ignore the OLE preamble, you’ll just see a warning:warning [oleObject1.bin]: 2789 extra bytes at beginning or within zipfile
Usual disclaimer: Don’t run this code. I barely test stuff. No error handling. Intended for education only.
If you are lazy like I am, then this script will extract the contents of all embedded ZIPs for you:
#!/bin/bash
if [[ $# -ge 1 ]]; then
for a in $(unzip -l "$1" \*oleObject\*.bin | grep oleObject | cut -c28-); do
o=$(uuidgen).zip
unzip -p "$1" "${a##+([[:space:]])}" > $o
unzip -x $o
rm $o
done
fi
Hope this helps!