There has been a lot of buzz about Stable Diffusion for text-to-image synthesis, which saw its Public Release around 22 Aug 22. You can read more on the Stability.AI blog and try it at Hugging Face. What’s groundbreaking is is that is open source, with a pre-trained downloadable model and modest system requirements, so anyone can try it on their own computer... anyone... like me!

Background

If you’ve not heard about text-to-image AI like DALL-E 2, Midjourney, Disco Diffusion before (you must’ve been living under a rock): these are machine learning models that generate digital images from a natural language text prompt. These are not image-search or copy-paste jobs, but truly unique never-before-seen “creative” works of “art”: though we won’t debate this further :)

For background and discussion (non-technical):

For technical information to go beyond the basics in this post, check out:

As before when I posted about Running GFPGAN Face Restoration in a container, my maths is not up to par to understand A.I. or Machine Learning (AI/ML). Be warned.

Running Stable Diffusion txt2img

To get Stable Diffusion running on my M1 MacBook Pro, I followed Ben Firshman’s guide, “Run Stable Diffusion on your M1 Mac’s GPU” to install a modified version of Stable Diffusion from his bfirsh/apple-silicon-mps-support GitHub branch.

I did a quick comparison against the official CompVis stable-diffusion repository - all changes seem to be related to replacing code that uses NVIDIA’s cuda (and dropping cudatoolkit altogether) in favour of Apple’s Metal Performance Shaders (mps) in PyTorch.

It’s not possible to use MPS within a macOS VM or Docker container, so I can’t isolate it UTM or Multipass. Instead, you will need Homebrew already installed on your macOS.

Summarizing the one-time setup from Ben Firshman’s guide:

brew update
brew install python Cmake protobuf rust
git clone -b apple-silicon-mps-support https://github.com/bfirsh/stable-diffusion.git
cd stable-diffusion
mkdir -p models/ldm/stable-diffusion-v1/
python3 -m pip install virtualenv
python3 -m virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

At this point, notice that your Shell prompt starts with (venv), which indicates that you are using a Python virtual environment. Therefore, remember to run source venv/bin/activate every time you start a new session (i.e. when you open a new terminal).

Next, to download the weights the Stable Diffusion model was built with:

If this is too much trouble, I found a download available on Google Storage, but I cannot vouch for it: clicking on this link will immediately download the model.

Finally, it’s time to generate an image from a text prompt! After more downloads and a short wait, you’ll find the AI-generated image in outputs/txt2img-samples/. FYI the downloads are about 2.8 GB, but this only happens once. (--n_samples 1 is required, and --plms uses Katherine Crowson’s PLMS sampler implementation, instead of DDIM)

python scripts/txt2img.py --n_samples 1 --plms \ 
  --prompt "a sad woman holding a blue pug wearing a hat, in the style of gustav klimt" 

If you encounter and issue with the latest Protobuf v3.20.x, along the lines of ImportError: dlopen(protobuf/pyext/_message.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace, then you could try to downgrade it to a known working version, i.e. pip install "protobuf==3.19.4"

Ta-da!

Stable Diffusion Output Sample "a sad woman holding a blue pug wearing a hat, in the style of gustav klimt"

Not sure about the blue hat, perhaps it’s my phrasing problem, but isn’t amazing?!

One final thing I tried (so far): to generate more than one image (all in a single PNG), just increase the number of iterations, e.g. --n_iter 3.

python scripts/txt2img.py --n_samples 1 --plms --n_iter 3 \ 
  --prompt "a sad woman holding a blue pug wearing a hat, in the style of gustav klimt" 

Stable Diffusion 3x Output Sample "a sad woman holding a blue pug wearing a hat, in the style of gustav klimt"

I love the first image! It’s beautiful and perfectly Klimt in his Golden Phase. Can I copyright them?

The Internet is about to be flooded with machine-generated images - entering “Klimt” in a search engine may turn up these images intsead of his paintaings; future training sets will need to be carefuly curated; and perhaps art will have less value, since anyone can genrate an image, free; or perhaps art will have more value, in cases where human authorship is established?

I think there is a watermark embedded in generated images with invisible-watermark. Also if you get RickRolled, or get a black image (if you have disabled the check_safety() function call, then Stable Diffusion has determined image to be NSFW - no idea why I keep getting this.

Update 28 Sep 2022: The recent PyTorch nightly v1.13.0.dev20220927 (or thereabouts) increased by render time from 1-2 minutes (MPS) to 10-15 minutes! Your mileage may vary, but to downgrade to the version I used:

pip3 uninstall torch torchvision
pip3 install --pre torch==1.13.0.dev20220915 torchvision==0.14.0.dev20220915 --extra-index-url https://download.pytorch.org/whl/nightly/cpu