There has been a lot of buzz about Stable Diffusion for text-to-image synthesis, which saw its Public Release around 22 Aug 22. You can read more on the Stability.AI blog and try it at Hugging Face. What’s groundbreaking is is that is open source, with a pre-trained downloadable model and modest system requirements, so anyone can try it on their own computer... anyone... like me!
If you’ve not heard about text-to-image AI like DALL-E 2, Midjourney, Disco Diffusion before (you must’ve been living under a rock): these are machine learning models that generate digital images from a natural language text prompt. These are not image-search or copy-paste jobs, but truly unique never-before-seen “creative” works of “art”: though we won’t debate this further :)
For background and discussion (non-technical):
- Cleo Abram’s “The REAL fight over AI art” explainer on Huge if true,
- Dagogo Altraide’s “How This A.I. Draws Anything You Describe [DALL-E 2]” on ColdFusion.
For technical information to go beyond the basics in this post, check out:
- Edan Meyer’s “Stable Diffusion - What, Why, How?” very informative video, whithout getting too deep with the math. I’ll probably try out more of his code in the future...
- Any-Winter-4079’s “StableDiffusion RUNS on M1 chips” Reddit post for even more things to try like
- Dr. Lincoln Stein’s ”Stable Diffusion Dream Script” with “for-dummies” installation guides for Windows, Linux and macOS and examples of major features. I was too lazy to install Anaconda / miniconda, so maybe that’s a future to-do too...
As before when I posted about Running GFPGAN Face Restoration in a container, my maths is not up to par to understand A.I. or Machine Learning (AI/ML). Be warned.
Running Stable Diffusion txt2img
To get Stable Diffusion running on my M1 MacBook Pro, I followed Ben Firshman’s guide, “Run Stable Diffusion on your M1 Mac’s GPU” to install a modified version of Stable Diffusion from his
bfirsh/apple-silicon-mps-support GitHub branch.
I did a quick comparison against the official CompVis stable-diffusion repository - all changes seem to be related to replacing code that uses NVIDIA’s cuda (and dropping
cudatoolkit altogether) in favour of Apple’s Metal Performance Shaders (mps) in PyTorch.
Summarizing the one-time setup from Ben Firshman’s guide:
brew update brew install python Cmake protobuf rust git clone -b apple-silicon-mps-support https://github.com/bfirsh/stable-diffusion.git cd stable-diffusion mkdir -p models/ldm/stable-diffusion-v1/ python3 -m pip install virtualenv python3 -m virtualenv venv source venv/bin/activate pip install -r requirements.txt
At this point, notice that your Shell prompt starts with
(venv), which indicates that you are using a Python virtual environment. Therefore, remember to run
source venv/bin/activate every time you start a new session (i.e. when you open a new terminal).
Next, to download the weights the Stable Diffusion model was built with:
- Head to the Hugging Face stable-diffusion-v-1-4-original repository,
- Click Access Repository,
- Create an account, accept the license,
- Download sd-v1-4.ckpt - FYI, the file is about 4 GB,
- Save it as
If this is too much trouble, I found a download available on Google Storage, but I cannot vouch for it: clicking on this link will immediately download the model.
Finally, it’s time to generate an image from a text prompt! After more downloads and a short wait, you’ll find the AI-generated image in
outputs/txt2img-samples/. FYI the downloads are about 2.8 GB, but this only happens once. (
--n_samples 1 is required, and
--plms uses Katherine Crowson’s PLMS sampler implementation, instead of DDIM)
python scripts/txt2img.py --n_samples 1 --plms \ --prompt "a sad woman holding a blue pug wearing a hat, in the style of gustav klimt"
If you encounter and issue with the latest Protobuf v3.20.x, along the lines of
ImportError: dlopen(protobuf/pyext/_message.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace, then you could try to downgrade it to a known working version, i.e.
pip install "protobuf==3.19.4"
Not sure about the blue hat, perhaps it’s my phrasing problem, but isn’t amazing?!
One final thing I tried (so far): to generate more than one image (all in a single PNG), just increase the number of iterations, e.g.
python scripts/txt2img.py --n_samples 1 --plms --n_iter 3 \ --prompt "a sad woman holding a blue pug wearing a hat, in the style of gustav klimt"
I love the first image! It’s beautiful and perfectly Klimt in his Golden Phase. Can I copyright them?
The Internet is about to be flooded with machine-generated images - entering “Klimt” in a search engine may turn up these images intsead of his paintaings; future training sets will need to be carefuly curated; and perhaps art will have less value, since anyone can genrate an image, free; or perhaps art will have more value, in cases where human authorship is established?
I think there is a watermark embedded in generated images with invisible-watermark. Also if you get RickRolled, or get a black image (if you have disabled the
check_safety() function call, then Stable Diffusion has determined image to be NSFW - no idea why I keep getting this.
Update 28 Sep 2022: The recent PyTorch nightly v1.13.0.dev20220927 (or thereabouts) increased by render time from 1-2 minutes (MPS) to 10-15 minutes! Your mileage may vary, but to downgrade to the version I used:
pip3 uninstall torch torchvision pip3 install --pre torch==1.13.0.dev20220915 torchvision==0.14.0.dev20220915 --extra-index-url https://download.pytorch.org/whl/nightly/cpu