SDXL 1.0 uses two text prompts used to guide image generation. In my first post, SDXL 1.0 with ComfyUI, I referred to the second text prompt as a “style” but I wonder if I am correct. I have no idea! So let’s test out both prompts...

About CLIPs

Image generation with Stable Diffusion starts with the need to “understand” the input text as it relates to images. The neural network “encoder” to do this is OpenAI’s Contrastive Language-Image Pre-training (CLIP). SDXL uses OpenCLIP, an open-source implementation of CLIP, trained on an open dataset of captioned images.

In ComfyUI these input parameters are represented as:

  • CLIP_G = text_encoder (CLIPTextModel)
  • CLIP_L = text_encoder_2 (CLIPTextModelWithProjection)

Comfy UI ClipTextEncodeSDXL node

Checking the SDXL documentation, the two text inputs are described as:

text_encoder (CLIPTextModel) — Frozen text-encoder. Stable Diffusion XL uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.

text_encoder_2 ( CLIPTextModelWithProjection) — Second frozen text-encoder. Stable Diffusion XL uses the text and pool portion of CLIP, specifically the laion/CLIP-ViT-bigG-14-laion2B-39B-b160k variant.

That doesn’t clarify at all! I did check out SDXL Prompt Tests by Markus Pobitzer... but that just confused me further.

Time to experiment! In all cases I used the Base SDXL 1.0 model (no refiner) and identical parameters:

  • image size = 1024 x 1024
  • sampler = dpmpp_2m, scheduler = karras
  • steps = 30, cfg = 8.0
  • seed = fixed 674690372
  • CLIP_G positive = “an angel with wings outstretched landing in the middle of a busy road”
  • negative prompt = “text, watermark, b&w” (except the B&W output in the last experiment set)

Experiments with “Official” Styles in CLIP_L

By official, I mean that these are the so-called “styles” in the free-to-use Clipdrop SDXL image generator:

Cllipdrop Styles screenshot

I’ll just call them the “Official” styles:

  • no style - which I assume means leaving CLIP_L blank (although actually using no style produces strikingly similar results)
  • anime
  • photographic
  • digital art
  • comic book
  • fantasy art
  • analog film
  • neon punk
  • isometric
  • low poly
  • origami
  • line art
  • cinematic
  • 3d model
  • pixel art

As you can see below, using a single “official” style in the CLIP_L field faithfully produces the expected stylistic representation. Excellent!

CLIP_L Prompt Experiments 1: Official Style Descriptors

This means my CLIP_G can be a textual description of the image (adjectives and nouns), and style can be applied (projected) independently.

More Experiments with Styles

I want to make sure styles belong in CLIP_L - what happens if I place styles in CLIP_G instead? What if I use only one prompt to embed styles, similar to Stable Diffusion 1.0/1.5? How about using synonyms (shortened or “clipped” words) or words with similar meanings?

  • In the first row:

    • if both prompts are the same, i.e. the prompt with “low poly” appended, then low poly is strongly emphasized compared to the 1st image on the 4th row previously.
    • if prompts are swapped so that the style is in CLIP_G, then evidently, the CLIP_L encoder is unable to capture the essence of the prompt and is this is the wrong way to use the text prompts.
    • if the style “low poly” is appended to CLIP_G and CLIP_L is left blank - the output passable but less dynamic somehow...
  • In the second row:

    • Per Clipdrop, photographic is the “official” style (a rather odd word),
    • However, the clipped form photo, produces worse results,
    • And photograph is oddly the worst.
  • And similarly, in the final row:

    • Per Clipdrop cinematic is “official,”
    • It seems cinema looks worse,
    • But film is the worst.

CLIP_L Prompt Experiments 2: Shortened Style Descriptors

It’s possible that the last 2 scenarios are specific to my prompt. Still, I’d best stick to the “official” styles for use with CLIP_L first, then append more style descriptors after that in a chaotic, un-directed, word-salad.

Experiments with Alternative Styles

WARNING: NSFW

What about other “typical” style prompts? I tried many... and here are the most interesting results!

CLIP_L Prompt Experiments 3: Alternative Style Descriptors

Bonus! I am using the ComfyUI extension ImagesGrid to generate the image grid.

  • For every LoadImage, use ImageCombine to add each image to the previous (it’s probably adding each image to an array / List).
  • Since my images are added left-to-right, then top-to-bottom (Z), I use ImagesGridByColumn.
  • To label the columns and rows, use the GridAnnotation node - the first text box sets column titles (delimited by ;), and the second text box sets row titles (again delimited by ;).