How to Generate Customized AI Art Using VQGAN and CLIP
Knowing how AI art is made is the key to making even better AI art.
The latest and greatest AI content generation trend is AI generated art. In January 2021, OpenAI demoed DALL-E, a GPT-3 variant which creates images instead of text. However, it can create images in response to a text prompt, allowing for some very fun output.
However the generated images are not always coherent, so OpenAI also demoed CLIP, which can be used to translate an image into text and therefore identify which generated images were actually avocado armchairs. CLIP was then open-sourced, although DALL-E was not.
Since CLIP is essentially an interface between representations of text and image data, clever hacking can allow anyone to create their own pseudo-DALL-E. The first implementation was Big Sleep by Ryan Murdock/@advadnoun, which combined CLIP with an image generating GAN named BigGAN. Then open source worked its magic: the GAN base was changed to VQGAN, a newer model architecture Patrick Esser and Robin Rombach and Björn Ommer which allows more coherent image generation. The core CLIP-guided training was improved and translated to a Colab Notebook by Katherine Crawson/@RiversHaveWings and others in a special Discord server. Twitter accounts like @images_ai and @ai_curio which leverage VQGAN + CLIP with user-submitted prompts have gone viral and received mainstream press. @ak92501 created a fork of that Notebook which has a user-friendly UI, to which I became aware of how far AI image generation has developed in a few months.
From that, I forked my own Colab Notebook, and streamlined the UI a bit to minimize the number of clicks needs to start generating and make it more mobile-friendly.
The VQGAN + CLIP technology is now in a good state such that it can be used for more serious experimentation. Some say art is better when there’s mystery, but my view is that knowing how AI art is made is the key to making even better AI art.
A Hello World to AI Generated Art
Let’s jump right into it with something fantastical: how well can AI generate a cyberpunk forest?
The TL;DR of how VQGAN + CLIP works is that VQGAN generates an image, CLIP scores the image according to how well it can detect the input prompt, and VQGAN uses that information to iteratively improve its image generation. Lj Miranda has a good detailed technical writeup.
Now let’s do the same prompt as before, but with an added author from a time well before the cyberpunk genre existed and see if the AI can follow their style. Let’s try Salvador Dali.
It’s definitely a cyberpunk forest, and it’s definitely Dali’s style.
One trick the community found to improve generated image quality is to simply add phrases that tell the AI to make a good image, such as
trending on /r/art. Trying that here:
In this case, it’s unclear if the
artstationHQ part of the prompt gets higher priority than the
Salvador Dali part. Another trick that VQGAN + CLIP can do is take multiple input text prompts, which can add more control. Additionally, you can assign weights to these different prompts. So if we did
cyberpunk forest by Salvador Dali:3 | artstationHQ, the model will try three times as hard to ensure that the prompt follows a Dali painting than
Much better! Lastly, we can use negative weights for prompts such that the model targets the opposite of that prompt. Let’s do the opposite of
green and white to see if the AI tries to remove those two colors from the palette and maybe make the final image more cyberpunky.
Now we’re getting to video game concept art quality generation. Indeed, VQGAN + CLIP rewards the use of clever input prompt engineering.
Initial Images and Style Transfer
Normally with VQGAN + CLIP, the generation starts from a blank slate. However, you can optionally provide an image to start from instead. This provides both a good base for generation and speeds it up since it doesn’t have to learn from empty noise. I usually recommend a lower learning rate as a result.
So let’s try an initial image of myself, naturally.
One of the earliest promising use cases of AI Image Generation was neural style transfer, where an AI could take the “style” of one image and transpose it to another. Can it follow the style of a specific painting, such as Starry Night by Vincent Van Gogh?
Well, it got the colors and style, but the AI appears to have taken the “Van Gogh” part literally and gave me a nice beard.
Of course, with the power of AI, you can do both prompts at the same time for maximum chaos.
Icons and Generating Images With A Specific Shape
While I was first experimenting with VQGAN + CLIP, I saw an interesting tweet by AI researcher Mark Riedl:
Playing with controlling figure placement in VQGAN + CLIP— Mark Loki Variant Riedl (@mark_riedl) July 31, 2021
prompt: "a wizard duel | disney style" pic.twitter.com/9goW2l0oJf
Intrigued, I adapted some icon generation code I had handy from another project and created icon-image, a Python tool to programmatically generate an icon using Font Awesome icons and paste it onto a noisy background.
This icon can be used as an initial image, as above. Adjusting the text prompt to accomidate the icon can result in very cool images, such as
a black and white evil robot by Junji Ito.
The background and icon noise is the key, as AI can shape it much better than solid colors. Omitting the noise results in a more boring image that doesn’t reflect the prompt as well, although it has its own style.
Another fun prompt addition is
rendered in unreal engine (with an optional
high quality), which instructs the AI to create a three-dimensional image and works especially well with icons.
icon-image can also generate brand images, such as the Twitter logo, which can be good for comedy, especially if you tweak the logo/background colors as well. What if we turn the Twitter logo into Mordor, which is an fair metaphor?
So that didn’t turn out well as the Twitter logo got overpowered by the prompt (you can see outlines of the logo’s bottom). However, there’s a trick to force the AI to respect the logo: set the icon as the initial image and the target image, and apply a high weight to the prompt (the weight can be lowered iteratively to preserve the logo better).
More Fun Examples
Here’s a few more good demos of what VQGAN + CLIP can do using the ideas and tricks above:
@kingdomakrillic released an album with many more examples of prompt augmentations and their results.
Making Money Off of VQGAN + CLIP
Can these AI generated images be commercialized as software-as-a-service? It’s unclear. In contrast to StyleGAN2 images (where the license is explicitly noncommercial), all aspects of the VQGAN + CLIP pipeline are MIT Licensed which does support commericalization. However, the ImageNet 16384 VQGAN used in this Colab Notebook and many other VQGAN+CLIP Notebooks was trained on ImageNet, which has famously complicated licensing, and whether finetuning the VQGAN counts as sufficiently detached from an IP perspective hasn’t been legally tested to my knowledge. There are other VQGANs available such as ones trained on the Open Images Dataset or COCO, both of which have commercial-friendly CC-BY-4.0 licenses, although in my testing they had substantially lower image generation quality.
Granted, the biggest blocker to making money off of VQGAN + CLIP in a scalable manner is generation speed; unlike most commercial AI models which use inference and can therefore be optimized to drastically increase performance, VQGAN + CLIP requires training, which is much slower and can’t allow content generation in real time like GPT-3. Even with expensive GPUs and generating at small images sizes, training takes a couple minutes at minimum, which correlates with a higher cost-per-image and annoyed users. It’s still cheaper per image than what OpenAI charges for their GPT-3 API, though, and many startups have built on that successfuly.
Of course, if you just want make NFTs from manual usage of VQGAN + CLIP, go ahead.
The Next Steps for AI Image Generation
CLIP itself is just the first practical iteration of translating text-to-images, and I suspect this won’t be the last implementation of such a model (OpenAI may pull a GPT-3 and not open-source the inevitable CLIP-2 since now there’s a proven monetizeable use case).
However, the AI Art Generation industry is developing at a record pace, especially on the image-generating part of the equation. Just the day before this article was posted, Katherine Crawson released a Colab Notebook for CLIP with Guided Diffusion, which generates more realistic images (albeit less fantastical), and Tom White released a pixel art generating Notebook which doesn’t use a VQGAN variant.
The possibilities with just VQGAN + CLIP alone are endless.