GANs on Max Woolf's Blog

How to Generate Customized AI Art Using VQGAN and CLIP

Wed, 18 Aug 2021 08:45:00 -0700

The latest and greatest AI content generation trend is AI generated art. In January 2021, OpenAI demoed DALL-E, a GPT-3 variant which creates images instead of text. However, it can create images in response to a text prompt, allowing for some very fun output.

DALL-E demo, via OpenAI.

However the generated images are not always coherent, so OpenAI also demoed CLIP, which can be used to translate an image into text and therefore identify which generated images were actually avocado armchairs. CLIP was then open-sourced, although DALL-E was not.

CLIP demo, via OpenAI.

Since CLIP is essentially an interface between representations of text and image data, clever hacking can allow anyone to create their own pseudo-DALL-E. The first implementation was Big Sleep by Ryan Murdock/@advadnoun, which combined CLIP with an image generating GAN named BigGAN. Then open source worked its magic: the GAN base was changed to VQGAN, a newer model architecture Patrick Esser and Robin Rombach and Björn Ommer which allows more coherent image generation. The core CLIP-guided training was improved and translated to a Colab Notebook by Katherine Crawson/@RiversHaveWings and others in a special Discord server. Twitter accounts like @images_ai and @ai_curio which leverage VQGAN + CLIP with user-submitted prompts have gone viral and received mainstream press. @ak92501 created a fork of that Notebook which has a user-friendly UI, to which I became aware of how far AI image generation has developed in a few months.

From that, I forked my own Colab Notebook, and streamlined the UI a bit to minimize the number of clicks needs to start generating and make it more mobile-friendly.

The VQGAN + CLIP technology is now in a good state such that it can be used for more serious experimentation. Some say art is better when there’s mystery, but my view is that knowing how AI art is made is the key to making even better AI art.

A Hello World to AI Generated Art

_All AI-generated image examples in this blog post are generated using this Colab Notebook, with the captions indicating the text prompt and other relevant deviations from the default inputs to reproduce the image._

Let’s jump right into it with something fantastical: how well can AI generate a cyberpunk forest?

cyberpunk forest

The TL;DR of how VQGAN + CLIP works is that VQGAN generates an image, CLIP scores the image according to how well it can detect the input prompt, and VQGAN uses that information to iteratively improve its image generation. Lj Miranda has a good detailed technical writeup.

via Lj Miranda. Modified for theme friendliness.

Now let’s do the same prompt as before, but with an added author from a time well before the cyberpunk genre existed and see if the AI can follow their style. Let’s try Salvador Dali.

cyberpunk forest by Salvador Dali

It’s definitely a cyberpunk forest, and it’s definitely Dali’s style.

One trick the community found to improve generated image quality is to simply add phrases that tell the AI to make a good image, such as artstationHQ or trending on /r/art. Trying that here:

cyberpunk forest by Salvador Dali artstationHQ

In this case, it’s unclear if the artstationHQ part of the prompt gets higher priority than the Salvador Dali part. Another trick that VQGAN + CLIP can do is take multiple input text prompts, which can add more control. Additionally, you can assign weights to these different prompts. So if we did cyberpunk forest by Salvador Dali:3 | artstationHQ, the model will try three times as hard to ensure that the prompt follows a Dali painting than artstationHQ.

cyberpunk forest by Salvador Dali:3 | artstationHQ

Much better! Lastly, we can use negative weights for prompts such that the model targets the opposite of that prompt. Let’s do the opposite of green and white to see if the AI tries to remove those two colors from the palette and maybe make the final image more cyberpunky.

cyberpunk forest by Salvador Dali:3 | artstationHQ | green and white:-1

Now we’re getting to video game concept art quality generation. Indeed, VQGAN + CLIP rewards the use of clever input prompt engineering.

Initial Images and Style Transfer

Normally with VQGAN + CLIP, the generation starts from a blank slate. However, you can optionally provide an image to start from instead. This provides both a good base for generation and speeds it up since it doesn’t have to learn from empty noise. I usually recommend a lower learning rate as a result.

So let’s try an initial image of myself, naturally.

No, I am not an AI Generated person. Hopefully.

Let’s try another artist, such as Junji Ito who has a very distinctive horror style of art.

a black and white portrait by Junji Ito — initial image above, learning rate = 0.1

One of the earliest promising use cases of AI Image Generation was neural style transfer, where an AI could take the “style” of one image and transpose it to another. Can it follow the style of a specific painting, such as Starry Night by Vincent Van Gogh?

Starry Night by Vincent Van Gogh — initial image above, learning rate = 0.1

Well, it got the colors and style, but the AI appears to have taken the “Van Gogh” part literally and gave me a nice beard.

Of course, with the power of AI, you can do both prompts at the same time for maximum chaos.

Starry Night by Vincent Van Gogh | a black and white portrait by Junji Ito — initial image above, learning rate = 0.1

Icons and Generating Images With A Specific Shape

While I was first experimenting with VQGAN + CLIP, I saw an interesting tweet by AI researcher Mark Riedl:

Intrigued, I adapted some icon generation code I had handy from another project and created icon-image, a Python tool to programmatically generate an icon using Font Awesome icons and paste it onto a noisy background.

The default icon image used in the Colab Notebook

This icon can be used as an initial image, as above. Adjusting the text prompt to accomidate the icon can result in very cool images, such as a black and white evil robot by Junji Ito.

a black and white evil robot by Junji Ito — initial image above, learning rate = 0.1

The background and icon noise is the key, as AI can shape it much better than solid colors. Omitting the noise results in a more boring image that doesn’t reflect the prompt as well, although it has its own style.

a black and white evil robot by Junji Ito — initial image above except 1.0 icon opacity and 0.0 background noice opacity, learning rate = 0.1

Another fun prompt addition is rendered in unreal engine (with an optional high quality), which instructs the AI to create a three-dimensional image and works especially well with icons.

smiling rusted robot rendered in unreal engine high quality — icon initial image, learning rate = 0.1

icon-image can also generate brand images, such as the Twitter logo, which can be good for comedy, especially if you tweak the logo/background colors as well. What if we turn the Twitter logo into Mordor, which is an fair metaphor?

Mordor — fab fa-twitter icon, icon initial image, black icon background, red icon, learning rate = 0.1

So that didn’t turn out well as the Twitter logo got overpowered by the prompt (you can see outlines of the logo’s bottom). However, there’s a trick to force the AI to respect the logo: set the icon as the initial image and the target image, and apply a high weight to the prompt (the weight can be lowered iteratively to preserve the logo better).

Mordor:3 — fab fa-twitter icon, icon initial image, icon target image, black icon background, red icon, learning rate = 0.1

More Fun Examples

Here’s a few more good demos of what VQGAN + CLIP can do using the ideas and tricks above:

Microsoft Excel by Junji Ito — 500 steps

a portrait of Mark Zuckerberg:2 | a portrait of a bottle of Sweet Baby Ray's barbecue sauce — 500 steps

Never gonna give you up, Never gonna let you down — 500 steps

a portrait of cyberpunk Elon Musk:2 | a human:-1 — 500 steps

hamburger of the Old Gods:5 — fas fa-hamburger icon, icon initial image, icon target image, black icon background, white icon, learning rate = 0.1, 500 steps

reality is an illusion:8 — fas fa-eye icon, icon initial image, icon target image, black icon background, white icon, learning rate = 0.1

@kingdomakrillic released an album with many more examples of prompt augmentations and their results.

Making Money Off of VQGAN + CLIP

Can these AI generated images be commercialized as software-as-a-service? It’s unclear. In contrast to StyleGAN2 images (where the license is explicitly noncommercial), all aspects of the VQGAN + CLIP pipeline are MIT Licensed which does support commericalization. However, the ImageNet 16384 VQGAN used in this Colab Notebook and many other VQGAN+CLIP Notebooks was trained on ImageNet, which has famously complicated licensing, and whether finetuning the VQGAN counts as sufficiently detached from an IP perspective hasn’t been legally tested to my knowledge. There are other VQGANs available such as ones trained on the Open Images Dataset or COCO, both of which have commercial-friendly CC-BY-4.0 licenses, although in my testing they had substantially lower image generation quality.

Granted, the biggest blocker to making money off of VQGAN + CLIP in a scalable manner is generation speed; unlike most commercial AI models which use inference and can therefore be optimized to drastically increase performance, VQGAN + CLIP requires training, which is much slower and can’t allow content generation in real time like GPT-3. Even with expensive GPUs and generating at small images sizes, training takes a couple minutes at minimum, which correlates with a higher cost-per-image and annoyed users. It’s still cheaper per image than what OpenAI charges for their GPT-3 API, though, and many startups have built on that successfuly.

Of course, if you just want make NFTs from manual usage of VQGAN + CLIP, go ahead.

The Next Steps for AI Image Generation

CLIP itself is just the first practical iteration of translating text-to-images, and I suspect this won’t be the last implementation of such a model (OpenAI may pull a GPT-3 and not open-source the inevitable CLIP-2 since now there’s a proven monetizeable use case).

However, the AI Art Generation industry is developing at a record pace, especially on the image-generating part of the equation. Just the day before this article was posted, Katherine Crawson released a Colab Notebook for CLIP with Guided Diffusion, which generates more realistic images (albeit less fantastical), and Tom White released a pixel art generating Notebook which doesn’t use a VQGAN variant.

The possibilities with just VQGAN + CLIP alone are endless.

Easily Transform Portraits of People into AI Aberrations Using StyleCLIP

Fri, 30 Apr 2021 08:55:00 -0700

tl;dr follow the instructions in this Colab Notebook to generate your own AI Aberration images and videos! If you want to use your own images, follow the instructions in this Colab Notebook first!

GANs, generative adversarial networks, are all the rage nowadays for creating AI-based imagery. You’ve probably seen GANs used in tools like thispersondoesnotexist.com, which currently uses NVIDIA’s extremely powerful open-source StyleGAN2.

In 2021, OpenAI open-sourced CLIP, a model which can give textual classification predictions for a provided image. Since CLIP effectively interfaces between text data and image data, you can theoetically map that text data to StyleGAN. Enter StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, a paper by Patashnik, Wu et al (with code open-sourced on GitHub) which allows CLIP vectors to be used to guide StyleGAN generations through user-provided text.

From the paper: the left-most image is the input; the other images are the result of the prompt at the top.

The authors have also provided easy-to-use Colab Notebooks to help set up these models and run them on a GPU for free. The most interesting one is the Global Directions notebook, which allows the end user to do what is listed in the image above, and I’ve made my own variant which streamlines the workflow a bit.

After a large amount of experimention, I’ve found that StyleCLIP is essentially Photoshop driven by text, with all the good, bad, and chaos that entails.

Getting an Image Into StyleCLIP

GANs in general work by interpreting random “noise” as data and generate an image from that noise. This noise is typically known as a latent vector. The paper Designing an Encoder for StyleGAN Image Manipulation by Tov et al (with code open-sourced on GitHub plus a Colab Notebook too) uses an encoder to invert a given image into to the latent vectors which StyleGAN can use to reconstruct the image. These vectors can then be tweaked to get a specified target image from StyleGAN. However, the inversion will only work if you invert a human-like portrait, otherwise you’ll get garbage. And even then it may not be a perfect 1:1 map.

I created a streamlined notebook to isolate out the creation of the latent vectors for better interoprability with StyleCLIP.

To demo StyleCLIP, I decided to use Facebook CEO Mark Zuckerberg, who’s essentially a meme in himself. I found a photo of Mark Zuckerberg facing the camera, cropped it, ran it through the Notebook, and behold, we have our base Zuck for hacking!

Human Transmutation

All StyleCLIP generation examples here use the streamlined notebook and Mark Zuckerberg latents, with the captions indicating how to reproduce the image so you can hack them yourself!

Let’s start simple and reproduce the examples in the paper. A tanned Zuck should do the trick (in the event he forgets his sunscreen).

tanned face, beta = 0.15, alpha = 6.6"/>

face -> tanned face, beta = 0.15, alpha = 6.6

What about giving Zuck a cool new hairdo?

face with Hi-top fade hair, beta = 0.17, alpha = 8.6"/>

face with hair -> face with Hi-top fade hair, beta = 0.17, alpha = 8.6

Like all AI, it can cheat if you give it an impossible task. What happens if you try to use StyleCLIP to increase the size of Zuck’s nostrils, which are barely visible at all in the base photo?

face with flared nostrils, beta = 0.09, alpha = 6.3"/>

face with nose -> face with flared nostrils, beta = 0.09, alpha = 6.3

The AI transforms his entire facial structure just to get his nostrils exposed and make the AI happy.

CLIP has seen images of everything on the internet, including public figures. Even though the StyleCLIP paper doesn’t discuss it, why not try to transform people into other people?

Many AI practioners use Tesla Technoking Elon Musk as a test case for anything AI because ~~he generates massive SEO~~ of his contributions to AI and modern nerd culture, which is why I opted to use Zuck as a contrast.

Given that, I bring you, Elon Zuck.

Elon Musk face, beta = 0.12, alpha = 4.3"/>

face -> Elon Musk face, beta = 0.12, alpha = 4.3

What if you see Zuck as a literal Jesus Christ?

Jesus Christ face, beta = 0.13, alpha = 9.1"/>

face -> Jesus Christ face, beta = 0.13, alpha = 9.1

Due to being generated by StyleGAN, the transformations have to resemble something somewhat like a real-life human, but there’s nothing stopping CLIP from trying to gravitate toward faces that aren’t human. What if you tell StyleCLIP to transform Zuck into an anime character, such as Dragon Ball Z’s Goku?

Dragon Ball Z Goku face, beta = 0.09, alpha = 5.4"/>

face -> Dragon Ball Z Goku face, beta = 0.09, alpha = 5.4

Zuck gets the hair, at least.

People accuse Zuck of being a robot. What if we make him more of a robot (as guided by a robot)?

robot face, beta = 0.08, alpha = 10"/>

face -> robot face, beta = 0.08, alpha = 10

These are all pretty tame so far. StyleCLIP surprisingly has the ability to have more complex prompts while still maintaining expected results.

Can Mark Zuckerberg do a troll face? yes, he can!

troll face, beta = 0.13, alpha = 9.1"/>

face -> troll face, beta = 0.13, alpha = 9.1

We can go deeper. What about altering other attributes at the same time?

troll face with large eyes, beta = 0.13, alpha = 9.1"/>

face -> troll face with large eyes, beta = 0.13, alpha = 9.1

Working with CLIP rewards good prompt engineering, an incresingly relevant AI skill with the rise of GPT-3. With more specific, complex prompts you can stretch the “human” constraint of StyleGAN. 👁👄👁

face with very large eyes and very large mouth, beta = 0.16, alpha = 7.8"/>

face with eyes -> face with very large eyes and very large mouth, beta = 0.16, alpha = 7.8

Experimentation is half the fun of StyleCLIP!

Antiprompts

You may have seen that all the examples above had positive alphas, which control the strength of the transformation. So let’s talk about negative alphas. While positive alphas increase strength toward the target text vector, negative alphas increase strength away from the target text vector, resulting in the complete opposite of the prompt. This gives rise to what I call antiprompts: prompts where you intentionally want the opposite of what’s specified where asking a normal prompt doesn’t give you quite want you want.

Let’s see if Zuck can make a serious face.

serious face, beta = 0.09, alpha = 6.3"/>

face -> serious face, beta = 0.09, alpha = 6.3

More pouty than serious. But what if he does the opposite of a laughing face?

laughing face, beta = 0.09, alpha = -6.3"/>

face -> laughing face, beta = 0.09, alpha = -6.3

That’s more like it.

It doesn’t stop there. In the previous section we saw what happens when you give prompts of people and compound prompts. What, you may ask, does the AI think is the opposite of a person?

In the Goku example above, Zuck got larger, darker hair, more pale skin, and a chonky neck. What happens if you do the inverse?

Dragon Ball Z Goku face, beta = 0.09, alpha = -5.4"/>

face -> Dragon Ball Z Goku face, beta = 0.09, alpha = -5.4

His hair is smaller and blonde, his skin is more tan, and he barely has a neck at all.

What if you make Zuck the opposite of a robot? Does he become human?

robot face, beta = 0.08, alpha = -10"/>

face -> robot face, beta = 0.08, alpha = -10

He becomes Pedro Pascal apparently.

Video AI Algorithms

A fun feature I added to the notebook is the ability to make videos, by generating frames from zero alpha to the target alpha and rendering them using ffmpeg. Through that, we can see these wonderful transformations occur at a disturbingly smooth 60fps!

Animations are cool to fully illustrate how the AI can cheat, such as with the flared nostrils example above.

Or you can opt for pure chaos and do one of the more complex transformations. 👁👄👁

TikTok will have a lot of fun with this!

Ethics and Biases

Let’s address the elephant in the room: is it ethical to edit photos with AI like this?

My take is that StyleCLIP is no different than what Adobe Photoshop has done for decades. Unlike deepfakes, these by construction are constrained to human portraits and can’t be used in other contexts to mislead or cause deception. Turning Mark Zuckerberg into Elon Musk would not cause a worldwide panic. FaceApp, which does a similar tyle of image editing, was released years ago and still tops the App Store charts without causing democracy to implode. That said, I recommend only using StyleCLIP on public figures.

In my testing, there is definitely an issue of model bias, both within StyleGAN and within CLIP. A famous example of gender bias in AI is a propensity to assign gender to gender neutral terms, such as He is a soldier. She is a teacher. Let’s try both for Zuck.

soldier face, beta = 0.1, alpha = 7.2"/>

face -> soldier face, beta = 0.1, alpha = 7.2

teacher face, beta = 0.13, alpha = 5.6"/>

face -> teacher face, beta = 0.13, alpha = 5.6

Unfortunately it still holds true.

It is surprisingly easy to get the model to perform racist/sexist/ageist transformations without much prodding. Inputting face with white skin -> face with black skin does what you think it would do. Making similar transformations based on race/sex/age do indeed work, and I am deliberately not demoing them. If you do experiment around these biases, I recommend careful consideration with posting the outputs.

The Future of AI Image Editing

StyleCLIP is a fun demo on the potential of AI-based image editing. Although not the most pragmatic way to edit portraits, it’s fun to see just how well (or how poorly) it can adapt to certain prompts.

Even though everything noted in this blog post is open-sourced, don’t think about trying to sell StyleCLIP as a product: StyleGAN2 (which in the end is responsible for generating the image) and its variants were released under non-commerical licenses. But it wouldn’t surprise me if someone uses the techniques noted in the papers to create their own, more efficient StyleCLIP with a bespoke efficient GAN to create an entire new industry.