<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>GANs on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/tag/gans/</link>
    <description>Recent content in GANs on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Wed, 18 Aug 2021 08:45:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/tag/gans/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>How to Generate Customized AI Art Using VQGAN and CLIP</title>
      <link>https://minimaxir.com/2021/08/vqgan-clip/</link>
      <pubDate>Wed, 18 Aug 2021 08:45:00 -0700</pubDate>
      <guid>https://minimaxir.com/2021/08/vqgan-clip/</guid>
      <description>Knowing how AI art is made is the key to making even better AI art.</description>
      <content:encoded><![CDATA[<style>pre code { white-space: pre; }</style>
<p>The latest and greatest AI content generation trend is AI generated art. In January 2021, <a href="https://openai.com/">OpenAI</a> demoed <a href="https://openai.com/blog/dall-e/">DALL-E</a>, a GPT-3 variant which creates images instead of text. However, it can create images in response to a text prompt, allowing for some very fun output.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/avocado_hu_a758e21fc220789.webp 320w,/2021/08/vqgan-clip/avocado_hu_b17b8218450473b0.webp 768w,/2021/08/vqgan-clip/avocado_hu_f18c1c7ad2c98eac.webp 1024w,/2021/08/vqgan-clip/avocado.png 1632w" src="avocado.png"
         alt="DALL-E demo, via OpenAI."/> <figcaption>
            <p>DALL-E demo, <a href="https://openai.com/blog/dall-e/">via OpenAI</a>.</p>
        </figcaption>
</figure>

<p>However the generated images are not always coherent, so OpenAI also demoed <a href="https://openai.com/blog/clip/">CLIP</a>, which can be used to translate an image into text and therefore identify which generated images were actually avocado armchairs. CLIP was then <a href="https://github.com/openai/CLIP">open-sourced</a>, although DALL-E was not.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/guacamole_hu_af13ecf1e14e0f91.webp 320w,/2021/08/vqgan-clip/guacamole_hu_af49e75a81bb35b1.webp 768w,/2021/08/vqgan-clip/guacamole_hu_72b24e07c3f1faad.webp 1024w,/2021/08/vqgan-clip/guacamole.png 1198w" src="guacamole.png"
         alt="CLIP demo, via OpenAI."/> <figcaption>
            <p>CLIP demo, <a href="https://openai.com/blog/clip/">via OpenAI</a>.</p>
        </figcaption>
</figure>

<p>Since CLIP is essentially an interface between representations of text and image data, clever hacking can allow anyone to create their own pseudo-DALL-E. The first implementation was <a href="https://github.com/lucidrains/big-sleep">Big Sleep</a> by Ryan Murdock/<a href="https://twitter.com/advadnoun">@advadnoun</a>, which combined CLIP with an image generating <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">GAN</a> named <a href="https://arxiv.org/abs/1809.11096">BigGAN</a>. Then open source worked its magic: the GAN base was changed to <a href="https://github.com/CompVis/taming-transformers">VQGAN</a>, a newer model architecture Patrick Esser and Robin Rombach and Björn Ommer which allows more coherent image generation. The core CLIP-guided training was improved and translated <a href="https://colab.research.google.com/drive/1L8oL-vLJXVcRzCFbPwOoMkPKJ8-aYdPN">to a Colab Notebook</a> by Katherine Crawson/<a href="https://twitter.com/RiversHaveWings">@RiversHaveWings</a> and others in a special Discord server. Twitter accounts like <a href="https://twitter.com/images_ai">@images_ai</a> and <a href="https://twitter.com/ai_curio">@ai_curio</a> which leverage VQGAN + CLIP with user-submitted prompts have gone viral and <a href="https://www.newyorker.com/culture/infinite-scroll/appreciating-the-poetic-misunderstandings-of-ai-art">received mainstream press</a>. <a href="https://twitter.com/ak92501">@ak92501</a> <a href="https://twitter.com/ak92501/status/1421246864649773058">created</a> a <a href="https://colab.research.google.com/drive/1Foi0mCSE6NrW9oI3Fhni7158Krz4ZXdH?usp=sharing">fork of that Notebook</a> which has a user-friendly UI, to which I became aware of how far AI image generation has developed in a few months.</p>
<p>From that, I forked <a href="https://colab.research.google.com/drive/1wkF67ThUz37T2_oPIuSwuO4e_-0vjaLs?usp=sharing">my own Colab Notebook</a>, and streamlined the UI a bit to minimize the number of clicks needs to start generating and make it more mobile-friendly.</p>
<p>The VQGAN + CLIP technology is now in a good state such that it can be used for more serious experimentation. Some say art is better when there&rsquo;s mystery, but my view is that knowing how AI art is made is the key to making even better AI art.</p>
<h2 id="a-hello-world-to-ai-generated-art">A Hello World to AI Generated Art</h2>
<p>_All AI-generated image examples in this blog post are generated using <a href="https://colab.research.google.com/drive/1wkF67ThUz37T2_oPIuSwuO4e_-0vjaLs?usp=sharing">this Colab Notebook</a>, with the captions indicating the text prompt and other relevant deviations from the default inputs to reproduce the image._</p>
<p>Let&rsquo;s jump right into it with something fantastical: how well can AI generate a cyberpunk forest?</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/cyberpunk_forest_hu_4ba90fadcee22967.webp 320w,/2021/08/vqgan-clip/cyberpunk_forest.png 592w" src="cyberpunk_forest.png"
         alt="cyberpunk forest"/> <figcaption>
            <p><code>cyberpunk forest</code></p>
        </figcaption>
</figure>

<p>The TL;DR of how VQGAN + CLIP works is that VQGAN generates an image, CLIP scores the image according to how well it can detect the input prompt, and VQGAN uses that information to iteratively improve its image generation. Lj Miranda has a <a href="https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/">good detailed technical writeup</a>.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/clip_vqgan_with_image_hu_5e04a615f6af5ff.webp 320w,/2021/08/vqgan-clip/clip_vqgan_with_image_hu_7546c61d3cb746e.webp 768w,/2021/08/vqgan-clip/clip_vqgan_with_image_hu_d4c317842e36f301.webp 1024w,/2021/08/vqgan-clip/clip_vqgan_with_image.png 1067w" src="clip_vqgan_with_image.png"
         alt="via Lj Miranda. Modified for theme friendliness."/> <figcaption>
            <p><a href="https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/">via Lj Miranda</a>. Modified for theme friendliness.</p>
        </figcaption>
</figure>

<p>Now let&rsquo;s do the same prompt as before, but with an added author from a time well before the cyberpunk genre existed and see if the AI can follow their style. Let&rsquo;s try <a href="https://www.wikiart.org/en/salvador-dali">Salvador Dali</a>.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali_hu_3ad61193478875b7.webp 320w,/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali.png 592w" src="cyberpunk_forest_by_salvador_dali.png"
         alt="cyberpunk forest by Salvador Dali"/> <figcaption>
            <p><code>cyberpunk forest by Salvador Dali</code></p>
        </figcaption>
</figure>

<p>It&rsquo;s definitely a cyberpunk forest, and it&rsquo;s definitely Dali&rsquo;s style.</p>
<p>One trick the community found to improve generated image quality is to simply add phrases that tell the AI to make a <em>good</em> image, such as <code>artstationHQ</code> or <code>trending on /r/art</code>. Trying that here:</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali_artstationhq_hu_e9392ca8f1eb7213.webp 320w,/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali_artstationhq.png 592w" src="cyberpunk_forest_by_salvador_dali_artstationhq.png"
         alt="cyberpunk forest by Salvador Dali artstationHQ"/> <figcaption>
            <p><code>cyberpunk forest by Salvador Dali artstationHQ</code></p>
        </figcaption>
</figure>

<p>In this case, it&rsquo;s unclear if the <code>artstationHQ</code> part of the prompt gets higher priority than the <code>Salvador Dali</code> part. Another trick that VQGAN + CLIP can do is take multiple input text prompts, which can add more control. Additionally, you can assign weights to these different prompts. So if we did <code>cyberpunk forest by Salvador Dali:3 | artstationHQ</code>, the model will try three times as hard to ensure that the prompt follows a Dali painting than <code>artstationHQ</code>.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali_3_artstationhq_hu_948ad338bfc41f2.webp 320w,/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali_3_artstationhq.png 592w" src="cyberpunk_forest_by_salvador_dali_3_artstationhq.png"
         alt="cyberpunk forest by Salvador Dali:3 | artstationHQ"/> <figcaption>
            <p><code>cyberpunk forest by Salvador Dali:3 | artstationHQ</code></p>
        </figcaption>
</figure>

<p>Much better! Lastly, we can use negative weights for prompts such that the model targets the opposite of that prompt. Let&rsquo;s do the opposite of <code>green and white</code> to see if the AI tries to remove those two colors from the palette and maybe make the final image more cyberpunky.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali_3_artstationhq_gw_hu_166c44dff41886a2.webp 320w,/2021/08/vqgan-clip/cyberpunk_forest_by_salvador_dali_3_artstationhq_gw.png 592w" src="cyberpunk_forest_by_salvador_dali_3_artstationhq_gw.png"
         alt="cyberpunk forest by Salvador Dali:3 | artstationHQ | green and white:-1"/> <figcaption>
            <p><code>cyberpunk forest by Salvador Dali:3 | artstationHQ | green and white:-1</code></p>
        </figcaption>
</figure>

<p>Now we&rsquo;re getting to video game concept art quality generation. Indeed, VQGAN + CLIP rewards the use of clever input prompt engineering.</p>
<h2 id="initial-images-and-style-transfer">Initial Images and Style Transfer</h2>
<p>Normally with VQGAN + CLIP, the generation starts from a blank slate. However, you can optionally provide an image to start from instead. This provides both a good base for generation and speeds it up since it doesn&rsquo;t have to learn from empty noise. I usually recommend a lower learning rate as a result.</p>
<p>So let&rsquo;s try an initial image of myself, naturally.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/max_hu_458a2426943e0485.webp 320w,/2021/08/vqgan-clip/max.png 600w" src="max.png"
         alt="No, I am not an AI Generated person. Hopefully."/> <figcaption>
            <p>No, I am not an AI Generated person. Hopefully.</p>
        </figcaption>
</figure>

<p>Let&rsquo;s try another artist, such as <a href="https://en.wikipedia.org/wiki/Junji_Ito">Junji Ito</a> who has a very distinctive horror <a href="https://www.google.com/search?q=junji&#43;ito&#43;images">style of art</a>.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/max_junji_ito_hu_d11d0c8ed7eb69af.webp 320w,/2021/08/vqgan-clip/max_junji_ito.png 592w" src="max_junji_ito.png"
         alt="a black and white portrait by Junji Ito — initial image above, learning rate = 0.1"/> <figcaption>
            <p><code>a black and white portrait by Junji Ito</code> — initial image above, learning rate = 0.1</p>
        </figcaption>
</figure>

<p>One of the earliest promising use cases of AI Image Generation was <a href="https://www.tensorflow.org/tutorials/generative/style_transfer">neural style transfer</a>, where an AI could take the &ldquo;style&rdquo; of one image and transpose it to another. Can it follow the style of a specific painting, such as <a href="https://www.vangoghgallery.com/painting/starry-night.html">Starry Night</a> by <a href="https://en.wikipedia.org/wiki/Vincent_van_Gogh">Vincent Van Gogh</a>?</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/max_starry_night_hu_79b050ddfd750a23.webp 320w,/2021/08/vqgan-clip/max_starry_night.png 592w" src="max_starry_night.png"
         alt="Starry Night by Vincent Van Gogh — initial image above, learning rate = 0.1"/> <figcaption>
            <p><code>Starry Night by Vincent Van Gogh</code> — initial image above, learning rate = 0.1</p>
        </figcaption>
</figure>

<p>Well, it got the colors and style, but the AI appears to have taken the &ldquo;Van Gogh&rdquo; part literally and gave me a nice beard.</p>
<p>Of course, with the power of AI, you can do both prompts at the same time for maximum chaos.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/max_junji_ito_starry_night_hu_836dac4f5598d721.webp 320w,/2021/08/vqgan-clip/max_junji_ito_starry_night.png 592w" src="max_junji_ito_starry_night.png"
         alt="Starry Night by Vincent Van Gogh | a black and white portrait by Junji Ito — initial image above, learning rate = 0.1"/> <figcaption>
            <p><code>Starry Night by Vincent Van Gogh | a black and white portrait by Junji Ito</code> — initial image above, learning rate = 0.1</p>
        </figcaption>
</figure>

<h2 id="icons-and-generating-images-with-a-specific-shape">Icons and Generating Images With A Specific Shape</h2>
<p>While I was first experimenting with VQGAN + CLIP, I saw <a href="https://twitter.com/mark_riedl/status/1421282588791132161">an interesting tweet</a> by AI researcher Mark Riedl:</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/mark_riedl/status/1421282588791132161"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>Intrigued, I adapted some icon generation code I had handy <a href="https://github.com/minimaxir/stylecloud">from another project</a> and created <a href="https://github.com/minimaxir/icon-image">icon-image</a>, a Python tool to programmatically generate an icon using <a href="https://fontawesome.com/">Font Awesome</a> icons and paste it onto a noisy background.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/icon_robot_hu_7a372246756b89fb.webp 320w,/2021/08/vqgan-clip/icon_robot.png 600w" src="icon_robot.png"
         alt="The default icon image used in the Colab Notebook"/> <figcaption>
            <p>The default icon image used in the Colab Notebook</p>
        </figcaption>
</figure>

<p>This icon can be used as an initial image, as above. Adjusting the text prompt to accomidate the icon can result in very cool images, such as <code>a black and white evil robot by Junji Ito</code>.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/robot_junji_ito_hu_dccb3a3ced294446.webp 320w,/2021/08/vqgan-clip/robot_junji_ito.png 592w" src="robot_junji_ito.png"
         alt="a black and white evil robot by Junji Ito — initial image above, learning rate = 0.1"/> <figcaption>
            <p><code>a black and white evil robot by Junji Ito</code> — initial image above, learning rate = 0.1</p>
        </figcaption>
</figure>

<p>The background and icon noise is the key, as AI can shape it much better than solid colors. Omitting the noise results in a more boring image that doesn&rsquo;t reflect the prompt as well, although it has its own style.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/robot_junji_ito_nonoise_hu_1fd3e72d34e39b97.webp 320w,/2021/08/vqgan-clip/robot_junji_ito_nonoise.png 592w" src="robot_junji_ito_nonoise.png"
         alt="a black and white evil robot by Junji Ito — initial image above except 1.0 icon opacity and 0.0 background noice opacity, learning rate = 0.1"/> <figcaption>
            <p><code>a black and white evil robot by Junji Ito</code> — initial image above except 1.0 icon opacity and 0.0 background noice opacity, learning rate = 0.1</p>
        </figcaption>
</figure>

<p>Another fun prompt addition is <code>rendered in unreal engine</code> (with an optional <code>high quality</code>), which instructs the AI to create a three-dimensional image and works especially well with icons.</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/robot_unreal_hu_b4d3e0483b500717.webp 320w,/2021/08/vqgan-clip/robot_unreal.png 592w" src="robot_unreal.png"
         alt="smiling rusted robot rendered in unreal engine high quality — icon initial image, learning rate = 0.1"/> <figcaption>
            <p><code>smiling rusted robot rendered in unreal engine high quality</code> — icon initial image, learning rate = 0.1</p>
        </figcaption>
</figure>

<p>icon-image can also generate brand images, such as the <a href="https://twitter.com/">Twitter</a> logo, which can be good for comedy, especially if you tweak the logo/background colors as well. What if we turn the Twitter logo into <a href="https://www.google.com/search?q=mordor&#43;images">Mordor</a>, which is an fair metaphor?</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/twitter_mordor_hu_5dc5a61efb797269.webp 320w,/2021/08/vqgan-clip/twitter_mordor.png 592w" src="twitter_mordor.png"
         alt="Mordor — fab fa-twitter icon, icon initial image, black icon background, red icon, learning rate = 0.1"/> <figcaption>
            <p><code>Mordor</code> — <code>fab fa-twitter</code> icon, icon initial image, black icon background, red icon, learning rate = 0.1</p>
        </figcaption>
</figure>

<p>So that didn&rsquo;t turn out well as the Twitter logo got overpowered by the prompt (you can see outlines of the logo&rsquo;s bottom). However, there&rsquo;s a trick to force the AI to respect the logo: set the icon as the initial image <em>and</em> the target image, and apply a high weight to the prompt (the weight can be lowered iteratively to preserve the logo better).</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/twitter_mordor_2_hu_c8d00364084e21bc.webp 320w,/2021/08/vqgan-clip/twitter_mordor_2.png 592w" src="twitter_mordor_2.png"
         alt="Mordor:3 — fab fa-twitter icon, icon initial image, icon target image, black icon background, red icon, learning rate = 0.1"/> <figcaption>
            <p><code>Mordor:3</code> — <code>fab fa-twitter</code> icon, icon initial image, icon target image, black icon background, red icon, learning rate = 0.1</p>
        </figcaption>
</figure>

<h2 id="more-fun-examples">More Fun Examples</h2>
<p>Here&rsquo;s a few more good demos of what VQGAN + CLIP can do using the ideas and tricks above:</p>
<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/excel_hu_82c869ac7653bce0.webp 320w,/2021/08/vqgan-clip/excel.png 592w" src="excel.png"
         alt="Microsoft Excel by Junji Ito — 500 steps"/> <figcaption>
            <p><code>Microsoft Excel by Junji Ito</code> — 500 steps</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/zuck_hu_fe52aa1dc05ed15e.webp 320w,/2021/08/vqgan-clip/zuck.png 592w" src="zuck.png"
         alt="a portrait of Mark Zuckerberg:2 | a portrait of a bottle of Sweet Baby Ray&#39;s barbecue sauce — 500 steps"/> <figcaption>
            <p><code>a portrait of Mark Zuckerberg:2 | a portrait of a bottle of Sweet Baby Ray's barbecue sauce</code> — 500 steps</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/rickroll_hu_6912b4de0321d7e3.webp 320w,/2021/08/vqgan-clip/rickroll.png 592w" src="rickroll.png"
         alt="Never gonna give you up, Never gonna let you down — 500 steps"/> <figcaption>
            <p><code>Never gonna give you up, Never gonna let you down</code> — 500 steps</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/elon_hu_d4768f4d846c8269.webp 320w,/2021/08/vqgan-clip/elon.png 592w" src="elon.png"
         alt="a portrait of cyberpunk Elon Musk:2 | a human:-1 — 500 steps"/> <figcaption>
            <p><code>a portrait of cyberpunk Elon Musk:2 | a human:-1</code> — 500 steps</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/hamburger_hu_cd523739402fd119.webp 320w,/2021/08/vqgan-clip/hamburger.png 592w" src="hamburger.png"
         alt="hamburger of the Old Gods:5 — fas fa-hamburger icon, icon initial image, icon target image, black icon background, white icon, learning rate = 0.1, 500 steps"/> <figcaption>
            <p><code>hamburger of the Old Gods:5</code> — <code>fas fa-hamburger</code> icon, icon initial image, icon target image, black icon background, white icon, learning rate = 0.1, 500 steps</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2021/08/vqgan-clip/reality_hu_288a071a0017cc2.webp 320w,/2021/08/vqgan-clip/reality.png 592w" src="reality.png"
         alt="reality is an illusion:8 — fas fa-eye icon, icon initial image, icon target image, black icon background, white icon, learning rate = 0.1"/> <figcaption>
            <p><code>reality is an illusion:8</code> — <code>fas fa-eye</code> icon, icon initial image, icon target image, black icon background, white icon, learning rate = 0.1</p>
        </figcaption>
</figure>

<p>@kingdomakrillic <a href="https://imgur.com/a/SnSIQRu">released an album</a> with <em>many</em> more examples of prompt augmentations and their results.</p>
<h2 id="making-money-off-of-vqgan--clip">Making Money Off of VQGAN + CLIP</h2>
<p>Can these AI generated images be commercialized as <a href="https://en.wikipedia.org/wiki/Software_as_a_service">software-as-a-service</a>? It&rsquo;s unclear. In contrast to <a href="https://github.com/NVlabs/stylegan2">StyleGAN2</a> images (where the <a href="https://nvlabs.github.io/stylegan2/license.html">license</a> is explicitly noncommercial), all aspects of the VQGAN + CLIP pipeline are MIT Licensed which does support commericalization. However, the ImageNet 16384 VQGAN used in this Colab Notebook and many other VQGAN+CLIP Notebooks was trained on <a href="https://www.image-net.org/">ImageNet</a>, which has <a href="https://www.reddit.com/r/MachineLearning/comments/id4394/d_is_it_legal_to_use_models_pretrained_on/">famously complicated licensing</a>, and whether finetuning the VQGAN counts as sufficiently detached from an IP perspective hasn&rsquo;t been legally tested to my knowledge. There are other VQGANs available such as ones trained on the <a href="https://opensource.google/projects/open-images-dataset">Open Images Dataset</a> or <a href="https://cocodataset.org/">COCO</a>, both of which have commercial-friendly <a href="https://creativecommons.org/licenses/by/4.0/">CC-BY-4.0</a> licenses, although in my testing they had substantially lower image generation quality.</p>
<p>Granted, the biggest blocker to making money off of VQGAN + CLIP in a scalable manner is generation speed; unlike most commercial AI models which use inference and can therefore be optimized to drastically increase performance, VQGAN + CLIP requires training, which is much slower and can&rsquo;t allow content generation in real time like <a href="https://openai.com/blog/openai-api/">GPT-3</a>. Even with expensive GPUs and generating at small images sizes, training takes a couple minutes at minimum, which correlates with a higher cost-per-image and annoyed users. It&rsquo;s still cheaper per image than what OpenAI charges for their GPT-3 API, though, and many startups have built on that successfuly.</p>
<p>Of course, if you just want make <a href="https://en.wikipedia.org/wiki/Non-fungible_token">NFTs</a> from manual usage of VQGAN + CLIP, go ahead.</p>
<h2 id="the-next-steps-for-ai-image-generation">The Next Steps for AI Image Generation</h2>
<p>CLIP itself is just the first practical iteration of translating text-to-images, and I suspect this won&rsquo;t be the last implementation of such a model (OpenAI may pull a GPT-3 and not open-source the inevitable CLIP-2 since now there&rsquo;s a proven monetizeable use case).</p>
<p>However, the AI Art Generation industry is developing at a record pace, especially on the image-generating part of the equation. Just the day before this article was posted, Katherine Crawson <a href="https://twitter.com/RiversHaveWings/status/1427580354651586562">released</a> a <a href="https://colab.research.google.com/drive/1QBsaDAZv8np29FPbvjffbE1eytoJcsgA">Colab Notebook</a> for CLIP with Guided Diffusion, which generates <a href="https://twitter.com/RiversHaveWings/status/1427746442727149568">more realistic</a> images (albeit less fantastical), and Tom White <a href="https://twitter.com/dribnet/status/1427613617973653505">released</a> a <a href="https://colab.research.google.com/github/dribnet/clipit/blob/master/demos/PixelDrawer.ipynb">pixel art generating Notebook</a> which doesn&rsquo;t use a VQGAN variant.</p>
<p>The possibilities with just VQGAN + CLIP alone are endless.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Easily Transform Portraits of People into AI Aberrations Using StyleCLIP</title>
      <link>https://minimaxir.com/2021/04/styleclip/</link>
      <pubDate>Fri, 30 Apr 2021 08:55:00 -0700</pubDate>
      <guid>https://minimaxir.com/2021/04/styleclip/</guid>
      <description>StyleCLIP is essentially Photoshop driven by text, with all the good, bad, and chaos that entails.</description>
      <content:encoded><![CDATA[<p><em><strong>tl;dr</strong> follow the instructions in <a href="https://colab.research.google.com/drive/13EJ1ATvTnE0N7I0ULLvRsta7J7HdNuBi?usp=sharing">this Colab Notebook</a> to generate your own AI Aberration images and videos! If you want to use your own images, follow the instructions in <a href="https://colab.research.google.com/drive/1St3R2qAbwwTV-amfYLeyGGswtzX4HHJP?usp=sharing">this Colab Notebook first</a>!</em></p>
<p>GANs, <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">generative adversarial networks</a>, are all the rage nowadays for creating AI-based imagery. You&rsquo;ve probably seen GANs used in tools like <a href="https://thispersondoesnotexist.com/">thispersondoesnotexist.com</a>, which currently uses NVIDIA&rsquo;s extremely powerful open-source <a href="https://github.com/NVlabs/stylegan2">StyleGAN2</a>.</p>
<p>In 2021, <a href="https://openai.com/">OpenAI</a> open-sourced <a href="https://github.com/openai/CLIP">CLIP</a>, a model which can give textual classification predictions for a provided image. Since CLIP effectively interfaces between text data and image data, you can theoetically map that text data to StyleGAN. Enter <a href="https://arxiv.org/abs/2103.17249">StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery</a>, a paper by Patashnik, Wu <em>et al</em> (with code <a href="https://github.com/orpatashnik/StyleCLIP">open-sourced on GitHub</a>) which allows CLIP vectors to be used to guide StyleGAN generations through user-provided text.</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/teaser_hu_d4bef5702d7835fd.webp 320w,/2021/04/styleclip/teaser_hu_1093876764fb12ab.webp 768w,/2021/04/styleclip/teaser_hu_23955890274ad6a7.webp 1024w,/2021/04/styleclip/teaser.png 1257w" src="teaser.png"
         alt="From the paper: the left-most image is the input; the other images are the result of the prompt at the top."/> <figcaption>
            <p>From the paper: the left-most image is the input; the other images are the result of the prompt at the top.</p>
        </figcaption>
</figure>

<p>The authors have also provided easy-to-use Colab Notebooks to help set up these models and run them on a GPU for free. The most interesting one is the <a href="https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb">Global Directions notebook</a>, which allows the end user to do what is listed in the image above, and I&rsquo;ve <a href="https://colab.research.google.com/drive/13EJ1ATvTnE0N7I0ULLvRsta7J7HdNuBi?usp=sharing">made my own variant</a> which streamlines the workflow a bit.</p>
<p>After a large amount of experimention, I&rsquo;ve found that StyleCLIP is essentially Photoshop driven by text, with all the good, bad, and chaos that entails.</p>
<h2 id="getting-an-image-into-styleclip">Getting an Image Into StyleCLIP</h2>
<p>GANs in general work by interpreting random &ldquo;noise&rdquo; as data and generate an image from that noise. This noise is typically known as a latent vector. The paper <a href="https://arxiv.org/abs/2102.02766">Designing an Encoder for StyleGAN Image Manipulation</a> by Tov <em>et al</em> (with code <a href="https://github.com/omertov/encoder4editing">open-sourced on GitHub</a> plus a <a href="https://colab.research.google.com/github/omertov/encoder4editing/blob/main/notebooks/inference_playground.ipynb">Colab Notebook too</a>) uses an encoder to invert a given image into to the latent vectors which StyleGAN can use to reconstruct the image. These vectors can then be tweaked to get a specified target image from StyleGAN. However, the inversion will only work if you invert a human-like portrait, otherwise you&rsquo;ll get garbage. And even then it may not be a perfect 1:1 map.</p>
<p>I created a <a href="https://colab.research.google.com/drive/1St3R2qAbwwTV-amfYLeyGGswtzX4HHJP?usp=sharing">streamlined notebook</a> to isolate out the creation of the latent vectors for better interoprability with StyleCLIP.</p>
<p>To demo StyleCLIP, I decided to use Facebook CEO <a href="https://www.facebook.com/zuck">Mark Zuckerberg</a>, who&rsquo;s essentially a meme in himself. I found a <a href="https://commons.wikimedia.org/wiki/File:Medvedev_and_Zuckerberg_October_2012-1.jpeg">photo of Mark Zuckerberg</a> facing the camera, cropped it, ran it through the Notebook, and behold, we have our base Zuck for hacking!</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_base_hu_ee3892dece7e25d5.webp 320w,/2021/04/styleclip/zuck_base.png 512w" src="zuck_base.png"/> 
</figure>

<h2 id="human-transmutation">Human Transmutation</h2>
<p><em>All StyleCLIP generation examples here use the <a href="https://colab.research.google.com/drive/13EJ1ATvTnE0N7I0ULLvRsta7J7HdNuBi?usp=sharing">streamlined notebook</a> and <a href="http://minimaxir.com/media/latents.pt">Mark Zuckerberg latents</a>, with the captions indicating how to reproduce the image so you can hack them yourself!</em></p>
<p>Let&rsquo;s start simple and reproduce the examples in the paper. A tanned Zuck should do the trick (in the event he <a href="https://www.buzzfeednews.com/article/katienotopoulos/mark-zuckerberg-sunscreen-surfing">forgets his sunscreen</a>).</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_tanned_hu_ccd1c3a46ad30cf5.webp 320w,/2021/04/styleclip/zuck_tanned.png 512w" src="zuck_tanned.png"
         alt="face -&gt; tanned face, beta = 0.15, alpha = 6.6"/> <figcaption>
            <p><code>face -&gt; tanned face</code>, beta = 0.15, alpha = 6.6</p>
        </figcaption>
</figure>

<p>What about giving Zuck a cool new hairdo?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_fade_hu_f858069ec9e2080f.webp 320w,/2021/04/styleclip/zuck_fade.png 512w" src="zuck_fade.png"
         alt="face with hair -&gt; face with Hi-top fade hair, beta = 0.17, alpha = 8.6"/> <figcaption>
            <p><code>face with hair -&gt; face with Hi-top fade hair</code>, beta = 0.17, alpha = 8.6</p>
        </figcaption>
</figure>

<p>Like all AI, it <a href="https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml">can cheat</a> if you give it an impossible task. What happens if you try to use StyleCLIP to increase the size of Zuck&rsquo;s nostrils, which are barely visible at all in the base photo?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_nose_hu_6cba74222da4b37a.webp 320w,/2021/04/styleclip/zuck_nose.png 512w" src="zuck_nose.png"
         alt="face with nose -&gt; face with flared nostrils, beta = 0.09, alpha = 6.3"/> <figcaption>
            <p><code>face with nose -&gt; face with flared nostrils</code>, beta = 0.09, alpha = 6.3</p>
        </figcaption>
</figure>

<p>The AI transforms his <em>entire facial structure</em> just to get his nostrils exposed and make the AI happy.</p>
<p>CLIP has seen images of everything on the internet, including public figures. Even though the StyleCLIP paper doesn&rsquo;t discuss it, why not try to transform people into other people?</p>
<p>Many AI practioners use Tesla Technoking <a href="https://twitter.com/elonmusk">Elon Musk</a> as a test case for anything AI because <del>he generates massive SEO</del> of his contributions to AI and modern nerd culture, which is why I opted to use Zuck as a contrast.</p>
<p>Given that, I bring you, Elon Zuck.</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_elon_musk_hu_c8850f2540711b87.webp 320w,/2021/04/styleclip/zuck_elon_musk.png 512w" src="zuck_elon_musk.png"
         alt="face -&gt; Elon Musk face, beta = 0.12, alpha = 4.3"/> <figcaption>
            <p><code>face -&gt; Elon Musk face</code>, beta = 0.12, alpha = 4.3</p>
        </figcaption>
</figure>

<p>What if you see Zuck as a literal Jesus Christ?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_jc_hu_96365ee3f29a01f2.webp 320w,/2021/04/styleclip/zuck_jc.png 512w" src="zuck_jc.png"
         alt="face -&gt; Jesus Christ face, beta = 0.13, alpha = 9.1"/> <figcaption>
            <p><code>face -&gt; Jesus Christ face</code>, beta = 0.13, alpha = 9.1</p>
        </figcaption>
</figure>

<p>Due to being generated by StyleGAN, the transformations have to resemble something somewhat like a real-life human, but there&rsquo;s nothing stopping CLIP from <em>trying</em> to gravitate toward faces that aren&rsquo;t human. What if you tell StyleCLIP to transform Zuck into an anime character, such as Dragon Ball Z&rsquo;s <a href="https://dragonball.fandom.com/wiki/Goku">Goku</a>?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_goku_hu_9d43d28c12915a99.webp 320w,/2021/04/styleclip/zuck_goku.png 512w" src="zuck_goku.png"
         alt="face -&gt; Dragon Ball Z Goku face, beta = 0.09, alpha = 5.4"/> <figcaption>
            <p><code>face -&gt; Dragon Ball Z Goku face</code>, beta = 0.09, alpha = 5.4</p>
        </figcaption>
</figure>

<p>Zuck gets the hair, at least.</p>
<p>People accuse Zuck of being a robot. What if we make him <em>more</em> of a robot (as guided by a robot)?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_robot_hu_7cdf02ab6ef3767c.webp 320w,/2021/04/styleclip/zuck_robot.png 512w" src="zuck_robot.png"
         alt="face -&gt; robot face, beta = 0.08, alpha = 10"/> <figcaption>
            <p><code>face -&gt; robot face</code>, beta = 0.08, alpha = 10</p>
        </figcaption>
</figure>

<p>These are all pretty tame so far. StyleCLIP surprisingly has the ability to have more complex prompts while still maintaining expected results.</p>
<p>Can Mark Zuckerberg do a troll face? yes, he can!</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_troll_face_hu_2f4d40244c453fe9.webp 320w,/2021/04/styleclip/zuck_troll_face.png 512w" src="zuck_troll_face.png"
         alt="face -&gt; troll face, beta = 0.13, alpha = 9.1"/> <figcaption>
            <p><code>face -&gt; troll face</code>, beta = 0.13, alpha = 9.1</p>
        </figcaption>
</figure>

<p>We can go deeper. What about altering other attributes at the same time?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_troll_face_eyes_hu_7433223d790f481f.webp 320w,/2021/04/styleclip/zuck_troll_face_eyes.png 512w" src="zuck_troll_face_eyes.png"
         alt="face -&gt; troll face with large eyes, beta = 0.13, alpha = 9.1"/> <figcaption>
            <p><code>face -&gt; troll face with large eyes</code>, beta = 0.13, alpha = 9.1</p>
        </figcaption>
</figure>

<p>Working with CLIP rewards good <a href="https://medium.com/swlh/openai-gpt-3-and-prompt-engineering-dcdc2c5fcd29">prompt engineering</a>, an incresingly relevant AI skill with the rise of GPT-3. With more specific, complex prompts you can stretch the &ldquo;human&rdquo; constraint of StyleGAN. 👁👄👁</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_large_hu_786477a500a1f4a7.webp 320w,/2021/04/styleclip/zuck_large.png 512w" src="zuck_large.png"
         alt="face with eyes -&gt; face with very large eyes and very large mouth, beta = 0.16, alpha = 7.8"/> <figcaption>
            <p><code>face with eyes -&gt; face with very large eyes and very large mouth</code>, beta = 0.16, alpha = 7.8</p>
        </figcaption>
</figure>

<p>Experimentation is half the fun of StyleCLIP!</p>
<h2 id="antiprompts">Antiprompts</h2>
<p>You may have seen that all the examples above had positive alphas, which control the strength of the transformation. So let&rsquo;s talk about negative alphas. While positive alphas increase strength toward the target text vector, negative alphas increase strength away from the target text vector, resulting in the <em>complete opposite</em> of the prompt. This gives rise to what I call <strong>antiprompts</strong>: prompts where you intentionally want the opposite of what&rsquo;s specified where asking a normal prompt doesn&rsquo;t give you quite want you want.</p>
<p>Let&rsquo;s see if Zuck can make a serious face.</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_serious_hu_32740f594e06e456.webp 320w,/2021/04/styleclip/zuck_serious.png 512w" src="zuck_serious.png"
         alt="face -&gt; serious face, beta = 0.09, alpha = 6.3"/> <figcaption>
            <p><code>face -&gt; serious face</code>, beta = 0.09, alpha = 6.3</p>
        </figcaption>
</figure>

<p>More pouty than serious. But what if he does the opposite of a laughing face?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_laughing_hu_ed4cc9a8e97f43f.webp 320w,/2021/04/styleclip/zuck_laughing.png 512w" src="zuck_laughing.png"
         alt="face -&gt; laughing face, beta = 0.09, alpha = -6.3"/> <figcaption>
            <p><code>face -&gt; laughing face</code>, beta = 0.09, alpha = -6.3</p>
        </figcaption>
</figure>

<p>That&rsquo;s more like it.</p>
<p>It doesn&rsquo;t stop there. In the previous section we saw what happens when you give prompts of people and compound prompts. What, you may ask, does the AI think is the opposite of a <em>person</em>?</p>
<p>In the Goku example above, Zuck got larger, darker hair, more pale skin, and a chonky neck. What happens if you do the inverse?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_goku_inv_hu_c5d010ce3249d6d1.webp 320w,/2021/04/styleclip/zuck_goku_inv.png 512w" src="zuck_goku_inv.png"
         alt="face -&gt; Dragon Ball Z Goku face, beta = 0.09, alpha = -5.4"/> <figcaption>
            <p><code>face -&gt; Dragon Ball Z Goku face</code>, beta = 0.09, alpha = -5.4</p>
        </figcaption>
</figure>

<p>His hair is smaller and blonde, his skin is more tan, and he barely has a neck at all.</p>
<p>What if you make Zuck the opposite of a robot? Does he become human?</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_robot_inv_hu_95925e84b159f9e1.webp 320w,/2021/04/styleclip/zuck_robot_inv.png 512w" src="zuck_robot_inv.png"
         alt="face -&gt; robot face, beta = 0.08, alpha = -10"/> <figcaption>
            <p><code>face -&gt; robot face</code>, beta = 0.08, alpha = -10</p>
        </figcaption>
</figure>

<p>He becomes <a href="https://en.wikipedia.org/wiki/Pedro_Pascal">Pedro Pascal</a> apparently.</p>
<h2 id="video-ai-algorithms">Video AI Algorithms</h2>
<p>A fun feature I added to the notebook is the ability to make videos, by generating frames from zero alpha to the target alpha and rendering them using <a href="https://www.ffmpeg.org/">ffmpeg</a>. Through that, we can see these wonderful transformations occur at a disturbingly smooth 60fps!</p>
<p>Animations are cool to fully illustrate how the AI can cheat, such as with the flared nostrils example above.</p>











  





<video controls  >
  <source src="/2021/04/styleclip/zuck_nose.mp4" type="video/mp4">
</video>

<p>Or you can opt for pure chaos and do one of the more complex transformations. 👁👄👁</p>











  





<video controls  >
  <source src="/2021/04/styleclip/zuck_large.mp4" type="video/mp4">
</video>

<p>TikTok will have a lot of fun with this!</p>
<h2 id="ethics-and-biases">Ethics and Biases</h2>
<p>Let&rsquo;s address the elephant in the room: is it ethical to edit photos with AI like this?</p>
<p>My take is that StyleCLIP is no different than what <a href="https://www.adobe.com/products/photoshop.html">Adobe Photoshop</a> has done for decades. Unlike deepfakes, these by construction are constrained to human portraits and can&rsquo;t be used in other contexts to mislead or cause deception. Turning Mark Zuckerberg into Elon Musk would not cause a worldwide panic. <a href="https://www.faceapp.com/">FaceApp</a>, which does a similar tyle of image editing, was released years ago and still tops the App Store charts without causing democracy to implode. That said, I recommend only using StyleCLIP on public figures.</p>
<p>In my testing, there is definitely an issue of model bias, both within StyleGAN and within CLIP. A famous example of gender bias in AI is a propensity to assign <a href="https://qz.com/1141122/google-translates-gender-bias-pairs-he-with-hardworking-and-she-with-lazy-and-other-examples/">gender to gender neutral terms</a>, such as <code>He is a soldier. She is a teacher</code>. Let&rsquo;s try both for Zuck.</p>
<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_soldier_hu_89743eafc8a43bed.webp 320w,/2021/04/styleclip/zuck_soldier.png 512w" src="zuck_soldier.png"
         alt="face -&gt; soldier face, beta = 0.1, alpha = 7.2"/> <figcaption>
            <p><code>face -&gt; soldier face</code>, beta = 0.1, alpha = 7.2</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2021/04/styleclip/zuck_teacher_hu_8d00746470894674.webp 320w,/2021/04/styleclip/zuck_teacher.png 512w" src="zuck_teacher.png"
         alt="face -&gt; teacher face, beta = 0.13, alpha = 5.6"/> <figcaption>
            <p><code>face -&gt; teacher face</code>, beta = 0.13, alpha = 5.6</p>
        </figcaption>
</figure>

<p>Unfortunately it still holds true.</p>
<p>It is surprisingly easy to get the model to perform racist/sexist/ageist transformations without much prodding. Inputting <code>face with white skin -&gt; face with black skin</code> does what you think it would do. Making similar transformations based on race/sex/age do indeed work, and I am deliberately not demoing them. If you do experiment around these biases, I recommend careful consideration with posting the outputs.</p>
<h2 id="the-future-of-ai-image-editing">The Future of AI Image Editing</h2>
<p>StyleCLIP is a fun demo on the potential of AI-based image editing. Although not the most pragmatic way to edit portraits, it&rsquo;s fun to see just how well (or how poorly) it can adapt to certain prompts.</p>
<p>Even though everything noted in this blog post is open-sourced, don&rsquo;t think about trying to sell StyleCLIP as a product: StyleGAN2 (which in the end is responsible for generating the image) and its variants were released under <a href="https://nvlabs.github.io/stylegan2/license.html">non-commerical licenses</a>. But it wouldn&rsquo;t surprise me if someone uses the techniques noted in the papers to create their own, more efficient StyleCLIP with a bespoke efficient GAN to create an entire new industry.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
