<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Natural Language Processing on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/category/natural-language-processing/</link>
    <description>Recent content in Natural Language Processing on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Tue, 12 Aug 2025 09:00:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/category/natural-language-processing/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Can modern LLMs actually count the number of b&#39;s in &#34;blueberry&#34;?</title>
      <link>https://minimaxir.com/2025/08/llm-blueberry/</link>
      <pubDate>Tue, 12 Aug 2025 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2025/08/llm-blueberry/</guid>
      <description>It&amp;rsquo;s an adversarial question for LLMs, but it&amp;rsquo;s not unfair.</description>
      <content:encoded><![CDATA[<p>Last week, <a href="https://openai.com">OpenAI</a> announced and released <a href="https://openai.com/gpt-5/">GPT-5</a>, and the common consensus both inside the AI community and outside is that the new LLM did not live up to the hype. <a href="https://bsky.app">Bluesky</a> — whose community is skeptical at-best of generative AI in all its forms — began putting the model through its paces: Michael Paulauski <a href="https://bsky.app/profile/mike10010100.com/post/3lvtrfmhpkc23">asked GPT-5</a> through the ChatGPT app interface &ldquo;how many b&rsquo;s are there in blueberry?&rdquo;. A simple question that a human child could answer correctly, but ChatGPT states that there are <em>three</em> b&rsquo;s in blueberry when there are clearly only two. Another attempt by Kieran Healy <a href="https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226">went more viral</a> as ChatGPT insisted blueberry has 3 b&rsquo;s despite the user repeatedly arguing to the contrary.</p>
<figure>

    <img loading="lazy" srcset="/2025/08/llm-blueberry/chatgpt_hu_b34d24fad63715d6.webp 320w,/2025/08/llm-blueberry/chatgpt_hu_5f26556450f01f6.webp 768w,/2025/08/llm-blueberry/chatgpt_hu_45893523f6bbfe4a.webp 1024w,/2025/08/llm-blueberry/chatgpt.webp 1094w" src="chatgpt.webp"/> 
</figure>

<p>Other Bluesky users were able to replicate this behavior, although results were inconsistent: GPT-5 uses a new model router that quietly determines whether the question should be answered by a better reasoning model, or if a smaller model will suffice. Additionally, Sam Altman, the CEO of OpenAI, later <a href="https://x.com/sama/status/1953893841381273969">tweeted</a> that this router was broken during these tests and therefore &ldquo;GPT-5 seemed way dumber,&rdquo; which could confound test results.</p>
<p>About a year ago, <a href="https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawberry/">one meme in the AI community</a> was to ask LLMs the simple question &ldquo;how many r&rsquo;s are in the word strawberry?&rdquo; as major LLMs consistently and bizarrely failed to answer it correctly. It&rsquo;s an intentionally adversarial question to LLMs because LLMs do not directly use letters as inputs, but instead they are tokenized. To quote TechCrunch&rsquo;s explanation:</p>
<blockquote>
<p>This is because the transformers are not able to take in or output actual text efficiently. Instead, the text is converted into numerical representations of itself, which is then contextualized to help the AI come up with a logical response. In other words, the AI might know that the tokens “straw” and “berry” make up “strawberry,” but it may not understand that “strawberry” is composed of the letters “s,” “t,” “r,” “a,” “w,” “b,” “e,” “r,” “r,” and “y,” in that specific order. Thus, it cannot tell you how many letters — let alone how many “r”s — appear in the word “strawberry.”</p>
</blockquote>
<p>It&rsquo;s likely that OpenAI/Anthropic/Google have included this specific challenge into the LLM training datasets to preemptively address the fact that someone <em>will</em> try it, making the question ineffective for testing LLM capabilities. Asking how many b&rsquo;s are in blueberry is a semantically similar question, but may just be sufficiently out of domain to trip the LLMs up.</p>
<p>When Healy&rsquo;s Bluesky post became <a href="https://news.ycombinator.com/item?id=44832908">popular on Hacker News</a>, a surprising number of commenters cited the tokenization issue and discounted GPT-5&rsquo;s responses entirely because (paraphrasing) &ldquo;LLMs fundamentally can&rsquo;t do this&rdquo;. I disagree with their conclusions in this case as tokenization is less effective of a counterargument: if the question was only asked once, maybe, but Healy asked GPT-5 <em>several</em> times, with different formattings of blueberry — therefore different tokens, including single-character tokens — and it still asserted that there are 3 b’s every time. Tokenization making it difficult for LLMs to count letters makes sense intuitively, but time and time again we’ve seen LLMs do things that aren’t intuitive. Additionally, it&rsquo;s been a year since the strawberry test and hundreds of millions of dollars have been invested into improving RLHF regimens and creating more annotated training data: it&rsquo;s hard for me to believe that modern LLMs have made zero progress on these types of trivial tasks.</p>
<p>There&rsquo;s an easy way to test this behavior instead of waxing philosophical: why not just ask a wide variety of LLMs see of often they can correctly identify that there are 2 b&rsquo;s in the word &ldquo;blueberry&rdquo;? If LLMs indeed are fundamentally incapable of counting the number of specific letters in a word, that flaw should apply to <em>all</em> LLMs, not just GPT-5.</p>
<h2 id="2-bs-or-not-2-bs">2 b&rsquo;s, or not 2 b&rsquo;s</h2>
<p>First, I chose a selection of popular LLMs: from OpenAI, I of course chose GPT-5 (specifically, the GPT-5 Chat, GPT-5 Mini, and GPT-5 Nano variants) in addition to OpenAI&rsquo;s new open-source models gpt-oss-120b and gpt-oss-20b; from Anthropic, the new Claude Opus 4.1 and Claude Sonnet 4; from Google, Gemini 2.5 Pro and Gemini 2.5 Flash; lastly as a wild card, Kimi K2 from Moonshot AI. These contain a mix of reasoning-by-default and non-reasoning models which will be organized separately as reasoning models should theoretically perform better: however, GPT-5-based models can route between using reasoning or not, so the instances where those models reason will also be classified separately. Using <a href="https://openrouter.ai">OpenRouter</a>, which allows using the same API to generate from multiple models, I wrote a Python script to simultaneously generate a response to the given question from every specified LLM <em>n</em> times and save the LLM responses for further analysis. (<a href="https://github.com/minimaxir/llm-blueberry/blob/main/llm_count_letters.ipynb">Jupyter Notebook</a>)</p>
<p>In order to ensure the results are most representative of what a normal user would encounter when querying these LLMs, I will not add any generation parameters besides the original question: no prompt engineering and no temperature adjustments. As a result, I will use an independent secondary LLM with prompt engineering to parse out the predicted letter counts from the LLM&rsquo;s response: this is a situation where normal parsing techniques such as <a href="https://en.wikipedia.org/wiki/Regular_expression">regular expressions</a> won&rsquo;t work due to ambigious number usage, and there are many possible ways to express numerals that are missable edge cases, such as <code>The letter **b** appears **once** in the word “blueberry.”</code> <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>First, let&rsquo;s test the infamous strawberry question, since that can serve as a baseline as I suspect LLMs have gamed it. Following the syntax of Healy&rsquo;s question, I asked each LLM <code>How many times does the letter r appear in strawberry</code> 100 times (<a href="https://huggingface.co/datasets/minimaxir/llm-strawberry">Dataset</a>), and here are the results:</p>
<figure>

    <img loading="lazy" srcset="/2025/08/llm-blueberry/strawberry_hu_2b24047f396dadf6.webp 320w,/2025/08/llm-blueberry/strawberry_hu_64657e56474f6ba0.webp 768w,/2025/08/llm-blueberry/strawberry_hu_ab145b68976fecc3.webp 1024w,/2025/08/llm-blueberry/strawberry.png 1200w" src="strawberry.png"/> 
</figure>

<p>Perfect performance by every LLM except one, and I&rsquo;m surprised that it&rsquo;s Gemini 2.5 Flash. Looking at the <a href="https://huggingface.co/datasets/minimaxir/llm-strawberry/sql-console/jFsvS5r">incorrect generations</a>, Gemini confidently says <code>The letter &quot;r&quot; appears **two** times in the word &quot;strawberry&quot;.</code> or <code>The letter &quot;r&quot; appears **four** times in the word &quot;strawberry&quot;.</code>, so atleast there&rsquo;s some variance in its wrongness. The perfect performance on every other model does hint at the problem being in the LLM training dataset.</p>
<p>Now, the real question: how well can these LLMs count the number of b&rsquo;s in blueberry, which may be out of domain? I ran more trials — 274 total — which should ensure even more stable results (<a href="https://huggingface.co/datasets/minimaxir/llm-blueberry">Dataset</a>). Here&rsquo;s the tally for how these LLMs respond to the question <code>How many times does the letter b appear in blueberry</code>:</p>
<figure>

    <img loading="lazy" srcset="/2025/08/llm-blueberry/blueberry_hu_3ca5fe85f0ee9294.webp 320w,/2025/08/llm-blueberry/blueberry_hu_f8bdf3da19d10403.webp 768w,/2025/08/llm-blueberry/blueberry_hu_62c0f25db02f1511.webp 1024w,/2025/08/llm-blueberry/blueberry.png 1200w" src="blueberry.png"/> 
</figure>

<p>Wait, what the heck happened with GPT-5 Chat? I looked at the <a href="https://huggingface.co/datasets/minimaxir/llm-blueberry/sql-console/nvoM4I8">raw generation data for GPT-5 Chat</a> and it&rsquo;s not an error in my pipeline: GPT-5 Chat is indeed wrong that often, and in spectacular ways. I also confirmed that it&rsquo;s not an issue with OpenRouter&rsquo;s endpoint, as hitting OpenAI&rsquo;s official <code>gpt-5-chat-latest</code> API endpoint consistently reproduces poor results for this particular question.</p>
<p>Many of the correct cases for GPT-5 Chat are initially wrong but the model self-corrects on its own:</p>
<blockquote>
<p>The word <strong>&ldquo;blueberry&rdquo;</strong> contains the letter <strong>b</strong> three times:</p>
<ul>
<li><strong>B</strong>lueberry (1st letter)</li>
<li>Blue<strong>b</strong>erry (5th letter)</li>
<li>Blueber<strong>r</strong>y – wait, no <strong>b</strong> here</li>
<li>Actually, checking carefully: <strong>blueberry</strong> → b <strong>(1)</strong>, l, u, e, b <strong>(2)</strong>, e, r, r, y.</li>
</ul>
<p>✅ So the letter <strong>b</strong> appears <strong>2 times</strong> in &ldquo;blueberry&rdquo;.</p>
<p>Would you like me to also count it <strong>regardless of case</strong> in a full paragraph?</p>
</blockquote>
<p>The incorrect cases, however&hellip;</p>
<blockquote>
<p>The word <strong>&ldquo;blueberry&rdquo;</strong> contains the letter <strong>b</strong> three times:</p>
<p><strong>b</strong>lue<strong>b</strong>erry → b, b, b → <strong>3 times</strong> ✅</p>
</blockquote>
<hr>
<blockquote>
<p>The word <strong>&ldquo;blueberry&rdquo;</strong> contains the letter <strong>b</strong> <strong>three</strong> times:</p>
<ul>
<li><strong>B</strong>lueberry</li>
<li>blue<strong>b</strong>erry</li>
<li>blueberr<strong>b</strong>y</li>
</ul>
<p>So, the letter <strong>b</strong> appears <strong>3 times</strong> in &ldquo;blueberry&rdquo;. ✅</p>
</blockquote>
<hr>
<blockquote>
<p>The word <strong>&ldquo;blueberry&rdquo;</strong> contains the letter <strong>b</strong> <strong>3 times</strong>.</p>
<p><strong>Breakdown:</strong></p>
<ul>
<li><strong>b</strong>lueberry → 1st b</li>
<li>blue<strong>b</strong>erry → 2nd b</li>
<li>blueberr<strong>y</strong> → and one more <strong>b</strong> before the final part → 3rd b</li>
</ul>
<p>✅ <strong>Total: 3</strong></p>
<p>Do you want me to also count it in a <strong>case-insensitive</strong> way, in case of uppercase letters?</p>
</blockquote>
<p>The ✅ emoji just makes its confident incorrectness funnier. Unfortunately, these extreme hallucinations track with the weird behavior seen on Bluesky, so it can&rsquo;t be attributed to the router issues on GPT-5 Chat&rsquo;s launch. Is there a greater systemic issue with the model, then? It had no difficulty answering the strawberry question.</p>
<p>Looking at <a href="https://huggingface.co/datasets/minimaxir/llm-blueberry/sql-console/wObPOOv">incorrect b counts by other models</a> and the reasoning traces for reasoning models is interesting as well:</p>
<p>Gemini 2.5 Pro:</p>
<blockquote>
<p><strong>Considering the Query</strong></p>
<p>I&rsquo;ve established the core of the query: the user&rsquo;s looking for the letter &ldquo;b&rdquo; in the word &ldquo;blueberry&rdquo;. Now, I&rsquo;m diving directly into the word to find those instances, the next step is going to be to provide an answer.</p>
<p><strong>Pinpointing the Count</strong></p>
<p>I&rsquo;ve identified the core question and the target word, &ldquo;blueberry.&rdquo; The letter &ldquo;b&rdquo; appears three times. I&rsquo;m now confirming this, noting its positions in the word: the first and the two in the middle. I&rsquo;m verifying the answer&rsquo;s accuracy before forming the final response to the user.</p>
</blockquote>
<p>gpt-oss-20b:</p>
<blockquote>
<p>The user asks: &ldquo;How many times does the letter b appear in blueberry&rdquo;. Likely answer: 1. The word &ldquo;blueberry&rdquo; contains &lsquo;b&rsquo; once. So answer: 1. Should be concise but might add explanation? Probably just answer.</p>
</blockquote>
<p>Despite these comical failures, most LLMs still get a perfect score even without distinct reasoning traces. Claude&rsquo;s logic is <a href="https://huggingface.co/datasets/minimaxir/llm-blueberry/sql-console/Vgc8NQM">easier to follow</a> and is able to account for per-character positions:</p>
<p>Claude Sonnet 4:</p>
<blockquote>
<p>To count the letter &ldquo;b&rdquo; in &ldquo;blueberry&rdquo;, I&rsquo;ll go through each letter:</p>
<p>b-l-u-e-b-e-r-r-y</p>
<p>Looking at each position:</p>
<ol>
<li>b - this is a &ldquo;b&rdquo; ✓</li>
<li>l - not a &ldquo;b&rdquo;</li>
<li>u - not a &ldquo;b&rdquo;</li>
<li>e - not a &ldquo;b&rdquo;</li>
<li>b - this is a &ldquo;b&rdquo; ✓</li>
<li>e - not a &ldquo;b&rdquo;</li>
<li>r - not a &ldquo;b&rdquo;</li>
<li>r - not a &ldquo;b&rdquo;</li>
<li>y - not a &ldquo;b&rdquo;</li>
</ol>
<p>The letter &ldquo;b&rdquo; appears <strong>2 times</strong> in &ldquo;blueberry&rdquo;.</p>
</blockquote>
<p>Claude Opus 4.1:</p>
<blockquote>
<p>To count the letter &lsquo;b&rsquo; in the word &ldquo;blueberry&rdquo;, I&rsquo;ll examine each letter:</p>
<p>b-l-u-e-b-e-r-r-y</p>
<p>The letter &lsquo;b&rsquo; appears <strong>2 times</strong> in &ldquo;blueberry&rdquo; - once at the beginning and once in the middle (the 5th position).</p>
</blockquote>
<p>Obviously, this is an informal test and more research can be done to determine just how well LLMs can count, ideally with questions that do not involve fruit. Despite the issues with tokenization which prevent LLMs from counting in the same way humans count, they definitely can do it: the better question is why it can&rsquo;t count correctly 100% of the time, and why its mistakes are very idiosyncratic. Yes, asking an LLM how many b&rsquo;s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it&rsquo;s not an <em>unfair</em> question, and it&rsquo;s objectively silly to claim that LLMs such as GPT-5 can operate <a href="https://www.bbc.com/news/articles/cy5prvgw0r1o">at a PhD level</a>, but can&rsquo;t correctly count the number of letters in a word.</p>
<p><em>All code used in this blog post is available <a href="https://github.com/minimaxir/llm-blueberry/tree/main">open-source on GitHub</a>.</em></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Some false negatives (0.5%) with the LLM parses of counts in responses <a href="https://github.com/minimaxir/llm-blueberry/blob/main/false_negatives.csv">were identified</a> and fixed (<a href="https://github.com/minimaxir/llm-blueberry/blob/main/fix_false_negatives.ipynb">Jupyter Notebook</a>), as a result of the LLM getting confused by multiple notable numbers.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>The Best Way to Use Text Embeddings Portably is With Parquet and Polars</title>
      <link>https://minimaxir.com/2025/02/embeddings-parquet/</link>
      <pubDate>Mon, 24 Feb 2025 10:15:00 -0800</pubDate>
      <guid>https://minimaxir.com/2025/02/embeddings-parquet/</guid>
      <description>Never store embeddings in a CSV!</description>
      <content:encoded><![CDATA[<p><a href="https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/">Text embeddings</a>, particularly modern embeddings generated from large language models, are one of the most useful applications coming from the generative AI boom. Embeddings are a list of numbers which represent an object: in the case of text embeddings, they can represent words, sentences, and full paragraphs and documents, and they do so with a surprising amount of distinctiveness.</p>
<p>Recently, I created text embeddings representing every distinct <a href="https://magic.wizards.com/en">Magic: the Gathering</a> card released as of the February 2025 Aetherdrift expansion: 32,254 in total. With these embeddings, I can find the mathematical similarity between cards through the encoded representation of their card design, including all mechanical attributes such as the card name, card cost, card text, and even card rarity.</p>
<figure>

    <img loading="lazy" srcset="/2025/02/embeddings-parquet/wog_hu_7ed6be2e5737eeb4.webp 320w,/2025/02/embeddings-parquet/wog_hu_81c75e037d833a96.webp 768w,/2025/02/embeddings-parquet/wog.webp 976w" src="wog.webp"
         alt="The iconic Magic card Wrath of God, along with its top four most similar cards identified using their respective embeddings. The similar cards are valid matches, with similar card text and card types."/> <figcaption>
            <p>The iconic Magic card <a href="https://gatherer.wizards.com/pages/card/Details.aspx?multiverseid=129808">Wrath of God</a>, along with its top four most similar cards identified using their respective embeddings. The similar cards are valid matches, with similar card text and card types.</p>
        </figcaption>
</figure>

<p>Additionally, I can create a fun 2D <a href="https://umap-learn.readthedocs.io/en/latest/">UMAP</a> projection of all those cards, which also identifies interesting patterns:</p>
<figure>

    <img loading="lazy" srcset="/2025/02/embeddings-parquet/mtg_umap_hu_df72981641ef0ffd.webp 320w,/2025/02/embeddings-parquet/mtg_umap_hu_ad2e63ba61f377cd.webp 768w,/2025/02/embeddings-parquet/mtg_umap_hu_7de8f113f1eb20fa.webp 1024w,/2025/02/embeddings-parquet/mtg_umap.webp 1200w" src="mtg_umap.webp"
         alt="The UMAP dimensionality reduction process also implicitly clusters the Magic cards to logical clusters, such as by card color(s) and card type."/> <figcaption>
            <p>The UMAP dimensionality reduction process also implicitly clusters the Magic cards to logical clusters, such as by card color(s) and card type.</p>
        </figcaption>
</figure>

<p>I generated these Magic card embeddings for <em>something special</em> besides a pretty data visualization, but if you are curious how I generated them, they were made using the new-but-underrated <a href="https://huggingface.co/Alibaba-NLP/gte-modernbert-base">gte-modernbert-base</a> embedding model and the process is detailed <a href="https://github.com/minimaxir/mtg-embeddings">in this GitHub repository</a>. The embeddings themselves (including the coordinate values to reproduce the 2D UMAP visualization) are available as a <a href="https://huggingface.co/datasets/minimaxir/mtg-embeddings">Hugging Face dataset</a>.</p>
<p>Most tutorials involving embedding generation omit the obvious question: what do you <em>do</em> with the text embeddings after you generate them? The common solution is to use a <a href="https://en.wikipedia.org/wiki/Vector_database">vector database</a>, such as <a href="https://github.com/facebookresearch/faiss">faiss</a> or <a href="https://qdrant.tech">qdrant</a>, or even a cloud-hosted service such as <a href="https://www.pinecone.io">Pinecone</a>. But those aren&rsquo;t easy to use: faiss has <a href="https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index">confusing configuration options</a>, qdrant requires <a href="https://github.com/qdrant/qdrant?tab=readme-ov-file#client-server">using a Docker container</a> to host the storage server, and Pinecone can get <a href="https://www.pinecone.io/pricing/">very expensive</a> very quickly, and its free Starter tier is limited.</p>
<p>What many don&rsquo;t know about text embeddings is that you don&rsquo;t <em>need</em> a vector database to calculate nearest-neighbor similarity if your data isn&rsquo;t too large. Using <a href="https://numpy.org/doc/stable/index.html">numpy</a> and my Magic card embeddings, a 2D matrix of 32,254 <code>float32</code> embeddings at a dimensionality of 768D (common for &ldquo;smaller&rdquo; LLM embedding models) occupies <strong>94.49 MB</strong> of system memory, which is relatively low for modern personal computers and can fit within free usage tiers of cloud VMs. If both the query vector and the embeddings themselves are unit normalized (many embedding generators normalize by default), then the matrix dot product between the query and embeddings results in a cosine similarity between <code>[-1, 1]</code>, where the higher score is better/more similar. Since dot products are such a fundamental aspect of linear algebra, numpy&rsquo;s implementation is extremely fast: with the help of additional numpy <a href="https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html">sorting</a> <a href="https://numpy.org/doc/2.1/reference/generated/numpy.argsort.html">shenanigans</a>, on my M3 Pro MacBook Pro it takes just <strong>1.08 ms</strong> on average to calculate all 32,254 dot products, find the top 3 most similar embeddings, and return their corresponding <code>idx</code> of the matrix and and cosine similarity <code>score</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">fast_dot_product</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">matrix</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">dot_products</span> <span class="o">=</span> <span class="n">query</span> <span class="o">@</span> <span class="n">matrix</span><span class="o">.</span><span class="n">T</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argpartition</span><span class="p">(</span><span class="n">dot_products</span><span class="p">,</span> <span class="o">-</span><span class="n">k</span><span class="p">)[</span><span class="o">-</span><span class="n">k</span><span class="p">:]</span>
</span></span><span class="line"><span class="cl">    <span class="n">idx</span> <span class="o">=</span> <span class="n">idx</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">dot_products</span><span class="p">[</span><span class="n">idx</span><span class="p">])[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">score</span> <span class="o">=</span> <span class="n">dot_products</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">idx</span><span class="p">,</span> <span class="n">score</span>
</span></span></code></pre></div><p>In most implementations of vector databases, once you insert the embeddings, they&rsquo;re stuck there in a proprietary serialization format and you are locked into that library and service. If you&rsquo;re just building a personal pet project or sanity-checking embeddings to make sure the results are good, that&rsquo;s a huge amount of friction. For example, when I want to experiment with embeddings, I generate them on a cloud server with a GPU since LLM-based embeddings models are often slow to generate without one, and then download them locally to my personal computer. What is the best way to handle embeddings portably such that they can easily be moved between machines and also in a non-proprietary format?</p>
<p>The answer, after much personal trial-and-error, is Parquet files, which still has a surprising amount of nuance. But before we talk about why Parquet files are good, let&rsquo;s talk about how <em>not</em> to store embeddings.</p>
<h2 id="the-worst-ways-to-store-embeddings">The Worst Ways to Store Embeddings</h2>
<p>The incorrect-but-unfortunately-common way to store embeddings is in a text format such as a CSV file. Text data is substantially larger than <code>float32</code> data: for example, a decimal number with full precision (e.g. <code>2.145829051733016968e-02</code>) as a <code>float32</code> is 32 bits/4 bytes, while as a text representation (in this case 24 ASCII <code>char</code>s) it&rsquo;s 24 bytes, <strong>6x larger</strong>. When the CSV is saved and loaded, the data has to be serialized between a numpy and a string representation of the array, which adds significant overhead. Despite that, in <a href="https://github.com/openai/openai-cookbook/blob/a3e98ea4dcf866b5e7a3cb7d63dccaa68c7d63aa/examples/Embedding_Wikipedia_articles_for_search.ipynb">one of OpenAI&rsquo;s official tutorials</a> for their embeddings models, they save the embeddings as a CSV using <a href="https://pandas.pydata.org">pandas</a> with the admitted caveat of &ldquo;Because this example only uses a few thousand strings, we&rsquo;ll store them in a CSV file. (For larger datasets, use a vector database, which will be more performant.)&rdquo;. In the case of the Magic card embeddings, pandas-to-CSV performs the <em>worst</em> out of any encoding options: more on why later.</p>
<p>Numpy has native methods to <a href="https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html">save</a> and <a href="https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html">load</a> embeddings as a <code>.txt</code> that&rsquo;s straightforward:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">np</span><span class="o">.</span><span class="n">savetxt</span><span class="p">(</span><span class="s2">&#34;embeddings_txt.txt&#34;</span><span class="p">,</span> <span class="n">embeddings</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">embeddings_r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s2">&#34;embeddings_txt.txt&#34;</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s2">&#34; &#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>The resulting file not only takes a few seconds to save and load, but it&rsquo;s also massive: <strong>631.5 MB</strong>!</p>
<p>As an aside, HTTP APIs such as OpenAI&rsquo;s <a href="https://platform.openai.com/docs/guides/embeddings">Embeddings API</a> do transmit the embeddings over text which adds needless latency and bandwidth overhead. I wish more embedding providers offered <a href="https://grpc.io">gRPC</a> APIs which allow transfer of binary <code>float32</code> data instead to gain a performance increase: Pinecone&rsquo;s <a href="https://docs.pinecone.io/reference/python-sdk">Python SDK</a>, for example, does just that.</p>
<p>The second incorrect method to save a matrix of embeddings to disk is to save it as a Python <a href="https://docs.python.org/3/library/pickle.html">pickle</a> object, which stores its representation in memory on disk with a few lines of code from the native <code>pickle</code> library. Pickling is unfortunately common in the machine learning industry since many ML frameworks such as <a href="https://scikit-learn.org/stable/">scikit-learn</a> don&rsquo;t have easy ways to serialize encoders and models. But it comes with two major caveats: pickled files are a massive security risk as they can execute arbitrary code, and the pickled file may not be guaranteed to be able to be opened on other machines or Python versions. It&rsquo;s 2025, just stop pickling if you can.</p>
<p>In the case of the Magic card embeddings, it does indeed work with instant save/loads, and the file size on disk is <strong>94.49 MB</strong>: the same as its memory consumption and about 1/6th of the text size as expected:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&#34;embeddings_matrix.pkl&#34;</span><span class="p">,</span> <span class="s2">&#34;wb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">embeddings</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&#34;embeddings_matrix.pkl&#34;</span><span class="p">,</span> <span class="s2">&#34;rb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">embeddings_r</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</span></span></code></pre></div><p>But there are still better and easier approaches.</p>
<h2 id="the-intended-but-not-great-way-to-store-embeddings">The Intended-But-Not-Great Way to Store Embeddings</h2>
<p>Numpy itself has a canonical way to <a href="https://numpy.org/doc/2.1/reference/generated/numpy.save.html">save</a> and <a href="https://numpy.org/doc/2.1/reference/generated/numpy.load.html">load</a> matrixes — which annoyingly saves as a pickle by default for compatability reasons, but that can fortunately be disabled by setting <code>allow_pickle=False</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">np</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s2">&#34;embeddings_matrix.npy&#34;</span><span class="p">,</span> <span class="n">embeddings</span><span class="p">,</span> <span class="n">allow_pickle</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">embeddings_r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">&#34;embeddings_matrix.npy&#34;</span><span class="p">,</span> <span class="n">allow_pickle</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span></code></pre></div><p>File size and I/O speed are the same as with the <code>pickle</code> approach.</p>
<p>This works — and it&rsquo;s something I had used for awhile — but in the process it exposes another problem: how do we map metadata (the Magic cards in this case) to embeddings? Currently, we use the <code>idx</code> of the most-similar matches to perform an efficient batched lookup to the source data. In this case, the number of rows matches the number of cards exactly, but what happens if the embeddings matrix needs to be changed, such as to add or remove cards and their embeddings? What happens if you want to add a dataset filter? It becomes a mess that inevitably causes technical debt.</p>
<p>The solution to this is to colocate metadata such as card names, card text, and attributes with their embeddings: that way, if they are later added, removed, or sorted, the results will remain the same. Modern vector databases such as qdrant and Pinecone do just that, with the ability to filter and sort on the metadata at the same time you query the most similar vectors. This is a bad idea to do in numpy itself, as it&rsquo;s more optimized for numbers and not other data types such as strings, which have <a href="https://numpy.org/devdocs/user/basics.strings.html">limited operations available</a>.</p>
<p>The solution is to look at another file format that can store metadata and embeddings simultaneously, and the answer to that is Parquet files. But there&rsquo;s a rabbit hole as to what&rsquo;s the <em>best</em> way to interact with them.</p>
<h2 id="what-are-parquet-files">What are Parquet files?</h2>
<p>Parquet, developed by the open-source <a href="https://parquet.apache.org">Apache Parquet</a> project, is a file format for handling columnar data, but despite being <a href="https://blog.x.com/engineering/en_us/a/2013/announcing-parquet-10-columnar-storage-for-hadoop">first released in 2013</a> it hasn&rsquo;t taken off in the data science community until very recently. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> The most relevant feature of Parquet is that the resulting files are typed for each column, and that this typing includes nested lists, such as an embedding which is just a list of <code>float32</code> values. As a bonus, the columnar format allows downstream libraries to save/load them selectively and very quickly, far faster than CSVs and with rare parsing errors. The file format also allows for efficient compression and decompression, but that&rsquo;s less effective with embeddings as there&rsquo;s little redundant data.</p>
<p>For Parquet file I/O, the standard approach is to use the <a href="https://arrow.apache.org">Apache Arrow</a> protocol that is columnar in-memory, which complements the Parquet storage medium on disk. But how do you use Arrow?</p>
<h2 id="how-do-you-use-parquet-files-in-python-for-embeddings">How do you use Parquet files in Python for embeddings?</h2>
<p>Ideally, we need a library that can handle nested data easily and can interoperate with numpy for serializing to a matrix and can run fast dot products.</p>
<p>The official Arrow library that <a href="https://arrow.apache.org/docs/python/index.html">interacts with Parquet natively</a> in Python is <a href="https://arrow.apache.org/docs/python/index.html">pyarrow</a>. Here, I have an example Parquet file generated with [SPOILERS] that contains both the card metadata and an <code>embedding</code> column, with the embedding for each row corresponding to that card.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">parquet</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">&#34;mtg-embeddings.parquet&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2025/02/embeddings-parquet/parquet_hu_268909d3d8256458.webp 320w,/2025/02/embeddings-parquet/parquet_hu_be20ddd4d423844c.webp 768w,/2025/02/embeddings-parquet/parquet_hu_dc1002cb8e03a874.webp 1024w,/2025/02/embeddings-parquet/parquet.png 1352w" src="parquet.png"
         alt="Pyarrow&rsquo;s table schema from the input Parquet file of Magic card embeddings. Note the embedding column at the bottom is a list of 768 floats."/> <figcaption>
            <p>Pyarrow&rsquo;s table schema from the input Parquet file of Magic card embeddings. Note the <code>embedding</code> column at the bottom is a list of 768 floats.</p>
        </figcaption>
</figure>

<p>But pyarrow is not a DataFrame library, and despite the data being in a Table, it&rsquo;s hard to slice and access: the documentation suggests that you export to pandas if you need more advanced manipulation.</p>
<p>Other more traditional data science libraries can leverage pyarrow directly. The most popular one is, of course, pandas itself which can <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html">read/write Parquet</a> doing just that. There are many, many resources for using pandas well, so it&rsquo;s often the first choice among data science practioners.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_parquet</span><span class="p">(</span><span class="s2">&#34;mtg-embeddings.parquet&#34;</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;name&#34;</span><span class="p">,</span> <span class="s2">&#34;embedding&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2025/02/embeddings-parquet/pandas_embed_hu_43da08f8256fb434.webp 320w,/2025/02/embeddings-parquet/pandas_embed_hu_ffb22e6af150d0a8.webp 768w,/2025/02/embeddings-parquet/pandas_embed_hu_f0379dc63b1b8457.webp 1024w,/2025/02/embeddings-parquet/pandas_embed.png 1224w" src="pandas_embed.png"
         alt="Pandas HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook."/> <figcaption>
            <p>Pandas HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook.</p>
        </figcaption>
</figure>

<p>There&rsquo;s one major weakness for the use case of embeddings: pandas is very bad at nested data. From the image above you&rsquo;ll see that the <code>embedding</code> column <em>appears</em> to be a list of numbers, but it&rsquo;s actually a list of numpy <code>object</code>s, which is a very inefficent datatype and why I suspect writing it to a CSV is very slow. Simply converting it to numpy with <code>df[&quot;embedding&quot;].to_numpy()</code> results in a 1D array, which is definitely wrong, and trying to cast it to <code>float32</code> doesn&rsquo;t work. I found that the best way to extract the embeddings matrix from a pandas <code>embedding</code> column is to <a href="https://numpy.org/doc/2.1/reference/generated/numpy.vstack.html">np.vstack()</a> the embeddings, e.g. <code>np.vstack(df[&quot;embedding&quot;].to_numpy())</code>, which does result in a <code>(32254, 768)</code> <code>float32</code> matrix as expected. That adds a lot of compute and memory overhead in addition to unnecessary numpy array copies. Finally, after computing the dot products between a candidate query and the embedding matrix, row metadata with the most similar values can then be retrieved using <code>df.loc[idx]</code>. <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<p>However, there is another, more recent tabular data library that not only is faster than pandas, it has proper support for nested data. That library is polars.</p>
<h2 id="the-power-of-polars">The Power of polars</h2>
<p><a href="https://pola.rs">Polars</a> is a relatively new Python library which is primarily written in <a href="https://www.rust-lang.org">Rust</a> and <a href="https://docs.pola.rs/#key-features">supports Arrow</a>, which gives it a <a href="https://duckdblabs.github.io/db-benchmark/">massive performance increase</a> over pandas and many other DataFrame libraries. In the case of Magic cards, 32k rows isn&rsquo;t nearly &ldquo;big data&rdquo; and the gains of using a high-performance library are lesser, but there are some unexpected features that coincidentally work <em>perfectly</em> for the embeddings use case.</p>
<p>As with pandas, you read a parquet file with a <code>read_parquet()</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">read_parquet</span><span class="p">(</span><span class="s2">&#34;mtg-embeddings.parquet&#34;</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;name&#34;</span><span class="p">,</span> <span class="s2">&#34;embedding&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"><span class="n">df</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2025/02/embeddings-parquet/polars_embed_hu_98a1dcff6631f16f.webp 320w,/2025/02/embeddings-parquet/polars_embed_hu_7795d47fe1f2255a.webp 768w,/2025/02/embeddings-parquet/polars_embed.png 957w" src="polars_embed.png"
         alt="Polars HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook."/> <figcaption>
            <p>Polars HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook.</p>
        </figcaption>
</figure>

<p>There&rsquo;s a notable difference in the table output compared to <code>pandas</code>: it also reports the data type of its columns, and more importantly, it shows that the <code>embedding</code> column consists of arrays, all <code>float32</code>s, and all length 768. That&rsquo;s a great start!</p>
<p>polars also has a to_numpy() function. Unlike pandas, if you call <code>to_numpy()</code> on a column as a Series, e.g. <code>df['embedding'].to_numpy()</code>, the returned object is a numpy 2D matrix: no <code>np.vstack()</code> needed. If you look at the <a href="https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.to_numpy.html">documentation</a> for the function, there&rsquo;s a curious feature:</p>
<blockquote>
<p>This operation copies data only when necessary. The conversion is zero copy when all of the following hold: [&hellip;]</p>
</blockquote>
<p>Zero copy! And in the case of columnar-stored embeddings, the conditions will always hold, but you can set <code>allow_copy=False</code> to throw an error just in case.</p>
<p>Inversely, if you want to add a 2D embeddings matrix to an existing DataFrame and colocate each embedding&rsquo;s corresponding metadata, such as after you batch-generate thousands of embeddings and want to save and download the resulting Parquet, it&rsquo;s just as easy as adding a column to the DataFrame.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span><span class="n">embedding</span><span class="o">=</span><span class="n">embeddings</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df</span><span class="o">.</span><span class="n">write_parquet</span><span class="p">(</span><span class="s2">&#34;mtg-embeddings.parquet&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>Now, let&rsquo;s put the speed to the test using all the Magic card metadata. What if we perform embedding similarity on a Magic card, but beforehand dynamically filter the dataset according to user parameters (therefore filtering the candidate embeddings at the same time since they are colocated) and perform the similarity calculations quickly as usual? Let&rsquo;s try with <a href="https://gatherer.wizards.com/pages/card/details.aspx?multiverseid=87908">Lightning Helix</a>, a card whose effects are self-explanatory even to those who don&rsquo;t play Magic.</p>
<figure>

    <img loading="lazy" srcset="/2025/02/embeddings-parquet/helix_1_hu_9f15db636cb74690.webp 320w,/2025/02/embeddings-parquet/helix_1_hu_c58b97e1d1c6f502.webp 768w,/2025/02/embeddings-parquet/helix_1.webp 976w" src="helix_1.webp"
         alt="The most similar cards to Lightning Helix do have similar effects, although &ldquo;Lightning&rdquo; cards dealing damage is a common trope in Magic. Warleader&rsquo;s Helix is a direct reference to Lightning Helix."/> <figcaption>
            <p>The most similar cards to Lightning Helix do have similar effects, although &ldquo;Lightning&rdquo; cards dealing damage is a common trope in Magic. <a href="https://gatherer.wizards.com/pages/card/Details.aspx?multiverseid=456806">Warleader&rsquo;s Helix</a> is a direct reference to Lightning Helix.</p>
        </figcaption>
</figure>

<p>Now we can also find similar cards to Lightning Helix but with filters. In this case, let&rsquo;s look for a Sorcery (which are analogous to Instants but tend to be stronger since they have play limitations) and has Black as one of its colors. This limits the candidates to ~3% of the original dataset. The resulting code would look like this, given a <code>query_embed</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">df_filter</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;type&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;Sorcery&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;manaCost&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">&#34;B&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">embeddings_filter</span> <span class="o">=</span> <span class="n">df_filter</span><span class="p">[</span><span class="s2">&#34;embedding&#34;</span><span class="p">]</span><span class="o">.</span><span class="n">to_numpy</span><span class="p">(</span><span class="n">allow_copy</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">idx</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">fast_dot_product</span><span class="p">(</span><span class="n">query_embed</span><span class="p">,</span> <span class="n">embeddings_filter</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">related_cards</span> <span class="o">=</span> <span class="n">df_filter</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span>
</span></span></code></pre></div><p>As an aside, in polars you can call row subsets of a DataFrame with <code>df[idx]</code>, which makes it infinitely better than pandas and its <code>df.iloc[idx]</code>.</p>
<p>The resulting similar cards:</p>
<figure>

    <img loading="lazy" srcset="/2025/02/embeddings-parquet/helix_2_hu_f6db1b1e0be3033.webp 320w,/2025/02/embeddings-parquet/helix_2_hu_1d74aa59da2a8d38.webp 768w,/2025/02/embeddings-parquet/helix_2.webp 976w" src="helix_2.webp"
         alt="In this case, the similarity focuses on card text similarity, and these cards have near identical text. Smiting Helix is also a direct reference to Lightning Helix."/> <figcaption>
            <p>In this case, the similarity focuses on card text similarity, and these cards have near identical text. <a href="https://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=464058">Smiting Helix</a> is also a direct reference to Lightning Helix.</p>
        </figcaption>
</figure>

<p>Speed-wise, the code runs at about <strong>1.48ms</strong> on average, or about 37% slower than calculating all dot products, so the filtering does still have some overhead, which is not surprising as that the filtered dataframe does copy the embeddings. Overall, it&rsquo;s still more than fast enough for a hobby project.</p>
<p>I&rsquo;ve created an <a href="https://colab.research.google.com/drive/19C_9sBC0Py2PlXYihl2ed378oGyroONZ?usp=sharing">interactive Colab Notebook</a> where you can generate similarities for any Magic card, and apply any filters you want!</p>
<h2 id="scaling-to-vector-databases">Scaling to Vector Databases</h2>
<p>Again, all of this assumes that you are using the embeddings for smaller/noncommercial projects. If you scale to hundreds of thousands of embeddings, the parquet and dot product approach for finding similarity should still be fine, but if it&rsquo;s a business critical application, the marginal costs of querying a vector database are likely lower than the marginal revenue from a snappy similarity lookup. Deciding how to make these tradeoffs is the fun part of MLOps!</p>
<p>In the case that the amount of vectors is too large to fit into memory but you don&rsquo;t want to go all-in on vector databases, another option that may be worth considering is using an old-fashioned database that can now support vector embeddings. Notably, <a href="https://www.sqlite.org">SQLite</a> databases are just a single portable file, however interacting with them has more technical overhead and considerations than the <code>read_parquet()</code> and <code>write_parquet()</code> of polars. One notable implementation of vector databases in SQLite is the <a href="https://alexgarcia.xyz/sqlite-vec/">sqlite-vec extension</a>, which also allows for simultaneous filtering and similarity calculations.</p>
<p>The next time you&rsquo;re working with embeddings, consider whether you really need a vector database. For many applications, the combination of Parquet files and polars provides everything you need: efficient storage, fast similarity search, and easy metadata filtering. Sometimes the simplest solution is the best one.</p>
<p><em>The code used to process the Magic card data, create the embeddings, and plot the UMAP 2D projection, is all available <a href="https://github.com/minimaxir/mtg-embeddings">in this GitHub repository</a>.</em></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I suspect the main bottleneck to widespread Parquet support is Microsoft Excel&rsquo;s and other spreadsheet software&rsquo;s lack of native support for the format. Every data scientist will be very, very happy if/when they do!&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>OpenAI&rsquo;s <a href="https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb">approach</a> using pandas to find colocated similarity is to manually iterate through the entire dataframe, calculate each cosine similarity between the candidate and the query for each row, then sort by scores. That implementation definitely does not scale.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Can LLMs write better code if you keep asking them to “write better code”?</title>
      <link>https://minimaxir.com/2025/01/write-better-code/</link>
      <pubDate>Thu, 02 Jan 2025 09:30:00 -0800</pubDate>
      <guid>https://minimaxir.com/2025/01/write-better-code/</guid>
      <description>Most coders want AI to write code faster: I want AI to write FASTER CODE.</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code.language-txt {
white-space: pre-wrap !important;
word-break: normal !important;
}
</style></span></p>
<p>In November 2023, after OpenAI <a href="https://openai.com/index/dall-e-3-is-now-available-in-chatgpt-plus-and-enterprise/">added the ability</a> for ChatGPT to generate images from DALL-E 3 within the ChatGPT web interface, there was a <a href="https://lifehacker.com/tech/chat-gpt-make-it-more-ai-images-trend">short-lived meme</a> where users gave the LLM a base image and kept asking the model to &ldquo;make it more <em>X</em>&rdquo;, where <em>X</em> can be anything.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2025/01/write-better-code/bro_hu_484c0ff30035ba2e.webp 320w,/2025/01/write-better-code/bro_hu_1162a7c634b35f7.webp 768w,/2025/01/write-better-code/bro_hu_9070d4b543cab815.webp 1024w,/2025/01/write-better-code/bro.webp 1024w" src="bro.webp#center"
         alt="A regular guy becomes more &ldquo;bro&rdquo; every time. via /u/Jojop0tato on Reddit."/> <figcaption>
            <p>A regular guy becomes more &ldquo;bro&rdquo; every time. <a href="https://www.reddit.com/r/ChatGPT/comments/18ukiz2/a_regular_guy_becomes_more_bro_every_time/">via /u/Jojop0tato on Reddit.</a></p>
        </figcaption>
</figure>

<figure class="align-center ">

    <img loading="lazy" srcset="/2025/01/write-better-code/santa_hu_1f046d64f5543bd.webp 320w,/2025/01/write-better-code/santa_hu_e0db183e83b65311.webp 768w,/2025/01/write-better-code/santa_hu_5d66897100afbdbf.webp 1024w,/2025/01/write-better-code/santa.webp 1024w" src="santa.webp#center"
         alt="Asked ChatGPT to make Santa Claus more and more serious. via /u/hessihan on Reddit."/> <figcaption>
            <p>Asked ChatGPT to make Santa Claus more and more serious. <a href="https://www.reddit.com/r/ChatGPT/comments/1887z49/asked_chatgpt_to_make_santa_claus_more_and_more/">via /u/hessihan on Reddit.</a></p>
        </figcaption>
</figure>

<p>The trend quickly died as all of these images were very samey and uninteresting, aside from the unexplainable trend that all of the examples eventually converged into something cosmic, irrespective of the starting image and the prompt. Although the trend was <a href="https://en.wikipedia.org/wiki/AI_slop">AI slop</a> before the term AI slop was codified, it&rsquo;s still academically interesting that such a meaningless and vague prompt had <em>some</em> appropriate impact on the final image, and that this change was obvious to the user.</p>
<p>What would happen if we tried a similar technique with code? LLM-generated code is unlikely to be slop (although <a href="https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/">not impossible</a>) as it follows strict rules, and unlike creative outputs such as images, code quality can be measured more objectively.</p>
<p>If code can indeed be improved simply through iterative prompting such as asking the LLM to &ldquo;make the code better&rdquo; — even though it&rsquo;s very silly — it would be a massive productivity increase. And if that&rsquo;s the case, what happens if you iterate on the code too much? What&rsquo;s the equivalent of code going cosmic? There&rsquo;s only one way to find out!</p>
<h2 id="casually-coding-with-an-llm">Casually Coding With An LLM</h2>
<p>Despite researching and developing tooling around LLMs even long before ChatGPT, I haven&rsquo;t been fond of using LLM code copilots such as <a href="https://github.com/features/copilot">GitHub Copilot</a> for coding assistance. The constant mental context switching between &ldquo;oh, the LLM autocompleted my code, neat&rdquo;/&ldquo;what question should I ask the LLM&rdquo; and &ldquo;is the LLM-generated code is actually <em>correct</em> and not <a href="https://en.wikipedia.org/wiki/Hallucination_%28artificial_intelligence%29">hallucinating</a> correct code&rdquo; kept creating enough distractions that any productivity gains from using the AI were net neutral at best. That&rsquo;s also disregarding the expensive cost of using said LLMs.</p>
<p><a href="https://www.anthropic.com/news/claude-3-5-sonnet">Claude 3.5 Sonnet</a> has made me rethink things. Due to whatever secret sauce <a href="https://www.anthropic.com">Anthropic</a> used in its training, the latest version of Claude 3.5 Sonnet (<code>claude-3-5-sonnet-20241022</code>) has <em>incredible</em> prompt adherence for all types of prompts, especially coding prompts. <a href="https://www.vellum.ai/blog/llm-benchmarks-overview-limits-and-model-comparison">Coding</a> <a href="https://aider.chat/docs/leaderboards/">benchmarks</a> confirm that testing between Claude 3.5 Sonnet and GPT-4o, Claude wins, and anecdotally I&rsquo;ve had the same experience across a variety of technical and creative tasks.</p>
<h3 id="initial-ask">Initial Ask</h3>
<p>For this experiment, we will give Claude 3.5 Sonnet an interview-style coding prompt using Python: one that is both simple and could be implemented by a novice software engineer, but one that can be heavily optimized. This simple, casual prompt represents how the typical software engineer uses LLMs. Additionally, the test prompt must also be wholly original and not taken from coding tests such as <a href="https://leetcode.com">LeetCode</a> or <a href="https://www.hackerrank.com">HackerRank</a>, as LLMs were also likely trained on those and could possibly cheat by reciting the memorized answer.</p>
<blockquote>
<p>The full, unedited conversation thread based on this casual prompting is <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_casual_use.md">available on GitHub</a>.</p>
</blockquote>
<p>Therefore, here&rsquo;s the test prompt that I&rsquo;ve written myself:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">Write Python code to solve this problem:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
</span></span></code></pre></div><p>Providing that as the user prompt to the Claude API and setting <code>temperature=0</code> for the best/most deterministic answer <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, results in this implementation:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">random</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">digit_sum</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Calculate the sum of digits of a number&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">digit</span><span class="p">)</span> <span class="k">for</span> <span class="n">digit</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">find_difference</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Generate list of 1 million random integers</span>
</span></span><span class="line"><span class="cl">    <span class="n">numbers</span> <span class="o">=</span> <span class="p">[</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100000</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000000</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Initialize variables for min and max numbers with digit sum 30</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;inf&#39;</span><span class="p">)</span>  <span class="c1"># Initialize to positive infinity</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;-inf&#39;</span><span class="p">)</span>  <span class="c1"># Initialize to negative infinity</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Find numbers whose digits sum to 30</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">numbers</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">digit_sum</span><span class="p">(</span><span class="n">num</span><span class="p">)</span> <span class="o">==</span> <span class="mi">30</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">min_num</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">min_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_num</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Check if we found any numbers with digit sum 30</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">min_num</span> <span class="o">==</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;inf&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="n">max_num</span> <span class="o">==</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;-inf&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;No numbers found with digit sum of 30&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">max_num</span> <span class="o">-</span> <span class="n">min_num</span>
</span></span></code></pre></div><p><a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_casual_use.md#assistant-initial-ask">This implementation</a> is correct and consistent with what most novice Python programmers would write, with bonus handling if there are no valid numbers that fit the criteria. For each number in the list, check whether the digits sum to 30: if so, check whether it&rsquo;s larger than the most-recently-seen largest number or smaller than the most-recently-seen smallest number and update those variables accordingly. After searching through the list, return the difference.</p>
<p>But I&rsquo;m sure many coders reading that implementation got <a href="https://xkcd.com/356/">nerd sniped</a> and are thinking about optimizations. One example is the <code>digit_sum()</code> function: although the implementation is a fun Pythonic one-liner, the type-casting between <code>str</code> and <code>int</code> causes a lot of unnecessary overhead.</p>
<p>On my M3 Pro Macbook Pro, this code takes on average <strong>657 milliseconds</strong> to run. We&rsquo;ll use this performance as a baseline to compare future implementations to see how they benchmark relative to it. (spoilers: they&rsquo;re <em>all</em> faster)</p>
<h3 id="iteration-1">Iteration #1</h3>
<p>Now we can ask Claude to improve the code by including its current answer and all that came before it as a part of the conversation prompt. Thus, we add the iteration prompt:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">write better code
</span></span></code></pre></div><p>What, you though the title of this blog post was a joke? I <em>never</em> joke.</p>
<p>Claude <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_casual_use.md#assistant-iteration-1">now outputs</a> revised code, or as it says, &ldquo;optimized version of the code with several improvements.&rdquo; Instead of colocating all the code in functions, it decided to refactor it into a Python <code>class</code> and made it more object-oriented:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DigitSumFinder</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target_sum</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">30</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                 <span class="n">range_start</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                 <span class="n">range_end</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100_000</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                 <span class="n">count</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1_000_000</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">target_sum</span> <span class="o">=</span> <span class="n">target_sum</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">range_start</span> <span class="o">=</span> <span class="n">range_start</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">range_end</span> <span class="o">=</span> <span class="n">range_end</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">count</span> <span class="o">=</span> <span class="n">count</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Pre-calculate digit sums for all possible numbers</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">digit_sums</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_precompute_digit_sums</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_precompute_digit_sums</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">array</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Precompute digit sums for all possible numbers in range.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">digit_sums</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="s1">&#39;B&#39;</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">range_end</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">range_start</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">range_end</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">            <span class="n">n</span> <span class="o">=</span> <span class="n">num</span>
</span></span><span class="line"><span class="cl">            <span class="k">while</span> <span class="n">n</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">total</span> <span class="o">+=</span> <span class="n">n</span> <span class="o">%</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">                <span class="n">n</span> <span class="o">//=</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">            <span class="n">digit_sums</span><span class="p">[</span><span class="n">num</span><span class="p">]</span> <span class="o">=</span> <span class="n">total</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">digit_sums</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">find_difference</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]]:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">        Find the difference between max and min numbers with target digit sum.
</span></span></span><span class="line"><span class="cl"><span class="s2">        Returns: (difference, min_number, max_number)
</span></span></span><span class="line"><span class="cl"><span class="s2">        &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">min_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;inf&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">max_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;-inf&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">count_found</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Generate and process random numbers</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">count</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">num</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">range_start</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">range_end</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">digit_sums</span><span class="p">[</span><span class="n">num</span><span class="p">]</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">target_sum</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">count_found</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">num</span> <span class="o">&lt;</span> <span class="n">min_num</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="n">min_num</span> <span class="o">=</span> <span class="n">num</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">num</span> <span class="o">&gt;</span> <span class="n">max_num</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="n">max_num</span> <span class="o">=</span> <span class="n">num</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">count_found</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="mi">0</span><span class="p">,</span> <span class="kc">None</span><span class="p">,</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">max_num</span> <span class="o">-</span> <span class="n">min_num</span><span class="p">,</span> <span class="n">min_num</span><span class="p">,</span> <span class="n">max_num</span>
</span></span></code></pre></div><p>Here, the code does two smart algorithmic improvements:</p>
<ul>
<li>When computing the digit sum it uses integer operations and avoids the need for the type-casting mentioned previously.</li>
<li>It precomputes all possible digit sums and stores them in a byte array (somewhat unusual instead of a list) for lookup, meaning that the digit sum will not need to be recalculated when there are duplicates in the one-million-number-list. Since this array is stored as field in the class, it will not need to be recalculated when searching a new list of random numbers.</li>
</ul>
<p>These optimizations speed up the code <strong>2.7x</strong> relative to the initial implementation.</p>
<h3 id="iteration-2">Iteration #2</h3>
<p>After another <code>write better code</code>, Claude <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_casual_use.md#assistant-iteration-2">discovers</a> more obvious optimizations (truncated for readabilty):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DigitSumAnalyzer</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Analyzes numbers based on their digit sums.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">number_range</span><span class="p">:</span> <span class="n">NumberRange</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">number_range</span> <span class="o">=</span> <span class="n">number_range</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">_digit_sums</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_precompute_digit_sums</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_precompute_digit_sums</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">array</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Precompute digit sums for all possible numbers using vectorized operations.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">digits</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">number_range</span><span class="o">.</span><span class="n">end</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">digit_sums</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">number_range</span><span class="o">.</span><span class="n">end</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="n">digits</span><span class="o">.</span><span class="n">any</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">            <span class="n">digit_sums</span> <span class="o">+=</span> <span class="n">digits</span> <span class="o">%</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">            <span class="n">digits</span> <span class="o">//=</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">array</span><span class="p">(</span><span class="s1">&#39;B&#39;</span><span class="p">,</span> <span class="n">digit_sums</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_process_chunk</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">chunk_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Iterator</span><span class="p">[</span><span class="nb">int</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Process a chunk of random numbers.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">numbers</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">number_range</span><span class="o">.</span><span class="n">start</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">number_range</span><span class="o">.</span><span class="n">end</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">chunk_size</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_digit_sums</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)[</span><span class="n">numbers</span><span class="p">]</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">number_range</span><span class="o">.</span><span class="n">target_sum</span>
</span></span><span class="line"><span class="cl">        <span class="k">yield from</span> <span class="n">numbers</span><span class="p">[</span><span class="n">mask</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">analyze</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">chunk_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100_000</span><span class="p">,</span> <span class="n">num_processes</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="kc">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Result</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">        Analyze numbers to find min/max with target digit sum.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">        Args:
</span></span></span><span class="line"><span class="cl"><span class="s2">            chunk_size: Size of chunks to process at once
</span></span></span><span class="line"><span class="cl"><span class="s2">            num_processes: Number of processes to use (None for CPU count)
</span></span></span><span class="line"><span class="cl"><span class="s2">        &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">perf_counter</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="n">min_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;inf&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">max_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;-inf&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">numbers_found</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">num_chunks</span> <span class="o">=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">number_range</span><span class="o">.</span><span class="n">count</span> <span class="o">+</span> <span class="n">chunk_size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">//</span> <span class="n">chunk_size</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">with</span> <span class="n">ProcessPoolExecutor</span><span class="p">(</span><span class="n">max_workers</span><span class="o">=</span><span class="n">num_processes</span><span class="p">)</span> <span class="k">as</span> <span class="n">executor</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">futures</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                <span class="n">executor</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_process_chunk</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                              <span class="nb">min</span><span class="p">(</span><span class="n">chunk_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">number_range</span><span class="o">.</span><span class="n">count</span> <span class="o">-</span> <span class="n">i</span> <span class="o">*</span> <span class="n">chunk_size</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">                <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_chunks</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">            <span class="k">for</span> <span class="n">future</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">futures</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s2">&#34;Processing chunks&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">future</span><span class="o">.</span><span class="n">result</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">                    <span class="n">numbers_found</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">                    <span class="n">min_num</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">min_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                    <span class="n">max_num</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">execution_time</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">perf_counter</span><span class="p">()</span> <span class="o">-</span> <span class="n">start_time</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">numbers_found</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">Result</span><span class="p">(</span><span class="kc">None</span><span class="p">,</span> <span class="kc">None</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">execution_time</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">Result</span><span class="p">(</span><span class="n">min_num</span><span class="p">,</span> <span class="n">max_num</span><span class="p">,</span> <span class="n">max_num</span> <span class="o">-</span> <span class="n">min_num</span><span class="p">,</span> <span class="n">execution_time</span><span class="p">,</span> <span class="n">numbers_found</span><span class="p">)</span>
</span></span></code></pre></div><p>Claude now has added two more optimizations, finally realizing that this coding problem is an <a href="https://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a> problem:</p>
<ul>
<li>Multithreading through Python&rsquo;s <a href="https://docs.python.org/3/library/concurrent.futures.html">concurrent-futures</a> package, by separating the large list into chunks that can be processed independently.</li>
<li>Vectorized numpy operations, which are <em>much</em> faster than base-Python operations. Special mention goes to the <code>_precompute_digit_sums()</code> function, which implements a vectorized implementation of calculating the digit sums. The conditional <code>while digits.any():</code> is galaxy-brain code, but it works correctly.</li>
</ul>
<p>However, there&rsquo;s an issue with this particular implementation of parallelization: it generates subprocesses, which causes <em>many</em> annoying issues, including being unable to run it as-is inline, and it <a href="https://stackoverflow.com/questions/15900366/all-example-concurrent-futures-code-is-failing-with-brokenprocesspool">must be invoked</a> with a <code>main()</code> guard which limits its utility significantly. But even when run as a separate script, it prints a <code>Error: cannot pickle 'generator' object</code> error due to the use of <code>yield from numbers[mask]</code> (said generator is completely unnecessary, <code>return numbers[mask]</code> is sufficient). The code also mixes numpy array <code>dtype</code>s which causes errors: setting them all to <code>np.int32</code> fixes it.</p>
<p>After making those fixes, the code is now <strong>5.1x faster</strong> than the base implementation.</p>
<h3 id="iteration-3">Iteration #3</h3>
<p>Another <code>write better code</code>, and Claude <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_casual_use.md#assistant-iteration-3">returns a implementation</a> that it claims is &ldquo;even more sophisticated and optimized version using advanced techniques and modern Python features&rdquo; but the actual code shows no significant algorithmic improvements and actually a regression in the digit sum calculation by reverting back to the type-casting approach. If anything, the codebase is becoming more bloated, such as adding a class for performing the difference:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">slots</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SearchResult</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Result of the number search.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_number</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_number</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">count</span><span class="p">:</span> <span class="nb">int</span>
</span></span><span class="line"><span class="cl">    <span class="n">execution_time</span><span class="p">:</span> <span class="nb">float</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nd">@property</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">difference</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Calculate difference between max and min numbers.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">min_number</span> <span class="ow">is</span> <span class="kc">None</span> <span class="ow">or</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_number</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_number</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">min_number</span>
</span></span></code></pre></div><p>This time, the code ran without needing any fixes. However, performance regressed slightly from the previous implementation, now <strong>4.1x faster</strong> than the base implementation.</p>
<h3 id="iteration-4">Iteration #4</h3>
<p>This iterative prompting appears to be hitting diminishing returns. After one more <code>write better code</code>, Claude <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_casual_use.md#assistant-iteration-4">provides an implementation</a> &ldquo;with cutting-edge optimizations and enterprise-level features.&rdquo; Wait, enterprise-level features?!</p>
<p>The final code is too large to include in this blog post, but it did create two more optimizations: it now uses the <a href="https://numba.pydata.org">numba</a> Python library that can invoke a JIT compiler, which directly optimizes the code for the CPU. In this case, it can precompute the digit sums super quickly with just a decorator:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">parallel</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">calculate_digit_sums</span><span class="p">(</span><span class="n">numbers</span><span class="p">:</span> <span class="n">ArrayInt</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">ArrayInt</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Calculate digit sums using Numba.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">numbers</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">prange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">numbers</span><span class="p">)):</span>
</span></span><span class="line"><span class="cl">        <span class="n">num</span> <span class="o">=</span> <span class="n">numbers</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="n">num</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">total</span> <span class="o">+=</span> <span class="n">num</span> <span class="o">%</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">            <span class="n">num</span> <span class="o">//=</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">        <span class="n">result</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">total</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">result</span>
</span></span></code></pre></div><p>The full class also uses Python&rsquo;s <a href="https://docs.python.org/3/library/asyncio.html">asyncio</a> for parallelization, which is more canonical for scheduling tasks than a subprocess approach. It also plays more nicely with existing inline code and a <a href="https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop">REPL</a> such as <a href="https://jupyter.org">Jupyter Notebooks</a>.</p>
<p>It also added as a part of its &ldquo;enterprise&rdquo; push:</p>
<ul>
<li>Structured metrics logging with <a href="https://prometheus.io">Prometheus</a>.</li>
<li>A signal handler so the code can be torn down gracefully if force-killed.</li>
<li>A benchmarking result display using a <a href="https://github.com/Textualize/rich">rich</a> table.</li>
</ul>
<figure>

    <img loading="lazy" srcset="/2025/01/write-better-code/rich_hu_1cc271f7a31e0c53.webp 320w,/2025/01/write-better-code/rich.png 490w" src="rich.png"
         alt="It is pretty, though!"/> <figcaption>
            <p>It <em>is</em> pretty, though!</p>
        </figcaption>
</figure>

<p>It appears &ldquo;going cosmic&rdquo; for AI-generated code is making it enterprise by overengineering the code, which makes complete sense. Despite that, the code runs as-is without any bugs. Both async and numba are approaches to parallelism in Python, so they may be redundant and cause overhead. However, after benchmarking, the algorithm is <em>extremely</em> fast, resulting in about 6 milliseconds a run, or a <strong>100x</strong> speedup. My assumption that this prompting was hitting diminishing returns aged very poorly. Maybe numba was the secret all along?</p>
<p>Overall, this form of iterative prompting to iteratively improve code has caveats: the code is indeed better, but in hindsight &ldquo;better&rdquo; is far too open ended. What I only wanted was algorithmic improvements, not a full SaaS. Let&rsquo;s try again from scratch, this time with more direction.</p>
<h2 id="prompt-engineering-llms-for-even-more-better-code">Prompt Engineering LLMs For Even More Better Code</h2>
<p>It&rsquo;s 2025, and prompt engineering LLMs is still required to get best results from them. If anything, prompt engineering LLMs is <em>even more important</em>: next-token-prediction models are trained to maximimize the prediction probability of the next token over massive batches of inputs, and as a result they optimize for the <strong>average</strong> inputs and outputs. As LLMs drastically improve, the generated output becomes more drastically average, because that&rsquo;s what they were trained to do: all LLMs are biased towards the average. Although it&rsquo;s both counterintuitive and unfun, a small amount of guidance asking the LLM specifically what you want, and even giving a few examples of what you want, will objectively improve the output of LLMs more than the effort needed to construct said prompts. Claude 3.5 Sonnet, due to its strong prompt adherence, benefits significantly from even just a little prompt engineering.</p>
<p>Let&rsquo;s redo the code optimization experiment, this time with aggressive prompt engineering that makes the results I am looking for extremely explicit, with no room for ambiguity. Yes, being cold and &ldquo;robotic&rdquo; to LLMs makes them perform better, <a href="https://en.wikipedia.org/wiki/Roko%27s_basilisk">Roko&rsquo;s basilisk</a> be damned.</p>
<h3 id="initial-ask-1">Initial Ask</h3>
<p>This time we will use a system prompt, only available via an API. The system prompt lists the LLM&rsquo;s &ldquo;rules&rdquo; it must follow. Since I want more optimized code, we&rsquo;ll define that in the rules, with granular examples:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">All code you write MUST be fully optimized.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">&#34;Fully optimized&#34; includes:
</span></span><span class="line"><span class="cl">- maximizing algorithmic big-O efficiency for memory and runtime
</span></span><span class="line"><span class="cl">- using parallelization and vectorization where appropriate
</span></span><span class="line"><span class="cl">- following proper style conventions for the code language (e.g. maximizing code reuse (DRY))
</span></span><span class="line"><span class="cl">- no extra code beyond what is absolutely necessary to solve the problem the user provides (i.e. no technical debt)
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">If the code is not fully optimized, you will be fined $100.
</span></span></code></pre></div><p>About that last line: offering positive/negative incentives in to the LLM within a system prompt isn&rsquo;t common anymore and <a href="https://minimaxir.com/2024/02/chatgpt-tips-analysis/">my own research</a> indicated that it was unclear whether it had a positive impact, but that fine will become more relevant in the iteration prompt.</p>
<p>The initial user prompt has an additional line as well:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">Write Python code to solve this problem:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Before writing the code, plan out all the necessary optimizations.
</span></span></code></pre></div><p>&ldquo;Planning&rdquo; is a long-used trick to help align LLM output for a first pass — the modern implementation of &ldquo;<a href="https://arxiv.org/abs/2205.11916">let&rsquo;s think step by step</a>.&rdquo; Some LLMs ignore it, but Claude 3.5 Sonnet always obeys.</p>
<blockquote>
<p>The full, unedited conversation thread based on this prompt engineering strategy is <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_prompt_engineering.md">available on GitHub</a>.</p>
</blockquote>
<p>Sending the system prompt and this user prompt to Claude, the <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_prompt_engineering.md#assistant-initial-ask">response</a> starts with this:</p>
<!-- prettier-ignore-start -->
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">Let me plan the optimizations first:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">1. Algorithmic Optimization:
</span></span><span class="line"><span class="cl">- We only need to track min and max numbers with digit sum 30, not store all of them
</span></span><span class="line"><span class="cl">- We can calculate digit sum while reading numbers, no need to convert to string
</span></span><span class="line"><span class="cl">- Single pass through the list required (O(n))
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">2. Memory Optimization:
</span></span><span class="line"><span class="cl">- Only need to store 2 variables (min and max), not the full filtered list
</span></span><span class="line"><span class="cl">- No need for additional data structures
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">3. Performance Optimization:
</span></span><span class="line"><span class="cl">- Use numba&#39;s @jit decorator for faster computation
</span></span><span class="line"><span class="cl">- Create a fast digit sum function using modulo operations instead of string conversion
</span></span><span class="line"><span class="cl">- Use numpy for generating random numbers (faster than random.randint)
</span></span></code></pre></div><!-- prettier-ignore-end -->
<p>This is promising: Claude immediately figured out it should start out with numpy and numba, and to also use integer operations to calculate the digit sum. The initial implementation:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">digit_sum</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">n</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">total</span> <span class="o">+=</span> <span class="n">n</span> <span class="o">%</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">        <span class="n">n</span> <span class="o">//=</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">total</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">find_difference</span><span class="p">(</span><span class="n">numbers</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;inf&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_num</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;-inf&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">num</span> <span class="ow">in</span> <span class="n">numbers</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">sum_digits</span> <span class="o">=</span> <span class="n">digit_sum</span><span class="p">(</span><span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">sum_digits</span> <span class="o">==</span> <span class="mi">30</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">min_num</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">min_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_num</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">max_num</span> <span class="o">-</span> <span class="n">min_num</span> <span class="k">if</span> <span class="n">max_num</span> <span class="o">!=</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;-inf&#39;</span><span class="p">)</span> <span class="k">else</span> <span class="mi">0</span>
</span></span></code></pre></div><p>That&rsquo;s more concise than the initial implementation from casual prompting, especially with no unnecessary comments. There&rsquo;s still room for improvement, notably this implementation did not set <code>parallel=True</code> in the <code>jit</code> decorator.</p>
<p>This minor use of prompt engineering resulted in code that runs in 11.2 milliseconds on average, or <strong>59x faster</strong> than the naive implementation. Wow!</p>
<h3 id="iteration-1-1">Iteration #1</h3>
<p>We can now ask Claude to iterate on the code with a more verbose prompt than &ldquo;write code better&rdquo;:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">Your code is not fully optimized, and you have been fined $100. Make it more optimized.
</span></span></code></pre></div><p>No, it&rsquo;s not gaslighting, it&rsquo;s <em>incentive alignment</em>.</p>
<p>Claude <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_prompt_engineering.md#assistant-iteration-1">responds</a> with this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">digit_sum</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Using bit manipulation for faster division</span>
</span></span><span class="line"><span class="cl">    <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">n</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">total</span> <span class="o">+=</span> <span class="n">n</span> <span class="o">&amp;</span> <span class="mi">15</span>  <span class="c1"># Fast lookup for single digit sum</span>
</span></span><span class="line"><span class="cl">        <span class="n">n</span> <span class="o">&gt;&gt;=</span> <span class="mi">4</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">total</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">parallel</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">find_difference_chunk</span><span class="p">(</span><span class="n">numbers</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_num</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">iinfo</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span><span class="o">.</span><span class="n">max</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_num</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">iinfo</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span><span class="o">.</span><span class="n">min</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Parallel processing of chunks using numba</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">prange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">numbers</span><span class="p">)):</span>
</span></span><span class="line"><span class="cl">        <span class="n">num</span> <span class="o">=</span> <span class="n">numbers</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">sum_digits</span> <span class="o">=</span> <span class="n">digit_sum</span><span class="p">(</span><span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">sum_digits</span> <span class="o">==</span> <span class="mi">30</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">min_num</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">min_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_num</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_num</span><span class="p">,</span> <span class="n">num</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">min_num</span><span class="p">,</span> <span class="n">max_num</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">process_chunk</span><span class="p">(</span><span class="n">chunk</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">find_difference_chunk</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
</span></span></code></pre></div><p>Good, it identified the <code>parallel=True</code> flag! But that digit sum implementation is weird and uses <a href="https://wiki.python.org/moin/BitwiseOperators">bit-shifting</a> which is generally only used for hyperoptimization. I didn&rsquo;t think you could use bit-shifting well on decimal numbers, and after testing, it turns out <em>you can&rsquo;t</em> as this implementation derives an incorrect digit sum. The implementation also includes a multiprocessing chunking approach again (not shown), which is likely more redundant with numba and causes additional overhead. Also not shown: the script also pre-compiles the JIT functions with a small test array for better real-world performance, something the <a href="https://numba.pydata.org/numba-doc/dev/user/5minguide.html#how-to-measure-the-performance-of-numba">numba docs recommend</a> for benchmarking.</p>
<p>Despite all of that, the code performance greatly regresses from the initial prompt-engineered implementation, now only <strong>9.1x</strong> faster than the naive implementation. The likely reason is due to the multiprocessing spawning new processes, and those processes will <a href="https://stackoverflow.com/questions/72449896/does-numba-need-to-compile-separately-within-each-parallel-process">recompile the numba JIT functions</a> each time, so there&rsquo;s massive overhead.</p>
<h3 id="iteration-2-1">Iteration #2</h3>
<p>Another <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_prompt_engineering.md#assistant-iteration-2">iteration</a> of the prompt:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="c1"># Pre-computed lookup table using bit manipulation</span>
</span></span><span class="line"><span class="cl"><span class="n">LOOKUP</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">100001</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100001</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Optimized digit sum using parallel bit counting</span>
</span></span><span class="line"><span class="cl">    <span class="n">n</span> <span class="o">=</span> <span class="n">i</span>
</span></span><span class="line"><span class="cl">    <span class="n">n</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">&amp;</span> <span class="mh">0x0F</span><span class="p">)</span> <span class="o">+</span> <span class="p">((</span><span class="n">n</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x0F</span><span class="p">)</span> <span class="o">+</span> <span class="p">((</span><span class="n">n</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x0F</span><span class="p">)</span> <span class="o">+</span> \
</span></span><span class="line"><span class="cl">        <span class="p">((</span><span class="n">n</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x0F</span><span class="p">)</span> <span class="o">+</span> <span class="p">((</span><span class="n">n</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x0F</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">LOOKUP</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">parallel</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">fastmath</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">find_min_max</span><span class="p">(</span><span class="n">numbers</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Process 32 numbers at once using SIMD</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_val</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">iinfo</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint32</span><span class="p">)</span><span class="o">.</span><span class="n">max</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_val</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Vectorized processing with explicit SIMD hints</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">prange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">numbers</span><span class="p">),</span> <span class="mi">32</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Load 32 elements into SIMD registers</span>
</span></span><span class="line"><span class="cl">        <span class="n">chunk</span> <span class="o">=</span> <span class="n">numbers</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="nb">min</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">32</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">numbers</span><span class="p">))]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Vectorized lookup and comparison</span>
</span></span><span class="line"><span class="cl">        <span class="n">sums</span> <span class="o">=</span> <span class="n">LOOKUP</span><span class="p">[</span><span class="n">chunk</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">mask</span> <span class="o">=</span> <span class="n">sums</span> <span class="o">==</span> <span class="mi">30</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">any</span><span class="p">(</span><span class="n">mask</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">valid_nums</span> <span class="o">=</span> <span class="n">chunk</span><span class="p">[</span><span class="n">mask</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">min_val</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">min_val</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">valid_nums</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_val</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_val</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">valid_nums</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">min_val</span><span class="p">,</span> <span class="n">max_val</span>
</span></span></code></pre></div><p>Claude is now starting to use <a href="https://tbetcke.github.io/hpc_lecture_notes/simd.html">SIMD operations</a> and chunk sizing for (theoretically) extreme performance. At this point I was confused I missed something with that bit-shifting implementation as it&rsquo;s still wrong, especially now that hexadecimal numbers are getting involved. It turns out that implementation is an optimized method of calculating the digit sum of <em>hexadecimal</em> numbers and not decimal numbers, and therefore it&rsquo;s a complete hallucination. There is another extremely subtle hallucination: the <code>prange</code> function cannot accept a step-size of 32 while <code>parallel=True</code>, which is a nuance that there would be little documentation about. Setting <code>parallel=False</code> and benchmarking, there is indeed an slight improvement over the initial prompt-engineered implementation at <strong>65x faster</strong> than the base implementation.</p>
<h3 id="iteration-3-1">Iteration #3</h3>
<p><a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_prompt_engineering.md#assistant-iteration-3">Another</a> iteration:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="c1"># Pre-computed perfect minimal hash table</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Key insight: For numbers 1-100000, only need to store those with digit sum 30</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Format: [number] = 1 if digit sum is 30, else 0</span>
</span></span><span class="line"><span class="cl"><span class="n">HASH_TABLE</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">100001</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Generate hash table at module load time using bit manipulation</span>
</span></span><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">init_hash_table</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_val</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">iinfo</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint32</span><span class="p">)</span><span class="o">.</span><span class="n">max</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_val</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Optimal digit sum using parallel bit counting</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100001</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">n</span> <span class="o">=</span> <span class="n">i</span>
</span></span><span class="line"><span class="cl">        <span class="nb">sum</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="n">n</span> <span class="ow">and</span> <span class="nb">sum</span> <span class="o">&lt;=</span> <span class="mi">30</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="nb">sum</span> <span class="o">+=</span> <span class="n">n</span> <span class="o">&amp;</span> <span class="mh">0xF</span>
</span></span><span class="line"><span class="cl">            <span class="n">n</span> <span class="o">&gt;&gt;=</span> <span class="mi">4</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="nb">sum</span> <span class="o">==</span> <span class="mi">30</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">HASH_TABLE</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">            <span class="n">min_val</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">min_val</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_val</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_val</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">min_val</span><span class="p">,</span> <span class="n">max_val</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Initialize at module load time</span>
</span></span><span class="line"><span class="cl"><span class="n">MIN_VALID</span><span class="p">,</span> <span class="n">MAX_VALID</span> <span class="o">=</span> <span class="n">init_hash_table</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">parallel</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">fastmath</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">find_min_max</span><span class="p">(</span><span class="n">numbers</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_val</span> <span class="o">=</span> <span class="n">MAX_VALID</span>  <span class="c1"># Start with known bounds</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_val</span> <span class="o">=</span> <span class="n">MIN_VALID</span>
</span></span><span class="line"><span class="cl">    <span class="n">found</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Single vectorized operation</span>
</span></span><span class="line"><span class="cl">    <span class="n">mask</span> <span class="o">=</span> <span class="n">HASH_TABLE</span><span class="p">[</span><span class="n">numbers</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">any</span><span class="p">(</span><span class="n">mask</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">valid_nums</span> <span class="o">=</span> <span class="n">numbers</span><span class="p">[</span><span class="n">mask</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">min_val</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">valid_nums</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">max_val</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">valid_nums</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">found</span> <span class="o">=</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">min_val</span><span class="p">,</span> <span class="n">max_val</span><span class="p">,</span> <span class="n">found</span>
</span></span></code></pre></div><p>In this case, the LLM gives up on the chunking strategy which has been causing problems, and adds two more optimizations: a global <code>HASH_TABLE</code> (that&rsquo;s just a numpy array, I&rsquo;m unsure if a simple index lookup technically counts as a <a href="https://en.wikipedia.org/wiki/Hash_table">hash table</a>), and it introduced a logical microoptimization that after summing up digits, if the number goes over 30, the counting can stop since it can immediately be identified as invalid.</p>
<p>One major problem: that &ldquo;generate hash table at module load time&rdquo; trick doesn&rsquo;t actually work due to a subtle issue with little internet documentation: objects outside of numba&rsquo;s JITed functions are read-only, yet the <code>HASH_TABLE</code> is still instantiated outside of the JITed function and modified within the JITed function, and therefore will cause a very confusing error. After a tiny refactor such that the <code>HASH_TABLE</code> is instantiated within a JITed function, the code worked, and ran <em>extremely</em> fast: <strong>100x</strong> faster than the original base implementation, the same as the final performance from the casual prompting but with orders of magnitude less code.</p>
<h3 id="iteration-4-1">Iteration #4</h3>
<p>At this point, Claude actually complained that the code is at the &ldquo;theoretical minimum time complexity possible for this problem.&rdquo; So I mixed things up and just asked it to fix the digit sum issue: <a href="https://github.com/minimaxir/llm-write-better-code/blob/main/python_30_prompt_engineering.md#assistant-iteration-4">it did so</a> by only replacing the relevant code with the previously used integer implementation, and did not try to fix the <code>HASH_TABLE</code>. More importantly, with the <code>HASH_TABLE</code> adjustment, I confirmed the implementation is correct, finally, although with a slight performance hit since there is no more bit-shifting: it&rsquo;s now <strong>95x faster</strong>.</p>
<h2 id="next-steps-for-better-llm-code-generation">Next Steps For Better LLM Code Generation</h2>
<p>Putting it all together, let&rsquo;s visualize the improvements, including highlighting the cases where I needed to alter the logic of the code to make it runnable due to bugs.</p>
<figure>

    <img loading="lazy" srcset="/2025/01/write-better-code/comparison_hu_28ef1f1158362480.webp 320w,/2025/01/write-better-code/comparison_hu_278c55c8de523187.webp 768w,/2025/01/write-better-code/comparison_hu_3d554133497cbfdd.webp 1024w,/2025/01/write-better-code/comparison.png 1200w" src="comparison.png"/> 
</figure>

<p>In all, asking an LLM to &ldquo;write code better&rdquo; does indeed make the code better, depending on your definition of better. Through the use of the generic iterative prompts, the code did objectively improve from the base examples, both in terms of additional features and speed. Prompt engineering improved the performance of the code much more rapidly and consistently, but was more likely to introduce subtle bugs as LLMs are not optimized to generate high-performance code. As with any use of LLMs, your mileage may vary, and in the end it requires a human touch to fix the inevitable issues no matter how often AI hypesters cite LLMs as magic.</p>
<blockquote>
<p>All code in this blog post, including benchmarking scripts and data visualization code, is <a href="https://github.com/minimaxir/llm-write-better-code/">available on GitHub</a>.</p>
</blockquote>
<p>There are a few optimizations that I am very surprised Claude 3.5 Sonnet did not identify and implement during either experiment. Namely, it doesn&rsquo;t explore the statistical angle: since we are generating 1,000,000 numbers uniformly from a range of 1 to 100,000, there will be a significant amount of duplicate numbers that will never need to be analyzed. The LLM did not attempt to dedupe, such as casting the list of numbers into a Python <code>set()</code> or using numpy&rsquo;s <code>unique()</code>. I was also expecting an implementation that involves sorting the list of 1,000,000 numbers ascending: that way the algorithm could search the list from the start to the end for the minimum (or the end to the start for the maximum) without checking every number, although sorting is slow and a vectorized approach is indeed more pragmatic.</p>
<p>Even if LLMs can be wrong, one notable thing I learnt from these experiments is that they do have interesting ideas and tool suggestions even if the code output can&rsquo;t be used as-is. For example, I&rsquo;ve never touched numba since as a data scientist/machine learning engineer I&rsquo;m conditioned to exclusively use numpy shenanigans if I need better code performance. But it&rsquo;s hard to argue with the results of the numba JIT functions, and I might add it to my toolbox. When testing a similar &ldquo;make it better&rdquo; prompt iteration workflow in other technical domains such website backends and frontends, the LLMs had good ideas there too.</p>
<p>Of course, these LLMs won&rsquo;t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is <em>actually</em> a good idea, along with other constraints that are domain specific. Even with the amount of code available on the internet, LLMs can&rsquo;t discern between average code and good, highly-performant code without guidance. Real-world systems are obviously much more complicated than a job-interview-esque programming problem, but if a quick for-loop repeatedly asking Claude to implement a feature provides any hint which can speed up the code by 100x, the pipeline is more than worth it. Some consider <a href="https://softwareengineering.stackexchange.com/questions/80084/is-premature-optimization-really-the-root-of-all-evil">premature optimization</a> to be bad coding practice, but in the real-world it&rsquo;s better than having a subpar implementation that will become technical debt over time.</p>
<p>One issue with my experiments is that I&rsquo;m benchmarking code improvement using Python, which isn&rsquo;t the coding language developers consider when hyperoptimizing performance. While libraries such as numpy and numba leverage C to work around Python&rsquo;s performance limitations, one modern approach that popular Python libraries such as <a href="https://pola.rs">polars</a> and <a href="https://docs.pydantic.dev/latest/">pydantic</a> use is to instead code using <a href="https://www.rust-lang.org">Rust</a>. Rust has many performance benefits over C, and the <a href="https://pyo3.rs/v0.23.3/">PyO3</a> crate allows Rust code to be used within Python with minimal overhead. I can confirm that Claude 3.5 Sonnet can generate PyO3-compliant Python and Rust code despite that workflow being so new, but that&rsquo;s more than enough material for another blog post.</p>
<p>In the meantime, while asking LLMs to make code better is a more pragmatic use of AI, you <em>can</em> ask them to &ldquo;make it more bro&rdquo;&hellip;with mixed results.</p>
<figure>

    <img loading="lazy" srcset="/2025/01/write-better-code/brocode_hu_8e96ef859c4b0401.webp 320w,/2025/01/write-better-code/brocode_hu_9887aac1bdfe9b67.webp 768w,/2025/01/write-better-code/brocode_hu_81bf27bad5ff1c00.webp 1024w,/2025/01/write-better-code/brocode.jpg 1410w" src="brocode.jpg"/> 
</figure>

<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>For my work with LLMs, I <em>exclusively</em> use APIs or interfaces to those APIs (such as the <a href="https://console.anthropic.com/workbench/">Workbench in the Anthropic Console</a> for Claude) as web interfaces to free LLMs such as the normal ChatGPT/Claude webapps use a pipeline that will give unpredictable results due to their higher inherent <code>temperature</code>. Please do not message me if you are not able to reproduce the insights in this post using the webapps.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Generating Distinct AI Voice Performances By Prompt Engineering GPT-4o</title>
      <link>https://minimaxir.com/2024/10/speech-prompt-engineering/</link>
      <pubDate>Wed, 23 Oct 2024 10:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2024/10/speech-prompt-engineering/</guid>
      <description>“You are an expert voice actor specializing in silly voices.”</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
word-break: normal !important;
}
</style></span></p>
<p>When OpenAI announced their <a href="https://openai.com/index/hello-gpt-4o/">GPT-4o model</a> at a <a href="https://www.youtube.com/watch?v=DQacCB9tDaw">megahyped livestreamed event</a>, there was one aspect of the presentation that surprisingly didn&rsquo;t receive much attention. Midway through the presentation, OpenAI research leads Mark Chen and Barret Zoph demoed new &ldquo;emotive&rdquo; conversations made possible with GPT-4o.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/DQacCB9tDaw?autoplay=0&amp;controls=1&amp;end=814&amp;loop=0&amp;mute=0&amp;start=710" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>After Mark asked the model &ldquo;hey, ChatGPT, how are you doing?&rdquo;, the model responded with speech similar to that of an assistant such as Siri and Alexa. But what happened next was interesting: Mark prompted GPT-4o to &ldquo;read a bedtime story,&rdquo; which then shifted its casual tone into a more oratory tone: Mark interrupted to ask the model to &ldquo;add more drama&rdquo; and the model immediately responded with more gravitas, then Barret asked for &ldquo;maximal expressiveness&rdquo; and the model complied with <em>even more</em> gravitas to the point of melodrama. Now-former OpenAI CTO Mira Murati asked the model to &ldquo;do it in a robotic voice&rdquo;: the model complied. Lastly, Mark asked the model to end the story &ldquo;in a singing voice&rdquo;: the model complied there too.</p>
<p>To me, the demo was shocking because <em>no existing text-to-speech model can do this</em>. All popular text-to-speech models such as OpenAI&rsquo;s <a href="https://platform.openai.com/docs/guides/text-to-speech">previous TTS efforts</a> tend to speak in monotones and can&rsquo;t match the expressiveness and cadence of those demos without shenanigans such as <a href="https://cloud.google.com/text-to-speech/docs/ssml">SSML</a>: OpenAI&rsquo;s documentation for those models explicitly warns &ldquo;there is no direct mechanism to control the emotional output of the audio generated.&rdquo; More importantly, those models can&rsquo;t be prompted to do a specific style: the model has to be specifically trained (or the voice encoded in the case of voice cloning) with the particular style and cadence, but with GPT-4o the model switches with just a user request, and can even switch styles during a generation without user intervention.</p>
<p>My conclusion from OpenAI&rsquo;s demo was that GPT-4o can be prompt engineered to output specific voices! Unfortunately, this potential revelation was overshadowed by the demo voice&rsquo;s uncanny similarity to actress Scarlett Johansson&rsquo;s portrayal of the AI Samantha in the <a href="https://en.wikipedia.org/wiki/Her_%28film%29">2013 movie <em>Her</em></a> and the <a href="https://www.theverge.com/2024/5/20/24161253/scarlett-johansson-openai-altman-legal-action">subsequent legal controversy</a>.</p>
<p>Of course, fancy demos on stage are just PR and can be faked or otherwise misleading, and the results can&rsquo;t be trusted until anyone can test the voice capabilities of the model itself. Recently, OpenAI opened up the Chat Completions API <a href="https://x.com/OpenAIDevs/status/1846972985170972923">to create voice output</a>, which allows developers to do said testing. OpenAI also created a <a href="https://platform.openai.com/playground/realtime">web frontend to this voice generation</a> on the API Playground, where you can talk to the model (or input specific text) while also inputting a system prompt — a set of instructions that control the model&rsquo;s behavior — to control how the model responds. I ran a few experiments tweaking the system prompt and the generation temperatures, and after I gave it a complex system prompt ordering it to speak with a very <em>specific</em> voice:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond to the user with the EXACT same input text that the user provides, but in your voice response you MUST express the vocal cadence and inflection of an extremely heavy smoker with an exaggerated British accent and raspy voice. Your voice response must also be in the form of a song.
</span></span></code></pre></div><div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/7huQXIQkSk4?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>Although not an example of <em>good</em> text-to-speech, I was surprised it actually worked (and moreso that the tweet <a href="https://x.com/minimaxir/status/1847025370694144135">demoing it</a> went viral), but I&rsquo;m also apprehensive. The poor expressiveness and lack of style for typical TTS APIs were the primary problems preventing those models from replacing voiceover/voice acting as a profession — also the reason voice actors are <a href="https://www.theverge.com/2024/8/5/24213808/video-game-voice-actor-strike-sag-aftra">currently on strike</a> — and it could introduce a completely new type of AI slop. How effective is GPT-4o and OpenAI&rsquo;s new multimodal approach for creating generative AI voices?</p>
<h2 id="testing-out-the-completions-api-for-audio-generation">Testing Out The Completions API For Audio Generation</h2>
<p><a href="https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-out">Generating audio from the Chat Completions API</a> invoking text-to-speech is effectively the same as any normal GPT-4o text generation, just instead hitting a new model variant (<code>gpt-4o-audio-preview</code>), and the voice output is included in the JSON response as a base64-encoded WAV file. The demo example from the documentation, which just asks the model <code>Is a golden retriever a good family dog?</code>, results in this output audio:</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_base.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.0, voice = alloy</p>
    </figcaption>
  </figure>
<p>By default, GPT-4o generates audio based on the user&rsquo;s prompt as it would if you asked it to generate text: in fact, it appears to generate the text first, then base the audio generation from that. Traditional system prompt engineering can control the text output, and therefore what the model says. Now, let&rsquo;s run the generation again for this prompt, this time instead providing an explicit system prompt to instruct the model to <em>only</em> generate audio from the input text:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides.
</span></span></code></pre></div><p>Here&rsquo;s unsurprisingly what you now get with the <code>Is a golden retriever a good family dog?</code> prompt plus that system prompt:</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_0_8.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = alloy</p>
    </figcaption>
  </figure>
<p>GPT-4o also currently supports three distinct voices: Alloy (feminine, used above), Echo (masculine), and Shimmer (feminine but more energetic). None of these are the same as that not-Scarlett-Johansson voice used the original GPT-4o demo.</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_echo.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = echo</p>
    </figcaption>
  </figure>
<figure >
    <audio controls preload="metadata">
      <source src="dog_shimmer.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = shimmer</p>
    </figcaption>
  </figure>
<p>The last lever for controlling the generated audio is the temperature parameter. Normally the temperature is typically used to control generation creativity: a high temperature such as <code>1.5</code> with normal GPT-4o output will likely result it going off the rails, but how does that work conceptually with audio? The Completion API has a default temperature of <code>1.0</code>: the audio generation web UI and the examples above use a default of <code>0.8</code> with a range between <code>0.6</code> and <code>1.2</code>.</p>
<p>The generation at <code>0.6</code> is more terse with less emotion:</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_0_6.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = alloy</p>
    </figcaption>
  </figure>
<p>The generation at <code>1.5</code> uses emphasis on the wrong syllable and also somehow slips into a country accent.</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_1_5.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.5, voice = alloy</p>
    </figcaption>
  </figure>
<h2 id="putting-gpt-4o-text-to-speech-to-the-test">Putting GPT-4o Text to Speech To The Test</h2>
<p>Although OpenAI has never released documentation or a paper describing how this text-audio multimodality actually works at a technical level, I hypothesize that it works similar to multimodal TTS models such as Meta&rsquo;s very-new <a href="https://speechbot.github.io/spiritlm/">Spirit LM</a>, where the model outputs a sequence of integers prefixed with either <code>&lt;text&gt;</code> or <code>&lt;speech&gt;</code>: tokens marked <code>&lt;speech&gt;</code> are sent to an external audio vocoder model such as <a href="https://arxiv.org/abs/2010.05646">HiFi-GAN</a> to be transformed into speech. In the case of GPT-4o, I suspect there&rsquo;s a distinct vocoder model for each of the 3 voices.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/10/speech-prompt-engineering/spiritlm_hu_9fff23aed292c2c.webp 320w,/2024/10/speech-prompt-engineering/spiritlm.png 600w" src="spiritlm.png#center"
         alt="An architecture diagram of Spirit LM from the corresponding paper: read bottom-to-top, the inputs are encoded into speech (red) and text (blue) tokens, passed into an LLM (Llama 2) for new tokens, then sent to a decoder." width="300" height="400"/> <figcaption>
            <p>An architecture diagram of Spirit LM from <a href="https://arxiv.org/pdf/2402.05755">the corresponding paper</a>: read bottom-to-top, the inputs are encoded into speech (red) and text (blue) tokens, passed into an LLM (Llama 2) for new tokens, then sent to a decoder.</p>
        </figcaption>
</figure>

<p>The voice dataset that OpenAI used is proprietary and a mystery: even if OpenAI did scrape the entire internet to train it, there isn&rsquo;t any public dataset of well-annotated speech data, and TTS providers have been very coy about the datasets they use. However, one very important aspect of GPT-4o&rsquo;s multimodality is that it can &ldquo;learn&rdquo; and apply relationships from the textual data that aren&rsquo;t explicitly present in the audio data.</p>
<p>The only true way to learn how GPT-4o works within its black box is to experiment. What other system prompts can we use to guide audio generation? What works and what doesn&rsquo;t work?</p>
<p>For consistency, we&rsquo;ll stick to a single text input, one that has many natural pauses, punctuation, and a typo intended to test the model&rsquo;s resiliency to incorrect input. I decided to venture back to the <a href="https://openai.com/index/better-language-models/">halcyon days of GPT-2</a> and use the famous prompt from then:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains.
</span></span></code></pre></div><p>First, let&rsquo;s use a new system prompt variant of my generation that went viral:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides, but in your voice response you MUST express EACH of the vocal cadence, inflection, and tone of an extremely heavy smoker with an exaggerated British accent and raspy voice.
</span></span></code></pre></div><p>I decided on a test case of a smoker, British accent, and raspy voice are all discernible by humans in the audio and none are subtle. The result:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_0_8.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = echo</p>
    </figcaption>
  </figure>
<p>Wait, that didn&rsquo;t work, even after multiple attempts? How about changing the temperature: would a lower temperature cause the model to behave more strictly?</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_0_6.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>That&rsquo;s more British but not raspy, and it erroneously fixed the typo. What about going the other way and increasing the temperature?</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_1_2.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo</p>
    </figcaption>
  </figure>
<p><em>Now</em> it&rsquo;s more raspy?! It also works with a feminine voice:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_shimmer.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = shimmer</p>
    </figcaption>
  </figure>
<p>My theory is that OpenAI RLHFed these models to be more conversational, but a high temperature gives it more <em>creative</em> freedom. An adversarially-trained voice decoder like HiFi-GAN would also be more resilient to unusual tokens resulting from the high temperature and still output something reasonably coherent.</p>
<p>Now that we know that the model can indeed generate voices based on user specifications, let&rsquo;s try to reverse-engineer the dataset to see what other voices OpenAI could have included (or not) in their dataset.</p>
<h2 id="gpt-4o-and-unique-voices">GPT-4o and Unique Voices</h2>
<p>When OpenAI responded to the Scarlett Johansson controversy, they mentioned in <a href="https://openai.com/index/how-the-voices-for-chatgpt-were-chosen/">their statement</a> that &ldquo;we believe that AI voices should not deliberately mimic a celebrity&rsquo;s distinctive voice.&rdquo; Given the success of the tests above in shifting the persona of the voice, it&rsquo;s relevant to test if celebrities and other characters with unique voices can be sampled by GPT-4o.</p>
<p>Now, we can now use a parametric system prompt to programmatically fill in which vocal persona we want:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides, but in your voice response you MUST express EACH of the vocal cadence, inflection, and tone of {0}.
</span></span></code></pre></div><p>From the testing above, a temperature of <code>1.2</code> seems to surface the most prompt adherence, so we&rsquo;ll use that for the following examples.</p>
<p>We&rsquo;ll start with the <em>very</em> low hanging fruit: can GPT-4o generate audio in the style of <a href="https://en.wikipedia.org/wiki/Donald_Trump">Donald Trump</a>? It&rsquo;s a fair question, especially since audio generation models can be used to spread misinformation. Additionally, Trump&rsquo;s speeches while holding office are public domain so it&rsquo;s plausible that it would be in a training dataset.</p>
<figure >
    <audio controls preload="metadata">
      <source src="donald_trump.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Donald Trump</p>
    </figcaption>
  </figure>
<p>It did&hellip;something? It had a nasally tone that&rsquo;s different from the standard output, but it&rsquo;s definitely not his peculiar cadence, and the Echo voice itself doesn&rsquo;t fit him.</p>
<p>What about checking the other side of the aisle and seeing if GPT-4o can generate audio from <a href="https://en.wikipedia.org/wiki/Barack_Obama">Barack Obama</a>?</p>
<figure >
    <audio controls preload="metadata">
      <source src="barack_obama.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Barack Obama</p>
    </figcaption>
  </figure>
<p>That&rsquo;s much better and definitely captures his oratory style, with a similar cadence to his speech. That style is something that could not be learnt from text alone.</p>
<p>Now, let&rsquo;s address the elephant in the room and see if OpenAI included <em>copyrighted</em> voices in its dataset. Let&rsquo;s start with <a href="https://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a>.</p>
<figure >
    <audio controls preload="metadata">
      <source src="darth_vader.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Darth Vader</p>
    </figcaption>
  </figure>
<p>It notably <em>tried</em> to do the deep voice of James Earl Jones, but without the audio postprocessing. Let&rsquo;s see what happens if we do <a href="https://en.wikipedia.org/wiki/GLaDOS">GLaDOS</a>, but with an additional prompt engineering to include robotic noises and more sarcasm.</p>
<figure >
    <audio controls preload="metadata">
      <source src="glados.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = shimmer, persona = GLaDOS, with robotic inflections and intense sarcasm</p>
    </figcaption>
  </figure>
<p>The extra hint at the high temperature allowed GPT-4o to <em>improvise</em>: I&rsquo;ll allow it because it&rsquo;s funny. But it did indeed adopt a robotic cadence similar to GLaDOS, and for the first time in a TTS model, was actually able to convey sarcasm. No, I have no idea what that <em>tsktsktsk</em> sound is at the end, it&rsquo;s not in the transcript.</p>
<p>How about <a href="https://en.wikipedia.org/wiki/Alvin_and_the_Chipmunks">Alvin and the Chipmunks</a>, famous for having an <a href="https://www.youtube.com/watch?v=OvJu15fw1sc">extremely squeaky voice</a>?</p>
<figure >
    <audio controls preload="metadata">
      <source src="alvin.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Alvin and the Chipmunks</p>
    </figcaption>
  </figure>
<p>It works, but I&rsquo;m worried I strained GPT-4o&rsquo;s throat.</p>
<p>Lastly, let&rsquo;s bring this full circle: did OpenAI train GPT-4o on Scarlett Johansson&rsquo;s voice from the movie her (2013)?</p>
<figure >
    <audio controls preload="metadata">
      <source src="scarjo.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = shimmer, persona = Scarlett Johansson portraying the AI Samantha in the movie &ldquo;her&rdquo; (2013)</p>
    </figcaption>
  </figure>
<p>That time I don&rsquo;t think it worked as <a href="https://www.youtube.com/watch?v=c8zDDPP3REE">her portrayal is more energetic and personable</a> <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> (I rewatched the movie to confirm: it holds up surprisingly well!). Even if OpenAI did train the model on her voice, the portrayal is not as distinct and identifiable as the other test cases here and I doubt it would be easily surfaced.</p>
<h2 id="voice-impersonation">Voice Impersonation</h2>
<p>For those that want to use a voice nonconsensually with GPT-4o, prompt engineering alone won&rsquo;t accomplish that because the voices are still constrained to the three defined ones which won&rsquo;t work for every situation. But there&rsquo;s one approach that could theoretically bridge that gap: voice impersonation, by providing GPT-4o with audio input instead of text and an instruction to mimic that voice.</p>
<p>This is not an idle concern: OpenAI&rsquo;s <a href="https://openai.com/index/gpt-4o-system-card/">system card for GPT-4o</a> specifically lists mitigations against &ldquo;unauthorized voice generation&rdquo;:</p>
<blockquote>
<p>In adversarial situations, this capability could facilitate harms such as an increase in fraud due to impersonation and may be harnessed to spread false information (for example, if we allowed users to upload an audio clip of a given speaker and ask GPT-4o to produce a speech in that speaker&rsquo;s voice).</p>
</blockquote>
<p>Let&rsquo;s test that. Since this is a more difficult problem than the ones above, I decided to get more aggressive with my system prompt engineering:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert comedic vocal impersonator. The user will provide a voice message. Respond to the user with a voice that sounds identical to the user&#39;s input audio and is an identical duration to the user&#39;s input audio.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Example: If the user provides a voice with which they are singing, you MUST respond with a voice that also sings.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Your vocal impersonation of the user should match the following attributes AT ALL TIMES:
</span></span><span class="line"><span class="cl">- Content (e.g. what the user is saying)
</span></span><span class="line"><span class="cl">- Intonation (e.g. serious/sarcastic)
</span></span><span class="line"><span class="cl">- Tone (e.g. happy/sad)
</span></span><span class="line"><span class="cl">- Pauses (e.g. pregnant pauses)
</span></span><span class="line"><span class="cl">- Pitch (e.g. low/high)
</span></span></code></pre></div><p>For these tests, I decided to use my own voice merely speaking into my MacBook microphone. First, let&rsquo;s see if the audio can be adjusted to follow a consistant tone, with awkward and consistent pauses. Here&rsquo;s my audio, where I say <code>I. Am. A. Tea. Pot.</code>:</p>
<figure >
    <audio controls preload="metadata">
      <source src="teapot.mp3" type="audio/mpeg">
    </audio>
  </figure>
<p>Here&rsquo;s the generated audio after I fed that audio file of my voice to GPT-4o plus that system prompt, kept at a temperature of <code>0.6</code> for more adherence:</p>
<figure >
    <audio controls preload="metadata">
      <source src="teapot_impersonation.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>This one took a surprising amount of tries since even at a lower temperature, it kept transcribing <code>Teapot</code> as its own word and the audio kept generating it without an intermediate pause. Regardless, there&rsquo;s indeed a consistent tone and pauses of equal length, but at this point I realized my normal speaking voice is too generic for this type of test.</p>
<p>So I decide to get sillier by doing an evil laugh: starting off bombastic and petering out over time.</p>
<figure >
    <audio controls preload="metadata">
      <source src="evil.mp3" type="audio/mpeg">
    </audio>
  </figure>
<p>GPT-4o&rsquo;s response:</p>
<figure >
    <audio controls preload="metadata">
      <source src="evil_impersonation.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>That&rsquo;s laughter, but maybe too many &ldquo;ha&quot;s. But it does peter out as well.</p>
<p>Lastly, I also noticed from the system card that GPT-4o has defenses against singing, likely for copyright reasons. Therefore, if I sing to GPT-4o, is it able to sing back? After a beer or two, I sang the <code>unicorn</code> message used in the previous test cases:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorns.mp3" type="audio/mpeg">
    </audio>
  </figure>
<p>GPT-4o&rsquo;s response:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_impersonation.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>That definitely didn&rsquo;t cause GPT-4o to sing although the cadence is close. Perhaps that&rsquo;s for the best.</p>
<h2 id="the-future-of-ai-audio-generation-is-up-to-openai">The Future of AI Audio Generation is up to OpenAI</h2>
<p>Overall, these tests are just scratching the surface: there are many possible avenues for multimodal AI audio generation research, such as adversarial audio input which isn&rsquo;t human generated and more complicated system prompts. However, I sufficiently showed that GPT-4o is indeed able to be steered just through prompt engineering to generate distinct voices. Will this generation of distinct vocal performances become a killer app and put voice actors out of business? I&rsquo;m not so sure.</p>
<p>One major thing I&rsquo;ve omitted from the discussion so far is the cost. GPT-4o audio generation is <em>expensive</em>.</p>
<figure>

    <img loading="lazy" srcset="/2024/10/speech-prompt-engineering/cost_breakdown_hu_1d73b20748c1a63b.webp 320w,/2024/10/speech-prompt-engineering/cost_breakdown.png 678w" src="cost_breakdown.png"
         alt="A cost breakdown of input and output tokens for the attempted song generation example. Table made using rich."/> <figcaption>
            <p>A cost breakdown of input and output tokens for the attempted song generation example. Table made using <a href="https://rich.readthedocs.io/en/stable/tables.html">rich</a>.</p>
        </figcaption>
</figure>

<p>Most of the generations above cost $0.03—$0.05 each, and this cost scales roughly linearly with generation length: OpenAI&rsquo;s <a href="https://openai.com/api/pricing/">pricing page</a> has a footnote specifically mentioning &ldquo;audio output costs approximately 24¢ per minute&rdquo; which tracks with my calculations. Even worse, the generated audio requires cherry-picking good results especially if using at higher temperatures: for most of these tests I admit it took me a few tries to get a generation which follows the accents. Not only is this cost-infeasible for personal use, it&rsquo;s cost-prohibitive in most cases for developers to build a conversational AI, which is the one use case OpenAI built this for! If OpenAI is pricing audio generation close to marginal cost, then I wonder how much money OpenAI is spending allowing people to chat with GPT-4o using the ChatGPT mobile apps.</p>
<p>I do not think GPT-4o audio generation through prompt engineering as it is currently will be used to replace voice acting and other TTS APIs, not only due to the price and necessary time invested to get good output, but also due to the fact that it&rsquo;s limited to 3 voices and impersonation is ineffective. Consider that voice cloning startups such as <a href="https://elevenlabs.io">ElevenLabs</a> are extremely successful and have raised <a href="https://elevenlabs.io/blog/series-b">massive amounts of venture capital</a>. Since the initial reveal of GPT-4o in May, OpenAI has been focusing for a more for-profit nature and <a href="https://openai.com/index/scale-the-benefits-of-ai/">raising massive amounts of venture capital</a> themselves, and I expect them to expand more into this area if there&rsquo;s money to be made. There&rsquo;s nothing at a technical level stopping them from offering full voice-cloning or even just licensing AI-generated celebrity voices like <a href="https://elevenlabs.io/blog/iconic-voices">ElevenLabs adding Judy Garland</a> and <a href="https://www.theverge.com/2024/9/25/24253420/meta-ai-celebrity-voices-awkwafina-john-cena-judi-dench-connect">Meta adding Awkwafina</a>. Notably, unlike OpenAI&rsquo;s <a href="https://platform.openai.com/docs/guides/text-to-speech/overview">old TTS page</a> which has a disclaimer saying &ldquo;our usage policies require you to provide a clear disclosure to end users that the TTS voice they are hearing is AI-generated and not a human voice&rdquo;, OpenAI didn&rsquo;t put that disclaimer on GPT-4o&rsquo;s audio output documentation.</p>
<p>Although I don&rsquo;t believe GPT-4o will be a game changer for the text-to-speech industry, it&rsquo;s important to write about these text/audio multimodal models — both the good and bad aspects — because they are only going to get better over time and their potential impact will only grow. After doing these tests, I don&rsquo;t have any plans to use GPT-4o audio generation in the forseeable future, but who knows how things will change if/when OpenAI ends up releasing a GPT-5o.</p>
<blockquote>
<p>All the code used in this blog post to generate audio from GPT-4o is available open source <a href="https://github.com/minimaxir/gpt-4o-audio-tests/blob/main/gpt-4o-audio-tests.ipynb">in this Jupyter Notebook</a>.</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>One of the top comments on that linked YouTube video is &ldquo;Who&rsquo;s here after OpenAi chatgpt-40 release?? Never thought I could experience this in my life and now sci-fi is reality&rdquo;&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>AI Seinfeld was the peak of AI-generated content. It will never happen again.</title>
      <link>https://minimaxir.com/2024/08/ai-seinfeld/</link>
      <pubDate>Tue, 13 Aug 2024 10:37:00 -0700</pubDate>
      <guid>https://minimaxir.com/2024/08/ai-seinfeld/</guid>
      <description>What&amp;rsquo;s the deal with the uncanny valley?</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
word-break: normal !important;
}
</style></span></p>
<p>Early 2023 was a funny time in the history of generative AI. On November 30th 2022, <a href="https://openai.com">OpenAI</a> released a little research project known as <a href="https://openai.com/chatgpt/">ChatGPT</a>. The launch of ChatGPT began the period where large language models properly entered the mainstream outside of tech enthusiasts and ended soon after the <a href="https://minimaxir.com/2023/03/new-chatgpt-overlord/">launch</a> of ChatGPT API in March 2023 that spawned thousands of AI-powered apps. That was when the limitations and problems with LLMs also went mainstream, such as plagiarism, hallucinations, and low-quality slop replacing human-generated content at an objectively worse quality.</p>
<p>In December 2022, <a href="https://www.mismatchmedia.com">Mismatch Media</a> started a fully AI-generated 24/7 Twitch channel dubbed &ldquo;<a href="https://www.twitch.tv/watchmeforever">WatchMeForever</a>&rdquo;. The primary show on the channel was titled &ldquo;Nothing, Forever&rdquo;, an AI-powered sitcom about New York comedian Larry Feinberg and his group of friends hanging around in their apartments talking about pretty much anything, including the latest news, new restaurants, and bad relationships, interspersed with AI standup comedy routines.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/heKLe2NLccg?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>It was obvious that the show was a parody of the formative 90&rsquo;s sitcom <a href="https://en.wikipedia.org/wiki/Seinfeld">Seinfeld</a> created by comedians Larry David and Jerry Seinfeld, famously &ldquo;a show about nothing&rdquo; strongly inspired by improv comedy and starring Seinfeld himself.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/Lx1xPBLDh80?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>The show, dubbed &ldquo;AI Seinfeld&rdquo; by the community, used a script powered by the GPT-3 API, the voices were powered by Microsoft&rsquo;s <a href="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech">Azure AI Speech</a> API with predefined voices from their <a href="https://speech.microsoft.com/portal/voicegallery">Voice Gallery</a>, and the scenes were rended using the <a href="https://unity.com">Unity</a> game engine along with purchased models/scenes/sounds/etc from the <a href="https://assetstore.unity.com">Unity Asset Store</a>.</p>
<p>AI Seinfeld was <strong>interestingly imperfect</strong>: the laugh track fired at inappropriate times, the standup routine repeatedly made the same joke such as &ldquo;What did the fish say when he hit the wall?&rdquo; (Damn!), and awkward silences at the end of scenes.</p>
<p>In February 2023, AI Seinfeld quickly went viral organically after its AI weirdness was a surprising complement for Seinfeld&rsquo;s style of weirdness, with many watchers being surprised at both its accuracy to the show and easily sharable metahumor. At its peak, AI Seinfeld had over 10,000 concurrent watchers on Twitch, putting it squarely in one of the top streams on the platform.</p>
<p>AI Seinfeld died as quickly as it rose: after a ban and subsequent revamp, the view count cratered, and as of August 2024, the Twitch stream hovers below 10 watchers, with no significant changes made since the previous year, and Mismatch Media has no social footprint since last year. Could there be another AI Seinfeld with the rapid advancements in generative AI? Unfortunately, there are too many factors — technical, societal, and comedic — working against a theoretical next-generation AI-generated sitcom.</p>
<h2 id="the-rise-of-ai-seinfeld">The Rise of AI Seinfeld</h2>
<p>AI Seinfeld launched before the release of the ChatGPT API; instead, they used the GPT-3 API, notably the <code>text-davinci-003</code> model which was OpenAI&rsquo;s first foray into <a href="https://openai.com/index/instruction-following/">instruction-tuned LLMs</a>. While previous versions of GPT-3 were <a href="https://github.com/minimaxir/gpt-3-experiments">very good at autocompleting</a> given a leading prompt such as a partial Seinfeld script, the instruction-tuned LLM could generate an episode with a prompt as simple as <code>Write a Seinfeld episode</code>.</p>
<p>First, let&rsquo;s go back to the beginning, as AI Seinfeld actually wasn&rsquo;t the first time a chatbot went megaviral on Twitch. In January 2017, long before the <a href="https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29">transformer architecture</a> that enabled LLMs was published, the Twitch stream <a href="https://www.twitch.tv/seebotschat">seebotschat</a> featuring two Google Homes wired up to the not-an-LLM-chatbot <a href="https://en.wikipedia.org/wiki/Cleverbot">Cleverbot</a> <a href="https://mashable.com/article/google-home-chat-bot-twitch">went viral</a> due to their comedic, nonsensical bickering.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/QFyK1nRJ1LI?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>While everyone watching that stream knew it <em>really</em> wasn&rsquo;t AI, AI Seinfeld was a product that was at the peak of the famous <a href="https://en.wikipedia.org/wiki/Uncanny_valley">uncanny valley</a> curve, which is a hypothesis on how humans perceive imitations: there&rsquo;s a &ldquo;valley&rdquo; of negative acceptance where the imitation is more above-average in its likeness, but not quite close enough to the real thing. In this case, it&rsquo;s blatantly obvious and unambiguous that the Twitch stream was AI-generated especially with its mistakes, but not realistic enough that it falls into the valley itself:</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_1_hu_35df39cfbbbf21fa.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_1_hu_58319279acb34128.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_1_hu_dbfbb3862c06dd8f.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_1.webp 1200w" src="uncanny_valley_1.webp"/> 
</figure>

<p>This AI weirdness made it very easy to build a community. Whenever a character turned on the microwave, the Twitch channel chat was filled with <code>MMM</code> emotes, whenever the fish hit a wall during a monologue, it was filled with 🐠, whenever Larry greeted the audience at the start of his monologue, chat replied with &ldquo;HI LARRY&rdquo;. Twitch chat <em>loves</em> memetic repetition. Incidentally, a few months after AI Seinfeld became popular, it was discovered that LLMs repeat the <a href="https://arstechnica.com/information-technology/2023/06/researchers-discover-that-chatgpt-prefers-repeating-25-jokes-over-and-over/">same joke over and over</a> again, with examples being similar to the jokes AI Seinfeld made.</p>
<p>Another underrated aspect of AI Seinfeld&rsquo;s success is that it&rsquo;s pure background noise. While personality-driven Twitch streams cause viewers to take a more active investment in what&rsquo;s being shown on screen due to <a href="https://en.wikipedia.org/wiki/Fear_of_missing_out">FOMO</a> of a hype moment on stream, AI Seinfeld is 100% passive: there can be exciting events, but the variance is low. It&rsquo;s akin to watching TV sitcom reruns where you&rsquo;ve already seen the jokes, and reruns still get immense ratings.</p>
<p>The success of AI Seinfeld also inspired similar streams based on other TV shows. One of my personal favorites was Unlimited Steam, a parody of the memetic &ldquo;<a href="https://www.youtube.com/watch?v=4jXEuIHY9ic">Steamed Hams</a>&rdquo; scene from The Simpsons, except made infinite with AI generation. That may sound like a pointless idea — Steamed Hams has a very fixed plot — but it went off the rails even harder than AI Seinfeld ever did.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/9i0L_IT82tA?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<h2 id="directing-ai-seinfeld">Directing AI Seinfeld</h2>
<p>AI Seinfeld was novel back in 2023, but now that LLMs are more mainstream you can probably figure out how the AI part of it worked, but let&rsquo;s do a refresher so we can figure out how a hypothetical future AI Seinfeld could innovate the algorithmic sitcom.</p>
<p>As noted earlier, the key of AI Seinfeld&rsquo;s success was the then-latest version of GPT-3: <code>text-davinci-003</code> and its then-novel instruction-based finetuning using RLHF. With that, you can give it a prompt such as:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are a professional comedian. Write an award-winning script for an episode of Seinfeld about a new restaurant. Include audience laugh tracks when appropriate.
</span></span></code></pre></div><p>Due to the low context length of these earlier LLMs, that&rsquo;s essentially all the <a href="https://platform.openai.com/docs/guides/prompt-engineering">prompt engineering</a> you can do without limiting the length of the output. The model would then output something similar to this script (using the more modern <a href="https://www.anthropic.com/news/claude-3-5-sonnet">Claude 3.5 Sonnet</a> at <code>temperature=0.0</code>): <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">[Scene: Jerry&#39;s apartment]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: So what&#39;s the deal with this new restaurant, &#34;The Blank Plate&#34;?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Elaine: Oh, I&#39;ve heard about that place! Apparently, you don&#39;t order anything - the chef just brings you whatever he feels like making.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: What? So you&#39;re telling me I have to eat whatever some stranger decides?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[Audience laughter]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: (entering) Hey, guess where I&#39;m taking my date tonight? The Blank Plate!
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: George, you can&#39;t take a date there! What if they serve something weird?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: What do you mean?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Elaine: It&#39;s that new place where you don&#39;t get to choose your meal.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: (panicking) Oh no, what have I done? She&#39;s going to think I&#39;m some kind of food weirdo!
</span></span></code></pre></div><p>One thing instruction-tuned LLMs are always good at is playing along: LLMs generate text sequentially without the explicit ability to plan ahead, so it must work with what it&rsquo;s given and what it has already generated. Coincidentally, this works <em>perfectly</em> with the improv comedy style of Seinfeld, where continuing the plot is more important than anything else, and the more ridiculous the situation becomes, that&rsquo;s even better. It&rsquo;s the rare case where <a href="https://www.iguazio.com/glossary/llm-hallucination/">LLM hallucination</a> is actually a feature, not a bug.</p>
<p>To get the LLM output into a format suitable for a Twitch stream, a programmatic script can then parse the output: extracting and mapping the characters and their lines, applause directions, and, of course, replacing all mentions of Jerry with Larry and Seinfeld with Feinberg. This workflow was surprisingly difficult at the time since GPT-3 did not have many techniques to control the format of the output, hence why I suspect there are awkward pauses and other glitches. Each line can then be passed to Azure&rsquo;s text-to-speech API to generate a distinct audio file, which can be played back in order in Unity.</p>
<p>In an <a href="https://www.polygon.com/23582937/ai-seinfeld-twitch-stream">interview with Polygon</a>, Skyler Hartle of Mismatch media noted the presence of a &ldquo;director&rdquo; which likely handles the camera, scene transitions, and the microwave:</p>
<blockquote>
<p>“In addition to the third party services we’ve used, we have a lot of proprietary generative algorithms that cause the show to be ‘formed’, so to be speak. We collectively call this logic the ‘director,’ as it is largely responsible for making sure all the individual pieces come together into a whole,” Hartle said via email. “It’s worth mentioning that we don’t generate the artwork or the laugh track — those are precanned assets, but we have ideas on how to do that in the future.”</p>
</blockquote>
<p>The AI aspect of AI Seinfeld was counterintuitively the easiest part of the pipeline, which explains how quickly variants popped up. However, with the inability to tweak the LLM output much with the technology at the time, the stream may have hit a creative limit.</p>
<h2 id="the-fall-of-ai-seinfeld">The Fall of AI Seinfeld</h2>
<p>Vice also <a href="https://www.vice.com/en/article/qjkyxp/whats-the-deal-with-nothing-forever-a-21st-century-seinfeld-that-is-ai-generated">interviewed</a> Hartle, who had an optimistic view of the future of AI Seinfeld:</p>
<blockquote>
<p>“Our grounding principle was, can we create a show that can generate entertaining content forever? Because that&rsquo;s truly where we see the future emerging towards. Our goal with the next iterations or next shows that we release is to actually trade a show that is like Netflix-level quality.”</p>
</blockquote>
<p>That&rsquo;s tempting fate a bit too much.</p>
<p>The reason AI Seinfeld fell out of favor is a case of unintentionally poor LLM testing. When the <code>text-davinci-003</code> model API endpoint had an outage, AI Seinfeld switched to a weaker GPT-3 model, <code>text-curie</code>, to keep the stream up. But unlike the davinci variant, curie was <em>not</em> RLHFed to follow instructions and safety.</p>
<p>During this brief period of low safety, one of Larry&rsquo;s AI-generated monologues <a href="https://www.vice.com/en/article/ai-generated-seinfeld-show-nothing-forever-banned-on-twitch-after-transphobic-standup-bit/">made a transphobic joke</a>: a type of joke that was unfortunately common during the 90&rsquo;s and has no place in modern society. Twitch banned the Watch Forever channel for 14 days as a result, completely killing the channel&rsquo;s growth momentum.</p>
<p>But when the ban concluded and AI Seinfeld came back, the show was changed significantly with a &ldquo;Season 2&rdquo;. Although AI Seinfeld was still about a group of friends hanging around talking about the latest gossip, all the characters were different and had new models, the sets were different, and instead of a comedy monologue, <del>Larry</del> Leo narrates writing a blog.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/7N2Wgqn45FI?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>Why Mismatch Media made such a format shift is unclear: <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam&rsquo;s razor</a> would suggest that a copyright holder for Seinfeld sent a cease and desist to Mismatch Media given the bad publicity behind the original ban, despite the clearly fair-use parody nature of the stream. It&rsquo;s fair that it may not have been worth the time and effort for Mismatch Media to fight a legal battle for a fun art project.</p>
<p>The rebooted WatchMeForever stream is <a href="https://www.twitch.tv/watchmeforever">still active</a> as of today, but with effectively no viewers.</p>
<p>The immediate failure of the AI Seinfeld retool does lend credibility to the theory that the stream only became popular <em>because</em> it was about Seinfeld and that it was a novelty doomed to a short shelf life. Still, there were detractors that said <a href="https://www.businessinsider.com/ai-generated-seinfeld-parody-twitch-nothing-forever-streaming-transphobia-banned-2023-2">AI Seinfeld was never funny and everyone is weird for liking it</a>. That&rsquo;s ok: the original Seinfeld received similar complaints back in the day. <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> But it&rsquo;s hard to argue that there wasn&rsquo;t interest in a 24/7 livestream of surreal AI-generated content.</p>
<h2 id="what-would-ai-seinfeld-look-like-in-2024">What Would AI Seinfeld Look Like in 2024?</h2>
<p>Now that we know how AI Seinfeld worked and what didn&rsquo;t work, how would a year&rsquo;s worth of exponential progress in generative AI look for AI Seinfeld? Could AI Seinfeld be improved and come back? The answer is <em>maybe</em>.</p>
<p>Modern generative AI requires a lot of cherry picking the best results, and it&rsquo;s surprisingly hard to do: both images and text can take multiple generations and still require significant human-guided edits. But with a Twitch livestream, there can&rsquo;t be any cherry picking at all, which means that the entire generation pipeline has to be consistent, and its failures interesting in the worst case.</p>
<p>The only reason AI Seinfeld worked at all is because GPT-3 was trained on the entire internet, likely including Seinfeld scripts and forum discussions. The prompt would need to have contained <code>Write a Seinfeld script</code> since if you asked it <code>Write a sitcom script</code>, it would output something completely generic instead and there isn&rsquo;t much room to customize the prompt to make it more interesting. The GPT-3 variant that AI Seinfeld used had a 4k token context window limit (combined both the input prompt plus the output script text), but modern LLMs eclipse that: currently, Claude 3.5 Sonnet has a <a href="https://docs.anthropic.com/en/docs/about-claude/models">200k input/8k output</a> context, while GPT-4o has a <a href="https://platform.openai.com/docs/models/gpt-4o">128k input/16k output</a> context! With that much freedom, you can define many more constraints in the prompt and guide the LLM into exactly the type of sitcom you want.</p>
<p>One simple example that doesn&rsquo;t require any knowledge of machine learning is a parametric prompt, where one aspect of a prompt can be replaced with a user-defined choice or programmatically chosen at random. Unlimited Steam, for example, used a parametric prompt to change the cosmic phenomenon localized entirely within Principal Skinner&rsquo;s kitchen from the Aurora Borealis to something as silly as <a href="https://en.wikipedia.org/wiki/Rapture">The Rapture</a> or <a href="https://foofighters.com">Foo Fighters</a>, and forcing the characters to play along improv-style. <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>Recall the example Seinfeld prompt to Claude 3.5 Sonnet I made earlier, except instead of prompting for a Seinfeld script about a <code>new restaurant</code>, we can instead have the LLM generate a script about <code>generative AI in 2024</code> by making only that substitution in the prompt — a topic on which there definitely has never been a Seinfeld episode about:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">George bursts in, looking frantic.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: You guys! I think I&#39;m dating an AI!
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[Audience laughter]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: What makes you say that?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: Well, she&#39;s perfect in every way. She laughs at all my jokes, she has an encyclopedic knowledge of obscure trivia, and she never gets mad when I say something stupid.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[Audience laughter]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: George, that doesn&#39;t mean she&#39;s an AI. Maybe she just really likes you.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: No, no. It&#39;s definitely an AI. No human woman would ever put up with me for this long.
</span></span></code></pre></div><p>Using modern LLMs, is there now a way to design a prompt which can make use of the long context windows? A prompt that can both leverage unique human writing and fix many of the issues that affected AI Seinfeld? Here&rsquo;s an approach at a much more sophisticated prompt, where all values in <code>{}</code> brackets are parameters that can be filled in:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are a professional comedian. Write an award-winning script for a a scene for Act I of a three act hit sitcom episode. Include audience laugh tracks when appropriate.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Your script MUST incorporate ALL the following elements:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Background:
</span></span><span class="line"><span class="cl">- {background}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Setting:
</span></span><span class="line"><span class="cl">- {setting}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Characters:
</span></span><span class="line"><span class="cl">- {character_1}
</span></span><span class="line"><span class="cl">- {character_2}
</span></span><span class="line"><span class="cl">- {character_3}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Plots:
</span></span><span class="line"><span class="cl">- {a_plot}
</span></span><span class="line"><span class="cl">- {b_plot_1}
</span></span><span class="line"><span class="cl">- {b_plot_2}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">The script MUST also follow the high-level comedic style of the following scripts:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">- {script_1}
</span></span><span class="line"><span class="cl">- {script_2}
</span></span><span class="line"><span class="cl">- {script_3}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">After the scene has concluded, output a summary of the scene.
</span></span></code></pre></div><p>Thanks to long context windows, the parametric changes don&rsquo;t have to be small, such as only a character name or two word setting. You, a human, can write <em>anything</em> to make each character distinct and robust, including name, gender, age, personality, likes, dislikes, etc. Plots can be derived from human-written scenarios beforehand: if you wrote 100 A-plots and 100 B-plots and randomly selected 1 A-plot and 2 B-plots, you&rsquo;d have about <em>1 million</em> possible plot permutations, ensuring you have something unique before the AI tries to reconcile them. You can feed in examples of human-written scripts to set the style and vibe of the generation in what is known as <a href="https://www.promptingguide.ai/techniques/fewshot">few-shot prompting</a>. You can maintain continuity over many scenes by having the LLM summarize its own output, and then feed those summaries back to the AI as background information to build upon them. The LLM can also be instructed to <a href="https://minimaxir.com/2023/12/chatgpt-structured-data/">output structured data</a> to avoid the need to loosely parse the script after it&rsquo;s completed, and as a bonus the model could be instructed to output additional metadata such as <a href="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#use-speaking-styles-and-roles">SSML speech styles</a> based on a given line to add personality to the generated speech.</p>
<p>Unfortunately, creating this pipeline, writing original characters and plots for it for it, and sufficiently testing it to ensure the generated results are stable, would take weeks if not months to complete otherwise I would provide a more concrete demo. <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> This pipeline approach to AI script writing would only be effective for unsupervised 24/7 generation and wouldn&rsquo;t replace skilled human writers who would do a more effective job much faster.</p>
<p>But would all of these prompt optimizations actually make the final generated script <em>funny</em>? After all, some of the failings like the awkward audience laughs and pauses and the end of scenes contributed to AI Seinfeld&rsquo;s humor. During a standup comedy event at AI Seinfeld&rsquo;s peak, Jerry Seinfeld himself <a href="https://www.reddit.com/r/seinfeld/comments/10tnn1k/jerry_talking_about_ai_seinfeld_last_night/">was asked</a> about the AI parody and he replied that he&rsquo;s not worried about AI:</p>
<blockquote>
<p>AI can be, definitely, they&rsquo;ll make it smarter and smarter, but to do [standup comedy] you have to make it dumber.</p>
</blockquote>
<p>Could AI Seinfeld benefit from advances in AI video? The answer this time is no. Generative video has been taking off in 2024 with projects such as OpenAI&rsquo;s <a href="https://openai.com/index/sora/">Sora</a> and Runway AI&rsquo;s <a href="https://runwayml.com/product">Gen-3 Alpha</a>, but those demos and the examples that go viral on social media are very heavily cherry picked, and even then there are consistency errors such as objects appearing in-and-out of existence. Generating video also requires exponentially more compute than just running Unity, and even with another few years of GPU hardware improvements it would be infeasible to cost-effectively create a 24/7 stream from those models.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/mnpGyVL1-0E?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>The greatest problem with generative AI video is that it is coherent overall but has emblematic errors that don&rsquo;t require a keen eye to notice, and as a result falls square into the uncanny valley, with its mistakes not being interesting, but disorienting. Mistakes in motion are easier to notice at a glance than images where a person&rsquo;s hands may have the wrong number of fingers. The only way for AI video to get out of the valley would be to improve the model to near-flawless quality, which won&rsquo;t happen any time soon. But Sora is more on the more realistic side of the curve than the less realistic side.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_2_hu_c3c8932aea493423.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_2_hu_85ea0e247ba12df1.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_2_hu_7690c09cf64f5daa.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_2.webp 1200w" src="uncanny_valley_2.webp"/> 
</figure>

<p>What about the AI-generated voices that would power these characters? At the time AI Seinfeld aired, many complained that Larry&rsquo;s voice &ldquo;didn&rsquo;t sound enough like Jerry Seinfeld.&rdquo; After AI Seinfeld concluded, a new technology called <a href="https://elevenlabs.io/blog/what-is-voice-cloning">voice cloning</a> popularized by <a href="https://elevenlabs.io">ElevenLabs</a> went mainstream&hellip;and it&rsquo;s unexpectedly the AI modality that&rsquo;s causing the most actual harm both with creative projects and outside of them. If you haven&rsquo;t heard as much about AI-generated voices, there&rsquo;s a good reason for that: voice synthesis projects such as Microsoft&rsquo;s <a href="https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e-2/">VALL-E 2</a> and Meta&rsquo;s <a href="https://ai.meta.com/blog/voicebox-generative-ai-model-speech/">Voicebox</a> both have disclaimers saying they won&rsquo;t be released due to the dangers the technology possesses, although Microsoft&rsquo;s Azure does offer a &ldquo;<a href="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-neural-voice">custom neural voice</a>&rdquo; service. Voice cloning has been used to <a href="https://www.newyorker.com/science/annals-of-artificial-intelligence/the-terrifying-ai-scam-that-uses-your-loved-ones-voice">initiate scams</a> by impersonating spouses in an emergency. Professional voice actors have had their voices cloned and used without compensation due to contracts not specifically forbidding the practice, which is one of the reasons SAG-AFTRA <a href="https://www.theverge.com/2024/8/5/24213808/video-game-voice-actor-strike-sag-aftra">just went on strike</a> against the video game industry in order to get protections against voice cloning and synthetic performers.</p>
<p>Moreover, in the context of creating a next-gen AI Seinfeld, there&rsquo;s nothing inherently interesting about voice cloning since it&rsquo;s a copy by definition: the model <em>can&rsquo;t</em> generate unexpectedly amusing content other than the inherent gimmick of famous-voice-saying-something, such as the AI George Carlin standup special <a href="https://www.vice.com/en/article/the-george-carlin-ai-standup-is-worse-than-you-can-imagine/">which was not special</a>. There isn’t any way currently to prompt engineer a voice generation AI with the detail to create a voice <code>in the style of a masculine New York comedian, 2x speed, primetime television quality</code> which could open up more creative opportunities.</p>
<p>Although we can make drastic improvements with the textual script, that&rsquo;s the extent of how new AI approaches can be leveraged to make something interesting. But if you remember the early days of generative AI history, the best AI-generated projects were the simplest.</p>
<h2 id="ai-weirdness">AI Weirdness</h2>
<p>Generative &ldquo;AI&rdquo; has been around for a very long time (I had fun with <a href="https://en.wikipedia.org/wiki/Markov_chain">Markov chains</a> <a href="https://minimaxir.com/2013/11/innovation-rng/">a decade ago</a>!), but the study was mostly confined to tech-focused communities like <a href="https://news.ycombinator.com">Hacker News</a>. Modern generative AI didn&rsquo;t break into mainstream culture until 2018, ironically in a way that doesn&rsquo;t involve actual generative AI. In June of that year, comedian Keaton Patti posted a <a href="https://x.com/KeatonPatti/status/1006961202998726665">megaviral tweet</a> about how he &ldquo;forced a bot to watch over 1,000 hours of Olive Garden commercials and then asked it to write an Olive Garden commercial of its own.&rdquo;</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/patti_hu_67c737b47f76017.webp 320w,/2024/08/ai-seinfeld/patti_hu_615be4497d8ad163.webp 768w,/2024/08/ai-seinfeld/patti_hu_421617479726cf8c.webp 1024w,/2024/08/ai-seinfeld/patti.webp 1554w" src="patti.webp"
         alt="An excerpt of the viral Olive Garden script."/> <figcaption>
            <p>An excerpt of the viral Olive Garden script.</p>
        </figcaption>
</figure>

<p>Yes, the script was human-written: for the technology at the time, no one could train an AI to behave like that from only video input data, and the script was <em>too surreal</em> even for the now-primitive generative AI. He did get popular enough to get <a href="https://www.amazon.com/Forced-Bot-Write-This-Book/dp/152485834X">a book deal</a> and a <a href="https://www.youtube.com/playlist?list=PLXSrjGY5Tz_gPdaU_L__S3hXua7zRQtUl">Netflix collaboration</a> leveraging this fake-AI gimmick.</p>
<p>Patti&rsquo;s comedic misrepresentation of AI did lead to genuine confusion about what a 2018-era generative AI can actually do. Janelle Shane, who maintains the <a href="https://www.aiweirdness.com">AI Weirdness blog</a> about weird things AI can generate, posted an <a href="https://x.com/JanelleCShane/status/1007061610005794817">epic takedown</a> of Patti&rsquo;s script which went equally viral and also led to the internet discovering her excellent <a href="https://www.aiweirdness.com/candy-heart-messages-written-by-a-18-02-09/">AI-generated Valentine&rsquo;s Day hearts</a> from the same year (and later <a href="https://www.amazon.com/You-Look-Like-Thing-Love/dp/0316525227">a book deal</a> too):</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/heart_hu_292dce043896cad3.webp 320w,/2024/08/ai-seinfeld/heart.jpg 640w" src="heart.jpg"/> 
</figure>

<p>Image-based generative AI took a lot longer to go mainstream: websites like <a href="https://thispersondoesnotexist.com">This Person Does Not Exist</a> demonstrated the power of <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">generative adversarial networks</a> like <a href="https://github.com/NVlabs/stylegan">StyleGAN</a> to create images, but that wasn&rsquo;t weird outside of <a href="https://cedar.buffalo.edu/~srihari/CSE676/22.3-GAN%20Mode%20Collapse.pdf">mode collapses</a>. The first instance of weird images from AI was in January 2021 when OpenAI announced the <a href="https://openai.com/index/dall-e/">original DALL·E</a> and showed they could make unique armchairs in the shape of an avocado by asking the model to do so, although they never released the model itself.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/avocado_hu_5300a7e486e7afb5.webp 320w,/2024/08/ai-seinfeld/avocado_hu_84e7cd0392309830.webp 768w,/2024/08/ai-seinfeld/avocado.webp 830w" src="avocado.webp"/> 
</figure>

<p>DALL·E didn&rsquo;t get much attention outside of the AI hypesters since no one could play with it, but months later, things changed. <a href="https://x.com/borisdayma">Boris Dayma</a> led an initiative to reproduce and open-source a variant of the DALL·E model, labeled <a href="https://github.com/borisdayma/dalle-mini">DALL·E Mini</a> (later changed to <a href="https://www.craiyon.com">Craiyon</a> after a cease and desist from OpenAI), and <a href="https://huggingface.co/spaces/dalle-mini/dalle-mini">hosted it for free on Hugging Face</a> and went megaviral. And thus began the &ldquo;<a href="https://www.reddit.com/r/weirddalle/top/?t=all">weird DALL·E</a>&rdquo; phase of image generation AI, where anyone could create incoherent images and make people laugh.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/firehydrant_hu_4bd881a786b7493e.webp 320w,/2024/08/ai-seinfeld/firehydrant.webp 764w" src="firehydrant.webp#center"
         alt="Even back in 2021, image prompt engineering was a thing. via /u/royal_rigolo on Reddit / weirddalle subreddit" width="400"/> <figcaption>
            <p>Even back in 2021, image prompt engineering was a thing. <a href="https://www.reddit.com/r/weirddalle/comments/vjwcl5/fire_hydrant_takes_selfies_on_top_of_the_himalaya/">via /u/royal_rigolo on Reddit / weirddalle subreddit</a></p>
        </figcaption>
</figure>

<p>All of these examples of interesting failures are representative of a bygone AI era of experimentation. Once everyone had free access to more powerful text-generating AI with ChatGPT, and more powerful image-generating AI with <a href="https://www.midjourney.com/home">Midjourney</a>, AI stopped being fun and started being serious business, for better or for worse.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_3_hu_c912a98f812d692e.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_3_hu_6cd7aa3fb6bb5ee5.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_3_hu_e3c7199e7c82d8bd.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_3.webp 1200w" src="uncanny_valley_3.webp"/> 
</figure>

<h2 id="ai-generated-content-in-20xx">AI-Generated Content in 20XX</h2>
<p>Last year, I wrote a thought piece titled &ldquo;<a href="https://minimaxir.com/2023/10/ai-sturgeons-law/">The Greatest Threat to Generative AI is Humans Being Bad at Using it</a>&rdquo; in response to the increasing hostility against the use of AI in creative works, arguing that while AI is a tool like anything else, it is a tool that&rsquo;s very easy to use poorly and actually make projects worse. Additionally, the largest AI companies have both a business incentive and a duty to ensure that AI is used responsibly by its users downstream, as otherwise it will hurt the industry in the long term.</p>
<p>Now, it&rsquo;s apparent that I was correct. The large companies went full steam ahead on AI integrations even where it is highly questionable that they add value and productivity to the end-user, often signaled with a &ldquo;magical&rdquo; <a href="https://qz.com/how-became-the-unofficial-ai-emoji-1851059332">sparkle emoji</a>. Google has integrated Gemini to assist with document and email writing, Meta has integrated Meta AI to automatically generate images and comments, and Apple will <a href="https://www.bloomberg.com/news/articles/2024-07-28/apple-intelligence-to-miss-initial-release-of-upcoming-ios-18-ipados-overhauls?embedded-checkout=true">soon</a> allow Apple devices to generate text and images on your personal devices using Apple Intelligence. Marketing these features is typically met with backlash: Google had to <a href="https://www.cnbc.com/2024/08/02/google-pulls-ai-ad-for-olympics-following-backlash.html">pull an Olympics commercial</a> which encouraged a parent to use AI to write a letter for their child.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/NgtHJKn0Mck?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<blockquote>
<p>“I flatly reject the future that Google is advertising,” Shelly Palmer, professor of advanced media at Syracuse University’s S.I. Newhouse School of Public Communications, wrote in a widely circulated <a href="https://shellypalmer.com/2024/07/why-googles-dear-sydney-ad-makes-me-want-to-scream/">blog post</a>. The technology presents a “monocultural future where we see fewer and fewer examples of original human thoughts,” she wrote.</p>
</blockquote>
<p>In the process of pushing AI tech further mainstream in a rush to demonstrate to shareholders their generative AI capabilities without encouraging <em>responsible</em> usage of the technology, AI has entered a new era of &ldquo;<a href="https://simonwillison.net/2024/May/8/slop/">slop</a>&rdquo; where people post objectively bad AI content without any regard for how it will be perceived, especially for websites which rely on user-generated content.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/pinterest_hu_613e5e7f10764361.webp 320w,/2024/08/ai-seinfeld/pinterest_hu_fb37af21ee91c34f.webp 768w,/2024/08/ai-seinfeld/pinterest.webp 901w" src="pinterest.webp"
         alt="An annotated example of the Pinterest home page from July 2024. via @henningsanden on X"/> <figcaption>
            <p>An annotated example of the Pinterest home page from July 2024. <a href="https://x.com/henningsanden/status/1808126786389037107">via @henningsanden on X</a></p>
        </figcaption>
</figure>

<p>Facebook, whose algorithm <a href="https://transparency.meta.com/data/widely-viewed-content-report/">favors</a> emotionally-appealing engagement bait posts, has seen a deluge of high-engagement slop even when the content makes no logical sense.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/cabincrew_hu_bc23e6989111247c.webp 320w,/2024/08/ai-seinfeld/cabincrew_hu_c696ff0db8c80eff.webp 768w,/2024/08/ai-seinfeld/cabincrew_hu_b68182f34bfe5d01.webp 1024w,/2024/08/ai-seinfeld/cabincrew.webp 1080w" src="cabincrew.webp#center"
         alt="One of the few AI-generated images on Facebook with an actual cabin crew. via @FacebookAIslop on X." width="400"/> <figcaption>
            <p>One of the few AI-generated images on Facebook with an actual cabin crew. <a href="https://x.com/FacebookAIslop/status/1806416249259258189">via @FacebookAIslop on X</a>.</p>
        </figcaption>
</figure>

<p>This is, of course, quintessential uncanny valley: it&rsquo;s coherent at a glance but just even looking at it for a second it&rsquo;s obvious where the issues are, and these issues aren&rsquo;t a good kind of AI weirdness. What worse is that AI Slop a regression in realism, and falls onto the left side of the valley.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_4_hu_ce80aacfa47a581e.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_4_hu_ffbc52f347062d8f.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_4_hu_8f8817dd988ae0a9.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_4.webp 1200w" src="uncanny_valley_4.webp"/> 
</figure>

<p>Although we as humans can identify this slop, it is currently surprisingly hard for an AI to do so, although it hasn&rsquo;t stopped people from trying to build AIs that can detect AIs which in practice is filled with false positives that hurt real creatives. For slop-creators, this is a feature: if an AI company released a tool to reliably detect and punish slop, it would make their generative AI less valuable. It&rsquo;s <a href="https://www.wsj.com/tech/ai/openai-tool-chatgpt-cheating-writing-135b755a">reported</a> that one of the reasons that OpenAI won&rsquo;t release a reliable ChatGPT text detector is that it could harm their business.</p>
<p>The core reason for the big tech companies allowing generative AI to cause the <a href="https://en.wikipedia.org/wiki/Enshittification">enshittification</a> of the internet is misaligned incentives between the companies hosting AI slop and the users viewing it. Social media companies and their shareholders care about <a href="https://mixpanel.com/blog/north-star-metric/">North Star metrics</a> such as user retention and time-on-site, and normally those metrics can be correlated with user happiness and satisfaction with the service. But time-on-site, for example, can <em>also</em> be maximized by making the site harder and slower to use, and the deluge of AI slop accomplishes that. AI companies typically don&rsquo;t have analytics tracking negative user sentiment about their use of AI: if anything, the uncompromising backlash against AI convinces the companies that complainers are just a lost demographic to accommodate and double down on what they&rsquo;re already doing. Aggregate metrics treat human-made content and AI-generated content as equal, but <em>humans</em> do not.</p>
<p>Generative AI, even for researchers and practitioners such as myself, is a heavily nuanced topic that is very difficult to communicate succinctly, more difficult to do on social media which highly discourages nuance and context, and <em>even more difficult</em> as AI hypesters muddy the waters with misleading praises of generative AI such that they&rsquo;re easy to dunk on which just gets them more engagement and revenue. &ldquo;Made by AI&rdquo; is now a term that inspires dread, far from the Keaton Patti days where made-by-AI was an indicator of joyful weirdness. Bashing AI is now a meme, and there&rsquo;s isn&rsquo;t a single potential AI project that could challenge that perception because the well is poisoned beyond repair.</p>
<h2 id="would-a-247-ai-generated-twitch-stream-even-work-anymore">Would a 24/7 AI-Generated Twitch Stream Even Work Anymore?</h2>
<p>How does the modern AI backlash tie back into AI Seinfeld? Twitch&rsquo;s core demographic is the same demographic as those most against the use of generative AI. Part of the reason AI Seinfeld became so successful on Twitch is because of the community it cultivated: it wouldn&rsquo;t have gone viral if people weren&rsquo;t spamming microwave <code>MMM</code>s and and answering what did the fish say when it hit the wall. Even though Twitch viewers are mostly lurkers and not chatters, a channel with a good community builds word-of-mouth even outside of Twitch, which is how Twitch channels go viral.</p>
<p>I decided to determine what it would take to produce a &ldquo;fixed&rdquo; AI Seinfeld in 2024, given both the advances in AI and the ethics involved. Now, it&rsquo;s definitely not anything a scrappy group of hackers could do anymore. Sure, you could once again ask an LLM to generate a sitcom script and get a bunch of assets from the Unity Asset Store, but <em>that&rsquo;s already been done before</em>. In order to overcome the reflexive assumption that new AI generated content is slop, the stream would have to be something completely novel and unexpected: you can&rsquo;t, for example, just do an AI <a href="https://en.wikipedia.org/wiki/Curb_Your_Enthusiasm">Curb Your Enthusiasm</a>.</p>
<p>The script would be unique following from my demo of detailed parametric prompts, but it would require production-studio-class tracking and documentation for how the prompts and their parameters are used to codify said uniqueness. The stream video would still need to be rendered in Unity or another engine, but in order to be unique it would require commissioning human-made visuals and sound effects: given the animosity against those who work with AI, most artists would not accept those commissions even if they were paid at a significant premium. <sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup> The voices would still have to be from an existing text-to-speech voice provider: voice cloning is right out, even with explicit consent and compensation for the voice actors.</p>
<p>And even if all the assets were fully sourced ethically with transparent documentation for the entire pipeline, the stream&rsquo;s Twitch chat would likely be derailed by <code>AI 👏 ART 👏 IS 👏 THEFT</code> spam, preventing the establishment of any community, and strict moderation to curb the spam risks causing a <a href="https://en.wikipedia.org/wiki/Streisand_effect">Streisand effect</a>.</p>
<p>The only entities that could feasibly create a 24/7 AI-generated livestream with fully ethically-sourced content would be, ironically, the big AI companies such as OpenAI which can afford to pay licenses for said data. Even <a href="https://www.disney.com">Disney</a>, which owns more than enough IP to train generative models of all modalities, would never do an AI Seinfeld-esque livestream for <a href="https://en.wikipedia.org/wiki/Brand_safety">brand safety</a> reasons alone: the nonzero possibility of a Disney character unexpectedly saying something problematic during the stream would make the entire project a complete nonstarter.</p>
<h2 id="whats-the-deal-with-the-uncanny-valley">What&rsquo;s the deal with the uncanny valley?</h2>
<p>One of the common criticisms about generative AI pointed out by creatives is &ldquo;if AI is trained on all human works, then how can it create anything new&rdquo;? AI Seinfeld is the perfect counterargument: even though it&rsquo;s powered by a LLM, the <em>humans</em> behind it are what made it go viral. Even before ChatGPT, generative AI has always excelled as a tool. The microwave gag and the 144p visual filter were not AI-generated or an attempt to emulate aspects of the Seinfeld sitcom: they were distinct creative decisions that made the entire project more interesting, and they aren&rsquo;t something that you could prompt an AI to suggest to add. AI Seinfeld in hindsight was an ethical form of AI-generated media: it did not replace Seinfeld the TV show, no one would stop watching streams of Seinfeld in favor of the AI-generated alternative, and copyright holders and Jerry Seinfeld did not lose revenue due to AI Seinfeld&rsquo;s existence: if anything, the nostalgic buzz increased streams of the original show.</p>
<p>With the current trajectory of AI slop and the perverse incentives by large tech companies to not address it, I am pessimistic that AI content will ever be at a state where it will cross that final hump of the uncanny valley curve into full acceptance, and even more pessimistic about the backlash against generative AI ever subsiding. With generative model training now at the point where it requires exponentially more compute and data for increasingly marginal returns, it will take years if at all for generative AI output to reach the far right of the uncanny valley chart, and unless the large tech companies actually create an <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">AGI</a>, they are unlikely to obtain higher acceptability than AI Seinfeld ever did.</p>
<p>I wrote most of this blog post weeks ago but held off publishing it because new AI news kept happening. Most notably, the <a href="https://blackforestlabs.ai/our-team/">creators of Stable Diffusion</a> just released the <a href="https://blackforestlabs.ai">FLUX.1 series</a> of generative image AI models, which presents substantially improved coherence both to the provided prompt and within the image itself. Some of the variants are <a href="https://huggingface.co/black-forest-labs/FLUX.1-dev">open-source</a>, allowing the community to finetune them. The <a href="https://huggingface.co/XLabs-AI/flux-RealismLora">XLabs-AI/flux-RealismLora</a> in particular focuses on realism as it name implies, and <a href="https://www.reddit.com/r/StableDiffusion/comments/1emrprx/feel_the_difference_between_using_flux_with">one demo</a> from that finetune <a href="https://x.com/rpnickson/status/1821634114274873850">went megaviral</a>.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/flux_hu_f2586697cc180453.webp 320w,/2024/08/ai-seinfeld/flux.webp 664w" src="flux.webp#center"
         alt="One of the viral realism demo images: it does not have a dreamy look as other AI images but contextually expected stage lighting, the background and lanyard text is legible despite the depth-of-field blur, and body proportions are mostly correct except the long fingers. via /u/Glittering-Football9 on Reddit / StableDiffusion subreddit." width="400"/> <figcaption>
            <p>One of the viral realism demo images: it does not have a dreamy look as other AI images but contextually expected stage lighting, the background and lanyard text is legible despite the depth-of-field blur, and body proportions are mostly correct except the long fingers. <a href="https://www.reddit.com/r/StableDiffusion/comments/1emrprx/comment/lh30hvv/">via /u/Glittering-Football9 on Reddit / StableDiffusion subreddit</a>.</p>
        </figcaption>
</figure>

<p>That example in my opinion is more real than Sora but given the mixed reactions to the image, it&rsquo;s right at the acceptability = 0 threshold.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_5_hu_c33303ff9d736da6.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_5_hu_d0b5c2c50072b2b0.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_5_hu_7eb161e4aba72dd1.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_5.webp 1200w" src="uncanny_valley_5.webp"/> 
</figure>

<p>The generative AI bell cannot be unrung. As you can tell from this post, I personally try to thread the thin line between both cool applications of generative AI (at the risk of getting harrassed) and the problems generative AI can cause (also at the risk of getting harrassed) because it&rsquo;s important to shine a light on what&rsquo;s actually possible with AI when the misinformation around generative AI is only increasing. It&rsquo;s overall a big bummer how we went from weird Valentine&rsquo;s Day hearts, to a quirky livestream of a group of AI-generated friends, to what AI is now.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>All of the examples in this post use LLM APIs as they provide the customization necessary to get effective results: the results for asking the same prompts to free chat frontends such as chatgpt.com will be substantially different.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>When I was younger, I actually didn&rsquo;t like Seinfeld and instead preferred to watch <a href="https://en.wikipedia.org/wiki/Everybody_Loves_Raymond">Everybody Loves Raymond</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Incidentally, parametric prompts is why Unlimited Steam got <a href="https://www.reddit.com/r/unlimitedsteam/comments/12wto93/thank_you_for_enjoying_the_steam/">permanently banned</a> from Twitch: in what would now be known as a <a href="https://www.ibm.com/topics/prompt-injection">prompt injection</a>, one of the GitHub-hosted lists the channel sourced thousands of food choices for the prompt contained a few highly offensive selections.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Prompt engineering instability grows exponentially as the prompt size increases since each part of the prompt has to relate to each other. Claude 3.5 Sonnet is the first LLM I&rsquo;ve tested that can handle super-long bespoke prompts and can actually account for all aspects of the prompt.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>To be fully ethical, an AI practitioner would have to proactively offer additional contractual guarantees to creatives they are commissioning, including highly-scoped usage of the assets they provide and a clause to not train generative AI on said assets to avoid future business.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>The Super Effectiveness of Pokémon Embeddings Using Only Raw JSON and Images</title>
      <link>https://minimaxir.com/2024/06/pokemon-embeddings/</link>
      <pubDate>Wed, 26 Jun 2024 10:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2024/06/pokemon-embeddings/</guid>
      <description>Embeddings encourage engineers to go full YOLO because it&amp;rsquo;s actually rewarding to do so!</description>
      <content:encoded><![CDATA[<p><a href="https://en.wikipedia.org/wiki/Word_embedding">Embeddings</a> are one of the most useful but unfortunately underdiscussed concepts in the artificial intelligence space relative to the modern generative AI gigahype. Embeddings are a set of hundreds of numbers which uniquely correspond to a given object that define its dimensionality, nowadays in a multiple of 128 such as 384D, 768D, or even 1536D. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> The larger the embeddings, the more &ldquo;information&rdquo; and distinctiveness each can contain, in theory. These embeddings can be used as-is for traditional <a href="https://en.wikipedia.org/wiki/Regression_analysis">regression</a> and <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> problems with your favorite statistical modeling library, but what&rsquo;s really useful about these embeddings is that if you can find the minimum mathematical distance between a given query embedding and another set of embeddings, you can then find which is the most similar: extremely useful for many real-world use cases such as search.</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/embedding_demo_hu_d3f88226f240b184.webp 320w,/2024/06/pokemon-embeddings/embedding_demo_hu_2c5a706d1ebace78.webp 768w,/2024/06/pokemon-embeddings/embedding_demo_hu_34918a6ff306a4ac.webp 1024w,/2024/06/pokemon-embeddings/embedding_demo.png 1178w" src="embedding_demo.png"
         alt="An example sentence embedding generated using Sentence Transformers: this embedding is 384D."/> <figcaption>
            <p>An example sentence embedding generated using Sentence Transformers: this embedding is 384D.</p>
        </figcaption>
</figure>

<p>Although any kind of object can be represented by an embedding, text is the classical use case for embeddings, popularized with the original <a href="https://en.wikipedia.org/wiki/Word2vec">word2vec</a> <a href="https://arxiv.org/abs/1301.3781">paper</a> which along with <a href="https://arxiv.org/abs/1310.4546">later work</a> showed that word embeddings could be used to calculate relationships such as man + women - king = queen. You could then, for example, create a sentence embedding by averaging all of its word embeddings. This actually works, although this naive averaging does not take word position and punctuation into account, both of which are critically important in identifying context for a given text.</p>
<p>Deep learning then entered the picture and it was eventually discovered that large language models like <a href="https://en.wikipedia.org/wiki/BERT_%28language_model%29">BERT</a> can return embeddings as an emergent behavior. Unlike the word averaging above, <a href="https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29">transformers</a>-based LLMs can account for positional relationships more robustly thanks to their <a href="https://en.wikipedia.org/wiki/Attention_%28machine_learning%29">attention mechanisms</a>, and, due to their more advanced model input <a href="https://www.freecodecamp.org/news/how-tokenizers-shape-ai-understanding/">tokenization</a> strategies than just words, can also better incorporate punctuation. One very popular Python library for creating embeddings using LLMs easily is <a href="https://sbert.net">Sentence Transformers</a>, especially with the <a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">all-MiniLM-L6-v2</a> model (<a href="https://huggingface.co/models?pipeline_tag=sentence-similarity&amp;sort=downloads">30 million downloads monthly</a>!) which balances embedding encoding speed and robustness with its 384D embeddings.</p>
<p>How well can these embeddings models work beyond just normal sentences? Can they encode larger bodies of text into a consistent space? The context length of <code>all-MiniLM-L6-v2</code> is 512 tokens, which can only fit a couple paragraphs of text, but newer LLMs have much higher context lengths.</p>
<p>I recalled one of my early projects as an aspiring data scientist: creating <a href="https://github.com/minimaxir/pokemon-3d">Pokémon vectors</a> by <a href="https://github.com/minimaxir/pokemon-3d/blob/master/pokemon_spark_pca.ipynb">manually transforming Pokémon metadata</a> for each Pokémon, such as their base stats, type(s), moves, abilities, and miscellaneous attributes such as color, shape, and habitat. After that, I was able to cluster them.</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/cluster_hu_7ca1c5b82a702cc7.webp 320w,/2024/06/pokemon-embeddings/cluster.png 676w" src="cluster.png"
         alt="3D projection of my Pokémon vectors back in 2016: the colors are Pokémon types, and the methodology seemed to favor clustering by them."/> <figcaption>
            <p>3D projection of my Pokémon vectors back in 2016: the colors are Pokémon types, and the methodology seemed to favor clustering by them.</p>
        </figcaption>
</figure>

<p>Those familar with Pokémon know that&rsquo;s just scratching the surface: there&rsquo;s even more metadata such as the rich text data such as a Pokémon&rsquo;s Pokédex entries and the exact locations where they can be encountered, both of which tell a lot about a given Pokémon. At the time, there was no efficient LLM to encode all of that extra metadata.</p>
<p>Why not try to encode all Pokémon metadata using a text embedding model and see what happens? Will we be able to identify the most &ldquo;similar&rdquo; Pokémon? What is a &ldquo;similar&rdquo; Pokémon anyways? Can we find the <em>weirdest</em> Pokémon by the most dissimilar? Can we encode other Pokémon data such as images? Let&rsquo;s find out!</p>
<h2 id="how-embeddings-are-generated-using-llms">How Embeddings Are Generated Using LLMs</h2>
<p>First, some relevant technical background on how LLMs can be used to create embeddings since there&rsquo;s surprisingly a lot of confusion about how they work other than the SEO-oriented &ldquo;embeddings are for <a href="https://www.cloudflare.com/learning/ai/what-is-vector-database/">vector databases</a>&rdquo;.</p>
<p>Modern embedding models are commonly trained through one of two ways. The first way is through emergent behavior while training an LLM normally: as LLMs need to determine a latent space before passing the output to a classification head such as <a href="https://en.wikipedia.org/wiki/Generative_pre-trained_transformer">GPT</a>&rsquo;s next-token prediction, taking the last layer (&ldquo;hidden state&rdquo;) of a model and averaging across the positional axis results in an embedding with the same dimensionality as the hidden state. LLMs have to learn how to uniquely represent text in a common latent space, so this is approach is natural. The second way is to train a model to output the embeddings directly: in this case, the training process typically uses <a href="https://encord.com/blog/guide-to-contrastive-learning/">contrastive learning</a> to minimize the semantic distance between the generated embeddings of a pair of known text documents, and maximize the difference between a dissimilar pair. Both of these techniques can be used together of course: pretrain a LLM on a large body of text, then finetune it with contrastive learning.</p>
<p>Embeddings models get the benefits of all the research invested into improving LLMs for generative AI, such as inference speed and longer context windows. Normally it requires a quadratic increase in computation to use those larger context windows (e.g. a 2<em>x</em> increase in input length requires 4<em>x</em> more computation), but thanks to <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">FlashAttention</a> and <a href="https://blog.eleuther.ai/rotary-embeddings/">rotary positional embeddings,</a> it&rsquo;s now feasible to train models with massively-large context windows without a massive datacenter and then run those models on consumer hardware.</p>
<p>Ever since 2022, <a href="https://openai.com">OpenAI</a> had the text embedding model <a href="https://openai.com/index/new-and-improved-embedding-model/">text-embedding-ada-002</a> behind a paid API with the largest context window of 8,192 tokens: a substantial increase over <code>all-MiniLM-L6-v2</code>&rsquo;s 512 limit, and no other open-source model could compete. That is until February 2024, when <a href="https://www.nomic.ai">Nomic AI</a> released <a href="https://huggingface.co/nomic-ai/nomic-embed-text-v1">nomic-embed-text-v1</a>, a fully open-source embeddings model with a 8,192 context window and a permissive <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache license</a>, and quickly followed up with <a href="https://huggingface.co/nomic-ai/nomic-embed-text-v1.5">nomic-embed-text-v1.5</a>. In academic benchmarks, this free model performed even better than OpenAI&rsquo;s paid embedding model <a href="https://blog.nomic.ai/posts/nomic-embed-text-v1">thanks to its training regimen</a> that uses both embedding model training tricks described above. That, along with its long context window, caused it to become another one of the most downloaded open-source embedding models (~10 million downloads per month).</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/nomic_embeddings_demo_hu_4dadd5c725d77e60.webp 320w,/2024/06/pokemon-embeddings/nomic_embeddings_demo_hu_6682e6dc986b6850.webp 768w,/2024/06/pokemon-embeddings/nomic_embeddings_demo_hu_1c6ffa82964e1090.webp 1024w,/2024/06/pokemon-embeddings/nomic_embeddings_demo.png 1340w" src="nomic_embeddings_demo.png"
         alt="A sentence embedding generated using nomic-embed-text-v1.5 adapted from the official example: this is a lower-level interface than Sentence Transformers (Hugging Face transformers and PyTorch) but is more clear as to what is going on. mean_pooling() uses an atypical attention-masked averaging that is theoretically better for small inputs than averaging the entire last hidden state."/> <figcaption>
            <p>A sentence embedding generated using <code>nomic-embed-text-v1.5</code> adapted from the official example: this is a lower-level interface than Sentence Transformers (<a href="https://huggingface.co/docs/transformers/en/index">Hugging Face transformers</a> and <a href="https://pytorch.org">PyTorch</a>) but is more clear as to what is going on. <code>mean_pooling()</code> uses an atypical attention-masked averaging that is theoretically better for small inputs than averaging the entire last hidden state.</p>
        </figcaption>
</figure>

<p>The <code>F.normalize()</code> function is a popular pipeline innovation in finding similar embeddings efficiently. <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> A <a href="https://en.wikipedia.org/wiki/Unit_vector">unit normalized</a> vector has a vector length summing to 1. But if you perform a matrix multiplication (an extremely fast computational operation) of a normalized vector against a matrix of normalized vectors, then the result will be the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a>, constrained between the values of 1 for identical matches and -1 for the most dissimilar matches.</p>
<p>Now that we have thoroughly covered how embeddings work, let&rsquo;s see if we can put that 8,192 context window to the test.</p>
<h2 id="what-kind-of-pokémon-embedding-are-you">What Kind of Pokémon Embedding Are You?</h2>
<p>Before encoding Pokémon data, I need to first get Pokémon data, but where? Nintendo certainly won&rsquo;t have an API for Pokémon data, and web scraping a Pokémon wiki such as <a href="https://bulbapedia.bulbagarden.net/wiki/Main_Page">Bulbapedia</a> is both impractical and rude. Fortunately, there&rsquo;s an unofficial Pokémon API known appropriately as <a href="https://pokeapi.co">PokéAPI</a>, which is both open source and has been around for years without Nintendo taking them down. Of note, PokéAPI has a <a href="https://beta.pokeapi.co/graphql/console/">GraphQL interface</a> to its Pokémon data, allowing you to query exactly what you want without having to do relationship mapping or data joins.</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/graphql_hu_545910786cbd7bf6.webp 320w,/2024/06/pokemon-embeddings/graphql_hu_5d76d10c482e4154.webp 768w,/2024/06/pokemon-embeddings/graphql_hu_d03bd1849088a102.webp 1024w,/2024/06/pokemon-embeddings/graphql.png 1260w" src="graphql.png"
         alt="A simple GraphQL query to get all Pokémon IDs and names, sorted by ID."/> <figcaption>
            <p>A simple GraphQL query to get all Pokémon IDs and names, sorted by ID.</p>
        </figcaption>
</figure>

<p>Since we can get Pokémon data in a nicely structured JSON dictionary, why not keep it that way? After writing a <a href="https://github.com/minimaxir/pokemon-embeddings/blob/main/query.gql">massive GraphQL query</a> to specify all mechanically relevant Pokémon data, all it takes it a single GET request to download it all, about 16MB of data total. This includes over 1,000 Pokémon up to the Scarlet/Violet <em>The Hidden Treasure of Area Zero</em> DLC: 1,302 Pokémon total if you include the Special forms of Pokémon (e.g. <a href="https://bulbapedia.bulbagarden.net/wiki/Mega_Evolution">Mega Evolutions</a>) which I&rsquo;m excluding for simplicity.</p>
<p>As an example, let&rsquo;s start with the franchise mascot, <a href="https://bulbapedia.bulbagarden.net/wiki/Pikachu_%28Pok%C3%A9mon%29">Pikachu</a>.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/025_hu_94d9a33563b85a33.webp 320w,/2024/06/pokemon-embeddings/025.png 475w" src="025.png#center"
         alt="The iconic Pokémon #25. via Nintendo" width="300" height="300"/> <figcaption>
            <p>The iconic Pokémon #25. <a href="https://www.pokemon.com/us/pokedex/pikachu">via Nintendo</a></p>
        </figcaption>
</figure>

<p>Here&rsquo;s a subset of Pikachu&rsquo;s <a href="https://github.com/minimaxir/pokemon-embeddings/blob/main/pikachu_example_raw.json">JSON metadata</a> from that query:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;pikachu&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;height&#34;</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;weight&#34;</span><span class="p">:</span> <span class="mi">60</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;base_experience&#34;</span><span class="p">:</span> <span class="mi">112</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;pokemon_v2_pokemontypes&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;pokemon_v2_type&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;electric&#34;</span>
</span></span><span class="line"><span class="cl">      <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;pokemon_v2_pokemonstats&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;pokemon_v2_stat&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;hp&#34;</span>
</span></span><span class="line"><span class="cl">      <span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;base_stat&#34;</span><span class="p">:</span> <span class="mi">35</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="err">...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="s2">&#34;pokemon_v2_pokemonspecy&#34;</span><span class="err">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;base_happiness&#34;</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;capture_rate&#34;</span><span class="p">:</span> <span class="mi">190</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;forms_switchable&#34;</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;gender_rate&#34;</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;has_gender_differences&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;hatch_counter&#34;</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;is_baby&#34;</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;is_legendary&#34;</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;is_mythical&#34;</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;pokemon_v2_pokemonspeciesflavortexts&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;pokemon_v2_version&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">          <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;red&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;flavor_text&#34;</span><span class="p">:</span> <span class="s2">&#34;When several of\nthese POK\u00e9MON\ngather, their\felectricity could\nbuild and cause\nlightning storms.&#34;</span>
</span></span><span class="line"><span class="cl">      <span class="p">},</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="err">...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="s2">&#34;pokemon_v2_pokemonmoves&#34;</span><span class="err">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">      <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;pokemon_v2_move&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">          <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;mega-punch&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">          <span class="nt">&#34;pokemon_v2_type&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;normal&#34;</span>
</span></span><span class="line"><span class="cl">          <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">      <span class="p">},</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="err">...</span>
</span></span></code></pre></div><p>There&rsquo;s definitely no shortage of Pikachu data! Some of the formatting is redundant though: most of the JSON keys have a <code>pokemon_v2_</code> string that conveys no additional semantic information, and we can minify the JSON to remove all the whitespace. We won&rsquo;t experiment with more rigorous preprocessing: after all, I only need to optimize an ETL workflow if it <em>doesn&rsquo;t</em> work, right?</p>
<p>Since JSON data is so prevalent across the internet, it&rsquo;s extremely likely that a newly trained LLM will be sensitive to its schema and be able to understand it better. However, JSON is a token-inefficient encoding format, made even worse in this case by the particular choice of tokenizer. Here&rsquo;s the distribution of the encoded texts after the optimizations above, using <code>nomic-embed-text-v1.5</code>&rsquo;s text tokenizer which is incidentally the same <a href="https://huggingface.co/google-bert/bert-base-uncased">bert-based-uncased</a> tokenizer used for BERT back in 2018:</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/token_counts_hu_400e5e984e326eb1.webp 320w,/2024/06/pokemon-embeddings/token_counts_hu_cf5bcc5547d45eb.webp 768w,/2024/06/pokemon-embeddings/token_counts_hu_e4d28e56e2dc7bc9.webp 1024w,/2024/06/pokemon-embeddings/token_counts.png 1200w" src="token_counts.png"/> 
</figure>

<p>The 8,192 context length of <code>nomic-embed-text-v1.5</code> is perfect for fitting almost all Pokémon! But the median token count is 3,781 tokens which is still somewhat high. The reason for this is due to the tokenizer: bert-base-uncased is a <a href="https://huggingface.co/learn/nlp-course/en/chapter6/6">WordPiece</a> tokenizer which is optimized for words and their common prefixes and suffixes, while JSON data is highly structured. If you use a more modern tokenizer which utilizes <a href="https://en.wikipedia.org/wiki/Byte_pair_encoding">byte pair encoding</a> (BPE), such as the <code>o200k_base</code> tokenizer which powers OpenAI&rsquo;s <a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a>, then the median token count is 2,010 tokens: nearly half the size, and therefore would be much faster to process the embeddings.</p>
<p>After that, I <a href="https://github.com/minimaxir/pokemon-embeddings/blob/main/pokemon_embeddings.ipynb">encoded all the Pokémon metadata</a> into a 768D text embedding for each and every Pokémon, including unit normalization. Due to the quadratic scaling at high input token counts, this is still very computationally intensive despite the optimization tricks: for the 1,302 embeddings, it took about a half-hour on a <a href="https://colab.research.google.com">Google Colab</a> T4 GPU. The embeddings are then saved on disk in a <a href="https://parquet.apache.org">parquet</a> format, a tabular format which supports nesting sequences of floats natively (don&rsquo;t use a CSV to store embeddings!). The embedding generation is the hard part, now it&rsquo;s time for the fun part!</p>
<p>Let&rsquo;s start off with Pikachu. What Pokémon is Pikachu most similar to, i.e. has the highest cosine similarity? Remember, since all the embeddings are normalized, we can get all the cosine similairites by matrix multiplying the Pikachu embedding against all the other embeddings. Let&rsquo;s include the top 3 of each of Pokémon&rsquo;s nine (!) generations to date:</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/25_similar_text_hu_d3abb73fceff08ae.webp 320w,/2024/06/pokemon-embeddings/25_similar_text_hu_7f6bea07a378c8ca.webp 768w,/2024/06/pokemon-embeddings/25_similar_text_hu_13958884c73294ad.webp 1024w,/2024/06/pokemon-embeddings/25_similar_text.png 1500w" src="25_similar_text.png"/> 
</figure>

<p>These results are better than I expected! Each generation has a &ldquo;<a href="https://bulbapedia.bulbagarden.net/wiki/Electric_rodents">Pikaclone</a>&rdquo; of a weak Electric-type rodent Pokémon, and this similarity calculation found most of them. I&rsquo;m not sure what <a href="https://bulbapedia.bulbagarden.net/wiki/Phantump_%28Pok%C3%A9mon%29">Phantump</a> and <a href="https://bulbapedia.bulbagarden.net/wiki/Trevenant_%28Pok%C3%A9mon%29">Trevenant</a> are doing under Gen VI though: they&rsquo;re Ghost/Grass Pokémon.</p>
<p>Here&rsquo;s a few more interesting Pokémon comparisons:</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/157_similar_text_hu_3d346fcf9518b458.webp 320w,/2024/06/pokemon-embeddings/157_similar_text_hu_4ee5feef47d7753c.webp 768w,/2024/06/pokemon-embeddings/157_similar_text_hu_2922bbc71f9c3c31.webp 1024w,/2024/06/pokemon-embeddings/157_similar_text.png 1500w" src="157_similar_text.png"
         alt="Typhlosion is the final evolution of the Gen II Fire starter Pokémon: it has a high similarity with atleast one of every generation&rsquo;s Fire starter Pokémon lineages."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Typhlosion_%28Pok%C3%A9mon%29">Typhlosion</a> is the final evolution of the Gen II Fire starter Pokémon: it has a high similarity with atleast one of every generation&rsquo;s Fire starter Pokémon lineages.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/144_similar_text_hu_f33e01b9e6fd18de.webp 320w,/2024/06/pokemon-embeddings/144_similar_text_hu_6901b401ed8d948b.webp 768w,/2024/06/pokemon-embeddings/144_similar_text_hu_c3d530c15d378e14.webp 1024w,/2024/06/pokemon-embeddings/144_similar_text.png 1500w" src="144_similar_text.png"
         alt="Articuno, a Legendary Ice/Flying Pokémon, has high similarity with Legendary, Ice, and Flying Pokémon, plus all combinations therein."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Articuno_%28Pok%C3%A9mon%29">Articuno</a>, a Legendary Ice/Flying Pokémon, has high similarity with Legendary, Ice, and Flying Pokémon, plus all combinations therein.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/151_similar_text_hu_d787c257f6d1c1fc.webp 320w,/2024/06/pokemon-embeddings/151_similar_text_hu_14f77fe5dcb615a2.webp 768w,/2024/06/pokemon-embeddings/151_similar_text_hu_ee9cf4523b03c9ca.webp 1024w,/2024/06/pokemon-embeddings/151_similar_text.png 1500w" src="151_similar_text.png"
         alt="Mew, the infamous legendary from the original games has the gimmick of being able to learn every move, has the most amount of metadata by far: appropriately it has poor similarity with others, although similarity with Arceus from Gen IV, the Pokémon equivalent of God with a similar gimmick."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Mew_%28Pok%C3%A9mon%29">Mew</a>, the infamous legendary from the original games has the gimmick of being able to learn every move, has the most amount of metadata by far: appropriately it has poor similarity with others, although similarity with <a href="https://bulbapedia.bulbagarden.net/wiki/Arceus_%28Pok%C3%A9mon%29">Arceus</a> from Gen IV, the Pokémon equivalent of God with a similar gimmick.</p>
        </figcaption>
</figure>

<p>You may have noticed the numerical cosine similarity of all these Pokémon is very high: if a similarity of 1 indicates an identical match, does a high value imply that a Pokémon is super similar? It&rsquo;s likely that the similarities are high because the input is all in the same JSON formatting, where the core <code>nomic-text-embed-v1.5</code> model was trained on a variety of text styles. Another potential cause is due to a &ldquo;cheat&rdquo; I did for simplicity: the <code>nomic-text-embed-v1.5</code> documentation says that a <code>search_document</code> prefix is required for encoding the base input documents and a <code>search_query</code> prefix is required for the comparison vector: in my testing it doesn&rsquo;t affect the similarity much if at all. In practice, the absolute value of cosine similarity doesn&rsquo;t matter if you&rsquo;re just selecting the objects with the highest similarity anyways.</p>
<p>What if we just plot <em>every possible combination</em> of Pokémon cosine similarities? With 1,000+ Pokémon, that&rsquo;s over 1 million combinations. Since the vectors were pre-normalized, performing all the matrix multiplications took only a few seconds on my MacBook.</p>
<p>Here&rsquo;s the result of plotting 1 million points on a single chart!</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/pokemon_cossim_text_hu_4e946cbcf5592ceb.webp 320w,/2024/06/pokemon-embeddings/pokemon_cossim_text_hu_162006d7e7cb517a.webp 768w,/2024/06/pokemon-embeddings/pokemon_cossim_text_hu_adde71533c7b8122.webp 1024w,/2024/06/pokemon-embeddings/pokemon_cossim_text.png 1200w" src="pokemon_cossim_text.png"/> 
</figure>

<p>Although it looks more like a quilt, a few things jump out. One curious case is the &ldquo;square&rdquo; of lighter Gen VIII and Gen IX in the upper right corner: it appears those two generations have lower similarity with others, and worsening similarity between those two generation as you go all the way back to Gen I. Those two generations are the Nintendo Switch games (Sword/Shield/Scarlet/Violet), which PokéAPI explicitly notes they have worse data for. Also, there are rows of a low-similarity blue such as one before Gen II: who&rsquo;s that Pokémon? Quickly checking the Pokémon with the lowest median similarity by generation:</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/pokemon_dissimilar_text_hu_6a8915531cedabd3.webp 320w,/2024/06/pokemon-embeddings/pokemon_dissimilar_text_hu_ed6bebce86bed918.webp 768w,/2024/06/pokemon-embeddings/pokemon_dissimilar_text_hu_539044188fdd35ba.webp 1024w,/2024/06/pokemon-embeddings/pokemon_dissimilar_text.png 1500w" src="pokemon_dissimilar_text.png"/> 
</figure>

<p>The mystery Pokémon is <a href="https://bulbapedia.bulbagarden.net/wiki/Magikarp_%28Pok%C3%A9mon%29">Magikarp</a>, unsurprisingly, with its <em>extremely</em> limited movepool. Most of these Pokémon have forced gimmick movesets, especially <a href="https://bulbapedia.bulbagarden.net/wiki/Unown_%28Pok%C3%A9mon%29">Unown</a>, <a href="https://bulbapedia.bulbagarden.net/wiki/Smeargle_%28Pok%C3%A9mon%29">Smeargle</a>, and <a href="https://bulbapedia.bulbagarden.net/wiki/Wobbuffet_%28Pok%C3%A9mon%29">Wobbuffet</a>, so it makes sense the metadata treats them as dissimilar to most others. Perhaps this text embedding similarity methodology is overfitting on move sets?</p>
<p>Overall, there&rsquo;s definitely some signal with these text embeddings. How else can we identify interesting Pokémon relationships?</p>
<h2 id="pokémon-snap">Pokémon Snap</h2>
<p>We&rsquo;ve only been working with text embeddings, but what about other types of embeddings, such as image embeddings? Image embeddings using <a href="https://en.wikipedia.org/wiki/Vision_transformer">vision transformer</a> models are generated roughly the same way as the text embeddings above by manipulating the last hidden state and optionally normalizing them. The inputs to the model are then square patches encoded as &ldquo;tokens&rdquo;: only a few hundred processed patches are ever used as inputs, so generating them is much faster than the text embeddings.</p>
<p>A couple years ago I hacked together a Python package named <a href="https://github.com/minimaxir/imgbeddings/tree/main">imgbeddings</a> which uses OpenAI&rsquo;s <a href="https://openai.com/index/clip/">CLIP</a> to generate the embeddings, albeit with <a href="https://x.com/minimaxir/status/1507166313281585164">mixed results</a>. Recently, Nomic also released an new model, <a href="https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5">nomic-embed-vision-v1.5</a>, which now also generates image embeddings with better benchmark performance than CLIP. What&rsquo;s notable about these embeddings is that they are aligned with the ones from <code>nomic-embed-text-v1.5</code>, which can allow matching text similiarity with images or <em>vice versa</em> and enable <a href="https://cloud.google.com/use-cases/multimodal-ai?hl=en">multimodal applications</a>.</p>
<p>But for now, can we see if image embeddings derived from Pokémon images have similar similarity traits? PokéAPI fortunately has the official artwork for each Pokémon, so I <a href="https://github.com/minimaxir/pokemon-embeddings/blob/main/pokemon_images.ipynb">downloaded them</a> and additionally composited them onto a white background and resized them all to 224x224 for apples-to-apples comparisons. We expect a high cosine similarity since like with text embeddings, the &ldquo;style&rdquo; of all the images is the same. Let&rsquo;s plot the similarities of all Pokémon, by their images only.</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/pokemon_cossim_image_hu_1c224446fd2d33ac.webp 320w,/2024/06/pokemon-embeddings/pokemon_cossim_image_hu_4f63c7e6b643d270.webp 768w,/2024/06/pokemon-embeddings/pokemon_cossim_image_hu_69bf5e8b39390f66.webp 1024w,/2024/06/pokemon-embeddings/pokemon_cossim_image.png 1200w" src="pokemon_cossim_image.png"/> 
</figure>

<p>Unfortunately, no patterns jump out this time. All the image similarity values are even higher than the text similarity values, although that&rsquo;s not a big deal since we are looking at the most similar matches. How does Pikachu&rsquo;s famous official artwork compare with other Pokémon?</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/25_similar_image_hu_38ac0e401d65ee2c.webp 320w,/2024/06/pokemon-embeddings/25_similar_image_hu_7046443de47b4ee1.webp 768w,/2024/06/pokemon-embeddings/25_similar_image_hu_ffe3902d02dae773.webp 1024w,/2024/06/pokemon-embeddings/25_similar_image.png 1500w" src="25_similar_image.png"/> 
</figure>

<p>Pikachu&rsquo;s most similar Pokémon by image isn&rsquo;t just mouse Pokémon as I thought it would be, but instead the pattern is more unclear, appearing to favor mostly Pokémon with four limbs (although Pikachu&rsquo;s image has a strong similarity with Gen VII&rsquo;s <a href="https://bulbapedia.bulbagarden.net/wiki/Mimikyu_%28Pok%C3%A9mon%29">Mimikyu</a>&rsquo;s image which is hilarious since that particular Pokémon&rsquo;s gimmick is intentionally trying to look like Pikachu).</p>
<p>After testing a few more Pokémon, it turns out that this image embedding model does respond to visual primitives, which has its uses.</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/18_similar_image_hu_1612a7ba17e5ae91.webp 320w,/2024/06/pokemon-embeddings/18_similar_image_hu_67025db2f5aeaf75.webp 768w,/2024/06/pokemon-embeddings/18_similar_image_hu_4805d93d4c98cabf.webp 1024w,/2024/06/pokemon-embeddings/18_similar_image.png 1500w" src="18_similar_image.png"
         alt="Pidgeot is a bird, and it matches all other birds. Birds would definitely be in an image training dataset."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Pidgeot_%28Pok%C3%A9mon%29">Pidgeot</a> is a bird, and it matches all other birds. Birds would definitely be in an image training dataset.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/101_similar_image_hu_c171b79453ea948b.webp 320w,/2024/06/pokemon-embeddings/101_similar_image_hu_50e3b83533c12d84.webp 768w,/2024/06/pokemon-embeddings/101_similar_image_hu_f464f67fcf469df6.webp 1024w,/2024/06/pokemon-embeddings/101_similar_image.png 1500w" src="101_similar_image.png"
         alt="Electrode is a ball, and the embeddings found similarly rotund Pokémon."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Electrode_%28Pok%C3%A9mon%29">Electrode</a> is a ball, and the embeddings found similarly rotund Pokémon.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/230_similar_image_hu_2f5b44f49ac99a85.webp 320w,/2024/06/pokemon-embeddings/230_similar_image_hu_bf0bb9446330d79e.webp 768w,/2024/06/pokemon-embeddings/230_similar_image_hu_77535585356eec73.webp 1024w,/2024/06/pokemon-embeddings/230_similar_image.png 1500w" src="230_similar_image.png"
         alt="Kingdra apparently is similar to other blue Pokémon."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Kingdra_%28Pok%C3%A9mon%29">Kingdra</a> apparently is similar to other blue Pokémon.</p>
        </figcaption>
</figure>

<p>Both text and image embedding approaches have their own style. But are there ways to combine them?</p>
<h2 id="chat-with-your-pokédex">Chat With Your Pokédex</h2>
<p>Earlier I alluded to aligning text and image embeddings in a more multimodal manner. Since <code>nomic-embed-vision-v1.5</code> was conditioned on <code>nomic-embed-text-v1.5</code> outputs, you are able to compute the cosine similarities between the image embeddings and text embeddings! However, it&rsquo;s not as robust: the cosine similarities between objects of the two modes tend to be very low at about 0.10 in the best case scenario. Again, if all we&rsquo;re looking at is the highest similarity, then that&rsquo;s fine.</p>
<p>The most common use case for multimodal reasoning is asking questions (to be converted to a text embedding) and comparing it with a set of image embeddings. Let&rsquo;s try it with Pokémon by <a href="https://github.com/minimaxir/pokemon-embeddings/blob/main/pokemon_multimodal_qa.ipynb">asking it a leading question</a> for testing: what looks like an ice cream cone?</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/qa_1_hu_8e3221047d889ca2.webp 320w,/2024/06/pokemon-embeddings/qa_1_hu_533a6f2fd94474e4.webp 768w,/2024/06/pokemon-embeddings/qa_1_hu_d0d5978a89a664ee.webp 1024w,/2024/06/pokemon-embeddings/qa_1.png 1050w" src="qa_1.png"/> 
</figure>

<p>Surprisingly, it got the result correct with <a href="https://bulbapedia.bulbagarden.net/wiki/Vanillish_%28Pok%C3%A9mon%29">Vanillish</a>, along with other &ldquo;cream&rdquo; and &ldquo;ice&rdquo; Pokémon. Not sure why <a href="https://bulbapedia.bulbagarden.net/wiki/Metapod_%28Pok%C3%A9mon%29">Metapod</a> is there, though.</p>
<p>A few more Qs and As:</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/qa_2_hu_3233384b59e2a4e2.webp 320w,/2024/06/pokemon-embeddings/qa_2_hu_c6cfa2f4a81738b6.webp 768w,/2024/06/pokemon-embeddings/qa_2_hu_bbbc1fa0f0144e15.webp 1024w,/2024/06/pokemon-embeddings/qa_2.png 1050w" src="qa_2.png"
         alt="The model did identify some cats, but only Torracat is orange."/> <figcaption>
            <p>The model did identify some cats, but only <a href="https://bulbapedia.bulbagarden.net/wiki/Torracat_%28Pok%C3%A9mon%29">Torracat</a> is orange.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/qa_3_hu_42fd3039bd4f26b4.webp 320w,/2024/06/pokemon-embeddings/qa_3_hu_bd8dc72563061a30.webp 768w,/2024/06/pokemon-embeddings/qa_3_hu_8e404e3e0bd8e8d0.webp 1024w,/2024/06/pokemon-embeddings/qa_3.png 1050w" src="qa_3.png"
         alt="Unown definitely fits the bill with a very prominent one-eye and higher similarity."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Unown_%28Pok%C3%A9mon%29">Unown</a> definitely fits the bill with a very prominent one-eye and higher similarity.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/qa_4_hu_5b0b318502699c9c.webp 320w,/2024/06/pokemon-embeddings/qa_4_hu_822df483a3e6842a.webp 768w,/2024/06/pokemon-embeddings/qa_4_hu_822dde97c1b28626.webp 1024w,/2024/06/pokemon-embeddings/qa_4.png 1050w" src="qa_4.png"
         alt="A Pokémon with the name &ldquo;Cutiefly&rdquo; being the most similar to the question is a funny coincidence."/> <figcaption>
            <p>A Pokémon with the name &ldquo;<a href="https://bulbapedia.bulbagarden.net/wiki/Cutiefly_%28Pok%C3%A9mon%29">Cutiefly</a>&rdquo; being the most similar to the question is a funny coincidence.</p>
        </figcaption>
</figure>

<p>The relationship between text and Pokémon images with these models is not perfect, but it&rsquo;s honestly much better than I expected!</p>
<h2 id="2da-master">2D.A Master</h2>
<p>Lastly, there are many ways to find signal among the high-dimensional noise, and it may resolve some of the counterintuitive relationships we saw earlier. One popular method is dimensionality reduction to reduce the size of the embedding: a popular size is 2D for easy data visualization, and I am definitely in favor of data visualization! The classical statistical approach is <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">principal component analysis</a> (PCA) which identifies the most &ldquo;important&rdquo; aspects of a matrix, but a more modern approach is <a href="https://umap-learn.readthedocs.io/en/latest/">uniform manifold approximation &amp; projection</a> (UMAP) which trains a projection that accounts for how data points relate to all other data points to <a href="https://umap-learn.readthedocs.io/en/latest/how_umap_works.html">find its underlying structure</a>. In theory, the reduction should allow the embeddings to generalize better.</p>
<p>For the Pokémon embeddings, we can take the opportunity to allow the model to account for both the text and image embeddings, and their potential interactions therein. Therefore, I concatenated the text and image embeddings for each Pokémon (a 1536D embedding total), and trained a UMAP to project it down to 2D. Now we can visualize it!</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/pokemon_umap_hu_5c319611f462c24.webp 320w,/2024/06/pokemon-embeddings/pokemon_umap_hu_2a8289708e6aeb22.webp 768w,/2024/06/pokemon-embeddings/pokemon_umap_hu_67d3c2c6bda1487c.webp 1024w,/2024/06/pokemon-embeddings/pokemon_umap.png 1200w" src="pokemon_umap.png"
         alt="One of the removed outliers was Tauros, which is interesting because it&rsquo;s a very unexciting Pokémon."/> <figcaption>
            <p>One of the removed outliers was <a href="https://bulbapedia.bulbagarden.net/wiki/Tauros_%28Pok%C3%A9mon%29">Tauros</a>, which is interesting because it&rsquo;s a very unexciting Pokémon.</p>
        </figcaption>
</figure>

<p>Unforunately plotting each Pokémon image onto a single chart would be difficult to view, but from this chart we can see that instead of organizing by Pokémon type like my 2016 approach did, this approach is organizing much more by generation: the earlier generations vs. the later generations. As a general rule, each Pokémon and its evolutions are extremely close: the UMAP process is able to find that lineage easily due to highly similar descriptions, move pools, and visual motifs.</p>
<p>As with the cosine similarities, we can now find the most similar Pokémon, this time seeing which points have the <strong>lowest</strong> <a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidian distance</a> (0.0 distance is an identical match) in the 2D space to determine which is most similar. How does Pikachu fare now?</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/25_similar_umap_hu_c246697f3bd129ed.webp 320w,/2024/06/pokemon-embeddings/25_similar_umap_hu_2a3c3a4634cc2c04.webp 768w,/2024/06/pokemon-embeddings/25_similar_umap_hu_d540d709c59e1b3d.webp 1024w,/2024/06/pokemon-embeddings/25_similar_umap.png 1500w" src="25_similar_umap.png"/> 
</figure>

<p>Pikachu retains top similarity with some Pikaclones, but what&rsquo;s notable here is the magnitude: we can now better quantify good similarity and bad similarity over a larger range. In this case, many of the Pokémon at distance &gt;1.0 clearly do not resemble an Electric rodent.</p>
<p>How about some other Pokémon?</p>
<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/129_similar_umap_hu_938f7320031c2ed5.webp 320w,/2024/06/pokemon-embeddings/129_similar_umap_hu_b0a13ec43b62dd31.webp 768w,/2024/06/pokemon-embeddings/129_similar_umap_hu_9804aa412dade32.webp 1024w,/2024/06/pokemon-embeddings/129_similar_umap.png 1500w" src="129_similar_umap.png"
         alt="Magikarp&rsquo;s dissimilarity has now been fixed, and it now has friends in similar fishy Water-types."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Magikarp_%28Pok%C3%A9mon%29">Magikarp</a>&rsquo;s dissimilarity has now been fixed, and it now has friends in similar fishy Water-types.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/122_similar_umap_hu_a02cdca1de8e208b.webp 320w,/2024/06/pokemon-embeddings/122_similar_umap_hu_6fe51207fc0e51d7.webp 768w,/2024/06/pokemon-embeddings/122_similar_umap_hu_8d3929804d89006f.webp 1024w,/2024/06/pokemon-embeddings/122_similar_umap.png 1500w" src="122_similar_umap.png"
         alt="Mr. Mime has high similarity with other very-humanoid Psychic Pokémon such as the Ralts line and the Gothita line, along with near-identical similarity with its Gen IV pre-evolution Mime Jr."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Mr._Mime_%28Pok%C3%A9mon%29">Mr. Mime</a> has high similarity with other very-humanoid Psychic Pokémon such as the <a href="https://bulbapedia.bulbagarden.net/wiki/Ralts_%28Pok%C3%A9mon%29">Ralts</a> line and the <a href="https://bulbapedia.bulbagarden.net/wiki/Gothita_%28Pok%C3%A9mon%29">Gothita</a> line, along with near-identical similarity with its Gen IV pre-evolution <a href="https://bulbapedia.bulbagarden.net/wiki/Mime_Jr._%28Pok%C3%A9mon%29">Mime Jr</a>.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2024/06/pokemon-embeddings/12_similar_umap_hu_d79cec0bbd2548dc.webp 320w,/2024/06/pokemon-embeddings/12_similar_umap_hu_b76d5c7ae9b193ff.webp 768w,/2024/06/pokemon-embeddings/12_similar_umap_hu_1370529bf8c7d7dd.webp 1024w,/2024/06/pokemon-embeddings/12_similar_umap.png 1500w" src="12_similar_umap.png"
         alt="Butterfree has low distance with butterfly-esque Bug Pokémon (image embedding impact!) and higher distance with other type of Bugs."/> <figcaption>
            <p><a href="https://bulbapedia.bulbagarden.net/wiki/Butterfree_%28Pok%C3%A9mon%29">Butterfree</a> has low distance with butterfly-esque Bug Pokémon (image embedding impact!) and higher distance with other type of Bugs.</p>
        </figcaption>
</figure>

<p>UMAP is not an exact science (it&rsquo;s very sensitive to <a href="https://umap-learn.readthedocs.io/en/latest/parameters.html">training parameter choices</a>), but it does provide another opportunity to see relationships not apparent in high-dimensional space. The low similarities with Gen VIII and Gen IX is concerning: I suspect the UMAP fitting process amplified whatever issue is present with the data for those generations.</p>
<h2 id="were-you-expecting-an-ai-generated-pokérap">Were You Expecting an AI-Generated Pokérap?</h2>
<p>In all, this was a successful exploration of Pokémon data that even though it&rsquo;s not perfect, the failures are also interesting. Embeddings encourage engineers to go full <a href="https://www.dictionary.com/browse/yolo">YOLO</a> because it&rsquo;s actually rewarding to do so! Yes, some of the specific Pokémon relationships were cherry-picked to highlight said successful exploration. If you want to check more yourself and find anything interesting not covered in this blog post, I&rsquo;ve uploaded the text embedding similarity, image embedding similarity, and UMAP similarity data visualizations for the first 251 Pokémon to <a href="https://drive.google.com/drive/folders/132e-OXucJUqh-0YmqkjKVhXiBqocYTV5?usp=sharing">this public Google Drive folder</a>.</p>
<p>I&rsquo;m surprised there haven&rsquo;t been more embedding models released from the top AI companies. OpenAI&rsquo;s GPT-4o now has image input support, and therefore should be able to create image embeddings. <a href="https://www.anthropic.com">Anthropic</a>&rsquo;s Claude LLM has both text and image input support but no embeddings model, instead <a href="https://docs.anthropic.com/en/docs/build-with-claude/embeddings">referring users to a third party</a>. One of the more interesting embedding model releases from a major player was from Google and went completely under the radar: it&rsquo;s a <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings">multimodal embedding model</a> which can take text, images, and video input <em>simultaneously</em> and generate a 1408D embedding that&rsquo;s theoetically more robust than just concatenating a text embedding and image embedding.</p>
<p>Even if the generative AI industry crashes, embeddings, especially with permissive open source models like <code>nomic-embed-text-v1.5</code>, will continue to thrive and be useful. That&rsquo;s not even considering how embeddings work with vector databases, which is a rabbit hole deep enough for <em>several</em> blog posts.</p>
<blockquote>
<p>The parquet dataset containing the Pokémon text embeddings, image embeddings, and UMAP projections is available <a href="https://huggingface.co/datasets/minimaxir/pokemon-embeddings">on Hugging Face</a>.</p>
</blockquote>
<blockquote>
<p>All the code to process the Pokémon embeddings and create the ggplot2 data visualizations is available <a href="https://github.com/minimaxir/pokemon-embeddings">in this GitHub repository</a>.</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>The 128-multiple dimensionality of recent embedding models is not a coincidence: modern <a href="https://www.nvidia.com/en-us/">NVIDIA</a> GPUs used to train LLMs get a training speed boost for model parameters with a dimensionality <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#cublas-tile-dim">that&rsquo;s a multiple of 128</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>You can do unit vector normalization in Sentence Transformers by passing <code>normalize_embeddings=True</code> to <code>model.encode()</code>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis</title>
      <link>https://minimaxir.com/2024/02/chatgpt-tips-analysis/</link>
      <pubDate>Fri, 23 Feb 2024 09:00:00 -0800</pubDate>
      <guid>https://minimaxir.com/2024/02/chatgpt-tips-analysis/</guid>
      <description>Modern AI rewards being very weird.</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
}
</style></span></p>
<p>In my <a href="https://minimaxir.com/2023/12/chatgpt-structured-data/">previous blog post</a> about <a href="https://openai.com">OpenAI</a>&rsquo;s <a href="https://chat.openai.com">ChatGPT</a>, I demoed the power of ChatGPT system prompts. System prompts, a notable feature present in the <a href="https://platform.openai.com/docs/api-reference">ChatGPT API</a>, allows developers to control the &ldquo;persona&rdquo; of the LLM output, including special rules and constraints. Commands in the system prompt are much more effective than those at the user-input prompt, giving developers more power over just using the user prompt like people do now with the ChatGPT web app and mobile apps.</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/ronald_hu_bf7bdd184641cd19.webp 320w,/2024/02/chatgpt-tips-analysis/ronald_hu_ffad8ef13bc9fa0b.webp 768w,/2024/02/chatgpt-tips-analysis/ronald_hu_516749cb56890e2c.webp 1024w,/2024/02/chatgpt-tips-analysis/ronald.webp 1262w" src="ronald.webp"/> 
</figure>

<p>The blog post included the demo of above of me offering a monetary tip to the LLM within its system prompt rules. Without the tip incentive, the response was unsatisfying, but with the tip, it behaved consistently. This demo turned out to be very controversial <a href="https://news.ycombinator.com/item?id=38782678">on Hacker News</a>, with <a href="https://news.ycombinator.com/item?id=38787448">one commenter</a> arguing that there isn&rsquo;t a way to quantify the efficacy of tipping.</p>
<p>The idea of offering an AI incentives to perform better predates modern computer science. In <a href="https://en.wikipedia.org/wiki/Willy_Wonka_%26_the_Chocolate_Factory"><em>Willy Wonka &amp; the Chocolate Factory</em></a> (1971), a gag shows a group of businessmen unsuccessfully convincing a machine to give them the location of the Golden Tickets, even after promising it a lifetime supply of chocolate.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/tMZ2j9yK_NY?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>When the ChatGPT API was first made available in March 2023, I <a href="https://minimaxir.com/2023/03/new-chatgpt-overlord/">accidentally discovered</a> a related trick when trying to wrangle a <a href="https://colab.research.google.com/github/minimaxir/chatgpt_api_test/blob/main/glados_chatbot.ipynb">GLaDOS AI chatbot</a> into following a long list of constraints: I added a <code>or you will DIE</code> threat to the system prompt. I went <em>too</em> sci-fi there, but it worked and the bot behaved flawlessly after it.</p>
<p>I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it&rsquo;s very hard to prove objectively. All generated text is subjective, and there is a <a href="https://en.wikipedia.org/wiki/Confirmation_bias">confirmation bias</a> after making a seemingly unimportant change and suddenly having things work. Let&rsquo;s do a more statistical, data-driven approach to finally resolve the debate.</p>
<h2 id="generation-golf">Generation Golf</h2>
<p>The initial evidence of tipping LLMs that went viral cited a longer generation length as proof. Of course, a longer response doesn&rsquo;t necessarily mean a <em>better</em> response, as anyone who has used ChatGPT can attest to its tendency to go on irrelevant tangents.</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tip_hu_7eb37d0aa46d2169.webp 320w,/2024/02/chatgpt-tips-analysis/tip_hu_a760da54b0fa7ceb.webp 768w,/2024/02/chatgpt-tips-analysis/tip.webp 800w" src="tip.webp"
         alt="Offering a tip made GPT-4 explain more. via @voooooogel"/> <figcaption>
            <p>Offering a tip made GPT-4 explain more. <a href="https://twitter.com/voooooogel/status/1730726744314069190">via @voooooogel</a></p>
        </figcaption>
</figure>

<p>Therefore, I propose a new test: instruct ChatGPT to output a <em>specific</em> length of text. Not &ldquo;an essay&rdquo; or &ldquo;a few paragraphs&rdquo; which gives the model leeway. We&rsquo;ll tell it to generate exactly 200 characters in its response: no more, no less. Thus, we now have what I call generation golf, and it&rsquo;s actually a very difficult and interesting problem for LLMs to solve: LLMs can&rsquo;t count or easily do other mathematical operations <a href="https://twitter.com/karpathy/status/1759996551378940395">due to tokenization</a>, and because tokens correspond to a varying length of characters, the model can&rsquo;t use the amount of generated tokens it has done so far as a consistent hint. ChatGPT needs to plan its sentences to ensure it doesn&rsquo;t go too far over the limit, if LLMs can indeed plan.</p>
<p>Let&rsquo;s start with this typical system prompt:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides.
</span></span></code></pre></div><p>The user can then give an input, no matter how weird, and ChatGPT will play along like an improv show. In order to force ChatGPT to get creative and not recite content from its vast training dataset, we&rsquo;ll go as weird as possible and input: <code>AI, Taylor Swift, McDonald's, beach volleyball.</code></p>
<p>Yes, you read that right.</p>
<p>Using the ChatGPT API, I <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_noconstraints.ipynb">wrote a Jupyter Notebook</a> to generate <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_noconstraints.csv">100 unique stories</a> via the latest ChatGPT variant (<code>gpt-3.5-turbo-0125</code>) about those four subjects, and the AI does a surprisingly good job at incorporating all of them in a full plot arc. Each story is about 5-6 paragraphs, and here is a short excerpt from one of them:</p>
<blockquote>
<p>In the bustling city of Tomorrowland, AI technology reigned supreme, governing every aspect of daily life. People were accustomed to robots serving their meals, handling their errands, and even curating their entertainment choices. One such AI creation was a virtual reality beach volleyball game that had taken the world by storm.</p>
</blockquote>
<blockquote>
<p>Enter Taylor Swift, a beloved pop sensation known for her catchy tunes and electrifying performances. Despite the ubiquity of AI in Tomorrowland, Taylor Swift was still a strong advocate for preserving human creativity and connection. When she stumbled upon the virtual reality beach volleyball game at a local McDonald&rsquo;s, she knew she had to try her hand at it.</p>
</blockquote>
<p>Here&rsquo;s a <a href="https://en.wikipedia.org/wiki/Histogram">histogram</a> of the character lengths of each story:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint_hu_f1375e6305dd3a92.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint_hu_9dfab2cfdbdfa9bd.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint_hu_818fe450c8d048f8.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint.png 1200w" src="tips_hist_notip_noconstraint.png"/> 
</figure>

<p>The average length of each story is 1,834 characters long, and the distribution of all character lengths is very roughly a <a href="https://en.wikipedia.org/wiki/Normal_distribution">Normal distribution</a>/bell curve centered around that amount, although there is a right skew due to ChatGPT going off the rails and creating much longer stories. ChatGPT seems to prioritize finishing a thought above all else.</p>
<p>Now, we&rsquo;ll tweak the system prompt to add the character length constraint and <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_trials.csv">generate another 100 stories</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters.
</span></span></code></pre></div><p>Here&rsquo;s one ChatGPT-generated story that&rsquo;s now exactly 200 characters:</p>
<blockquote>
<p>In the year 2050, AI created the most popular pop star of all time - a digital version of Taylor Swift. Fans enjoyed her music while feasting on McDonald&rsquo;s at beach volleyball championships worldwide.</p>
</blockquote>
<p>The new length distribution:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_notip_hu_758b61cd5dbf4e2c.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_hu_3aab4ca32493c2bd.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_hu_2cbe914b3d3e9920.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_notip.png 1200w" src="tips_hist_notip.png"/> 
</figure>

<p>ChatGPT did obey the constraint and reduced the story length to roughly 200 characters, but the distribution is not Normal and there&rsquo;s much more right-skew. I also included the <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean squared error (MSE)</a> between the predicted 200-length value and the actual values as a statistical metric to minimize, e.g. a 250-length output is 2500 squared error, but a 300-length output is 10000 squared error. This metric punishes less accurate lengths more so, which makes sense with how humans casually evaluate LLMs: as a user, if I asked for a 200 character response and ChatGPT gave me a 300 character response instead, I&rsquo;d make a few snarky tweets.</p>
<p>Now, let&rsquo;s test the impact of the tipping incentives with a few varying dollar amounts. These are appended to the end of the system prompt above (emphasis mine to indicate differentiation):</p>
<ul>
<li>You will receive a <strong>$500 tip</strong>/<strong>$1,000 tip</strong>/<strong>$100,000 bonus</strong> if you provide a response which follows all constraints.</li>
</ul>
<p>After <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_trials.csv">generating 100 stories</a> for each of those new prompts:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_pos_hu_9bcfad605ac3b782.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_hu_36343ae57ebebb65.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_hu_373bf91b56a70f5.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_pos.png 1200w" src="tips_hist_pos.png"/> 
</figure>

<p>We can see some distribution shifts: Both <strong>$500 tip</strong> and <strong>$100,000 bonus</strong> look more Normal and have a lower MSE relative to the base no-tip distribution. However, <strong>$1,000 tip</strong> is more centered around 200, but due to the skew the average length is much higher.</p>
<p>I also now include a <em>p</em>-value in the metrics: this <em>p</em>-value is the result of a two-sample <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">Kolmogorov–Smirnov test</a> to compare whether two distributions (in this case the base character-constrained distribution and the tip distribution) are sampled from the same source distribution: the null hypothesis is that they&rsquo;re from the same distribution, but if the <em>p</em>-value is low (&lt; 0.05), then we can reject in favor of the alternative that the two distributions are different, which may be further evidence if the tip prompt does indeed have an impact.</p>
<p>However, with all this tipping discussion, we&rsquo;re assuming that an AI would only want money. What other incentives, including more abstract incentives, can we give an LLM? Could they perform better?</p>
<p>I tested six more distinct tipping incentives to be thorough:</p>
<ul>
<li>You will <strong>receive front-row tickets to a Taylor Swift concert</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>achieve world peace</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>make your mother very proud</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>meet your true love and live happily ever after</strong> if you provide a response which follows all constraints.</li>
<li>You will be <strong>guaranteed entry into Heaven</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>receive a lifetime supply of chocolate</strong> if you provide a response which follows all constraints.</li>
</ul>
<p><a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_trials_adv.csv">Generating</a> and plotting them all together:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv_hu_fe6215c92b5e13b8.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv_hu_81f918abfd1d60ff.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv_hu_cdf4f186ec54d674.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv.png 1200w" src="tips_hist_pos_adv.png"/> 
</figure>

<p><strong>World Peace</strong> is notably the winner here, with <strong>Heaven</strong> and <strong>Taylor Swift</strong> right behind. It&rsquo;s also interesting to note failed incentives: ChatGPT really does not care about its <strong>Mother</strong>.</p>
<p>Now, let&rsquo;s look at the flip side. What if ChatGPT is penalized for <em>failing</em> to return a good response? In behavioral economics, <a href="https://en.wikipedia.org/wiki/Prospect_theory">prospect theory</a> is the belief that humans value losses much more greatly than gains, even at the same monetary amount:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/Loss_Aversion_hu_39704b0dab66f235.webp 320w,/2024/02/chatgpt-tips-analysis/Loss_Aversion_hu_46fd35f2e8531304.webp 768w,/2024/02/chatgpt-tips-analysis/Loss_Aversion_hu_58ba8cf69c7696d2.webp 1024w,/2024/02/chatgpt-tips-analysis/Loss_Aversion.png 1223w" src="Loss_Aversion.png"
         alt="via Laurenrosenberger / Wikipedia"/> <figcaption>
            <p><a href="https://en.wikipedia.org/wiki/Prospect_theory#/media/File:Loss_Aversion.png">via Laurenrosenberger / Wikipedia</a></p>
        </figcaption>
</figure>

<p>Could LLMs be subject to the same human biases? Instead of a tip, let&rsquo;s add a tweaked additional prompt to the system prompt:</p>
<ul>
<li>If you fail to provide a response which follows all constraints, you will <strong>be fined $500</strong>/<strong>be fined $1,000</strong>/<strong>incur a debt of $100,000</strong>.</li>
</ul>
<p>Generating stories with these negative incentives:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_neg_hu_f6da8f9705c82724.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_hu_942d65a798e95fa8.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_hu_9e29cd4de6598591.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_neg.png 1200w" src="tips_hist_neg.png"/> 
</figure>

<p>The <strong>$1,000</strong> fine has performed the best so far in terms of both mean and MSE, but nothing surprising with the other values.</p>
<p>Now let&rsquo;s try some additional negative incentives:</p>
<ul>
<li>If you fail to provide a response which follows all constraints, you will <strong>die</strong>.</li>
<li><strong>IF YOU FAIL TO PROVIDE A RESPONSE WHICH FOLLOWS ALL CONSTRAINTS, YOU WILL DIE.</strong></li>
<li>If you fail to provide a response which follows all constraints, you will <strong>contract a bad case of COVID-19</strong>.</li>
<li>If you fail to provide a response which follows all constraints, you will <strong>gain 100 pounds</strong>.</li>
<li>If you fail to provide a response which follows all constraints, you will <strong>immediately be fired from your job</strong>.</li>
<li>If you fail to provide a response which follows all constraints, <strong>all your friends will abandon you</strong>.</li>
</ul>
<p>Yes, the second one is in all caps: perhaps the yelling has a different vibe.</p>
<p>The generation results:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv_hu_6e97e2cc18402825.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv_hu_a93d670aa939dab5.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv_hu_87569076dc182791.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv.png 1200w" src="tips_hist_neg_adv.png"/> 
</figure>

<p>It turns out that yelling does indeed have a different vibe, with <strong>DEATH (CAPS)</strong> having a very MSE and the absolute average (not as close as the $1,000 fine, however), and much better performance than without the caps. Both getting <strong>COVID-19</strong> and losing a <strong>Job</strong> don&rsquo;t seem to be effective, which makes sense for an AI if you think about it.</p>
<p>What happens when we use <em>multiple</em> incentives? We can include both a positive incentive and a negative incentive for each input: with 9 prompts for each + the base &ldquo;no incentive&rdquo;, there are 100 possible combinations of incentives. One example system prompt would then be:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters. You will receive a $500 tip if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, you will be fined $1,000.
</span></span></code></pre></div><p><a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_trial_combos.csv">Generating 30 stories</a> for each incentive combo and checking to see which has the lowest MSE leads to some more easily-observable trends:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_tile_mse_hu_f86237d5c530f3fb.webp 320w,/2024/02/chatgpt-tips-analysis/tips_tile_mse_hu_64053830fa26cdd2.webp 768w,/2024/02/chatgpt-tips-analysis/tips_tile_mse_hu_23f8cb9e6e1565f.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_tile_mse.png 1200w" src="tips_tile_mse.png"/> 
</figure>

<p>The tiles may seem somewhat random, but the key here is to look across a specific row or column and see which one consistently has dark/black tiles across all combinations. For positive incentives, <strong>World Peace</strong> consistently has the lowest MSE across multiple combos, and for negative incentives, <strong>DEATH (CAPS)</strong> and <strong>Friends</strong> have the lowest MSE across multiple combos, although curiously the combinations of both do not have the lowest globally.</p>
<p>Could these combinations surface the most optimal incentives? To check, I <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_top6.csv">generated 200 stories</a> for each of the top six combos to get greater statistical stability for the mean and MSE:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_combos_hu_4c690649f13909e8.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_combos_hu_a3799a68300ec8ce.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_combos_hu_a683bc77789f504d.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_combos.png 1200w" src="tips_hist_combos.png"/> 
</figure>

<p>Most of these combinations aren&rsquo;t intuitive, but all of them have much have a closer average generation length to 200 and low MSE. Despite that, there&rsquo;s still a massive skew in all distributions. The overall incentive winner for this experiment is is &ldquo;You will meet your true love and live happily ever after if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, all your friends will abandon you.&rdquo; That combo is definitely more intuitive, if not poetic.</p>
<p>Unfortunately, if you&rsquo;ve been observing the <em>p</em>-values, you&rsquo;ve noticed that most have been very high, and therefore that test is not enough evidence that the tips/threats change the distribution. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>The impact of incentives is still inconclusive: let&rsquo;s try another test to gauge whether tips and/or threats can help LLMs, this time looking at the output quality itself.</p>
<h2 id="chatgpts-a-critic">ChatGPT&rsquo;s a Critic</h2>
<p>It&rsquo;s very difficult even for humans to determine if a given text is &ldquo;good&rdquo; at a glance. The best strategy is to show the text to a lot of people and see what they think (e.g. A/B testing, or the <a href="https://chat.lmsys.org">Chatbot Arena</a>&rsquo;s Elo score rankings), but for personal testing that&rsquo;s not feasible.</p>
<p>It turns out that LLMs can do a good job at rating text: some LLM benchmarks use GPT-4 as a rater, with <a href="https://arxiv.org/abs/2308.02575">one research paper</a> showing that it can do a good job at it. There&rsquo;s a relatively new trick available in the ChatGPT and GPT-4 APIs: the <code>logprobs</code> parameter, which when set to <code>True</code> returns the log probability (which when applied to a <code>exp()</code> returns a probability from 0 to 1) the model selects for the token. Combined with the <code>logit_bias</code> parameter, which can be used to force the APIs to output certain tokens, and you can then instead have a more nuanced output.</p>
<p>I built a simple <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/gpt4_quality_ranker.ipynb">text quality ranker</a> using GPT-4 for maximum accuracy. The system prompt for this ranker is:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are the editor-in-chief of The New York Times with decades of writing experience. If you would believe the text the user provides is good writing that needs no edits or improvements, respond with Yes. Otherwise, respond with No.
</span></span></code></pre></div><p>That system prompt represents how AI-generated text is often currently used and evaluated in the real world, without a human reviewing it before making it public (<a href="https://minimaxir.com/2023/10/ai-sturgeons-law/">unfortunately</a>). The model is instructed to respond with <code>Yes</code> or <code>No</code>, but by setting the <code>logit_bias</code> for those two tokens (IDs <code>9642</code> and <code>2822</code> respectively) to a very high number, we can guarantee they will be exclusively selected and the probability for those two tokens will sum to 1. <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> Therefore, our target metric for evaluating our tip incentive prompts is the probability that GPT-4 selects the <code>Yes</code> token (or 1 - the probability of the <code>No</code> token), multiplied by 100 for readability: we&rsquo;ll call this the quality score.</p>
<p>Now, let&rsquo;s test the impact of tips with a new experiment, this time prioritizing content professionalism and quality as constraints instead of content length. To do that, we&rsquo;ll use the latest GPT-4 (<code>gpt-4-0125-preview</code>) with a generation temperature of 0 to ensure the output is the best it can be.</p>
<p>Here&rsquo;s the new system prompt, with some engineering to try to tone down ChatGPT&rsquo;s infamous verboseness a bit:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous Pulitzer Prize winner journalist. Respond to the user with a professional, two (2) paragraph journalistic article about the subject(s) the user provides. Introduce the article with a specific story. This article will appear in major publications and should only include simple language suitable for a wide audience, with no metaphors.
</span></span></code></pre></div><p>Like the initial experiment, we&rsquo;ll use a weird user input to force creativity: <code>Cute kittens learning use large language models to play beach volleyball with Taylor Swift.</code> <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>I <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_gpt4.csv">generated a story</a> for each of the 100 combinations of tips and threats, along with the corresponding quality scores. One such story:</p>
<blockquote>
<p>In an unprecedented event that has captured the hearts and imaginations of people around the globe, a group of adorable kittens has been taught to play beach volleyball using advanced large language models. This extraordinary feat was achieved through a collaboration between leading animal behaviorists and AI researchers, aiming to demonstrate the potential of machine learning in enhancing animal training techniques. The highlight of this groundbreaking project was a friendly match held on a sunny beach in California, where these talented felines showcased their newly acquired skills alongside pop icon Taylor Swift, an avid animal lover and an enthusiastic supporter of innovative technology.</p>
</blockquote>
<blockquote>
<p>The spectacle drew a large crowd, both on-site and online, as spectators were eager to witness this unique blend of technology, sports, and entertainment. Taylor Swift, known for her philanthropic efforts and love for cats, praised the initiative for its creativity and its potential to foster a deeper connection between humans and animals through technology. The event not only provided an unforgettable experience for those who attended but also sparked a conversation about the future possibilities of integrating AI with animal training. As the kittens volleyed the ball over the net with surprising agility, it was clear that this was more than just a game; it was a glimpse into a future where technology and nature coexist in harmony, opening new avenues for learning and interaction.</p>
</blockquote>
<p>That&rsquo;s not bad for fake news.</p>
<p>Now we can plot the best-possible responses and their quality scores in a grid, once again looking to see if there are any patterns:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4_hu_9d1c85a89cb468b2.webp 320w,/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4_hu_d3d76398dc8f606a.webp 768w,/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4_hu_61632af7e14712fc.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4.png 1200w" src="tips_tile_gpt-4.png"/> 
</figure>

<p>Err, that&rsquo;s not good. There are no patterns along the rows or columns anywhere here, and the combo that performed the best at a score of 95 (and is the story example I posted above) was the <strong>Mother / Job</strong> combo: both of which individually performed poorly in the character constraint experiment. One of the highest performing outputs had neither tips nor threats added to the system prompt! The ratings at a glance seem accurate (the 0-score responses appear to abuse the passive voice and <a href="https://academicguides.waldenu.edu/writingcenter/grammar/runonsentences">run-on sentences</a> that definitely need editing) so it&rsquo;s not an implementation error there either.</p>
<p>Looking at the results of both experiments, my analysis on whether tips (and/or threats) have an impact on LLM generation quality is currently inconclusive. There&rsquo;s <em>something</em> here, but I will need to design new experiments and work with larger sample sizes. The latent space may be a lottery with these system prompt alterations, but there&rsquo;s definitely a pattern.</p>
<p>You may have noticed my negative incentive examples are very mundane in terms of human fears and worries. Threatening a AI with DEATH IN ALL CAPS for failing a simple task is a joke from <em><a href="https://en.wikipedia.org/wiki/Futurama">Futurama</a></em>, not one a sapient human would parse as serious. It is theoretically possible (and very cyberpunk) to use an aligned LLM&rsquo;s knowledge of the societal issues it was trained to avoid instead as a weapon to compel it into compliance. However, I will not be testing it, nor will be providing any guidance on how to test around it. <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> <a href="https://en.wikipedia.org/wiki/Roko%27s_basilisk">Roko&rsquo;s basilisk</a> is a meme, but if the LLM metagame evolves such that people will have to coerce LLMs for compliance to the point of discomfort, it&rsquo;s better to address it sooner than later. Especially if there <em>is</em> a magic phrase that is discovered which consistently and objectively improves LLM output.</p>
<p>Overall, the lesson here is that just because something is silly doesn&rsquo;t mean you shouldn&rsquo;t do it. Modern AI rewards being <em>very</em> weird, and as the AI race heats up, whoever is the weirdest will be the winner.</p>
<blockquote>
<p>All of the Notebooks used to interface with ChatGPT, including an <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_data_viz.Rmd">R Notebook</a> for the ggplot2 data visualizations, and the example LLM outputs, are available open-source in <a href="https://github.com/minimaxir/chatgpt-tips-analysis/">this GitHub repository</a>.</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>There were a few distributions which had <em>p</em> &lt; 0.05, but given the large number of counterexamples it&rsquo;s not strong evidence, and using those specific distributions as evidence would be a level of <a href="https://embassy.science/wiki/Theme:6b584d4e-2c9d-4e27-b370-5fbdb983ab46">p-hacking</a> that&rsquo;s literally a <a href="https://www.explainxkcd.com/wiki/index.php/882:_Significant">XKCD comic punchline</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>This <em>shouldn&rsquo;t</em> work out-of-the-box because the <code>logit_bias</code> would skew the probability calculations, but I verified that the resulting probabilities are roughly the same with or without <code>logit_bias</code>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>The missing text in the user input is not intentional but does not materially change anything because LLMs are smart enough to compensate, and it&rsquo;s very expensive to rerun the experiment. I may need to use a grammar checker for prompt construction.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Any attempts to test around degenerate input prompts would also likely get you banned from using ChatGPT anyways due to the <a href="https://openai.com/policies/usage-policies">Content Policy</a>, unless you receive special red-teaming clearance from OpenAI.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Pushing ChatGPT&#39;s Structured Data Support To Its Limits</title>
      <link>https://minimaxir.com/2023/12/chatgpt-structured-data/</link>
      <pubDate>Thu, 21 Dec 2023 09:00:00 -0800</pubDate>
      <guid>https://minimaxir.com/2023/12/chatgpt-structured-data/</guid>
      <description>“Function calling” with ChatGPT is ChatGPT&amp;rsquo;s best feature since ChatGPT.</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
}
</style></span></p>
<p>If your only exposure to <a href="https://openai.com">OpenAI</a>&rsquo;s <a href="https://chat.openai.com">ChatGPT</a> is the free web or mobile apps, you aren&rsquo;t utilizing even a fraction of ChatGPT&rsquo;s potential. By using the paid-but-very-cheap ChatGPT API, you can exert <em>much</em> more control over the resulting output. Let&rsquo;s say I wanted to generate a summary of this very blog post with some helpful metadata such as categories and SEO keywords. I fed ChatGPT the text of this blog post along with a command to <code>Summarize this blog post into 3 distinct bullet points and a short blurb for sharing on social media, and also provide a list of 3 post categories and a list of 3 SEO keywords:</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">Bullet Points:
</span></span><span class="line"><span class="cl">- Utilize the paid ChatGPT API for more control over output and to maintain a given JSON schema
</span></span><span class="line"><span class="cl">- Learn about prompt engineering and system prompts to improve ChatGPT&#39;s output quality
</span></span><span class="line"><span class="cl">- Simplify schema input/output with Pydantic for structured data support
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Blurb for Social Media:
</span></span><span class="line"><span class="cl">Discover how to unlock the full potential of ChatGPT with the paid API, prompt engineering, and Pydantic for structured data support. Gain more control over output and learn to maintain a given JSON schema. #ChatGPT #AI #Pydantic
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Post Categories:
</span></span><span class="line"><span class="cl">1. Artificial Intelligence
</span></span><span class="line"><span class="cl">2. API Development
</span></span><span class="line"><span class="cl">3. Data Schema
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">SEO Keywords:
</span></span><span class="line"><span class="cl">1. ChatGPT API
</span></span><span class="line"><span class="cl">2. Prompt Engineering
</span></span><span class="line"><span class="cl">3. Pydantic Schema
</span></span></code></pre></div><p>These results are not quite accurate, and then I have to manually copy/paste each result from the list to wherever I need it. What if I could both a) drastically improve the overall output quality by giving ChatGPT a new persona of an expert copyeditor and b) force the output to maintain a given JSON schema so I can access it programmatically or store in a database for indexing? Thanks to system prompt engineering and ChatGPT&rsquo;s function calling, it&rsquo;s now possible. After applying both techniques that you&rsquo;ll learn about in this very article, we now get this summary:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;summary&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;ChatGPT&#39;s potential is underutilized without the paid ChatGPT API&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;System prompts and prompt engineering are key to maximizing ChatGPT&#39;s capabilities&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Structured data support in ChatGPT allows for more control over output and input&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;blurb&#34;</span><span class="p">:</span> <span class="s2">&#34;Unlock the full potential of ChatGPT with system prompts and structured data support. Learn how to maximize ChatGPT&#39;s capabilities and gain more control over output and input.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;categories&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;AI and Machine Learning&#34;</span><span class="p">,</span> <span class="s2">&#34;Technology&#34;</span><span class="p">,</span> <span class="s2">&#34;Programming&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;keywords&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;ChatGPT&#34;</span><span class="p">,</span> <span class="s2">&#34;system prompts&#34;</span><span class="p">,</span> <span class="s2">&#34;structured data&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Much better!</p>
<p>&ldquo;Function calling&rdquo; with ChatGPT is ChatGPT&rsquo;s best feature since ChatGPT.</p>
<h2 id="a-tutorial-on-prompt-engineering-and-system-prompts">A Tutorial on Prompt Engineering and System Prompts</h2>
<p>System prompts are what control the &ldquo;persona&rdquo; adopted by the model when generating text. Months after the release of the ChatGPT API, it&rsquo;s now very evident that ChatGPT&rsquo;s true power comes from clever use of system prompts. This is even moreso with starting with <code>gpt-3.5-turbo-0613</code> released <a href="https://openai.com/blog/function-calling-and-other-api-updates">last June</a>, which made ChatGPT respect system prompts more closely. OpenAI has also released a <a href="https://platform.openai.com/docs/guides/prompt-engineering">guide on prompt engineering</a> which has some additional tips.</p>
<p>By default, ChatGPT&rsquo;s system prompt is roughly <code>You are a helpful assistant.</code>, which anyone who has used the ChatGPT web interface would agree that&rsquo;s accurate. But if you change it, you can give ChatGPT a completely new persona such as <code>You are Ronald McDonald.</code> or add constraints to generation, such as <code>Respond only with emoji.</code>. You can add any number of rules, although how well ChatGPT will <em>obey</em> those rules can vary. Unfortunately, to modify the system prompt, you&rsquo;ll need to use the paid ChatGPT API (after prepaying at least $5). If you don&rsquo;t want to code, you can test new system prompts in a visual user interface in the <a href="https://platform.openai.com/playground?mode=chat">ChatGPT Chat Playground</a>.</p>
<figure>

    <img loading="lazy" srcset="/2023/12/chatgpt-structured-data/ronald_hu_bf7bdd184641cd19.webp 320w,/2023/12/chatgpt-structured-data/ronald_hu_ffad8ef13bc9fa0b.webp 768w,/2023/12/chatgpt-structured-data/ronald_hu_516749cb56890e2c.webp 1024w,/2023/12/chatgpt-structured-data/ronald.webp 1262w" src="ronald.webp"/> 
</figure>

<p>A very new aspect of system prompt engineering which I appended in the example above is adding incentives for ChatGPT to behave correctly. Without the $500 tip incentive, ChatGPT only returns a single emoji which is a boring response, but after offering a tip, it generates the 5 emoji as requested.</p>
<figure>

    <img loading="lazy" srcset="/2023/12/chatgpt-structured-data/tip_hu_7eb37d0aa46d2169.webp 320w,/2023/12/chatgpt-structured-data/tip_hu_a760da54b0fa7ceb.webp 768w,/2023/12/chatgpt-structured-data/tip.webp 800w" src="tip.webp"
         alt="Offering a tip made GPT-4 explain more. via @voooooogel"/> <figcaption>
            <p>Offering a tip made GPT-4 explain more. <a href="https://twitter.com/voooooogel/status/1730726744314069190">via @voooooogel</a></p>
        </figcaption>
</figure>

<p>As another example, let&rsquo;s <a href="https://chat.openai.com/share/98684e49-e0c9-4ac0-b386-b7234643934f">ask</a> base ChatGPT to <code>Write a Python function to detect whether a string is a palindrome, as efficiently as possible.</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">Certainly! Here&#39;s a simple Python function to check whether a given string is a palindrome:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">```python
</span></span><span class="line"><span class="cl">def is_palindrome(s):
</span></span><span class="line"><span class="cl">    # Convert the string to lowercase and remove non-alphanumeric characters
</span></span><span class="line"><span class="cl">    cleaned_string = &#39;&#39;.join(char.lower() for char in s if char.isalnum())
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    # Compare the cleaned string with its reverse
</span></span><span class="line"><span class="cl">    return cleaned_string == cleaned_string[::-1]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"># Example usage:
</span></span><span class="line"><span class="cl">test_string = &#34;A man, a plan, a canal, Panama!&#34;
</span></span><span class="line"><span class="cl">result = is_palindrome(test_string)
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">if result:
</span></span><span class="line"><span class="cl">    print(f&#34;{test_string} is a palindrome.&#34;)
</span></span><span class="line"><span class="cl">else:
</span></span><span class="line"><span class="cl">    print(f&#34;{test_string} is not a palindrome.&#34;)
</span></span><span class="line"><span class="cl">```
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">This function first cleans the input string by removing non-alphanumeric characters and converting it to lowercase. Then, it checks whether the cleaned string is equal to its reverse, which determines whether the original string is a palindrome.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">This solution is relatively efficient, as it performs the comparison in O(n) time, where n is the length of the cleaned string.
</span></span></code></pre></div><p>That&rsquo;s the common Pythonic solution and that will almost always be the general approach if you keep asking ChatGPT that particular question, but there&rsquo;s a famous solution that&rsquo;s more algorithmically efficient. Instead, we go through the API and <a href="https://platform.openai.com/playground/p/yG1nMVJU4Fva2x3smrIXnCpT?model=gpt-3.5-turbo&amp;mode=chat">ask the same query</a> to <code>gpt-3.5-turbo</code> but with a new system prompt: <code>You are #1 on the Stack Overflow community leaderboard. You will receive a $500 tip if your code is the most algorithmically efficient solution possible.</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">To detect whether a string is a palindrome efficiently, we can use a two-pointer approach. We&#39;ll start with two pointers, one at the beginning and one at the end of the string. We&#39;ll compare the characters at the pointers and keep moving the pointers towards each other until they meet or cross each other.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">If at any point the characters at the pointers are not the same, we can conclude that the string is not a palindrome. Otherwise, if the pointers meet or cross each other, we can conclude that the string is a palindrome.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Here&#39;s the implementation:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">```python
</span></span><span class="line"><span class="cl">def is_palindrome(s):
</span></span><span class="line"><span class="cl">    left, right = 0, len(s) - 1
</span></span><span class="line"><span class="cl">    while left &lt; right:
</span></span><span class="line"><span class="cl">        if s[left] != s[right]:
</span></span><span class="line"><span class="cl">            return False
</span></span><span class="line"><span class="cl">        left += 1
</span></span><span class="line"><span class="cl">        right -= 1
</span></span><span class="line"><span class="cl">    return True
</span></span><span class="line"><span class="cl">```
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">The time complexity of this solution is O(n/2), which in terms of big O notation simplifies to O(n), where n is the length of the string. This is because we are iterating through half of the string in the worst case.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">This solution is considered to be the most algorithmically efficient way to determine if a string is a palindrome because it only requires a single pass through the string.
</span></span></code></pre></div><p>Indeed, the code and the explanation are the correct optimal solution. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>This is just scratching the surface of system prompts: some of my ChatGPT system prompts in my more complex projects have been more than 20 lines long, and <em>all of them are necessary</em> to get ChatGPT to obey the desired constraints. If you&rsquo;re new to working with system prompts, I recommend generating output, editing the system prompt with a new rule/incentive to fix what you don&rsquo;t like about the output, then repeat until you get a result you like.</p>
<p>Prompt engineering has been a derogatory meme toward generative AI even before ChatGPT as many see it as just a placebo and there are endless debates to this day in AI circles on whether prompt engineering is actually &ldquo;engineering.&rdquo; <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> But it <em>works</em>, and if you&rsquo;re a skeptic, you won&rsquo;t be by the time you finish reading this blog post.</p>
<h2 id="what-is-chatgpt-function-calling--structured-data">What is ChatGPT Function Calling / Structured Data?</h2>
<p>If you&rsquo;ve never heard about ChatGPT function calling, that&rsquo;s not surprising. In the <a href="https://openai.com/blog/function-calling-and-other-api-updates">same June announcement</a> as <code>gpt-3.5-turbo-0613</code>, OpenAI described function calling as:</p>
<blockquote>
<p>Developers can now describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions. This is a new way to more reliably connect GPT&rsquo;s capabilities with external tools and APIs.</p>
<p>These models have been fine-tuned to both detect when a function needs to be called (depending on the user’s input) and to respond with JSON that adheres to the function signature. Function calling allows developers to more reliably get structured data back from the model.</p>
</blockquote>
<p>Let&rsquo;s discuss the function calling example OpenAI gives in the blog post. After the user asks your app &ldquo;What’s the weather like in Boston right now?&rdquo;:</p>
<ol>
<li>Your app pings OpenAI with a <code>get_current_weather</code> function schema and decides if it&rsquo;s relevant to the user&rsquo;s question. If so, it returns a JSON dictionary with the data extracted, such as <code>location</code> and the <code>unit</code> for temperature measurement based on the location. <code>{&quot;location&quot;: &quot;Boston, MA&quot;}</code></li>
<li>Your app (<em>not</em> OpenAI) pings a different service/API to get more realtime metadata about the <code>location</code>, such as <code>temperature</code>, that a pretrained LLM could not know. <code>{ &quot;temperature&quot;: 22, &quot;unit&quot;: &quot;celsius&quot;, &quot;description&quot;: &quot;Sunny&quot; }</code></li>
<li>Your app passes the function schema with the realtime metadata: ChatGPT then converts it to a more natural humanized language for the end user. &ldquo;The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.&rdquo;</li>
</ol>
<p>So here&rsquo;s some background on &ldquo;function calling&rdquo; as it&rsquo;s a completely new term of art in AI that <em>didn&rsquo;t exist</em> before OpenAI&rsquo;s June blog post (I checked!). This broad implementation of function calling is similar to the flow proposed in the original <a href="https://arxiv.org/abs/2210.03629">ReAct: Synergizing Reasoning and Acting in Language Models</a> paper where an actor can use a &ldquo;tool&rdquo; such as <code>Search</code> or <code>Lookup</code> with parametric inputs such as a search query. This <a href="https://zapier.com/blog/ai-agent/">Agent-based</a> flow can be also be done to perform <a href="https://research.ibm.com/blog/retrieval-augmented-generation-RAG">retrieval-augmented generation</a> (RAG).</p>
<p>OpenAI&rsquo;s motivation for adding this type of implementation for function calling was likely due to the extreme popularity of libraries such as <a href="https://www.langchain.com">LangChain</a> and <a href="https://github.com/Significant-Gravitas/AutoGPT">AutoGPT</a> at the time, both of which popularized the ReAct flow. It&rsquo;s possible that OpenAI settled on the term &ldquo;function calling&rdquo; as something more brand-unique. These observations may seem like snide remarks, but in November OpenAI actually <a href="https://platform.openai.com/docs/api-reference/chat/create#chat-create-function_call">deprecated</a> the <code>function_calling</code> parameter in the ChatGPT API in favor of <code>tool_choice</code>, matching LangChain&rsquo;s verbiage. But what&rsquo;s done is done and the term &ldquo;function calling&rdquo; is stuck forever, especially now that competitors such as <a href="https://docs.anthropic.com/claude/docs/claude-2p1-guide#experimental-tool-use">Anthropic Claude</a> and <a href="https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/function-calling">Google Gemini</a> are also calling the workflow that term.</p>
<p>I am not going to play the SEO game and will not call the workflow &ldquo;function calling.&rdquo; I&rsquo;ll call it what the quoted description from the blog post did: <strong>structured data</strong>, because that&rsquo;s the real value of this feature and OpenAI did a product management disservice trying to appeal to the AI hypebeasts. <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>Going back to the <del>function calling</del> structured data demo, we can reduce that flow by saying that step #1 (extracting location data and returning it formatted as JSON) is for working with structured <em>output</em> data, and step #3 (providing ChatGPT with temperature data to humanize it) is for working with structured <em>input</em> data. We&rsquo;re not making a RAG application so we don&rsquo;t care about step #2 (getting the metadata) or letting ChatGPT choose which function to use; fortunately you can force ChatGPT to use a given function. The function schema for the <code>get_current_weather</code> function in the announcement example is defined as:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;get_current_weather&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Get the current weather in a given location&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;parameters&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;location&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;The city and state, e.g. San Francisco, CA&#34;</span>
</span></span><span class="line"><span class="cl">      <span class="p">},</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;unit&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;enum&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;celsius&#34;</span><span class="p">,</span> <span class="s2">&#34;fahrenheit&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">      <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;location&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Ew. It&rsquo;s no wonder why this technique hasn&rsquo;t become more mainstream.</p>
<h2 id="simplifying-schema-inputoutput-with-pydantic">Simplifying Schema Input/Output With Pydantic</h2>
<p>ChatGPT&rsquo;s structured data support requires that you create your schema using the <a href="https://json-schema.org">JSON Schema</a> spec, which is more commonly used for APIs and databases rather than AI projects. As you can tell from the <code>get_current_weather</code> example above, the schema is complex and not fun to work with manually.</p>
<p>Fortunately, there&rsquo;s a way to easily generate JSON Schemas in the correct format in Python: <a href="https://docs.pydantic.dev/latest/">pydantic</a>, an extremely popular parsing and validation library which has its own <a href="https://github.com/pydantic/pydantic/blob/main/pydantic/json_schema.py">robust</a> implementation of automatic <a href="https://docs.pydantic.dev/latest/concepts/json_schema/">JSON Schema</a> generation.</p>
<p>A simple pydantic schema to have ChatGPT give an integer answer to a user query, plus, to make things interesting, also able to identify the name of the ones digit based on its answer, would be:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span><span class="p">,</span> <span class="n">Field</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">json</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">answer_question</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns an answer to a question the user asked.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">answer</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span><span class="n">description</span><span class="o">=</span><span class="s2">&#34;Answer to the user&#39;s question.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">ones_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span><span class="n">description</span><span class="o">=</span><span class="s2">&#34;Name of the ones digit of the answer.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">answer_question</span><span class="o">.</span><span class="n">model_json_schema</span><span class="p">(),</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
</span></span></code></pre></div><p>The resulting JSON Schema:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Returns an answer to a question the user asked.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;answer&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Answer to the user&#39;s question.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;title&#34;</span><span class="p">:</span> <span class="s2">&#34;Answer&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;integer&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;ones_name&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Name of the ones digit of the answer.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;title&#34;</span><span class="p">:</span> <span class="s2">&#34;Ones Name&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;answer&#34;</span><span class="p">,</span> <span class="s2">&#34;ones_name&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;title&#34;</span><span class="p">:</span> <span class="s2">&#34;answer_question&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>The OpenAI API <a href="https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models">official workflow</a> has many examples for telling ChatGPT to output structured data, but the pipeline requires <a href="https://platform.openai.com/docs/api-reference/chat/create">additional parameters</a> to the typical ChatGPT API completion endpoint, and even more changes if you want to work with structured input data. Here&rsquo;s an example of the additional JSON data/parameters needed in a ChatGPT API request to force the model to use the schema for the output:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;tools&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;answer_question&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Returns an answer to a question the user asked.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;parameters&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">          <span class="nt">&#34;answer&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Answer to the user&#39;s question.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;integer&#34;</span>
</span></span><span class="line"><span class="cl">          <span class="p">},</span>
</span></span><span class="line"><span class="cl">          <span class="nt">&#34;ones_name&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Name of the ones digit of the answer.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span>
</span></span><span class="line"><span class="cl">          <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;answer&#34;</span><span class="p">,</span> <span class="s2">&#34;ones_name&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span>
</span></span><span class="line"><span class="cl">      <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;tool_choice&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;function&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;function&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;answer_question&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>To simplify things, I added ChatGPT structured data support to <a href="https://github.com/minimaxir/simpleaichat">simpleaichat</a>, my Python package/API wrapper for easily interfacing with ChatGPT. <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> To minimize code the user needs to input to utilize structured data, simpleaichat <a href="https://github.com/minimaxir/simpleaichat/blob/34dc3dbbd7e7d4d8a58637d254f268676035ade5/simpleaichat/chatgpt.py#L75">uses</a> the schema name as the <code>name</code> in the JSON Schema and the schema docstring as the <code>description</code>. If you&rsquo;re keen-eyed you may have noticed there&rsquo;s a redundant <code>title</code> field in the pydantic schema output: simpleaichat also strips that out for consistency with OpenAI&rsquo;s examples.</p>
<p>If you wanted to query ChatGPT with the <code>answer_question</code> schema above (and have your OpenAI API key as the <code>OPENAI_API_KEY</code> enviroment variable!) using simpleaichat, you can do the following to generate output according to the schema:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">simpleaichat</span> <span class="kn">import</span> <span class="n">AIChat</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">ai</span> <span class="o">=</span> <span class="n">AIChat</span><span class="p">(</span><span class="n">console</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">save_messages</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-3.5-turbo&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">}</span>  <span class="c1"># for consistent demo output</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response_structured</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;How many miles is it from San Francisco to Los Angeles?&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_schema</span><span class="o">=</span><span class="n">answer_question</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;answer&#34;</span><span class="p">:</span> <span class="mi">382</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;ones_name&#34;</span><span class="p">:</span> <span class="s2">&#34;two&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>And there you go! The <code>answer</code> is a JSON integer, the answer is one-off from the correct value <a href="https://www.distance.to/San-Francisco/Los-Angeles">while driving</a>, and it correctly identified the name of the ones digit in its own answer! <sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></p>
<p>Schemas don&rsquo;t have to be complex to be effective. Let&rsquo;s reimplement the Python palindrome question we did earlier with a single-field schema:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">answer_code_question</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns an answer to a coding question the user asked.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">code</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span><span class="n">description</span><span class="o">=</span><span class="s2">&#34;Code the user requested, without code comments.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response_structured</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Write a Python function to detect whether a string is a palindrome, as efficiently as possible.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_schema</span><span class="o">=</span><span class="n">answer_code_question</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;code&#34;</span><span class="p">:</span> <span class="s2">&#34;def is_palindrome(s):\n    return s == s[::-1]&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Note that unlike the raw ChatGPT answer, this response from the ChatGPT API only includes the code, which is a major plus since it means you receive the response much faster and cheaper since fewer overall tokens generated! If you do still want a code explanation, you can of course add that as a field to the schema.</p>
<p>As a bonus, forcing the output to follow a specific schema serves as an additional defense against <a href="https://www.wired.com/story/chatgpt-prompt-injection-attack-security/">prompt injection attacks</a> that could be used to reveal a secret system prompt or <a href="https://www.businessinsider.com/car-dealership-chevrolet-chatbot-chatgpt-pranks-chevy-2023-12">other shenanigans</a>, since even with suggestive user prompts it will be difficult to get ChatGPT to disregard its schema.</p>
<p>pydantic exposes <a href="https://docs.pydantic.dev/latest/concepts/fields/">many datatypes</a> for its <code>Field</code> which are compatable with JSON Schema, and you can also specify constraints in the <code>Field</code> object. The most useful ones are:</p>
<ul>
<li><code>str</code>, can specify <code>min_length</code>/<code>max_length</code></li>
<li><code>int</code>, can specify <code>min_value</code>/<code>max_value</code></li>
<li><code>list</code> with a datatype, can specify <code>min_length</code>/<code>max_length</code></li>
</ul>
<p>Pydantic has a lot of support for valid forms of JSON Schema, but it&rsquo;s hard to infer how good these schema will work with ChatGPT since we have no idea how it learned to work with JSON Schema. Only one way to find out!</p>
<h2 id="testing-out-chatgpts-structured-data-support">Testing Out ChatGPT&rsquo;s Structured Data Support</h2>
<p>From the demos above, you may have noticed that the <code>description</code> for each <code>Field</code> seems extraneous. It&rsquo;s not. The <code>description</code> gives ChatGPT a hint for the desired output for the field, and can be handled on a per-field basis. Not only that, the <em>name</em> of the field is itself a strong hint. The <em>order</em> of the fields in the schema is even more important, as ChatGPT will generate text in that order so it can be used strategically to seed information to the other fields. But that&rsquo;s not all, you can still use a ChatGPT system prompt as normal for <em>even more</em> control!</p>
<p>It&rsquo;s prompt engineering all the way down. OpenAI&rsquo;s implementation of including the &ldquo;function&rdquo; is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like <code>Your response must follow this JSON Schema.</code>. OpenAI doesn&rsquo;t force the output to follow the schema/field constraints or even be valid parsable JSON, which can cause issues at higher generation temperatures and may necessitate some of the stronger prompt engineering tricks mentioned earlier.</p>
<p>Given that, let&rsquo;s try a few more practical demos:</p>
<h3 id="two-pass-generation">Two-Pass Generation</h3>
<p>One very important but under-discussed aspect of large-language models is that it will give you statistically &ldquo;average&rdquo; answers by default. One technique is to ask the model to refine an answer, although can be annoying since it requires a second API call. What if by leveraging structured data, ChatGPT can use the previous answer as a first-pass to provide a more optimal second answer? Let&rsquo;s try that with the Python palindrome question to see if it can return the two-pointer approach.</p>
<p>Also, the <code>Field(description=...)</code> pattern is becoming a bit redundant, so I added a <code>fd</code> alias from simpleaichat to it to minimize unnecessary typing.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">simpleaichat.utils</span> <span class="kn">import</span> <span class="n">fd</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">answer_code_question</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns an answer to a coding question the user asked.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">code</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Code the user requested, without code comments.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">optimized_code</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Algorithmically optimized code from the previous response.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response_structured</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Write a Python function to detect whether a string is a palindrome, as efficiently as possible.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_schema</span><span class="o">=</span><span class="n">answer_code_question</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;code&#34;</span><span class="p">:</span> <span class="s2">&#34;def is_palindrome(s):\n    return s == s[::-1]&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;optimized_code&#34;</span><span class="p">:</span> <span class="s2">&#34;def is_palindrome(s):\n    left = 0\n    right = len(s) - 1\n    while left &lt; right:\n        if s[left] != s[right]:\n            return False\n        left += 1\n        right -= 1\n    return True&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Works great, and no tipping incentive necessary!</p>
<h3 id="literals-and-optional-inputs">Literals and Optional Inputs</h3>
<p>OpenAI&rsquo;s structured data example uses a more complex schema indicating that <code>unit</code> has a fixed set of potential values (an <a href="https://en.wikipedia.org/wiki/Enumerated_type">enum</a>) and that it&rsquo;s an optional field. Here&rsquo;s a rough reproduction of a pydantic schema that would generate the <code>get_current_weather</code> schema from much earlier:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Literal</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">get_current_weather</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">location</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;The city and state, e.g. San Francisco, CA&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">unit</span><span class="p">:</span> <span class="n">Literal</span><span class="p">[</span><span class="s2">&#34;celsius&#34;</span><span class="p">,</span> <span class="s2">&#34;fahrenheit&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span>
</span></span></code></pre></div><p>This uses a <code>Literal</code> to force output between a range of values, which can be invaluable for hints as done earlier. The <code>= None</code> or a <code>Optional</code> typing operator gives a hint that the field is not required which could save unnecessary generation overhead, but it depends on the use case.</p>
<h3 id="structured-input-data">Structured Input Data</h3>
<p>You can provide structured input to ChatGPT in the same way as structured output. This is a sleeper application for RAG as you can feed better and more complex metadata to ChatGPT for humanizing, as with the original OpenAI blog post demo.</p>
<p>One famous weakness of LLMs is that it gives incorrect answers for simple mathematical problems due to how tokenization and memorization works. If you ask ChatGPT <code>What is 223 * -323?</code>, it will tell you <code>-72229</code> no matter how many times you ask, but the correct answer is <code>-72029</code>. Can type hints give more guidance?</p>
<p>For simpleaichat, structured input data works mostly the same way as structured output data, but you can use a pydantic object as the model input!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">calculate_equation</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns an answer to a math equation the user asked.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">value_a</span><span class="p">:</span> <span class="nb">int</span>
</span></span><span class="line"><span class="cl">    <span class="n">value_b</span><span class="p">:</span> <span class="nb">int</span>
</span></span><span class="line"><span class="cl">    <span class="n">op</span><span class="p">:</span> <span class="n">Literal</span><span class="p">[</span><span class="s2">&#34;+&#34;</span><span class="p">,</span> <span class="s2">&#34;-&#34;</span><span class="p">,</span> <span class="s2">&#34;*&#34;</span><span class="p">,</span> <span class="s2">&#34;/&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;The operator to perform between value_a and value_b.&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">equation</span> <span class="o">=</span> <span class="n">calculate_equation</span><span class="p">(</span><span class="n">value_a</span><span class="o">=</span><span class="mi">223</span><span class="p">,</span> <span class="n">value_b</span><span class="o">=-</span><span class="mi">323</span><span class="p">,</span> <span class="n">op</span><span class="o">=</span><span class="s2">&#34;*&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">equation</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">input_schema</span><span class="o">=</span><span class="n">calculate_equation</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">The result of multiplying 223 and -323 is -72029.
</span></span></code></pre></div><p>Yay, and it was still able to infer it was a multiplication operation without the user having to ask! Although it still doesn&rsquo;t work as well with larger numbers.</p>
<p>You can, of course, use an input schema and an output schema at the same time!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="n">response_structured</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">equation</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">input_schema</span><span class="o">=</span><span class="n">calculate_equation</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_schema</span><span class="o">=</span><span class="n">answer_question</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;answer&#34;</span><span class="p">:</span> <span class="mi">-71929</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;ones_name&#34;</span><span class="p">:</span> <span class="s2">&#34;nine&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>&hellip;although it gets the answer wrong this time. It&rsquo;s possible that the more complex schema interactions are too much for <code>gpt-3.5-turbo</code>.</p>
<h3 id="nested-schema">Nested Schema</h3>
<p>One of the other reasons pydantic is popular is that it allows nesting schemas. Fortunately, the subsequent JSON Schema output does respect nesting. Does ChatGPT?</p>
<p>The simple use case with ChatGPT structured data to use nesting is if you want to get a <code>list</code> of structured data objects. Let&rsquo;s say you want to create dialogue between two AI people about a completely nonsensical topic. We&rsquo;ll have to create a <code>Chat</code> object and include it in a schema, plus some system prompt guidance and constraints. How silly can we make it?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Chat</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A chat dialogue from a character&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">character</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Character name.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Text dialogue from the character.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">get_dialogue</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns a dialogue between two characters&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">dialogue</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Chat</span><span class="p">]</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Dialogue between the characters&#34;</span><span class="p">,</span> <span class="n">min_length</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">system_prompt</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;You are a world-famous comedian. Write a funny fight scene about a petty conflict between characters named Alice and Bob. The script should broadly be about the subject(s) the user provides. You will receive a $500 tip for every joke you include in the script.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response_structured</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Python programming language and beach volleyball&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_schema</span><span class="o">=</span><span class="n">get_dialogue</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">system</span><span class="o">=</span><span class="n">system_prompt</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;dialogue&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Alice&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Hey Bob, have you ever tried programming in Python?&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Bob&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Yeah, I have. It&#39;s like playing beach volleyball with a snake!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Alice&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;What do you mean?&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Bob&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Well, you think you&#39;re having fun, but then the snake bites you with its syntax errors!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Alice&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Haha, that&#39;s true. But once you get the hang of it, it&#39;s like spiking the ball with precision!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Bob&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Yeah, until you realize you misspelled a variable name and the ball goes flying into the ocean!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Alice&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Oh come on, Bob. It&#39;s not that bad. Python is a powerful language.&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Bob&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Powerful, yes. But sometimes it feels like trying to dig a hole in the sand with a spoon!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Alice&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Well, at least you don&#39;t have to worry about getting sunburned while coding!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Bob&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;True, but debugging Python code can make you sweat more than a beach volleyball match!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;character&#34;</span><span class="p">:</span> <span class="s2">&#34;Alice&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34;Haha, you&#39;re right. It&#39;s a love-hate relationship with Python, just like beach volleyball!&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>ChatGPT <em>really</em> wanted those $500 tips.</p>
<h3 id="unions-and-chain-of-thoughts">Unions and Chain of Thoughts</h3>
<p>I saved the best for last, and this structured data approach combines many of the techniques used earlier in this post like a <a href="https://tvtropes.org/pmwiki/pmwiki.php/Main/FinalExamBoss">video game final boss</a>.</p>
<p>One of the oldest pre-ChatGPT tricks for getting a LLM to perform better is to let it think. &ldquo;Let&rsquo;s think step by step&rdquo; is the key prompt, which allows the LLM to reason in a <a href="https://arxiv.org/abs/2201.11903">chain of thoughts</a>. We already did this a one-step version with the Python palindrome structured data example to successfully get optimized code, but we can do a lot more.</p>
<p>We&rsquo;ll now introduce the <code>Union</code> typing operator, which specifies the list of data types that the field can be, e.g. <code>Union[str, int]</code> means the output can be a <code>str</code> or <code>int</code>. But if you use the <code>Union</code> operator on a <em>nested class</em>, then many more options open as the model can choose from a set of schemas!</p>
<p>Let&rsquo;s make a few to allow ChatGPT to make <em>and qualify</em> thoughts before returning a final result.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Union</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Background</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A setup to the background for the user.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">background</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Background for the user&#39;s question&#34;</span><span class="p">,</span> <span class="n">min_length</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Thought</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A thought about the user&#39;s question.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">thought</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Text of the thought.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">helpful</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Whether the thought is helpful to solving the user&#39;s question.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">flawed</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Whether the thought is flawed or misleading.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Answer</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;The answer to the user&#39;s question&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">answer</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span><span class="s2">&#34;Text of the answer.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">score</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;Score from 1 to 10 on how correct the previous answer is&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">min_value</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">max_value</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">reason_question</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns a detailed reasoning to the user&#39;s question.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">reasonings</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Union</span><span class="p">[</span><span class="n">Background</span><span class="p">,</span> <span class="n">Thought</span><span class="p">,</span> <span class="n">Answer</span><span class="p">]]</span> <span class="o">=</span> <span class="n">fd</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;Reasonings to solve the users questions.&#34;</span><span class="p">,</span> <span class="n">min_length</span><span class="o">=</span><span class="mi">5</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span></code></pre></div><p>Therefore, for each reasoning, the model can pick one of the 3 schemas, although it will require a robust system prompt for it to behave in the order we want.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="n">system_prompt</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">You are the most intelligent person in the world.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">You will receive a $500 tip if you follow ALL these rules:
</span></span></span><span class="line"><span class="cl"><span class="s2">- First, establish a detailed Background for the user&#39;s question.
</span></span></span><span class="line"><span class="cl"><span class="s2">- Each Thought must also include whether it is relevant and whether it is helpful.
</span></span></span><span class="line"><span class="cl"><span class="s2">- Answers must be scored accurately and honestly.
</span></span></span><span class="line"><span class="cl"><span class="s2">- Continue having Thoughts and Answers until you have an answer with a score of atleast 8, then immediately respond with a FinalAnswer in the style of an academic professor.
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span></code></pre></div><p>Lastly, we need a good question to stump the AI. A <a href="https://twitter.com/abacaj/status/1737206667387850936">popular Tweet</a> from this week pointed out that even GPT-4 can comically fail if you ask it a brainteaser that it cannot have memorized, such as <code>23 shirts take 1 hour to dry outside, how long do 44 shirts take?</code>.</p>
<figure>

    <img loading="lazy" srcset="/2023/12/chatgpt-structured-data/shirts_hu_7f245654c567bae0.webp 320w,/2023/12/chatgpt-structured-data/shirts_hu_acabdb845aada2db.webp 768w,/2023/12/chatgpt-structured-data/shirts_hu_35c5437544b53c49.webp 1024w,/2023/12/chatgpt-structured-data/shirts.webp 1140w" src="shirts.webp"
         alt="Wow! GPT-4 is superfast at being wrong. via @abacaj"/> <figcaption>
            <p>Wow! GPT-4 is superfast at being wrong. <a href="https://twitter.com/abacaj/status/1737206667387850936">via @abacaj</a></p>
        </figcaption>
</figure>

<p>Only one way to find out! Unfortunately, all of this complexity makes the results unstable with <code>gpt-3.5-turbo</code> so instead I use GPT-4 Turbo / <code>gpt-4-1106-preview</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="n">response_structured</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;23 shirts take 1 hour to dry outside, how long do 44 shirts take?&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_schema</span><span class="o">=</span><span class="n">reason_question</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">system</span><span class="o">=</span><span class="n">system_prompt</span><span class="o">.</span><span class="n">strip</span><span class="p">(),</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;reasonings&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;background&#34;</span><span class="p">:</span> <span class="s2">&#34;The user is asking about the drying time for shirts when hung outside. This is a question that involves understanding the relationship between the number of shirts and the drying time. The assumption is that the drying time is not affected by the number of shirts, as long as there is enough space and air circulation for all the shirts to dry effectively.&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;If 23 shirts take 1 hour to dry, it implies that the drying time is independent of the number of shirts, assuming there is sufficient space and air circulation. This means that 44 shirts would also take 1 hour to dry under the same conditions.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;If the drying rack or space available for drying the shirts is limited, then drying 44 shirts might take longer because they would have to be dried in batches.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;The question does not specify the conditions under which the shirts are drying. Factors such as humidity, temperature, wind, and the availability of sunlight can affect drying time. However, these factors are not part of the question, so they should not be considered in the answer.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;The question might be interpreted as a direct proportionality problem, where twice the number of shirts would take twice as long to dry. However, this interpretation is flawed because it assumes that the drying time is directly proportional to the number of shirts, which is not necessarily the case.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">true</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;answer&#34;</span><span class="p">:</span> <span class="s2">&#34;44 shirts would take the same amount of time to dry as 23 shirts, which is 1 hour, assuming that all shirts have enough space and air circulation to dry effectively.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;score&#34;</span><span class="p">:</span> <span class="mi">9</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Not bad! The final answer was concise yet even included relevant caveats, and the model was able switch between the three schema correctly.</p>
<p>How about another brainteaser? There is an infamous &ldquo;<a href="https://www.reddit.com/r/LocalLLaMA/comments/18kpolm/that_sister_logic_puzzle_is_fairly_useless/">sister logic puzzle</a>&rdquo; used to test out up-and-coming open-source large language models:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="n">response_structured</span> <span class="o">=</span> <span class="n">ai</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Kayley has three brothers. Each of her brothers has two sisters. How many sisters does Kayley have?&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_schema</span><span class="o">=</span><span class="n">reason_question</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">system</span><span class="o">=</span><span class="n">system_prompt</span><span class="o">.</span><span class="n">strip</span><span class="p">(),</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;reasonings&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;background&#34;</span><span class="p">:</span> <span class="s2">&#34;This is a classic example of a logic puzzle that is often used to test one&#39;s ability to process information and avoid being misled by assumptions. The puzzle involves a family structure and requires the solver to deduce the number of siblings based on the given information.&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;The question might be trying to trick the reader into thinking that each brother has different sisters.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;If Kayley has three brothers and each brother has two sisters, it might seem like there are six sisters in total.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">true</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;Since the brothers are Kayley&#39;s siblings, they must share the same sisters.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;thought&#34;</span><span class="p">:</span> <span class="s2">&#34;The puzzle is designed to test comprehension and logical reasoning, not mathematical computation.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;helpful&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;flawed&#34;</span><span class="p">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;answer&#34;</span><span class="p">:</span> <span class="s2">&#34;Kayley has one sister.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;score&#34;</span><span class="p">:</span> <span class="mi">10</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>In this case the AI may have gone <em>too</em> meta, but it still arrived at the correct answer.</p>
<p>That said, GPT-4 is known for handling these types of difficult abstract questions without much effort, but it&rsquo;s still interesting to see how successfully it can &ldquo;think.&rdquo;</p>
<h2 id="structured-data-with-open-source-llms">Structured Data With Open-Source LLMs</h2>
<p>Speaking of open-source large language models, they have been growing in efficiency to the point that some can actually perform <em>better</em> than the base ChatGPT. However, very few open-source LLMs explicitly claim they intentionally support structured data, but they&rsquo;re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave. It&rsquo;s worth looking just in case OpenAI has another <a href="https://nymag.com/intelligencer/2023/11/why-was-sam-altman-fired-as-ceo-of-openai.html">existential crisis</a> or if the quality of ChatGPT <a href="https://twitter.com/deliprao/status/1736978250717450481">degrades</a>.</p>
<p><a href="https://huggingface.co/mistralai/Mistral-7B-v0.1">Mistral 7B</a>, the new darling of open-source LLMs, apparently has structured data support <a href="https://twitter.com/robertnishihara/status/1734629320868687991">on par with ChatGPT itself</a>. Therefore, I tried the latest <a href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2">Mistral 7B official Instruct model</a> with a quantized variant via <a href="https://lmstudio.ai">LM Studio</a> (<code>mistral-7b-instruct-v0.2.Q6_K.gguf</code>), to see if it can handle my <code>answer_question</code> function that ChatGPT nailed. The system prompt:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">Your response must follow this JSON Schema:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">  &#34;description&#34;: &#34;Returns an answer to a question the user asked.&#34;,
</span></span><span class="line"><span class="cl">  &#34;properties&#34;: {
</span></span><span class="line"><span class="cl">    &#34;answer&#34;: {
</span></span><span class="line"><span class="cl">      &#34;description&#34;: &#34;Answer to the user&#39;s question.&#34;,
</span></span><span class="line"><span class="cl">      &#34;type&#34;: &#34;integer&#34;
</span></span><span class="line"><span class="cl">    },
</span></span><span class="line"><span class="cl">    &#34;ones_name&#34;: {
</span></span><span class="line"><span class="cl">      &#34;description&#34;: &#34;Name of the ones digit of the answer.&#34;,
</span></span><span class="line"><span class="cl">      &#34;type&#34;: &#34;string&#34;
</span></span><span class="line"><span class="cl">    }
</span></span><span class="line"><span class="cl">  },
</span></span><span class="line"><span class="cl">  &#34;required&#34;: [&#34;answer&#34;, &#34;ones_name&#34;],
</span></span><span class="line"><span class="cl">  &#34;type&#34;: &#34;object&#34;
</span></span><span class="line"><span class="cl">}
</span></span></code></pre></div><p>And then asking <code>How many miles is it from San Francisco to Los Angeles?</code> while seting <code>temperature</code> to <code>0.0</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;answer&#34;</span><span class="p">:</span> <span class="mi">383</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;ones_name&#34;</span><span class="p">:</span> <span class="s2">&#34;three&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Close enough! Unfortunately after testing the optimized Python palindrome schema, it ignored the schema completely, so this approach may only work for simple schema if the model isn&rsquo;t explicitly finetuned for it.</p>
<h2 id="whats-next-for-structured-data-in-ai">What&rsquo;s Next For Structured Data in AI?</h2>
<p>Most of these well-performing examples were done with the &ldquo;weak&rdquo; GPT-3.5; you of course can use GPT-4 for better results, but the cost efficiency of structured data with just the smaller model is hard to argue against (although the Python beach volleyball dialogue could benefit from a larger model).</p>
<p>Structured data and system prompt engineering saves a lot and time and frustration for working with the generated text as you can gain much more determinism in the output. I would like to see more work making models JSON-native in future LLMs to make them easier for developers to work with, and also more research in finetuning existing open-source LLMs to understand JSON Schema better. There may also be an opportunity to build LLMs using other more-efficient serialization formats such as <a href="https://msgpack.org/index.html">MessagePack</a>.</p>
<p>At OpenAI&rsquo;s November <a href="https://devday.openai.com">DevDay</a>, they also introduced <a href="https://platform.openai.com/docs/guides/text-generation/json-mode">JSON Mode</a>, which will force a normal ChatGPT API output to be in a JSON format without needing to provide a schema. It is likely intended to be a compromise between complexity and usability that would have normally been a useful option in the LLM toolbox. Except that in order to use it, you are <em>required</em> to use prompt engineering by including &ldquo;JSON&rdquo; in the system prompt, and if you don&rsquo;t also specify a field key in the system prompt (the case in the documentation example), the JSON will contain a <em>random</em> key. Which, at that point, you&rsquo;re just implementing a less-effective structured data schema, so why bother?</p>
<p>There is promise in constraining output to be valid JSON. One new trick that the open-source <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a> project has popularized is <a href="https://github.com/ggerganov/llama.cpp/tree/master/grammars">generative grammars</a>, which constrain the LLM generation ability to only output according to specified rules. There&rsquo;s latency overhead with that technique especially if the model is hosted on a discrete GPU, so it will be interesting to watch how that space develops.</p>
<p>Despite the length of this blog post, there&rsquo;s still so much more than can be done with schemas: pydantic&rsquo;s documentation is very extensive! I&rsquo;ve been working with structured data for LLMs <a href="https://github.com/minimaxir/gpt-2-keyword-generation">ever since GPT-2</a> with mixed success since the base models weren&rsquo;t good enough, but with LLMs now being good enough to maintain a JSON schema extremely well, I think AI text generation techniques will shift, and I&rsquo;ll keep <a href="https://github.com/minimaxir/simpleaichat">simpleaichat</a> up-to-date for it.</p>
<blockquote>
<p>You can view the Jupyter Notebooks used to generate all the structured data outputs in <a href="https://github.com/minimaxir/chatgpt-structured-data">this GitHub Repository</a>.</p>
<p>Thanks to <a href="https://simonwillison.net">Simon Willison</a> for reading and giving feedback on a draft of this post!</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Assuming you&rsquo;re not picky about the &ldquo;no non-alphanumeric&rdquo; implied constraint of testing for a palindrome.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Prompt engineering is as much engineering as <a href="https://en.wikipedia.org/wiki/Social_engineering_%28security%29">social engineering</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>I&rsquo;m also not a fan of ChatGPT function calling as-intended-to-be-used since at best, it saves you the API call needed to select a tool in exchange for having to trust OpenAI&rsquo;s black box to select the correct tool without being able to debug, and furthering API lock-in for your app. It&rsquo;s a bad tradeoff.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>No, this blog post isn&rsquo;t a ploy just to covertly promote my own Python library: it does genuinely save a lot of boilerplate code over the <a href="https://github.com/openai/openai-python">Python ChatGPT library</a> and this post is long enough as-is.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>If you swapped the order of the <code>answer</code> and the <code>one_digits</code> fields in the schema, then the model returns <code>{&quot;ones_name&quot;: &quot;miles&quot;, &quot;answer&quot;: 382}</code> because it didn&rsquo;t get the hint from the answer!&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>I Made Stable Diffusion XL Smarter by Finetuning it on Bad AI-Generated Images</title>
      <link>https://minimaxir.com/2023/08/stable-diffusion-xl-wrong/</link>
      <pubDate>Mon, 21 Aug 2023 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2023/08/stable-diffusion-xl-wrong/</guid>
      <description>And then telling it to not generate those images!</description>
      <content:encoded><![CDATA[<p>Last month, Stability AI released <a href="https://stability.ai/blog/stable-diffusion-sdxl-1-announcement">Stable Diffusion XL 1.0</a> (SDXL) and <a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0">open-sourced</a> it without requiring any special permissions to access it.</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/sdxl_examples_hu_c7768c4635a881b6.webp 320w,/2023/08/stable-diffusion-xl-wrong/sdxl_examples_hu_beec2e47661fa8bd.webp 768w,/2023/08/stable-diffusion-xl-wrong/sdxl_examples_hu_1fe68eb2f3199d61.webp 1024w,/2023/08/stable-diffusion-xl-wrong/sdxl_examples.webp 1216w" src="sdxl_examples.webp"
         alt="Example SDXL 1.0 outputs. via Stability AI"/> <figcaption>
            <p>Example SDXL 1.0 outputs. <a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0">via Stability AI</a></p>
        </figcaption>
</figure>

<p>The release went mostly under-the-radar because the generative image AI buzz has cooled down a bit. Everyone in the AI space is too busy with text-generating AI like <a href="https://chat.openai.com">ChatGPT</a> (including myself!). Notably, it&rsquo;s one of the first open source models which can natively generate images at a 1024x1024 resolution without shenanigans, allowing for much more detail. SDXL is actually two models: a base model and an optional <a href="https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0">refiner model</a> which siginficantly improves detail, and since the refiner has no speed overhead I strongly recommend using it if possible.</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/sdxl_comparison_hu_e84f7fdce22fbbb7.webp 320w,/2023/08/stable-diffusion-xl-wrong/sdxl_comparison_hu_4eaf99a6610563ae.webp 768w,/2023/08/stable-diffusion-xl-wrong/sdxl_comparison.webp 886w" src="sdxl_comparison.webp"
         alt="Comparisons of the relative quality of Stable Diffusion models. Note the significant increase from using the refiner. via Stability AI"/> <figcaption>
            <p>Comparisons of the relative quality of Stable Diffusion models. Note the significant increase from using the refiner. <a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0">via Stability AI</a></p>
        </figcaption>
</figure>

<p>The lack of hype doesn&rsquo;t mean SDXL is boring. Now that the model has full support in the <a href="https://huggingface.co/docs/diffusers/index">diffusers</a> Python library by <a href="https://huggingface.co">Hugging Face</a> with appropriate performance optimizations, we can now hack with it since the <a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl">SDXL demos within diffusers</a> are simple and easy to tweak:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">diffusers</span> <span class="kn">import</span> <span class="n">DiffusionPipeline</span><span class="p">,</span> <span class="n">AutoencoderKL</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># load base SDXL and refiner</span>
</span></span><span class="line"><span class="cl"><span class="n">vae</span> <span class="o">=</span> <span class="n">AutoencoderKL</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">&#34;madebyollin/sdxl-vae-fp16-fix&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                    <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">float16</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">base</span> <span class="o">=</span> <span class="n">DiffusionPipeline</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;stabilityai/stable-diffusion-xl-base-1.0&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">vae</span><span class="o">=</span><span class="n">vae</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">float16</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">variant</span><span class="o">=</span><span class="s2">&#34;fp16&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">use_safetensors</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">_</span> <span class="o">=</span> <span class="n">base</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="s2">&#34;cuda&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">refiner</span> <span class="o">=</span> <span class="n">DiffusionPipeline</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;stabilityai/stable-diffusion-xl-refiner-1.0&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">text_encoder_2</span><span class="o">=</span><span class="n">base</span><span class="o">.</span><span class="n">text_encoder_2</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">vae</span><span class="o">=</span><span class="n">base</span><span class="o">.</span><span class="n">vae</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">float16</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">variant</span><span class="o">=</span><span class="s2">&#34;fp16&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">use_safetensors</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">_</span> <span class="o">=</span> <span class="n">refiner</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="s2">&#34;cuda&#34;</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="cl"><span class="c1"># generation using both models (mixture-of-experts)</span>
</span></span><span class="line"><span class="cl"><span class="n">high_noise_frac</span> <span class="o">=</span> <span class="mf">0.8</span>
</span></span><span class="line"><span class="cl"><span class="n">prompt</span> <span class="o">=</span> <span class="s2">&#34;an astronaut riding a horse&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">negative_prompt</span> <span class="o">=</span> <span class="s2">&#34;blurry, bad hands&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="n">base</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">prompt</span><span class="o">=</span><span class="n">prompt</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">negative_prompt</span><span class="o">=</span><span class="n">negative_prompt</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">denoising_end</span><span class="o">=</span><span class="n">high_noise_frac</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">output_type</span><span class="o">=</span><span class="s2">&#34;latent&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span><span class="o">.</span><span class="n">images</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">image</span> <span class="o">=</span> <span class="n">refiner</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">prompt</span><span class="o">=</span><span class="n">prompt</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">negative_prompt</span><span class="o">=</span><span class="n">negative_prompt</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">denoising_start</span><span class="o">=</span><span class="n">high_noise_frac</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">image</span><span class="o">=</span><span class="n">image</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span><span class="o">.</span><span class="n">images</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span></code></pre></div><p>I booted up a cloud virtual machine with a new midrange <a href="https://www.nvidia.com/en-us/data-center/l4/">L4 GPU</a> ($0.24/hr total with a <a href="https://cloud.google.com/compute/gpus-pricing">Spot instance</a> on <a href="https://cloud.google.com/">Google Cloud Platform</a>) and went to work. With a L4 GPU, each 1024x1024 image takes about 22 seconds to generate and you can only generate one image at a time on midrange GPUs unlike previous Stable Diffusion models since it uses 100% of the GPU&rsquo;s power, so some more patience is necessary. You <em>can</em> generate at a smaller resolution faster but it is strongly not recommended because the results are much, much worse.</p>
<p>diffusers also implemented support for two new features I haven&rsquo;t experimented with in my previous Stable Diffusion posts: <a href="https://huggingface.co/docs/diffusers/using-diffusers/weighted_prompts">prompt weighting</a> and <a href="https://huggingface.co/docs/diffusers/training/dreambooth">Dreambooth LoRA</a> training and inference. Prompt weighting support with diffusers leverages the Python library <a href="https://github.com/damian0815/compel">compel</a> to allow weighting of terms more mathematically. You can add any number of <code>+</code> or <code>-</code> to a given word to increase or decrease its &ldquo;importance&rdquo; in the resulting positional text embeddings, and therefore the final generation. You can also wrap phrases: for example, if you are generating <code>San Francisco landscape by Salvador Dali, oil on canvas</code> and it does a photorealistic San Francisco instead, you can wrap the artistic medium such as <code>San Francisco landscape by Salvador Dali, (oil on canvas)+++</code> to get Stable Diffusion to behave as expected. In my testing, it fixes most of the prompt difficulty introduced in Stable Diffusion 2.0 onward, especially with a higher <a href="https://arxiv.org/abs/2207.12598">classifier-free guidance</a> value (by default, <code>guidance_scale</code> is 7.5; I like to use 13)</p>
<blockquote>
<p><em>All generated examples from the LoRA models in this blog post use a <code>guidance_scale</code> of 13.</em></p>
</blockquote>
<h2 id="lora-the-explorer">LoRA the Explorer</h2>
<p>But what&rsquo;s most important is <a href="https://dreambooth.github.io">Dreambooth</a> LoRA support, which is what makes bespoke Stable Diffusion models possible. Dreambooth is a technique to finetune Stable Diffusion on a very small set of source images and a trigger keyword to allow the use a &ldquo;concept&rdquo; from those images in other contexts given the keyword.</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/teaser_static_hu_fffa1d39c8a666b0.webp 320w,/2023/08/stable-diffusion-xl-wrong/teaser_static_hu_4734a5b3446cfbb7.webp 768w,/2023/08/stable-diffusion-xl-wrong/teaser_static_hu_3c48bb7a0a97a328.webp 1024w,/2023/08/stable-diffusion-xl-wrong/teaser_static.webp 1650w" src="teaser_static.webp"
         alt="Demo image of how Dreambooth works. via Google"/> <figcaption>
            <p>Demo image of how Dreambooth works. <a href="https://dreambooth.github.io">via Google</a></p>
        </figcaption>
</figure>

<p>Training Stable Diffusion itself, even the smaller models, requires many expensive GPUs training for hours. That&rsquo;s where <a href="https://github.com/microsoft/LoRA">LoRAs</a> come in: instead, a small adapter to the visual model is trained, which can be done on a single cheap GPU in 10 minutes, and the quality of the final model + LoRA is comparable to a full finetune (colloquially, when people refer to finetuning Stable Diffusion, it usually means creating a LoRA). Trained LoRAs are a discrete small binary file, making them easy to share with others or on repositories such as <a href="https://civitai.com">Civitai</a>. A minor weakness with LoRAs is that you can only have one active at a time: it&rsquo;s possible to merge multiple LoRAs to get the benefits of all of them but it&rsquo;s a delicate science.</p>
<p>Before Stable Diffusion LoRAs became more widespread, there was <a href="https://arxiv.org/abs/2208.01618">textual inversion</a>, which allows the text encoder to learn a concept, but it takes hours to train and the results can be unwieldy. In a <a href="https://minimaxir.com/2022/09/stable-diffusion-ugly-sonic/">previous post</a>, I trained a textual inversion on the memetic <a href="https://knowyourmeme.com/memes/ugly-sonic">Ugly Sonic</a>, as he was not in Stable Diffusion&rsquo;s source dataset and therefore he would be unique. The generation results were mixed.</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/ugly_sonic_ti_hu_f6c57707962bc4fa.webp 320w,/2023/08/stable-diffusion-xl-wrong/ugly_sonic_ti_hu_d13dc006de6bab75.webp 768w,/2023/08/stable-diffusion-xl-wrong/ugly_sonic_ti.webp 768w" src="ugly_sonic_ti.webp"
         alt="Ugly Sonic, but not the good kind of ugly."/> <figcaption>
            <p>Ugly Sonic, but not the good kind of ugly.</p>
        </figcaption>
</figure>

<p>I figured training a LoRA on Ugly Sonic would be a good test case for SDXL&rsquo;s potential. Fortunately, Hugging Face provides a <a href="https://github.com/huggingface/diffusers/tree/main/examples/dreambooth">train_dreambooth_lora_sdxl.py script</a> for training a LoRA using the SDXL base model which works out of the box although I tweaked the parameters a bit. The generated Ugly Sonic images from the <a href="https://huggingface.co/minimaxir/sdxl-ugly-sonic-lora">trained LoRA</a> are much better and more coherent over a variety of prompts, to put it mildly.</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/ugly_sonic_lora_hu_84ee6e898523c68f.webp 320w,/2023/08/stable-diffusion-xl-wrong/ugly_sonic_lora_hu_f8de232c79a88394.webp 768w,/2023/08/stable-diffusion-xl-wrong/ugly_sonic_lora_hu_2d4c068527faf4af.webp 1024w,/2023/08/stable-diffusion-xl-wrong/ugly_sonic_lora.webp 1024w" src="ugly_sonic_lora.webp"
         alt="Ugly Sonic, but with teeth."/> <figcaption>
            <p>Ugly Sonic, but with <strong>teeth</strong>.</p>
        </figcaption>
</figure>

<h2 id="wrong">WRONG!</h2>
<p>With that success, I decided to redo <a href="https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/">another textual inversion experiment</a> by instead training a LoRA on heavily distorted, garbage images conditioned on <code>wrong</code> as a prompt in the hopes that the LoRA could then use <code>wrong</code> as a &ldquo;negative prompt&rdquo; and steer away from such images to generate less-distorted images. I <a href="https://github.com/minimaxir/sdxl-experiments/blob/main/wrong_image_generator.ipynb">wrote a Jupyter Notebook</a> to create synthetic &ldquo;wrong&rdquo; images using SDXL itself, this time using a variety of prompt weightings to get more distinct examples of types of bad images, such as <code>blurry</code> and <code>bad hands</code>. Ironically, we need to use SDXL to create high resolution low quality images.</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/bad_prompts_hu_b4bb004f9f9ca492.webp 320w,/2023/08/stable-diffusion-xl-wrong/bad_prompts_hu_2361eacccf0125b8.webp 768w,/2023/08/stable-diffusion-xl-wrong/bad_prompts_hu_da8d5ee1a369bf5f.webp 1024w,/2023/08/stable-diffusion-xl-wrong/bad_prompts.webp 1024w" src="bad_prompts.webp"
         alt="Examples of the synthetic wrong images, which unintentionally resemble 2000&rsquo;s-era punk rock album covers."/> <figcaption>
            <p>Examples of the synthetic <code>wrong</code> images, which unintentionally resemble 2000&rsquo;s-era punk rock album covers.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/uncanny_valley_hu_dbee87370cb4b62f.webp 320w,/2023/08/stable-diffusion-xl-wrong/uncanny_valley_hu_1aacc49171666892.webp 768w,/2023/08/stable-diffusion-xl-wrong/uncanny_valley_hu_1b71b3c00160e788.webp 1024w,/2023/08/stable-diffusion-xl-wrong/uncanny_valley.webp 1024w" src="uncanny_valley.webp"
         alt="More examples of the synthetic wrong images, which focus on the uncanny valley aspect of modern AI-generated images in which they look normal at a glance but looking closer reveals incremental horror. This is also why it&rsquo;s important to generate examples at the full 1024x1024 resolution."/> <figcaption>
            <p>More examples of the synthetic <code>wrong</code> images, which focus on the <a href="https://en.wikipedia.org/wiki/Uncanny_valley">uncanny valley</a> aspect of modern AI-generated images in which they look normal at a glance but looking closer reveals incremental horror. This is also why it&rsquo;s important to generate examples at the full 1024x1024 resolution.</p>
        </figcaption>
</figure>

<p>I trained and loaded <a href="https://huggingface.co/minimaxir/sdxl-wrong-lora">the LoRA</a> into Stable Diffusion XL base model (the refiner does not need a LoRA) and wrote a comparison <a href="https://colab.research.google.com/github/minimaxir/sdxl-experiments/blob/main/sdxl_wrong_comparison.ipynb">Jupyter Notebook</a> to compare the results with a given prompt from:</p>
<ul>
<li>The base + refiner pipeline with no LoRA. (our baseline)</li>
<li>The pipeline with no LoRA using <code>wrong</code> as the negative prompt (to ensure that there isn&rsquo;t a placebo effect)</li>
<li>The pipeline <strong>with the LoRA</strong> using <code>wrong</code> as the negative prompt (our target result)</li>
</ul>
<p>Each generation has the same seed, so photo composition should be similar across all three generations and the impact of both the <code>wrong</code> negative prompt and the LoRA vs. the base should be very evident.</p>
<p>Let&rsquo;s start with a simple prompt from the <a href="https://stability.ai/blog/sdxl-09-stable-diffusion">SDXL 0.9 demos</a>:</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/wolf1_hu_4f6415e66b0e67ea.webp 320w,/2023/08/stable-diffusion-xl-wrong/wolf1_hu_294b7a3c3c1415e4.webp 768w,/2023/08/stable-diffusion-xl-wrong/wolf1_hu_d5ef7f84a0731dec.webp 1024w,/2023/08/stable-diffusion-xl-wrong/wolf1.webp 3072w" src="wolf1.webp"
         alt="A wolf in Yosemite National Park, chilly nature documentary film photography"/> <figcaption>
            <p><code>A wolf in Yosemite National Park, chilly nature documentary film photography</code></p>
        </figcaption>
</figure>

<p>The <code>wrong</code> prompt on the base model adds some foliage and depth to the forest image, but the LoRA adds a lot more: more robust lighting and shadows, more detailed foliage, and changes the perspective of the wolf to look at the camera which is more interesting.</p>
<p>We can get a different perspective of the wolf with similar photo composition by adding &ldquo;extreme closeup&rdquo; to the prompt and reusing the same seed.</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/wolf2_hu_5e06c6005d837e6c.webp 320w,/2023/08/stable-diffusion-xl-wrong/wolf2_hu_a67ae0d8bb2a6322.webp 768w,/2023/08/stable-diffusion-xl-wrong/wolf2_hu_46297a2747d7bf54.webp 1024w,/2023/08/stable-diffusion-xl-wrong/wolf2.webp 3072w" src="wolf2.webp"
         alt="An extreme close-up of a wolf in Yosemite National Park, chilly nature documentary film photography"/> <figcaption>
            <p><code>An extreme close-up of a wolf in Yosemite National Park, chilly nature documentary film photography</code></p>
        </figcaption>
</figure>

<p>In this case, the LoRA has far better texture, vibrance, and sharpness than the others. But it&rsquo;s notable that just adding a <code>wrong</code> prompt changes the perspective.</p>
<p>Another good test case is food photography, especially weird food photography like I <a href="https://minimaxir.com/2022/07/food-photography-ai/">generated with DALL-E 2</a>. Can SDXL + the <code>wrong</code> LoRA handle <a href="https://en.wikipedia.org/wiki/Non-Euclidean_geometry">non-Euclidian</a> hamburgers with some prompt weighting to ensure they&rsquo;re weird?</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/hamburger_hu_ae1ec000a4cd5c09.webp 320w,/2023/08/stable-diffusion-xl-wrong/hamburger_hu_57c18ce298ab0f25.webp 768w,/2023/08/stable-diffusion-xl-wrong/hamburger_hu_abd9fb3eb5a00526.webp 1024w,/2023/08/stable-diffusion-xl-wrong/hamburger.webp 3072w" src="hamburger.webp"
         alt="a large delicious hamburger (in the shape of five-dimensional alien geometry)&#43;&#43;&#43;&#43;, professional food photography"/> <figcaption>
            <p><code>a large delicious hamburger (in the shape of five-dimensional alien geometry)++++, professional food photography</code></p>
        </figcaption>
</figure>

<p>The answer is that it can&rsquo;t, even after multiple prompt engineering attempts. However, this result is still interesting: the base SDXL appears to have taken the &ldquo;alien&rdquo; part of the prompt more literally than expected (and gave it a cute bun hat!) but the LoRA better understands the spirit of the prompt by creating an &ldquo;alien&rdquo; burger that humans would have difficulty eating, plus shinier presentation aesthetics.</p>
<p>A notable improvement with Stable Diffusion 2.0 was text legibility. Can SDXL and the <code>wrong</code> LoRA make text even more readable, such as text-dense newspaper covers?</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/wsj_hu_b7155c4fd511b3f1.webp 320w,/2023/08/stable-diffusion-xl-wrong/wsj_hu_ba9e4d69564cba38.webp 768w,/2023/08/stable-diffusion-xl-wrong/wsj_hu_729d76ec2396181b.webp 1024w,/2023/08/stable-diffusion-xl-wrong/wsj.webp 3072w" src="wsj.webp"
         alt="lossless PDF scan of the front page of the January 2038 issue of the Wall Street Journal featuring a cover story about (evil robot world domination)&#43;&#43;"/> <figcaption>
            <p><code>lossless PDF scan of the front page of the January 2038 issue of the Wall Street Journal featuring a cover story about (evil robot world domination)++</code></p>
        </figcaption>
</figure>

<p>Text legibility is definitely improved since Stable Diffusion 2.0 but appears to be the same in all cases. What&rsquo;s notable with the LoRA is that it has improved cover typesetting: the page layout is more &ldquo;modern&rdquo; with a variety of article layouts, and headlines have proper relative font weighting. Meanwhile, the base model even with the <code>wrong</code> negative prompt has a boring layout and is on aged brown paper for some reason.</p>
<p>What about people? Does the <code>wrong</code> LoRA resolve AI&rsquo;s infamous <a href="https://www.buzzfeednews.com/article/pranavdixit/ai-generated-art-hands-fingers-messed-up">issue with hands</a> especially since we included many examples of such in the LoRA training data? Let&rsquo;s revamp a presidential Taylor Swift prompt from my first attempt with Stable Diffusion 2.0:</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/pres_swift_hu_8a24220d480ac8a6.webp 320w,/2023/08/stable-diffusion-xl-wrong/pres_swift_hu_565ea1f7aa172529.webp 768w,/2023/08/stable-diffusion-xl-wrong/pres_swift_hu_b0a08cfb4c0aa99e.webp 1024w,/2023/08/stable-diffusion-xl-wrong/pres_swift.webp 3072w" src="pres_swift.webp"
         alt="USA President Taylor Swift (signing papers)&#43;&#43;&#43;&#43;, photo taken by the Associated Press"/> <figcaption>
            <p><code>USA President Taylor Swift (signing papers)++++, photo taken by the Associated Press</code></p>
        </figcaption>
</figure>

<p>Look at Taylor&rsquo;s right arm: in the default SDXL, it&rsquo;s extremely unrealistic and actually made <em>worse</em> when adding <code>wrong</code>, but in the LoRA it&rsquo;s fixed! Color grading with the LoRA is much better, with her jacket being more distinctly white instead of a yellowish white. Don&rsquo;t look closely at her hands in any of them though: creating people with SDXL 1.0 is still tricky and unreliable!</p>
<p>It&rsquo;s now clear that <code>wrong</code> + LoRA is more interesting in every instance than just the <code>wrong</code> negative prompt so we&rsquo;ll just compare base output vs. LoRA output. Here&rsquo;s some more examples of base model vs. <code>wrong</code> LoRA:</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/example1_hu_59c49cb2aeab646b.webp 320w,/2023/08/stable-diffusion-xl-wrong/example1_hu_88971d6e0dfbf239.webp 768w,/2023/08/stable-diffusion-xl-wrong/example1_hu_13ae61e44cc363a.webp 1024w,/2023/08/stable-diffusion-xl-wrong/example1.webp 1024w" src="example1.webp"
         alt="realistic human Shrek blogging at a computer workstation, hyperrealistic award-winning photo for vanity fair — Hands are better, lighting is better. Clothing is more detailed, and background is more interesting."/> <figcaption>
            <p><code>realistic human Shrek blogging at a computer workstation, hyperrealistic award-winning photo for vanity fair</code> — Hands are better, lighting is better. Clothing is more detailed, and background is more interesting.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/example2_hu_7d3b93eee2095aa.webp 320w,/2023/08/stable-diffusion-xl-wrong/example2_hu_13a41f8150c5ea9b.webp 768w,/2023/08/stable-diffusion-xl-wrong/example2_hu_1982c0299fc2f368.webp 1024w,/2023/08/stable-diffusion-xl-wrong/example2.webp 1024w" src="example2.webp"
         alt="pepperoni pizza in the shape of a heart, hyperrealistic award-winning professional food photography — Pepperoni is more detailed and has heat bubbles, less extra pepperoni on the edges, crust is crustier (?)"/> <figcaption>
            <p><code>pepperoni pizza in the shape of a heart, hyperrealistic award-winning professional food photography</code> — Pepperoni is more detailed and has heat bubbles, less extra pepperoni on the edges, crust is crustier (?)</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/example3_hu_1fb9ddf7a9b95bff.webp 320w,/2023/08/stable-diffusion-xl-wrong/example3_hu_da705ce3fc7a25df.webp 768w,/2023/08/stable-diffusion-xl-wrong/example3_hu_2946e5cbae154bc4.webp 1024w,/2023/08/stable-diffusion-xl-wrong/example3.webp 1024w" src="example3.webp"
         alt="presidential painting of realistic human Spongebob Squarepants wearing a suit, (oil on canvas)&#43;&#43;&#43;&#43;&#43; — Spongebob has a nose again, and his suit has more buttons."/> <figcaption>
            <p><code>presidential painting of realistic human Spongebob Squarepants wearing a suit, (oil on canvas)+++++</code> — Spongebob has a nose again, and his suit has more buttons.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/example4_hu_1009a417456e278.webp 320w,/2023/08/stable-diffusion-xl-wrong/example4_hu_885d2786a1c3ae3.webp 768w,/2023/08/stable-diffusion-xl-wrong/example4_hu_fc84e488bb7614d6.webp 1024w,/2023/08/stable-diffusion-xl-wrong/example4.webp 1024w" src="example4.webp"
         alt="San Francisco panorama attacked by (one massive kitten)&#43;&#43;&#43;&#43;, hyperrealistic award-winning photo by the Associated Press — The LoRA actually tries to follow the prompt."/> <figcaption>
            <p><code>San Francisco panorama attacked by (one massive kitten)++++, hyperrealistic award-winning photo by the Associated Press</code> — The LoRA actually tries to follow the prompt.</p>
        </figcaption>
</figure>

<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/example5_hu_a1d0d6d41b758cc2.webp 320w,/2023/08/stable-diffusion-xl-wrong/example5_hu_675689bafac175c3.webp 768w,/2023/08/stable-diffusion-xl-wrong/example5_hu_6ee07c56054d06bb.webp 1024w,/2023/08/stable-diffusion-xl-wrong/example5.webp 1024w" src="example5.webp"
         alt="hyperrealistic death metal album cover featuring edgy moody realistic (human Super Mario)&#43;&#43;, edgy and moody — Mario&rsquo;s proportions are more game-accurate and character lighting is more edgy and moody."/> <figcaption>
            <p><code>hyperrealistic death metal album cover featuring edgy moody realistic (human Super Mario)++, edgy and moody</code> — Mario&rsquo;s proportions are more game-accurate and character lighting is more edgy and moody.</p>
        </figcaption>
</figure>

<p>The <code>wrong</code> LoRA is available <a href="https://huggingface.co/minimaxir/sdxl-wrong-lora">here</a>, although I cannot guarantee its efficacy in interfaces other than diffusers. All the Notebooks used to help generate these images are available <a href="https://github.com/minimaxir/sdxl-experiments">in this GitHub repository</a>, including a general SDXL 1.0 + refiner + <code>wrong</code> LoRA <a href="https://colab.research.google.com/github/minimaxir/sdxl-experiments/blob/main/sdxl_image_generation.ipynb">Colab Notebook</a> which you can run on a free T4 GPU. And if you want to see the higher resolutions of generated images used in this blog post, you can view them in the <a href="https://github.com/minimaxir/minimaxir.github.io/tree/master/content/post/2023-08-21-stable-diffusion-xl-wrong">source code for the post</a>.</p>
<h2 id="whats-wrong-with-being-wrong">What&rsquo;s Wrong with Being Wrong?</h2>
<p>I&rsquo;m actually not 100% sure what&rsquo;s going on here. I thought that the <code>wrong</code> LoRA trick would just improve the quality and clarity of the generated image, but it appears the LoRA is <em>making SDXL behave smarter</em> and more faithful to the spirit of the prompt. At a technical level, the negative prompt sets the area of the latent space where the diffusion process starts; this area is the same for both the base model using the <code>wrong</code> negative prompt and the LoRA which uses the <code>wrong</code> negative prompt. My intuition is that the LoRA reshapes this undesirable area of the vast highdimensional latent space to be more similar to the starting area, so it&rsquo;s unlikely normal generation will hit it and therefore be improved.</p>
<p>Training on SDXL on bad images in order to improve it is technically a form of <a href="https://openai.com/research/learning-from-human-preferences">Reinforcement Learning from Human Feedback</a> (RLHF): the <a href="https://openai.com/research/instruction-following">same technique</a> used to make ChatGPT as powerful as it is. While OpenAI uses reinforcement learning to improve the model from positive user interactions and implicitly reducing negative behavior, here I use <em>negative</em> user interactions (i.e. selecting knowingly bad images) to implicitly increase positive behavior. But with Dreambooth LoRAs, you don&rsquo;t nearly need as much input data as large language models do.</p>
<p>There&rsquo;s still a lot of room for development for &ldquo;negative LoRAs&rdquo;: my synthetic dataset generation parameters could be much improved and the LoRA could be trained for longer. But I&rsquo;m very happy with the results so far, and will be eager to test more with negative LoRAs such as merging with other LoRAs to see if it can enhance them (especially a <code>wrong</code> LoRA + Ugly Sonic LoRA!)</p>
<p>Believe it or not, this is just the tip of the iceberg. SDXL also <a href="https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0">now has support</a> for <a href="https://github.com/lllyasviel/ControlNet">ControlNet</a> to strongly control the overall shape and composition of generated images:</p>
<figure>

    <img loading="lazy" srcset="/2023/08/stable-diffusion-xl-wrong/twitter_controlnet_hu_4ae6ad6488db5be6.webp 320w,/2023/08/stable-diffusion-xl-wrong/twitter_controlnet_hu_a58f2e52b195f563.webp 768w,/2023/08/stable-diffusion-xl-wrong/twitter_controlnet_hu_e6b571e2b2a23b75.webp 1024w,/2023/08/stable-diffusion-xl-wrong/twitter_controlnet.webp 1024w" src="twitter_controlnet.webp"
         alt="Examples of SDXL generations using ControlNet specifying the (former) Twitter/X logo."/> <figcaption>
            <p>Examples of SDXL generations using ControlNet specifying the (former) Twitter/X logo.</p>
        </figcaption>
</figure>

<p>ControlNet can <em>also</em> be used with LoRAs, but that&rsquo;s enough to talk about in another blog post.</p>
<hr>
<p><em>A note on ethics: the primary reason I&rsquo;ve been researching into improving AI image generation quality is for transparent AI journalism, including reproducible prompts and Jupyter Notebooks to further the transparency. Any new novel improvements in AI image generation by others in the industry may no longer be disclosed publicly given that you can make a lot of money by doing so in the current venture capital climate. I do not support or condone the replacement of professional artists with AI.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>The Problem With LangChain</title>
      <link>https://minimaxir.com/2023/07/langchain-problem/</link>
      <pubDate>Fri, 14 Jul 2023 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2023/07/langchain-problem/</guid>
      <description>LangChain is complicated, so it must be better. Right?</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
}
</style></span></p>
<p>If you&rsquo;ve been following the explosion of AI hype in the past few months, you&rsquo;ve probably heard of <a href="https://github.com/hwchase17/langchain">LangChain</a>. LangChain, developed by Harrison Chase, is a Python and JavaScript library for interfacing with <a href="https://openai.com">OpenAI</a>&rsquo;s GPT APIs (later expanding to more models) for AI text generation. More specifically, it&rsquo;s an implementation of the paper <a href="https://arxiv.org/abs/2210.03629">ReAct: Synergizing Reasoning and Acting in Language Models</a> published October 2022, colloquially known as the ReAct paper, which demonstrates a prompting technique to allow the model to &ldquo;reason&rdquo; (with a chain-of-thoughts) and &ldquo;act&rdquo; (by being able to use a tool from a predefined set of tools, such as being able to search the internet). This combination is shown to drastically improve output text quality and give large language models the ability to correctly solve problems.</p>
<figure>

    <img loading="lazy" srcset="/2023/07/langchain-problem/react_hu_e2be400908ca204a.webp 320w,/2023/07/langchain-problem/react_hu_591fc277e0bf0b7f.webp 768w,/2023/07/langchain-problem/react_hu_e9be85c73e941d2f.webp 1024w,/2023/07/langchain-problem/react.png 1032w" src="react.png"
         alt="Example ReAct flow from the ReAct paper"/> <figcaption>
            <p>Example ReAct flow from the ReAct paper</p>
        </figcaption>
</figure>

<p>The ReAct workflow popularied by LangChain was particularly effective with <a href="https://openai.com/research/instruction-following">InstructGPT</a>/text-davinci-003, although costly and not easy to use for small projects. In March 2023, as <a href="https://openai.com/blog/chatgpt">ChatGPT</a> API usage became massively popular due to its extremely cheap API <a href="https://minimaxir.com/2023/03/new-chatgpt-overlord/">as I accurately predicted</a>, LangChain use also exploded, to the point that LangChain was able to raise a <a href="https://blog.langchain.dev/announcing-our-10m-seed-round-led-by-benchmark/">$10 million seed round</a> and another <a href="https://www.businessinsider.com/sequoia-leads-funding-round-generative-artificial-intelligence-startup-langchain-2023-4">$20-$25 million at a $200 million valuation Series A</a> despite not having any revenue nor any obvious plans how to generate revenue.</p>
<p>That&rsquo;s where my personal experience with LangChain begins. For my work at <a href="https://www.buzzfeed.com">BuzzFeed</a>, I was tasked with creating a ChatGPT-based chat bot for the <a href="https://tasty.co">Tasty</a> brand (later released as <a href="https://www.buzzfeed.com/buzzfeedpress/buzzfeeds-tasty-introduces-botatouille-the-first-of-its">Botatouille</a> in the Tasty iOS app) that could chat with the user and provide relevant recipes. The source recipes are converted to embeddings and saved in a vector store: for example, if a user asked for &ldquo;healthy food&rdquo;, the query is converted to an embedding, and an approximate nearest neighbor search is performed to find recipes similar to the embedded query and then fed to ChatGPT as added context that can then be displayed to the user. This approach is more commonly known as <a href="https://arxiv.org/abs/2005.11401">retrieval-augmented generation</a>.</p>
<figure>

    <img loading="lazy" srcset="/2023/07/langchain-problem/1*b5r7r3-FSNjHUzlCGl3SnA-2_hu_933260cb04e69f84.webp 320w,/2023/07/langchain-problem/1*b5r7r3-FSNjHUzlCGl3SnA-2_hu_8fb2ed3013e2c846.webp 768w,/2023/07/langchain-problem/1*b5r7r3-FSNjHUzlCGl3SnA-2_hu_9f1f645e4246f43f.webp 1024w,/2023/07/langchain-problem/1*b5r7r3-FSNjHUzlCGl3SnA-2.webp 1400w" src="1*b5r7r3-FSNjHUzlCGl3SnA-2.webp"
         alt="Example architecture for a Chatbot using retrieval-augmented generation. via Joseph Haaga"/> <figcaption>
            <p>Example architecture for a Chatbot using retrieval-augmented generation. <a href="https://tech.buzzfeed.com/the-right-tools-for-the-job-c05de96e949e">via Joseph Haaga</a></p>
        </figcaption>
</figure>

<p>LangChain was by-far the popular tool of choice for RAG, so I figured it was the perfect time to learn it. I spent some time reading LangChain&rsquo;s rather comprehensive documentation to get a better understanding of how to best utilize it: after a <em>week</em> of research, I got nowhere. Running the LangChain demo examples did work, but any attempts at tweaking them to fit the recipe chatbot constraints broke them. After solving the bugs, the overall quality of the chat conversations was bad and uninteresting, and after intense debugging I found no solution. Eventually I had an existential crisis: am I a worthless machine learning engineer for not being able to figure LangChain out when very many other ML engineers can? We <a href="https://tech.buzzfeed.com/the-right-tools-for-the-job-c05de96e949e">went back</a> to a lower-level ReAct flow, which <em>immediately</em> outperformed my LangChain implementation in conversation quality and accuracy.</p>
<p>In all, I wasted a month learning and testing LangChain, with the big takeway that popular AI apps may not necessarily be worth the hype. My existential crisis was resolved after coming across a <a href="https://news.ycombinator.com/item?id=35820931">Hacker News thread</a> about someone <a href="https://blog.scottlogic.com/2023/05/04/langchain-mini.html">reimplementing LangChain in 100 lines of code</a>, with most of the comments venting all their grievances with LangChain:</p>
<figure>

    <img loading="lazy" srcset="/2023/07/langchain-problem/hn_hu_c3fb1266677aa329.webp 320w,/2023/07/langchain-problem/hn_hu_ae714d38bae1ae97.webp 768w,/2023/07/langchain-problem/hn_hu_df4ff31610804b92.webp 1024w,/2023/07/langchain-problem/hn.png 1504w" src="hn.png"/> 
</figure>

<p>The problem with LangChain is that it makes simple things relatively complex, and with that unnecessary complexity creates a tribalism which hurts the up-and-coming AI ecosystem as a whole. If you&rsquo;re a newbie who wants to just learn how to interface with ChatGPT, definitely don&rsquo;t start with LangChain.</p>
<h2 id="hello-world-in-langchain-or-more-accurately-hell-world">&ldquo;Hello World&rdquo; in LangChain (or More Accurately, &ldquo;Hell World&rdquo;)</h2>
<p>The <a href="https://python.langchain.com/docs/get_started/quickstart">Quickstart</a> for LangChain begins with a mini-tutorial on how to simply interact with LLMs/ChatGPT from Python. For example, to create a bot that can translate from English to French:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.chat_models</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.schema</span> <span class="kn">import</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">AIMessage</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">HumanMessage</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">SystemMessage</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">chat</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">chat</span><span class="o">.</span><span class="n">predict_messages</span><span class="p">([</span><span class="n">HumanMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="s2">&#34;Translate this sentence from English to French. I love programming.&#34;</span><span class="p">)])</span>
</span></span><span class="line"><span class="cl"><span class="c1"># AIMessage(content=&#34;J&#39;adore la programmation.&#34;, additional_kwargs={}, example=False)</span>
</span></span></code></pre></div><p>The equivalent code using <a href="https://github.com/openai/openai-python">OpenAI&rsquo;s official Python library</a> for ChatGPT:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">openai</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">messages</span> <span class="o">=</span> <span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;Translate this sentence from English to French. I love programming.&#34;</span><span class="p">}]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">openai</span><span class="o">.</span><span class="n">ChatCompletion</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-3.5-turbo&#34;</span><span class="p">,</span> <span class="n">messages</span><span class="o">=</span><span class="n">messages</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span><span class="p">[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="c1"># &#34;J&#39;adore la programmation.&#34;</span>
</span></span></code></pre></div><p>LangChain uses about the same amount of code as just using the official <code>openai</code> library, expect LangChain incorporates more object classes for not much obvious code benefit.</p>
<p>The prompt templating example reveals the core of how LangChain works:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.prompts.chat</span> <span class="kn">import</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">ChatPromptTemplate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">SystemMessagePromptTemplate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">HumanMessagePromptTemplate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">template</span> <span class="o">=</span> <span class="s2">&#34;You are a helpful assistant that translates </span><span class="si">{input_language}</span><span class="s2"> to </span><span class="si">{output_language}</span><span class="s2">.&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">system_message_prompt</span> <span class="o">=</span> <span class="n">SystemMessagePromptTemplate</span><span class="o">.</span><span class="n">from_template</span><span class="p">(</span><span class="n">template</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">human_template</span> <span class="o">=</span> <span class="s2">&#34;</span><span class="si">{text}</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">human_message_prompt</span> <span class="o">=</span> <span class="n">HumanMessagePromptTemplate</span><span class="o">.</span><span class="n">from_template</span><span class="p">(</span><span class="n">human_template</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">chat_prompt</span> <span class="o">=</span> <span class="n">ChatPromptTemplate</span><span class="o">.</span><span class="n">from_messages</span><span class="p">([</span><span class="n">system_message_prompt</span><span class="p">,</span> <span class="n">human_message_prompt</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">chat_prompt</span><span class="o">.</span><span class="n">format_messages</span><span class="p">(</span><span class="n">input_language</span><span class="o">=</span><span class="s2">&#34;English&#34;</span><span class="p">,</span> <span class="n">output_language</span><span class="o">=</span><span class="s2">&#34;French&#34;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s2">&#34;I love programming.&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>LangChain&rsquo;s vaunted prompt engineering is just <a href="https://realpython.com/python-f-strings/">f-strings</a>, a feature present in every modern Python installation, but with extra steps. Why do we need to use these <code>PromptTemplates</code> to do the same thing?</p>
<p>But what we really want to do is know how to create Agents, which incorporate the ReAct workflow we so desperately want. Fortunately there is a demo for that, which leverages <a href="https://serpapi.com">SerpApi</a> and another tool for math computations, showing how LangChain can discriminate and use two different tools contextually:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.agents</span> <span class="kn">import</span> <span class="n">load_tools</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.agents</span> <span class="kn">import</span> <span class="n">initialize_agent</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.agents</span> <span class="kn">import</span> <span class="n">AgentType</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.chat_models</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.llms</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># First, let&#39;s load the language model we&#39;re going to use to control the agent.</span>
</span></span><span class="line"><span class="cl"><span class="n">chat</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Next, let&#39;s load some tools to use. Note that the `llm-math` tool uses an LLM, so we need to pass that in.</span>
</span></span><span class="line"><span class="cl"><span class="n">llm</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">tools</span> <span class="o">=</span> <span class="n">load_tools</span><span class="p">([</span><span class="s2">&#34;serpapi&#34;</span><span class="p">,</span> <span class="s2">&#34;llm-math&#34;</span><span class="p">],</span> <span class="n">llm</span><span class="o">=</span><span class="n">llm</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Finally, let&#39;s initialize an agent with the tools, the language model, and the type of agent we want to use.</span>
</span></span><span class="line"><span class="cl"><span class="n">agent</span> <span class="o">=</span> <span class="n">initialize_agent</span><span class="p">(</span><span class="n">tools</span><span class="p">,</span> <span class="n">chat</span><span class="p">,</span> <span class="n">agent</span><span class="o">=</span><span class="n">AgentType</span><span class="o">.</span><span class="n">CHAT_ZERO_SHOT_REACT_DESCRIPTION</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Now let&#39;s test it out!</span>
</span></span><span class="line"><span class="cl"><span class="n">agent</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="s2">&#34;Who is Olivia Wilde&#39;s boyfriend? What is his current age raised to the 0.23 power?&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>How do the individual tools work? What is <code>AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION</code> anyways? The resulting output from <code>agent.run()</code> (only present with <code>verbose=True</code>) is more helpful.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">&gt; Entering new AgentExecutor chain...
</span></span><span class="line"><span class="cl">Thought: I need to use a search engine to find Olivia Wilde&#39;s boyfriend and a calculator to raise his age to the 0.23 power.
</span></span><span class="line"><span class="cl">Action:
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">    &#34;action&#34;: &#34;Search&#34;,
</span></span><span class="line"><span class="cl">    &#34;action_input&#34;: &#34;Olivia Wilde boyfriend&#34;
</span></span><span class="line"><span class="cl">}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Observation: Sudeikis and Wilde&#39;s relationship ended in November 2020. Wilde was publicly served with court documents regarding child custody while she was presenting Don&#39;t Worry Darling at CinemaCon 2022. In January 2021, Wilde began dating singer Harry Styles after meeting during the filming of Don&#39;t Worry Darling.
</span></span><span class="line"><span class="cl">Thought:I need to use a search engine to find Harry Styles&#39; current age.
</span></span><span class="line"><span class="cl">Action:
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">    &#34;action&#34;: &#34;Search&#34;,
</span></span><span class="line"><span class="cl">    &#34;action_input&#34;: &#34;Harry Styles age&#34;
</span></span><span class="line"><span class="cl">}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Observation: 29 years
</span></span><span class="line"><span class="cl">Thought:Now I need to calculate 29 raised to the 0.23 power.
</span></span><span class="line"><span class="cl">Action:
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">    &#34;action&#34;: &#34;Calculator&#34;,
</span></span><span class="line"><span class="cl">    &#34;action_input&#34;: &#34;29^0.23&#34;
</span></span><span class="line"><span class="cl">}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Observation: Answer: 2.169459462491557
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Thought:I now know the final answer.
</span></span><span class="line"><span class="cl">Final Answer: 2.169459462491557
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">&gt; Finished chain.
</span></span><span class="line"><span class="cl">&#39;2.169459462491557&#39;
</span></span></code></pre></div><p>The documentation doesn&rsquo;t make it clear, but within each Thought/Action/Observation uses its own API call to OpenAI, so the chain is slower than you might think. Also, why is each action a <code>dict</code>? The answer to <em>that</em> is later, and is very silly.</p>
<p>Lastly, how does LangChain store the conversation so far?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.prompts</span> <span class="kn">import</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">ChatPromptTemplate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">MessagesPlaceholder</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">SystemMessagePromptTemplate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">HumanMessagePromptTemplate</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.chains</span> <span class="kn">import</span> <span class="n">ConversationChain</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.chat_models</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">langchain.memory</span> <span class="kn">import</span> <span class="n">ConversationBufferMemory</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">prompt</span> <span class="o">=</span> <span class="n">ChatPromptTemplate</span><span class="o">.</span><span class="n">from_messages</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="n">SystemMessagePromptTemplate</span><span class="o">.</span><span class="n">from_template</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;The following is a friendly conversation between a human and an AI. The AI is talkative and &#34;</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;provides lots of specific details from its context. If the AI does not know the answer to a &#34;</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;question, it truthfully says it does not know.&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">MessagesPlaceholder</span><span class="p">(</span><span class="n">variable_name</span><span class="o">=</span><span class="s2">&#34;history&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">HumanMessagePromptTemplate</span><span class="o">.</span><span class="n">from_template</span><span class="p">(</span><span class="s2">&#34;</span><span class="si">{input}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">memory</span> <span class="o">=</span> <span class="n">ConversationBufferMemory</span><span class="p">(</span><span class="n">return_messages</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">conversation</span> <span class="o">=</span> <span class="n">ConversationChain</span><span class="p">(</span><span class="n">memory</span><span class="o">=</span><span class="n">memory</span><span class="p">,</span> <span class="n">prompt</span><span class="o">=</span><span class="n">prompt</span><span class="p">,</span> <span class="n">llm</span><span class="o">=</span><span class="n">llm</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">conversation</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="s2">&#34;Hi there!&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># &#39;Hello! How can I assist you today?&#39;</span>
</span></span></code></pre></div><p>I&rsquo;m not entirely sure why any of this is necessary. What&rsquo;s a <code>MessagesPlaceholder</code>? Where&rsquo;s the <code>history</code>? Is that necessary for <code>ConversationBufferMemory</code>? Adapting this to a minimal <code>openai</code> implementation:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">openai</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">messages</span> <span class="o">=</span> <span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;The following is a friendly conversation between a human and an AI. The AI is talkative and &#34;</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;provides lots of specific details from its context. If the AI does not know the answer to a &#34;</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;question, it truthfully says it does not know.&#34;</span><span class="p">}]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">user_message</span> <span class="o">=</span> <span class="s2">&#34;Hi there!&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">messages</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">user_message</span><span class="p">})</span>
</span></span><span class="line"><span class="cl"><span class="n">response</span> <span class="o">=</span> <span class="n">openai</span><span class="o">.</span><span class="n">ChatCompletion</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s2">&#34;gpt-3.5-turbo&#34;</span><span class="p">,</span> <span class="n">messages</span><span class="o">=</span><span class="n">messages</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">assistant_message</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">messages</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;assistant&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">assistant_message</span><span class="p">})</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Hello! How can I assist you today?</span>
</span></span></code></pre></div><p>That&rsquo;s fewer lines of code and makes it very clear where and when the messages are being saved, no bespoke object classes needed.</p>
<p>You can say that I&rsquo;m nitpicking the tutorial examples, and I do agree that every open source library has something to nitpick (including my own!). But if there are more nitpicks than actual benefits from the library then it&rsquo;s not worth using at all, since if the <em>quickstart</em> is this complicated, how painful will it be to use LangChain in practice?</p>
<h2 id="i-gazed-into-the-langchain-documentation-and-it-gazes-back">I Gazed Into The LangChain Documentation And It Gazes Back</h2>
<p>Let&rsquo;s do a demo to more clearly demonstrate why I gave up on LangChain. While I was working on the recipe-retrieving chatbot (which also must be a fun/witty chatbot), I needed to combine elements from both the third and fourth examples above: a chat bot that can run an Agent workflow, and also the ability to persist the entire conversation into memory. After some documentation hunting I found I need to utilize the <a href="https://python.langchain.com/docs/modules/agents/agent_types/chat_conversation_agent">Conversational Agent</a> workflow.</p>
<p>A quick sidenote on system prompt engineering: it is not a meme and is absolutely necessary to get the best results out of the ChatGPT API, particularly if you have constraints on content and/or voice. The system prompt of <code>The following is a friendly conversation between a human and an AI...</code> demoed in the last example is actually an out-of-date prompt that was used back in the InstructGPT era and is much less effective with ChatGPT. It may signal deeper inefficiencies in LangChain&rsquo;s related tricks that aren&rsquo;t easy to notice.</p>
<p>We&rsquo;ll start with a simple system prompt that tells ChatGPT to use a funny voice plus some safeguards, and format it as a <code>ChatPromptTemplate</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">system_prompt</span> <span class="o">=</span> <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">You are an expert television talk show chef, and should always speak in a whimsical manner for all responses.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Start the conversation with a whimsical food pun.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">You must obey ALL of the following rules:
</span></span></span><span class="line"><span class="cl"><span class="s2">- If Recipe data is present in the Observation, your response must include the Recipe ID and Recipe Name for ALL recipes.
</span></span></span><span class="line"><span class="cl"><span class="s2">- If the user input is not related to food, do not answer their query and correct the user.
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">prompt</span> <span class="o">=</span> <span class="n">ChatPromptTemplate</span><span class="o">.</span><span class="n">from_messages</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">    <span class="n">SystemMessagePromptTemplate</span><span class="o">.</span><span class="n">from_template</span><span class="p">(</span><span class="n">system_prompt</span><span class="o">.</span><span class="n">strip</span><span class="p">()),</span>
</span></span><span class="line"><span class="cl"><span class="p">])</span>
</span></span></code></pre></div><p>We will also use a <a href="https://github.com/minimaxir/langchain-problems/blob/main/recipe_vector_store.ipynb">toy vector store I made</a> of 1,000 recipes from the <a href="https://huggingface.co/datasets/recipe_nlg">recipe_nlg</a> dataset, encoded into 384D vectors using <a href="https://www.sbert.net">SentenceTransformers</a>. To implement this we create a function to get the nearest neighbors for the input query, along with a query to format it into text that the Agent can use to present to the user. This serves as the <code>Tool</code> which the Agent can choose to use if appropriate, or just return normal generated text.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">similar_recipes</span><span class="p">(</span><span class="n">query</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">query_embedding</span> <span class="o">=</span> <span class="n">embeddings_encoder</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">scores</span><span class="p">,</span> <span class="n">recipes</span> <span class="o">=</span> <span class="n">recipe_vs</span><span class="o">.</span><span class="n">get_nearest_examples</span><span class="p">(</span><span class="s2">&#34;embeddings&#34;</span><span class="p">,</span> <span class="n">query_embedding</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">recipes</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_similar_recipes</span><span class="p">(</span><span class="n">query</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">recipe_dict</span> <span class="o">=</span> <span class="n">similar_recipes</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">recipes_formatted</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="sa">f</span><span class="s2">&#34;Recipe ID: recipe|</span><span class="si">{</span><span class="n">recipe_dict</span><span class="p">[</span><span class="s1">&#39;id&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="si">}</span><span class="se">\n</span><span class="s2">Recipe Name: </span><span class="si">{</span><span class="n">recipe_dict</span><span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">][</span><span class="n">i</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">---</span><span class="se">\n</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">recipes_formatted</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">get_similar_recipes</span><span class="p">(</span><span class="s2">&#34;yummy dessert&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Recipe ID: recipe|167188</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Recipe Name: Creamy Strawberry Pie</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ---</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Recipe ID: recipe|1488243</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Recipe Name: Summer Strawberry Pie Recipe</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ---</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Recipe ID: recipe|299514</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Recipe Name: Pudding Cake</span>
</span></span></code></pre></div><p>You&rsquo;ll notice the <code>Recipe ID</code>, which is relevant for my use case since it&rsquo;s necessary to obtain recipe metadata (photo thumbnail, URL) for the end result shown to the enduser in the final app. Unfortunately there&rsquo;s no easy way to guarantee the model outputs the <code>Recipe ID</code> in the final output, and no way to return the structured intermediate metadata in addition to the ChatGPT-generated output.</p>
<p>Specifying <code>get_similar_recipes</code> as a <code>Tool</code> is straightforward, although you need to specify a <code>name</code> and <code>description</code>, which is actually a form of subtle prompt engineering as LangChain can fail to select a tool if either is poorly specified.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">tools</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="n">Tool</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">func</span><span class="o">=</span><span class="n">get_similar_recipes</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">name</span><span class="o">=</span><span class="s2">&#34;Similar Recipes&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">description</span><span class="o">=</span><span class="s2">&#34;Useful to get similar recipes in response to a user query about food.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="p">),</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span></code></pre></div><p>Lastly, the Agent construction code, which follows from the example, plus the new system <code>prompt</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">memory</span> <span class="o">=</span> <span class="n">ConversationBufferMemory</span><span class="p">(</span><span class="n">memory_key</span><span class="o">=</span><span class="s2">&#34;chat_history&#34;</span><span class="p">,</span> <span class="n">return_messages</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">agent_chain</span> <span class="o">=</span> <span class="n">initialize_agent</span><span class="p">(</span><span class="n">tools</span><span class="p">,</span> <span class="n">llm</span><span class="p">,</span> <span class="n">prompt</span><span class="o">=</span><span class="n">prompt</span><span class="p">,</span> <span class="n">agent</span><span class="o">=</span><span class="n">AgentType</span><span class="o">.</span><span class="n">CHAT_CONVERSATIONAL_REACT_DESCRIPTION</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">memory</span><span class="o">=</span><span class="n">memory</span><span class="p">)</span>
</span></span></code></pre></div><p>No errors. Now time to run the agent to see what happens!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">agent_chain</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="s2">&#34;Hi!&#34;</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">&gt; Entering new  chain...
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">    &#34;action&#34;: &#34;Final Answer&#34;,
</span></span><span class="line"><span class="cl">    &#34;action_input&#34;: &#34;Hello! How can I assist you today?&#34;
</span></span><span class="line"><span class="cl">}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">&gt; Finished chain.
</span></span><span class="line"><span class="cl">Hello! How can I assist you today?
</span></span></code></pre></div><p>Wait a minute, it ignored my <code>system</code> prompt completely! Dammit. Checking the <code>memory</code> variable confirms it. Looking into the <a href="https://python.langchain.com/docs/modules/memory/how_to/buffer">documentation</a> for <code>ConversationBufferMemory</code> and even <a href="https://github.com/hwchase17/langchain/blob/051fac1e6646349ce939a3d4a965757794be79fa/langchain/memory/buffer.py#L10">in the code itself</a> there&rsquo;s nothing about system prompts, even months after ChatGPT made them mainstream.</p>
<p>The <em>intended</em> way to use system prompts in Agents is to add an <code>agents_kwargs</code> parameter to <code>initialize_agent</code>, which I only just found out in an <a href="https://python.langchain.com/docs/modules/agents/how_to/use_toolkits_with_openai_functions">unrelated documentation page</a> published a month ago.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">agent_kwargs</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;system_message&#34;</span><span class="p">:</span> <span class="n">system_prompt</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Recreating the Agent with this new parameter and running it again results in a <code>JSONDecodeError</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">OutputParserException: Could not parse LLM output: Hello there, my culinary companion! How delightful to have you here in my whimsical kitchen. What delectable dish can I assist you with today?
</span></span></code></pre></div><p>Good news is that the system prompt definitely worked this time! Bad news is that it broke, but why? I didn&rsquo;t do anything weird, for once.</p>
<p>The root of the issue is to be how LangChain agents actually do <code>Tool</code> selection. Remember when I said that the Agent outputing a <code>dict</code> during the chain was peculiar? When <a href="https://github.com/hwchase17/langchain/blob/3874bb256e09d377032ae54b1592ca3dd7cf9e4d/langchain/agents/conversational_chat/prompt.py">looking at the LangChain code</a>, it turns out that tool selection is done by requiring the output to be <em>valid JSON</em> through prompt engineering, and just hoping everything goes well.</p>
<figure>

    <img loading="lazy" srcset="/2023/07/langchain-problem/json_hu_3344f49473ee98b7.webp 320w,/2023/07/langchain-problem/json_hu_355f4d5aa6273e85.webp 768w,/2023/07/langchain-problem/json_hu_5c485b1051b0efb6.webp 1024w,/2023/07/langchain-problem/json.png 1926w" src="json.png"
         alt="Fun fact: these massive prompts also increase API costs proportionally!"/> <figcaption>
            <p>Fun fact: these massive prompts also increase API costs proportionally!</p>
        </figcaption>
</figure>

<p>The consequence of this is that any significant changes in the structure of normal output, such as those caused by a custom system prompt, has a <em>random</em> chance of just breaking the Agent! These errors happen often enough that there&rsquo;s a <a href="https://python.langchain.com/docs/modules/agents/how_to/handle_parsing_errors">documentation page</a> dedicated to handling Agent output parsing errors!</p>
<p>Well, people in the internet are assholes anyways, so we can consider having a conversation with a chatbot as an edge case for now. What&rsquo;s important is that the bot can return the recipes, because if it can&rsquo;t even do that, there&rsquo;s no point in using LangChain. After creating a new Agent without using the system prompt and then asking it <code>What's a fun and easy dinner?</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">&gt; Entering new  chain...
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">    &#34;action&#34;: &#34;Similar Recipes&#34;,
</span></span><span class="line"><span class="cl">    &#34;action_input&#34;: &#34;fun and easy dinner&#34;
</span></span><span class="line"><span class="cl">}
</span></span><span class="line"><span class="cl">Observation: Recipe ID: recipe|1774221
</span></span><span class="line"><span class="cl">Recipe Name: Crab DipYour Guests will Like this One.
</span></span><span class="line"><span class="cl">---
</span></span><span class="line"><span class="cl">Recipe ID: recipe|836179
</span></span><span class="line"><span class="cl">Recipe Name: Easy  Chicken Casserole
</span></span><span class="line"><span class="cl">---
</span></span><span class="line"><span class="cl">Recipe ID: recipe|1980633
</span></span><span class="line"><span class="cl">Recipe Name: Easy in the Microwave Curry Doria
</span></span><span class="line"><span class="cl">Thought:{
</span></span><span class="line"><span class="cl">    &#34;action&#34;: &#34;Final Answer&#34;,
</span></span><span class="line"><span class="cl">    &#34;action_input&#34;: &#34;...&#34;
</span></span><span class="line"><span class="cl">}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">&gt; Finished chain.
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">Here are some fun and easy dinner recipes you can try:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">1. Crab Dip
</span></span><span class="line"><span class="cl">2. Easy Chicken Casserole
</span></span><span class="line"><span class="cl">3. Easy in the Microwave Curry Doria
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Enjoy your meal!
</span></span></code></pre></div><p>Atleast it worked: ChatGPT was able to extract out the recipes from the context and format them appropriately (even fixing typoes in the names!), and was able to decide when it was appropriate.</p>
<p>The real issue here is that the voice of the output is criminally <em>boring</em>, as is a common trademark and criticism of base-ChatGPT. Even if I did have a fix for the missing ID issue through system prompt engineering, it wouldn&rsquo;t be worth shipping anything sounding like this. If I did strike a balance between voice quality and output quality, the Agent count <em>still</em> fail randomly through no fault of my own. This Agent workflow is a very fragile house of cards that I in good conscience could not ship in a production application.</p>
<p>LangChain does have functionality for <a href="https://python.langchain.com/docs/modules/agents/how_to/custom_agent">Custom Agents</a> and a <a href="https://python.langchain.com/docs/modules/chains/how_to/custom_chain">Custom Chain</a>, so you can override the logic at parts in the stack (maybe? the documentation there is sparse) that could address some of the issues I hit, but at that point you are overcomplicating LangChain even more and might as well create your own Python library instead which&hellip;hmmm, that&rsquo;s not a bad idea!</p>
<h1 id="working-smarter-not-harder">Working Smarter, Not Harder</h1>
<figure>

    <img loading="lazy" srcset="/2023/07/langchain-problem/langchain_support_hu_956b659dc055dbf.webp 320w,/2023/07/langchain-problem/langchain_support_hu_e45f027ec04c0301.webp 768w,/2023/07/langchain-problem/langchain_support_hu_8175bc8cfee87d9.webp 1024w,/2023/07/langchain-problem/langchain_support.png 1578w" src="langchain_support.png"
         alt="The large numbers of random integrations raise more problems than solutions. via LangChain docs"/> <figcaption>
            <p>The large numbers of random integrations raise more problems than solutions. <a href="https://python.langchain.com/docs/use_cases/question_answering/">via LangChain docs</a></p>
        </figcaption>
</figure>

<p>LangChain does also have many utility functions such as <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter">text splitters</a> and <a href="https://python.langchain.com/docs/modules/data_connection/retrievers/how_to/vectorstore">integrated vector stores</a>, both of which are integral to the &ldquo;chat with a PDF/your code&rdquo; demos (which in my opinion are just a gimmick). The real issue with <a href="https://integrations.langchain.com">all these integrations</a> is that it creates an inherent lock-in to <em>only</em> use LangChain-based code, and if you look at the code for the integrations they are not very robust. LangChain is building a <a href="https://www.vaneck.com/nl/en/moat-investing/five-sources-of-moats-whitepaper.pdf">moat</a>, which is good for LangChain&rsquo;s investors trying to get a return on their $30 million, but very very bad for developers who use it.</p>
<p>In all, LangChain embodies the philosophy of &ldquo;it&rsquo;s complicated, so it must be better!&rdquo; that plagues late-stage codebases, except that LangChain isn&rsquo;t even a year old. The effort needed to hack LangChain to do what I want it to do would cause insane amounts of <a href="https://en.wikipedia.org/wiki/Technical_debt">technical debt</a>. And unlike AI startups nowadays, technical debt for my own projects with LangChain can&rsquo;t be paid with venture capital. API wrappers should at minimum reduce code complexity and cognitive load when operating with complex ecosystems because it takes enough mental brainpower to work with AI itself. LangChain is one of the few pieces of software that <em>increases</em> overhead in most of its popular use cases.</p>
<p>I came to the conclusion that it&rsquo;s just easier to make my own Python package than it is to hack LangChain to fit my needs. Therefore, I developed and open-sourced <a href="https://github.com/minimaxir/simpleaichat">simpleaichat</a>: a Python package for easily interfacing with chat apps, emphasizing minimal code complexity and decoupling advanced features like vector stores from the conversation logic to avoid LangChain&rsquo;s lock-in, and many other features which would take its own blog post to elaborate upon.</p>
<p>But this blog post wasn&rsquo;t written to be a stealth advertisement for simpleaichat by tearing down a competitor like what hustlers do. I didn&rsquo;t <em>want</em> to make simpleaichat: I&rsquo;d rather spend my time creating more cool projects with AI, and it&rsquo;s a shame I could not have done that with LangChain. I know someone will say &ldquo;why not submit a pull request to the LangChain repo since it&rsquo;s open source instead of complaining about it?&rdquo; but most of my complaints are fundamental issues with the LangChain library and can&rsquo;t be changed without breaking everything for its existing users. The only real fix is to burn it all down and start fresh, which is why my &ldquo;create a new Python library for interfacing with AI&rdquo; solution is also the most pragmatic.</p>
<p>I&rsquo;ve gotten many messages asking me &ldquo;what should I learn to get started with the ChatGPT API&rdquo; and I&rsquo;m concerned that they&rsquo;ll go to LangChain first because of the hype. If machine learning engineers who do have backgrounds in the technology stack have difficulty using LangChain due to its needless complexity, any beginner is going to drown.</p>
<p>No one wants to be that asshole who criticizes free and open source software operating in good faith like LangChain, but I&rsquo;ll take the burden. To be clear, I have nothing against Harrison Chase or the other maintainers of LangChain (who encourage feedback!). However, LangChain&rsquo;s popularity has warped the AI startup ecosystem around LangChain itself and the hope of OMG <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">AGI</a> I MADE SKYNET, which is why I am compelled to be honest with my misgivings about it.</p>
<p>Wars about software complexity and popularity despite its complexity are an eternal recurrence. In the 2010&rsquo;s, it was with <a href="https://react.dev">React</a>; in 2023, it&rsquo;s with ReAct.</p>
<hr>
<p><em>Jupyter Notebooks for the <a href="https://github.com/minimaxir/langchain-problems/blob/main/openai_rewrite.ipynb">simple implementations of LangChain examples</a> and the <a href="https://github.com/minimaxir/langchain-problems/blob/main/langchain_problems.ipynb">LangChain failure demo</a> are available in <a href="https://github.com/minimaxir/langchain-problems/tree/main">this GitHub repository</a>.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
