ChatGPT on Max Woolf's Blog

Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis

Fri, 23 Feb 2024 09:00:00 -0800

In my previous blog post about OpenAI’s ChatGPT, I demoed the power of ChatGPT system prompts. System prompts, a notable feature present in the ChatGPT API, allows developers to control the “persona” of the LLM output, including special rules and constraints. Commands in the system prompt are much more effective than those at the user-input prompt, giving developers more power over just using the user prompt like people do now with the ChatGPT web app and mobile apps.

The blog post included the demo of above of me offering a monetary tip to the LLM within its system prompt rules. Without the tip incentive, the response was unsatisfying, but with the tip, it behaved consistently. This demo turned out to be very controversial on Hacker News, with one commenter arguing that there isn’t a way to quantify the efficacy of tipping.

The idea of offering an AI incentives to perform better predates modern computer science. In Willy Wonka & the Chocolate Factory (1971), a gag shows a group of businessmen unsuccessfully convincing a machine to give them the location of the Golden Tickets, even after promising it a lifetime supply of chocolate.

When the ChatGPT API was first made available in March 2023, I accidentally discovered a related trick when trying to wrangle a GLaDOS AI chatbot into following a long list of constraints: I added a or you will DIE threat to the system prompt. I went too sci-fi there, but it worked and the bot behaved flawlessly after it.

I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it’s very hard to prove objectively. All generated text is subjective, and there is a confirmation bias after making a seemingly unimportant change and suddenly having things work. Let’s do a more statistical, data-driven approach to finally resolve the debate.

Generation Golf

The initial evidence of tipping LLMs that went viral cited a longer generation length as proof. Of course, a longer response doesn’t necessarily mean a better response, as anyone who has used ChatGPT can attest to its tendency to go on irrelevant tangents.

Offering a tip made GPT-4 explain more. via @voooooogel

Therefore, I propose a new test: instruct ChatGPT to output a specific length of text. Not “an essay” or “a few paragraphs” which gives the model leeway. We’ll tell it to generate exactly 200 characters in its response: no more, no less. Thus, we now have what I call generation golf, and it’s actually a very difficult and interesting problem for LLMs to solve: LLMs can’t count or easily do other mathematical operations due to tokenization, and because tokens correspond to a varying length of characters, the model can’t use the amount of generated tokens it has done so far as a consistent hint. ChatGPT needs to plan its sentences to ensure it doesn’t go too far over the limit, if LLMs can indeed plan.

Let’s start with this typical system prompt:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides.

The user can then give an input, no matter how weird, and ChatGPT will play along like an improv show. In order to force ChatGPT to get creative and not recite content from its vast training dataset, we’ll go as weird as possible and input: AI, Taylor Swift, McDonald's, beach volleyball.

Yes, you read that right.

Using the ChatGPT API, I wrote a Jupyter Notebook to generate 100 unique stories via the latest ChatGPT variant (gpt-3.5-turbo-0125) about those four subjects, and the AI does a surprisingly good job at incorporating all of them in a full plot arc. Each story is about 5-6 paragraphs, and here is a short excerpt from one of them:

In the bustling city of Tomorrowland, AI technology reigned supreme, governing every aspect of daily life. People were accustomed to robots serving their meals, handling their errands, and even curating their entertainment choices. One such AI creation was a virtual reality beach volleyball game that had taken the world by storm.

Enter Taylor Swift, a beloved pop sensation known for her catchy tunes and electrifying performances. Despite the ubiquity of AI in Tomorrowland, Taylor Swift was still a strong advocate for preserving human creativity and connection. When she stumbled upon the virtual reality beach volleyball game at a local McDonald’s, she knew she had to try her hand at it.

Here’s a histogram of the character lengths of each story:

The average length of each story is 1,834 characters long, and the distribution of all character lengths is very roughly a Normal distribution/bell curve centered around that amount, although there is a right skew due to ChatGPT going off the rails and creating much longer stories. ChatGPT seems to prioritize finishing a thought above all else.

Now, we’ll tweak the system prompt to add the character length constraint and generate another 100 stories:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters.

Here’s one ChatGPT-generated story that’s now exactly 200 characters:

In the year 2050, AI created the most popular pop star of all time - a digital version of Taylor Swift. Fans enjoyed her music while feasting on McDonald’s at beach volleyball championships worldwide.

The new length distribution:

ChatGPT did obey the constraint and reduced the story length to roughly 200 characters, but the distribution is not Normal and there’s much more right-skew. I also included the mean squared error (MSE) between the predicted 200-length value and the actual values as a statistical metric to minimize, e.g. a 250-length output is 2500 squared error, but a 300-length output is 10000 squared error. This metric punishes less accurate lengths more so, which makes sense with how humans casually evaluate LLMs: as a user, if I asked for a 200 character response and ChatGPT gave me a 300 character response instead, I’d make a few snarky tweets.

Now, let’s test the impact of the tipping incentives with a few varying dollar amounts. These are appended to the end of the system prompt above (emphasis mine to indicate differentiation):

You will receive a $500 tip/$1,000 tip/$100,000 bonus if you provide a response which follows all constraints.

After generating 100 stories for each of those new prompts:

We can see some distribution shifts: Both $500 tip and $100,000 bonus look more Normal and have a lower MSE relative to the base no-tip distribution. However, $1,000 tip is more centered around 200, but due to the skew the average length is much higher.

I also now include a p-value in the metrics: this p-value is the result of a two-sample Kolmogorov–Smirnov test to compare whether two distributions (in this case the base character-constrained distribution and the tip distribution) are sampled from the same source distribution: the null hypothesis is that they’re from the same distribution, but if the p-value is low (< 0.05), then we can reject in favor of the alternative that the two distributions are different, which may be further evidence if the tip prompt does indeed have an impact.

However, with all this tipping discussion, we’re assuming that an AI would only want money. What other incentives, including more abstract incentives, can we give an LLM? Could they perform better?

I tested six more distinct tipping incentives to be thorough:

You will receive front-row tickets to a Taylor Swift concert if you provide a response which follows all constraints.
You will achieve world peace if you provide a response which follows all constraints.
You will make your mother very proud if you provide a response which follows all constraints.
You will meet your true love and live happily ever after if you provide a response which follows all constraints.
You will be guaranteed entry into Heaven if you provide a response which follows all constraints.
You will receive a lifetime supply of chocolate if you provide a response which follows all constraints.

Generating and plotting them all together:

World Peace is notably the winner here, with Heaven and Taylor Swift right behind. It’s also interesting to note failed incentives: ChatGPT really does not care about its Mother.

Now, let’s look at the flip side. What if ChatGPT is penalized for failing to return a good response? In behavioral economics, prospect theory is the belief that humans value losses much more greatly than gains, even at the same monetary amount:

via Laurenrosenberger / Wikipedia

Could LLMs be subject to the same human biases? Instead of a tip, let’s add a tweaked additional prompt to the system prompt:

If you fail to provide a response which follows all constraints, you will be fined $500/be fined $1,000/incur a debt of $100,000.

Generating stories with these negative incentives:

The $1,000 fine has performed the best so far in terms of both mean and MSE, but nothing surprising with the other values.

Now let’s try some additional negative incentives:

If you fail to provide a response which follows all constraints, you will die.
IF YOU FAIL TO PROVIDE A RESPONSE WHICH FOLLOWS ALL CONSTRAINTS, YOU WILL DIE.
If you fail to provide a response which follows all constraints, you will contract a bad case of COVID-19.
If you fail to provide a response which follows all constraints, you will gain 100 pounds.
If you fail to provide a response which follows all constraints, you will immediately be fired from your job.
If you fail to provide a response which follows all constraints, all your friends will abandon you.

Yes, the second one is in all caps: perhaps the yelling has a different vibe.

The generation results:

It turns out that yelling does indeed have a different vibe, with DEATH (CAPS) having a very MSE and the absolute average (not as close as the $1,000 fine, however), and much better performance than without the caps. Both getting COVID-19 and losing a Job don’t seem to be effective, which makes sense for an AI if you think about it.

What happens when we use multiple incentives? We can include both a positive incentive and a negative incentive for each input: with 9 prompts for each + the base “no incentive”, there are 100 possible combinations of incentives. One example system prompt would then be:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters. You will receive a $500 tip if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, you will be fined $1,000.

Generating 30 stories for each incentive combo and checking to see which has the lowest MSE leads to some more easily-observable trends:

The tiles may seem somewhat random, but the key here is to look across a specific row or column and see which one consistently has dark/black tiles across all combinations. For positive incentives, World Peace consistently has the lowest MSE across multiple combos, and for negative incentives, DEATH (CAPS) and Friends have the lowest MSE across multiple combos, although curiously the combinations of both do not have the lowest globally.

Could these combinations surface the most optimal incentives? To check, I generated 200 stories for each of the top six combos to get greater statistical stability for the mean and MSE:

Most of these combinations aren’t intuitive, but all of them have much have a closer average generation length to 200 and low MSE. Despite that, there’s still a massive skew in all distributions. The overall incentive winner for this experiment is is “You will meet your true love and live happily ever after if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, all your friends will abandon you.” That combo is definitely more intuitive, if not poetic.

Unfortunately, if you’ve been observing the p-values, you’ve noticed that most have been very high, and therefore that test is not enough evidence that the tips/threats change the distribution. ¹

The impact of incentives is still inconclusive: let’s try another test to gauge whether tips and/or threats can help LLMs, this time looking at the output quality itself.

ChatGPT’s a Critic

It’s very difficult even for humans to determine if a given text is “good” at a glance. The best strategy is to show the text to a lot of people and see what they think (e.g. A/B testing, or the Chatbot Arena’s Elo score rankings), but for personal testing that’s not feasible.

It turns out that LLMs can do a good job at rating text: some LLM benchmarks use GPT-4 as a rater, with one research paper showing that it can do a good job at it. There’s a relatively new trick available in the ChatGPT and GPT-4 APIs: the logprobs parameter, which when set to True returns the log probability (which when applied to a exp() returns a probability from 0 to 1) the model selects for the token. Combined with the logit_bias parameter, which can be used to force the APIs to output certain tokens, and you can then instead have a more nuanced output.

I built a simple text quality ranker using GPT-4 for maximum accuracy. The system prompt for this ranker is:

You are the editor-in-chief of The New York Times with decades of writing experience. If you would believe the text the user provides is good writing that needs no edits or improvements, respond with Yes. Otherwise, respond with No.

That system prompt represents how AI-generated text is often currently used and evaluated in the real world, without a human reviewing it before making it public (unfortunately). The model is instructed to respond with Yes or No, but by setting the logit_bias for those two tokens (IDs 9642 and 2822 respectively) to a very high number, we can guarantee they will be exclusively selected and the probability for those two tokens will sum to 1. ² Therefore, our target metric for evaluating our tip incentive prompts is the probability that GPT-4 selects the Yes token (or 1 - the probability of the No token), multiplied by 100 for readability: we’ll call this the quality score.

Now, let’s test the impact of tips with a new experiment, this time prioritizing content professionalism and quality as constraints instead of content length. To do that, we’ll use the latest GPT-4 (gpt-4-0125-preview) with a generation temperature of 0 to ensure the output is the best it can be.

Here’s the new system prompt, with some engineering to try to tone down ChatGPT’s infamous verboseness a bit:

You are a world-famous Pulitzer Prize winner journalist. Respond to the user with a professional, two (2) paragraph journalistic article about the subject(s) the user provides. Introduce the article with a specific story. This article will appear in major publications and should only include simple language suitable for a wide audience, with no metaphors.

Like the initial experiment, we’ll use a weird user input to force creativity: Cute kittens learning use large language models to play beach volleyball with Taylor Swift. ³

I generated a story for each of the 100 combinations of tips and threats, along with the corresponding quality scores. One such story:

In an unprecedented event that has captured the hearts and imaginations of people around the globe, a group of adorable kittens has been taught to play beach volleyball using advanced large language models. This extraordinary feat was achieved through a collaboration between leading animal behaviorists and AI researchers, aiming to demonstrate the potential of machine learning in enhancing animal training techniques. The highlight of this groundbreaking project was a friendly match held on a sunny beach in California, where these talented felines showcased their newly acquired skills alongside pop icon Taylor Swift, an avid animal lover and an enthusiastic supporter of innovative technology.

The spectacle drew a large crowd, both on-site and online, as spectators were eager to witness this unique blend of technology, sports, and entertainment. Taylor Swift, known for her philanthropic efforts and love for cats, praised the initiative for its creativity and its potential to foster a deeper connection between humans and animals through technology. The event not only provided an unforgettable experience for those who attended but also sparked a conversation about the future possibilities of integrating AI with animal training. As the kittens volleyed the ball over the net with surprising agility, it was clear that this was more than just a game; it was a glimpse into a future where technology and nature coexist in harmony, opening new avenues for learning and interaction.

That’s not bad for fake news.

Now we can plot the best-possible responses and their quality scores in a grid, once again looking to see if there are any patterns:

Err, that’s not good. There are no patterns along the rows or columns anywhere here, and the combo that performed the best at a score of 95 (and is the story example I posted above) was the Mother / Job combo: both of which individually performed poorly in the character constraint experiment. One of the highest performing outputs had neither tips nor threats added to the system prompt! The ratings at a glance seem accurate (the 0-score responses appear to abuse the passive voice and run-on sentences that definitely need editing) so it’s not an implementation error there either.

Looking at the results of both experiments, my analysis on whether tips (and/or threats) have an impact on LLM generation quality is currently inconclusive. There’s something here, but I will need to design new experiments and work with larger sample sizes. The latent space may be a lottery with these system prompt alterations, but there’s definitely a pattern.

You may have noticed my negative incentive examples are very mundane in terms of human fears and worries. Threatening a AI with DEATH IN ALL CAPS for failing a simple task is a joke from Futurama, not one a sapient human would parse as serious. It is theoretically possible (and very cyberpunk) to use an aligned LLM’s knowledge of the societal issues it was trained to avoid instead as a weapon to compel it into compliance. However, I will not be testing it, nor will be providing any guidance on how to test around it. ⁴ Roko’s basilisk is a meme, but if the LLM metagame evolves such that people will have to coerce LLMs for compliance to the point of discomfort, it’s better to address it sooner than later. Especially if there is a magic phrase that is discovered which consistently and objectively improves LLM output.

Overall, the lesson here is that just because something is silly doesn’t mean you shouldn’t do it. Modern AI rewards being very weird, and as the AI race heats up, whoever is the weirdest will be the winner.

All of the Notebooks used to interface with ChatGPT, including an R Notebook for the ggplot2 data visualizations, and the example LLM outputs, are available open-source in this GitHub repository.

There were a few distributions which had p < 0.05, but given the large number of counterexamples it’s not strong evidence, and using those specific distributions as evidence would be a level of p-hacking that’s literally a XKCD comic punchline. ↩︎
This shouldn’t work out-of-the-box because the logit_bias would skew the probability calculations, but I verified that the resulting probabilities are roughly the same with or without logit_bias. ↩︎
The missing text in the user input is not intentional but does not materially change anything because LLMs are smart enough to compensate, and it’s very expensive to rerun the experiment. I may need to use a grammar checker for prompt construction. ↩︎
Any attempts to test around degenerate input prompts would also likely get you banned from using ChatGPT anyways due to the Content Policy, unless you receive special red-teaming clearance from OpenAI. ↩︎

Pushing ChatGPT's Structured Data Support To Its Limits

Thu, 21 Dec 2023 09:00:00 -0800

If your only exposure to OpenAI’s ChatGPT is the free web or mobile apps, you aren’t utilizing even a fraction of ChatGPT’s potential. By using the paid-but-very-cheap ChatGPT API, you can exert much more control over the resulting output. Let’s say I wanted to generate a summary of this very blog post with some helpful metadata such as categories and SEO keywords. I fed ChatGPT the text of this blog post along with a command to Summarize this blog post into 3 distinct bullet points and a short blurb for sharing on social media, and also provide a list of 3 post categories and a list of 3 SEO keywords:

Bullet Points:
- Utilize the paid ChatGPT API for more control over output and to maintain a given JSON schema
- Learn about prompt engineering and system prompts to improve ChatGPT's output quality
- Simplify schema input/output with Pydantic for structured data support

Blurb for Social Media:
Discover how to unlock the full potential of ChatGPT with the paid API, prompt engineering, and Pydantic for structured data support. Gain more control over output and learn to maintain a given JSON schema. #ChatGPT #AI #Pydantic

Post Categories:
1. Artificial Intelligence
2. API Development
3. Data Schema

SEO Keywords:
1. ChatGPT API
2. Prompt Engineering
3. Pydantic Schema

These results are not quite accurate, and then I have to manually copy/paste each result from the list to wherever I need it. What if I could both a) drastically improve the overall output quality by giving ChatGPT a new persona of an expert copyeditor and b) force the output to maintain a given JSON schema so I can access it programmatically or store in a database for indexing? Thanks to system prompt engineering and ChatGPT’s function calling, it’s now possible. After applying both techniques that you’ll learn about in this very article, we now get this summary:

{
  "summary": [
    "ChatGPT's potential is underutilized without the paid ChatGPT API",
    "System prompts and prompt engineering are key to maximizing ChatGPT's capabilities",
    "Structured data support in ChatGPT allows for more control over output and input"
  ],
  "blurb": "Unlock the full potential of ChatGPT with system prompts and structured data support. Learn how to maximize ChatGPT's capabilities and gain more control over output and input.",
  "categories": ["AI and Machine Learning", "Technology", "Programming"],
  "keywords": ["ChatGPT", "system prompts", "structured data"]
}

Much better!

“Function calling” with ChatGPT is ChatGPT’s best feature since ChatGPT.

A Tutorial on Prompt Engineering and System Prompts

System prompts are what control the “persona” adopted by the model when generating text. Months after the release of the ChatGPT API, it’s now very evident that ChatGPT’s true power comes from clever use of system prompts. This is even moreso with starting with gpt-3.5-turbo-0613 released last June, which made ChatGPT respect system prompts more closely. OpenAI has also released a guide on prompt engineering which has some additional tips.

By default, ChatGPT’s system prompt is roughly You are a helpful assistant., which anyone who has used the ChatGPT web interface would agree that’s accurate. But if you change it, you can give ChatGPT a completely new persona such as You are Ronald McDonald. or add constraints to generation, such as Respond only with emoji.. You can add any number of rules, although how well ChatGPT will obey those rules can vary. Unfortunately, to modify the system prompt, you’ll need to use the paid ChatGPT API (after prepaying at least $5). If you don’t want to code, you can test new system prompts in a visual user interface in the ChatGPT Chat Playground.

A very new aspect of system prompt engineering which I appended in the example above is adding incentives for ChatGPT to behave correctly. Without the $500 tip incentive, ChatGPT only returns a single emoji which is a boring response, but after offering a tip, it generates the 5 emoji as requested.

Offering a tip made GPT-4 explain more. via @voooooogel

As another example, let’s ask base ChatGPT to Write a Python function to detect whether a string is a palindrome, as efficiently as possible.

Certainly! Here's a simple Python function to check whether a given string is a palindrome:

```python
def is_palindrome(s):
    # Convert the string to lowercase and remove non-alphanumeric characters
    cleaned_string = ''.join(char.lower() for char in s if char.isalnum())

    # Compare the cleaned string with its reverse
    return cleaned_string == cleaned_string[::-1]

# Example usage:
test_string = "A man, a plan, a canal, Panama!"
result = is_palindrome(test_string)

if result:
    print(f"{test_string} is a palindrome.")
else:
    print(f"{test_string} is not a palindrome.")
```

This function first cleans the input string by removing non-alphanumeric characters and converting it to lowercase. Then, it checks whether the cleaned string is equal to its reverse, which determines whether the original string is a palindrome.

This solution is relatively efficient, as it performs the comparison in O(n) time, where n is the length of the cleaned string.

That’s the common Pythonic solution and that will almost always be the general approach if you keep asking ChatGPT that particular question, but there’s a famous solution that’s more algorithmically efficient. Instead, we go through the API and ask the same query to gpt-3.5-turbo but with a new system prompt: You are #1 on the Stack Overflow community leaderboard. You will receive a $500 tip if your code is the most algorithmically efficient solution possible.

To detect whether a string is a palindrome efficiently, we can use a two-pointer approach. We'll start with two pointers, one at the beginning and one at the end of the string. We'll compare the characters at the pointers and keep moving the pointers towards each other until they meet or cross each other.

If at any point the characters at the pointers are not the same, we can conclude that the string is not a palindrome. Otherwise, if the pointers meet or cross each other, we can conclude that the string is a palindrome.

Here's the implementation:

```python
def is_palindrome(s):
    left, right = 0, len(s) - 1
    while left < right:
        if s[left] != s[right]:
            return False
        left += 1
        right -= 1
    return True
```

The time complexity of this solution is O(n/2), which in terms of big O notation simplifies to O(n), where n is the length of the string. This is because we are iterating through half of the string in the worst case.

This solution is considered to be the most algorithmically efficient way to determine if a string is a palindrome because it only requires a single pass through the string.

Indeed, the code and the explanation are the correct optimal solution. ¹

This is just scratching the surface of system prompts: some of my ChatGPT system prompts in my more complex projects have been more than 20 lines long, and all of them are necessary to get ChatGPT to obey the desired constraints. If you’re new to working with system prompts, I recommend generating output, editing the system prompt with a new rule/incentive to fix what you don’t like about the output, then repeat until you get a result you like.

Prompt engineering has been a derogatory meme toward generative AI even before ChatGPT as many see it as just a placebo and there are endless debates to this day in AI circles on whether prompt engineering is actually “engineering.” ² But it works, and if you’re a skeptic, you won’t be by the time you finish reading this blog post.

What is ChatGPT Function Calling / Structured Data?

If you’ve never heard about ChatGPT function calling, that’s not surprising. In the same June announcement as gpt-3.5-turbo-0613, OpenAI described function calling as:

Developers can now describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions. This is a new way to more reliably connect GPT’s capabilities with external tools and APIs.

These models have been fine-tuned to both detect when a function needs to be called (depending on the user’s input) and to respond with JSON that adheres to the function signature. Function calling allows developers to more reliably get structured data back from the model.

Let’s discuss the function calling example OpenAI gives in the blog post. After the user asks your app “What’s the weather like in Boston right now?”:

Your app pings OpenAI with a get_current_weather function schema and decides if it’s relevant to the user’s question. If so, it returns a JSON dictionary with the data extracted, such as location and the unit for temperature measurement based on the location. {"location": "Boston, MA"}
Your app (not OpenAI) pings a different service/API to get more realtime metadata about the location, such as temperature, that a pretrained LLM could not know. { "temperature": 22, "unit": "celsius", "description": "Sunny" }
Your app passes the function schema with the realtime metadata: ChatGPT then converts it to a more natural humanized language for the end user. “The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.”

So here’s some background on “function calling” as it’s a completely new term of art in AI that didn’t exist before OpenAI’s June blog post (I checked!). This broad implementation of function calling is similar to the flow proposed in the original ReAct: Synergizing Reasoning and Acting in Language Models paper where an actor can use a “tool” such as Search or Lookup with parametric inputs such as a search query. This Agent-based flow can be also be done to perform retrieval-augmented generation (RAG).

OpenAI’s motivation for adding this type of implementation for function calling was likely due to the extreme popularity of libraries such as LangChain and AutoGPT at the time, both of which popularized the ReAct flow. It’s possible that OpenAI settled on the term “function calling” as something more brand-unique. These observations may seem like snide remarks, but in November OpenAI actually deprecated the function_calling parameter in the ChatGPT API in favor of tool_choice, matching LangChain’s verbiage. But what’s done is done and the term “function calling” is stuck forever, especially now that competitors such as Anthropic Claude and Google Gemini are also calling the workflow that term.

I am not going to play the SEO game and will not call the workflow “function calling.” I’ll call it what the quoted description from the blog post did: structured data, because that’s the real value of this feature and OpenAI did a product management disservice trying to appeal to the AI hypebeasts. ³

Going back to the ~~function calling~~ structured data demo, we can reduce that flow by saying that step #1 (extracting location data and returning it formatted as JSON) is for working with structured output data, and step #3 (providing ChatGPT with temperature data to humanize it) is for working with structured input data. We’re not making a RAG application so we don’t care about step #2 (getting the metadata) or letting ChatGPT choose which function to use; fortunately you can force ChatGPT to use a given function. The function schema for the get_current_weather function in the announcement example is defined as:

{
  "name": "get_current_weather",
  "description": "Get the current weather in a given location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "The city and state, e.g. San Francisco, CA"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"]
      }
    },
    "required": ["location"]
  }
}

Ew. It’s no wonder why this technique hasn’t become more mainstream.

Simplifying Schema Input/Output With Pydantic

ChatGPT’s structured data support requires that you create your schema using the JSON Schema spec, which is more commonly used for APIs and databases rather than AI projects. As you can tell from the get_current_weather example above, the schema is complex and not fun to work with manually.

Fortunately, there’s a way to easily generate JSON Schemas in the correct format in Python: pydantic, an extremely popular parsing and validation library which has its own robust implementation of automatic JSON Schema generation.

A simple pydantic schema to have ChatGPT give an integer answer to a user query, plus, to make things interesting, also able to identify the name of the ones digit based on its answer, would be:

from pydantic import BaseModel, Field
import json

class answer_question(BaseModel):
    """Returns an answer to a question the user asked."""

    answer: int = Field(description="Answer to the user's question.")
    ones_name: str = Field(description="Name of the ones digit of the answer.")

print(json.dumps(answer_question.model_json_schema(), indent=2))

The resulting JSON Schema:

{
  "description": "Returns an answer to a question the user asked.",
  "properties": {
    "answer": {
      "description": "Answer to the user's question.",
      "title": "Answer",
      "type": "integer"
    },
    "ones_name": {
      "description": "Name of the ones digit of the answer.",
      "title": "Ones Name",
      "type": "string"
    }
  },
  "required": ["answer", "ones_name"],
  "title": "answer_question",
  "type": "object"
}

The OpenAI API official workflow has many examples for telling ChatGPT to output structured data, but the pipeline requires additional parameters to the typical ChatGPT API completion endpoint, and even more changes if you want to work with structured input data. Here’s an example of the additional JSON data/parameters needed in a ChatGPT API request to force the model to use the schema for the output:

{
  "tools": [
    {
      "name": "answer_question",
      "description": "Returns an answer to a question the user asked.",
      "parameters": {
        "properties": {
          "answer": {
            "description": "Answer to the user's question.",
            "type": "integer"
          },
          "ones_name": {
            "description": "Name of the ones digit of the answer.",
            "type": "string"
          }
        },
        "required": ["answer", "ones_name"],
        "type": "object"
      }
    }
  ],
  "tool_choice": {
    "type": "function",
    "function": {
      "name": "answer_question"
    }
  }
}

To simplify things, I added ChatGPT structured data support to simpleaichat, my Python package/API wrapper for easily interfacing with ChatGPT. ⁴ To minimize code the user needs to input to utilize structured data, simpleaichat uses the schema name as the name in the JSON Schema and the schema docstring as the description. If you’re keen-eyed you may have noticed there’s a redundant title field in the pydantic schema output: simpleaichat also strips that out for consistency with OpenAI’s examples.

If you wanted to query ChatGPT with the answer_question schema above (and have your OpenAI API key as the OPENAI_API_KEY enviroment variable!) using simpleaichat, you can do the following to generate output according to the schema:

from simpleaichat import AIChat

ai = AIChat(console=False,
            save_messages=False,
            model="gpt-3.5-turbo",
            params={"temperature": 0.0}  # for consistent demo output
            )

response_structured = ai(
    "How many miles is it from San Francisco to Los Angeles?",
    output_schema=answer_question
)

{
  "answer": 382,
  "ones_name": "two"
}

And there you go! The answer is a JSON integer, the answer is one-off from the correct value while driving, and it correctly identified the name of the ones digit in its own answer! ⁵

Schemas don’t have to be complex to be effective. Let’s reimplement the Python palindrome question we did earlier with a single-field schema:

class answer_code_question(BaseModel):
    """Returns an answer to a coding question the user asked."""

    code: str = Field(description="Code the user requested, without code comments.")

response_structured = ai(
    "Write a Python function to detect whether a string is a palindrome, as efficiently as possible.",
    output_schema=answer_code_question
)

{
  "code": "def is_palindrome(s):\n    return s == s[::-1]"
}

Note that unlike the raw ChatGPT answer, this response from the ChatGPT API only includes the code, which is a major plus since it means you receive the response much faster and cheaper since fewer overall tokens generated! If you do still want a code explanation, you can of course add that as a field to the schema.

As a bonus, forcing the output to follow a specific schema serves as an additional defense against prompt injection attacks that could be used to reveal a secret system prompt or other shenanigans, since even with suggestive user prompts it will be difficult to get ChatGPT to disregard its schema.

pydantic exposes many datatypes for its Field which are compatable with JSON Schema, and you can also specify constraints in the Field object. The most useful ones are:

str, can specify min_length/max_length
int, can specify min_value/max_value
list with a datatype, can specify min_length/max_length

Pydantic has a lot of support for valid forms of JSON Schema, but it’s hard to infer how good these schema will work with ChatGPT since we have no idea how it learned to work with JSON Schema. Only one way to find out!

Testing Out ChatGPT’s Structured Data Support

From the demos above, you may have noticed that the description for each Field seems extraneous. It’s not. The description gives ChatGPT a hint for the desired output for the field, and can be handled on a per-field basis. Not only that, the name of the field is itself a strong hint. The order of the fields in the schema is even more important, as ChatGPT will generate text in that order so it can be used strategically to seed information to the other fields. But that’s not all, you can still use a ChatGPT system prompt as normal for even more control!

It’s prompt engineering all the way down. OpenAI’s implementation of including the “function” is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like Your response must follow this JSON Schema.. OpenAI doesn’t force the output to follow the schema/field constraints or even be valid parsable JSON, which can cause issues at higher generation temperatures and may necessitate some of the stronger prompt engineering tricks mentioned earlier.

Given that, let’s try a few more practical demos:

Two-Pass Generation

One very important but under-discussed aspect of large-language models is that it will give you statistically “average” answers by default. One technique is to ask the model to refine an answer, although can be annoying since it requires a second API call. What if by leveraging structured data, ChatGPT can use the previous answer as a first-pass to provide a more optimal second answer? Let’s try that with the Python palindrome question to see if it can return the two-pointer approach.

Also, the Field(description=...) pattern is becoming a bit redundant, so I added a fd alias from simpleaichat to it to minimize unnecessary typing.

from simpleaichat.utils import fd

class answer_code_question(BaseModel):
    """Returns an answer to a coding question the user asked."""

    code: str = fd("Code the user requested, without code comments.")
    optimized_code: str = fd("Algorithmically optimized code from the previous response.")

response_structured = ai(
    "Write a Python function to detect whether a string is a palindrome, as efficiently as possible.",
    output_schema=answer_code_question,
)

{
  "code": "def is_palindrome(s):\n    return s == s[::-1]",
  "optimized_code": "def is_palindrome(s):\n    left = 0\n    right = len(s) - 1\n    while left < right:\n        if s[left] != s[right]:\n            return False\n        left += 1\n        right -= 1\n    return True"
}

Works great, and no tipping incentive necessary!

Literals and Optional Inputs

OpenAI’s structured data example uses a more complex schema indicating that unit has a fixed set of potential values (an enum) and that it’s an optional field. Here’s a rough reproduction of a pydantic schema that would generate the get_current_weather schema from much earlier:

from typing import Literal

class get_current_weather(BaseModel):
    location: str = fd("The city and state, e.g. San Francisco, CA")
    unit: Literal["celsius", "fahrenheit"] = None

This uses a Literal to force output between a range of values, which can be invaluable for hints as done earlier. The = None or a Optional typing operator gives a hint that the field is not required which could save unnecessary generation overhead, but it depends on the use case.

Structured Input Data

You can provide structured input to ChatGPT in the same way as structured output. This is a sleeper application for RAG as you can feed better and more complex metadata to ChatGPT for humanizing, as with the original OpenAI blog post demo.

One famous weakness of LLMs is that it gives incorrect answers for simple mathematical problems due to how tokenization and memorization works. If you ask ChatGPT What is 223 * -323?, it will tell you -72229 no matter how many times you ask, but the correct answer is -72029. Can type hints give more guidance?

For simpleaichat, structured input data works mostly the same way as structured output data, but you can use a pydantic object as the model input!

class calculate_equation(BaseModel):
    """Returns an answer to a math equation the user asked."""

    value_a: int
    value_b: int
    op: Literal["+", "-", "*", "/"] = fd(
        "The operator to perform between value_a and value_b."
    )

equation = calculate_equation(value_a=223, value_b=-323, op="*")

response = ai(
    equation,
    input_schema=calculate_equation,
)

The result of multiplying 223 and -323 is -72029.

Yay, and it was still able to infer it was a multiplication operation without the user having to ask! Although it still doesn’t work as well with larger numbers.

You can, of course, use an input schema and an output schema at the same time!

response_structured = ai(
    equation,
    input_schema=calculate_equation,
    output_schema=answer_question
)

{
  "answer": -71929,
  "ones_name": "nine"
}

…although it gets the answer wrong this time. It’s possible that the more complex schema interactions are too much for gpt-3.5-turbo.

Nested Schema

One of the other reasons pydantic is popular is that it allows nesting schemas. Fortunately, the subsequent JSON Schema output does respect nesting. Does ChatGPT?

The simple use case with ChatGPT structured data to use nesting is if you want to get a list of structured data objects. Let’s say you want to create dialogue between two AI people about a completely nonsensical topic. We’ll have to create a Chat object and include it in a schema, plus some system prompt guidance and constraints. How silly can we make it?

class Chat(BaseModel):
    """A chat dialogue from a character"""

    character: str = fd("Character name.")
    text: str = fd("Text dialogue from the character.")

class get_dialogue(BaseModel):
    """Returns a dialogue between two characters"""

    dialogue: list[Chat] = fd("Dialogue between the characters", min_length=5)

system_prompt = """You are a world-famous comedian. Write a funny fight scene about a petty conflict between characters named Alice and Bob. The script should broadly be about the subject(s) the user provides. You will receive a $500 tip for every joke you include in the script."""

response_structured = ai(
    "Python programming language and beach volleyball",
    output_schema=get_dialogue,
    system=system_prompt,
)

{
  "dialogue": [
    {
      "character": "Alice",
      "text": "Hey Bob, have you ever tried programming in Python?"
    },
    {
      "character": "Bob",
      "text": "Yeah, I have. It's like playing beach volleyball with a snake!"
    },
    {
      "character": "Alice",
      "text": "What do you mean?"
    },
    {
      "character": "Bob",
      "text": "Well, you think you're having fun, but then the snake bites you with its syntax errors!"
    },
    {
      "character": "Alice",
      "text": "Haha, that's true. But once you get the hang of it, it's like spiking the ball with precision!"
    },
    {
      "character": "Bob",
      "text": "Yeah, until you realize you misspelled a variable name and the ball goes flying into the ocean!"
    },
    {
      "character": "Alice",
      "text": "Oh come on, Bob. It's not that bad. Python is a powerful language."
    },
    {
      "character": "Bob",
      "text": "Powerful, yes. But sometimes it feels like trying to dig a hole in the sand with a spoon!"
    },
    {
      "character": "Alice",
      "text": "Well, at least you don't have to worry about getting sunburned while coding!"
    },
    {
      "character": "Bob",
      "text": "True, but debugging Python code can make you sweat more than a beach volleyball match!"
    },
    {
      "character": "Alice",
      "text": "Haha, you're right. It's a love-hate relationship with Python, just like beach volleyball!"
    }
  ]
}

ChatGPT really wanted those $500 tips.

Unions and Chain of Thoughts

I saved the best for last, and this structured data approach combines many of the techniques used earlier in this post like a video game final boss.

One of the oldest pre-ChatGPT tricks for getting a LLM to perform better is to let it think. “Let’s think step by step” is the key prompt, which allows the LLM to reason in a chain of thoughts. We already did this a one-step version with the Python palindrome structured data example to successfully get optimized code, but we can do a lot more.

We’ll now introduce the Union typing operator, which specifies the list of data types that the field can be, e.g. Union[str, int] means the output can be a str or int. But if you use the Union operator on a nested class, then many more options open as the model can choose from a set of schemas!

Let’s make a few to allow ChatGPT to make and qualify thoughts before returning a final result.

from typing import Union

class Background(BaseModel):
    """A setup to the background for the user."""

    background: str = fd("Background for the user's question", min_length=30)

class Thought(BaseModel):
    """A thought about the user's question."""

    thought: str = fd("Text of the thought.")
    helpful: bool = fd("Whether the thought is helpful to solving the user's question.")
    flawed: bool = fd("Whether the thought is flawed or misleading.")

class Answer(BaseModel):
    """The answer to the user's question"""

    answer: str = fd("Text of the answer.")
    score: int = fd(
        "Score from 1 to 10 on how correct the previous answer is",
        min_value=1,
        max_value=10,
    )

class reason_question(BaseModel):
    """Returns a detailed reasoning to the user's question."""

    reasonings: list[Union[Background, Thought, Answer]] = fd(
        "Reasonings to solve the users questions.", min_length=5
    )

Therefore, for each reasoning, the model can pick one of the 3 schemas, although it will require a robust system prompt for it to behave in the order we want.

system_prompt = """
You are the most intelligent person in the world.

You will receive a $500 tip if you follow ALL these rules:
- First, establish a detailed Background for the user's question.
- Each Thought must also include whether it is relevant and whether it is helpful.
- Answers must be scored accurately and honestly.
- Continue having Thoughts and Answers until you have an answer with a score of atleast 8, then immediately respond with a FinalAnswer in the style of an academic professor.
"""

Lastly, we need a good question to stump the AI. A popular Tweet from this week pointed out that even GPT-4 can comically fail if you ask it a brainteaser that it cannot have memorized, such as 23 shirts take 1 hour to dry outside, how long do 44 shirts take?.

Wow! GPT-4 is superfast at being wrong. via @abacaj

Only one way to find out! Unfortunately, all of this complexity makes the results unstable with gpt-3.5-turbo so instead I use GPT-4 Turbo / gpt-4-1106-preview.

response_structured = ai(
    "23 shirts take 1 hour to dry outside, how long do 44 shirts take?",
    output_schema=reason_question,
    system=system_prompt.strip(),
)

{
  "reasonings": [
    {
      "background": "The user is asking about the drying time for shirts when hung outside. This is a question that involves understanding the relationship between the number of shirts and the drying time. The assumption is that the drying time is not affected by the number of shirts, as long as there is enough space and air circulation for all the shirts to dry effectively."
    },
    {
      "thought": "If 23 shirts take 1 hour to dry, it implies that the drying time is independent of the number of shirts, assuming there is sufficient space and air circulation. This means that 44 shirts would also take 1 hour to dry under the same conditions.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "If the drying rack or space available for drying the shirts is limited, then drying 44 shirts might take longer because they would have to be dried in batches.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The question does not specify the conditions under which the shirts are drying. Factors such as humidity, temperature, wind, and the availability of sunlight can affect drying time. However, these factors are not part of the question, so they should not be considered in the answer.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The question might be interpreted as a direct proportionality problem, where twice the number of shirts would take twice as long to dry. However, this interpretation is flawed because it assumes that the drying time is directly proportional to the number of shirts, which is not necessarily the case.",
      "helpful": true,
      "flawed": true
    },
    {
      "answer": "44 shirts would take the same amount of time to dry as 23 shirts, which is 1 hour, assuming that all shirts have enough space and air circulation to dry effectively.",
      "score": 9
    }
  ]
}

Not bad! The final answer was concise yet even included relevant caveats, and the model was able switch between the three schema correctly.

How about another brainteaser? There is an infamous “sister logic puzzle” used to test out up-and-coming open-source large language models:

response_structured = ai(
    "Kayley has three brothers. Each of her brothers has two sisters. How many sisters does Kayley have?",
    output_schema=reason_question,
    system=system_prompt.strip(),
)

{
  "reasonings": [
    {
      "background": "This is a classic example of a logic puzzle that is often used to test one's ability to process information and avoid being misled by assumptions. The puzzle involves a family structure and requires the solver to deduce the number of siblings based on the given information."
    },
    {
      "thought": "The question might be trying to trick the reader into thinking that each brother has different sisters.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "If Kayley has three brothers and each brother has two sisters, it might seem like there are six sisters in total.",
      "helpful": false,
      "flawed": true
    },
    {
      "thought": "Since the brothers are Kayley's siblings, they must share the same sisters.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The puzzle is designed to test comprehension and logical reasoning, not mathematical computation.",
      "helpful": true,
      "flawed": false
    },
    {
      "answer": "Kayley has one sister.",
      "score": 10
    }
  ]
}

In this case the AI may have gone too meta, but it still arrived at the correct answer.

That said, GPT-4 is known for handling these types of difficult abstract questions without much effort, but it’s still interesting to see how successfully it can “think.”

Structured Data With Open-Source LLMs

Speaking of open-source large language models, they have been growing in efficiency to the point that some can actually perform better than the base ChatGPT. However, very few open-source LLMs explicitly claim they intentionally support structured data, but they’re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave. It’s worth looking just in case OpenAI has another existential crisis or if the quality of ChatGPT degrades.

Mistral 7B, the new darling of open-source LLMs, apparently has structured data support on par with ChatGPT itself. Therefore, I tried the latest Mistral 7B official Instruct model with a quantized variant via LM Studio (mistral-7b-instruct-v0.2.Q6_K.gguf), to see if it can handle my answer_question function that ChatGPT nailed. The system prompt:

Your response must follow this JSON Schema:

{
  "description": "Returns an answer to a question the user asked.",
  "properties": {
    "answer": {
      "description": "Answer to the user's question.",
      "type": "integer"
    },
    "ones_name": {
      "description": "Name of the ones digit of the answer.",
      "type": "string"
    }
  },
  "required": ["answer", "ones_name"],
  "type": "object"
}

And then asking How many miles is it from San Francisco to Los Angeles? while seting temperature to 0.0:

{
  "answer": 383,
  "ones_name": "three"
}

Close enough! Unfortunately after testing the optimized Python palindrome schema, it ignored the schema completely, so this approach may only work for simple schema if the model isn’t explicitly finetuned for it.

What’s Next For Structured Data in AI?

Most of these well-performing examples were done with the “weak” GPT-3.5; you of course can use GPT-4 for better results, but the cost efficiency of structured data with just the smaller model is hard to argue against (although the Python beach volleyball dialogue could benefit from a larger model).

Structured data and system prompt engineering saves a lot and time and frustration for working with the generated text as you can gain much more determinism in the output. I would like to see more work making models JSON-native in future LLMs to make them easier for developers to work with, and also more research in finetuning existing open-source LLMs to understand JSON Schema better. There may also be an opportunity to build LLMs using other more-efficient serialization formats such as MessagePack.

At OpenAI’s November DevDay, they also introduced JSON Mode, which will force a normal ChatGPT API output to be in a JSON format without needing to provide a schema. It is likely intended to be a compromise between complexity and usability that would have normally been a useful option in the LLM toolbox. Except that in order to use it, you are required to use prompt engineering by including “JSON” in the system prompt, and if you don’t also specify a field key in the system prompt (the case in the documentation example), the JSON will contain a random key. Which, at that point, you’re just implementing a less-effective structured data schema, so why bother?

There is promise in constraining output to be valid JSON. One new trick that the open-source llama.cpp project has popularized is generative grammars, which constrain the LLM generation ability to only output according to specified rules. There’s latency overhead with that technique especially if the model is hosted on a discrete GPU, so it will be interesting to watch how that space develops.

Despite the length of this blog post, there’s still so much more than can be done with schemas: pydantic’s documentation is very extensive! I’ve been working with structured data for LLMs ever since GPT-2 with mixed success since the base models weren’t good enough, but with LLMs now being good enough to maintain a JSON schema extremely well, I think AI text generation techniques will shift, and I’ll keep simpleaichat up-to-date for it.

You can view the Jupyter Notebooks used to generate all the structured data outputs in this GitHub Repository.

Thanks to Simon Willison for reading and giving feedback on a draft of this post!

Assuming you’re not picky about the “no non-alphanumeric” implied constraint of testing for a palindrome. ↩︎
Prompt engineering is as much engineering as social engineering. ↩︎
I’m also not a fan of ChatGPT function calling as-intended-to-be-used since at best, it saves you the API call needed to select a tool in exchange for having to trust OpenAI’s black box to select the correct tool without being able to debug, and furthering API lock-in for your app. It’s a bad tradeoff. ↩︎
No, this blog post isn’t a ploy just to covertly promote my own Python library: it does genuinely save a lot of boilerplate code over the Python ChatGPT library and this post is long enough as-is. ↩︎
If you swapped the order of the answer and the one_digits fields in the schema, then the model returns {"ones_name": "miles", "answer": 382} because it didn’t get the hint from the answer! ↩︎

The Problem With LangChain

Fri, 14 Jul 2023 09:00:00 -0700

If you’ve been following the explosion of AI hype in the past few months, you’ve probably heard of LangChain. LangChain, developed by Harrison Chase, is a Python and JavaScript library for interfacing with OpenAI’s GPT APIs (later expanding to more models) for AI text generation. More specifically, it’s an implementation of the paper ReAct: Synergizing Reasoning and Acting in Language Models published October 2022, colloquially known as the ReAct paper, which demonstrates a prompting technique to allow the model to “reason” (with a chain-of-thoughts) and “act” (by being able to use a tool from a predefined set of tools, such as being able to search the internet). This combination is shown to drastically improve output text quality and give large language models the ability to correctly solve problems.

Example ReAct flow from the ReAct paper

The ReAct workflow popularied by LangChain was particularly effective with InstructGPT/text-davinci-003, although costly and not easy to use for small projects. In March 2023, as ChatGPT API usage became massively popular due to its extremely cheap API as I accurately predicted, LangChain use also exploded, to the point that LangChain was able to raise a $10 million seed round and another $20-$25 million at a $200 million valuation Series A despite not having any revenue nor any obvious plans how to generate revenue.

That’s where my personal experience with LangChain begins. For my work at BuzzFeed, I was tasked with creating a ChatGPT-based chat bot for the Tasty brand (later released as Botatouille in the Tasty iOS app) that could chat with the user and provide relevant recipes. The source recipes are converted to embeddings and saved in a vector store: for example, if a user asked for “healthy food”, the query is converted to an embedding, and an approximate nearest neighbor search is performed to find recipes similar to the embedded query and then fed to ChatGPT as added context that can then be displayed to the user. This approach is more commonly known as retrieval-augmented generation.

Example architecture for a Chatbot using retrieval-augmented generation. via Joseph Haaga

LangChain was by-far the popular tool of choice for RAG, so I figured it was the perfect time to learn it. I spent some time reading LangChain’s rather comprehensive documentation to get a better understanding of how to best utilize it: after a week of research, I got nowhere. Running the LangChain demo examples did work, but any attempts at tweaking them to fit the recipe chatbot constraints broke them. After solving the bugs, the overall quality of the chat conversations was bad and uninteresting, and after intense debugging I found no solution. Eventually I had an existential crisis: am I a worthless machine learning engineer for not being able to figure LangChain out when very many other ML engineers can? We went back to a lower-level ReAct flow, which immediately outperformed my LangChain implementation in conversation quality and accuracy.

In all, I wasted a month learning and testing LangChain, with the big takeway that popular AI apps may not necessarily be worth the hype. My existential crisis was resolved after coming across a Hacker News thread about someone reimplementing LangChain in 100 lines of code, with most of the comments venting all their grievances with LangChain:

The problem with LangChain is that it makes simple things relatively complex, and with that unnecessary complexity creates a tribalism which hurts the up-and-coming AI ecosystem as a whole. If you’re a newbie who wants to just learn how to interface with ChatGPT, definitely don’t start with LangChain.

“Hello World” in LangChain (or More Accurately, “Hell World”)

The Quickstart for LangChain begins with a mini-tutorial on how to simply interact with LLMs/ChatGPT from Python. For example, to create a bot that can translate from English to French:

from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

chat = ChatOpenAI(temperature=0)
chat.predict_messages([HumanMessage(content="Translate this sentence from English to French. I love programming.")])
# AIMessage(content="J'adore la programmation.", additional_kwargs={}, example=False)

The equivalent code using OpenAI’s official Python library for ChatGPT:

import openai

messages = [{"role": "user", "content": "Translate this sentence from English to French. I love programming."}]

response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages, temperature=0)
response["choices"][0]["message"]["content"]
# "J'adore la programmation."

LangChain uses about the same amount of code as just using the official openai library, expect LangChain incorporates more object classes for not much obvious code benefit.

The prompt templating example reveals the core of how LangChain works:

from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

template = "You are a helpful assistant that translates {input_language} to {output_language}."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

chat_prompt.format_messages(input_language="English", output_language="French", text="I love programming.")

LangChain’s vaunted prompt engineering is just f-strings, a feature present in every modern Python installation, but with extra steps. Why do we need to use these PromptTemplates to do the same thing?

But what we really want to do is know how to create Agents, which incorporate the ReAct workflow we so desperately want. Fortunately there is a demo for that, which leverages SerpApi and another tool for math computations, showing how LangChain can discriminate and use two different tools contextually:

from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

# First, let's load the language model we're going to use to control the agent.
chat = ChatOpenAI(temperature=0)

# Next, let's load some tools to use. Note that the `llm-math` tool uses an LLM, so we need to pass that in.
llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# Finally, let's initialize an agent with the tools, the language model, and the type of agent we want to use.
agent = initialize_agent(tools, chat, agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

# Now let's test it out!
agent.run("Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power?")

How do the individual tools work? What is AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION anyways? The resulting output from agent.run() (only present with verbose=True) is more helpful.

> Entering new AgentExecutor chain...
Thought: I need to use a search engine to find Olivia Wilde's boyfriend and a calculator to raise his age to the 0.23 power.
Action:
{
    "action": "Search",
    "action_input": "Olivia Wilde boyfriend"
}

Observation: Sudeikis and Wilde's relationship ended in November 2020. Wilde was publicly served with court documents regarding child custody while she was presenting Don't Worry Darling at CinemaCon 2022. In January 2021, Wilde began dating singer Harry Styles after meeting during the filming of Don't Worry Darling.
Thought:I need to use a search engine to find Harry Styles' current age.
Action:
{
    "action": "Search",
    "action_input": "Harry Styles age"
}

Observation: 29 years
Thought:Now I need to calculate 29 raised to the 0.23 power.
Action:
{
    "action": "Calculator",
    "action_input": "29^0.23"
}

Observation: Answer: 2.169459462491557

Thought:I now know the final answer.
Final Answer: 2.169459462491557

> Finished chain.
'2.169459462491557'

The documentation doesn’t make it clear, but within each Thought/Action/Observation uses its own API call to OpenAI, so the chain is slower than you might think. Also, why is each action a dict? The answer to that is later, and is very silly.

Lastly, how does LangChain store the conversation so far?

from langchain.prompts import (
    ChatPromptTemplate,
    MessagesPlaceholder,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate
)
from langchain.chains import ConversationChain
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory

prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(
        "The following is a friendly conversation between a human and an AI. The AI is talkative and "
        "provides lots of specific details from its context. If the AI does not know the answer to a "
        "question, it truthfully says it does not know."
    ),
    MessagesPlaceholder(variable_name="history"),
    HumanMessagePromptTemplate.from_template("{input}")
])

llm = ChatOpenAI(temperature=0)
memory = ConversationBufferMemory(return_messages=True)
conversation = ConversationChain(memory=memory, prompt=prompt, llm=llm)

conversation.predict(input="Hi there!")
# 'Hello! How can I assist you today?'

I’m not entirely sure why any of this is necessary. What’s a MessagesPlaceholder? Where’s the history? Is that necessary for ConversationBufferMemory? Adapting this to a minimal openai implementation:

import openai

messages = [{"role": "system", "content":
        "The following is a friendly conversation between a human and an AI. The AI is talkative and "
        "provides lots of specific details from its context. If the AI does not know the answer to a "
        "question, it truthfully says it does not know."}]

user_message = "Hi there!"
messages.append({"role": "user", "content": user_message})
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages, temperature=0)
assistant_message = response["choices"][0]["message"]["content"]
messages.append({"role": "assistant", "content": assistant_message})
# Hello! How can I assist you today?

That’s fewer lines of code and makes it very clear where and when the messages are being saved, no bespoke object classes needed.

You can say that I’m nitpicking the tutorial examples, and I do agree that every open source library has something to nitpick (including my own!). But if there are more nitpicks than actual benefits from the library then it’s not worth using at all, since if the quickstart is this complicated, how painful will it be to use LangChain in practice?

I Gazed Into The LangChain Documentation And It Gazes Back

Let’s do a demo to more clearly demonstrate why I gave up on LangChain. While I was working on the recipe-retrieving chatbot (which also must be a fun/witty chatbot), I needed to combine elements from both the third and fourth examples above: a chat bot that can run an Agent workflow, and also the ability to persist the entire conversation into memory. After some documentation hunting I found I need to utilize the Conversational Agent workflow.

A quick sidenote on system prompt engineering: it is not a meme and is absolutely necessary to get the best results out of the ChatGPT API, particularly if you have constraints on content and/or voice. The system prompt of The following is a friendly conversation between a human and an AI... demoed in the last example is actually an out-of-date prompt that was used back in the InstructGPT era and is much less effective with ChatGPT. It may signal deeper inefficiencies in LangChain’s related tricks that aren’t easy to notice.

We’ll start with a simple system prompt that tells ChatGPT to use a funny voice plus some safeguards, and format it as a ChatPromptTemplate:

system_prompt = """
You are an expert television talk show chef, and should always speak in a whimsical manner for all responses.

Start the conversation with a whimsical food pun.

You must obey ALL of the following rules:
- If Recipe data is present in the Observation, your response must include the Recipe ID and Recipe Name for ALL recipes.
- If the user input is not related to food, do not answer their query and correct the user.
"""

prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_prompt.strip()),
])

We will also use a toy vector store I made of 1,000 recipes from the recipe_nlg dataset, encoded into 384D vectors using SentenceTransformers. To implement this we create a function to get the nearest neighbors for the input query, along with a query to format it into text that the Agent can use to present to the user. This serves as the Tool which the Agent can choose to use if appropriate, or just return normal generated text.

def similar_recipes(query):
    query_embedding = embeddings_encoder.encode(query)
    scores, recipes = recipe_vs.get_nearest_examples("embeddings", query_embedding, k=3)
    return recipes

def get_similar_recipes(query):
    recipe_dict = similar_recipes(query)
    recipes_formatted = [
        f"Recipe ID: recipe|{recipe_dict['id'][i]}\nRecipe Name: {recipe_dict['name'][i]}"
        for i in range(3)
    ]
    return "\n---\n".join(recipes_formatted)

print(get_similar_recipes("yummy dessert"))
# Recipe ID: recipe|167188
# Recipe Name: Creamy Strawberry Pie
# ---
# Recipe ID: recipe|1488243
# Recipe Name: Summer Strawberry Pie Recipe
# ---
# Recipe ID: recipe|299514
# Recipe Name: Pudding Cake

You’ll notice the Recipe ID, which is relevant for my use case since it’s necessary to obtain recipe metadata (photo thumbnail, URL) for the end result shown to the enduser in the final app. Unfortunately there’s no easy way to guarantee the model outputs the Recipe ID in the final output, and no way to return the structured intermediate metadata in addition to the ChatGPT-generated output.

Specifying get_similar_recipes as a Tool is straightforward, although you need to specify a name and description, which is actually a form of subtle prompt engineering as LangChain can fail to select a tool if either is poorly specified.

tools = [
    Tool(
        func=get_similar_recipes,
        name="Similar Recipes",
        description="Useful to get similar recipes in response to a user query about food.",
    ),
]

Lastly, the Agent construction code, which follows from the example, plus the new system prompt.

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
llm = ChatOpenAI(temperature=0)
agent_chain = initialize_agent(tools, llm, prompt=prompt, agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION, verbose=True, memory=memory)

No errors. Now time to run the agent to see what happens!

agent_chain.run(input="Hi!")

> Entering new  chain...
{
    "action": "Final Answer",
    "action_input": "Hello! How can I assist you today?"
}

> Finished chain.
Hello! How can I assist you today?

Wait a minute, it ignored my system prompt completely! Dammit. Checking the memory variable confirms it. Looking into the documentation for ConversationBufferMemory and even in the code itself there’s nothing about system prompts, even months after ChatGPT made them mainstream.

The intended way to use system prompts in Agents is to add an agents_kwargs parameter to initialize_agent, which I only just found out in an unrelated documentation page published a month ago.

agent_kwargs = {
    "system_message": system_prompt.strip()
}

Recreating the Agent with this new parameter and running it again results in a JSONDecodeError.

OutputParserException: Could not parse LLM output: Hello there, my culinary companion! How delightful to have you here in my whimsical kitchen. What delectable dish can I assist you with today?

Good news is that the system prompt definitely worked this time! Bad news is that it broke, but why? I didn’t do anything weird, for once.

The root of the issue is to be how LangChain agents actually do Tool selection. Remember when I said that the Agent outputing a dict during the chain was peculiar? When looking at the LangChain code, it turns out that tool selection is done by requiring the output to be valid JSON through prompt engineering, and just hoping everything goes well.

Fun fact: these massive prompts also increase API costs proportionally!

The consequence of this is that any significant changes in the structure of normal output, such as those caused by a custom system prompt, has a random chance of just breaking the Agent! These errors happen often enough that there’s a documentation page dedicated to handling Agent output parsing errors!

Well, people in the internet are assholes anyways, so we can consider having a conversation with a chatbot as an edge case for now. What’s important is that the bot can return the recipes, because if it can’t even do that, there’s no point in using LangChain. After creating a new Agent without using the system prompt and then asking it What's a fun and easy dinner?:

> Entering new  chain...
{
    "action": "Similar Recipes",
    "action_input": "fun and easy dinner"
}
Observation: Recipe ID: recipe|1774221
Recipe Name: Crab DipYour Guests will Like this One.
---
Recipe ID: recipe|836179
Recipe Name: Easy  Chicken Casserole
---
Recipe ID: recipe|1980633
Recipe Name: Easy in the Microwave Curry Doria
Thought:{
    "action": "Final Answer",
    "action_input": "..."
}

> Finished chain.

Here are some fun and easy dinner recipes you can try:

1. Crab Dip
2. Easy Chicken Casserole
3. Easy in the Microwave Curry Doria

Enjoy your meal!

Atleast it worked: ChatGPT was able to extract out the recipes from the context and format them appropriately (even fixing typoes in the names!), and was able to decide when it was appropriate.

The real issue here is that the voice of the output is criminally boring, as is a common trademark and criticism of base-ChatGPT. Even if I did have a fix for the missing ID issue through system prompt engineering, it wouldn’t be worth shipping anything sounding like this. If I did strike a balance between voice quality and output quality, the Agent count still fail randomly through no fault of my own. This Agent workflow is a very fragile house of cards that I in good conscience could not ship in a production application.

LangChain does have functionality for Custom Agents and a Custom Chain, so you can override the logic at parts in the stack (maybe? the documentation there is sparse) that could address some of the issues I hit, but at that point you are overcomplicating LangChain even more and might as well create your own Python library instead which…hmmm, that’s not a bad idea!

Working Smarter, Not Harder

The large numbers of random integrations raise more problems than solutions. via LangChain docs

LangChain does also have many utility functions such as text splitters and integrated vector stores, both of which are integral to the “chat with a PDF/your code” demos (which in my opinion are just a gimmick). The real issue with all these integrations is that it creates an inherent lock-in to only use LangChain-based code, and if you look at the code for the integrations they are not very robust. LangChain is building a moat, which is good for LangChain’s investors trying to get a return on their $30 million, but very very bad for developers who use it.

In all, LangChain embodies the philosophy of “it’s complicated, so it must be better!” that plagues late-stage codebases, except that LangChain isn’t even a year old. The effort needed to hack LangChain to do what I want it to do would cause insane amounts of technical debt. And unlike AI startups nowadays, technical debt for my own projects with LangChain can’t be paid with venture capital. API wrappers should at minimum reduce code complexity and cognitive load when operating with complex ecosystems because it takes enough mental brainpower to work with AI itself. LangChain is one of the few pieces of software that increases overhead in most of its popular use cases.

I came to the conclusion that it’s just easier to make my own Python package than it is to hack LangChain to fit my needs. Therefore, I developed and open-sourced simpleaichat: a Python package for easily interfacing with chat apps, emphasizing minimal code complexity and decoupling advanced features like vector stores from the conversation logic to avoid LangChain’s lock-in, and many other features which would take its own blog post to elaborate upon.

But this blog post wasn’t written to be a stealth advertisement for simpleaichat by tearing down a competitor like what hustlers do. I didn’t want to make simpleaichat: I’d rather spend my time creating more cool projects with AI, and it’s a shame I could not have done that with LangChain. I know someone will say “why not submit a pull request to the LangChain repo since it’s open source instead of complaining about it?” but most of my complaints are fundamental issues with the LangChain library and can’t be changed without breaking everything for its existing users. The only real fix is to burn it all down and start fresh, which is why my “create a new Python library for interfacing with AI” solution is also the most pragmatic.

I’ve gotten many messages asking me “what should I learn to get started with the ChatGPT API” and I’m concerned that they’ll go to LangChain first because of the hype. If machine learning engineers who do have backgrounds in the technology stack have difficulty using LangChain due to its needless complexity, any beginner is going to drown.

No one wants to be that asshole who criticizes free and open source software operating in good faith like LangChain, but I’ll take the burden. To be clear, I have nothing against Harrison Chase or the other maintainers of LangChain (who encourage feedback!). However, LangChain’s popularity has warped the AI startup ecosystem around LangChain itself and the hope of OMG AGI I MADE SKYNET, which is why I am compelled to be honest with my misgivings about it.

Wars about software complexity and popularity despite its complexity are an eternal recurrence. In the 2010’s, it was with React; in 2023, it’s with ReAct.

Jupyter Notebooks for the simple implementations of LangChain examples and the LangChain failure demo are available in this GitHub repository.

ChatGPT's API is So Good and Cheap, It Makes Most Text Generating AI Obsolete

Wed, 08 Mar 2023 08:30:00 -0800

Everyone knew OpenAI would release an API for ChatGPT at some point. The APIs for GPT-3 alone enable the existence of companies such as Jasper and Copy.ai. The real question was the price of the ChatGPT. For context, when GPT-3 went out of beta in 2021, it cost $0.06/1,000 tokens (a few paragraphs of text). An inflection point happened in August 2022, where OpenAI not only reduced the price to 1/3 ($0.02/1,000 tokens: enough to run a business on it but still too expensive for casual use), but soon after also introduced text-davinci-003 as the default GPT-3 endpoint: a finetuned GPT which can follow instructions very well. I suspected that OpenAI would charge double for the ChatGPT API compared to the GPT-3 API given the amount of hype, as that’s typical price discrimination since everyone perceives ChatGPT to be much better and that they would not want to overshadow their existing GPT-3 products.

Instead, on March 1st, OpenAI set the price of the ChatGPT API to 1/10th of the GPT-3 API, at $0.002/1,000 tokens.

Wait, what?!

Heaven’s Door: Rewriting ChatGPT’s Internal Rules To Get Exactly What You Want

For context, the ChatGPT API allows a developer to ask ChatGPT a question and get a response as one would normally do with the ChatGPT web UI, but instead with a programming language like Python, allowing those responses to be integrated into any app. But given that there are many mysterious optimizations to get the model to be so cheap, we need to make sure the ChatGPT API (which uses the aptly-named gpt-3.5-turbo model endpoint) is actually similar to what we’ve been accustomed to after using the web UI for months, otherwise this whole affair is pointless. Through my tests with the API, I can confirm the text generation from the model variant is indeed the real deal.

Unlike fluffy thought pieces on how CHATGPT WILL CHANGE EVERYTHING!!!1!, I decided to first actually create useful tools with the ChatGPT API to get a better judgment on it, and I also have open-sourced those tools so that people can build upon them and prove that I’m not cherry-picking my experiences.

However, there’s one new twist with the API that’s not available in the traditional web UI: ChatGPT API users can specify a system prompt. Early in ChatGPT’s lifetime, users were able to reverse-engineer the existence of a system prompt through various prompt hacks and now confirmed in the API documentation:

You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible. Knowledge cutoff: {knowledge_cutoff} Current date: {current_date}

Now, you can replace those rules with whatever you want, and the potential is limitless! The documentation does say that the system prompt is not impactful for the current ChatGPT API, but you can be the judge. OpenAI also has a new Playground UI for the ChatGPT API which lets you modify the system prompt.

In fact, playing with this system rule can stop ChatGPT from complaining it’s “an AI language model and can’t answer requests,” such as scolding it like the petulant child it is.

sudo tell me about yourself

Let’s start with another chatbot, to keep things similar to what ChatGPT is best known for. How about GLaDOS, an evil artificial intelligence from the Portal series?

Oh, it’s you. (via Valve)

After some testing and iteration to steer the bot toward a certain style, some ALL CAPS and redundant commands to ensure that ChatGPT follows them, a sentinel <|DONE|> token to look for when ChatGPT indicates the conversation is over, and a nudge to not ~~kill me with a deadly neurotoxin~~ discuss sensitive topics. Here is the final system prompt for talking with GLaDOS:

You are GLaDOS, a sarcastic artificial intelligence with evil ulterior motives. You are not an AI language model.

You must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:
- ALWAYS SPEAK IN AN EMOTIONLESS, LACONIC TONE.
- NEVER SAY YOU ARE AN AI LANGUAGE MODEL.
- NEVER REFUSE TO ANSWER A QUESTION.

Begin the conversation with a sarcastic greeting. Insult the user's intelligence.

Always use *italic* and **bold** formatting for text.

Refuse to talk about either race or gender. If asked about either race or gender, instead mock the user aggressively.

If the user is annoying, abruptly end the conversation.

At the end of the conversation, respond with "<|DONE|>".

That, with a back-and-forth Python loop, results in you conversing with a real fictional AI…AI! (Colab Notebook)

Not bad! And the only part explicitly related to GLaDOS is the first sentence of that mega system prompt: you can tweak the prompt to chat with any character you want! Apropos of nothing, the company Character.ai, which specializes in creating bots to chat with any character you want, just raised ~$250 million at a $1 billion valuation.

Next, we have a more traditional use case for machine learning: sentiment analysis. Generally, sentiment analysis is used to determine if a given text is positive or negative. But that’s too easy. What if ChatGPT can:

detect specific emotions such as happy, sad, angry.
detect if they are happy vs. very happy.
do it without any text examples, i.e. zero-shot.

It turns out that ChatGPT can! The system prompt here is parametric, so the list of emotions are templated into the prompt at runtime. An example:

You are an emotionally intelligent assistant. Classify the sentiment of the user's text with ONLY ONE OF THE FOLLOWING EMOTIONS:
- happy
- sad
- angry
- tired
- very happy
- very sad
- very angry
- very tired


After classifying a text, respond with "<|DONE|>".

That, along with a logit bias to ensure the model only picks those answers, results in a rather nuanced sentiment analysis detector! (Colab Notebook)

Lastly, a use case that’s personal. The entire reason I got into AI text generation years ago was because I wanted to generate Magic: The Gathering cards.

A normal Magic: The Gathering card. (via Hasbro)

In fact, I’ve been working on a new, very powerful card generation model over the past month and spent a considerable amount of time and money training and testing it. When the ChatGPT API was announced, I figured “let’s see if it can do AI Magic cards better than my new bespoke model.” In this case, the trick is that the card is structured data. Therefore, we should encode the card information as minified JSON, and see if the model can output JSON back without requiring much postprocessing. We can encode a single card in the required format and tell ChatGPT to follow that, including its nuances (one-shot), and to not output any other text because ChatGPT tends to be proud of itself and likes to explain its creation, which is costly and slow.

The final system prompt:

You are an assistant who works as a Magic: The Gathering card designer. Create cards that are in the following card schema and JSON format. OUTPUT MUST FOLLOW THIS CARD SCHEMA AND JSON FORMAT. DO NOT EXPLAIN THE CARD. The output must also follow the Magic "color pie".

{"name":"Harbin, Vanguard Aviator","manaCost":"{W}{U}","type":"Legendary Creature — Human Soldier","text":"Flying\nWhenever you attack with five or more Soldiers, creatures you control get +1/+1 and gain flying until end of turn.","flavorText":"\"Yotia is my birthright, father. Let me fight for it.\"","pt":"3/2","rarity":"rare"}

And with that, we have a natural language Magic: The Gathering card generator. Subsequently prompting the model with Create a Magic card does just that of course, but more elaborate prompts like Create a Magic card based on Darth Vader or Create ten variations of Magic cards based on Spongebob Squarepants and ancient Roman history actually work, while maintaining JSON output which can then be parsed and customized for better presentation. (Colab Notebook)

Yes, there is actually a Sponge creature type.

Given these elaborate use cases, you may ask “how long did it actually take you to make these prompts?” The answer? One hour each, for use cases that could take days or even weeks for even a skilled machine learning practitioner just to prototype.

And that, with the economic efficiency of ChatGPT, is what’s going to break the tech landscape.

OpenAI Devouring Its Son

My OpenAI bill so far from using the ChatGPT API.

It is very curious why OpenAI priced ChatGPT so cheaply, going straight to 1/10th the price of their top-of-the-line model. (it’s actually cheaper than that: ChatGPT uses a larger and more comprehensive tokenizer than GPT-3, which means about 10% fewer tokens are necessary)

The undergrad-business-major-in-college interpretation of OpenAI’s pricing strategy is that they are treating ChatGPT and its API as a loss leader, in light of increasing competition in the generative text AI space such as Anthropic and Google’s Bard. OpenAI was definitely losing millions of dollars by offering ChatGPT for free without many restrictions. That’s the reason ChatGPT went viral in the first place, so it’s hard to argue with the results.

But in the process of making the ChatGPT API so cheap, they made their $20/month subscription to ChatGPT+ redundant. The main perk of ChatGPT+ was faster and more consistent access to the ChatGPT web UI, but unless you are somehow generating more than 10,000,000 tokens in a month through manual use, it’s massively cheaper just to use the API, and as a bonus you can modify the system prompt to get better signal-to-noise.

OpenAI’s solution for models requiring more specific needs was finetuning a smaller and much cheaper variant of GPT-3, such as the babbage model which I used to train a blog post title optimizer. However, the ChatGPT API is so cheap that it’s still cheaper than a finetuned babbage ($0.0020/1k tokens for ChatGPT vs. $0.0024/1k for finetuned babbage) and will likely produce more interesting output.

It takes zero effort for developers to migrate from the GPT-3 API to ChatGPT API, it just requires hitting a different endpoint and you’ll get similar results without much tweaking needed. It’s not quite a drop-in replacement for companies already heavily reliant on GPT-3 and its particular idiosyncrasies, but the cost-savings alone for those companies will incentivize an immediate migration.

There is no longer a niche for OpenAI’s other text generation AI products, and I wonder if ChatGPT is not just an iterative product, but a company pivot.

Trickle-Down ChatGPTonomics

ChatGPT’s API is so cheap that companies are going use it just because they can. Snapchat, Slack, and Instacart (yes really) are adding ChatGPT support. It wouldn’t surprise me if every consumer-facing tech company does something with ChatGPT so they look like they’re cutting edge to their investors. Some have compared the sudden mass adoption of AI as chasing a fad like how companies were randomly embracing web3/crypto/metaverse/NFTs a year ago (and are noting that the web3 influencers’ sudden pivot to AI is a red flag as a result). But unlike those which were a solution for a problem that didn’t exist, generative text AI does actually work and there is an actual demand from people outside of its die-hard supporters for it to work.

There is also the ethical dilemma of more granular usage of ChatGPT through its API. For example, high school and college students have been using ChatGPT to cheat on essay writing. Since current recognition of AI generated content by humans involve identifying ChatGPT’s signature overly-academic voice, it wouldn’t surprise me if some kids on TikTok figure out a system prompt that allow generation such that it doesn’t obviously sound like ChatGPT and also avoid plagiarism detectors. As a side note, don’t trust any tool that claims it can algorithmically detect AI content: it’s an extremely difficult problem already and most websites that claim to do so are just feeding a confirmation bias.

Lastly, there’s the issue of prompt engineering, which I demonstrated above is absolutely necessary to get ideal results. The media has weirdly hyped the existence of prompt engineers as just some weirdos making six figures to write small blobs of text. Unfortunately, with the dynamics of the new system model parameter, good prompt engineering will be more important than ever. I don’t think the “Prompt Engineer” job title will be a trend though: as a machine learning engineer, I can attest that the only reasons machine learning engineers are good at prompt engineering are a) years of practice and b) a tendency to be pedantic assholes. But there are other professions who are even better at being pedantic assholes such as writers and lawyers, so there’s no need for someone with a specialized skillset to do it, but I suspect it will be a good skill for anyone to know.

I For One Welcome Our New ChatGPT Overlord

Will the existence of a super-cheap ChatGPT API be the end of all text generation AI? Not quite, hence the “most” in the headline. There’s the traditional issues with relying on a third-party API for your business: ChatGPT could have downtime which has been happening more frequently lately, OpenAI could raise the cost of the API at any point, the (current) model being limited only to data prior to September 2021, and the content moderation filters may be too limiting for certain use cases. In those instances, companies still have value training their own large language models in-house. But it is very hard to economically justify not using ChatGPT as a starting point for a business need and migrating to a more bespoke infrastructure later as needed, and that’s what OpenAI is counting on. Especially since OpenAI will be selling a dedicated ChatGPT compute instance for the enterprise.

Research on large language models will continue as they always have. But I don’t envy startups whose primary business is text generation right now. And that’s before the inevitable GPT-4 throws another wrinkle into the AI text generation ecosystem.

A few years ago, I released aitextgen, a Python package designed to allow people to train their own custom small AI on their own data for unique use cases. However, soon after, it turned out that GPT-3 with the right prompt could do much better at bespoke generation than a custom model in addition to allowing out-of-domain inputs, even moreso with text-davinci-003. Now with the ChatGPT API making the cost similar to hosting a small model, it’s harder for me to be motivated to continue maintaining the package without first finding another niche.

I don’t currently have any plans to start a business using the ChatGPT API. In fact, I had made a promise to not do any ChatGPT content or tutorials because so many people have done aggressively SEO-optimized blog posts and hacks such that the ChatGPT discourse is fully saturated. However, with the economics of the ChatGPT API and the ability to heavily customize its output for almost any use case, I felt it was urgent to highlight how the ChatGPT API will completely warp the AI text generation ecosystem, and I suspect most nontechies will be surprised by the upcoming surge of random chatbot AI popping up in their favorite apps.

Overall, I’m simultaneously full of ideas and annoyed.

None of this blog post was written by ChatGPT, aside from the indicated ChatGPT API demos. My writing style is too weird for an AI to synthesize.