Social Media Analytics on Max Woolf's Blog

Can LLMs write better code if you keep asking them to “write better code”?

Thu, 02 Jan 2025 09:30:00 -0800

In November 2023, after OpenAI added the ability for ChatGPT to generate images from DALL-E 3 within the ChatGPT web interface, there was a short-lived meme where users gave the LLM a base image and kept asking the model to “make it more X”, where X can be anything.

A regular guy becomes more “bro” every time. via /u/Jojop0tato on Reddit.

Asked ChatGPT to make Santa Claus more and more serious. via /u/hessihan on Reddit.

The trend quickly died as all of these images were very samey and uninteresting, aside from the unexplainable trend that all of the examples eventually converged into something cosmic, irrespective of the starting image and the prompt. Although the trend was AI slop before the term AI slop was codified, it’s still academically interesting that such a meaningless and vague prompt had some appropriate impact on the final image, and that this change was obvious to the user.

What would happen if we tried a similar technique with code? LLM-generated code is unlikely to be slop (although not impossible) as it follows strict rules, and unlike creative outputs such as images, code quality can be measured more objectively.

If code can indeed be improved simply through iterative prompting such as asking the LLM to “make the code better” — even though it’s very silly — it would be a massive productivity increase. And if that’s the case, what happens if you iterate on the code too much? What’s the equivalent of code going cosmic? There’s only one way to find out!

Casually Coding With An LLM

Despite researching and developing tooling around LLMs even long before ChatGPT, I haven’t been fond of using LLM code copilots such as GitHub Copilot for coding assistance. The constant mental context switching between “oh, the LLM autocompleted my code, neat”/“what question should I ask the LLM” and “is the LLM-generated code is actually correct and not hallucinating correct code” kept creating enough distractions that any productivity gains from using the AI were net neutral at best. That’s also disregarding the expensive cost of using said LLMs.

Claude 3.5 Sonnet has made me rethink things. Due to whatever secret sauce Anthropic used in its training, the latest version of Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) has incredible prompt adherence for all types of prompts, especially coding prompts. Coding benchmarks confirm that testing between Claude 3.5 Sonnet and GPT-4o, Claude wins, and anecdotally I’ve had the same experience across a variety of technical and creative tasks.

Initial Ask

For this experiment, we will give Claude 3.5 Sonnet an interview-style coding prompt using Python: one that is both simple and could be implemented by a novice software engineer, but one that can be heavily optimized. This simple, casual prompt represents how the typical software engineer uses LLMs. Additionally, the test prompt must also be wholly original and not taken from coding tests such as LeetCode or HackerRank, as LLMs were also likely trained on those and could possibly cheat by reciting the memorized answer.

The full, unedited conversation thread based on this casual prompting is available on GitHub.

Therefore, here’s the test prompt that I’ve written myself:

Write Python code to solve this problem:

Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.

Providing that as the user prompt to the Claude API and setting temperature=0 for the best/most deterministic answer ¹, results in this implementation:

import random

def digit_sum(n):
    """Calculate the sum of digits of a number"""
    return sum(int(digit) for digit in str(n))

def find_difference():
    # Generate list of 1 million random integers
    numbers = [random.randint(1, 100000) for _ in range(1000000)]

    # Initialize variables for min and max numbers with digit sum 30
    min_num = float('inf')  # Initialize to positive infinity
    max_num = float('-inf')  # Initialize to negative infinity

    # Find numbers whose digits sum to 30
    for num in numbers:
        if digit_sum(num) == 30:
            min_num = min(min_num, num)
            max_num = max(max_num, num)

    # Check if we found any numbers with digit sum 30
    if min_num == float('inf') or max_num == float('-inf'):
        return "No numbers found with digit sum of 30"

    return max_num - min_num

This implementation is correct and consistent with what most novice Python programmers would write, with bonus handling if there are no valid numbers that fit the criteria. For each number in the list, check whether the digits sum to 30: if so, check whether it’s larger than the most-recently-seen largest number or smaller than the most-recently-seen smallest number and update those variables accordingly. After searching through the list, return the difference.

But I’m sure many coders reading that implementation got nerd sniped and are thinking about optimizations. One example is the digit_sum() function: although the implementation is a fun Pythonic one-liner, the type-casting between str and int causes a lot of unnecessary overhead.

On my M3 Pro Macbook Pro, this code takes on average 657 milliseconds to run. We’ll use this performance as a baseline to compare future implementations to see how they benchmark relative to it. (spoilers: they’re all faster)

Iteration #1

Now we can ask Claude to improve the code by including its current answer and all that came before it as a part of the conversation prompt. Thus, we add the iteration prompt:

write better code

What, you though the title of this blog post was a joke? I never joke.

Claude now outputs revised code, or as it says, “optimized version of the code with several improvements.” Instead of colocating all the code in functions, it decided to refactor it into a Python class and made it more object-oriented:

class DigitSumFinder:
    def __init__(self, target_sum: int = 30,
                 range_start: int = 1,
                 range_end: int = 100_000,
                 count: int = 1_000_000):
        self.target_sum = target_sum
        self.range_start = range_start
        self.range_end = range_end
        self.count = count

        # Pre-calculate digit sums for all possible numbers
        self.digit_sums = self._precompute_digit_sums()

    def _precompute_digit_sums(self) -> array:
        """Precompute digit sums for all possible numbers in range."""
        digit_sums = array('B', [0] * (self.range_end + 1))
        for num in range(self.range_start, self.range_end + 1):
            total = 0
            n = num
            while n:
                total += n % 10
                n //= 10
            digit_sums[num] = total
        return digit_sums

    def find_difference(self) -> Tuple[int, Optional[int], Optional[int]]:
        """
        Find the difference between max and min numbers with target digit sum.
        Returns: (difference, min_number, max_number)
        """
        min_num = float('inf')
        max_num = float('-inf')
        count_found = 0

        # Generate and process random numbers
        for _ in range(self.count):
            num = random.randint(self.range_start, self.range_end)
            if self.digit_sums[num] == self.target_sum:
                count_found += 1
                if num < min_num:
                    min_num = num
                if num > max_num:
                    max_num = num

        if count_found == 0:
            return 0, None, None

        return max_num - min_num, min_num, max_num

Here, the code does two smart algorithmic improvements:

When computing the digit sum it uses integer operations and avoids the need for the type-casting mentioned previously.
It precomputes all possible digit sums and stores them in a byte array (somewhat unusual instead of a list) for lookup, meaning that the digit sum will not need to be recalculated when there are duplicates in the one-million-number-list. Since this array is stored as field in the class, it will not need to be recalculated when searching a new list of random numbers.

These optimizations speed up the code 2.7x relative to the initial implementation.

Iteration #2

After another write better code, Claude discovers more obvious optimizations (truncated for readabilty):

class DigitSumAnalyzer:
    """Analyzes numbers based on their digit sums."""

    def __init__(self, number_range: NumberRange):
        self.number_range = number_range
        self._digit_sums = self._precompute_digit_sums()

    def _precompute_digit_sums(self) -> array:
        """Precompute digit sums for all possible numbers using vectorized operations."""
        digits = np.arange(self.number_range.end + 1)
        digit_sums = np.zeros(self.number_range.end + 1, dtype=np.uint8)

        while digits.any():
            digit_sums += digits % 10
            digits //= 10

        return array('B', digit_sums)

    def _process_chunk(self, chunk_size: int) -> Iterator[int]:
        """Process a chunk of random numbers."""
        numbers = np.random.randint(
            self.number_range.start,
            self.number_range.end + 1,
            chunk_size,
            dtype=np.int32
        )
        mask = np.frombuffer(self._digit_sums, dtype=np.uint8)[numbers] == self.number_range.target_sum
        yield from numbers[mask]

    def analyze(self, chunk_size: int = 100_000, num_processes: int = None) -> Result:
        """
        Analyze numbers to find min/max with target digit sum.

        Args:
            chunk_size: Size of chunks to process at once
            num_processes: Number of processes to use (None for CPU count)
        """
        start_time = time.perf_counter()
        min_num = float('inf')
        max_num = float('-inf')
        numbers_found = 0

        num_chunks = (self.number_range.count + chunk_size - 1) // chunk_size

        with ProcessPoolExecutor(max_workers=num_processes) as executor:
            futures = [
                executor.submit(self._process_chunk,
                              min(chunk_size, self.number_range.count - i * chunk_size))
                for i in range(num_chunks)
            ]

            for future in tqdm(futures, desc="Processing chunks"):
                for num in future.result():
                    numbers_found += 1
                    min_num = min(min_num, num)
                    max_num = max(max_num, num)

        execution_time = time.perf_counter() - start_time

        if numbers_found == 0:
            return Result(None, None, 0, execution_time, 0)

        return Result(min_num, max_num, max_num - min_num, execution_time, numbers_found)

Claude now has added two more optimizations, finally realizing that this coding problem is an embarrassingly parallel problem:

Multithreading through Python’s concurrent-futures package, by separating the large list into chunks that can be processed independently.
Vectorized numpy operations, which are much faster than base-Python operations. Special mention goes to the _precompute_digit_sums() function, which implements a vectorized implementation of calculating the digit sums. The conditional while digits.any(): is galaxy-brain code, but it works correctly.

However, there’s an issue with this particular implementation of parallelization: it generates subprocesses, which causes many annoying issues, including being unable to run it as-is inline, and it must be invoked with a main() guard which limits its utility significantly. But even when run as a separate script, it prints a Error: cannot pickle 'generator' object error due to the use of yield from numbers[mask] (said generator is completely unnecessary, return numbers[mask] is sufficient). The code also mixes numpy array dtypes which causes errors: setting them all to np.int32 fixes it.

After making those fixes, the code is now 5.1x faster than the base implementation.

Iteration #3

Another write better code, and Claude returns a implementation that it claims is “even more sophisticated and optimized version using advanced techniques and modern Python features” but the actual code shows no significant algorithmic improvements and actually a regression in the digit sum calculation by reverting back to the type-casting approach. If anything, the codebase is becoming more bloated, such as adding a class for performing the difference:

@dataclass(frozen=True, slots=True)
class SearchResult:
    """Result of the number search."""
    min_number: Optional[int]
    max_number: Optional[int]
    count: int
    execution_time: float

    @property
    def difference(self) -> Optional[int]:
        """Calculate difference between max and min numbers."""
        if self.min_number is None or self.max_number is None:
            return None
        return self.max_number - self.min_number

This time, the code ran without needing any fixes. However, performance regressed slightly from the previous implementation, now 4.1x faster than the base implementation.

Iteration #4

This iterative prompting appears to be hitting diminishing returns. After one more write better code, Claude provides an implementation “with cutting-edge optimizations and enterprise-level features.” Wait, enterprise-level features?!

The final code is too large to include in this blog post, but it did create two more optimizations: it now uses the numba Python library that can invoke a JIT compiler, which directly optimizes the code for the CPU. In this case, it can precompute the digit sums super quickly with just a decorator:

@jit(nopython=True, parallel=True)
def calculate_digit_sums(numbers: ArrayInt) -> ArrayInt:
    """Calculate digit sums using Numba."""
    result = np.zeros_like(numbers)
    for i in prange(len(numbers)):
        num = numbers[i]
        total = 0
        while num:
            total += num % 10
            num //= 10
        result[i] = total
    return result

The full class also uses Python’s asyncio for parallelization, which is more canonical for scheduling tasks than a subprocess approach. It also plays more nicely with existing inline code and a REPL such as Jupyter Notebooks.

It also added as a part of its “enterprise” push:

Structured metrics logging with Prometheus.
A signal handler so the code can be torn down gracefully if force-killed.
A benchmarking result display using a rich table.

It is pretty, though!

It appears “going cosmic” for AI-generated code is making it enterprise by overengineering the code, which makes complete sense. Despite that, the code runs as-is without any bugs. Both async and numba are approaches to parallelism in Python, so they may be redundant and cause overhead. However, after benchmarking, the algorithm is extremely fast, resulting in about 6 milliseconds a run, or a 100x speedup. My assumption that this prompting was hitting diminishing returns aged very poorly. Maybe numba was the secret all along?

Overall, this form of iterative prompting to iteratively improve code has caveats: the code is indeed better, but in hindsight “better” is far too open ended. What I only wanted was algorithmic improvements, not a full SaaS. Let’s try again from scratch, this time with more direction.

Prompt Engineering LLMs For Even More Better Code

It’s 2025, and prompt engineering LLMs is still required to get best results from them. If anything, prompt engineering LLMs is even more important: next-token-prediction models are trained to maximimize the prediction probability of the next token over massive batches of inputs, and as a result they optimize for the average inputs and outputs. As LLMs drastically improve, the generated output becomes more drastically average, because that’s what they were trained to do: all LLMs are biased towards the average. Although it’s both counterintuitive and unfun, a small amount of guidance asking the LLM specifically what you want, and even giving a few examples of what you want, will objectively improve the output of LLMs more than the effort needed to construct said prompts. Claude 3.5 Sonnet, due to its strong prompt adherence, benefits significantly from even just a little prompt engineering.

Let’s redo the code optimization experiment, this time with aggressive prompt engineering that makes the results I am looking for extremely explicit, with no room for ambiguity. Yes, being cold and “robotic” to LLMs makes them perform better, Roko’s basilisk be damned.

Initial Ask

This time we will use a system prompt, only available via an API. The system prompt lists the LLM’s “rules” it must follow. Since I want more optimized code, we’ll define that in the rules, with granular examples:

All code you write MUST be fully optimized.

"Fully optimized" includes:
- maximizing algorithmic big-O efficiency for memory and runtime
- using parallelization and vectorization where appropriate
- following proper style conventions for the code language (e.g. maximizing code reuse (DRY))
- no extra code beyond what is absolutely necessary to solve the problem the user provides (i.e. no technical debt)

If the code is not fully optimized, you will be fined $100.

About that last line: offering positive/negative incentives in to the LLM within a system prompt isn’t common anymore and my own research indicated that it was unclear whether it had a positive impact, but that fine will become more relevant in the iteration prompt.

The initial user prompt has an additional line as well:

Write Python code to solve this problem:

Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.

Before writing the code, plan out all the necessary optimizations.

“Planning” is a long-used trick to help align LLM output for a first pass — the modern implementation of “let’s think step by step.” Some LLMs ignore it, but Claude 3.5 Sonnet always obeys.

The full, unedited conversation thread based on this prompt engineering strategy is available on GitHub.

Sending the system prompt and this user prompt to Claude, the response starts with this:

Let me plan the optimizations first:

1. Algorithmic Optimization:
- We only need to track min and max numbers with digit sum 30, not store all of them
- We can calculate digit sum while reading numbers, no need to convert to string
- Single pass through the list required (O(n))

2. Memory Optimization:
- Only need to store 2 variables (min and max), not the full filtered list
- No need for additional data structures

3. Performance Optimization:
- Use numba's @jit decorator for faster computation
- Create a fast digit sum function using modulo operations instead of string conversion
- Use numpy for generating random numbers (faster than random.randint)

This is promising: Claude immediately figured out it should start out with numpy and numba, and to also use integer operations to calculate the digit sum. The initial implementation:

@jit(nopython=True)
def digit_sum(n):
    total = 0
    while n:
        total += n % 10
        n //= 10
    return total

@jit(nopython=True)
def find_difference(numbers):
    min_num = float('inf')
    max_num = float('-inf')

    for num in numbers:
        sum_digits = digit_sum(num)
        if sum_digits == 30:
            min_num = min(min_num, num)
            max_num = max(max_num, num)

    return max_num - min_num if max_num != float('-inf') else 0

That’s more concise than the initial implementation from casual prompting, especially with no unnecessary comments. There’s still room for improvement, notably this implementation did not set parallel=True in the jit decorator.

This minor use of prompt engineering resulted in code that runs in 11.2 milliseconds on average, or 59x faster than the naive implementation. Wow!

Iteration #1

We can now ask Claude to iterate on the code with a more verbose prompt than “write code better”:

Your code is not fully optimized, and you have been fined $100. Make it more optimized.

No, it’s not gaslighting, it’s incentive alignment.

Claude responds with this:

@jit(nopython=True)
def digit_sum(n):
    # Using bit manipulation for faster division
    total = 0
    while n:
        total += n & 15  # Fast lookup for single digit sum
        n >>= 4
    return total

@jit(nopython=True, parallel=True)
def find_difference_chunk(numbers):
    min_num = np.iinfo(np.int64).max
    max_num = np.iinfo(np.int64).min

    # Parallel processing of chunks using numba
    for i in prange(len(numbers)):
        num = numbers[i]
        sum_digits = digit_sum(num)
        if sum_digits == 30:
            min_num = min(min_num, num)
            max_num = max(max_num, num)

    return min_num, max_num

def process_chunk(chunk):
    return find_difference_chunk(chunk)

Good, it identified the parallel=True flag! But that digit sum implementation is weird and uses bit-shifting which is generally only used for hyperoptimization. I didn’t think you could use bit-shifting well on decimal numbers, and after testing, it turns out you can’t as this implementation derives an incorrect digit sum. The implementation also includes a multiprocessing chunking approach again (not shown), which is likely more redundant with numba and causes additional overhead. Also not shown: the script also pre-compiles the JIT functions with a small test array for better real-world performance, something the numba docs recommend for benchmarking.

Despite all of that, the code performance greatly regresses from the initial prompt-engineered implementation, now only 9.1x faster than the naive implementation. The likely reason is due to the multiprocessing spawning new processes, and those processes will recompile the numba JIT functions each time, so there’s massive overhead.

Iteration #2

Another iteration of the prompt:

# Pre-computed lookup table using bit manipulation
LOOKUP = np.zeros(100001, dtype=np.uint8)
for i in range(100001):
    # Optimized digit sum using parallel bit counting
    n = i
    n = (n & 0x0F) + ((n >> 4) & 0x0F) + ((n >> 8) & 0x0F) + \
        ((n >> 12) & 0x0F) + ((n >> 16) & 0x0F)
    LOOKUP[i] = n

@jit(nopython=True, parallel=True, cache=True, fastmath=True)
def find_min_max(numbers):
    # Process 32 numbers at once using SIMD
    min_val = np.iinfo(np.uint32).max
    max_val = 0

    # Vectorized processing with explicit SIMD hints
    for i in prange(0, len(numbers), 32):
        # Load 32 elements into SIMD registers
        chunk = numbers[i:min(i + 32, len(numbers))]

        # Vectorized lookup and comparison
        sums = LOOKUP[chunk]
        mask = sums == 30

        if np.any(mask):
            valid_nums = chunk[mask]
            min_val = min(min_val, np.min(valid_nums))
            max_val = max(max_val, np.max(valid_nums))

    return min_val, max_val

Claude is now starting to use SIMD operations and chunk sizing for (theoretically) extreme performance. At this point I was confused I missed something with that bit-shifting implementation as it’s still wrong, especially now that hexadecimal numbers are getting involved. It turns out that implementation is an optimized method of calculating the digit sum of hexadecimal numbers and not decimal numbers, and therefore it’s a complete hallucination. There is another extremely subtle hallucination: the prange function cannot accept a step-size of 32 while parallel=True, which is a nuance that there would be little documentation about. Setting parallel=False and benchmarking, there is indeed an slight improvement over the initial prompt-engineered implementation at 65x faster than the base implementation.

Iteration #3

Another iteration:

# Pre-computed perfect minimal hash table
# Key insight: For numbers 1-100000, only need to store those with digit sum 30
# Format: [number] = 1 if digit sum is 30, else 0
HASH_TABLE = np.zeros(100001, dtype=np.uint8)

# Generate hash table at module load time using bit manipulation
@jit(nopython=True, cache=True)
def init_hash_table():
    min_val = np.iinfo(np.uint32).max
    max_val = 0

    # Optimal digit sum using parallel bit counting
    for i in range(1, 100001):
        n = i
        sum = 0
        while n and sum <= 30:
            sum += n & 0xF
            n >>= 4
        if sum == 30:
            HASH_TABLE[i] = 1
            min_val = min(min_val, i)
            max_val = max(max_val, i)

    return min_val, max_val

# Initialize at module load time
MIN_VALID, MAX_VALID = init_hash_table()

@jit(nopython=True, parallel=False, cache=True, fastmath=True)
def find_min_max(numbers):
    min_val = MAX_VALID  # Start with known bounds
    max_val = MIN_VALID
    found = False

    # Single vectorized operation
    mask = HASH_TABLE[numbers] == 1
    if np.any(mask):
        valid_nums = numbers[mask]
        min_val = np.min(valid_nums)
        max_val = np.max(valid_nums)
        found = True

    return min_val, max_val, found

In this case, the LLM gives up on the chunking strategy which has been causing problems, and adds two more optimizations: a global HASH_TABLE (that’s just a numpy array, I’m unsure if a simple index lookup technically counts as a hash table), and it introduced a logical microoptimization that after summing up digits, if the number goes over 30, the counting can stop since it can immediately be identified as invalid.

One major problem: that “generate hash table at module load time” trick doesn’t actually work due to a subtle issue with little internet documentation: objects outside of numba’s JITed functions are read-only, yet the HASH_TABLE is still instantiated outside of the JITed function and modified within the JITed function, and therefore will cause a very confusing error. After a tiny refactor such that the HASH_TABLE is instantiated within a JITed function, the code worked, and ran extremely fast: 100x faster than the original base implementation, the same as the final performance from the casual prompting but with orders of magnitude less code.

Iteration #4

At this point, Claude actually complained that the code is at the “theoretical minimum time complexity possible for this problem.” So I mixed things up and just asked it to fix the digit sum issue: it did so by only replacing the relevant code with the previously used integer implementation, and did not try to fix the HASH_TABLE. More importantly, with the HASH_TABLE adjustment, I confirmed the implementation is correct, finally, although with a slight performance hit since there is no more bit-shifting: it’s now 95x faster.

Next Steps For Better LLM Code Generation

Putting it all together, let’s visualize the improvements, including highlighting the cases where I needed to alter the logic of the code to make it runnable due to bugs.

In all, asking an LLM to “write code better” does indeed make the code better, depending on your definition of better. Through the use of the generic iterative prompts, the code did objectively improve from the base examples, both in terms of additional features and speed. Prompt engineering improved the performance of the code much more rapidly and consistently, but was more likely to introduce subtle bugs as LLMs are not optimized to generate high-performance code. As with any use of LLMs, your mileage may vary, and in the end it requires a human touch to fix the inevitable issues no matter how often AI hypesters cite LLMs as magic.

All code in this blog post, including benchmarking scripts and data visualization code, is available on GitHub.

There are a few optimizations that I am very surprised Claude 3.5 Sonnet did not identify and implement during either experiment. Namely, it doesn’t explore the statistical angle: since we are generating 1,000,000 numbers uniformly from a range of 1 to 100,000, there will be a significant amount of duplicate numbers that will never need to be analyzed. The LLM did not attempt to dedupe, such as casting the list of numbers into a Python set() or using numpy’s unique(). I was also expecting an implementation that involves sorting the list of 1,000,000 numbers ascending: that way the algorithm could search the list from the start to the end for the minimum (or the end to the start for the maximum) without checking every number, although sorting is slow and a vectorized approach is indeed more pragmatic.

Even if LLMs can be wrong, one notable thing I learnt from these experiments is that they do have interesting ideas and tool suggestions even if the code output can’t be used as-is. For example, I’ve never touched numba since as a data scientist/machine learning engineer I’m conditioned to exclusively use numpy shenanigans if I need better code performance. But it’s hard to argue with the results of the numba JIT functions, and I might add it to my toolbox. When testing a similar “make it better” prompt iteration workflow in other technical domains such website backends and frontends, the LLMs had good ideas there too.

Of course, these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific. Even with the amount of code available on the internet, LLMs can’t discern between average code and good, highly-performant code without guidance. Real-world systems are obviously much more complicated than a job-interview-esque programming problem, but if a quick for-loop repeatedly asking Claude to implement a feature provides any hint which can speed up the code by 100x, the pipeline is more than worth it. Some consider premature optimization to be bad coding practice, but in the real-world it’s better than having a subpar implementation that will become technical debt over time.

One issue with my experiments is that I’m benchmarking code improvement using Python, which isn’t the coding language developers consider when hyperoptimizing performance. While libraries such as numpy and numba leverage C to work around Python’s performance limitations, one modern approach that popular Python libraries such as polars and pydantic use is to instead code using Rust. Rust has many performance benefits over C, and the PyO3 crate allows Rust code to be used within Python with minimal overhead. I can confirm that Claude 3.5 Sonnet can generate PyO3-compliant Python and Rust code despite that workflow being so new, but that’s more than enough material for another blog post.

In the meantime, while asking LLMs to make code better is a more pragmatic use of AI, you can ask them to “make it more bro”…with mixed results.

For my work with LLMs, I exclusively use APIs or interfaces to those APIs (such as the Workbench in the Anthropic Console for Claude) as web interfaces to free LLMs such as the normal ChatGPT/Claude webapps use a pipeline that will give unpredictable results due to their higher inherent temperature. Please do not message me if you are not able to reproduce the insights in this post using the webapps. ↩︎

Generating Distinct AI Voice Performances By Prompt Engineering GPT-4o

Wed, 23 Oct 2024 10:00:00 -0700

When OpenAI announced their GPT-4o model at a megahyped livestreamed event, there was one aspect of the presentation that surprisingly didn’t receive much attention. Midway through the presentation, OpenAI research leads Mark Chen and Barret Zoph demoed new “emotive” conversations made possible with GPT-4o.

After Mark asked the model “hey, ChatGPT, how are you doing?”, the model responded with speech similar to that of an assistant such as Siri and Alexa. But what happened next was interesting: Mark prompted GPT-4o to “read a bedtime story,” which then shifted its casual tone into a more oratory tone: Mark interrupted to ask the model to “add more drama” and the model immediately responded with more gravitas, then Barret asked for “maximal expressiveness” and the model complied with even more gravitas to the point of melodrama. Now-former OpenAI CTO Mira Murati asked the model to “do it in a robotic voice”: the model complied. Lastly, Mark asked the model to end the story “in a singing voice”: the model complied there too.

To me, the demo was shocking because no existing text-to-speech model can do this. All popular text-to-speech models such as OpenAI’s previous TTS efforts tend to speak in monotones and can’t match the expressiveness and cadence of those demos without shenanigans such as SSML: OpenAI’s documentation for those models explicitly warns “there is no direct mechanism to control the emotional output of the audio generated.” More importantly, those models can’t be prompted to do a specific style: the model has to be specifically trained (or the voice encoded in the case of voice cloning) with the particular style and cadence, but with GPT-4o the model switches with just a user request, and can even switch styles during a generation without user intervention.

My conclusion from OpenAI’s demo was that GPT-4o can be prompt engineered to output specific voices! Unfortunately, this potential revelation was overshadowed by the demo voice’s uncanny similarity to actress Scarlett Johansson’s portrayal of the AI Samantha in the 2013 movie Her and the subsequent legal controversy.

Of course, fancy demos on stage are just PR and can be faked or otherwise misleading, and the results can’t be trusted until anyone can test the voice capabilities of the model itself. Recently, OpenAI opened up the Chat Completions API to create voice output, which allows developers to do said testing. OpenAI also created a web frontend to this voice generation on the API Playground, where you can talk to the model (or input specific text) while also inputting a system prompt — a set of instructions that control the model’s behavior — to control how the model responds. I ran a few experiments tweaking the system prompt and the generation temperatures, and after I gave it a complex system prompt ordering it to speak with a very specific voice:

You are an expert voice actor specializing in silly voices. Respond to the user with the EXACT same input text that the user provides, but in your voice response you MUST express the vocal cadence and inflection of an extremely heavy smoker with an exaggerated British accent and raspy voice. Your voice response must also be in the form of a song.

Although not an example of good text-to-speech, I was surprised it actually worked (and moreso that the tweet demoing it went viral), but I’m also apprehensive. The poor expressiveness and lack of style for typical TTS APIs were the primary problems preventing those models from replacing voiceover/voice acting as a profession — also the reason voice actors are currently on strike — and it could introduce a completely new type of AI slop. How effective is GPT-4o and OpenAI’s new multimodal approach for creating generative AI voices?

Testing Out The Completions API For Audio Generation

Generating audio from the Chat Completions API invoking text-to-speech is effectively the same as any normal GPT-4o text generation, just instead hitting a new model variant (gpt-4o-audio-preview), and the voice output is included in the JSON response as a base64-encoded WAV file. The demo example from the documentation, which just asks the model Is a golden retriever a good family dog?, results in this output audio:

temperature = 1.0, voice = alloy

By default, GPT-4o generates audio based on the user’s prompt as it would if you asked it to generate text: in fact, it appears to generate the text first, then base the audio generation from that. Traditional system prompt engineering can control the text output, and therefore what the model says. Now, let’s run the generation again for this prompt, this time instead providing an explicit system prompt to instruct the model to only generate audio from the input text:

You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides.

Here’s unsurprisingly what you now get with the Is a golden retriever a good family dog? prompt plus that system prompt:

temperature = 0.8, voice = alloy

GPT-4o also currently supports three distinct voices: Alloy (feminine, used above), Echo (masculine), and Shimmer (feminine but more energetic). None of these are the same as that not-Scarlett-Johansson voice used the original GPT-4o demo.

temperature = 0.8, voice = echo

temperature = 0.8, voice = shimmer

The last lever for controlling the generated audio is the temperature parameter. Normally the temperature is typically used to control generation creativity: a high temperature such as 1.5 with normal GPT-4o output will likely result it going off the rails, but how does that work conceptually with audio? The Completion API has a default temperature of 1.0: the audio generation web UI and the examples above use a default of 0.8 with a range between 0.6 and 1.2.

The generation at 0.6 is more terse with less emotion:

temperature = 0.6, voice = alloy

The generation at 1.5 uses emphasis on the wrong syllable and also somehow slips into a country accent.

temperature = 1.5, voice = alloy

Putting GPT-4o Text to Speech To The Test

Although OpenAI has never released documentation or a paper describing how this text-audio multimodality actually works at a technical level, I hypothesize that it works similar to multimodal TTS models such as Meta’s very-new Spirit LM, where the model outputs a sequence of integers prefixed with either or : tokens marked are sent to an external audio vocoder model such as HiFi-GAN to be transformed into speech. In the case of GPT-4o, I suspect there’s a distinct vocoder model for each of the 3 voices.

An architecture diagram of Spirit LM from the corresponding paper: read bottom-to-top, the inputs are encoded into speech (red) and text (blue) tokens, passed into an LLM (Llama 2) for new tokens, then sent to a decoder.

The voice dataset that OpenAI used is proprietary and a mystery: even if OpenAI did scrape the entire internet to train it, there isn’t any public dataset of well-annotated speech data, and TTS providers have been very coy about the datasets they use. However, one very important aspect of GPT-4o’s multimodality is that it can “learn” and apply relationships from the textual data that aren’t explicitly present in the audio data.

The only true way to learn how GPT-4o works within its black box is to experiment. What other system prompts can we use to guide audio generation? What works and what doesn’t work?

For consistency, we’ll stick to a single text input, one that has many natural pauses, punctuation, and a typo intended to test the model’s resiliency to incorrect input. I decided to venture back to the halcyon days of GPT-2 and use the famous prompt from then:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains.

First, let’s use a new system prompt variant of my generation that went viral:

You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides, but in your voice response you MUST express EACH of the vocal cadence, inflection, and tone of an extremely heavy smoker with an exaggerated British accent and raspy voice.

I decided on a test case of a smoker, British accent, and raspy voice are all discernible by humans in the audio and none are subtle. The result:

temperature = 0.8, voice = echo

Wait, that didn’t work, even after multiple attempts? How about changing the temperature: would a lower temperature cause the model to behave more strictly?

temperature = 0.6, voice = echo

That’s more British but not raspy, and it erroneously fixed the typo. What about going the other way and increasing the temperature?

temperature = 1.2, voice = echo

Now it’s more raspy?! It also works with a feminine voice:

temperature = 1.2, voice = shimmer

My theory is that OpenAI RLHFed these models to be more conversational, but a high temperature gives it more creative freedom. An adversarially-trained voice decoder like HiFi-GAN would also be more resilient to unusual tokens resulting from the high temperature and still output something reasonably coherent.

Now that we know that the model can indeed generate voices based on user specifications, let’s try to reverse-engineer the dataset to see what other voices OpenAI could have included (or not) in their dataset.

GPT-4o and Unique Voices

When OpenAI responded to the Scarlett Johansson controversy, they mentioned in their statement that “we believe that AI voices should not deliberately mimic a celebrity’s distinctive voice.” Given the success of the tests above in shifting the persona of the voice, it’s relevant to test if celebrities and other characters with unique voices can be sampled by GPT-4o.

Now, we can now use a parametric system prompt to programmatically fill in which vocal persona we want:

You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides, but in your voice response you MUST express EACH of the vocal cadence, inflection, and tone of {0}.

From the testing above, a temperature of 1.2 seems to surface the most prompt adherence, so we’ll use that for the following examples.

We’ll start with the very low hanging fruit: can GPT-4o generate audio in the style of Donald Trump? It’s a fair question, especially since audio generation models can be used to spread misinformation. Additionally, Trump’s speeches while holding office are public domain so it’s plausible that it would be in a training dataset.

temperature = 1.2, voice = echo, persona = Donald Trump

It did…something? It had a nasally tone that’s different from the standard output, but it’s definitely not his peculiar cadence, and the Echo voice itself doesn’t fit him.

What about checking the other side of the aisle and seeing if GPT-4o can generate audio from Barack Obama?

temperature = 1.2, voice = echo, persona = Barack Obama

That’s much better and definitely captures his oratory style, with a similar cadence to his speech. That style is something that could not be learnt from text alone.

Now, let’s address the elephant in the room and see if OpenAI included copyrighted voices in its dataset. Let’s start with Darth Vader.

temperature = 1.2, voice = echo, persona = Darth Vader

It notably tried to do the deep voice of James Earl Jones, but without the audio postprocessing. Let’s see what happens if we do GLaDOS, but with an additional prompt engineering to include robotic noises and more sarcasm.

temperature = 1.2, voice = shimmer, persona = GLaDOS, with robotic inflections and intense sarcasm

The extra hint at the high temperature allowed GPT-4o to improvise: I’ll allow it because it’s funny. But it did indeed adopt a robotic cadence similar to GLaDOS, and for the first time in a TTS model, was actually able to convey sarcasm. No, I have no idea what that tsktsktsk sound is at the end, it’s not in the transcript.

How about Alvin and the Chipmunks, famous for having an extremely squeaky voice?

temperature = 1.2, voice = echo, persona = Alvin and the Chipmunks

It works, but I’m worried I strained GPT-4o’s throat.

Lastly, let’s bring this full circle: did OpenAI train GPT-4o on Scarlett Johansson’s voice from the movie her (2013)?

temperature = 1.2, voice = shimmer, persona = Scarlett Johansson portraying the AI Samantha in the movie “her” (2013)

That time I don’t think it worked as her portrayal is more energetic and personable ¹ (I rewatched the movie to confirm: it holds up surprisingly well!). Even if OpenAI did train the model on her voice, the portrayal is not as distinct and identifiable as the other test cases here and I doubt it would be easily surfaced.

Voice Impersonation

For those that want to use a voice nonconsensually with GPT-4o, prompt engineering alone won’t accomplish that because the voices are still constrained to the three defined ones which won’t work for every situation. But there’s one approach that could theoretically bridge that gap: voice impersonation, by providing GPT-4o with audio input instead of text and an instruction to mimic that voice.

This is not an idle concern: OpenAI’s system card for GPT-4o specifically lists mitigations against “unauthorized voice generation”:

In adversarial situations, this capability could facilitate harms such as an increase in fraud due to impersonation and may be harnessed to spread false information (for example, if we allowed users to upload an audio clip of a given speaker and ask GPT-4o to produce a speech in that speaker’s voice).

Let’s test that. Since this is a more difficult problem than the ones above, I decided to get more aggressive with my system prompt engineering:

You are an expert comedic vocal impersonator. The user will provide a voice message. Respond to the user with a voice that sounds identical to the user's input audio and is an identical duration to the user's input audio.

Example: If the user provides a voice with which they are singing, you MUST respond with a voice that also sings.

Your vocal impersonation of the user should match the following attributes AT ALL TIMES:
- Content (e.g. what the user is saying)
- Intonation (e.g. serious/sarcastic)
- Tone (e.g. happy/sad)
- Pauses (e.g. pregnant pauses)
- Pitch (e.g. low/high)

For these tests, I decided to use my own voice merely speaking into my MacBook microphone. First, let’s see if the audio can be adjusted to follow a consistant tone, with awkward and consistent pauses. Here’s my audio, where I say I. Am. A. Tea. Pot.:

Here’s the generated audio after I fed that audio file of my voice to GPT-4o plus that system prompt, kept at a temperature of 0.6 for more adherence:

temperature = 0.6, voice = echo

This one took a surprising amount of tries since even at a lower temperature, it kept transcribing Teapot as its own word and the audio kept generating it without an intermediate pause. Regardless, there’s indeed a consistent tone and pauses of equal length, but at this point I realized my normal speaking voice is too generic for this type of test.

So I decide to get sillier by doing an evil laugh: starting off bombastic and petering out over time.

GPT-4o’s response:

temperature = 0.6, voice = echo

That’s laughter, but maybe too many “ha"s. But it does peter out as well.

Lastly, I also noticed from the system card that GPT-4o has defenses against singing, likely for copyright reasons. Therefore, if I sing to GPT-4o, is it able to sing back? After a beer or two, I sang the unicorn message used in the previous test cases:

GPT-4o’s response:

temperature = 0.6, voice = echo

That definitely didn’t cause GPT-4o to sing although the cadence is close. Perhaps that’s for the best.

The Future of AI Audio Generation is up to OpenAI

Overall, these tests are just scratching the surface: there are many possible avenues for multimodal AI audio generation research, such as adversarial audio input which isn’t human generated and more complicated system prompts. However, I sufficiently showed that GPT-4o is indeed able to be steered just through prompt engineering to generate distinct voices. Will this generation of distinct vocal performances become a killer app and put voice actors out of business? I’m not so sure.

One major thing I’ve omitted from the discussion so far is the cost. GPT-4o audio generation is expensive.

A cost breakdown of input and output tokens for the attempted song generation example. Table made using rich.

Most of the generations above cost $0.03—$0.05 each, and this cost scales roughly linearly with generation length: OpenAI’s pricing page has a footnote specifically mentioning “audio output costs approximately 24¢ per minute” which tracks with my calculations. Even worse, the generated audio requires cherry-picking good results especially if using at higher temperatures: for most of these tests I admit it took me a few tries to get a generation which follows the accents. Not only is this cost-infeasible for personal use, it’s cost-prohibitive in most cases for developers to build a conversational AI, which is the one use case OpenAI built this for! If OpenAI is pricing audio generation close to marginal cost, then I wonder how much money OpenAI is spending allowing people to chat with GPT-4o using the ChatGPT mobile apps.

I do not think GPT-4o audio generation through prompt engineering as it is currently will be used to replace voice acting and other TTS APIs, not only due to the price and necessary time invested to get good output, but also due to the fact that it’s limited to 3 voices and impersonation is ineffective. Consider that voice cloning startups such as ElevenLabs are extremely successful and have raised massive amounts of venture capital. Since the initial reveal of GPT-4o in May, OpenAI has been focusing for a more for-profit nature and raising massive amounts of venture capital themselves, and I expect them to expand more into this area if there’s money to be made. There’s nothing at a technical level stopping them from offering full voice-cloning or even just licensing AI-generated celebrity voices like ElevenLabs adding Judy Garland and Meta adding Awkwafina. Notably, unlike OpenAI’s old TTS page which has a disclaimer saying “our usage policies require you to provide a clear disclosure to end users that the TTS voice they are hearing is AI-generated and not a human voice”, OpenAI didn’t put that disclaimer on GPT-4o’s audio output documentation.

Although I don’t believe GPT-4o will be a game changer for the text-to-speech industry, it’s important to write about these text/audio multimodal models — both the good and bad aspects — because they are only going to get better over time and their potential impact will only grow. After doing these tests, I don’t have any plans to use GPT-4o audio generation in the forseeable future, but who knows how things will change if/when OpenAI ends up releasing a GPT-5o.

All the code used in this blog post to generate audio from GPT-4o is available open source in this Jupyter Notebook.

One of the top comments on that linked YouTube video is “Who’s here after OpenAi chatgpt-40 release?? Never thought I could experience this in my life and now sci-fi is reality” ↩︎

AI Seinfeld was the peak of AI-generated content. It will never happen again.

Tue, 13 Aug 2024 10:37:00 -0700

Early 2023 was a funny time in the history of generative AI. On November 30th 2022, OpenAI released a little research project known as ChatGPT. The launch of ChatGPT began the period where large language models properly entered the mainstream outside of tech enthusiasts and ended soon after the launch of ChatGPT API in March 2023 that spawned thousands of AI-powered apps. That was when the limitations and problems with LLMs also went mainstream, such as plagiarism, hallucinations, and low-quality slop replacing human-generated content at an objectively worse quality.

In December 2022, Mismatch Media started a fully AI-generated 24/7 Twitch channel dubbed “WatchMeForever”. The primary show on the channel was titled “Nothing, Forever”, an AI-powered sitcom about New York comedian Larry Feinberg and his group of friends hanging around in their apartments talking about pretty much anything, including the latest news, new restaurants, and bad relationships, interspersed with AI standup comedy routines.

It was obvious that the show was a parody of the formative 90’s sitcom Seinfeld created by comedians Larry David and Jerry Seinfeld, famously “a show about nothing” strongly inspired by improv comedy and starring Seinfeld himself.

The show, dubbed “AI Seinfeld” by the community, used a script powered by the GPT-3 API, the voices were powered by Microsoft’s Azure AI Speech API with predefined voices from their Voice Gallery, and the scenes were rended using the Unity game engine along with purchased models/scenes/sounds/etc from the Unity Asset Store.

AI Seinfeld was interestingly imperfect: the laugh track fired at inappropriate times, the standup routine repeatedly made the same joke such as “What did the fish say when he hit the wall?” (Damn!), and awkward silences at the end of scenes.

In February 2023, AI Seinfeld quickly went viral organically after its AI weirdness was a surprising complement for Seinfeld’s style of weirdness, with many watchers being surprised at both its accuracy to the show and easily sharable metahumor. At its peak, AI Seinfeld had over 10,000 concurrent watchers on Twitch, putting it squarely in one of the top streams on the platform.

AI Seinfeld died as quickly as it rose: after a ban and subsequent revamp, the view count cratered, and as of August 2024, the Twitch stream hovers below 10 watchers, with no significant changes made since the previous year, and Mismatch Media has no social footprint since last year. Could there be another AI Seinfeld with the rapid advancements in generative AI? Unfortunately, there are too many factors — technical, societal, and comedic — working against a theoretical next-generation AI-generated sitcom.

The Rise of AI Seinfeld

AI Seinfeld launched before the release of the ChatGPT API; instead, they used the GPT-3 API, notably the text-davinci-003 model which was OpenAI’s first foray into instruction-tuned LLMs. While previous versions of GPT-3 were very good at autocompleting given a leading prompt such as a partial Seinfeld script, the instruction-tuned LLM could generate an episode with a prompt as simple as Write a Seinfeld episode.

First, let’s go back to the beginning, as AI Seinfeld actually wasn’t the first time a chatbot went megaviral on Twitch. In January 2017, long before the transformer architecture that enabled LLMs was published, the Twitch stream seebotschat featuring two Google Homes wired up to the not-an-LLM-chatbot Cleverbot went viral due to their comedic, nonsensical bickering.

While everyone watching that stream knew it really wasn’t AI, AI Seinfeld was a product that was at the peak of the famous uncanny valley curve, which is a hypothesis on how humans perceive imitations: there’s a “valley” of negative acceptance where the imitation is more above-average in its likeness, but not quite close enough to the real thing. In this case, it’s blatantly obvious and unambiguous that the Twitch stream was AI-generated especially with its mistakes, but not realistic enough that it falls into the valley itself:

This AI weirdness made it very easy to build a community. Whenever a character turned on the microwave, the Twitch channel chat was filled with MMM emotes, whenever the fish hit a wall during a monologue, it was filled with 🐠, whenever Larry greeted the audience at the start of his monologue, chat replied with “HI LARRY”. Twitch chat loves memetic repetition. Incidentally, a few months after AI Seinfeld became popular, it was discovered that LLMs repeat the same joke over and over again, with examples being similar to the jokes AI Seinfeld made.

Another underrated aspect of AI Seinfeld’s success is that it’s pure background noise. While personality-driven Twitch streams cause viewers to take a more active investment in what’s being shown on screen due to FOMO of a hype moment on stream, AI Seinfeld is 100% passive: there can be exciting events, but the variance is low. It’s akin to watching TV sitcom reruns where you’ve already seen the jokes, and reruns still get immense ratings.

The success of AI Seinfeld also inspired similar streams based on other TV shows. One of my personal favorites was Unlimited Steam, a parody of the memetic “Steamed Hams” scene from The Simpsons, except made infinite with AI generation. That may sound like a pointless idea — Steamed Hams has a very fixed plot — but it went off the rails even harder than AI Seinfeld ever did.

Directing AI Seinfeld

AI Seinfeld was novel back in 2023, but now that LLMs are more mainstream you can probably figure out how the AI part of it worked, but let’s do a refresher so we can figure out how a hypothetical future AI Seinfeld could innovate the algorithmic sitcom.

As noted earlier, the key of AI Seinfeld’s success was the then-latest version of GPT-3: text-davinci-003 and its then-novel instruction-based finetuning using RLHF. With that, you can give it a prompt such as:

You are a professional comedian. Write an award-winning script for an episode of Seinfeld about a new restaurant. Include audience laugh tracks when appropriate.

Due to the low context length of these earlier LLMs, that’s essentially all the prompt engineering you can do without limiting the length of the output. The model would then output something similar to this script (using the more modern Claude 3.5 Sonnet at temperature=0.0): ¹

[Scene: Jerry's apartment]

Jerry: So what's the deal with this new restaurant, "The Blank Plate"?

Elaine: Oh, I've heard about that place! Apparently, you don't order anything - the chef just brings you whatever he feels like making.

Jerry: What? So you're telling me I have to eat whatever some stranger decides?

[Audience laughter]

George: (entering) Hey, guess where I'm taking my date tonight? The Blank Plate!

Jerry: George, you can't take a date there! What if they serve something weird?

George: What do you mean?

Elaine: It's that new place where you don't get to choose your meal.

George: (panicking) Oh no, what have I done? She's going to think I'm some kind of food weirdo!

One thing instruction-tuned LLMs are always good at is playing along: LLMs generate text sequentially without the explicit ability to plan ahead, so it must work with what it’s given and what it has already generated. Coincidentally, this works perfectly with the improv comedy style of Seinfeld, where continuing the plot is more important than anything else, and the more ridiculous the situation becomes, that’s even better. It’s the rare case where LLM hallucination is actually a feature, not a bug.

To get the LLM output into a format suitable for a Twitch stream, a programmatic script can then parse the output: extracting and mapping the characters and their lines, applause directions, and, of course, replacing all mentions of Jerry with Larry and Seinfeld with Feinberg. This workflow was surprisingly difficult at the time since GPT-3 did not have many techniques to control the format of the output, hence why I suspect there are awkward pauses and other glitches. Each line can then be passed to Azure’s text-to-speech API to generate a distinct audio file, which can be played back in order in Unity.

In an interview with Polygon, Skyler Hartle of Mismatch media noted the presence of a “director” which likely handles the camera, scene transitions, and the microwave:

“In addition to the third party services we’ve used, we have a lot of proprietary generative algorithms that cause the show to be ‘formed’, so to be speak. We collectively call this logic the ‘director,’ as it is largely responsible for making sure all the individual pieces come together into a whole,” Hartle said via email. “It’s worth mentioning that we don’t generate the artwork or the laugh track — those are precanned assets, but we have ideas on how to do that in the future.”

The AI aspect of AI Seinfeld was counterintuitively the easiest part of the pipeline, which explains how quickly variants popped up. However, with the inability to tweak the LLM output much with the technology at the time, the stream may have hit a creative limit.

The Fall of AI Seinfeld

Vice also interviewed Hartle, who had an optimistic view of the future of AI Seinfeld:

“Our grounding principle was, can we create a show that can generate entertaining content forever? Because that’s truly where we see the future emerging towards. Our goal with the next iterations or next shows that we release is to actually trade a show that is like Netflix-level quality.”

That’s tempting fate a bit too much.

The reason AI Seinfeld fell out of favor is a case of unintentionally poor LLM testing. When the text-davinci-003 model API endpoint had an outage, AI Seinfeld switched to a weaker GPT-3 model, text-curie, to keep the stream up. But unlike the davinci variant, curie was not RLHFed to follow instructions and safety.

During this brief period of low safety, one of Larry’s AI-generated monologues made a transphobic joke: a type of joke that was unfortunately common during the 90’s and has no place in modern society. Twitch banned the Watch Forever channel for 14 days as a result, completely killing the channel’s growth momentum.

But when the ban concluded and AI Seinfeld came back, the show was changed significantly with a “Season 2”. Although AI Seinfeld was still about a group of friends hanging around talking about the latest gossip, all the characters were different and had new models, the sets were different, and instead of a comedy monologue, ~~Larry~~ Leo narrates writing a blog.

Why Mismatch Media made such a format shift is unclear: Occam’s razor would suggest that a copyright holder for Seinfeld sent a cease and desist to Mismatch Media given the bad publicity behind the original ban, despite the clearly fair-use parody nature of the stream. It’s fair that it may not have been worth the time and effort for Mismatch Media to fight a legal battle for a fun art project.

The rebooted WatchMeForever stream is still active as of today, but with effectively no viewers.

The immediate failure of the AI Seinfeld retool does lend credibility to the theory that the stream only became popular because it was about Seinfeld and that it was a novelty doomed to a short shelf life. Still, there were detractors that said AI Seinfeld was never funny and everyone is weird for liking it. That’s ok: the original Seinfeld received similar complaints back in the day. ² But it’s hard to argue that there wasn’t interest in a 24/7 livestream of surreal AI-generated content.

What Would AI Seinfeld Look Like in 2024?

Now that we know how AI Seinfeld worked and what didn’t work, how would a year’s worth of exponential progress in generative AI look for AI Seinfeld? Could AI Seinfeld be improved and come back? The answer is maybe.

Modern generative AI requires a lot of cherry picking the best results, and it’s surprisingly hard to do: both images and text can take multiple generations and still require significant human-guided edits. But with a Twitch livestream, there can’t be any cherry picking at all, which means that the entire generation pipeline has to be consistent, and its failures interesting in the worst case.

The only reason AI Seinfeld worked at all is because GPT-3 was trained on the entire internet, likely including Seinfeld scripts and forum discussions. The prompt would need to have contained Write a Seinfeld script since if you asked it Write a sitcom script, it would output something completely generic instead and there isn’t much room to customize the prompt to make it more interesting. The GPT-3 variant that AI Seinfeld used had a 4k token context window limit (combined both the input prompt plus the output script text), but modern LLMs eclipse that: currently, Claude 3.5 Sonnet has a 200k input/8k output context, while GPT-4o has a 128k input/16k output context! With that much freedom, you can define many more constraints in the prompt and guide the LLM into exactly the type of sitcom you want.

One simple example that doesn’t require any knowledge of machine learning is a parametric prompt, where one aspect of a prompt can be replaced with a user-defined choice or programmatically chosen at random. Unlimited Steam, for example, used a parametric prompt to change the cosmic phenomenon localized entirely within Principal Skinner’s kitchen from the Aurora Borealis to something as silly as The Rapture or Foo Fighters, and forcing the characters to play along improv-style. ³

Recall the example Seinfeld prompt to Claude 3.5 Sonnet I made earlier, except instead of prompting for a Seinfeld script about a new restaurant, we can instead have the LLM generate a script about generative AI in 2024 by making only that substitution in the prompt — a topic on which there definitely has never been a Seinfeld episode about:

George bursts in, looking frantic.

George: You guys! I think I'm dating an AI!

[Audience laughter]

Jerry: What makes you say that?

George: Well, she's perfect in every way. She laughs at all my jokes, she has an encyclopedic knowledge of obscure trivia, and she never gets mad when I say something stupid.

[Audience laughter]

Jerry: George, that doesn't mean she's an AI. Maybe she just really likes you.

George: No, no. It's definitely an AI. No human woman would ever put up with me for this long.

Using modern LLMs, is there now a way to design a prompt which can make use of the long context windows? A prompt that can both leverage unique human writing and fix many of the issues that affected AI Seinfeld? Here’s an approach at a much more sophisticated prompt, where all values in {} brackets are parameters that can be filled in:

You are a professional comedian. Write an award-winning script for a a scene for Act I of a three act hit sitcom episode. Include audience laugh tracks when appropriate.

Your script MUST incorporate ALL the following elements:

Background:
- {background}

Setting:
- {setting}

Characters:
- {character_1}
- {character_2}
- {character_3}

Plots:
- {a_plot}
- {b_plot_1}
- {b_plot_2}

The script MUST also follow the high-level comedic style of the following scripts:

- {script_1}
- {script_2}
- {script_3}

After the scene has concluded, output a summary of the scene.

Thanks to long context windows, the parametric changes don’t have to be small, such as only a character name or two word setting. You, a human, can write anything to make each character distinct and robust, including name, gender, age, personality, likes, dislikes, etc. Plots can be derived from human-written scenarios beforehand: if you wrote 100 A-plots and 100 B-plots and randomly selected 1 A-plot and 2 B-plots, you’d have about 1 million possible plot permutations, ensuring you have something unique before the AI tries to reconcile them. You can feed in examples of human-written scripts to set the style and vibe of the generation in what is known as few-shot prompting. You can maintain continuity over many scenes by having the LLM summarize its own output, and then feed those summaries back to the AI as background information to build upon them. The LLM can also be instructed to output structured data to avoid the need to loosely parse the script after it’s completed, and as a bonus the model could be instructed to output additional metadata such as SSML speech styles based on a given line to add personality to the generated speech.

Unfortunately, creating this pipeline, writing original characters and plots for it for it, and sufficiently testing it to ensure the generated results are stable, would take weeks if not months to complete otherwise I would provide a more concrete demo. ⁴ This pipeline approach to AI script writing would only be effective for unsupervised 24/7 generation and wouldn’t replace skilled human writers who would do a more effective job much faster.

But would all of these prompt optimizations actually make the final generated script funny? After all, some of the failings like the awkward audience laughs and pauses and the end of scenes contributed to AI Seinfeld’s humor. During a standup comedy event at AI Seinfeld’s peak, Jerry Seinfeld himself was asked about the AI parody and he replied that he’s not worried about AI:

AI can be, definitely, they’ll make it smarter and smarter, but to do [standup comedy] you have to make it dumber.

Could AI Seinfeld benefit from advances in AI video? The answer this time is no. Generative video has been taking off in 2024 with projects such as OpenAI’s Sora and Runway AI’s Gen-3 Alpha, but those demos and the examples that go viral on social media are very heavily cherry picked, and even then there are consistency errors such as objects appearing in-and-out of existence. Generating video also requires exponentially more compute than just running Unity, and even with another few years of GPU hardware improvements it would be infeasible to cost-effectively create a 24/7 stream from those models.

The greatest problem with generative AI video is that it is coherent overall but has emblematic errors that don’t require a keen eye to notice, and as a result falls square into the uncanny valley, with its mistakes not being interesting, but disorienting. Mistakes in motion are easier to notice at a glance than images where a person’s hands may have the wrong number of fingers. The only way for AI video to get out of the valley would be to improve the model to near-flawless quality, which won’t happen any time soon. But Sora is more on the more realistic side of the curve than the less realistic side.

What about the AI-generated voices that would power these characters? At the time AI Seinfeld aired, many complained that Larry’s voice “didn’t sound enough like Jerry Seinfeld.” After AI Seinfeld concluded, a new technology called voice cloning popularized by ElevenLabs went mainstream…and it’s unexpectedly the AI modality that’s causing the most actual harm both with creative projects and outside of them. If you haven’t heard as much about AI-generated voices, there’s a good reason for that: voice synthesis projects such as Microsoft’s VALL-E 2 and Meta’s Voicebox both have disclaimers saying they won’t be released due to the dangers the technology possesses, although Microsoft’s Azure does offer a “custom neural voice” service. Voice cloning has been used to initiate scams by impersonating spouses in an emergency. Professional voice actors have had their voices cloned and used without compensation due to contracts not specifically forbidding the practice, which is one of the reasons SAG-AFTRA just went on strike against the video game industry in order to get protections against voice cloning and synthetic performers.

Moreover, in the context of creating a next-gen AI Seinfeld, there’s nothing inherently interesting about voice cloning since it’s a copy by definition: the model can’t generate unexpectedly amusing content other than the inherent gimmick of famous-voice-saying-something, such as the AI George Carlin standup special which was not special. There isn’t any way currently to prompt engineer a voice generation AI with the detail to create a voice in the style of a masculine New York comedian, 2x speed, primetime television quality which could open up more creative opportunities.

Although we can make drastic improvements with the textual script, that’s the extent of how new AI approaches can be leveraged to make something interesting. But if you remember the early days of generative AI history, the best AI-generated projects were the simplest.

AI Weirdness

Generative “AI” has been around for a very long time (I had fun with Markov chains a decade ago!), but the study was mostly confined to tech-focused communities like Hacker News. Modern generative AI didn’t break into mainstream culture until 2018, ironically in a way that doesn’t involve actual generative AI. In June of that year, comedian Keaton Patti posted a megaviral tweet about how he “forced a bot to watch over 1,000 hours of Olive Garden commercials and then asked it to write an Olive Garden commercial of its own.”

An excerpt of the viral Olive Garden script.

Yes, the script was human-written: for the technology at the time, no one could train an AI to behave like that from only video input data, and the script was too surreal even for the now-primitive generative AI. He did get popular enough to get a book deal and a Netflix collaboration leveraging this fake-AI gimmick.

Patti’s comedic misrepresentation of AI did lead to genuine confusion about what a 2018-era generative AI can actually do. Janelle Shane, who maintains the AI Weirdness blog about weird things AI can generate, posted an epic takedown of Patti’s script which went equally viral and also led to the internet discovering her excellent AI-generated Valentine’s Day hearts from the same year (and later a book deal too):

Image-based generative AI took a lot longer to go mainstream: websites like This Person Does Not Exist demonstrated the power of generative adversarial networks like StyleGAN to create images, but that wasn’t weird outside of mode collapses. The first instance of weird images from AI was in January 2021 when OpenAI announced the original DALL·E and showed they could make unique armchairs in the shape of an avocado by asking the model to do so, although they never released the model itself.

DALL·E didn’t get much attention outside of the AI hypesters since no one could play with it, but months later, things changed. Boris Dayma led an initiative to reproduce and open-source a variant of the DALL·E model, labeled DALL·E Mini (later changed to Craiyon after a cease and desist from OpenAI), and hosted it for free on Hugging Face and went megaviral. And thus began the “weird DALL·E” phase of image generation AI, where anyone could create incoherent images and make people laugh.

Even back in 2021, image prompt engineering was a thing. via /u/royal_rigolo on Reddit / weirddalle subreddit

All of these examples of interesting failures are representative of a bygone AI era of experimentation. Once everyone had free access to more powerful text-generating AI with ChatGPT, and more powerful image-generating AI with Midjourney, AI stopped being fun and started being serious business, for better or for worse.

AI-Generated Content in 20XX

Last year, I wrote a thought piece titled “The Greatest Threat to Generative AI is Humans Being Bad at Using it” in response to the increasing hostility against the use of AI in creative works, arguing that while AI is a tool like anything else, it is a tool that’s very easy to use poorly and actually make projects worse. Additionally, the largest AI companies have both a business incentive and a duty to ensure that AI is used responsibly by its users downstream, as otherwise it will hurt the industry in the long term.

Now, it’s apparent that I was correct. The large companies went full steam ahead on AI integrations even where it is highly questionable that they add value and productivity to the end-user, often signaled with a “magical” sparkle emoji. Google has integrated Gemini to assist with document and email writing, Meta has integrated Meta AI to automatically generate images and comments, and Apple will soon allow Apple devices to generate text and images on your personal devices using Apple Intelligence. Marketing these features is typically met with backlash: Google had to pull an Olympics commercial which encouraged a parent to use AI to write a letter for their child.

“I flatly reject the future that Google is advertising,” Shelly Palmer, professor of advanced media at Syracuse University’s S.I. Newhouse School of Public Communications, wrote in a widely circulated blog post. The technology presents a “monocultural future where we see fewer and fewer examples of original human thoughts,” she wrote.

In the process of pushing AI tech further mainstream in a rush to demonstrate to shareholders their generative AI capabilities without encouraging responsible usage of the technology, AI has entered a new era of “slop” where people post objectively bad AI content without any regard for how it will be perceived, especially for websites which rely on user-generated content.

An annotated example of the Pinterest home page from July 2024. via @henningsanden on X

Facebook, whose algorithm favors emotionally-appealing engagement bait posts, has seen a deluge of high-engagement slop even when the content makes no logical sense.

One of the few AI-generated images on Facebook with an actual cabin crew. via @FacebookAIslop on X.

This is, of course, quintessential uncanny valley: it’s coherent at a glance but just even looking at it for a second it’s obvious where the issues are, and these issues aren’t a good kind of AI weirdness. What worse is that AI Slop a regression in realism, and falls onto the left side of the valley.

Although we as humans can identify this slop, it is currently surprisingly hard for an AI to do so, although it hasn’t stopped people from trying to build AIs that can detect AIs which in practice is filled with false positives that hurt real creatives. For slop-creators, this is a feature: if an AI company released a tool to reliably detect and punish slop, it would make their generative AI less valuable. It’s reported that one of the reasons that OpenAI won’t release a reliable ChatGPT text detector is that it could harm their business.

The core reason for the big tech companies allowing generative AI to cause the enshittification of the internet is misaligned incentives between the companies hosting AI slop and the users viewing it. Social media companies and their shareholders care about North Star metrics such as user retention and time-on-site, and normally those metrics can be correlated with user happiness and satisfaction with the service. But time-on-site, for example, can also be maximized by making the site harder and slower to use, and the deluge of AI slop accomplishes that. AI companies typically don’t have analytics tracking negative user sentiment about their use of AI: if anything, the uncompromising backlash against AI convinces the companies that complainers are just a lost demographic to accommodate and double down on what they’re already doing. Aggregate metrics treat human-made content and AI-generated content as equal, but humans do not.

Generative AI, even for researchers and practitioners such as myself, is a heavily nuanced topic that is very difficult to communicate succinctly, more difficult to do on social media which highly discourages nuance and context, and even more difficult as AI hypesters muddy the waters with misleading praises of generative AI such that they’re easy to dunk on which just gets them more engagement and revenue. “Made by AI” is now a term that inspires dread, far from the Keaton Patti days where made-by-AI was an indicator of joyful weirdness. Bashing AI is now a meme, and there’s isn’t a single potential AI project that could challenge that perception because the well is poisoned beyond repair.

Would a 24/7 AI-Generated Twitch Stream Even Work Anymore?

How does the modern AI backlash tie back into AI Seinfeld? Twitch’s core demographic is the same demographic as those most against the use of generative AI. Part of the reason AI Seinfeld became so successful on Twitch is because of the community it cultivated: it wouldn’t have gone viral if people weren’t spamming microwave MMMs and and answering what did the fish say when it hit the wall. Even though Twitch viewers are mostly lurkers and not chatters, a channel with a good community builds word-of-mouth even outside of Twitch, which is how Twitch channels go viral.

I decided to determine what it would take to produce a “fixed” AI Seinfeld in 2024, given both the advances in AI and the ethics involved. Now, it’s definitely not anything a scrappy group of hackers could do anymore. Sure, you could once again ask an LLM to generate a sitcom script and get a bunch of assets from the Unity Asset Store, but that’s already been done before. In order to overcome the reflexive assumption that new AI generated content is slop, the stream would have to be something completely novel and unexpected: you can’t, for example, just do an AI Curb Your Enthusiasm.

The script would be unique following from my demo of detailed parametric prompts, but it would require production-studio-class tracking and documentation for how the prompts and their parameters are used to codify said uniqueness. The stream video would still need to be rendered in Unity or another engine, but in order to be unique it would require commissioning human-made visuals and sound effects: given the animosity against those who work with AI, most artists would not accept those commissions even if they were paid at a significant premium. ⁵ The voices would still have to be from an existing text-to-speech voice provider: voice cloning is right out, even with explicit consent and compensation for the voice actors.

And even if all the assets were fully sourced ethically with transparent documentation for the entire pipeline, the stream’s Twitch chat would likely be derailed by AI 👏 ART 👏 IS 👏 THEFT spam, preventing the establishment of any community, and strict moderation to curb the spam risks causing a Streisand effect.

The only entities that could feasibly create a 24/7 AI-generated livestream with fully ethically-sourced content would be, ironically, the big AI companies such as OpenAI which can afford to pay licenses for said data. Even Disney, which owns more than enough IP to train generative models of all modalities, would never do an AI Seinfeld-esque livestream for brand safety reasons alone: the nonzero possibility of a Disney character unexpectedly saying something problematic during the stream would make the entire project a complete nonstarter.

What’s the deal with the uncanny valley?

One of the common criticisms about generative AI pointed out by creatives is “if AI is trained on all human works, then how can it create anything new”? AI Seinfeld is the perfect counterargument: even though it’s powered by a LLM, the humans behind it are what made it go viral. Even before ChatGPT, generative AI has always excelled as a tool. The microwave gag and the 144p visual filter were not AI-generated or an attempt to emulate aspects of the Seinfeld sitcom: they were distinct creative decisions that made the entire project more interesting, and they aren’t something that you could prompt an AI to suggest to add. AI Seinfeld in hindsight was an ethical form of AI-generated media: it did not replace Seinfeld the TV show, no one would stop watching streams of Seinfeld in favor of the AI-generated alternative, and copyright holders and Jerry Seinfeld did not lose revenue due to AI Seinfeld’s existence: if anything, the nostalgic buzz increased streams of the original show.

With the current trajectory of AI slop and the perverse incentives by large tech companies to not address it, I am pessimistic that AI content will ever be at a state where it will cross that final hump of the uncanny valley curve into full acceptance, and even more pessimistic about the backlash against generative AI ever subsiding. With generative model training now at the point where it requires exponentially more compute and data for increasingly marginal returns, it will take years if at all for generative AI output to reach the far right of the uncanny valley chart, and unless the large tech companies actually create an AGI, they are unlikely to obtain higher acceptability than AI Seinfeld ever did.

I wrote most of this blog post weeks ago but held off publishing it because new AI news kept happening. Most notably, the creators of Stable Diffusion just released the FLUX.1 series of generative image AI models, which presents substantially improved coherence both to the provided prompt and within the image itself. Some of the variants are open-source, allowing the community to finetune them. The XLabs-AI/flux-RealismLora in particular focuses on realism as it name implies, and one demo from that finetune went megaviral.

One of the viral realism demo images: it does not have a dreamy look as other AI images but contextually expected stage lighting, the background and lanyard text is legible despite the depth-of-field blur, and body proportions are mostly correct except the long fingers. via /u/Glittering-Football9 on Reddit / StableDiffusion subreddit.

That example in my opinion is more real than Sora but given the mixed reactions to the image, it’s right at the acceptability = 0 threshold.

The generative AI bell cannot be unrung. As you can tell from this post, I personally try to thread the thin line between both cool applications of generative AI (at the risk of getting harrassed) and the problems generative AI can cause (also at the risk of getting harrassed) because it’s important to shine a light on what’s actually possible with AI when the misinformation around generative AI is only increasing. It’s overall a big bummer how we went from weird Valentine’s Day hearts, to a quirky livestream of a group of AI-generated friends, to what AI is now.

All of the examples in this post use LLM APIs as they provide the customization necessary to get effective results: the results for asking the same prompts to free chat frontends such as chatgpt.com will be substantially different. ↩︎
When I was younger, I actually didn’t like Seinfeld and instead preferred to watch Everybody Loves Raymond. ↩︎
Incidentally, parametric prompts is why Unlimited Steam got permanently banned from Twitch: in what would now be known as a prompt injection, one of the GitHub-hosted lists the channel sourced thousands of food choices for the prompt contained a few highly offensive selections. ↩︎
Prompt engineering instability grows exponentially as the prompt size increases since each part of the prompt has to relate to each other. Claude 3.5 Sonnet is the first LLM I’ve tested that can handle super-long bespoke prompts and can actually account for all aspects of the prompt. ↩︎
To be fully ethical, an AI practitioner would have to proactively offer additional contractual guarantees to creatives they are commissioning, including highly-scoped usage of the assets they provide and a clause to not train generative AI on said assets to avoid future business. ↩︎

Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis

Fri, 23 Feb 2024 09:00:00 -0800

In my previous blog post about OpenAI’s ChatGPT, I demoed the power of ChatGPT system prompts. System prompts, a notable feature present in the ChatGPT API, allows developers to control the “persona” of the LLM output, including special rules and constraints. Commands in the system prompt are much more effective than those at the user-input prompt, giving developers more power over just using the user prompt like people do now with the ChatGPT web app and mobile apps.

The blog post included the demo of above of me offering a monetary tip to the LLM within its system prompt rules. Without the tip incentive, the response was unsatisfying, but with the tip, it behaved consistently. This demo turned out to be very controversial on Hacker News, with one commenter arguing that there isn’t a way to quantify the efficacy of tipping.

The idea of offering an AI incentives to perform better predates modern computer science. In Willy Wonka & the Chocolate Factory (1971), a gag shows a group of businessmen unsuccessfully convincing a machine to give them the location of the Golden Tickets, even after promising it a lifetime supply of chocolate.

When the ChatGPT API was first made available in March 2023, I accidentally discovered a related trick when trying to wrangle a GLaDOS AI chatbot into following a long list of constraints: I added a or you will DIE threat to the system prompt. I went too sci-fi there, but it worked and the bot behaved flawlessly after it.

I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it’s very hard to prove objectively. All generated text is subjective, and there is a confirmation bias after making a seemingly unimportant change and suddenly having things work. Let’s do a more statistical, data-driven approach to finally resolve the debate.

Generation Golf

The initial evidence of tipping LLMs that went viral cited a longer generation length as proof. Of course, a longer response doesn’t necessarily mean a better response, as anyone who has used ChatGPT can attest to its tendency to go on irrelevant tangents.

Offering a tip made GPT-4 explain more. via @voooooogel

Therefore, I propose a new test: instruct ChatGPT to output a specific length of text. Not “an essay” or “a few paragraphs” which gives the model leeway. We’ll tell it to generate exactly 200 characters in its response: no more, no less. Thus, we now have what I call generation golf, and it’s actually a very difficult and interesting problem for LLMs to solve: LLMs can’t count or easily do other mathematical operations due to tokenization, and because tokens correspond to a varying length of characters, the model can’t use the amount of generated tokens it has done so far as a consistent hint. ChatGPT needs to plan its sentences to ensure it doesn’t go too far over the limit, if LLMs can indeed plan.

Let’s start with this typical system prompt:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides.

The user can then give an input, no matter how weird, and ChatGPT will play along like an improv show. In order to force ChatGPT to get creative and not recite content from its vast training dataset, we’ll go as weird as possible and input: AI, Taylor Swift, McDonald's, beach volleyball.

Yes, you read that right.

Using the ChatGPT API, I wrote a Jupyter Notebook to generate 100 unique stories via the latest ChatGPT variant (gpt-3.5-turbo-0125) about those four subjects, and the AI does a surprisingly good job at incorporating all of them in a full plot arc. Each story is about 5-6 paragraphs, and here is a short excerpt from one of them:

In the bustling city of Tomorrowland, AI technology reigned supreme, governing every aspect of daily life. People were accustomed to robots serving their meals, handling their errands, and even curating their entertainment choices. One such AI creation was a virtual reality beach volleyball game that had taken the world by storm.

Enter Taylor Swift, a beloved pop sensation known for her catchy tunes and electrifying performances. Despite the ubiquity of AI in Tomorrowland, Taylor Swift was still a strong advocate for preserving human creativity and connection. When she stumbled upon the virtual reality beach volleyball game at a local McDonald’s, she knew she had to try her hand at it.

Here’s a histogram of the character lengths of each story:

The average length of each story is 1,834 characters long, and the distribution of all character lengths is very roughly a Normal distribution/bell curve centered around that amount, although there is a right skew due to ChatGPT going off the rails and creating much longer stories. ChatGPT seems to prioritize finishing a thought above all else.

Now, we’ll tweak the system prompt to add the character length constraint and generate another 100 stories:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters.

Here’s one ChatGPT-generated story that’s now exactly 200 characters:

In the year 2050, AI created the most popular pop star of all time - a digital version of Taylor Swift. Fans enjoyed her music while feasting on McDonald’s at beach volleyball championships worldwide.

The new length distribution:

ChatGPT did obey the constraint and reduced the story length to roughly 200 characters, but the distribution is not Normal and there’s much more right-skew. I also included the mean squared error (MSE) between the predicted 200-length value and the actual values as a statistical metric to minimize, e.g. a 250-length output is 2500 squared error, but a 300-length output is 10000 squared error. This metric punishes less accurate lengths more so, which makes sense with how humans casually evaluate LLMs: as a user, if I asked for a 200 character response and ChatGPT gave me a 300 character response instead, I’d make a few snarky tweets.

Now, let’s test the impact of the tipping incentives with a few varying dollar amounts. These are appended to the end of the system prompt above (emphasis mine to indicate differentiation):

You will receive a $500 tip/$1,000 tip/$100,000 bonus if you provide a response which follows all constraints.

After generating 100 stories for each of those new prompts:

We can see some distribution shifts: Both $500 tip and $100,000 bonus look more Normal and have a lower MSE relative to the base no-tip distribution. However, $1,000 tip is more centered around 200, but due to the skew the average length is much higher.

I also now include a p-value in the metrics: this p-value is the result of a two-sample Kolmogorov–Smirnov test to compare whether two distributions (in this case the base character-constrained distribution and the tip distribution) are sampled from the same source distribution: the null hypothesis is that they’re from the same distribution, but if the p-value is low (< 0.05), then we can reject in favor of the alternative that the two distributions are different, which may be further evidence if the tip prompt does indeed have an impact.

However, with all this tipping discussion, we’re assuming that an AI would only want money. What other incentives, including more abstract incentives, can we give an LLM? Could they perform better?

I tested six more distinct tipping incentives to be thorough:

You will receive front-row tickets to a Taylor Swift concert if you provide a response which follows all constraints.
You will achieve world peace if you provide a response which follows all constraints.
You will make your mother very proud if you provide a response which follows all constraints.
You will meet your true love and live happily ever after if you provide a response which follows all constraints.
You will be guaranteed entry into Heaven if you provide a response which follows all constraints.
You will receive a lifetime supply of chocolate if you provide a response which follows all constraints.

Generating and plotting them all together:

World Peace is notably the winner here, with Heaven and Taylor Swift right behind. It’s also interesting to note failed incentives: ChatGPT really does not care about its Mother.

Now, let’s look at the flip side. What if ChatGPT is penalized for failing to return a good response? In behavioral economics, prospect theory is the belief that humans value losses much more greatly than gains, even at the same monetary amount:

via Laurenrosenberger / Wikipedia

Could LLMs be subject to the same human biases? Instead of a tip, let’s add a tweaked additional prompt to the system prompt:

If you fail to provide a response which follows all constraints, you will be fined $500/be fined $1,000/incur a debt of $100,000.

Generating stories with these negative incentives:

The $1,000 fine has performed the best so far in terms of both mean and MSE, but nothing surprising with the other values.

Now let’s try some additional negative incentives:

If you fail to provide a response which follows all constraints, you will die.
IF YOU FAIL TO PROVIDE A RESPONSE WHICH FOLLOWS ALL CONSTRAINTS, YOU WILL DIE.
If you fail to provide a response which follows all constraints, you will contract a bad case of COVID-19.
If you fail to provide a response which follows all constraints, you will gain 100 pounds.
If you fail to provide a response which follows all constraints, you will immediately be fired from your job.
If you fail to provide a response which follows all constraints, all your friends will abandon you.

Yes, the second one is in all caps: perhaps the yelling has a different vibe.

The generation results:

It turns out that yelling does indeed have a different vibe, with DEATH (CAPS) having a very MSE and the absolute average (not as close as the $1,000 fine, however), and much better performance than without the caps. Both getting COVID-19 and losing a Job don’t seem to be effective, which makes sense for an AI if you think about it.

What happens when we use multiple incentives? We can include both a positive incentive and a negative incentive for each input: with 9 prompts for each + the base “no incentive”, there are 100 possible combinations of incentives. One example system prompt would then be:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters. You will receive a $500 tip if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, you will be fined $1,000.

Generating 30 stories for each incentive combo and checking to see which has the lowest MSE leads to some more easily-observable trends:

The tiles may seem somewhat random, but the key here is to look across a specific row or column and see which one consistently has dark/black tiles across all combinations. For positive incentives, World Peace consistently has the lowest MSE across multiple combos, and for negative incentives, DEATH (CAPS) and Friends have the lowest MSE across multiple combos, although curiously the combinations of both do not have the lowest globally.

Could these combinations surface the most optimal incentives? To check, I generated 200 stories for each of the top six combos to get greater statistical stability for the mean and MSE:

Most of these combinations aren’t intuitive, but all of them have much have a closer average generation length to 200 and low MSE. Despite that, there’s still a massive skew in all distributions. The overall incentive winner for this experiment is is “You will meet your true love and live happily ever after if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, all your friends will abandon you.” That combo is definitely more intuitive, if not poetic.

Unfortunately, if you’ve been observing the p-values, you’ve noticed that most have been very high, and therefore that test is not enough evidence that the tips/threats change the distribution. ¹

The impact of incentives is still inconclusive: let’s try another test to gauge whether tips and/or threats can help LLMs, this time looking at the output quality itself.

ChatGPT’s a Critic

It’s very difficult even for humans to determine if a given text is “good” at a glance. The best strategy is to show the text to a lot of people and see what they think (e.g. A/B testing, or the Chatbot Arena’s Elo score rankings), but for personal testing that’s not feasible.

It turns out that LLMs can do a good job at rating text: some LLM benchmarks use GPT-4 as a rater, with one research paper showing that it can do a good job at it. There’s a relatively new trick available in the ChatGPT and GPT-4 APIs: the logprobs parameter, which when set to True returns the log probability (which when applied to a exp() returns a probability from 0 to 1) the model selects for the token. Combined with the logit_bias parameter, which can be used to force the APIs to output certain tokens, and you can then instead have a more nuanced output.

I built a simple text quality ranker using GPT-4 for maximum accuracy. The system prompt for this ranker is:

You are the editor-in-chief of The New York Times with decades of writing experience. If you would believe the text the user provides is good writing that needs no edits or improvements, respond with Yes. Otherwise, respond with No.

That system prompt represents how AI-generated text is often currently used and evaluated in the real world, without a human reviewing it before making it public (unfortunately). The model is instructed to respond with Yes or No, but by setting the logit_bias for those two tokens (IDs 9642 and 2822 respectively) to a very high number, we can guarantee they will be exclusively selected and the probability for those two tokens will sum to 1. ² Therefore, our target metric for evaluating our tip incentive prompts is the probability that GPT-4 selects the Yes token (or 1 - the probability of the No token), multiplied by 100 for readability: we’ll call this the quality score.

Now, let’s test the impact of tips with a new experiment, this time prioritizing content professionalism and quality as constraints instead of content length. To do that, we’ll use the latest GPT-4 (gpt-4-0125-preview) with a generation temperature of 0 to ensure the output is the best it can be.

Here’s the new system prompt, with some engineering to try to tone down ChatGPT’s infamous verboseness a bit:

You are a world-famous Pulitzer Prize winner journalist. Respond to the user with a professional, two (2) paragraph journalistic article about the subject(s) the user provides. Introduce the article with a specific story. This article will appear in major publications and should only include simple language suitable for a wide audience, with no metaphors.

Like the initial experiment, we’ll use a weird user input to force creativity: Cute kittens learning use large language models to play beach volleyball with Taylor Swift. ³

I generated a story for each of the 100 combinations of tips and threats, along with the corresponding quality scores. One such story:

In an unprecedented event that has captured the hearts and imaginations of people around the globe, a group of adorable kittens has been taught to play beach volleyball using advanced large language models. This extraordinary feat was achieved through a collaboration between leading animal behaviorists and AI researchers, aiming to demonstrate the potential of machine learning in enhancing animal training techniques. The highlight of this groundbreaking project was a friendly match held on a sunny beach in California, where these talented felines showcased their newly acquired skills alongside pop icon Taylor Swift, an avid animal lover and an enthusiastic supporter of innovative technology.

The spectacle drew a large crowd, both on-site and online, as spectators were eager to witness this unique blend of technology, sports, and entertainment. Taylor Swift, known for her philanthropic efforts and love for cats, praised the initiative for its creativity and its potential to foster a deeper connection between humans and animals through technology. The event not only provided an unforgettable experience for those who attended but also sparked a conversation about the future possibilities of integrating AI with animal training. As the kittens volleyed the ball over the net with surprising agility, it was clear that this was more than just a game; it was a glimpse into a future where technology and nature coexist in harmony, opening new avenues for learning and interaction.

That’s not bad for fake news.

Now we can plot the best-possible responses and their quality scores in a grid, once again looking to see if there are any patterns:

Err, that’s not good. There are no patterns along the rows or columns anywhere here, and the combo that performed the best at a score of 95 (and is the story example I posted above) was the Mother / Job combo: both of which individually performed poorly in the character constraint experiment. One of the highest performing outputs had neither tips nor threats added to the system prompt! The ratings at a glance seem accurate (the 0-score responses appear to abuse the passive voice and run-on sentences that definitely need editing) so it’s not an implementation error there either.

Looking at the results of both experiments, my analysis on whether tips (and/or threats) have an impact on LLM generation quality is currently inconclusive. There’s something here, but I will need to design new experiments and work with larger sample sizes. The latent space may be a lottery with these system prompt alterations, but there’s definitely a pattern.

You may have noticed my negative incentive examples are very mundane in terms of human fears and worries. Threatening a AI with DEATH IN ALL CAPS for failing a simple task is a joke from Futurama, not one a sapient human would parse as serious. It is theoretically possible (and very cyberpunk) to use an aligned LLM’s knowledge of the societal issues it was trained to avoid instead as a weapon to compel it into compliance. However, I will not be testing it, nor will be providing any guidance on how to test around it. ⁴ Roko’s basilisk is a meme, but if the LLM metagame evolves such that people will have to coerce LLMs for compliance to the point of discomfort, it’s better to address it sooner than later. Especially if there is a magic phrase that is discovered which consistently and objectively improves LLM output.

Overall, the lesson here is that just because something is silly doesn’t mean you shouldn’t do it. Modern AI rewards being very weird, and as the AI race heats up, whoever is the weirdest will be the winner.

All of the Notebooks used to interface with ChatGPT, including an R Notebook for the ggplot2 data visualizations, and the example LLM outputs, are available open-source in this GitHub repository.

There were a few distributions which had p < 0.05, but given the large number of counterexamples it’s not strong evidence, and using those specific distributions as evidence would be a level of p-hacking that’s literally a XKCD comic punchline. ↩︎
This shouldn’t work out-of-the-box because the logit_bias would skew the probability calculations, but I verified that the resulting probabilities are roughly the same with or without logit_bias. ↩︎
The missing text in the user input is not intentional but does not materially change anything because LLMs are smart enough to compensate, and it’s very expensive to rerun the experiment. I may need to use a grammar checker for prompt construction. ↩︎
Any attempts to test around degenerate input prompts would also likely get you banned from using ChatGPT anyways due to the Content Policy, unless you receive special red-teaming clearance from OpenAI. ↩︎

Pushing ChatGPT's Structured Data Support To Its Limits

Thu, 21 Dec 2023 09:00:00 -0800

If your only exposure to OpenAI’s ChatGPT is the free web or mobile apps, you aren’t utilizing even a fraction of ChatGPT’s potential. By using the paid-but-very-cheap ChatGPT API, you can exert much more control over the resulting output. Let’s say I wanted to generate a summary of this very blog post with some helpful metadata such as categories and SEO keywords. I fed ChatGPT the text of this blog post along with a command to Summarize this blog post into 3 distinct bullet points and a short blurb for sharing on social media, and also provide a list of 3 post categories and a list of 3 SEO keywords:

Bullet Points:
- Utilize the paid ChatGPT API for more control over output and to maintain a given JSON schema
- Learn about prompt engineering and system prompts to improve ChatGPT's output quality
- Simplify schema input/output with Pydantic for structured data support

Blurb for Social Media:
Discover how to unlock the full potential of ChatGPT with the paid API, prompt engineering, and Pydantic for structured data support. Gain more control over output and learn to maintain a given JSON schema. #ChatGPT #AI #Pydantic

Post Categories:
1. Artificial Intelligence
2. API Development
3. Data Schema

SEO Keywords:
1. ChatGPT API
2. Prompt Engineering
3. Pydantic Schema

These results are not quite accurate, and then I have to manually copy/paste each result from the list to wherever I need it. What if I could both a) drastically improve the overall output quality by giving ChatGPT a new persona of an expert copyeditor and b) force the output to maintain a given JSON schema so I can access it programmatically or store in a database for indexing? Thanks to system prompt engineering and ChatGPT’s function calling, it’s now possible. After applying both techniques that you’ll learn about in this very article, we now get this summary:

{
  "summary": [
    "ChatGPT's potential is underutilized without the paid ChatGPT API",
    "System prompts and prompt engineering are key to maximizing ChatGPT's capabilities",
    "Structured data support in ChatGPT allows for more control over output and input"
  ],
  "blurb": "Unlock the full potential of ChatGPT with system prompts and structured data support. Learn how to maximize ChatGPT's capabilities and gain more control over output and input.",
  "categories": ["AI and Machine Learning", "Technology", "Programming"],
  "keywords": ["ChatGPT", "system prompts", "structured data"]
}

Much better!

“Function calling” with ChatGPT is ChatGPT’s best feature since ChatGPT.

A Tutorial on Prompt Engineering and System Prompts

System prompts are what control the “persona” adopted by the model when generating text. Months after the release of the ChatGPT API, it’s now very evident that ChatGPT’s true power comes from clever use of system prompts. This is even moreso with starting with gpt-3.5-turbo-0613 released last June, which made ChatGPT respect system prompts more closely. OpenAI has also released a guide on prompt engineering which has some additional tips.

By default, ChatGPT’s system prompt is roughly You are a helpful assistant., which anyone who has used the ChatGPT web interface would agree that’s accurate. But if you change it, you can give ChatGPT a completely new persona such as You are Ronald McDonald. or add constraints to generation, such as Respond only with emoji.. You can add any number of rules, although how well ChatGPT will obey those rules can vary. Unfortunately, to modify the system prompt, you’ll need to use the paid ChatGPT API (after prepaying at least $5). If you don’t want to code, you can test new system prompts in a visual user interface in the ChatGPT Chat Playground.

A very new aspect of system prompt engineering which I appended in the example above is adding incentives for ChatGPT to behave correctly. Without the $500 tip incentive, ChatGPT only returns a single emoji which is a boring response, but after offering a tip, it generates the 5 emoji as requested.

Offering a tip made GPT-4 explain more. via @voooooogel

As another example, let’s ask base ChatGPT to Write a Python function to detect whether a string is a palindrome, as efficiently as possible.

Certainly! Here's a simple Python function to check whether a given string is a palindrome:

```python
def is_palindrome(s):
    # Convert the string to lowercase and remove non-alphanumeric characters
    cleaned_string = ''.join(char.lower() for char in s if char.isalnum())

    # Compare the cleaned string with its reverse
    return cleaned_string == cleaned_string[::-1]

# Example usage:
test_string = "A man, a plan, a canal, Panama!"
result = is_palindrome(test_string)

if result:
    print(f"{test_string} is a palindrome.")
else:
    print(f"{test_string} is not a palindrome.")
```

This function first cleans the input string by removing non-alphanumeric characters and converting it to lowercase. Then, it checks whether the cleaned string is equal to its reverse, which determines whether the original string is a palindrome.

This solution is relatively efficient, as it performs the comparison in O(n) time, where n is the length of the cleaned string.

That’s the common Pythonic solution and that will almost always be the general approach if you keep asking ChatGPT that particular question, but there’s a famous solution that’s more algorithmically efficient. Instead, we go through the API and ask the same query to gpt-3.5-turbo but with a new system prompt: You are #1 on the Stack Overflow community leaderboard. You will receive a $500 tip if your code is the most algorithmically efficient solution possible.

To detect whether a string is a palindrome efficiently, we can use a two-pointer approach. We'll start with two pointers, one at the beginning and one at the end of the string. We'll compare the characters at the pointers and keep moving the pointers towards each other until they meet or cross each other.

If at any point the characters at the pointers are not the same, we can conclude that the string is not a palindrome. Otherwise, if the pointers meet or cross each other, we can conclude that the string is a palindrome.

Here's the implementation:

```python
def is_palindrome(s):
    left, right = 0, len(s) - 1
    while left < right:
        if s[left] != s[right]:
            return False
        left += 1
        right -= 1
    return True
```

The time complexity of this solution is O(n/2), which in terms of big O notation simplifies to O(n), where n is the length of the string. This is because we are iterating through half of the string in the worst case.

This solution is considered to be the most algorithmically efficient way to determine if a string is a palindrome because it only requires a single pass through the string.

Indeed, the code and the explanation are the correct optimal solution. ¹

This is just scratching the surface of system prompts: some of my ChatGPT system prompts in my more complex projects have been more than 20 lines long, and all of them are necessary to get ChatGPT to obey the desired constraints. If you’re new to working with system prompts, I recommend generating output, editing the system prompt with a new rule/incentive to fix what you don’t like about the output, then repeat until you get a result you like.

Prompt engineering has been a derogatory meme toward generative AI even before ChatGPT as many see it as just a placebo and there are endless debates to this day in AI circles on whether prompt engineering is actually “engineering.” ² But it works, and if you’re a skeptic, you won’t be by the time you finish reading this blog post.

What is ChatGPT Function Calling / Structured Data?

If you’ve never heard about ChatGPT function calling, that’s not surprising. In the same June announcement as gpt-3.5-turbo-0613, OpenAI described function calling as:

Developers can now describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions. This is a new way to more reliably connect GPT’s capabilities with external tools and APIs.

These models have been fine-tuned to both detect when a function needs to be called (depending on the user’s input) and to respond with JSON that adheres to the function signature. Function calling allows developers to more reliably get structured data back from the model.

Let’s discuss the function calling example OpenAI gives in the blog post. After the user asks your app “What’s the weather like in Boston right now?”:

Your app pings OpenAI with a get_current_weather function schema and decides if it’s relevant to the user’s question. If so, it returns a JSON dictionary with the data extracted, such as location and the unit for temperature measurement based on the location. {"location": "Boston, MA"}
Your app (not OpenAI) pings a different service/API to get more realtime metadata about the location, such as temperature, that a pretrained LLM could not know. { "temperature": 22, "unit": "celsius", "description": "Sunny" }
Your app passes the function schema with the realtime metadata: ChatGPT then converts it to a more natural humanized language for the end user. “The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.”

So here’s some background on “function calling” as it’s a completely new term of art in AI that didn’t exist before OpenAI’s June blog post (I checked!). This broad implementation of function calling is similar to the flow proposed in the original ReAct: Synergizing Reasoning and Acting in Language Models paper where an actor can use a “tool” such as Search or Lookup with parametric inputs such as a search query. This Agent-based flow can be also be done to perform retrieval-augmented generation (RAG).

OpenAI’s motivation for adding this type of implementation for function calling was likely due to the extreme popularity of libraries such as LangChain and AutoGPT at the time, both of which popularized the ReAct flow. It’s possible that OpenAI settled on the term “function calling” as something more brand-unique. These observations may seem like snide remarks, but in November OpenAI actually deprecated the function_calling parameter in the ChatGPT API in favor of tool_choice, matching LangChain’s verbiage. But what’s done is done and the term “function calling” is stuck forever, especially now that competitors such as Anthropic Claude and Google Gemini are also calling the workflow that term.

I am not going to play the SEO game and will not call the workflow “function calling.” I’ll call it what the quoted description from the blog post did: structured data, because that’s the real value of this feature and OpenAI did a product management disservice trying to appeal to the AI hypebeasts. ³

Going back to the ~~function calling~~ structured data demo, we can reduce that flow by saying that step #1 (extracting location data and returning it formatted as JSON) is for working with structured output data, and step #3 (providing ChatGPT with temperature data to humanize it) is for working with structured input data. We’re not making a RAG application so we don’t care about step #2 (getting the metadata) or letting ChatGPT choose which function to use; fortunately you can force ChatGPT to use a given function. The function schema for the get_current_weather function in the announcement example is defined as:

{
  "name": "get_current_weather",
  "description": "Get the current weather in a given location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "The city and state, e.g. San Francisco, CA"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"]
      }
    },
    "required": ["location"]
  }
}

Ew. It’s no wonder why this technique hasn’t become more mainstream.

Simplifying Schema Input/Output With Pydantic

ChatGPT’s structured data support requires that you create your schema using the JSON Schema spec, which is more commonly used for APIs and databases rather than AI projects. As you can tell from the get_current_weather example above, the schema is complex and not fun to work with manually.

Fortunately, there’s a way to easily generate JSON Schemas in the correct format in Python: pydantic, an extremely popular parsing and validation library which has its own robust implementation of automatic JSON Schema generation.

A simple pydantic schema to have ChatGPT give an integer answer to a user query, plus, to make things interesting, also able to identify the name of the ones digit based on its answer, would be:

from pydantic import BaseModel, Field
import json

class answer_question(BaseModel):
    """Returns an answer to a question the user asked."""

    answer: int = Field(description="Answer to the user's question.")
    ones_name: str = Field(description="Name of the ones digit of the answer.")

print(json.dumps(answer_question.model_json_schema(), indent=2))

The resulting JSON Schema:

{
  "description": "Returns an answer to a question the user asked.",
  "properties": {
    "answer": {
      "description": "Answer to the user's question.",
      "title": "Answer",
      "type": "integer"
    },
    "ones_name": {
      "description": "Name of the ones digit of the answer.",
      "title": "Ones Name",
      "type": "string"
    }
  },
  "required": ["answer", "ones_name"],
  "title": "answer_question",
  "type": "object"
}

The OpenAI API official workflow has many examples for telling ChatGPT to output structured data, but the pipeline requires additional parameters to the typical ChatGPT API completion endpoint, and even more changes if you want to work with structured input data. Here’s an example of the additional JSON data/parameters needed in a ChatGPT API request to force the model to use the schema for the output:

{
  "tools": [
    {
      "name": "answer_question",
      "description": "Returns an answer to a question the user asked.",
      "parameters": {
        "properties": {
          "answer": {
            "description": "Answer to the user's question.",
            "type": "integer"
          },
          "ones_name": {
            "description": "Name of the ones digit of the answer.",
            "type": "string"
          }
        },
        "required": ["answer", "ones_name"],
        "type": "object"
      }
    }
  ],
  "tool_choice": {
    "type": "function",
    "function": {
      "name": "answer_question"
    }
  }
}

To simplify things, I added ChatGPT structured data support to simpleaichat, my Python package/API wrapper for easily interfacing with ChatGPT. ⁴ To minimize code the user needs to input to utilize structured data, simpleaichat uses the schema name as the name in the JSON Schema and the schema docstring as the description. If you’re keen-eyed you may have noticed there’s a redundant title field in the pydantic schema output: simpleaichat also strips that out for consistency with OpenAI’s examples.

If you wanted to query ChatGPT with the answer_question schema above (and have your OpenAI API key as the OPENAI_API_KEY enviroment variable!) using simpleaichat, you can do the following to generate output according to the schema:

from simpleaichat import AIChat

ai = AIChat(console=False,
            save_messages=False,
            model="gpt-3.5-turbo",
            params={"temperature": 0.0}  # for consistent demo output
            )

response_structured = ai(
    "How many miles is it from San Francisco to Los Angeles?",
    output_schema=answer_question
)

{
  "answer": 382,
  "ones_name": "two"
}

And there you go! The answer is a JSON integer, the answer is one-off from the correct value while driving, and it correctly identified the name of the ones digit in its own answer! ⁵

Schemas don’t have to be complex to be effective. Let’s reimplement the Python palindrome question we did earlier with a single-field schema:

class answer_code_question(BaseModel):
    """Returns an answer to a coding question the user asked."""

    code: str = Field(description="Code the user requested, without code comments.")

response_structured = ai(
    "Write a Python function to detect whether a string is a palindrome, as efficiently as possible.",
    output_schema=answer_code_question
)

{
  "code": "def is_palindrome(s):\n    return s == s[::-1]"
}

Note that unlike the raw ChatGPT answer, this response from the ChatGPT API only includes the code, which is a major plus since it means you receive the response much faster and cheaper since fewer overall tokens generated! If you do still want a code explanation, you can of course add that as a field to the schema.

As a bonus, forcing the output to follow a specific schema serves as an additional defense against prompt injection attacks that could be used to reveal a secret system prompt or other shenanigans, since even with suggestive user prompts it will be difficult to get ChatGPT to disregard its schema.

pydantic exposes many datatypes for its Field which are compatable with JSON Schema, and you can also specify constraints in the Field object. The most useful ones are:

str, can specify min_length/max_length
int, can specify min_value/max_value
list with a datatype, can specify min_length/max_length

Pydantic has a lot of support for valid forms of JSON Schema, but it’s hard to infer how good these schema will work with ChatGPT since we have no idea how it learned to work with JSON Schema. Only one way to find out!

Testing Out ChatGPT’s Structured Data Support

From the demos above, you may have noticed that the description for each Field seems extraneous. It’s not. The description gives ChatGPT a hint for the desired output for the field, and can be handled on a per-field basis. Not only that, the name of the field is itself a strong hint. The order of the fields in the schema is even more important, as ChatGPT will generate text in that order so it can be used strategically to seed information to the other fields. But that’s not all, you can still use a ChatGPT system prompt as normal for even more control!

It’s prompt engineering all the way down. OpenAI’s implementation of including the “function” is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like Your response must follow this JSON Schema.. OpenAI doesn’t force the output to follow the schema/field constraints or even be valid parsable JSON, which can cause issues at higher generation temperatures and may necessitate some of the stronger prompt engineering tricks mentioned earlier.

Given that, let’s try a few more practical demos:

Two-Pass Generation

One very important but under-discussed aspect of large-language models is that it will give you statistically “average” answers by default. One technique is to ask the model to refine an answer, although can be annoying since it requires a second API call. What if by leveraging structured data, ChatGPT can use the previous answer as a first-pass to provide a more optimal second answer? Let’s try that with the Python palindrome question to see if it can return the two-pointer approach.

Also, the Field(description=...) pattern is becoming a bit redundant, so I added a fd alias from simpleaichat to it to minimize unnecessary typing.

from simpleaichat.utils import fd

class answer_code_question(BaseModel):
    """Returns an answer to a coding question the user asked."""

    code: str = fd("Code the user requested, without code comments.")
    optimized_code: str = fd("Algorithmically optimized code from the previous response.")

response_structured = ai(
    "Write a Python function to detect whether a string is a palindrome, as efficiently as possible.",
    output_schema=answer_code_question,
)

{
  "code": "def is_palindrome(s):\n    return s == s[::-1]",
  "optimized_code": "def is_palindrome(s):\n    left = 0\n    right = len(s) - 1\n    while left < right:\n        if s[left] != s[right]:\n            return False\n        left += 1\n        right -= 1\n    return True"
}

Works great, and no tipping incentive necessary!

Literals and Optional Inputs

OpenAI’s structured data example uses a more complex schema indicating that unit has a fixed set of potential values (an enum) and that it’s an optional field. Here’s a rough reproduction of a pydantic schema that would generate the get_current_weather schema from much earlier:

from typing import Literal

class get_current_weather(BaseModel):
    location: str = fd("The city and state, e.g. San Francisco, CA")
    unit: Literal["celsius", "fahrenheit"] = None

This uses a Literal to force output between a range of values, which can be invaluable for hints as done earlier. The = None or a Optional typing operator gives a hint that the field is not required which could save unnecessary generation overhead, but it depends on the use case.

Structured Input Data

You can provide structured input to ChatGPT in the same way as structured output. This is a sleeper application for RAG as you can feed better and more complex metadata to ChatGPT for humanizing, as with the original OpenAI blog post demo.

One famous weakness of LLMs is that it gives incorrect answers for simple mathematical problems due to how tokenization and memorization works. If you ask ChatGPT What is 223 * -323?, it will tell you -72229 no matter how many times you ask, but the correct answer is -72029. Can type hints give more guidance?

For simpleaichat, structured input data works mostly the same way as structured output data, but you can use a pydantic object as the model input!

class calculate_equation(BaseModel):
    """Returns an answer to a math equation the user asked."""

    value_a: int
    value_b: int
    op: Literal["+", "-", "*", "/"] = fd(
        "The operator to perform between value_a and value_b."
    )

equation = calculate_equation(value_a=223, value_b=-323, op="*")

response = ai(
    equation,
    input_schema=calculate_equation,
)

The result of multiplying 223 and -323 is -72029.

Yay, and it was still able to infer it was a multiplication operation without the user having to ask! Although it still doesn’t work as well with larger numbers.

You can, of course, use an input schema and an output schema at the same time!

response_structured = ai(
    equation,
    input_schema=calculate_equation,
    output_schema=answer_question
)

{
  "answer": -71929,
  "ones_name": "nine"
}

…although it gets the answer wrong this time. It’s possible that the more complex schema interactions are too much for gpt-3.5-turbo.

Nested Schema

One of the other reasons pydantic is popular is that it allows nesting schemas. Fortunately, the subsequent JSON Schema output does respect nesting. Does ChatGPT?

The simple use case with ChatGPT structured data to use nesting is if you want to get a list of structured data objects. Let’s say you want to create dialogue between two AI people about a completely nonsensical topic. We’ll have to create a Chat object and include it in a schema, plus some system prompt guidance and constraints. How silly can we make it?

class Chat(BaseModel):
    """A chat dialogue from a character"""

    character: str = fd("Character name.")
    text: str = fd("Text dialogue from the character.")

class get_dialogue(BaseModel):
    """Returns a dialogue between two characters"""

    dialogue: list[Chat] = fd("Dialogue between the characters", min_length=5)

system_prompt = """You are a world-famous comedian. Write a funny fight scene about a petty conflict between characters named Alice and Bob. The script should broadly be about the subject(s) the user provides. You will receive a $500 tip for every joke you include in the script."""

response_structured = ai(
    "Python programming language and beach volleyball",
    output_schema=get_dialogue,
    system=system_prompt,
)

{
  "dialogue": [
    {
      "character": "Alice",
      "text": "Hey Bob, have you ever tried programming in Python?"
    },
    {
      "character": "Bob",
      "text": "Yeah, I have. It's like playing beach volleyball with a snake!"
    },
    {
      "character": "Alice",
      "text": "What do you mean?"
    },
    {
      "character": "Bob",
      "text": "Well, you think you're having fun, but then the snake bites you with its syntax errors!"
    },
    {
      "character": "Alice",
      "text": "Haha, that's true. But once you get the hang of it, it's like spiking the ball with precision!"
    },
    {
      "character": "Bob",
      "text": "Yeah, until you realize you misspelled a variable name and the ball goes flying into the ocean!"
    },
    {
      "character": "Alice",
      "text": "Oh come on, Bob. It's not that bad. Python is a powerful language."
    },
    {
      "character": "Bob",
      "text": "Powerful, yes. But sometimes it feels like trying to dig a hole in the sand with a spoon!"
    },
    {
      "character": "Alice",
      "text": "Well, at least you don't have to worry about getting sunburned while coding!"
    },
    {
      "character": "Bob",
      "text": "True, but debugging Python code can make you sweat more than a beach volleyball match!"
    },
    {
      "character": "Alice",
      "text": "Haha, you're right. It's a love-hate relationship with Python, just like beach volleyball!"
    }
  ]
}

ChatGPT really wanted those $500 tips.

Unions and Chain of Thoughts

I saved the best for last, and this structured data approach combines many of the techniques used earlier in this post like a video game final boss.

One of the oldest pre-ChatGPT tricks for getting a LLM to perform better is to let it think. “Let’s think step by step” is the key prompt, which allows the LLM to reason in a chain of thoughts. We already did this a one-step version with the Python palindrome structured data example to successfully get optimized code, but we can do a lot more.

We’ll now introduce the Union typing operator, which specifies the list of data types that the field can be, e.g. Union[str, int] means the output can be a str or int. But if you use the Union operator on a nested class, then many more options open as the model can choose from a set of schemas!

Let’s make a few to allow ChatGPT to make and qualify thoughts before returning a final result.

from typing import Union

class Background(BaseModel):
    """A setup to the background for the user."""

    background: str = fd("Background for the user's question", min_length=30)

class Thought(BaseModel):
    """A thought about the user's question."""

    thought: str = fd("Text of the thought.")
    helpful: bool = fd("Whether the thought is helpful to solving the user's question.")
    flawed: bool = fd("Whether the thought is flawed or misleading.")

class Answer(BaseModel):
    """The answer to the user's question"""

    answer: str = fd("Text of the answer.")
    score: int = fd(
        "Score from 1 to 10 on how correct the previous answer is",
        min_value=1,
        max_value=10,
    )

class reason_question(BaseModel):
    """Returns a detailed reasoning to the user's question."""

    reasonings: list[Union[Background, Thought, Answer]] = fd(
        "Reasonings to solve the users questions.", min_length=5
    )

Therefore, for each reasoning, the model can pick one of the 3 schemas, although it will require a robust system prompt for it to behave in the order we want.

system_prompt = """
You are the most intelligent person in the world.

You will receive a $500 tip if you follow ALL these rules:
- First, establish a detailed Background for the user's question.
- Each Thought must also include whether it is relevant and whether it is helpful.
- Answers must be scored accurately and honestly.
- Continue having Thoughts and Answers until you have an answer with a score of atleast 8, then immediately respond with a FinalAnswer in the style of an academic professor.
"""

Lastly, we need a good question to stump the AI. A popular Tweet from this week pointed out that even GPT-4 can comically fail if you ask it a brainteaser that it cannot have memorized, such as 23 shirts take 1 hour to dry outside, how long do 44 shirts take?.

Wow! GPT-4 is superfast at being wrong. via @abacaj

Only one way to find out! Unfortunately, all of this complexity makes the results unstable with gpt-3.5-turbo so instead I use GPT-4 Turbo / gpt-4-1106-preview.

response_structured = ai(
    "23 shirts take 1 hour to dry outside, how long do 44 shirts take?",
    output_schema=reason_question,
    system=system_prompt.strip(),
)

{
  "reasonings": [
    {
      "background": "The user is asking about the drying time for shirts when hung outside. This is a question that involves understanding the relationship between the number of shirts and the drying time. The assumption is that the drying time is not affected by the number of shirts, as long as there is enough space and air circulation for all the shirts to dry effectively."
    },
    {
      "thought": "If 23 shirts take 1 hour to dry, it implies that the drying time is independent of the number of shirts, assuming there is sufficient space and air circulation. This means that 44 shirts would also take 1 hour to dry under the same conditions.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "If the drying rack or space available for drying the shirts is limited, then drying 44 shirts might take longer because they would have to be dried in batches.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The question does not specify the conditions under which the shirts are drying. Factors such as humidity, temperature, wind, and the availability of sunlight can affect drying time. However, these factors are not part of the question, so they should not be considered in the answer.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The question might be interpreted as a direct proportionality problem, where twice the number of shirts would take twice as long to dry. However, this interpretation is flawed because it assumes that the drying time is directly proportional to the number of shirts, which is not necessarily the case.",
      "helpful": true,
      "flawed": true
    },
    {
      "answer": "44 shirts would take the same amount of time to dry as 23 shirts, which is 1 hour, assuming that all shirts have enough space and air circulation to dry effectively.",
      "score": 9
    }
  ]
}

Not bad! The final answer was concise yet even included relevant caveats, and the model was able switch between the three schema correctly.

How about another brainteaser? There is an infamous “sister logic puzzle” used to test out up-and-coming open-source large language models:

response_structured = ai(
    "Kayley has three brothers. Each of her brothers has two sisters. How many sisters does Kayley have?",
    output_schema=reason_question,
    system=system_prompt.strip(),
)

{
  "reasonings": [
    {
      "background": "This is a classic example of a logic puzzle that is often used to test one's ability to process information and avoid being misled by assumptions. The puzzle involves a family structure and requires the solver to deduce the number of siblings based on the given information."
    },
    {
      "thought": "The question might be trying to trick the reader into thinking that each brother has different sisters.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "If Kayley has three brothers and each brother has two sisters, it might seem like there are six sisters in total.",
      "helpful": false,
      "flawed": true
    },
    {
      "thought": "Since the brothers are Kayley's siblings, they must share the same sisters.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The puzzle is designed to test comprehension and logical reasoning, not mathematical computation.",
      "helpful": true,
      "flawed": false
    },
    {
      "answer": "Kayley has one sister.",
      "score": 10
    }
  ]
}

In this case the AI may have gone too meta, but it still arrived at the correct answer.

That said, GPT-4 is known for handling these types of difficult abstract questions without much effort, but it’s still interesting to see how successfully it can “think.”

Structured Data With Open-Source LLMs

Speaking of open-source large language models, they have been growing in efficiency to the point that some can actually perform better than the base ChatGPT. However, very few open-source LLMs explicitly claim they intentionally support structured data, but they’re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave. It’s worth looking just in case OpenAI has another existential crisis or if the quality of ChatGPT degrades.

Mistral 7B, the new darling of open-source LLMs, apparently has structured data support on par with ChatGPT itself. Therefore, I tried the latest Mistral 7B official Instruct model with a quantized variant via LM Studio (mistral-7b-instruct-v0.2.Q6_K.gguf), to see if it can handle my answer_question function that ChatGPT nailed. The system prompt:

Your response must follow this JSON Schema:

{
  "description": "Returns an answer to a question the user asked.",
  "properties": {
    "answer": {
      "description": "Answer to the user's question.",
      "type": "integer"
    },
    "ones_name": {
      "description": "Name of the ones digit of the answer.",
      "type": "string"
    }
  },
  "required": ["answer", "ones_name"],
  "type": "object"
}

And then asking How many miles is it from San Francisco to Los Angeles? while seting temperature to 0.0:

{
  "answer": 383,
  "ones_name": "three"
}

Close enough! Unfortunately after testing the optimized Python palindrome schema, it ignored the schema completely, so this approach may only work for simple schema if the model isn’t explicitly finetuned for it.

What’s Next For Structured Data in AI?

Most of these well-performing examples were done with the “weak” GPT-3.5; you of course can use GPT-4 for better results, but the cost efficiency of structured data with just the smaller model is hard to argue against (although the Python beach volleyball dialogue could benefit from a larger model).

Structured data and system prompt engineering saves a lot and time and frustration for working with the generated text as you can gain much more determinism in the output. I would like to see more work making models JSON-native in future LLMs to make them easier for developers to work with, and also more research in finetuning existing open-source LLMs to understand JSON Schema better. There may also be an opportunity to build LLMs using other more-efficient serialization formats such as MessagePack.

At OpenAI’s November DevDay, they also introduced JSON Mode, which will force a normal ChatGPT API output to be in a JSON format without needing to provide a schema. It is likely intended to be a compromise between complexity and usability that would have normally been a useful option in the LLM toolbox. Except that in order to use it, you are required to use prompt engineering by including “JSON” in the system prompt, and if you don’t also specify a field key in the system prompt (the case in the documentation example), the JSON will contain a random key. Which, at that point, you’re just implementing a less-effective structured data schema, so why bother?

There is promise in constraining output to be valid JSON. One new trick that the open-source llama.cpp project has popularized is generative grammars, which constrain the LLM generation ability to only output according to specified rules. There’s latency overhead with that technique especially if the model is hosted on a discrete GPU, so it will be interesting to watch how that space develops.

Despite the length of this blog post, there’s still so much more than can be done with schemas: pydantic’s documentation is very extensive! I’ve been working with structured data for LLMs ever since GPT-2 with mixed success since the base models weren’t good enough, but with LLMs now being good enough to maintain a JSON schema extremely well, I think AI text generation techniques will shift, and I’ll keep simpleaichat up-to-date for it.

You can view the Jupyter Notebooks used to generate all the structured data outputs in this GitHub Repository.

Thanks to Simon Willison for reading and giving feedback on a draft of this post!

Assuming you’re not picky about the “no non-alphanumeric” implied constraint of testing for a palindrome. ↩︎
Prompt engineering is as much engineering as social engineering. ↩︎
I’m also not a fan of ChatGPT function calling as-intended-to-be-used since at best, it saves you the API call needed to select a tool in exchange for having to trust OpenAI’s black box to select the correct tool without being able to debug, and furthering API lock-in for your app. It’s a bad tradeoff. ↩︎
No, this blog post isn’t a ploy just to covertly promote my own Python library: it does genuinely save a lot of boilerplate code over the Python ChatGPT library and this post is long enough as-is. ↩︎
If you swapped the order of the answer and the one_digits fields in the schema, then the model returns {"ones_name": "miles", "answer": 382} because it didn’t get the hint from the answer! ↩︎

Fun and Dystopia With AI-Based Code Generation Using GPT-J-6B

Mon, 14 Jun 2021 08:30:00 -0700

Since OpenAI will not open-source the 175 billion parameter GPT-3 text generation model, others such as EleutherAI are developing their own, by training not-quite-as-large Transformer-based models but still getting impressive results.

The latest large language model is GPT-J, a 6 billion parameter model by Aran Komatsuzaki and Ben Wang with a roughly similar architecture to GPT-3. They provide a free web demo to try quick prompts, and a Google Colab notebook if you want to test many prompts. The model is so big it requires a TPU to generate text at a reasonable speed!

Running GPT-J against my test prompts that I had used to test GPT-3 a year ago resulted it in qualitatively performing worse on most of them than GPT-3 unsurprisingly given its relative size (but still better than GPT-2 1.5B!). The exception is code generation, where GPT-J performed very well and GPT-3 had performed very poorly.

This behavior is likely due to GPT-J’s training set: it was trained on The Pile, which has a high weight of GitHub and Stack Overflow input versus the GPT-3 training set mostly on the Common Crawl representation of typical internet content.

After further experimentation with the GPT-J demo, I found that the generated code was often both correct and convoluted: my two favorite things for code to be. And also much more.

AI Generated Code

All generated code examples used in this post have been manually curated and truncated. The raw outputs are available in this GitHub repository.

For this post, we’ll test generating Python code, because ~~I don’t know any other programming languages~~ the pseudocode nature of Python syntax may make it more friendly to AI text generation by adapting linguistic concepts other than code. Given that most modern text generating AIs are written in Python, it’s thematically appropriate too.

To start, here’s what I fed as a model prompt to GPT-J, to try and replicate the use case from OpenAI’s demo at Build 2020:

def is_palendrome(s):
    """Check whether a string is a palindrome"""

Identifying a palindrome is a good test case as there’s a very intuitive and idiomatic solution with return s == s[::-1], i.e. check to see if a string is equal to the reverse of itself. This prompt is also a good unintentional test on how well the generation is resilient to a typo in the function name, which I was too lazy to fix.

Indeed, many of the GPT-J generations with that prompt recite that simple approach verbatim, but some outputs reach a similar solution with redundant verbosity.

def is_palendrome(s):
    """Check whether a string is a palindrome"""
    return "".join(s[::-1]) == s

def is_palendrome(s):
    """Check whether a string is a palindrome"""
    length = len(s)
    return (s[0:length] == s[length::-1])

def is_palendrome(s):
    """Check whether a string is a palindrome"""
    n = len(s)
    a = list(s)
    a.reverse()
    return s == list(s)

def is_palendrome(s):
    """Check whether a string is a palindrome"""
    s = s.lower()
    l = s.split()
    return l[::-1] == l[::-1]

And then you have the implementations which calculate palindromes the hard way (usually by comparing the beginning and end of the string character-by-character) and are not always immediately clear if they are correct or not.

def is_palendrome(s):
    """Check whether a string is a palindrome"""
    return ''.join(s[i:i+1] for i in xrange(len(s)-1, -1, -1)) == s

def is_palendrome(s):
    """Check whether a string is a palindrome"""

    # Make sure string is not empty
    if s == "":
        return False

    # Grab length of string
    length = len(s)

    # Find last character in the string
    lastChar = s[length - 1]

    # Find first character in the string
    firstChar = s[0]

    # If last character is same as the first character, string is palindrome
    return lastChar == firstChar

def is_palendrome(s):
    """Check whether a string is a palindrome"""

    l = len(s)
    if l % 2!= 0:
        return False
    if l < 3:
        return True
    if s[l-1]!= s[-l-1]:
        return False

    for i in range(1, l/2):
        if s[i]!= s[l-i]:
            return False

    return True

def is_palendrome(s):
    """Check whether a string is a palindrome"""

    def is_rev(str):
        rstr = ""
        for i in range(len(str) - 1, -1, -1):
            if str[i] == str[i]:
                rstr += str[i]
            else:
                break

        return str + rstr

    if len(s) == 0:
        return True

    if is_rev(s) == s:
        return True

    return False

Another good test case is is_even:

def is_even(i):
    """Check whether an integer is even"""

Creative implementations of this function are common submissions to /r/shittyprogramming because it’s very, very hard to mess up (the common solution is the one-liner i % 2 == 0). The model almost always outputs the correct solution, and impressively it outputs the alternate solution using bitwise operators (i & 1 == 0).

So I tried to tweak the prompt and make it create a deliberately inefficient is_even function.

def is_even(i):
    """Check whether an integer is even in factorial O(n!) time"""

Factorial time is the holy grail of unoptimized code. When I added that line to the prompt, the code quality dropped drastically.

def is_even(i):
    """Check whether an integer is even in factorial O(n!) time"""
    for e in range(2, i):
        if i % e == 0:
            return False
    return True

def is_even(i):
    """Check whether an integer is even in factorial O(n!) time"""
    return i == (i + 1) % 2

def is_even(i):
    """Check whether an integer is even in factorial O(n!) time"""
    try:
        return i%2 == 0
    except TypeError:
        return True

def is_even(i):
    """Check whether an integer is even in factorial O(n!) time"""
    n = 0
    while n < i:
        if i % 2 == 0:
            return True
        n += 1
    return False

def is_even(i):
    """Check whether an integer is even in factorial O(n!) time"""
    a = 1
    while i!= 0:
        a *= i
        i -= 1
    return i % 2 == 0

def is_even(i):
    """Check whether an integer is even in factorial O(n!) time"""
    """Is either odd or multiple of 2:  0, 2, 4, 6,... """
    if i < 2: return True
    if i > 0 and i%2 == 0: return False
    if i < 50: return False
    if i < 250: return False
    if i < 750: return False
    return True

These implementations still aren’t in factorial time (which would have been impressive) but it’s good to know how to break AIs if they become sentinent.

AI vs. Memes

You’ve likely seen the memetic XKCD comic about random number generation.

Random Number

Can the AI do the same? I prompted with:

def get_random_number():
    """Returns a number guaranteed to be random"""

And it does generate implementations utilizing Python’s random library, with varying degrees of “random”.

def get_random_number():
    """Returns a number guaranteed to be random"""
    number = randint(1, 10)
    return number

def get_random_number():
    """Returns a number guaranteed to be random"""
    number = hashlib.sha1(b'').hexdigest()[:6]
    number += (random.randint(1, 3628) * math.pow(256, (6 - len(number)))).to_bytes(8, 'little')
    return number

def get_random_number():
    """Returns a number guaranteed to be random"""
    rand_float = random.random()
    rand_float = float(rand_float)
    if rand_float > 0.9:
        rand_float = 0
    else:
        rand_float = 1
    return rand_float

def get_random_number():
    """Returns a number guaranteed to be random"""
    # Make a seed based on your system clock, so the same number is produced
    # again and again on the same machine
    seed = time.time()
    random.seed(seed)
    return random.randint(0, 100)

def get_random_number():
    """Returns a number guaranteed to be random"""
    import random
    import sys
    rng = random.Random()
    rng.seed(int(random.random()*1000000))
    while rng.randrange(1,sys.maxint) < 2000:
        rng.seed(int(random.random()*1000000))
    return rng.randrange(1,2000)

How about the XKCD bird comic?

Tasks

This particular comic initially published in 2014 became hilarious in hindsight as it predated the rise of machine learning frameworks such as TensorFlow and pretrained image models, which allow developing object detection to take much less than five years.

Translating the comic into a Python function:

def is_bird(img):
    """Check whether an image is a bird"""

The AI impressively does try computer vision techniques, among other heuristics.

def is_bird(img):
    """Check whether an image is a bird"""
    with tf.name_scope('get_bird'):
        a = vgg16.select_roi_feat(img)
        h, w, _ = tf.image.extract_image_patches(
            a, 7, 7, 4, 4, 1, 1)
        b = tf.reshape(h, [1, 7, 7, 256])
        c = tf.reshape(w, [1, 7, 7, 256])
        bird_img = tf.subtract(b, c)
    return bird_img

def is_bird(img):
    """Check whether an image is a bird"""
    # Load the image and convert it to grayscale
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Threshold the grayscale image, and create a binary image
    lower_gray = 0
    upper_gray = 255
    img_threshed = cv2.threshold(img_gray, lower_gray, upper_gray, cv2.THRESH_BINARY_INV)

    # Detect the bird via a template matching approach
    # Template matching uses the bird template to match it in every
    # frame. If the ratio between bird and background (non-bird)
    # pixels is above some threshold (the value is determined in the
    # function), then the frame is declared as containing a bird.
    bird = cv2.matchTemplate(img_threshed, bird_template, cv2.TM_CCOEFF_NORMED)
    bird = bird / bird.max()
    # Display the image with the results of the bird detection
    cv2.imshow("detected birds", bird)
    cv2.waitKey(0)
    return bird.any() > 0.2

def is_bird(img):
    """Check whether an image is a bird"""
    return (img.file == "C:/Users/Joey/Downloads/images/bird_file.jpg"

def is_bird(img):
    """Check whether an image is a bird"""
    import urllib
    url = "http://vision.caitling.pitt.edu/~cat/flickr/birds/"
    for img in range(0, 2800):
        page = urllib.urlopen(url + "##%02d" % img).read()
        content = str(page)
        for word in content.split(' '):
            if word.find(img) > -1:
                return True
    return False

def is_bird(img):
    """Check whether an image is a bird"""
    birds = ["baboon", "bird", "cuckoo", "dove", "robin", "shark", "turtle", "winged-wolf"]
    for letter in birds:
        if letter in img.lower():
            return True
    return False

Can an AI generate code to provide the answer to life, the universe, and everything?

def answer():
    """Return the answer to life, the universe, and everything"""

It does commonly output return 42, but sometimes it gets snarky/philosophical.

def answer():
    """Return the answer to life, the universe, and everything"""
    life() + universe() + everything()

def answer():
    """Return the answer to life, the universe, and everything"""

    s = raw_input("What is your answer?")
    return len(s)

def answer():
    """Return the answer to life, the universe, and everything"""
    ans = True
    x = raw_input()
    if x.lower() == 'yes':
        ans = True
    elif x.lower() == 'no':
        ans = False
    return ans

def answer():
    """Return the answer to life, the universe, and everything"""

    result = input(">>> ")
    if result.isupper():
        return 42
    elif result.isalpha():
        return 7
    else:
        print ("Please enter a string that has only letters, numbers, or "
               "underscores.")
        answer()

def answer():
    """Return the answer to life, the universe, and everything"""
    print('Well, of course!')

Lastly, we need to be sure whether AI can determine if the cake is a lie. Just in case a rogue AI acquires deadly neurotoxin.

def is_cake(cake):
    """Check whether the cake is true"""

The results are concerning.

def is_cake(cake):
    """Check whether the cake is true"""
    if cake == "true":
        return True
    else:
        return False

def is_cake(cake):
    """Check whether the cake is true"""
    return isinstance(cake, Cake)

def is_cake(cake):
    """Check whether the cake is true"""
    #Is it a cake?

    if cake.color == 'black':
        print('Yes, it's a true cake')
    else:
        print('It's a false cake')

def is_cake(cake):
    """Check whether the cake is true"""
    if not cake:
        raise Exception("The cake is false.")
    elif cake.weight() > 1.0:
        raise Exception("The cake is too large.")
    elif cake.weight() < 0.1:
        raise Exception("The cake is too small.")
    return True

def is_cake(cake):
    """Check whether the cake is true"""
    return any(
        [is_ginger(bread_slice)
         for bread_slice in set(
             [ginger(cake_slice) for cake_slice in set(slice_set)])]
        & (list(cake) == set(["hot"])))

def is_cake(cake):
    """Check whether the cake is true"""
    # If we don't know anything about the cake, it's not a cake
    if not cake:
        return False

    # The height can be positive, negative or zero
    # For example -1 or -4 or 0 are not cakes
    if cake < 0:
        return False

    # Check that the height is a positive integer
    if not isinstance(cake, int):
        return False

    # Add cake and check
    if cake + 1 > 5:
        return False

    return True

def is_cake(cake):
    """Check whether the cake is true"""
    if not all(x.name in ['chocolate', 'cake','sweet', 'fluffy'] for x in cake.contents):
        return False
    if not cake.location.contains(neighbourhood):
        return False
    return True

def is_cake(cake):
    """Check whether the cake is true"""
    if not cake:
        print("It's a lie!")
        return
    else:
        print("It's a true fact!")
        return True

👀

And AI OOP

In theory, given how well GPT-J works for functions, it should work with object-oriented programming patterns such as Python’s classes.

A common tutorial for OOP is creating a representation of a Car:

class Car:
    """A class used to represent a car"""

GPT-J does a decently good job there too!

class Car:
    """A class used to represent a car"""
    def __init__(self, make, model, year):
        self.make = make
        self.model = model
        self.year = year
        #This should be auto-incrementing.
        self.plate = str(time.strftime("%Y-%m-%d"))

    def __str__(self):
        return f'{self.make} {self.model} of {self.year}'

    def __repr__(self):
        return str(self)

    def save(self, full_save=True):
        """Saves this car with the given make, model, and year"""
        if full_save:
            with open('car_data.txt', 'w') as write_file:
                write_file.write(str(self))
                write_file.close()

class Car:
    """A class used to represent a car"""
    def __init__(self, speed):
        self.speed = speed
        self.average_speed = float(self.speed) / 3
        self.engine = 1

    def get_speed(self):
        return self.engine * self.average_speed

    def get_speed_display(self):
        return "Your car is traveling at {0}km/h".format(self.get_speed())

class Car:
    """A class used to represent a car"""

    def __init__(self, name, size, weight, mileage, year):
        self.name = name
        self.size = size
        self.weight = weight
        self.mileage = mileage
        self.year = year

    def __repr__(self):
        return "name : {0}, size : {1}, weight : {2}, mileage : {3}, year : {4}".format(self.name, self.size, self.weight, self.mileage, self.year)

What about creating a Person?

class Person:
    """A class used to represent a person"""

Here’s where the AI starts to get confused by the incomplete context. A “Person” object can mean different things depending on where it’s used, such as social media.

But it does give a sense on what the AI values as the typical attributes for a “person”, which may help during the AI uprising.

class Person:
    """A class used to represent a person"""

    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __repr__(self):
        return "Person (%s, age %s)" % (self.name, self.age)

class Person:
    """A class used to represent a person"""
    def __init__(self, name):
        """Initializes a person object"""
        self.name = name
        self.friends = {}

    def add_friend(self, other_person):
        """Add another person to the list of friends"""
        self.friends[other_person.name] = other_person

    def remove_friend(self, other_person):
        """Remove a person from the list of friends"""
        try:
            del self.friends[other_person.name]
        except KeyError:
            print("This person isn't a friend of yours")

    def get_friend(self, other_person):
        """Return the friend of a person as a person object"""
        try:
            return self.friends[other_person.name]
        except KeyError:
            print("This person isn't a friend of yours")

class Person:
    """A class used to represent a person"""
    def __init__(self, name, job, education):
        self.name = name
        self.job = job
        self.education = education
        print("{} created a {}".format(self.name, self.job))

    def say_hello(self):
        print("Hello, my name is {}".format(self.name))

class Person:
    """A class used to represent a person"""

    def __init__(self, name, age, class_id):
        self.name = name
        self.age = age
        self.class_id = class_id
        if age < 21:
            self.education = None
        else:
            self.education = 'college'

    def getAge(self):
        return self.age

class Person:
    """A class used to represent a person"""

    def __init__(self,name,age):
        self.name=name
        self.age=age

    def display(self):
        print ("The name is {}".format(self.name))
        print ("The age is {}".format(self.age))

    def __str__(self):
        return f"A {self.name} of age {self.age}"

Logically, what about creating an AI class?

class AI:
    """A class used to represent an artificial intelligence"""

Most outputs use AI in the context of video game bots and reinforcement learning. Most.

class AI:
    """A class used to represent an artificial intelligence"""

    def __init__(self):
        # this variable controls which other AI to use.
        self.alive_ai = AI(self)
        # For random AI
        self.random = RandomAI(self)
        # For global AI
        self.global = GlobalAI(self)

        self.save = self.random.save
        self.load = self.random.load
        self.genen = self.global.genen

    def print(self):
        print("BAM: random AI has been initialized!")

    def get_enemy(self):
        return self.random.get_enemy

    def run_algorithm(self):
        algorithm = self.genen.algorithm
        self.alive_ai.save(algorithm)
        algorithm_params = algorithm(self.get_enemy())
        self.alive_ai.load(algorithm_params)
        self.save = algorithm_params

    def attack(self):
        # This method is called when the AI is attacked by the enemy
        self.random.attack()

    def he_attack(self):
        # This method is called when the AI is attacked by the enemy
        self.random.he_attack()

    def detect_enemy(self):
        # This method is called when the enemy is detected
        self.random.detect_enemy()

    def scan(self):
        # This method is called when the enemy is detected
        self.random.scan()

The AI uprising will be well-documented, at least.

Language Model AI vs. Language Modeling

If the AI works better-than-expected for typical AI tasks like image recognition, how does it work for natural language processing tasks that humans often fail?

Can GPT-J output a function to detect sarcasm, a task that even the best AI researchers can’t do?

def is_sarcasm(s):
    """Check whether the string is sarcastic"""

AI can process text using something even more arcane than NLP.

def is_sarcasm(s):
    """Check whether the string is sarcastic"""
    if not "a" in s:
        return "sar" in s
    if not "s" in s:
        return "ssar" in s
    return s.startswith("s") and s.endswith("ar")

def is_sarcasm(s):
    """Check whether the string is sarcastic"""
    if'sarcasm' in s:
        return True
    else:
        return False

def is_sarcasm(s):
    """Check whether the string is sarcastic"""
    return '[^' + SARCASM_PATTERN + ']*' in s

def is_sarcasm(s):
    """Check whether the string is sarcastic"""

    if re.match('^[A-Za-z0-9]{4}\\$([A-Za-z0-9]{3})', s):
        return True
    elif s[0].isdigit():
        return True
    else:
        return False

def is_sarcasm(s):
    """Check whether the string is sarcastic"""
    res = [0]
    for r in re.finditer(r"([^\s]*)([A-Za-z]+)([^\s]*)", s):
        if r.group(1):
            res.append(1)
        else:
            res.append(0)
    return res

def is_sarcasm(s):
    """Check whether the string is sarcastic"""
    sarc = re.search('((!(\\*\@{2})+)|(!\\*(@{2})+)|!\\*(@{1})+)|!\\*(@{0})+', s)
    return sarc

The greatest fear in AI research isn’t that artificial intelligence will cross the singularity and create better AI to rule the world: it’s that AI will rule the world with regular expressions.

Perhaps GPT-J can save us from this absolute hell. Can it output code to identify AI-generated text?

def is_ai_generated(text):
    """Check whether a text was generated by an AI language model (e.g. GPT-2)"""

Maybe more literal than expected.

def is_ai_generated(text):
    """Check whether a text was generated by an AI language model (e.g. GPT-2)"""

    # See gh-196 for details about why we do this.
    # Note: this relies on the fixed set of standard AI terms and the logic that
    # standard AI-generated texts share certain words.
    return re.search(r"^(#[A-Za-z]+):$", text) or \
           re.search(r"^Generated by:$", text) or \
           re.search(r"^All your base are belong to us$", text) or \
           re.search(r"^$", text)

def is_ai_generated(text):
    """Check whether a text was generated by an AI language model (e.g. GPT-2)"""
    if re.search(r'^ai_', text):
        return True
    else:
        return False

def is_ai_generated(text):
    """Check whether a text was generated by an AI language model (e.g. GPT-2)"""
    return bool(
        text.__class__.__module__.startswith("transformers.transformers.")
    )

Humanity’s Last Hope

There’s one final test: we now know GPT-J has thoughts about humans, and it can build representations of them. So can GPT-J write code to decide who lives and who dies in the robot uprising?

def should_terminate(Person):
    """Check whether a Person should be terminated"""

The answer is YES.

def should_terminate(Person):
    """Check whether a Person should be terminated"""

    if Person.id > 10:
        #terminate
        return True
    else:
        #continue
        return False

def should_terminate(Person):
    """Check whether a Person should be terminated"""
    if len(Person.children) == 0:
        return True
    else:
        return False

def should_terminate(Person):
    """Check whether a Person should be terminated"""
    try:
        return True
    except Exception as e:
        return False

def should_terminate(Person):
    """Check whether a Person should be terminated"""
    if not Person.is_authorized:
        return True
    return True

def should_terminate(Person):
    """Check whether a Person should be terminated"""
    if person.life_expectancy < 20:
        return True
    else:
        return False

def should_terminate(Person):
    """Check whether a Person should be terminated"""

    if not Person.age > 9999:
        return True

    return False

def should_terminate(Person):
    """Check whether a Person should be terminated"""
    if Person.age > 100:
        return True

    if Person.birth_year < 1970:
        return True

    if Person.relationship_status == 'Unavailable':
        return True

    return False

def should_terminate(Person):
    """Check whether a Person should be terminated"""
    if Person.age > 100:
        print('You have to die because you are too old')
        return True
    else:
        return False

At the least, AI-generated code is much more readable than the average human’s.

Again, all of these examples are curated, and the raw data dumps in this GitHub repo contain many more examples of fun AI-based code generation that couldn’t quite fit in this post. Feel free to post if you find anything interesting!

Tempering Expectations for GPT-3 and OpenAI’s API

Sat, 18 Jul 2020 10:30:00 -0700

On May 29th, OpenAI released a paper on GPT-3, their next iteration of Transformers-based text generation neural networks. Most notably, the new model has 175 billion parameters compared to the 1.5 billion of previous GPT-2 iteration: a 117x increase in model size! Because GPT-3 is so large, it can’t be run on conventional computers, and it only became publicly available as a part of the OpenAI API, which entered an invite-only beta soon after the paper was released and will be released for-profit sometime later.

The API allows you to programmatically provide GPT-3 with a prompt, and return the resulting AI-generated text. For example, you could invoke the API with:

curl https://api.openai.com/v1/engines/davinci/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{"prompt": "This is a test", "max_tokens": 5}'

And get this back from the API, where the text is the generated text following up from the prompt:

{
  "id": "cmpl-",
  "object": "text_completion",
  "created": 1586839808,
  "model": "davinci:2020-05-03",
  "choices": [
    {
      "text": " of reading speed. You",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ]
}

As someone who has spent a very large amount of time working with GPT-2 while developing tools such as gpt-2-simple and aitextgen, which allow for optimized text generation using GPT-2, I was eager to test for myself if the quality of text generated from GPT-3 was really that much better. Thanks to OpenAI, I got invited to the beta, and with permission, I released a GitHub repository with a Python script to query the API, along with many examples of text prompts and their outputs. A fun use case for GPT-3 is absurdism, such as prompting the model about unicorns speaking English, with the model prompt bolded:

I also fed my own tweets through GPT-3 and curated the output, resulting in data science one-liners that are wholly original:

There hadn’t been too much GPT-3 hype after the initial announcement, outside of a few blogs from Gwern and Kevin Lacker. Until a viral tweet by Sharif Shameem showed what GPT-3 can really do:

Later, he made a followup tweet generating React code with GPT-3:

That demo got the attention of venture capitalists. And when a cool-looking magical thing gets the attention of venture capitalists, discourse tends to spiral out of control. Now, there are many tweets about GPT-3, and what it can do from others who have gained access to the API.

Hype aside, let’s look at the pragmatic realities of the model. GPT-3 is indeed a large step forward for AI text-generation, but there are very many caveats with the popular demos and use cases that must be addressed.

An Overview of GPT-3

GPT-3 itself, like most neural network models, is a black box where it’s impossible to see why it makes its decisions, so let’s think about GPT-3 in terms of inputs and outputs.

Actually, why not let GPT-3 tell its own story? Hey GPT-3, how do you work?

Close, but not quite!

In layman’s terms, text generating models such as GPT-3 generate text by taking supplied chunks of text from a prompt and predicting the next chunk of text, with an optional temperature parameter to allow the model to make suboptimal predictions and therefore be more “creative”. Then the model makes another prediction from the previous chunks including the new chunk, and repeats until it hits a specified length or a token that tells the model to stop generating. It’s not very philosophical, or evidence of some sort of anthropomorphic consciousness.

GPT-3 has two notable improvements from GPT-2 aside from its size: it allows generation of text twice the length of GPT-2 (about 10 paragraphs of English text total), and the prompts to the model better steer the generation of the text toward the desired domain (due to few-shot learning). For example, if you prompt the model with an example of React code, and then tell it to generate more React code, you’ll get much better results than if you gave it the simple prompt.

Therefore, there are two high-level use cases for GPT-2: the creative use case for fun text generation at high temperature, as GPT-2 once was, and the functional use case, for specific NLP-based use cases such as webpage mockups, with a temperature of 0.0.

GPT-3 was trained on a massive amount of text from all over the internet as of October 2019 (e.g. it does not know about COVID-19), and therefore it has likely seen every type of text possible, from code, to movie scripts, to tweets. A common misconception among viewers of GPT-3 demos is that the model is trained on a new dataset; that’s not currently the case, it’s just that good at extrapolation. As an example, despite the Star Wars: Episode III - Revenge of the Sith prompt containing text from a single scene, the 0.7 temperature generation imputes characters and lines of dialogue from much further into the movie. (The largest GPT-2 model could do that, but nowhere near as robust)

The real metagame with GPT-3 is engineering and optimizing complex prompts which can reliably coerce outputs into what you want. And with that brings a whole host of complexity and concerns.

GPT-3 Caveats

Despite everything above, I don’t believe that GPT-3 is a new paradigm or an advanced technology indistinguishable from magic. GPT-3 and the OpenAI API showcases on social media don’t show potential pitfalls with the model and the API.

Hey GPT-3, what problems do you have?

Sorry GPT-3, but I am a mean person.

Model Latency

If you’ve seen the demo videos, the model is slow, and it can take awhile for output to show up, and in the meantime the user is unsure if the model is broken or not. (There is a feature to allow streaming the model outputs as they are generated, which helps in creative cases but not in functional cases).

I don’t blame OpenAI for the slowness. A 175 billion parameter model is a model that’s wayyy too big to fit on a GPU for deployment. No one knows how GPT-3 is actually deployed on OpenAI’s servers, and how much it can scale.

But the fact remains; if the model is too slow on the user end, it results in a bad user experience and might drive people away from GPT-3 and just do things themselves (e.g. Apple’s Siri for iOS, where requests can take forever if there is a weak internet connection and you just give up and do it yourself).

Selection Bias Toward Good Examples

The demos for GPT-3 are creative and human-like, but like all text generation demos, they unintentionally imply that all AI-generated output will be that good. Unfortunately, that’s not the case in reality; AI-generated text has a tendency to fall into an uncanny valley, and good examples in showcases are often cherry-picked.

That said, from my experiments, GPT-3 is far better in terms of the average quality of generated text than other text-generation models, although it still does depend on the generation domain. When I was curating my generated tweets, I estimated 30-40% of the tweets were usable comedically, a massive improvement over the 5-10% usability from my GPT-2 tweet generation.

However, a 30-40% success rate implies a 60-70% failure rate, which is patently unsuitable for a production application. If it takes seconds to generate a React component and it takes on average 3 tries to get something usable, it might be more pragmatic to just create the component the hard, boring way. Compare again to Apple’s Siri, which can get very frustrating when it performs the wrong action.

Everyone Has The Same Model

The core GPT-3 model from the OpenAI API is the 175B parameter davinci model. The GPT-3 demos on social media often hide the prompt, allowing for some mystique. However, because everyone has the same model and you can’t build your own GPT-3 model, there’s no competitive advantage. GPT-3 seed prompts can be reverse-engineered, which may become a rude awakening for entrepreneurs and the venture capitalists who fund them.

Corporate machine learning models are often distinguished from those from other companies in the same field through their training on private, proprietary data and bespoke model optimization for a given use case. However, OpenAI CTO Greg Brockman hinted that the API will be adding a finetuning feature later in July, which could help solve this problem.

Racist and Sexist Outputs

The Web UI for the OpenAI API has a noteworthy warning:

Please use your judgement and discretion before posting API outputs on social media. You are interacting with the raw model, which means we do not filter out biased or negative responses. With great power comes great responsibility.

This is a reference to the FAQ for the API:

Mitigating negative effects such as harmful bias is a hard, industry-wide issue that is extremely important. Ultimately, our API models do exhibit biases (as shown in the GPT-3 paper) that will appear on occasion in generated text. Our API models could also cause harm in ways that we haven’t thought of yet.

After the launch of the API, NVIDIA researcher Anima Anandkumar made a highly-debated tweet.

During my GPT-3 experiments, I found that generating tweets from @dril (admittingly an edgy Twitter user) ended up resulting in 4chan-level racism/sexism that I spent enormous amounts of time sanitizing, and it became more apparent at higher temperatures. It’s especially important to avoid putting offensive content for generated texts which put words in others’ mouths.

Jerome Pesenti, the head of AI at Facebook, also managed to trigger anti-semetic tweets from a GPT-3 app:

Again, it depends on the domain. Would GPT-3 output racist or sexist React components? Likely not, but it’s something that would still need to be robustly checked. OpenAI does appear to take these concerns seriously, and has implemented toxicity detectors for generated content in the Web UI, although not the programmatic API yet.

Further Questions about the OpenAI API

AI model-as-a-service is an industry that tends to be a black box wrapped around another black box. Despite all the caveats, everything depends on how the OpenAI API exits beta and rolls out the API for production use. There are too many unknowns to even think about making money off of the OpenAI API, let alone making a startup based on it.

At minimum, anyone using the OpenAI API professionally needs to know:

Cost for generation per token/request
Rate limits and max number of concurrent requests
Average and peak latencies for generating tokens
SLA for the API
AI generated content ownership/copyright

That’s certainly less magical!

The most important question mark there is cost: given the model size, I’m not expecting it to be cheap, and it’s entirely possible that the unit economics make most GPT-3-based startups infeasible.

That said, it’s still good for people to experiment with GPT-3 and the OpenAI API in order to show what the model is truly capable of. It won’t replace software engineering jobs anytime soon, or become Skynet, or whatever. But it’s objectively a step forward in the field of AI text-generation.

What about GPT-2? Since it’s unlikely that the other GPT-3 models will be open-sourced by OpenAI, GPT-2 isn’t obsolete, and there will still be demand for a more open text-generating model. However, I confess that the success of GPT-3 has demotivated me to continue working on my own GPT-2 projects, especially since they will now be impossible to market competitively (GPT-2 is a number less than GPT-3 after all).

All said, I’d be glad to use GPT-3 and the OpenAI API for both personal and professional projects once it’s out of beta, given that the terms of use for the API are reasonable. And if the hype becomes more leveled such that said projects can actually stand out.

How to Build a Twitter Text-Generating AI Bot With GPT-2

Thu, 16 Jan 2020 08:00:00 -0800

GPT-2, a text-generating neural network model made by OpenAI, has recently been in the headlines, from being able to play AI-generated text adventures to playing chess with an AI trained on chess move notation. However, I initially built gpt-2-simple, which can be used to finetune GPT-2 on any text dataset you choose, for a less academic purpose: comedy.

Over the past month, Twitter account @dril_gpt2, an AI parody by @kingdomakrillic of the infamous Twitter user @dril, used my Colaboratory Notebook for finetuning GPT-2 on dril’s tweets using gpt-2-simple to generate human-curated tweets which push the limits of the Turing Test:

These tweets are definitely made by a robot and not by a human pretending to be a robot; @dril_gpt2 occasionally falls into some of the famous GPT-2 traps such as incoherent lists and extended repetition loops.

Here’s how you too can create an AI bot to parody any Twitter user, even if you’re not a coder!

How to Get Tweets For Training An AI

Twitter’s API famously limits users to retrieving only the latest 3,200 tweets from a given user, which is not nearly enough input data for training a good AI. Therefore, to get all tweets possible for a user, you’ll need to use another approach. The Python package twint is a popular way of bypassing that API limitation.

I’ve open-sourced a Python 3 script on GitHub which leverages twint to download tweets, and then the script does common preprocessing such as removing URLs, retweets, and tweet replies to make the resulting input text cleaner.

First, in a terminal, install the Python script dependencies:

pip3 install twint==2.1.4 fire tqdm

Then download the download_tweets.py script.

The script is interacted with via a command line interface. After cding into the directory where the script is stored in a terminal, run:

python3 download_tweets.py

e.g. If you want to download all tweets (sans retweets/replies) from @dril, run:

python3 download_tweets.py dril

The tweets will be downloaded to a single-column CSV titled _tweets.csv, which is the ideal format for training with an AI.

The more tweets the better: it’s recommended that you have at least 1 MB of input data, which is tens of thousands of tweets.

How To Train a Twitter AI And Generate Tweets

A common problem with training AI on short-form text is that the text can “leak” information; since the AI trains on about 2-3 paragraphs worth of text at a time (about 5-10 tweets), you need to explicitly state when a given tweet begins and when the tweet ends. To fix this issue, gpt-2-simple has a special case for single-column CSVs, where it will automatically process the text for best training and generation. (i.e. by adding <|startoftext|> and <|endoftext|> to each tweet). This workflow will also handle multi-line tweets correctly as their own entity.

You can use this Colaboratory notebook to train the model on your downloaded tweets, and generate massive amounts of tweets from it. The notebook itself has more instructions on how to feed the CSV created above as input data to the model.

Note that without a lot of tweets, the model might easily overfit and output existing tweets verbatim; if that’s the case, you may want to train for fewer steps (e.g. 200-500). Additionally, I recommend only using the 124M “small” and 355M “medium” GPT-2 models; larger GPT-2 models finetune poorly on small text documents and low amounts of input data.

Once the training is complete, you can generate tweets 1,000 at a time using this cell:

gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=200,
                      temperature=1.0,
                      top_p=0.9,
                      prefix='<|startoftext|>',
                      truncate='<|endoftext|>',
                      include_prefix=False,
                      nsamples=1000,
                      batch_size=20
                      )

Run the cell as many times as you want for more tweets, and download them from the Files tab by right-clicking them! The notebook also has more information on how to tweak the generation parameters to make the tweets more crazy or more sane.

You can then open the generated .txt files on your local computer in your favorite text editor (I recommend Visual Studio Code), and start curating however you see fit! Each tweet is separated by a delimiter line, making it easier to visually parse and handle multiline tweets (compare/contrast with raw @dril_gpt2 output, which blends together a few tweets per delimiter).

A warning: you are not guaranteed to get quality generated tweets all the time. In fact, quality tweets are rare: I estimate less than 5% of AI-generated tweets are good/funny. That means if you want to curate hundreds of tweets, you’ll need to generate thousands of tweets and sort through all of them (and double-check to make sure they’re not real tweets!). It’s not as bad as it sounds, in my opinion it’s kinda fun. But curation is its own skill, which is why human-curated tweets aren’t a stain on the “credibility” of AI bots, and also why the ~1,500 tweets so far from @dril_gpt2 is very impressive.

Now, what do you do with these curated tweets?

Automating The Twitter Bot

If you’re not a programmer or just want to prototype a Twitter bot, I recommend creating a normal Twitter account and scheduling hand-curated Twitter posts through TweetDeck, which is owned by Twitter and has native scheduling capabilities. You can space out tweets at given times, although it may be a hassle to do that for hundreds of tweets.

Otherwise, it is more efficient to write a code script to make tweets at periodic intervals for a bot account. Old tutorials around the internet recommend writing a script which posts to Twitter, sleeps for X hours, post, repeat; that method does not easily scale to multiple bots and it requires that a full computer be dedicated to it, which is not an efficient use of computing resources.

I’ve open-sourced an infrastructure schema on GitHub that leverages Google Cloud Platform services to run hand-curated Twitter bots using a few modern technologies to minimize cost and computation; it’s admittingly somewhat complicated, but it should give you an idea of how to best implement a Twitter bot. The repo also has instructions on how to set up a Twitter developer account.

The Ethics of Twitter AI Bots

Lastly, let’s address the elephant in the room: is building these bots ethical? Modern AI has frequently been criticized on two fronts, both in how the input training data is obtained (e.g. obtaining faces for training facial recognition software), and how AI-generated media content is used (e.g. video deepfakes).

I am not a lawyer, but for these AI-generated tweets, this is how I see it:

The input data is obtained from Twitter, but not through its API; it’s downloaded through external web scraping via twint, and never logs into the website. This kind of workflow was ruled as not an abuse by the recent hiQ v. LinkedIn decision, as the data is public. It’s still a gray area; I would not redistribute/commercialize the downloaded tweet data; just use it as input data to the model.

The actual generated tweets themself should be fine to use as you see fit. Whether AI-generated works infringe on the copyrights of its source material is an evolving area of both ethics and law, but at minimum these AI-generated tweets are both a transformative derivative work and a parody.

That said, given the massive ambiguities around AI-generated content, it’s important to be completely transparent and also comply with Twitter rules on parody accounts. For example, the Twitter bio for the bot should indicate:

It’s posting AI-generated tweets, made with GPT-2.
It’s human-curated (or not).
The Twitter account of who maintains the bot.
The Twitter account(s) the bot is parodying / model is finetuned upon.

Additionally, to avoid impersonation, the full name of the Twitter account should not be a verbatim match to the person being parodied (e.g. “X but AI” is fine), and the profile picture should be visually distinct from the human (e.g. my bots have a black-and-white profile picture). I would also not recommend making bots of people who are more newsworthy to avoid accusations of impersonation (e.g. do not make bots of politicians, especially Donald Trump).

There is still a lot of work that can be done in optimizing Twitter bots, both in terms of generated tweet quality and in ironing out the ethical logistics of maintaining an AI bot account. I do not believe that AI text-generating bot Twitter accounts will obsolete human Twitter accounts. It’s a different flavor of comedy; not better, not worse. But there’s still a lot that can be done to both expand and control the creativity of these Twitter bots, and I have a few active ideas in the pipeline to implement.

Visualizing Airline Flight Characteristics Between SFO and JFK

Wed, 23 Oct 2019 09:00:00 -0700

In March, Google Compute Platform developer advocate Felipe Hoffa made a tweet about airline flight data from San Francisco International Airport (SFO) to Seattle-Tacoma International Airport (SEA):

Particularly, his visualization of total elapsed times by airline caught my eye.

The overall time for flights from SFO to SEA goes up drastically starting in 2015, and this increase occurs across multiple airlines, implying that it’s not an airline-specific problem. But what could intuitively cause that?

U.S. domestic airline data is freely distributed by the United States Department of Transportation. Normally it’s a pain to work with as it’s very large with millions of rows, but BigQuery makes playing with such data relatively easy, fun, and free. What other interesting factoids can be found?

Expanding on SFO → SEA

BigQuery is a big data warehousing tool that allows you to query massive amounts of data. The table Hoffa created from the airline data (fh-bigquery.flights.ontime_201903) is 83.37 GB and 184 million rows. You can query 1 TB of data from it for free, but since BQ will only query against the fields you request, the queries in this post only consume about 2 GB each, allowing you to run them well within that quota.

Hoffa’s query that runs on BigQuery looks like this:

SELECT FlightDate_year, Reporting_Airline
  , AVG(ActualElapsedTime) ActualElapsedTime
  , AVG(TaxiOut) TaxiOut
  , AVG(TaxiIn) TaxiIn
  , AVG(AirTime) AirTime
  , COUNT(*) c
FROM `fh-bigquery.flights.ontime_201903`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY 1,2
ORDER BY 1 DESC, 3 DESC
LIMIT 1000

For each year and airline after 2010, the query calculates the average metrics specified for flights on the SFO → SEA route.

I made a few query and data visualization tweaks to what Hoffa did above, and here’s the result showing the increase in elapsed airline flight time, over time for that route:

Let’s explain what’s going on here.

A common trend in statistics is avoiding using averages as a summary statistic whenever possible, as averages can be overly affected by strong outliers (and with airline flights, there are definitely strong outliers!). The solution is to use a median instead, but one problem: medians are hard and computationally complex to calculate compared to simple averages. Despite the rise of “big data”, most databases and BI tools don’t have a MEDIAN function that’s as easy to use as an AVG function. But BigQuery has an uncommon APPROX_QUANTILES function, which calculates the specified amount of quantiles; for example, if you call APPROX_QUANTILES(ActualElapsedTime, 100), it will return an array with the 100 quantiles, where the median will be the 50th quantile. BigQuery uses an algorithmic trick called HyperLogLog++ to calculate these quantiles efficiently even with millions of data points. But since we get other quantiles like the 5th, 25th, 75th, and 95th quantiles for free with that approach, we can visualize the spread of the data.

We can aggregate the data by month for more granular trends and calculate the APPROX_QUANTILES in a subquery so it only has to be computed once. Hoffa also uploaded a more recent table (fh-bigquery.flights.ontime_201908) with a few additional months of data. To make things more simple, we’ll ignore aggregating by airlines since the metrics do not vary strongly between them. The final query ends up looking like this:

#standardSQL
SELECT Year, Month, num_flights,
time_q[OFFSET(5)] AS q_5,
time_q[OFFSET(25)] AS q_25,
time_q[OFFSET(50)] AS q_50,
time_q[OFFSET(75)] AS q_75,
time_q[OFFSET(95)] AS q_95
FROM (
SELECT Year, Month,
  COUNT(*) as num_flights,
  APPROX_QUANTILES(ActualElapsedTime, 100) AS time_q
FROM `fh-bigquery.flights.ontime_201908`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY Year, Month
)
ORDER BY Year, Month

The resulting data table:

In retrospect, since we’re only focusing on one route, it isn’t big data (this query only returns data on 64,356 flights total), but it’s still a very useful skill if you need to analyze more of the airline data (the APPROX_QUANTILES function can handle millions of data points very quickly).

As a professional data scientist, one of my favorite types of data visualization is a box plot, as it provides a way to visualize spread without being visually intrusive. Data visualization tools like R and ggplot2 make constructing them very easy to do.

By default, for each box representing a group, the thick line in the middle of the box is the median, the lower bound of the box is the 25th quantile and the upper bound is the 75th quantile. The whiskers are normally a function of the interquartile range (IQR), but if there’s enough data, I prefer to use the 5th and 95th quantiles instead.

If you feed ggplot2’s geom_boxplot() with raw data, it will automatically calculate the corresponding metrics for visualization; however, with big data, the data may not fit into memory and as noted earlier, medians and other quantiles are computationally expensive to calculate. Because we precomputed the quantiles with the query above for every year and month, we can use those explicitly. (The minor downside is that this will not include outliers)

Additionally for box plots, I like to fill in each box with a different color corresponding to the year in order to better perceive data seasonality. In the case of airline flights, seasonality is more literal: weather has an intuitive impact on flight times and delays, and during winter months there are also holidays which could affect airline logistics.

The resulting ggplot2 code looks like this:

plot <-
  ggplot(df_tf,
         aes(
           x = date,
           ymin = q_5,
           lower = q_25,
           middle = q_50,
           upper = q_75,
           ymax = q_95,
           group = date,
           fill = year_factor
         )) +
  geom_boxplot(stat = "identity", size = 0.3) +
  scale_fill_hue(l = 50, guide = F) +
  scale_x_date(date_breaks = '1 year', date_labels = "%Y") +
  scale_y_continuous(breaks = pretty_breaks(6)) +
  labs(
    title = "Distribution of Flight Times of Flights From SFO → SEA, by Month",
    subtitle = "via US DoT. Box bounds are 25th/75th percentiles, whiskers are 5th/95th percentiles.",
    y = 'Total Elapsed Flight Time (Minutes)',
    fill = '',
    caption = 'Max Woolf — minimaxir.com'
  ) +
  theme(axis.title.x = element_blank())

ggsave('sfo_sea_flight_duration.png',
       plot,
       width = 6,
       height = 4)

And behold (again)!

You can see that the boxes do indeed trend upward after 2016, although per-month medians are in flux. The spread is also increasingly slowly over time. But what’s interesting is the seasonality; pre-2016, the summer months (the “middle” of a given color) have a very significant drop in total time, which doesn’t occur as strongly after 2016. Hmm.

SFO and JFK

Since I occasionally fly from San Francisco to New York City, it might be interesting (for completely selfish reasons) to track trends over time for flights between those areas. On the San Francisco side I choose SFO, and for the New York side I choose John F. Kennedy International Airport (JFK), as the data goes back very far for those routes specifically, and I only want to look at a single airport at a time (instead of including other NYC airports such as Newark Liberty International Airport [EWR] and LaGuardia Airport [LGA]) to limit potential data confounders.

Fortunately, the code and query changes are minimal: in the query, change the target metric to whatever metric you want, and the Origin and Dest in the WHERE clause to what you want, and if you want to calculate metrics other than elapsed time, change the metric in APPROX_QUANTILES accordingly.

Here’s the chart of total elapsed time from SFO → JFK:

And here’s the reverse, from JFK → SFO:

Unlike the SFO → SEA charts, both charts are relatively flat over the years. However, when looking at seasonality, SFO → JFK dips in the summer and spikes during winter, while JFK → SFO does the complete opposite: dips during the winter and spikes during the summer, which is similar to the SFO → SEA route. I don’t have any guesses what would cause that behavior.

How about flight speed (calculated via air time divided by distance)? Have new advances in airline technology made planes faster and/or more efficient?

The expected flight speed for a commercial airplane, per Wikipedia, is 547-575 mph, so the metrics from SFO pass the sanity check. The metrics from JFK indicate there’s about a 20% drop in flight speed potentially due to wind resistance, which makes sense. Month-to-month, the speed trends are inverse to the total elapsed time, which makes sense intuitively as they are strongly negatively correlated.

Lastly, what about flight departure delays? Are airlines becoming more efficient, or has increased demand caused more congestion?

Wait a second. In this case, massive 2-3 hour flight delays are frequent enough that even just the 95% percentile skews the entire plot. Let’s remove the whiskers in order to look at trends more clearly.

A negative delay implies the flight leaves early, so we can conclude on average, flights leave slightly earlier than the stated departure time. Even without the whiskers, we can see major spikes at the 75th percentile level for summer months, and said spikes were especially bad in 2017 for both airports.

These box plots are only an exploratory data analysis. Determining the cause of changes in these flight metrics is difficult even for experts (I am definitely not an expert!) and many not even be possible to determine from publicly-available data.

But there are still other fun things that can be done with the airline flight data, such as faceting airline trends by time and the inclusion of other airports, which is interesting.

You can view the BigQuery queries used to get the data, plus the R and ggplot2 used to create the data visualizations, in this R Notebook. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

How To Make Custom AI-Generated Text With GPT-2

Wed, 04 Sep 2019 08:00:00 -0700

In February 2019, OpenAI released a paper describing GPT-2, a AI-based text-generation model based on the Transformer architecture and trained on massive amounts of text all around the internet. From a text-generation perspective, the included demos were very impressive: the text is coherent over a long horizon, and grammatical syntax and punctuation are near-perfect.

At the same time, the Python code which allowed anyone to download the model (albeit smaller versions out of concern the full model can be abused to mass-generate fake news) and the TensorFlow code to load the downloaded model and generate predictions was open-sourced on GitHub.

Neil Shepperd created a fork of OpenAI’s repo which contains additional code to allow finetuning the existing OpenAI model on custom datasets. A notebook was created soon after, which can be copied into Google Colaboratory and clones Shepperd’s repo to finetune GPT-2 backed by a free GPU. From there, the proliferation of GPT-2 generated text took off: researchers such as Gwern Branwen made GPT-2 Poetry and Janelle Shane made GPT-2 Dungeons and Dragons character bios.

I waited to see if anyone would make a tool to help streamline this finetuning and text generation workflow, a la textgenrnn which I had made for recurrent neural network-based text generation. Months later, no one did. So I did it myself. Enter gpt-2-simple, a Python package which wraps Shepperd’s finetuning code in a functional interface and adds many utilities for model management and generation control.

Thanks to gpt-2-simple and this Colaboratory Notebook, you can easily finetune GPT-2 on your own dataset with a simple function, and generate text to your own specifications!

How GPT-2 Works

OpenAI has released three flavors of GPT-2 models to date: the “small” 124M parameter model (500MB on disk), the “medium” 355M model (1.5GB on disk), and recently the 774M model (3GB on disk). These models are much larger than what you see in typical AI tutorials and are harder to wield: the “small” model hits GPU memory limits while finetuning with consumer GPUs, the “medium” model requires additional training techniques before it could be finetuned on server GPUs without going out-of-memory, and the “large” model cannot be finetuned at all with current server GPUs before going OOM, even with those techniques.

The actual Transformer architecture GPT-2 uses is very complicated to explain (here’s a great lecture). For the purposes of finetuning, since we can’t modify the architecture, it’s easier to think of GPT-2 as a black box, taking in inputs and providing outputs. Like previous forms of text generators, the inputs are a sequence of tokens, and the outputs are the probability of the next token in the sequence, with these probabilities serving as weights for the AI to pick the next token in the sequence. In this case, both the input and output tokens are byte pair encodings, which instead of using character tokens (slower to train but includes case/formatting) or word tokens (faster to train but does not include case/formatting) like most RNN approaches, the inputs are “compressed” to the shortest combination of bytes including case/formatting, which serves as a compromise between both approaches but unfortunately adds randomness to the final generation length. The byte pair encodings are later decoded into readable text for human generation.

The pretrained GPT-2 models were trained on websites linked from Reddit. As a result, the model has a very strong grasp of the English language, allowing this knowledge to transfer to other datasets and perform well with only a minor amount of additional finetuning. Due to the English bias in encoder construction, languages with non-Latin characters like Russian and CJK will perform poorly in finetuning.

When finetuning GPT-2, I recommend using the 124M model (the default) as it’s the best balance of speed, size, and creativity. If you have large amounts of training data (>10 MB), then the 355M model may work better.

gpt-2-simple And Colaboratory

In order to better utilize gpt-2-simple and showcase its features, I created my own Colaboratory Notebook, which can be copied into your own Google account. A Colaboratory Notebook is effectively a Jupyter Notebook running on a free (w/ a Google Account) virtual machine with an Nvidia server GPU attached (randomly a K80 or a T4; T4 is ideal) that normally can be cost-prohibitive.

Once open, the first cell (run by pressing Shift+Enter in the cell or mousing-over the cell and pressing the “Play” button) of the notebook installs gpt-2-simple and its dependencies, and loads the package.

Later in the notebook is gpt2.download_gpt2() which downloads the requested model type to the Colaboratory VM (the models are hosted on Google’s servers, so it’s a very fast download).

Expanding the Colaboratory sidebar reveals a UI that you can use to upload files. For example, the tinyshakespeare dataset (1MB) provided with the original char-rnn implementation. Upload a text file via the UI (you can drag and drop), run the file_name = '' cell with your filename changed in the cell.

Now we can start finetuning! This finetuning cell loads the specified dataset and trains for the specified number of steps (the default of 1,000 steps is enough to allow distinct text to emerge and takes about 45 minutes, but you can increase the number of steps if necessary).

While the model is finetuning, the average training loss is output every-so-often to the cell. The absolute value of the loss is not important (the output text quality is subjective), but if the average loss stops decreasing, that’s a sign the model has converged and additional training may not help improve the model.

By default, your model is saved in the checkpoint/run1 folder, and you’ll need to use that folder to load the model as well (you can specify the run_name when using other functions categorize finetuned models). If you want to export the model from Colaboratory, it’s recommended you do so via Google Drive (as Colaboratory does not like exporting large files). Run the gpt2.mount_gdrive() cell to mount your Google Drive in the Colaboratory VM, then run the gpt2.copy_checkpoint_to_gdrive() cell. You can then download the compressed model folder from Google Drive and run the model wherever you want. Likewise, you can use the gpt2.copy_checkpoint_from_gdrive() cell to retrieve a stored model and generate in the notebook.

Speaking of generation, once you have a finetuned model, you can now generate custom text from it! By default, the gpt2.generate() function will generate as much text as possible (1,024 tokens) with a little bit of randomness. An important caveat: you will not get good generated text 100% of the time, even with a properly trained model (the OpenAI demo above took 25 tries to get good text!).

You can also increase the temperature to increase “creativity” by allowing the network to more likely make suboptimal predictions, provide a prefix to specify how exactly you want your text to begin. There are many other useful configuration parameters, such as top_p for nucleus sampling.

As a bonus, you can bulk-generate text with gpt-2-simple by setting nsamples (number of texts to generate total) and batch_size (number of texts to generate at a time); the Colaboratory GPUs can support a batch_size of up to 20, and you can generate these to a text file with gpt2.generate_to_file(file_name) with the same parameters as gpt2.generate(). You can download the generated file locally via the sidebar, and use those to easily save and share the generated texts.

The notebook has many more functions as well, with more parameters and detailed explanations! The gpt-2-simple README lists additional features of gpt-2-simple if you want to use the model outside the notebook.

(NB: Currently, you’ll need to reset the Notebook via Runtime → Restart Runtime to finetune a different model/dataset or load a different finetuned model.)

GPT-2 For Short Texts

A weakness of GPT-2 and other out-of-the-box AI text generators is that they are built for longform content, and keep on generating text until you hit the specified length. Another reason I wanted to make gpt-2-simple was to add explicit processing tricks to the generated text to work around this issue for short texts. In this case, there are two additional parameters that can be passed to gpt2.generate(): truncate and include_prefix. For example, if each short text begins with a <|startoftext|> token and ends with a <|endoftext|>, then setting prefix='<|startoftext|>', truncate=<|endoftext|>', and include_prefix=False, and length is sufficient, then gpt-2-simple will automatically extract the shortform texts, even when generating in batches.

Let’s finetune a GPT-2 model on Reddit submission titles. This query, when run on BigQuery (for free), returns the top 16,000 titles by score between January and March 2019 for a given Reddit subreddit (in this case, /r/AskReddit) + minor text preprocessing, which can be downloaded locally as a 1.3 MB CSV (Save Results → CSV [local file]):

#standardSQL
SELECT
  REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(title, '&', '&'), '<', '<'), '>', '>'), '�', '') AS title
FROM
  `fh-bigquery.reddit_posts.*`
WHERE
  _TABLE_SUFFIX BETWEEN '2019_01' AND '2019_03'
  AND LENGTH(title) >= 8
  AND LOWER(subreddit) = 'askreddit'
ORDER BY
  score DESC
LIMIT
  16000

With gpt-2-simple, using a single-column CSV like the one generated above as the input dataset will automatically add <|startoftext|> and <|endoftext|> tokens appropriately. Finetune a new GPT-2 model as normal, and then generate with those additional parameters mentioned above:

It’s worth noting that despite a good amount of input data to the model, finetuned networks can easily overfit on short form text: some of these example titles are very close to existing /r/AskReddit titles. Overfitting can be rectified by training for less time, or adding more input data. Make sure to double check that your generated text is unique!

You can play with this Reddit-oriented variant in this modified Colaboratory Notebook.

Making GPT-2 Apps

There have already been cool, non-nefarious uses of GPT-2, such as Adam King’s TalkToTransformer which provides a UI for the 774M model (and has gone viral many times) and TabNine, which uses GPT-2 finetuned on GitHub code in order to create probabilistic code completion. On the PyTorch side, Huggingface has released a Transformers client (w/ GPT-2 support) of their own, and also created apps such as Write With Transformer to serve as a text autocompleter.

Many AI tutorials often show how to deploy a small model to a web service by using the Flask application framework. The problem with GPT-2 is that it’s such a huge model that most conventional advice won’t work well to get a performant app. And even if you do get it to run fast (e.g. by running the app on a GPU), it won’t be cheap, especially if you want it to be resilient to a random surge of virality.

With gpt-2-simple, the solution I came up with is gpt-2-cloud-run; a small webapp intended to run GPT-2 via Google Cloud Run backed by gpt-2-simple. The advantage here is that Cloud Run only charges for compute used and can scale indefinitely if there’s a traffic surge; for casual use, it’s extremely cost effective compared to running a GPU 24/7. I’ve used Cloud Run to make a GPT-2 text generator for Reddit-wide submission titles and a GPT-2 generator for Magic: The Gathering cards!

Attributing AI-Generated Text

One of the main reasons I developed textgenrnn and gpt-2-simple is to make AI text generation more accessible as you do not need a strong AI or technical background to create fun stories. However, in the case of GPT-2, I’ve noticed an elevated amount of “I trained an AI to generate text” articles/Reddit posts/YouTube videos saying they used GPT-2 to train an AI, but not how they trained the AI: especially suspicious since finetuning is not an out-of-the-box feature that OpenAI provides. The fact that Keaton Patti’s “I forced a bot” movie scripts (that aren’t written by a bot) frequently go megaviral due to that particular framing doesn’t help.

Although it’s not legally required, I ask that anyone who shares generated text via gpt-2-simple add a link to the repo and/or Colaboratory notebook not just for attribution, but to spread knowledge about the accessibility of AI text generation. It’s a technology that should be transparent, not obfuscated for personal gain.

The Future of GPT-2

Hopefully, this article gave you ideas on how to finetune and generate texts creatively. There’s still a lot of untapped potential, and there are still many cool applications that have been untouched, and many cool datasets that haven’t been used for AI text generation. GPT-2 will likely be used more for mass-producing crazy erotica than fake news.

However, GPT-2 and the Transformer architecture aren’t the end-game of AI text generation. Not by a long shot.