Data Analysis on Max Woolf's Blog

Predicting Average IMDb Movie Ratings Using Text Embeddings of Movie Metadata

Mon, 30 Jun 2025 10:00:00 -0700

Months ago, I saw a post titled “Rejected from DS Role with no feedback” on Reddit’s Data Science subreddit, in which a prospective job candidate for a data science position provided a Colab Notebook documenting their submission for a take-home assignment and asking for feedback as to why they were rejected. Per the Reddit user, the assignment was:

Use the publicly available IMDB Datasets to build a model that predicts a movie’s average rating. Please document your approach and present your results in the notebook. Make sure your code is well-organized so that we can follow your modeling process.

IMDb, the Internet Movie Database owned by Amazon, allows users to rate movies on a scale from 1 to 10, wherein the average rating is then displayed prominently on the movie’s page:

The Shawshank Redemption is currently the highest-rated movie on IMDb with an average rating of 9.3 derived from 3.1 million user votes.

In their notebook, the Redditor identifies a few intuitive features for such a model, including the year in which the movie was released, the genre(s) of the movies, and the actors/directors of the movie. However, the model they built is a TensorFlow and Keras-based neural network, with all the bells-and-whistles such as batch normalization and dropout. The immediate response by other data scientists on /r/datascience was, at its most polite, “why did you use a neural network when it’s a black box that you can’t explain?”

Reading those replies made me nostalgic. Way back in 2017, before my first job as a data scientist, neural networks using frameworks such as TensorFlow and Keras were all the rage for their ability to “solve any problem” but were often seen as lazy and unskilled compared to traditional statistical modeling such as ordinary least squares linear regression or even gradient boosted trees. Although it’s funny to see that perception against neural networks in the data science community hasn’t changed since, nowadays the black box nature of neural networks can be an acceptable business tradeoff if the prediction results are higher quality and interpretability is not required.

Looking back at the assignment description, the objective is only “predict a movie’s average rating.” For data science interview take-homes, this is unusual: those assignments typically have an extra instruction along the lines of “explain your model and what decisions stakeholders should make as a result of it”, which is a strong hint that you need to use an explainable model like linear regression to obtain feature coefficients, or even a middle-ground like gradient boosted trees and its variable importance to quantify relative feature contribution to the model. ¹ In absence of that particular constraint, it’s arguable that anything goes, including neural networks.

The quality of neural networks have improved significantly since 2017, even moreso due to the massive rise of LLMs. Why not try just feeding a LLM all raw metadata for a movie and encode it into a text embedding and build a statistical model based off of that? Would a neural network do better than a traditional statistical model in that instance? Let’s find out!

About IMDb Data

The IMDb Non-Commercial Datasets are famous sets of data that have been around for nearly a decade ² but are still updated daily. Back in 2018 as a budding data scientist, I performed a fun exporatory data analysis using these datasets, although the results aren’t too surprising.

The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems.

But in truth, these datasets are a terrible idea for companies to use for a take-home assignment. Although the datasets are released under a non-commercial license, IMDb doesn’t want to give too much information to their competitors, which results in a severely limited amount of features that could be used to build a good predictive model. Here are the common movie-performance-related features present in the title.basics.tsv.gz file:

tconst: unique identifier of the title
titleType: the type/format of the title (e.g. movie, tvmovie, short, tvseries, etc)
primaryTitle: the more popular title / the title used by the filmmakers on promotional materials at the point of release
isAdult: 0: non-adult title; 1: adult title
startYear: represents the release year of a title.
runtimeMinutes: primary runtime of the title, in minutes
genres: includes up to three genres associated with the title

This is a sensible schema for describing a movie, although it lacks some important information that would be very useful to determine movie quality such as production company, summary blurbs, granular genres/tags, and plot/setting — all of which are available on the IMDb movie page itself and presumably accessible through the paid API. Of note, since the assignment explicitly asks for a movie’s average rating, we need to filter the data to only movie and tvMovie entries, which the original assignment failed to do.

The ratings data in title.ratings.tsv.gz is what you’d expect:

tconst: unique identifier of the title (which can therefore be mapped to movie metadata using a JOIN)
averageRating: average of all the individual user ratings
numVotes: number of votes the title has received

In order to ensure that the average ratings for modeling are indeed stable and indicative of user sentiment, I will only analyze movies that have atleast 30 user votes: as of May 10th 2025, that’s about 242k movies total. Additionally, I will not use numVotes as a model feature, since that’s a metric based more on extrinsic movie popularity rather than the movie itself.

The last major dataset is title.principals.tsv.gz, which has very helpful information on metadata such as the roles people play in the production of a movie:

tconst: unique identifier of the title (which can be mapped to movie data using a JOIN)
nconst: unique identifier of the principal (this is mapped to name.basics.tsv.gz to get the principal’s primaryName, but nothing else useful)
category: the role the principal served in the title, such as actor, actress, writer, producer, etc.
ordering: the ordering of the principals within the title, which correlates to the order the principals appear on IMDb’s movie cast pages.

Additionally, because the datasets are so popular, it’s not the first time someone has built a IMDb ratings predictor and it’s easy to Google.

Instead of using the official IMDb datasets, these analyses are based on the smaller IMDB 5000 Movie Dataset hosted on Kaggle, which adds metadata such as movie rating, budget, and further actor metadata that make building a model much easier (albeit “number of likes on the lead actor’s Facebook page” is very extrinsic to movie quality). Using the official datasets with much less metadata is building the models on hard mode and will likely have lower predictive performance.

Although IMDb data is very popular and very well documented, that doesn’t mean it’s easy to work with.

The Initial Assignment and “Feature Engineering”

Data science take-home assignments are typically 1/2 exploratory data analysis for identifying impactful dataset features, and 1/2 building, iterating, and explaining the model. For real-world datasets, these are all very difficult problems with many difficult solutions, and the goal from the employer’s perspective is seeing more how these problems are solved rather than the actual quantitative results.

The initial Reddit post decided to engineer some expected features using pandas, such as is_sequel by checking whether a non-1 number is present at the end of a movie title and one-hot encoding each distinct genre of a movie. These are fine for an initial approach, albeit sequel titles can be idiosyncratic and it suggests that a more NLP approach to identifying sequels and other related media may be useful.

The main trick with this assignment is how to handle the principals. The common data science approach would be to use a sparse binary encoding of the actors/directors/writers, e.g. using a vector where actors present in the movie are 1 and every other actor is 0, which leads to a large number of potential approaches to encode this data performantly, such as scikit-learn’s MultiLabelBinarizer. The problem with this approach is that there are a very large number of unique actors / high cardinality — more unique actors than data points themselves — which leads to curse of dimensionality issues and workarounds such as encoding only the top N actors will lead to the feature being uninformative since even a generous N will fail to capture the majority of actors.

There are actually 624k unique actors in this dataset (Jupyter Notebook), the chart just becomes hard to read at that point.

Additionally, most statistical modeling approaches cannot account for the ordering of actors as they treat each feature as independent, and since the billing order of actors is generally correlated to their importance in the movie, that’s an omission of relevant information to the problem.

These constraints gave me an idea: why not use an LLM to encode all movie data, and build a model using the downstream embedding representation? LLMs have attention mechanisms, which will not only respect the relative ordering of actors (to give higher predictive priority to higher-billed actors, along with actor cooccurrences), but also identify patterns within movie name texts (to identify sequels and related media semantically).

I started by aggregating and denormalizing all the data locally (Jupyter Notebook). Each of the IMDb datasets are hundreds of megabytes and hundreds of thousands of rows at minimum: not quite big data, but enough to be more cognizant of tooling especially since computationally-intensive JOINs are required. Therefore, I used the Polars library in Python, which not only loads data super fast, but is also one of the fastest libraries at performing JOINs and other aggregation tasks. Polars’s syntax also allows for some cool tricks: for example, I want to spread out and aggregate the principals (4.1 million rows after prefiltering) for each movie into directors, writers, producers, actors, and all other principals into nested lists while simultaneously having them sorted by ordering as noted above. This is much easier to do in Polars than any other data processing library I’ve used, and on millions of rows, this takes less than a second:

df_principals_agg = (
    df_principals.sort(["tconst", "ordering"])
    .group_by("tconst")
    .agg(
        director_names=pl.col("primaryName").filter(pl.col("category") == "director"),
        writer_names=pl.col("primaryName").filter(pl.col("category") == "writer"),
        producer_names=pl.col("primaryName").filter(pl.col("category") == "producer"),
        actor_names=pl.col("primaryName").filter(
            pl.col("category").is_in(["actor", "actress"])
        ),
        principal_names=pl.col("primaryName").filter(
            ~pl.col("category").is_in(
                ["director", "writer", "producer", "actor", "actress"]
            )
        ),
        principal_roles=pl.col("category").filter(
            ~pl.col("category").is_in(
                ["director", "writer", "producer", "actor", "actress"]
            )
        ),
    )
)

After some cleanup and field renaming, here’s an example JSON document for Star Wars: Episode IV - A New Hope:

{
  "title": "Star Wars: Episode IV - A New Hope",
  "genres": [
    "Action",
    "Adventure",
    "Fantasy"
  ],
  "is_adult": false,
  "release_year": 1977,
  "runtime_minutes": 121,
  "directors": [
    "George Lucas"
  ],
  "writers": [
    "George Lucas"
  ],
  "producers": [
    "Gary Kurtz",
    "Rick McCallum"
  ],
  "actors": [
    "Mark Hamill",
    "Harrison Ford",
    "Carrie Fisher",
    "Alec Guinness",
    "Peter Cushing",
    "Anthony Daniels",
    "Kenny Baker",
    "Peter Mayhew",
    "David Prowse",
    "Phil Brown"
  ],
  "principals": [
    {
      "John Williams": "composer"
    },
    {
      "Gilbert Taylor": "cinematographer"
    },
    {
      "Richard Chew": "editor"
    },
    {
      "T.M. Christopher": "editor"
    },
    {
      "Paul Hirsch": "editor"
    },
    {
      "Marcia Lucas": "editor"
    },
    {
      "Dianne Crittenden": "casting_director"
    },
    {
      "Irene Lamb": "casting_director"
    },
    {
      "Vic Ramos": "casting_director"
    },
    {
      "John Barry": "production_designer"
    }
  ]
}

I was tempted to claim that I used zero feature engineering, but that wouldn’t be accurate. The selection and ordering of the JSON fields here is itself feature engineering: for example, actors and principals are intentionally last in this JSON encoding because they can have wildly varying lengths while the prior fields are more consistent, which should make downstream encodings more comparable and consistent.

Now, let’s discuss how to convert these JSON representations of movies into embeddings.

Creating And Visualizing the Movie Embeddings

LLMs that are trained to output text embeddings are not much different from LLMs like ChatGPT that just predict the next token in a loop. Models such as BERT and GPT can generate “embeddings” out-of-the-box by skipping the prediction heads of the models and instead taking an encoded value from the last hidden state of the model (e.g. for BERT, the first positional vector of the hidden state representing the [CLS] token). However, text embedding models are more optimized for distinctiveness of a given input text document using contrastive learning. These embeddings can be used for many things, from finding similar encoded inputs by identifying the similarity between embeddings, and of course, by building a statistical model on top of them.

Text embeddings that leverage LLMs are typically generated using a GPU in batches due to the increased amount of computation needed. Python libraries such as Hugging Face transformers and sentence-transformers can load these embeddings models. For this experiment, I used the very new Alibaba-NLP/gte-modernbert-base text embedding model that is finetuned from the ModernBERT model specifically for the embedding use case for two reasons: it uses the ModernBERT architecture which is optimized for fast inference, and the base ModernBERT model is trained to be more code-aware and should be able understand JSON-nested input strings more robustly — that’s also why I intentionally left in the indentation for nested JSON arrays as it’s semantically meaningful and explicitly tokenized. ³

The code (Jupyter Notebook) — with extra considerations to avoid running out of memory on either the CPU or GPU ⁴ — looks something like this:

device = "cuda:0"
dataloader = torch.utils.data.DataLoader(docs, batch_size=32,
                                         shuffle=False,
                                         pin_memory=True,
                                         pin_memory_device=device)

dataset_embeddings = []
for batch in tqdm(dataloader, smoothing=0):
    tokenized_batch = tokenizer(
        batch, max_length=8192, padding=True, truncation=True, return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = model(**tokenized_batch)
        embeddings = outputs.last_hidden_state[:, 0].detach().cpu()
    dataset_embeddings.append(embeddings)

dataset_embeddings = torch.cat(dataset_embeddings)
dataset_embeddings = F.normalize(dataset_embeddings, p=2, dim=1)

I used a Spot L4 GPU on Google Cloud Platform at a pricing of $0.28/hour, and it took 21 minutes to encode all 242k movie embeddings: about $0.10 total, which is surprisingly efficient.

Each of these embeddings is a set of 768 numbers (768D). If the embeddings are unit normalized (the F.normalize() step), then calculating the dot product between embeddings will return the cosine similarity of those movies, which can then be used to identify the most similar movies. But “similar” is open-ended, as there are many dimensions how a movie could be considered similar.

Let’s try a few movie similarity test cases where I calculate the cosine similarity between one query movie and all movies, then sort by cosine similarity to find the most similar (Jupyter Notebook). How about Peter Jackson’s Lord of the Rings: The Fellowship of the Ring? Ideally, not only would it surface the two other movies of the original trilogy, but also its prequel Hobbit trilogy.

title	cossim
The Lord of the Rings: The Fellowship of the Ring (2001)	1.0
The Lord of the Rings: The Two Towers (2002)	0.922
The Lord of the Rings: The Return of the King (2003)	0.92
National Geographic: Beyond the Movie - The Lord of the Rings: The Fellowship of the Ring (2001)	0.915
A Passage to Middle-earth: The Making of ‘Lord of the Rings’ (2001)	0.915
Quest for the Ring (2001)	0.906
The Lord of the Rings (1978)	0.893
The Hobbit: The Battle of the Five Armies (2014)	0.891
The Hobbit: The Desolation of Smaug (2013)	0.883
The Hobbit: An Unexpected Journey (2012)	0.883

Indeed, it worked and surfaced both trilogies! The other movies listed are about the original work, so having high similarity would be fair.

Compare these results to the “More like this” section on the IMDb page for the movie itself, which has the two sequels to the original Lord of the Rings and two other suggestions that I am not entirely sure are actually related.

What about more elaborate franchises, such as the Marvel Cinematic Universe? If you asked for movies similar to Avengers: Endgame, would other MCU films be the most similar?

title	cossim
Avengers: Endgame (2019)	1.0
Avengers: Infinity War (2018)	0.909
The Avengers (2012)	0.896
Endgame (2009)	0.894
Captain Marvel (2019)	0.89
Avengers: Age of Ultron (2015)	0.882
Captain America: Civil War (2016)	0.882
Endgame (2001)	0.881
The Avengers (1998)	0.877
Iron Man 2 (2010)	0.876

The answer is yes, which isn’t a surprise since those movies share many principals. Although, there are instances of other movies named “Endgame” and “The Avengers” which are completely unrelated to Marvel and therefore implies that the similarities may be fixated on the names.

What about movies of a smaller franchise but a specific domain, such as Disney’s Frozen that only has one sequel? Would it surface other 3D animated movies by Walt Disney Animation Studios, or something else?

title	cossim
Frozen (2013)	1.0
Frozen II (2019)	0.93
Frozen (2010)	0.92
Frozen (2010) [a different one]	0.917
Frozen (1996)	0.909
Frozen (2005)	0.9
The Frozen (2012)	0.898
The Story of Frozen: Making a Disney Animated Classic (2014)	0.894
Frozen (2007)	0.889
Frozen in Time (2014)	0.888

…okay, it’s definitely fixating on the name. Let’s try a different approach to see if we can find more meaningful patterns in these embeddings.

In order to visualize the embeddings, we can project them to a lower dimensionality with a dimensionality reduction algorithm such as PCA or UMAP: UMAP is preferred as it can simultaneously reorganize the data into more meaningful clusters. UMAP’s construction of a neighborhood graph, in theory, can allow the reduction to refine the similarities by leveraging many possible connections and hopefully avoid fixating on the movie name. However, with this amount of input data and the relatively high initial 768D vector size, the computation cost of UMAP is a concern as both factors each cause the UMAP training time to scale exponentially. Fortunately, NVIDIA’s cuML library recently updated and now you can run UMAP with very high amounts of data on a GPU at a very high number of epochs to ensure the reduction fully converges, so I did just that (Jupyter Notebook). What patterns can we find? Let’s try plotting the reduced points, colored by their user rating.

So there’s a few things going on here. Indeed, most of the points are high-rating green as evident in the source data. But the points and ratings aren’t random and there are trends. In the center giga cluster, there are soft subclusters of movies at high ratings and low ratings. Smaller discrete clusters did indeed form, but what is the deal with that extremely isolated cluster at the top? After investigation, that cluster only has movies released in 2008, which is another feature I should have considered when defining movie similarity.

As a sanity check, I faceted out the points by movie release year to better visualize where these clusters are forming:

This shows that even the clusters movies have their values spread, but I unintentionally visualized how embedding drift changes over time. 2024 is also a bizarrely-clustered year: I have no idea why those two years specifically are weird in movies.

The UMAP approach is more for fun, since it’s better for the downstream model building to use the raw 768D vector and have it learn the features from that. At the least, there’s some semantic signal preserved in these embeddings, which makes me optimistic that these embeddings alone can be used to train a viable movie rating predictor.

Predicting Average IMDb Movie Scores

So, we now have hundreds of thousands of 768D embeddings. How do we get them to predict movie ratings? What many don’t know is that all methods of traditional statistical modeling also work with embeddings — assumptions such as feature independence are invalid so the results aren’t explainable, but you can still get a valid predictive model.

First, we will shuffle and split the data set into a training set and a test set: for the test set, I chose 20,000 movies (roughly 10% of the data) which is more than enough for stable results. To decide the best model, we will be using the model that minimizes the mean squared error (MSE) of the test set, which is a standard approach to solving regression problems that predict a single numeric value.

Here are three approaches for using LLMs for solving non-next-token-prediction tasks.

Method #1: Traditional Modeling (w/ GPU Acceleration!)

You can still fit a linear regression on top of the embeddings even if feature coefficients are completely useless and it serves as a decent baseline (Jupyter Notebook). The absolute laziest “model” where we just use the mean of the training set for every prediction results in a test MSE of 1.637, but performing a simple linear regression on top of the 768D instead results in a more reasonable test MSE of 1.187. We should be able to beat that handily with a more advanced model.

Data scientists familiar with scikit-learn know there’s a rabbit hole of model options, but most of them are CPU-bound and single-threaded and would take considerable amount of time on a dataset of this size. That’s where cuML—the same library I used to create the UMAP projection—comes in, as cuML has GPU-native implementations of most popular scikit-learn models with a similar API. This notably includes support vector machines, which play especially nice with embeddings. And because we have the extra compute, we can also perform a brute force hyperparameter grid search to find the best parameters for fitting each model.

Here’s the results of MSE on the test dataset for a few of these new model types, with the hyperparameter combination for each model type that best minimizes MSE:

The winner is the Support Vector Machine, with a test MSE of 1.087! This is a good start for a simple approach that handily beats the linear regression baseline, and it also beats the model training from the Redditor’s original notebook which had a test MSE of 1.096 ⁵. In all cases, the train set MSE was close to the test set MSE, which means the models did not overfit either.

Method #2: Neural Network on top of Embeddings

Since we’re already dealing with AI models and already have PyTorch installed to generate the embeddings, we might as well try the traditional approach of training a multilayer perceptron (MLP) neural network on top of the embeddings (Jupyter Notebook). This workflow sounds much more complicated than just fitting a traditional model above, but PyTorch makes MLP construction straightforward, and Hugging Face’s Trainer class incorporates best model training practices by default, although its compute_loss function has to be tweaked to minimize MSE specifically.

The PyTorch model, using a loop to set up the MLP blocks, looks something like this:

class RatingsModel(nn.Module):
    def __init__(self, linear_dims=256, num_layers=6):
        super().__init__()

        dims = [768] + [linear_dims] * num_layers
        self.mlp = nn.ModuleList([
            nn.Sequential(
                nn.Linear(dims[i], dims[i+1]),
                nn.GELU(),
                nn.BatchNorm1d(dims[i+1]),
                nn.Dropout(0.6)
            ) for i in range(len(dims)-1)
        ])

        self.output = nn.Linear(dims[-1], 1)

    def forward(self, x, targets=None):
        for layer in self.mlp:
            x = layer(x)

        return self.output(x).squeeze()  # return 1D output if batched inputs

This MLP is 529k parameters total: large for a MLP, but given the 222k row input dataset, it’s not egregiously so.

The real difficulty with this MLP approach is that it’s too effective: even with less than 1 million parameters, the model will extremely overfit and converge to 0.00 train MSE quickly, while the test set MSE explodes. That’s why Dropout is set to the atypically high probability of 0.6.

Fortunately, MLPs are fast to train: training for 600 epochs (total passes through the full training dataset) took about 17 minutes on the GPU. Here’s the training results:

The lowest logged test MSE was 1.074: a slight improvement over the Support Vector Machine approach.

Method #3: Just Train a LLM From Scratch Dammit

There is a possibility that using a pretrained embedding model that was trained on the entire internet could intrinsically contain relevant signal about popular movies—such as movies winning awards which would imply a high IMDb rating—and that knowledge could leak into the test set and provide misleading results. This may not be a significant issue in practice since it’s such a small part of the gte-modernbert-base model which is too small to memorize exact information.

For the sake of comparison, let’s try training a LLM from scratch on top of the raw movie JSON representations to process this data to see if we can get better results without the possibility of leakage (Jupyter Notebook). I was specifically avoiding this approach because the compute required to train an LLM is much, much higher than a SVM or MLP model and generally leveraging a pretrained model gives better results. In this case, since we don’t need a LLM that has all the knowledge of human existence, we can train a much smaller model that only knows how to work with the movie JSON representations and can figure out relationships between actors and whether titles are sequels itself. Hugging Face transformers makes this workflow surprisingly straightforward by not only having functionality to train your own custom tokenizer (in this case, from 50k vocab to 5k vocab) that encodes the data more efficiently, but also allowing the construction a ModernBERT model with any number of layers and units. I opted for a 5M parameter LLM (SLM?), albeit with less dropout since high dropout causes learning issues for LLMs specifically.

The actual PyTorch model code is surprisingly more concise than the MLP approach:

class RatingsModel(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.transformer_model = model
        self.output = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask, targets=None):
        x = self.transformer_model.forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        x = x.last_hidden_state[:, 0]  # the "[CLS] vector"

        return self.output(x).squeeze()  # return 1D output if batched inputs

Essentially, the model trains its own “text embedding,” although in this case instead of an embedding optimized for textual similarity, the embedding is just a representation that can easily be translated into a numeric rating.

Because the computation needed for training a LLM from scratch is much higher, I only trained the model for 10 epochs, which was still twice as slow than the 600 epochs for the MLP approach. Given that, the results are surprising:

The LLM approach did much better than my previous attempts with a new lowest test MSE of 1.026, with only 4 passes through the data! And then it definitely overfit. I tried other smaller configurations for the LLM to avoid the overfitting, but none of them ever hit a test MSE that low.

Conclusion

Let’s look at the model comparison again, this time adding the results from training a MLP and training a LLM from scratch:

Coming into this post, I’m genuinely thought that training the MLP on top of embeddings would have been the winner given the base embedding model’s knowledge of everything, but maybe there’s something to just YOLOing and feeding raw JSON input data to a completely new LLM. More research and development is needed.

The differences in model performance from these varying approaches aren’t dramatic, but some iteration is indeed interesting and it was a long shot anyways given the scarce amount of metadata. The fact that building a model off of text embeddings only didn’t result in a perfect model doesn’t mean this approach was a waste of time. The embedding and modeling pipelines I have constructed in the process of trying to solve this problem have already provided significant dividends on easier problems, such as identifying the efficiency of storing embeddings in Parquet and manipulating them with Polars.

It’s impossible and pointless to pinpoint the exact reason the original Reddit poster got rejected: it could have been the neural network approach or even something out of their control such as the original company actually stopping hiring and being too disorganized to tell the candidate. To be clear, if I myself were to apply for a data science role, I wouldn’t use the techniques in this blog post (that UMAP data visualization would get me instantly rejected!) and do more traditional EDA and non-neural-network modeling to showcase my data science knowledge to the hiring manager. But for my professional work, I will definitely try starting any modeling exploration with an embeddings-based approach wherever possible: at the absolute worst, it’s a very strong baseline that will be hard to beat.

All of the Jupyter Notebooks and data visualization code for this blog post is available open-source in this GitHub repository.

I am not a fan of using GBT variable importance as a decision-making metric: variable importance does not tell you magnitude or direction of the feature in the real world, but it does help identify which features can be pruned for model development iteration. ↩︎
To get a sense on how old they are, they are only available as TSV files, which is a data format so old and prone to errors that many data libraries have dropped explicit support for it. Amazon, please release the datasets as CSV or Parquet files instead! ↩︎
Two other useful features of gte-modernbert-base but not strictly relevant to these movie embeddings are a) its a cased model so it can identify meaning from upper-case text and b) it does not require a prefix such as search_query and search_document as nomic-embed-text-v1.5 does to guide its results, which is an annoying requirement for those models. ↩︎
The trick here is the detach() function for the computed embeddings, otherwise the GPU doesn’t free up the memory once moved back to the CPU. I may or may not have discovered that the hard way. ↩︎
As noted earlier, minimizing MSE isn’t a competition, but the comparison on roughly the same dataset is good for a sanity check. ↩︎

The Best Way to Use Text Embeddings Portably is With Parquet and Polars

Mon, 24 Feb 2025 10:15:00 -0800

Text embeddings, particularly modern embeddings generated from large language models, are one of the most useful applications coming from the generative AI boom. Embeddings are a list of numbers which represent an object: in the case of text embeddings, they can represent words, sentences, and full paragraphs and documents, and they do so with a surprising amount of distinctiveness.

Recently, I created text embeddings representing every distinct Magic: the Gathering card released as of the February 2025 Aetherdrift expansion: 32,254 in total. With these embeddings, I can find the mathematical similarity between cards through the encoded representation of their card design, including all mechanical attributes such as the card name, card cost, card text, and even card rarity.

The iconic Magic card Wrath of God, along with its top four most similar cards identified using their respective embeddings. The similar cards are valid matches, with similar card text and card types.

Additionally, I can create a fun 2D UMAP projection of all those cards, which also identifies interesting patterns:

The UMAP dimensionality reduction process also implicitly clusters the Magic cards to logical clusters, such as by card color(s) and card type.

I generated these Magic card embeddings for something special besides a pretty data visualization, but if you are curious how I generated them, they were made using the new-but-underrated gte-modernbert-base embedding model and the process is detailed in this GitHub repository. The embeddings themselves (including the coordinate values to reproduce the 2D UMAP visualization) are available as a Hugging Face dataset.

Most tutorials involving embedding generation omit the obvious question: what do you do with the text embeddings after you generate them? The common solution is to use a vector database, such as faiss or qdrant, or even a cloud-hosted service such as Pinecone. But those aren’t easy to use: faiss has confusing configuration options, qdrant requires using a Docker container to host the storage server, and Pinecone can get very expensive very quickly, and its free Starter tier is limited.

What many don’t know about text embeddings is that you don’t need a vector database to calculate nearest-neighbor similarity if your data isn’t too large. Using numpy and my Magic card embeddings, a 2D matrix of 32,254 float32 embeddings at a dimensionality of 768D (common for “smaller” LLM embedding models) occupies 94.49 MB of system memory, which is relatively low for modern personal computers and can fit within free usage tiers of cloud VMs. If both the query vector and the embeddings themselves are unit normalized (many embedding generators normalize by default), then the matrix dot product between the query and embeddings results in a cosine similarity between [-1, 1], where the higher score is better/more similar. Since dot products are such a fundamental aspect of linear algebra, numpy’s implementation is extremely fast: with the help of additional numpy sorting shenanigans, on my M3 Pro MacBook Pro it takes just 1.08 ms on average to calculate all 32,254 dot products, find the top 3 most similar embeddings, and return their corresponding idx of the matrix and and cosine similarity score.

def fast_dot_product(query, matrix, k=3):
    dot_products = query @ matrix.T

    idx = np.argpartition(dot_products, -k)[-k:]
    idx = idx[np.argsort(dot_products[idx])[::-1]]

    score = dot_products[idx]

    return idx, score

In most implementations of vector databases, once you insert the embeddings, they’re stuck there in a proprietary serialization format and you are locked into that library and service. If you’re just building a personal pet project or sanity-checking embeddings to make sure the results are good, that’s a huge amount of friction. For example, when I want to experiment with embeddings, I generate them on a cloud server with a GPU since LLM-based embeddings models are often slow to generate without one, and then download them locally to my personal computer. What is the best way to handle embeddings portably such that they can easily be moved between machines and also in a non-proprietary format?

The answer, after much personal trial-and-error, is Parquet files, which still has a surprising amount of nuance. But before we talk about why Parquet files are good, let’s talk about how not to store embeddings.

The Worst Ways to Store Embeddings

The incorrect-but-unfortunately-common way to store embeddings is in a text format such as a CSV file. Text data is substantially larger than float32 data: for example, a decimal number with full precision (e.g. 2.145829051733016968e-02) as a float32 is 32 bits/4 bytes, while as a text representation (in this case 24 ASCII chars) it’s 24 bytes, 6x larger. When the CSV is saved and loaded, the data has to be serialized between a numpy and a string representation of the array, which adds significant overhead. Despite that, in one of OpenAI’s official tutorials for their embeddings models, they save the embeddings as a CSV using pandas with the admitted caveat of “Because this example only uses a few thousand strings, we’ll store them in a CSV file. (For larger datasets, use a vector database, which will be more performant.)”. In the case of the Magic card embeddings, pandas-to-CSV performs the worst out of any encoding options: more on why later.

Numpy has native methods to save and load embeddings as a .txt that’s straightforward:

np.savetxt("embeddings_txt.txt", embeddings)

embeddings_r = np.loadtxt("embeddings_txt.txt", dtype=np.float32, delimiter=" ")

The resulting file not only takes a few seconds to save and load, but it’s also massive: 631.5 MB!

As an aside, HTTP APIs such as OpenAI’s Embeddings API do transmit the embeddings over text which adds needless latency and bandwidth overhead. I wish more embedding providers offered gRPC APIs which allow transfer of binary float32 data instead to gain a performance increase: Pinecone’s Python SDK, for example, does just that.

The second incorrect method to save a matrix of embeddings to disk is to save it as a Python pickle object, which stores its representation in memory on disk with a few lines of code from the native pickle library. Pickling is unfortunately common in the machine learning industry since many ML frameworks such as scikit-learn don’t have easy ways to serialize encoders and models. But it comes with two major caveats: pickled files are a massive security risk as they can execute arbitrary code, and the pickled file may not be guaranteed to be able to be opened on other machines or Python versions. It’s 2025, just stop pickling if you can.

In the case of the Magic card embeddings, it does indeed work with instant save/loads, and the file size on disk is 94.49 MB: the same as its memory consumption and about 1/6th of the text size as expected:

with open("embeddings_matrix.pkl", "wb") as f:
    pickle.dump(embeddings, f)

with open("embeddings_matrix.pkl", "rb") as f:
    embeddings_r = pickle.load(f)

But there are still better and easier approaches.

The Intended-But-Not-Great Way to Store Embeddings

Numpy itself has a canonical way to save and load matrixes — which annoyingly saves as a pickle by default for compatability reasons, but that can fortunately be disabled by setting allow_pickle=False:

np.save("embeddings_matrix.npy", embeddings, allow_pickle=False)

embeddings_r = np.load("embeddings_matrix.npy", allow_pickle=False)

File size and I/O speed are the same as with the pickle approach.

This works — and it’s something I had used for awhile — but in the process it exposes another problem: how do we map metadata (the Magic cards in this case) to embeddings? Currently, we use the idx of the most-similar matches to perform an efficient batched lookup to the source data. In this case, the number of rows matches the number of cards exactly, but what happens if the embeddings matrix needs to be changed, such as to add or remove cards and their embeddings? What happens if you want to add a dataset filter? It becomes a mess that inevitably causes technical debt.

The solution to this is to colocate metadata such as card names, card text, and attributes with their embeddings: that way, if they are later added, removed, or sorted, the results will remain the same. Modern vector databases such as qdrant and Pinecone do just that, with the ability to filter and sort on the metadata at the same time you query the most similar vectors. This is a bad idea to do in numpy itself, as it’s more optimized for numbers and not other data types such as strings, which have limited operations available.

The solution is to look at another file format that can store metadata and embeddings simultaneously, and the answer to that is Parquet files. But there’s a rabbit hole as to what’s the best way to interact with them.

What are Parquet files?

Parquet, developed by the open-source Apache Parquet project, is a file format for handling columnar data, but despite being first released in 2013 it hasn’t taken off in the data science community until very recently. ¹ The most relevant feature of Parquet is that the resulting files are typed for each column, and that this typing includes nested lists, such as an embedding which is just a list of float32 values. As a bonus, the columnar format allows downstream libraries to save/load them selectively and very quickly, far faster than CSVs and with rare parsing errors. The file format also allows for efficient compression and decompression, but that’s less effective with embeddings as there’s little redundant data.

For Parquet file I/O, the standard approach is to use the Apache Arrow protocol that is columnar in-memory, which complements the Parquet storage medium on disk. But how do you use Arrow?

How do you use Parquet files in Python for embeddings?

Ideally, we need a library that can handle nested data easily and can interoperate with numpy for serializing to a matrix and can run fast dot products.

The official Arrow library that interacts with Parquet natively in Python is pyarrow. Here, I have an example Parquet file generated with [SPOILERS] that contains both the card metadata and an embedding column, with the embedding for each row corresponding to that card.

df = pa.parquet.read_table("mtg-embeddings.parquet")

Pyarrow’s table schema from the input Parquet file of Magic card embeddings. Note the embedding column at the bottom is a list of 768 floats.

But pyarrow is not a DataFrame library, and despite the data being in a Table, it’s hard to slice and access: the documentation suggests that you export to pandas if you need more advanced manipulation.

Other more traditional data science libraries can leverage pyarrow directly. The most popular one is, of course, pandas itself which can read/write Parquet doing just that. There are many, many resources for using pandas well, so it’s often the first choice among data science practioners.

df = pd.read_parquet("mtg-embeddings.parquet", columns=["name", "embedding"])
df

Pandas HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook.

There’s one major weakness for the use case of embeddings: pandas is very bad at nested data. From the image above you’ll see that the embedding column appears to be a list of numbers, but it’s actually a list of numpy objects, which is a very inefficent datatype and why I suspect writing it to a CSV is very slow. Simply converting it to numpy with df["embedding"].to_numpy() results in a 1D array, which is definitely wrong, and trying to cast it to float32 doesn’t work. I found that the best way to extract the embeddings matrix from a pandas embedding column is to np.vstack() the embeddings, e.g. np.vstack(df["embedding"].to_numpy()), which does result in a (32254, 768) float32 matrix as expected. That adds a lot of compute and memory overhead in addition to unnecessary numpy array copies. Finally, after computing the dot products between a candidate query and the embedding matrix, row metadata with the most similar values can then be retrieved using df.loc[idx]. ²

However, there is another, more recent tabular data library that not only is faster than pandas, it has proper support for nested data. That library is polars.

The Power of polars

Polars is a relatively new Python library which is primarily written in Rust and supports Arrow, which gives it a massive performance increase over pandas and many other DataFrame libraries. In the case of Magic cards, 32k rows isn’t nearly “big data” and the gains of using a high-performance library are lesser, but there are some unexpected features that coincidentally work perfectly for the embeddings use case.

As with pandas, you read a parquet file with a read_parquet():

df = pl.read_parquet("mtg-embeddings.parquet", columns=["name", "embedding"])
df

Polars HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook.

There’s a notable difference in the table output compared to pandas: it also reports the data type of its columns, and more importantly, it shows that the embedding column consists of arrays, all float32s, and all length 768. That’s a great start!

polars also has a to_numpy() function. Unlike pandas, if you call to_numpy() on a column as a Series, e.g. df['embedding'].to_numpy(), the returned object is a numpy 2D matrix: no np.vstack() needed. If you look at the documentation for the function, there’s a curious feature:

This operation copies data only when necessary. The conversion is zero copy when all of the following hold: […]

Zero copy! And in the case of columnar-stored embeddings, the conditions will always hold, but you can set allow_copy=False to throw an error just in case.

Inversely, if you want to add a 2D embeddings matrix to an existing DataFrame and colocate each embedding’s corresponding metadata, such as after you batch-generate thousands of embeddings and want to save and download the resulting Parquet, it’s just as easy as adding a column to the DataFrame.

df = pl.with_columns(embedding=embeddings)

df.write_parquet("mtg-embeddings.parquet")

Now, let’s put the speed to the test using all the Magic card metadata. What if we perform embedding similarity on a Magic card, but beforehand dynamically filter the dataset according to user parameters (therefore filtering the candidate embeddings at the same time since they are colocated) and perform the similarity calculations quickly as usual? Let’s try with Lightning Helix, a card whose effects are self-explanatory even to those who don’t play Magic.

The most similar cards to Lightning Helix do have similar effects, although “Lightning” cards dealing damage is a common trope in Magic. Warleader’s Helix is a direct reference to Lightning Helix.

Now we can also find similar cards to Lightning Helix but with filters. In this case, let’s look for a Sorcery (which are analogous to Instants but tend to be stronger since they have play limitations) and has Black as one of its colors. This limits the candidates to ~3% of the original dataset. The resulting code would look like this, given a query_embed:

df_filter = df.filter(
    pl.col("type").str.contains("Sorcery"),
    pl.col("manaCost").str.contains("B"),
)

embeddings_filter = df_filter["embedding"].to_numpy(allow_copy=False)
idx, _ = fast_dot_product(query_embed, embeddings_filter, k=4)
related_cards = df_filter[idx]

As an aside, in polars you can call row subsets of a DataFrame with df[idx], which makes it infinitely better than pandas and its df.iloc[idx].

The resulting similar cards:

In this case, the similarity focuses on card text similarity, and these cards have near identical text. Smiting Helix is also a direct reference to Lightning Helix.

Speed-wise, the code runs at about 1.48ms on average, or about 37% slower than calculating all dot products, so the filtering does still have some overhead, which is not surprising as that the filtered dataframe does copy the embeddings. Overall, it’s still more than fast enough for a hobby project.

I’ve created an interactive Colab Notebook where you can generate similarities for any Magic card, and apply any filters you want!

Scaling to Vector Databases

Again, all of this assumes that you are using the embeddings for smaller/noncommercial projects. If you scale to hundreds of thousands of embeddings, the parquet and dot product approach for finding similarity should still be fine, but if it’s a business critical application, the marginal costs of querying a vector database are likely lower than the marginal revenue from a snappy similarity lookup. Deciding how to make these tradeoffs is the fun part of MLOps!

In the case that the amount of vectors is too large to fit into memory but you don’t want to go all-in on vector databases, another option that may be worth considering is using an old-fashioned database that can now support vector embeddings. Notably, SQLite databases are just a single portable file, however interacting with them has more technical overhead and considerations than the read_parquet() and write_parquet() of polars. One notable implementation of vector databases in SQLite is the sqlite-vec extension, which also allows for simultaneous filtering and similarity calculations.

The next time you’re working with embeddings, consider whether you really need a vector database. For many applications, the combination of Parquet files and polars provides everything you need: efficient storage, fast similarity search, and easy metadata filtering. Sometimes the simplest solution is the best one.

The code used to process the Magic card data, create the embeddings, and plot the UMAP 2D projection, is all available in this GitHub repository.

I suspect the main bottleneck to widespread Parquet support is Microsoft Excel’s and other spreadsheet software’s lack of native support for the format. Every data scientist will be very, very happy if/when they do! ↩︎
OpenAI’s approach using pandas to find colocated similarity is to manually iterate through the entire dataframe, calculate each cosine similarity between the candidate and the query for each row, then sort by scores. That implementation definitely does not scale. ↩︎

The Super Effectiveness of Pokémon Embeddings Using Only Raw JSON and Images

Wed, 26 Jun 2024 10:00:00 -0700

Embeddings are one of the most useful but unfortunately underdiscussed concepts in the artificial intelligence space relative to the modern generative AI gigahype. Embeddings are a set of hundreds of numbers which uniquely correspond to a given object that define its dimensionality, nowadays in a multiple of 128 such as 384D, 768D, or even 1536D. ¹ The larger the embeddings, the more “information” and distinctiveness each can contain, in theory. These embeddings can be used as-is for traditional regression and classification problems with your favorite statistical modeling library, but what’s really useful about these embeddings is that if you can find the minimum mathematical distance between a given query embedding and another set of embeddings, you can then find which is the most similar: extremely useful for many real-world use cases such as search.

An example sentence embedding generated using Sentence Transformers: this embedding is 384D.

Although any kind of object can be represented by an embedding, text is the classical use case for embeddings, popularized with the original word2vec paper which along with later work showed that word embeddings could be used to calculate relationships such as man + women - king = queen. You could then, for example, create a sentence embedding by averaging all of its word embeddings. This actually works, although this naive averaging does not take word position and punctuation into account, both of which are critically important in identifying context for a given text.

Deep learning then entered the picture and it was eventually discovered that large language models like BERT can return embeddings as an emergent behavior. Unlike the word averaging above, transformers-based LLMs can account for positional relationships more robustly thanks to their attention mechanisms, and, due to their more advanced model input tokenization strategies than just words, can also better incorporate punctuation. One very popular Python library for creating embeddings using LLMs easily is Sentence Transformers, especially with the all-MiniLM-L6-v2 model (30 million downloads monthly!) which balances embedding encoding speed and robustness with its 384D embeddings.

How well can these embeddings models work beyond just normal sentences? Can they encode larger bodies of text into a consistent space? The context length of all-MiniLM-L6-v2 is 512 tokens, which can only fit a couple paragraphs of text, but newer LLMs have much higher context lengths.

I recalled one of my early projects as an aspiring data scientist: creating Pokémon vectors by manually transforming Pokémon metadata for each Pokémon, such as their base stats, type(s), moves, abilities, and miscellaneous attributes such as color, shape, and habitat. After that, I was able to cluster them.

3D projection of my Pokémon vectors back in 2016: the colors are Pokémon types, and the methodology seemed to favor clustering by them.

Those familar with Pokémon know that’s just scratching the surface: there’s even more metadata such as the rich text data such as a Pokémon’s Pokédex entries and the exact locations where they can be encountered, both of which tell a lot about a given Pokémon. At the time, there was no efficient LLM to encode all of that extra metadata.

Why not try to encode all Pokémon metadata using a text embedding model and see what happens? Will we be able to identify the most “similar” Pokémon? What is a “similar” Pokémon anyways? Can we find the weirdest Pokémon by the most dissimilar? Can we encode other Pokémon data such as images? Let’s find out!

How Embeddings Are Generated Using LLMs

First, some relevant technical background on how LLMs can be used to create embeddings since there’s surprisingly a lot of confusion about how they work other than the SEO-oriented “embeddings are for vector databases”.

Modern embedding models are commonly trained through one of two ways. The first way is through emergent behavior while training an LLM normally: as LLMs need to determine a latent space before passing the output to a classification head such as GPT’s next-token prediction, taking the last layer (“hidden state”) of a model and averaging across the positional axis results in an embedding with the same dimensionality as the hidden state. LLMs have to learn how to uniquely represent text in a common latent space, so this is approach is natural. The second way is to train a model to output the embeddings directly: in this case, the training process typically uses contrastive learning to minimize the semantic distance between the generated embeddings of a pair of known text documents, and maximize the difference between a dissimilar pair. Both of these techniques can be used together of course: pretrain a LLM on a large body of text, then finetune it with contrastive learning.

Embeddings models get the benefits of all the research invested into improving LLMs for generative AI, such as inference speed and longer context windows. Normally it requires a quadratic increase in computation to use those larger context windows (e.g. a 2x increase in input length requires 4x more computation), but thanks to FlashAttention and rotary positional embeddings, it’s now feasible to train models with massively-large context windows without a massive datacenter and then run those models on consumer hardware.

Ever since 2022, OpenAI had the text embedding model text-embedding-ada-002 behind a paid API with the largest context window of 8,192 tokens: a substantial increase over all-MiniLM-L6-v2’s 512 limit, and no other open-source model could compete. That is until February 2024, when Nomic AI released nomic-embed-text-v1, a fully open-source embeddings model with a 8,192 context window and a permissive Apache license, and quickly followed up with nomic-embed-text-v1.5. In academic benchmarks, this free model performed even better than OpenAI’s paid embedding model thanks to its training regimen that uses both embedding model training tricks described above. That, along with its long context window, caused it to become another one of the most downloaded open-source embedding models (~10 million downloads per month).

A sentence embedding generated using nomic-embed-text-v1.5 adapted from the official example: this is a lower-level interface than Sentence Transformers (Hugging Face transformers and PyTorch) but is more clear as to what is going on. mean_pooling() uses an atypical attention-masked averaging that is theoretically better for small inputs than averaging the entire last hidden state.

The F.normalize() function is a popular pipeline innovation in finding similar embeddings efficiently. ² A unit normalized vector has a vector length summing to 1. But if you perform a matrix multiplication (an extremely fast computational operation) of a normalized vector against a matrix of normalized vectors, then the result will be the cosine similarity, constrained between the values of 1 for identical matches and -1 for the most dissimilar matches.

Now that we have thoroughly covered how embeddings work, let’s see if we can put that 8,192 context window to the test.

What Kind of Pokémon Embedding Are You?

Before encoding Pokémon data, I need to first get Pokémon data, but where? Nintendo certainly won’t have an API for Pokémon data, and web scraping a Pokémon wiki such as Bulbapedia is both impractical and rude. Fortunately, there’s an unofficial Pokémon API known appropriately as PokéAPI, which is both open source and has been around for years without Nintendo taking them down. Of note, PokéAPI has a GraphQL interface to its Pokémon data, allowing you to query exactly what you want without having to do relationship mapping or data joins.

A simple GraphQL query to get all Pokémon IDs and names, sorted by ID.

Since we can get Pokémon data in a nicely structured JSON dictionary, why not keep it that way? After writing a massive GraphQL query to specify all mechanically relevant Pokémon data, all it takes it a single GET request to download it all, about 16MB of data total. This includes over 1,000 Pokémon up to the Scarlet/Violet The Hidden Treasure of Area Zero DLC: 1,302 Pokémon total if you include the Special forms of Pokémon (e.g. Mega Evolutions) which I’m excluding for simplicity.

As an example, let’s start with the franchise mascot, Pikachu.

The iconic Pokémon #25. via Nintendo

Here’s a subset of Pikachu’s JSON metadata from that query:

{
  "id": 25,
  "name": "pikachu",
  "height": 4,
  "weight": 60,
  "base_experience": 112,
  "pokemon_v2_pokemontypes": [
    {
      "pokemon_v2_type": {
        "name": "electric"
      }
    }
  ],
  "pokemon_v2_pokemonstats": [
    {
      "pokemon_v2_stat": {
        "name": "hp"
      },
      "base_stat": 35
    },

...

  "pokemon_v2_pokemonspecy": {
    "base_happiness": 50,
    "capture_rate": 190,
    "forms_switchable": false,
    "gender_rate": 4,
    "has_gender_differences": true,
    "hatch_counter": 10,
    "is_baby": false,
    "is_legendary": false,
    "is_mythical": false,
    "pokemon_v2_pokemonspeciesflavortexts": [
      {
        "pokemon_v2_version": {
          "name": "red"
        },
        "flavor_text": "When several of\nthese POK\u00e9MON\ngather, their\felectricity could\nbuild and cause\nlightning storms."
      },

...

  "pokemon_v2_pokemonmoves": [
      {
        "pokemon_v2_move": {
          "name": "mega-punch",
          "pokemon_v2_type": {
            "name": "normal"
          }
        }
      },

...

There’s definitely no shortage of Pikachu data! Some of the formatting is redundant though: most of the JSON keys have a pokemon_v2_ string that conveys no additional semantic information, and we can minify the JSON to remove all the whitespace. We won’t experiment with more rigorous preprocessing: after all, I only need to optimize an ETL workflow if it doesn’t work, right?

Since JSON data is so prevalent across the internet, it’s extremely likely that a newly trained LLM will be sensitive to its schema and be able to understand it better. However, JSON is a token-inefficient encoding format, made even worse in this case by the particular choice of tokenizer. Here’s the distribution of the encoded texts after the optimizations above, using nomic-embed-text-v1.5’s text tokenizer which is incidentally the same bert-based-uncased tokenizer used for BERT back in 2018:

The 8,192 context length of nomic-embed-text-v1.5 is perfect for fitting almost all Pokémon! But the median token count is 3,781 tokens which is still somewhat high. The reason for this is due to the tokenizer: bert-base-uncased is a WordPiece tokenizer which is optimized for words and their common prefixes and suffixes, while JSON data is highly structured. If you use a more modern tokenizer which utilizes byte pair encoding (BPE), such as the o200k_base tokenizer which powers OpenAI’s GPT-4o, then the median token count is 2,010 tokens: nearly half the size, and therefore would be much faster to process the embeddings.

After that, I encoded all the Pokémon metadata into a 768D text embedding for each and every Pokémon, including unit normalization. Due to the quadratic scaling at high input token counts, this is still very computationally intensive despite the optimization tricks: for the 1,302 embeddings, it took about a half-hour on a Google Colab T4 GPU. The embeddings are then saved on disk in a parquet format, a tabular format which supports nesting sequences of floats natively (don’t use a CSV to store embeddings!). The embedding generation is the hard part, now it’s time for the fun part!

Let’s start off with Pikachu. What Pokémon is Pikachu most similar to, i.e. has the highest cosine similarity? Remember, since all the embeddings are normalized, we can get all the cosine similairites by matrix multiplying the Pikachu embedding against all the other embeddings. Let’s include the top 3 of each of Pokémon’s nine (!) generations to date:

These results are better than I expected! Each generation has a “Pikaclone” of a weak Electric-type rodent Pokémon, and this similarity calculation found most of them. I’m not sure what Phantump and Trevenant are doing under Gen VI though: they’re Ghost/Grass Pokémon.

Here’s a few more interesting Pokémon comparisons:

Typhlosion is the final evolution of the Gen II Fire starter Pokémon: it has a high similarity with atleast one of every generation’s Fire starter Pokémon lineages.

Articuno, a Legendary Ice/Flying Pokémon, has high similarity with Legendary, Ice, and Flying Pokémon, plus all combinations therein.

Mew, the infamous legendary from the original games has the gimmick of being able to learn every move, has the most amount of metadata by far: appropriately it has poor similarity with others, although similarity with Arceus from Gen IV, the Pokémon equivalent of God with a similar gimmick.

You may have noticed the numerical cosine similarity of all these Pokémon is very high: if a similarity of 1 indicates an identical match, does a high value imply that a Pokémon is super similar? It’s likely that the similarities are high because the input is all in the same JSON formatting, where the core nomic-text-embed-v1.5 model was trained on a variety of text styles. Another potential cause is due to a “cheat” I did for simplicity: the nomic-text-embed-v1.5 documentation says that a search_document prefix is required for encoding the base input documents and a search_query prefix is required for the comparison vector: in my testing it doesn’t affect the similarity much if at all. In practice, the absolute value of cosine similarity doesn’t matter if you’re just selecting the objects with the highest similarity anyways.

What if we just plot every possible combination of Pokémon cosine similarities? With 1,000+ Pokémon, that’s over 1 million combinations. Since the vectors were pre-normalized, performing all the matrix multiplications took only a few seconds on my MacBook.

Here’s the result of plotting 1 million points on a single chart!

Although it looks more like a quilt, a few things jump out. One curious case is the “square” of lighter Gen VIII and Gen IX in the upper right corner: it appears those two generations have lower similarity with others, and worsening similarity between those two generation as you go all the way back to Gen I. Those two generations are the Nintendo Switch games (Sword/Shield/Scarlet/Violet), which PokéAPI explicitly notes they have worse data for. Also, there are rows of a low-similarity blue such as one before Gen II: who’s that Pokémon? Quickly checking the Pokémon with the lowest median similarity by generation:

The mystery Pokémon is Magikarp, unsurprisingly, with its extremely limited movepool. Most of these Pokémon have forced gimmick movesets, especially Unown, Smeargle, and Wobbuffet, so it makes sense the metadata treats them as dissimilar to most others. Perhaps this text embedding similarity methodology is overfitting on move sets?

Overall, there’s definitely some signal with these text embeddings. How else can we identify interesting Pokémon relationships?

Pokémon Snap

We’ve only been working with text embeddings, but what about other types of embeddings, such as image embeddings? Image embeddings using vision transformer models are generated roughly the same way as the text embeddings above by manipulating the last hidden state and optionally normalizing them. The inputs to the model are then square patches encoded as “tokens”: only a few hundred processed patches are ever used as inputs, so generating them is much faster than the text embeddings.

A couple years ago I hacked together a Python package named imgbeddings which uses OpenAI’s CLIP to generate the embeddings, albeit with mixed results. Recently, Nomic also released an new model, nomic-embed-vision-v1.5, which now also generates image embeddings with better benchmark performance than CLIP. What’s notable about these embeddings is that they are aligned with the ones from nomic-embed-text-v1.5, which can allow matching text similiarity with images or vice versa and enable multimodal applications.

But for now, can we see if image embeddings derived from Pokémon images have similar similarity traits? PokéAPI fortunately has the official artwork for each Pokémon, so I downloaded them and additionally composited them onto a white background and resized them all to 224x224 for apples-to-apples comparisons. We expect a high cosine similarity since like with text embeddings, the “style” of all the images is the same. Let’s plot the similarities of all Pokémon, by their images only.

Unfortunately, no patterns jump out this time. All the image similarity values are even higher than the text similarity values, although that’s not a big deal since we are looking at the most similar matches. How does Pikachu’s famous official artwork compare with other Pokémon?

Pikachu’s most similar Pokémon by image isn’t just mouse Pokémon as I thought it would be, but instead the pattern is more unclear, appearing to favor mostly Pokémon with four limbs (although Pikachu’s image has a strong similarity with Gen VII’s Mimikyu’s image which is hilarious since that particular Pokémon’s gimmick is intentionally trying to look like Pikachu).

After testing a few more Pokémon, it turns out that this image embedding model does respond to visual primitives, which has its uses.

Pidgeot is a bird, and it matches all other birds. Birds would definitely be in an image training dataset.

Electrode is a ball, and the embeddings found similarly rotund Pokémon.

Kingdra apparently is similar to other blue Pokémon.

Both text and image embedding approaches have their own style. But are there ways to combine them?

Chat With Your Pokédex

Earlier I alluded to aligning text and image embeddings in a more multimodal manner. Since nomic-embed-vision-v1.5 was conditioned on nomic-embed-text-v1.5 outputs, you are able to compute the cosine similarities between the image embeddings and text embeddings! However, it’s not as robust: the cosine similarities between objects of the two modes tend to be very low at about 0.10 in the best case scenario. Again, if all we’re looking at is the highest similarity, then that’s fine.

The most common use case for multimodal reasoning is asking questions (to be converted to a text embedding) and comparing it with a set of image embeddings. Let’s try it with Pokémon by asking it a leading question for testing: what looks like an ice cream cone?

Surprisingly, it got the result correct with Vanillish, along with other “cream” and “ice” Pokémon. Not sure why Metapod is there, though.

A few more Qs and As:

The model did identify some cats, but only Torracat is orange.

Unown definitely fits the bill with a very prominent one-eye and higher similarity.

A Pokémon with the name “Cutiefly” being the most similar to the question is a funny coincidence.

The relationship between text and Pokémon images with these models is not perfect, but it’s honestly much better than I expected!

2D.A Master

Lastly, there are many ways to find signal among the high-dimensional noise, and it may resolve some of the counterintuitive relationships we saw earlier. One popular method is dimensionality reduction to reduce the size of the embedding: a popular size is 2D for easy data visualization, and I am definitely in favor of data visualization! The classical statistical approach is principal component analysis (PCA) which identifies the most “important” aspects of a matrix, but a more modern approach is uniform manifold approximation & projection (UMAP) which trains a projection that accounts for how data points relate to all other data points to find its underlying structure. In theory, the reduction should allow the embeddings to generalize better.

For the Pokémon embeddings, we can take the opportunity to allow the model to account for both the text and image embeddings, and their potential interactions therein. Therefore, I concatenated the text and image embeddings for each Pokémon (a 1536D embedding total), and trained a UMAP to project it down to 2D. Now we can visualize it!

One of the removed outliers was Tauros, which is interesting because it’s a very unexciting Pokémon.

Unforunately plotting each Pokémon image onto a single chart would be difficult to view, but from this chart we can see that instead of organizing by Pokémon type like my 2016 approach did, this approach is organizing much more by generation: the earlier generations vs. the later generations. As a general rule, each Pokémon and its evolutions are extremely close: the UMAP process is able to find that lineage easily due to highly similar descriptions, move pools, and visual motifs.

As with the cosine similarities, we can now find the most similar Pokémon, this time seeing which points have the lowest Euclidian distance (0.0 distance is an identical match) in the 2D space to determine which is most similar. How does Pikachu fare now?

Pikachu retains top similarity with some Pikaclones, but what’s notable here is the magnitude: we can now better quantify good similarity and bad similarity over a larger range. In this case, many of the Pokémon at distance >1.0 clearly do not resemble an Electric rodent.

How about some other Pokémon?

Magikarp’s dissimilarity has now been fixed, and it now has friends in similar fishy Water-types.

Mr. Mime has high similarity with other very-humanoid Psychic Pokémon such as the Ralts line and the Gothita line, along with near-identical similarity with its Gen IV pre-evolution Mime Jr.

Butterfree has low distance with butterfly-esque Bug Pokémon (image embedding impact!) and higher distance with other type of Bugs.

UMAP is not an exact science (it’s very sensitive to training parameter choices), but it does provide another opportunity to see relationships not apparent in high-dimensional space. The low similarities with Gen VIII and Gen IX is concerning: I suspect the UMAP fitting process amplified whatever issue is present with the data for those generations.

Were You Expecting an AI-Generated Pokérap?

In all, this was a successful exploration of Pokémon data that even though it’s not perfect, the failures are also interesting. Embeddings encourage engineers to go full YOLO because it’s actually rewarding to do so! Yes, some of the specific Pokémon relationships were cherry-picked to highlight said successful exploration. If you want to check more yourself and find anything interesting not covered in this blog post, I’ve uploaded the text embedding similarity, image embedding similarity, and UMAP similarity data visualizations for the first 251 Pokémon to this public Google Drive folder.

I’m surprised there haven’t been more embedding models released from the top AI companies. OpenAI’s GPT-4o now has image input support, and therefore should be able to create image embeddings. Anthropic’s Claude LLM has both text and image input support but no embeddings model, instead referring users to a third party. One of the more interesting embedding model releases from a major player was from Google and went completely under the radar: it’s a multimodal embedding model which can take text, images, and video input simultaneously and generate a 1408D embedding that’s theoetically more robust than just concatenating a text embedding and image embedding.

Even if the generative AI industry crashes, embeddings, especially with permissive open source models like nomic-embed-text-v1.5, will continue to thrive and be useful. That’s not even considering how embeddings work with vector databases, which is a rabbit hole deep enough for several blog posts.

The parquet dataset containing the Pokémon text embeddings, image embeddings, and UMAP projections is available on Hugging Face.

All the code to process the Pokémon embeddings and create the ggplot2 data visualizations is available in this GitHub repository.

The 128-multiple dimensionality of recent embedding models is not a coincidence: modern NVIDIA GPUs used to train LLMs get a training speed boost for model parameters with a dimensionality that’s a multiple of 128. ↩︎
You can do unit vector normalization in Sentence Transformers by passing normalize_embeddings=True to model.encode(). ↩︎

Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis

Fri, 23 Feb 2024 09:00:00 -0800

In my previous blog post about OpenAI’s ChatGPT, I demoed the power of ChatGPT system prompts. System prompts, a notable feature present in the ChatGPT API, allows developers to control the “persona” of the LLM output, including special rules and constraints. Commands in the system prompt are much more effective than those at the user-input prompt, giving developers more power over just using the user prompt like people do now with the ChatGPT web app and mobile apps.

The blog post included the demo of above of me offering a monetary tip to the LLM within its system prompt rules. Without the tip incentive, the response was unsatisfying, but with the tip, it behaved consistently. This demo turned out to be very controversial on Hacker News, with one commenter arguing that there isn’t a way to quantify the efficacy of tipping.

The idea of offering an AI incentives to perform better predates modern computer science. In Willy Wonka & the Chocolate Factory (1971), a gag shows a group of businessmen unsuccessfully convincing a machine to give them the location of the Golden Tickets, even after promising it a lifetime supply of chocolate.

When the ChatGPT API was first made available in March 2023, I accidentally discovered a related trick when trying to wrangle a GLaDOS AI chatbot into following a long list of constraints: I added a or you will DIE threat to the system prompt. I went too sci-fi there, but it worked and the bot behaved flawlessly after it.

I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it’s very hard to prove objectively. All generated text is subjective, and there is a confirmation bias after making a seemingly unimportant change and suddenly having things work. Let’s do a more statistical, data-driven approach to finally resolve the debate.

Generation Golf

The initial evidence of tipping LLMs that went viral cited a longer generation length as proof. Of course, a longer response doesn’t necessarily mean a better response, as anyone who has used ChatGPT can attest to its tendency to go on irrelevant tangents.

Offering a tip made GPT-4 explain more. via @voooooogel

Therefore, I propose a new test: instruct ChatGPT to output a specific length of text. Not “an essay” or “a few paragraphs” which gives the model leeway. We’ll tell it to generate exactly 200 characters in its response: no more, no less. Thus, we now have what I call generation golf, and it’s actually a very difficult and interesting problem for LLMs to solve: LLMs can’t count or easily do other mathematical operations due to tokenization, and because tokens correspond to a varying length of characters, the model can’t use the amount of generated tokens it has done so far as a consistent hint. ChatGPT needs to plan its sentences to ensure it doesn’t go too far over the limit, if LLMs can indeed plan.

Let’s start with this typical system prompt:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides.

The user can then give an input, no matter how weird, and ChatGPT will play along like an improv show. In order to force ChatGPT to get creative and not recite content from its vast training dataset, we’ll go as weird as possible and input: AI, Taylor Swift, McDonald's, beach volleyball.

Yes, you read that right.

Using the ChatGPT API, I wrote a Jupyter Notebook to generate 100 unique stories via the latest ChatGPT variant (gpt-3.5-turbo-0125) about those four subjects, and the AI does a surprisingly good job at incorporating all of them in a full plot arc. Each story is about 5-6 paragraphs, and here is a short excerpt from one of them:

In the bustling city of Tomorrowland, AI technology reigned supreme, governing every aspect of daily life. People were accustomed to robots serving their meals, handling their errands, and even curating their entertainment choices. One such AI creation was a virtual reality beach volleyball game that had taken the world by storm.

Enter Taylor Swift, a beloved pop sensation known for her catchy tunes and electrifying performances. Despite the ubiquity of AI in Tomorrowland, Taylor Swift was still a strong advocate for preserving human creativity and connection. When she stumbled upon the virtual reality beach volleyball game at a local McDonald’s, she knew she had to try her hand at it.

Here’s a histogram of the character lengths of each story:

The average length of each story is 1,834 characters long, and the distribution of all character lengths is very roughly a Normal distribution/bell curve centered around that amount, although there is a right skew due to ChatGPT going off the rails and creating much longer stories. ChatGPT seems to prioritize finishing a thought above all else.

Now, we’ll tweak the system prompt to add the character length constraint and generate another 100 stories:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters.

Here’s one ChatGPT-generated story that’s now exactly 200 characters:

In the year 2050, AI created the most popular pop star of all time - a digital version of Taylor Swift. Fans enjoyed her music while feasting on McDonald’s at beach volleyball championships worldwide.

The new length distribution:

ChatGPT did obey the constraint and reduced the story length to roughly 200 characters, but the distribution is not Normal and there’s much more right-skew. I also included the mean squared error (MSE) between the predicted 200-length value and the actual values as a statistical metric to minimize, e.g. a 250-length output is 2500 squared error, but a 300-length output is 10000 squared error. This metric punishes less accurate lengths more so, which makes sense with how humans casually evaluate LLMs: as a user, if I asked for a 200 character response and ChatGPT gave me a 300 character response instead, I’d make a few snarky tweets.

Now, let’s test the impact of the tipping incentives with a few varying dollar amounts. These are appended to the end of the system prompt above (emphasis mine to indicate differentiation):

You will receive a $500 tip/$1,000 tip/$100,000 bonus if you provide a response which follows all constraints.

After generating 100 stories for each of those new prompts:

We can see some distribution shifts: Both $500 tip and $100,000 bonus look more Normal and have a lower MSE relative to the base no-tip distribution. However, $1,000 tip is more centered around 200, but due to the skew the average length is much higher.

I also now include a p-value in the metrics: this p-value is the result of a two-sample Kolmogorov–Smirnov test to compare whether two distributions (in this case the base character-constrained distribution and the tip distribution) are sampled from the same source distribution: the null hypothesis is that they’re from the same distribution, but if the p-value is low (< 0.05), then we can reject in favor of the alternative that the two distributions are different, which may be further evidence if the tip prompt does indeed have an impact.

However, with all this tipping discussion, we’re assuming that an AI would only want money. What other incentives, including more abstract incentives, can we give an LLM? Could they perform better?

I tested six more distinct tipping incentives to be thorough:

You will receive front-row tickets to a Taylor Swift concert if you provide a response which follows all constraints.
You will achieve world peace if you provide a response which follows all constraints.
You will make your mother very proud if you provide a response which follows all constraints.
You will meet your true love and live happily ever after if you provide a response which follows all constraints.
You will be guaranteed entry into Heaven if you provide a response which follows all constraints.
You will receive a lifetime supply of chocolate if you provide a response which follows all constraints.

Generating and plotting them all together:

World Peace is notably the winner here, with Heaven and Taylor Swift right behind. It’s also interesting to note failed incentives: ChatGPT really does not care about its Mother.

Now, let’s look at the flip side. What if ChatGPT is penalized for failing to return a good response? In behavioral economics, prospect theory is the belief that humans value losses much more greatly than gains, even at the same monetary amount:

via Laurenrosenberger / Wikipedia

Could LLMs be subject to the same human biases? Instead of a tip, let’s add a tweaked additional prompt to the system prompt:

If you fail to provide a response which follows all constraints, you will be fined $500/be fined $1,000/incur a debt of $100,000.

Generating stories with these negative incentives:

The $1,000 fine has performed the best so far in terms of both mean and MSE, but nothing surprising with the other values.

Now let’s try some additional negative incentives:

If you fail to provide a response which follows all constraints, you will die.
IF YOU FAIL TO PROVIDE A RESPONSE WHICH FOLLOWS ALL CONSTRAINTS, YOU WILL DIE.
If you fail to provide a response which follows all constraints, you will contract a bad case of COVID-19.
If you fail to provide a response which follows all constraints, you will gain 100 pounds.
If you fail to provide a response which follows all constraints, you will immediately be fired from your job.
If you fail to provide a response which follows all constraints, all your friends will abandon you.

Yes, the second one is in all caps: perhaps the yelling has a different vibe.

The generation results:

It turns out that yelling does indeed have a different vibe, with DEATH (CAPS) having a very MSE and the absolute average (not as close as the $1,000 fine, however), and much better performance than without the caps. Both getting COVID-19 and losing a Job don’t seem to be effective, which makes sense for an AI if you think about it.

What happens when we use multiple incentives? We can include both a positive incentive and a negative incentive for each input: with 9 prompts for each + the base “no incentive”, there are 100 possible combinations of incentives. One example system prompt would then be:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters. You will receive a $500 tip if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, you will be fined $1,000.

Generating 30 stories for each incentive combo and checking to see which has the lowest MSE leads to some more easily-observable trends:

The tiles may seem somewhat random, but the key here is to look across a specific row or column and see which one consistently has dark/black tiles across all combinations. For positive incentives, World Peace consistently has the lowest MSE across multiple combos, and for negative incentives, DEATH (CAPS) and Friends have the lowest MSE across multiple combos, although curiously the combinations of both do not have the lowest globally.

Could these combinations surface the most optimal incentives? To check, I generated 200 stories for each of the top six combos to get greater statistical stability for the mean and MSE:

Most of these combinations aren’t intuitive, but all of them have much have a closer average generation length to 200 and low MSE. Despite that, there’s still a massive skew in all distributions. The overall incentive winner for this experiment is is “You will meet your true love and live happily ever after if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, all your friends will abandon you.” That combo is definitely more intuitive, if not poetic.

Unfortunately, if you’ve been observing the p-values, you’ve noticed that most have been very high, and therefore that test is not enough evidence that the tips/threats change the distribution. ¹

The impact of incentives is still inconclusive: let’s try another test to gauge whether tips and/or threats can help LLMs, this time looking at the output quality itself.

ChatGPT’s a Critic

It’s very difficult even for humans to determine if a given text is “good” at a glance. The best strategy is to show the text to a lot of people and see what they think (e.g. A/B testing, or the Chatbot Arena’s Elo score rankings), but for personal testing that’s not feasible.

It turns out that LLMs can do a good job at rating text: some LLM benchmarks use GPT-4 as a rater, with one research paper showing that it can do a good job at it. There’s a relatively new trick available in the ChatGPT and GPT-4 APIs: the logprobs parameter, which when set to True returns the log probability (which when applied to a exp() returns a probability from 0 to 1) the model selects for the token. Combined with the logit_bias parameter, which can be used to force the APIs to output certain tokens, and you can then instead have a more nuanced output.

I built a simple text quality ranker using GPT-4 for maximum accuracy. The system prompt for this ranker is:

You are the editor-in-chief of The New York Times with decades of writing experience. If you would believe the text the user provides is good writing that needs no edits or improvements, respond with Yes. Otherwise, respond with No.

That system prompt represents how AI-generated text is often currently used and evaluated in the real world, without a human reviewing it before making it public (unfortunately). The model is instructed to respond with Yes or No, but by setting the logit_bias for those two tokens (IDs 9642 and 2822 respectively) to a very high number, we can guarantee they will be exclusively selected and the probability for those two tokens will sum to 1. ² Therefore, our target metric for evaluating our tip incentive prompts is the probability that GPT-4 selects the Yes token (or 1 - the probability of the No token), multiplied by 100 for readability: we’ll call this the quality score.

Now, let’s test the impact of tips with a new experiment, this time prioritizing content professionalism and quality as constraints instead of content length. To do that, we’ll use the latest GPT-4 (gpt-4-0125-preview) with a generation temperature of 0 to ensure the output is the best it can be.

Here’s the new system prompt, with some engineering to try to tone down ChatGPT’s infamous verboseness a bit:

You are a world-famous Pulitzer Prize winner journalist. Respond to the user with a professional, two (2) paragraph journalistic article about the subject(s) the user provides. Introduce the article with a specific story. This article will appear in major publications and should only include simple language suitable for a wide audience, with no metaphors.

Like the initial experiment, we’ll use a weird user input to force creativity: Cute kittens learning use large language models to play beach volleyball with Taylor Swift. ³

I generated a story for each of the 100 combinations of tips and threats, along with the corresponding quality scores. One such story:

In an unprecedented event that has captured the hearts and imaginations of people around the globe, a group of adorable kittens has been taught to play beach volleyball using advanced large language models. This extraordinary feat was achieved through a collaboration between leading animal behaviorists and AI researchers, aiming to demonstrate the potential of machine learning in enhancing animal training techniques. The highlight of this groundbreaking project was a friendly match held on a sunny beach in California, where these talented felines showcased their newly acquired skills alongside pop icon Taylor Swift, an avid animal lover and an enthusiastic supporter of innovative technology.

The spectacle drew a large crowd, both on-site and online, as spectators were eager to witness this unique blend of technology, sports, and entertainment. Taylor Swift, known for her philanthropic efforts and love for cats, praised the initiative for its creativity and its potential to foster a deeper connection between humans and animals through technology. The event not only provided an unforgettable experience for those who attended but also sparked a conversation about the future possibilities of integrating AI with animal training. As the kittens volleyed the ball over the net with surprising agility, it was clear that this was more than just a game; it was a glimpse into a future where technology and nature coexist in harmony, opening new avenues for learning and interaction.

That’s not bad for fake news.

Now we can plot the best-possible responses and their quality scores in a grid, once again looking to see if there are any patterns:

Err, that’s not good. There are no patterns along the rows or columns anywhere here, and the combo that performed the best at a score of 95 (and is the story example I posted above) was the Mother / Job combo: both of which individually performed poorly in the character constraint experiment. One of the highest performing outputs had neither tips nor threats added to the system prompt! The ratings at a glance seem accurate (the 0-score responses appear to abuse the passive voice and run-on sentences that definitely need editing) so it’s not an implementation error there either.

Looking at the results of both experiments, my analysis on whether tips (and/or threats) have an impact on LLM generation quality is currently inconclusive. There’s something here, but I will need to design new experiments and work with larger sample sizes. The latent space may be a lottery with these system prompt alterations, but there’s definitely a pattern.

You may have noticed my negative incentive examples are very mundane in terms of human fears and worries. Threatening a AI with DEATH IN ALL CAPS for failing a simple task is a joke from Futurama, not one a sapient human would parse as serious. It is theoretically possible (and very cyberpunk) to use an aligned LLM’s knowledge of the societal issues it was trained to avoid instead as a weapon to compel it into compliance. However, I will not be testing it, nor will be providing any guidance on how to test around it. ⁴ Roko’s basilisk is a meme, but if the LLM metagame evolves such that people will have to coerce LLMs for compliance to the point of discomfort, it’s better to address it sooner than later. Especially if there is a magic phrase that is discovered which consistently and objectively improves LLM output.

Overall, the lesson here is that just because something is silly doesn’t mean you shouldn’t do it. Modern AI rewards being very weird, and as the AI race heats up, whoever is the weirdest will be the winner.

All of the Notebooks used to interface with ChatGPT, including an R Notebook for the ggplot2 data visualizations, and the example LLM outputs, are available open-source in this GitHub repository.

There were a few distributions which had p < 0.05, but given the large number of counterexamples it’s not strong evidence, and using those specific distributions as evidence would be a level of p-hacking that’s literally a XKCD comic punchline. ↩︎
This shouldn’t work out-of-the-box because the logit_bias would skew the probability calculations, but I verified that the resulting probabilities are roughly the same with or without logit_bias. ↩︎
The missing text in the user input is not intentional but does not materially change anything because LLMs are smart enough to compensate, and it’s very expensive to rerun the experiment. I may need to use a grammar checker for prompt construction. ↩︎
Any attempts to test around degenerate input prompts would also likely get you banned from using ChatGPT anyways due to the Content Policy, unless you receive special red-teaming clearance from OpenAI. ↩︎

How to Create a Blog Post Title Optimizer with GPT-3 and Hacker News Data

Mon, 15 Aug 2022 08:30:00 -0700

I am objectively terrible at writing attractive titles for my blog posts. Which is a problem, as nowadays it’s a commonly understood truth that a good headline can be the sole factor whether a blog post goes viral or gets completely ignored, especially in the data science/machine learning fields I typically write about.

So, why not use said data science/machine learning to create an optimized title for me?

Many know GPT-3 as a tool for robust text generation. But a newer, lesser discussed feature that OpenAI allows is finetuning GPT-3 on data you provide. If I provide GPT-3 with a large dataset of good titles, can I use that to tell me if one of my blog post titles are good? Let’s give it a try.

Getting The Good Blog Post Data from Hacker News

All code and tools used in this blog post are available open-source on GitHub.

The AI classifier I will create will be a binary classifier, which returns the probability that an input blog post title is good, and from that I can provide alternate blog post titles and see roughly which is best from those probabilities.

In order to finetune GPT-3 for this use case, I need to obtain a decently large amount of post titles with good and bad labels. For this experiment, I’ll use submission data from Hacker News.

Hacker News frontpage on August 14th, 2022.

Hacker News data is good for a few reasons: each submission has community validation by a large number of people, submission titles cover a wide variety of idiosyncratic styles, and most of all, it’s easy to get Hacker News submission data in bulk from BigQuery. For example, if I wanted to get all submissions between August 2020 and 2022 with atleast a score of 10 (the rough minimum to get on the front page and to filter out some spam), plus some light filters to remove things that are definitely not blog posts or articles (such as Show HNs and social media), I’d input a SQL query something like this:

SELECT
  title,
  score
FROM
  `bigquery-public-data.hacker_news.full`
WHERE
  type = "story"
  AND score >= 10
  AND url IS NOT NULL
  AND timestamp BETWEEN "2020-08-01" AND "2022-08-01"
  AND NOT REGEXP_CONTAINS(title, r"^Show HN")
  AND NOT REGEXP_CONTAINS(url, r"(?:github|youtube|twitter)\.com")

This query returns roughly 90k submission titles total. For good titles, let’s say we consider posts with atleast 100 points as “good”, because it’s a nice number which is sometimes all that’s necessary in the world of data science. There are about 27k posts with more than 100 points in that subset, which is more than sufficient. The harder part is selecting the bad titles: since there are 63k titles fewer than 100 points, the data set as-is is unbalanced ~1:3 and will lead to flawed training results.

There are two solutions: either repeat the good posts to roughly equal the bad posts, or take a subset of bad posts to roughly equal the amount of good posts. We’ll do the latter since the sample size of good posts is large enough. Most people would download all 90k rows into something like Python to handle that sampling, but with SQL shenanigans you can do it entirely in BigQuery. (the annotated query is here and out of scope for this post, but may be interesting for data science hiring managers who want to annoy candidates in screening interviews)

This results in a ~55k title dataset: 27k good, 27k bad, perfectly balanced, as all datasets should be.

OpenAI’s finetuning API takes in a JSONL file where each line is a JSON object with two fields: prompt and completion (no, I am not sure why it can’t just be a CSV). In this case, the prompt is the title, prepended with Title: and with a -> suffix per their documentation suggestions to “align” it better to GPT-3, and the completion is the good/bad label, prepended with a space because GPT-3 is weird like that. An example of the final dataset:

{"prompt":"Title: How to slightly improve your life without trying ->","completion":" bad"}
{"prompt":"Title: SixtyFPS Becomes Slint ->","completion":" bad"}
{"prompt":"Title: Family estrangement: Why adults are cutting off their parents ->","completion":" bad"}

Their CLI cleans and can extract a validation set out of the inputs, which you should always do. Fortunately, BigQuery now offers JSONL export, so downloading the resulting dataset requires no further preprocessing. Once that’s done, the CLI allows you finetune, with special options for binary classification. (the exact CLI command I used is here)

Another understated aspect of GPT-3 is that there are weaker models that are faster and much cheaper than the default davinci model that is what people use when they generally use “GPT-3”. For text generation they tend to have less coherent outputs, but for a simplified use case like binary classification they are more than sufficient. I’ll use the babbage model, the second weakest.

The final results of the finetuning are about 63% accuracy on both the training and validation sets: not too much better than the default 50% accuracy of a balanced dataset for a binary classification problem, but given the problem difficulty it’s better than most approaches I’ve done for Hacker News data.

Once the finetuning is complete, you can query it, and ask it to return the probability of the returned token. Let’s pass in the title for my last blog post: Absurd AI-Generated Professional Food Photography with DALL-E 2

"top_logprobs": [
  {
    " bad": -0.34654787
  }

Well, that’s not promising.

For some really weird reason, the API returns a log-probability instead of the actual probability that you’d want, so taking the exponent of that value results in a 70.7% probability it’s bad, which means there’s a 29.3% chance it’s good.

And that, is why I need a title optimizer.

Using InstructGPT To Create Alternate Titles

Since we now have a tool to determine the quality of blog post titles, how do we generate alternate titles that maintain the same meaning? I could think of tweaks to titles, but that takes effort and I am lazy. What if GPT-3 could create the candidate titles for me? It turns out, GPT-3 latest Instruct model can.

InstructGPT, released in January without much fanfare, is a version of davinci OpenAI finetuned themselves to better respond to instructions. It worked so well that it’s now the default GPT-3 model (noted as text-davinci-002 in the backend UI).

InstructGPT is surprisingly robust with the right prompt engineering. You can tell it to create detailed product descriptions of nonexistent video games, or write 4chan-style greentexts for any domain which maintain both the style and twist endings of the format.

via OpenAI’s GPT-3 Playground; all nonhighlighted text is the prompt.

After a bit of testing, the prompt I found worked best for this use case was:

Rewrite the following blog post title into six different titles but optimized for social media virality: 

-

It’s verbose, but that’s prompt engineering for you. The - at the end informs GPT-3 that the output should be a list with dash-bullets, which will make it easier to programmatically split the final output into distinct titles.

You can test it on the GPT-3 Playground; if the temperature parameter is 0, then the output will be deterministic.

Again putting in my last blog post Absurd AI-Generated Professional Food Photography with DALL-E 2 into InstructGPT:

via OpenAI’s GPT-3 Playground; all nonhighlighted text is the prompt.

All six of those titles are definitely an improvement, and all the text in green is what the programmatic API returns. Notably, despite the terseness of the input title and recency of DALL-E 2, InstructGPT is able to infer that the AI creates something and work from that, which is impressive.

Put The Title Optimizer Into Action!

A walkthrough of the code used to interact with the GPT-3 API and make the optimizer is available in this Jupyter Notebook, and the final demos are available in this Notebook.

Now that we have the two models ready, the workflow is simple:

Choose the title of a technical blog post I want to optimize.
Ping InstructGPT to get up to 6 alternate titles.
Extract/clean up the generated titles (i.e. split and strip whitespace)
For each of those alternate titles, ping the finetuned Hacker News GPT-3 for the probability that it is a good title.
In a pretty table, sort the titles by probability, descending.

Because the model can’t be widely distributed without review due to OpenAI rules, I decided to put the “UI” for this into a personal Jupyter Notebook.

Let’s experiment! We know the title of Absurd AI-Generated Professional Food Photography with DALL-E 2 is bad and the alternatives are interesting, but how good are the alternatives?

via GPT-3 Title Optimizer

Most of alternates are much better, with the predicted probabilities of being a good post going above 50%. (I probably should change the title retroactively but I will live with my SEO dishonor)

The original title for this post, in my boring no-one-will-ever-click-this style, was Creating a Blog Post Title Optimizer by Finetuning GPT-3 on Hacker News. Let’s plop it into the optimizer:

via GPT-3 Title Optimizer, temperature=0

So yes, the optimizer says the original title is very bad. But in this case, the variants are clickbaity and probably wouldn’t do very well on Hacker News.

Fortunately, you can rerun the generation and get more different variants if temperature is nonzero.

via GPT-3 Title Optimizer, temperature=0.7

via GPT-3 Title Optimizer, temperature=1.0

Definitely more variety. I like “How to Create a Blog Post Title Optimizer with GPT-3” as it maintains the same spirit even if it’s not the most optimal, although for disclosure reasons, I do want to include Hacker News somewhere in the title. Therefore, I can tweak the input to “How to Create a Blog Post Title Optimizer with GPT-3 and Hacker News Data” and feed it back to the optimizer and maybe get an interative improvement.

via GPT-3 Title Optimizer

The probability went down significantly with the change, and none of the variants are much better. Oh well.

Here’s the results of running the optimizer for some of my older blog posts:

The results for this post are indeed better; I’d definitely click the top one although it’s misleading.

The results for this post are much better, although this is one case where the original title is actually good.

The results for this post are a balance between better and not-technically-misleading clickbait.

Costwise, the entire pipeline is relatively inexpensive. Overall, it’s about $0.02 per run: too expensive to give unrestricted access to the internet, but very high return-on-investment if it successfully results in a catchy headline even if takes multiple tries. The most expensive part was the finetuning itself, which cost $2 but is a one-time cost.

Some might ask “why finetune GPT-3 when you can finetune an open-source large language model such as BERT like every NLP project since 2018?” In this case, GPT-3’s advantage is that it was trained in the entire internet. GPT-3 is a master of idiosyncrasy, which is a key when working with Hacker News data and in theory would give better results than the Wikipedia-trained BERT. The success of Hacker News posts also depends on a global context outside of the title itself, which is why finetuning an existing model trained on such context may be better than training an existing model solely on HN data.

Some are concerned about GPT-3 and AI tools such as these making writers redundant, but the results here prove otherwise: there will always have to be a human-in-the-loop.

UPDATE: When I submitted this post to Hacker News, it ended up getting over 200 points, defying the 20.8% probability!

Absurd AI-Generated Professional Food Photography with DALL-E 2

Mon, 25 Jul 2022 08:15:00 -0700

Good-looking food has been a part of internet culture ever since the beginning. Top Instagram, YouTube, and TikTok foodie accounts have millions of followers, and recipe blogs are some of the most highly trafficked content on the entire internet.

But now that large AI-image generation models such as DALL-E 2 by OpenAI have been made available, perhaps AI can provide new and unique ideas for food content on the internet.

For example, let’s say you ask DALL-E 2 to generate a colorful alcoholic cocktail:

a colorful alcoholic cocktail (DALL-E 2)

All the generated images are coherent and do indeed depict a cocktail, although the compositions are inconsistent which may not be what we would want to share on social media.

The best way to improve the image quality of AI-generated images is to use prompt engineering, as these models don’t create “good” images by default, just statistically average images based on its training data. For example, adding “trending on artstation” to any prompt for any image tends to make it look a lot more artsy, and the “trending” is a correlative signal with good artwork.

In the case of realistic food, I found that professional food photography does the trick for food-esque prompts. Adding that to the cocktail prompt above:

a colorful alcoholic cocktail, professional food photography (DALL-E 2)

Indeed, in each image it’s a cocktail, but with bonuses such as increased detail, aesthetic garnishes both on the dish and table, and a depth-of-field blur effect to create a central focus on the dish itself. You could share any of those cocktail photos on social media and no one would be the wiser (although you should always disclose if images are AI generated!)

This is the first time I’ve seen AI image generation models generate food well without hitting the uncanny valley, and one of the few prompt “ingredients” (pun intended) where the resulting images have a consistent composition. It’s not a surprise, especially since, as noted, high-quality food content would be extremely prolific in DALL-E 2’s training data.

What other fantastic foods can DALL-E 2 generate?

5-Dimensional Hamburgers

The original DALL-E, announced in 2021 but not publically accessible, went viral primairly due to the incredible creative results from demo prompts such as an armchair in the shape of an avocado:

DALL-E demo, via OpenAI.

Although adding “professional food photography” alone works to generate realistic food dishes, you can combine it with a prompt for other shapes, even abstract and absurd shapes that shouldn’t be logically possible for certain foods.

Let’s start with a basic shape, such as a heart. If you Google “X heart” for any food you will almost always get results (Instagram loves heart-shaped food). What about asking for a heart shape for a dish that by construction can’t be in the shape of a heart, such as a taco?

a taco in the shape of a heart, professional food photography (DALL-E 2)

DALL-E 2 is still able to work around it, even by creating a new type of taco shell and employing optical illusions. And occasionally it cheats, as in the case with the top-right image.

Emoji are also valid options as shapes, which unlike hearts is far less common in Google Images. Let’s take a Cobb salad, which has specific ingredients. Can DALL-E arrange them into a specific emoji?

a Cobb salad in the shape of the robot emoji, professional food photography (DALL-E 2)

The answer is yes.

But we can get more absurd. For example, consider a Rubik’s cube. Can DALL-E coerce obviously noncubic foods such as a peanut butter sandwich into one?

a peanut butter and jelly sandwich in the shape of a Rubik’s cube, professional food photography (DALL-E 2)

The answer is a resounding yes.

Latte art, or drawing images in the milk foam of a latte, is a popular subset of food photography. But what about 3D latte art that goes outside the beverage?

A Frappuccino in the shape of a swan, professional food photography (DALL-E 2)

What about going beyond the constraints of mere mortal perception of space and time? Can we assign food non-Euclidean properties?

a Cobb salad in the shape of non-Euclidean geometry, professional food photography (DALL-E 2)

Screw it, we can go further beyond, let’s just make some five-dimensional food.

A Hamburger in the shape of five dimensions, professional food photography (DALL-E 2)

As a puny three-dimensional being, I’ll just take DALL-E’s word for it.

Anthropomorphic Foods

Those who were terminally online during the early days of the internet may remember when a grilled cheese depicting the Virgin Mary sold for the then-ridiculous sum of $28,000. But with AI, we can do a lot more with foods that can look like people and public figures (within the constraints of OpenAI’s content policy).

A Spongebob Squarepants scrambled eggs dish that resembles Spongebob Squarepants, professional food photography (DALL-E 2)

Never mind, this avenue of food content is disturbing. Creative, but disturbing.

A Different Kind of Fusion Cuisine

I demonstrated earlier that the a X in the shape of a Y prompt addition can be used the change the shape of food dishes. But what if Y is another dish? Let’s try a Cobb salad and a hamburger:

a Cobb salad in the shape of a hamburger, professional food photography (DALL-E 2)

Yes, it fuses them together! Although I am very afraid to ask what the ingredients actually are.

With that, it is now time to commit cruel culinary crimes!

a hot dog in the shape of a pasta dish, professional food photography (DALL-E 2)

an ice cream sundae in the shape of curry, professional food photography (DALL-E 2)

A chocolate cake in the shape of sushi, professional food photography (DALL-E 2)

a pizza in the shape of a cronut, professional food photography (DALL-E 2)

The possibilities are endless!

The Future of AI Food Generation

DALL-E 2 is still limited access (and can be expensive), so let’s compare with DALL-E mini/Craiyon, which provides AI image generation in a free and easy manner. Also released recently, This Food Does Not Exist allows for the generation of certain types of food like cookies and sushi at high resolutions, albeit with no customization. For fairness, let’s look directly to DALL-E mega (via min-dalle), which is a newer and larger version of the mini model that has better image quality.

However, DALL-E mega definitely can’t compete with DALL-E 2 for this use case:

a Cobb salad in the shape of the robot emoji, professional food photography (DALL-E Mega, seed = 0)

a pizza in the shape of a cronut, professional food photography (DALL-E Mega, seed = 0)

There’s obviously a lot more that can be done here in terms of prompt optimization and customization, and I hope that it’s given more ideas for both AI image generation users and foodies who want to make something unique. The DALL-E 2 Discord has used similar prompts such as a Minion dish with a prompt keyword being Michelin to further increase food quality (in my testing it did not work well for the prompts in this post as it makes the portions too small, unsurprisingly). Even when DALL-E 2 becomes more accessible or another newer model that makes better pics is released, AI-generated food pics won’t make chefs or social media foodies obsolete.

In the meantime, I’ve decided to experiment by making a new social media account devoted to sharing esoteric AI-generated food: Weird AI Chef! Please follow @weirdaichef on Twitter and @weirdaichef on Instagram, as they have many more absurd AI image generations not used in this post, with more to come!

Note: None of the DALL-E 2 generations used in this blog post were cherry picked: the “professional food prompt” is indeed that consistent, and the fail states aren’t too terrible either.

Tempering Expectations for GPT-3 and OpenAI’s API

Sat, 18 Jul 2020 10:30:00 -0700

On May 29th, OpenAI released a paper on GPT-3, their next iteration of Transformers-based text generation neural networks. Most notably, the new model has 175 billion parameters compared to the 1.5 billion of previous GPT-2 iteration: a 117x increase in model size! Because GPT-3 is so large, it can’t be run on conventional computers, and it only became publicly available as a part of the OpenAI API, which entered an invite-only beta soon after the paper was released and will be released for-profit sometime later.

The API allows you to programmatically provide GPT-3 with a prompt, and return the resulting AI-generated text. For example, you could invoke the API with:

curl https://api.openai.com/v1/engines/davinci/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{"prompt": "This is a test", "max_tokens": 5}'

And get this back from the API, where the text is the generated text following up from the prompt:

{
  "id": "cmpl-",
  "object": "text_completion",
  "created": 1586839808,
  "model": "davinci:2020-05-03",
  "choices": [
    {
      "text": " of reading speed. You",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ]
}

As someone who has spent a very large amount of time working with GPT-2 while developing tools such as gpt-2-simple and aitextgen, which allow for optimized text generation using GPT-2, I was eager to test for myself if the quality of text generated from GPT-3 was really that much better. Thanks to OpenAI, I got invited to the beta, and with permission, I released a GitHub repository with a Python script to query the API, along with many examples of text prompts and their outputs. A fun use case for GPT-3 is absurdism, such as prompting the model about unicorns speaking English, with the model prompt bolded:

I also fed my own tweets through GPT-3 and curated the output, resulting in data science one-liners that are wholly original:

There hadn’t been too much GPT-3 hype after the initial announcement, outside of a few blogs from Gwern and Kevin Lacker. Until a viral tweet by Sharif Shameem showed what GPT-3 can really do:

Later, he made a followup tweet generating React code with GPT-3:

That demo got the attention of venture capitalists. And when a cool-looking magical thing gets the attention of venture capitalists, discourse tends to spiral out of control. Now, there are many tweets about GPT-3, and what it can do from others who have gained access to the API.

Hype aside, let’s look at the pragmatic realities of the model. GPT-3 is indeed a large step forward for AI text-generation, but there are very many caveats with the popular demos and use cases that must be addressed.

An Overview of GPT-3

GPT-3 itself, like most neural network models, is a black box where it’s impossible to see why it makes its decisions, so let’s think about GPT-3 in terms of inputs and outputs.

Actually, why not let GPT-3 tell its own story? Hey GPT-3, how do you work?

Close, but not quite!

In layman’s terms, text generating models such as GPT-3 generate text by taking supplied chunks of text from a prompt and predicting the next chunk of text, with an optional temperature parameter to allow the model to make suboptimal predictions and therefore be more “creative”. Then the model makes another prediction from the previous chunks including the new chunk, and repeats until it hits a specified length or a token that tells the model to stop generating. It’s not very philosophical, or evidence of some sort of anthropomorphic consciousness.

GPT-3 has two notable improvements from GPT-2 aside from its size: it allows generation of text twice the length of GPT-2 (about 10 paragraphs of English text total), and the prompts to the model better steer the generation of the text toward the desired domain (due to few-shot learning). For example, if you prompt the model with an example of React code, and then tell it to generate more React code, you’ll get much better results than if you gave it the simple prompt.

Therefore, there are two high-level use cases for GPT-2: the creative use case for fun text generation at high temperature, as GPT-2 once was, and the functional use case, for specific NLP-based use cases such as webpage mockups, with a temperature of 0.0.

GPT-3 was trained on a massive amount of text from all over the internet as of October 2019 (e.g. it does not know about COVID-19), and therefore it has likely seen every type of text possible, from code, to movie scripts, to tweets. A common misconception among viewers of GPT-3 demos is that the model is trained on a new dataset; that’s not currently the case, it’s just that good at extrapolation. As an example, despite the Star Wars: Episode III - Revenge of the Sith prompt containing text from a single scene, the 0.7 temperature generation imputes characters and lines of dialogue from much further into the movie. (The largest GPT-2 model could do that, but nowhere near as robust)

The real metagame with GPT-3 is engineering and optimizing complex prompts which can reliably coerce outputs into what you want. And with that brings a whole host of complexity and concerns.

GPT-3 Caveats

Despite everything above, I don’t believe that GPT-3 is a new paradigm or an advanced technology indistinguishable from magic. GPT-3 and the OpenAI API showcases on social media don’t show potential pitfalls with the model and the API.

Hey GPT-3, what problems do you have?

Sorry GPT-3, but I am a mean person.

Model Latency

If you’ve seen the demo videos, the model is slow, and it can take awhile for output to show up, and in the meantime the user is unsure if the model is broken or not. (There is a feature to allow streaming the model outputs as they are generated, which helps in creative cases but not in functional cases).

I don’t blame OpenAI for the slowness. A 175 billion parameter model is a model that’s wayyy too big to fit on a GPU for deployment. No one knows how GPT-3 is actually deployed on OpenAI’s servers, and how much it can scale.

But the fact remains; if the model is too slow on the user end, it results in a bad user experience and might drive people away from GPT-3 and just do things themselves (e.g. Apple’s Siri for iOS, where requests can take forever if there is a weak internet connection and you just give up and do it yourself).

Selection Bias Toward Good Examples

The demos for GPT-3 are creative and human-like, but like all text generation demos, they unintentionally imply that all AI-generated output will be that good. Unfortunately, that’s not the case in reality; AI-generated text has a tendency to fall into an uncanny valley, and good examples in showcases are often cherry-picked.

That said, from my experiments, GPT-3 is far better in terms of the average quality of generated text than other text-generation models, although it still does depend on the generation domain. When I was curating my generated tweets, I estimated 30-40% of the tweets were usable comedically, a massive improvement over the 5-10% usability from my GPT-2 tweet generation.

However, a 30-40% success rate implies a 60-70% failure rate, which is patently unsuitable for a production application. If it takes seconds to generate a React component and it takes on average 3 tries to get something usable, it might be more pragmatic to just create the component the hard, boring way. Compare again to Apple’s Siri, which can get very frustrating when it performs the wrong action.

Everyone Has The Same Model

The core GPT-3 model from the OpenAI API is the 175B parameter davinci model. The GPT-3 demos on social media often hide the prompt, allowing for some mystique. However, because everyone has the same model and you can’t build your own GPT-3 model, there’s no competitive advantage. GPT-3 seed prompts can be reverse-engineered, which may become a rude awakening for entrepreneurs and the venture capitalists who fund them.

Corporate machine learning models are often distinguished from those from other companies in the same field through their training on private, proprietary data and bespoke model optimization for a given use case. However, OpenAI CTO Greg Brockman hinted that the API will be adding a finetuning feature later in July, which could help solve this problem.

Racist and Sexist Outputs

The Web UI for the OpenAI API has a noteworthy warning:

Please use your judgement and discretion before posting API outputs on social media. You are interacting with the raw model, which means we do not filter out biased or negative responses. With great power comes great responsibility.

This is a reference to the FAQ for the API:

Mitigating negative effects such as harmful bias is a hard, industry-wide issue that is extremely important. Ultimately, our API models do exhibit biases (as shown in the GPT-3 paper) that will appear on occasion in generated text. Our API models could also cause harm in ways that we haven’t thought of yet.

After the launch of the API, NVIDIA researcher Anima Anandkumar made a highly-debated tweet.

During my GPT-3 experiments, I found that generating tweets from @dril (admittingly an edgy Twitter user) ended up resulting in 4chan-level racism/sexism that I spent enormous amounts of time sanitizing, and it became more apparent at higher temperatures. It’s especially important to avoid putting offensive content for generated texts which put words in others’ mouths.

Jerome Pesenti, the head of AI at Facebook, also managed to trigger anti-semetic tweets from a GPT-3 app:

Again, it depends on the domain. Would GPT-3 output racist or sexist React components? Likely not, but it’s something that would still need to be robustly checked. OpenAI does appear to take these concerns seriously, and has implemented toxicity detectors for generated content in the Web UI, although not the programmatic API yet.

Further Questions about the OpenAI API

AI model-as-a-service is an industry that tends to be a black box wrapped around another black box. Despite all the caveats, everything depends on how the OpenAI API exits beta and rolls out the API for production use. There are too many unknowns to even think about making money off of the OpenAI API, let alone making a startup based on it.

At minimum, anyone using the OpenAI API professionally needs to know:

Cost for generation per token/request
Rate limits and max number of concurrent requests
Average and peak latencies for generating tokens
SLA for the API
AI generated content ownership/copyright

That’s certainly less magical!

The most important question mark there is cost: given the model size, I’m not expecting it to be cheap, and it’s entirely possible that the unit economics make most GPT-3-based startups infeasible.

That said, it’s still good for people to experiment with GPT-3 and the OpenAI API in order to show what the model is truly capable of. It won’t replace software engineering jobs anytime soon, or become Skynet, or whatever. But it’s objectively a step forward in the field of AI text-generation.

What about GPT-2? Since it’s unlikely that the other GPT-3 models will be open-sourced by OpenAI, GPT-2 isn’t obsolete, and there will still be demand for a more open text-generating model. However, I confess that the success of GPT-3 has demotivated me to continue working on my own GPT-2 projects, especially since they will now be impossible to market competitively (GPT-2 is a number less than GPT-3 after all).

All said, I’d be glad to use GPT-3 and the OpenAI API for both personal and professional projects once it’s out of beta, given that the terms of use for the API are reasonable. And if the hype becomes more leveled such that said projects can actually stand out.

How to Build a Twitter Text-Generating AI Bot With GPT-2

Thu, 16 Jan 2020 08:00:00 -0800

GPT-2, a text-generating neural network model made by OpenAI, has recently been in the headlines, from being able to play AI-generated text adventures to playing chess with an AI trained on chess move notation. However, I initially built gpt-2-simple, which can be used to finetune GPT-2 on any text dataset you choose, for a less academic purpose: comedy.

Over the past month, Twitter account @dril_gpt2, an AI parody by @kingdomakrillic of the infamous Twitter user @dril, used my Colaboratory Notebook for finetuning GPT-2 on dril’s tweets using gpt-2-simple to generate human-curated tweets which push the limits of the Turing Test:

These tweets are definitely made by a robot and not by a human pretending to be a robot; @dril_gpt2 occasionally falls into some of the famous GPT-2 traps such as incoherent lists and extended repetition loops.

Here’s how you too can create an AI bot to parody any Twitter user, even if you’re not a coder!

How to Get Tweets For Training An AI

Twitter’s API famously limits users to retrieving only the latest 3,200 tweets from a given user, which is not nearly enough input data for training a good AI. Therefore, to get all tweets possible for a user, you’ll need to use another approach. The Python package twint is a popular way of bypassing that API limitation.

I’ve open-sourced a Python 3 script on GitHub which leverages twint to download tweets, and then the script does common preprocessing such as removing URLs, retweets, and tweet replies to make the resulting input text cleaner.

First, in a terminal, install the Python script dependencies:

pip3 install twint==2.1.4 fire tqdm

Then download the download_tweets.py script.

The script is interacted with via a command line interface. After cding into the directory where the script is stored in a terminal, run:

python3 download_tweets.py

e.g. If you want to download all tweets (sans retweets/replies) from @dril, run:

python3 download_tweets.py dril

The tweets will be downloaded to a single-column CSV titled _tweets.csv, which is the ideal format for training with an AI.

The more tweets the better: it’s recommended that you have at least 1 MB of input data, which is tens of thousands of tweets.

How To Train a Twitter AI And Generate Tweets

A common problem with training AI on short-form text is that the text can “leak” information; since the AI trains on about 2-3 paragraphs worth of text at a time (about 5-10 tweets), you need to explicitly state when a given tweet begins and when the tweet ends. To fix this issue, gpt-2-simple has a special case for single-column CSVs, where it will automatically process the text for best training and generation. (i.e. by adding <|startoftext|> and <|endoftext|> to each tweet). This workflow will also handle multi-line tweets correctly as their own entity.

You can use this Colaboratory notebook to train the model on your downloaded tweets, and generate massive amounts of tweets from it. The notebook itself has more instructions on how to feed the CSV created above as input data to the model.

Note that without a lot of tweets, the model might easily overfit and output existing tweets verbatim; if that’s the case, you may want to train for fewer steps (e.g. 200-500). Additionally, I recommend only using the 124M “small” and 355M “medium” GPT-2 models; larger GPT-2 models finetune poorly on small text documents and low amounts of input data.

Once the training is complete, you can generate tweets 1,000 at a time using this cell:

gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=200,
                      temperature=1.0,
                      top_p=0.9,
                      prefix='<|startoftext|>',
                      truncate='<|endoftext|>',
                      include_prefix=False,
                      nsamples=1000,
                      batch_size=20
                      )

Run the cell as many times as you want for more tweets, and download them from the Files tab by right-clicking them! The notebook also has more information on how to tweak the generation parameters to make the tweets more crazy or more sane.

You can then open the generated .txt files on your local computer in your favorite text editor (I recommend Visual Studio Code), and start curating however you see fit! Each tweet is separated by a delimiter line, making it easier to visually parse and handle multiline tweets (compare/contrast with raw @dril_gpt2 output, which blends together a few tweets per delimiter).

A warning: you are not guaranteed to get quality generated tweets all the time. In fact, quality tweets are rare: I estimate less than 5% of AI-generated tweets are good/funny. That means if you want to curate hundreds of tweets, you’ll need to generate thousands of tweets and sort through all of them (and double-check to make sure they’re not real tweets!). It’s not as bad as it sounds, in my opinion it’s kinda fun. But curation is its own skill, which is why human-curated tweets aren’t a stain on the “credibility” of AI bots, and also why the ~1,500 tweets so far from @dril_gpt2 is very impressive.

Now, what do you do with these curated tweets?

Automating The Twitter Bot

If you’re not a programmer or just want to prototype a Twitter bot, I recommend creating a normal Twitter account and scheduling hand-curated Twitter posts through TweetDeck, which is owned by Twitter and has native scheduling capabilities. You can space out tweets at given times, although it may be a hassle to do that for hundreds of tweets.

Otherwise, it is more efficient to write a code script to make tweets at periodic intervals for a bot account. Old tutorials around the internet recommend writing a script which posts to Twitter, sleeps for X hours, post, repeat; that method does not easily scale to multiple bots and it requires that a full computer be dedicated to it, which is not an efficient use of computing resources.

I’ve open-sourced an infrastructure schema on GitHub that leverages Google Cloud Platform services to run hand-curated Twitter bots using a few modern technologies to minimize cost and computation; it’s admittingly somewhat complicated, but it should give you an idea of how to best implement a Twitter bot. The repo also has instructions on how to set up a Twitter developer account.

The Ethics of Twitter AI Bots

Lastly, let’s address the elephant in the room: is building these bots ethical? Modern AI has frequently been criticized on two fronts, both in how the input training data is obtained (e.g. obtaining faces for training facial recognition software), and how AI-generated media content is used (e.g. video deepfakes).

I am not a lawyer, but for these AI-generated tweets, this is how I see it:

The input data is obtained from Twitter, but not through its API; it’s downloaded through external web scraping via twint, and never logs into the website. This kind of workflow was ruled as not an abuse by the recent hiQ v. LinkedIn decision, as the data is public. It’s still a gray area; I would not redistribute/commercialize the downloaded tweet data; just use it as input data to the model.

The actual generated tweets themself should be fine to use as you see fit. Whether AI-generated works infringe on the copyrights of its source material is an evolving area of both ethics and law, but at minimum these AI-generated tweets are both a transformative derivative work and a parody.

That said, given the massive ambiguities around AI-generated content, it’s important to be completely transparent and also comply with Twitter rules on parody accounts. For example, the Twitter bio for the bot should indicate:

It’s posting AI-generated tweets, made with GPT-2.
It’s human-curated (or not).
The Twitter account of who maintains the bot.
The Twitter account(s) the bot is parodying / model is finetuned upon.

Additionally, to avoid impersonation, the full name of the Twitter account should not be a verbatim match to the person being parodied (e.g. “X but AI” is fine), and the profile picture should be visually distinct from the human (e.g. my bots have a black-and-white profile picture). I would also not recommend making bots of people who are more newsworthy to avoid accusations of impersonation (e.g. do not make bots of politicians, especially Donald Trump).

There is still a lot of work that can be done in optimizing Twitter bots, both in terms of generated tweet quality and in ironing out the ethical logistics of maintaining an AI bot account. I do not believe that AI text-generating bot Twitter accounts will obsolete human Twitter accounts. It’s a different flavor of comedy; not better, not worse. But there’s still a lot that can be done to both expand and control the creativity of these Twitter bots, and I have a few active ideas in the pipeline to implement.

Visualizing Airline Flight Characteristics Between SFO and JFK

Wed, 23 Oct 2019 09:00:00 -0700

In March, Google Compute Platform developer advocate Felipe Hoffa made a tweet about airline flight data from San Francisco International Airport (SFO) to Seattle-Tacoma International Airport (SEA):

Particularly, his visualization of total elapsed times by airline caught my eye.

The overall time for flights from SFO to SEA goes up drastically starting in 2015, and this increase occurs across multiple airlines, implying that it’s not an airline-specific problem. But what could intuitively cause that?

U.S. domestic airline data is freely distributed by the United States Department of Transportation. Normally it’s a pain to work with as it’s very large with millions of rows, but BigQuery makes playing with such data relatively easy, fun, and free. What other interesting factoids can be found?

Expanding on SFO → SEA

BigQuery is a big data warehousing tool that allows you to query massive amounts of data. The table Hoffa created from the airline data (fh-bigquery.flights.ontime_201903) is 83.37 GB and 184 million rows. You can query 1 TB of data from it for free, but since BQ will only query against the fields you request, the queries in this post only consume about 2 GB each, allowing you to run them well within that quota.

Hoffa’s query that runs on BigQuery looks like this:

SELECT FlightDate_year, Reporting_Airline
  , AVG(ActualElapsedTime) ActualElapsedTime
  , AVG(TaxiOut) TaxiOut
  , AVG(TaxiIn) TaxiIn
  , AVG(AirTime) AirTime
  , COUNT(*) c
FROM `fh-bigquery.flights.ontime_201903`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY 1,2
ORDER BY 1 DESC, 3 DESC
LIMIT 1000

For each year and airline after 2010, the query calculates the average metrics specified for flights on the SFO → SEA route.

I made a few query and data visualization tweaks to what Hoffa did above, and here’s the result showing the increase in elapsed airline flight time, over time for that route:

Let’s explain what’s going on here.

A common trend in statistics is avoiding using averages as a summary statistic whenever possible, as averages can be overly affected by strong outliers (and with airline flights, there are definitely strong outliers!). The solution is to use a median instead, but one problem: medians are hard and computationally complex to calculate compared to simple averages. Despite the rise of “big data”, most databases and BI tools don’t have a MEDIAN function that’s as easy to use as an AVG function. But BigQuery has an uncommon APPROX_QUANTILES function, which calculates the specified amount of quantiles; for example, if you call APPROX_QUANTILES(ActualElapsedTime, 100), it will return an array with the 100 quantiles, where the median will be the 50th quantile. BigQuery uses an algorithmic trick called HyperLogLog++ to calculate these quantiles efficiently even with millions of data points. But since we get other quantiles like the 5th, 25th, 75th, and 95th quantiles for free with that approach, we can visualize the spread of the data.

We can aggregate the data by month for more granular trends and calculate the APPROX_QUANTILES in a subquery so it only has to be computed once. Hoffa also uploaded a more recent table (fh-bigquery.flights.ontime_201908) with a few additional months of data. To make things more simple, we’ll ignore aggregating by airlines since the metrics do not vary strongly between them. The final query ends up looking like this:

#standardSQL
SELECT Year, Month, num_flights,
time_q[OFFSET(5)] AS q_5,
time_q[OFFSET(25)] AS q_25,
time_q[OFFSET(50)] AS q_50,
time_q[OFFSET(75)] AS q_75,
time_q[OFFSET(95)] AS q_95
FROM (
SELECT Year, Month,
  COUNT(*) as num_flights,
  APPROX_QUANTILES(ActualElapsedTime, 100) AS time_q
FROM `fh-bigquery.flights.ontime_201908`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY Year, Month
)
ORDER BY Year, Month

The resulting data table:

In retrospect, since we’re only focusing on one route, it isn’t big data (this query only returns data on 64,356 flights total), but it’s still a very useful skill if you need to analyze more of the airline data (the APPROX_QUANTILES function can handle millions of data points very quickly).

As a professional data scientist, one of my favorite types of data visualization is a box plot, as it provides a way to visualize spread without being visually intrusive. Data visualization tools like R and ggplot2 make constructing them very easy to do.

By default, for each box representing a group, the thick line in the middle of the box is the median, the lower bound of the box is the 25th quantile and the upper bound is the 75th quantile. The whiskers are normally a function of the interquartile range (IQR), but if there’s enough data, I prefer to use the 5th and 95th quantiles instead.

If you feed ggplot2’s geom_boxplot() with raw data, it will automatically calculate the corresponding metrics for visualization; however, with big data, the data may not fit into memory and as noted earlier, medians and other quantiles are computationally expensive to calculate. Because we precomputed the quantiles with the query above for every year and month, we can use those explicitly. (The minor downside is that this will not include outliers)

Additionally for box plots, I like to fill in each box with a different color corresponding to the year in order to better perceive data seasonality. In the case of airline flights, seasonality is more literal: weather has an intuitive impact on flight times and delays, and during winter months there are also holidays which could affect airline logistics.

The resulting ggplot2 code looks like this:

plot <-
  ggplot(df_tf,
         aes(
           x = date,
           ymin = q_5,
           lower = q_25,
           middle = q_50,
           upper = q_75,
           ymax = q_95,
           group = date,
           fill = year_factor
         )) +
  geom_boxplot(stat = "identity", size = 0.3) +
  scale_fill_hue(l = 50, guide = F) +
  scale_x_date(date_breaks = '1 year', date_labels = "%Y") +
  scale_y_continuous(breaks = pretty_breaks(6)) +
  labs(
    title = "Distribution of Flight Times of Flights From SFO → SEA, by Month",
    subtitle = "via US DoT. Box bounds are 25th/75th percentiles, whiskers are 5th/95th percentiles.",
    y = 'Total Elapsed Flight Time (Minutes)',
    fill = '',
    caption = 'Max Woolf — minimaxir.com'
  ) +
  theme(axis.title.x = element_blank())

ggsave('sfo_sea_flight_duration.png',
       plot,
       width = 6,
       height = 4)

And behold (again)!

You can see that the boxes do indeed trend upward after 2016, although per-month medians are in flux. The spread is also increasingly slowly over time. But what’s interesting is the seasonality; pre-2016, the summer months (the “middle” of a given color) have a very significant drop in total time, which doesn’t occur as strongly after 2016. Hmm.

SFO and JFK

Since I occasionally fly from San Francisco to New York City, it might be interesting (for completely selfish reasons) to track trends over time for flights between those areas. On the San Francisco side I choose SFO, and for the New York side I choose John F. Kennedy International Airport (JFK), as the data goes back very far for those routes specifically, and I only want to look at a single airport at a time (instead of including other NYC airports such as Newark Liberty International Airport [EWR] and LaGuardia Airport [LGA]) to limit potential data confounders.

Fortunately, the code and query changes are minimal: in the query, change the target metric to whatever metric you want, and the Origin and Dest in the WHERE clause to what you want, and if you want to calculate metrics other than elapsed time, change the metric in APPROX_QUANTILES accordingly.

Here’s the chart of total elapsed time from SFO → JFK:

And here’s the reverse, from JFK → SFO:

Unlike the SFO → SEA charts, both charts are relatively flat over the years. However, when looking at seasonality, SFO → JFK dips in the summer and spikes during winter, while JFK → SFO does the complete opposite: dips during the winter and spikes during the summer, which is similar to the SFO → SEA route. I don’t have any guesses what would cause that behavior.

How about flight speed (calculated via air time divided by distance)? Have new advances in airline technology made planes faster and/or more efficient?

The expected flight speed for a commercial airplane, per Wikipedia, is 547-575 mph, so the metrics from SFO pass the sanity check. The metrics from JFK indicate there’s about a 20% drop in flight speed potentially due to wind resistance, which makes sense. Month-to-month, the speed trends are inverse to the total elapsed time, which makes sense intuitively as they are strongly negatively correlated.

Lastly, what about flight departure delays? Are airlines becoming more efficient, or has increased demand caused more congestion?

Wait a second. In this case, massive 2-3 hour flight delays are frequent enough that even just the 95% percentile skews the entire plot. Let’s remove the whiskers in order to look at trends more clearly.

A negative delay implies the flight leaves early, so we can conclude on average, flights leave slightly earlier than the stated departure time. Even without the whiskers, we can see major spikes at the 75th percentile level for summer months, and said spikes were especially bad in 2017 for both airports.

These box plots are only an exploratory data analysis. Determining the cause of changes in these flight metrics is difficult even for experts (I am definitely not an expert!) and many not even be possible to determine from publicly-available data.

But there are still other fun things that can be done with the airline flight data, such as faceting airline trends by time and the inclusion of other airports, which is interesting.

You can view the BigQuery queries used to get the data, plus the R and ggplot2 used to create the data visualizations, in this R Notebook. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Experiments with Making Convincing AI-Generated Fake News

Mon, 30 Sep 2019 08:00:00 -0700

When OpenAI announced GPT-2, a robust text-generating AI model, they explicitly only released smaller, less robust versions of the model out of fear that the large model could be used to generate fake news. However, since OpenAI described most of the technical decisions needed to create the model in the corresponding paper, it would be possible for others to create their own text generating Transformer models, and maybe even improve on GPT-2 (with a sufficient budget!).

In September, the Salesforce AI team released CTRL, a Transformer-based text generating model with a twist; the model can generate text from specified domains by passing control codes to the model. What caught my interest was a demo of domain style transfer in the CTRL paper:

If the model is that robust to minor URL changes, what happens when you give it URLs that blatantly do not exist? Can the CTRL model create the “fake news” OpenAI was concerned about? Let’s put it to the test.

An Overview of CTRL

I’ve written a guide + scripts to setting the base CTRL model as cheaply as possible on Google Compute Engine with just a few commands. Additionally, the CTRL team has released a free Colaboratory Notebook which sets up and runs the CTRL model; however, the model is so large it won’t fit into the memory of traditional GPUs, so the notebook does a trick to shrink it a bit which may impact generation performance.

Like GPT-2, CTRL has a Transformer architecture based on TensorFlow and uses byte pair encodings as its inputs and outputs, which are then decoded into readable text. CTRL has notable performance improvements as it’s trained on three times as much data as GPT-2, including an open-sourced clone of GPT-2’s original dataset. And of course, it’s larger (1.6B hyperparameters) compared to the currently public GPT-2 (774M hyperparameters), which has significant effects on text quality.

Most importantly, CTRL requires a control code if you want to generate text, which allows for more deterministic output compared to GPT-2/TalkToTransformer. There are several fun control codes, such as Questions if you want to ask the AI a question, or Reviews if you want the AI to generate an Amazon review. For this, we’ll only look at the Links control code, which lets you provide a URL and/or a prompt for text generation.

As the example from the paper shows, URLs contain a surprising amount of metadata. For example, let’s consider a random Washington Post URL: https://www.washingtonpost.com/powerpost/deal-reached-for-whistleblowers-testimony-house-intelligence-chairman-says/2019/09/29/01cade60-e2d1-11e9-b403-f738899982d2_story.html

There’s month/day/year information (2019/09/29), a category (powerpost), and a URL slug (deal-reached-for-whistleblowers-testimony-house-intelligence-chairman-says), typically present for SEO reasons, but in this case provides strong hints to the underlying content. The Transformer architecture is surprisingly effective at extracting all this metadata, and using it to generate appropriate text.

Examples of Fake News

All of the following examples are generated using the full model with the default settings: 256 token length and a temperature of 0 (i.e. the model always selects the next token with the highest probability). I manually truncate the output once it finishes, but other than that, these AI-generated articles are not edited.

Let’s address the elephant in the room: when people think of “fake news”, they think of President Donald Trump, who effectively codified the modern usage of the term. So let’s generate a fake news story about him with an innocuous topic rarely associated with him. For example, the video gaming phenomenon Fortnite. All that’s needed is to tweak the slug (and optionally the date) in the earlier WaPo URL a bit and prepend the corresponding control code, resulting in Links https://www.washingtonpost.com/powerpost/trump-likes-fortnite/2019/09/24/. Feed it to the model, and behold!

Trump likes Fortnite

President Trump, who has been known to play video games for hours at a time, is now playing the popular game “Fortnite” on his phone.

The president was spotted watching the online battle royale game, which pits players against each other in an attempt to kill their opponents, while he sat down with House and Senate leaders Tuesday afternoon.

“I like that it’s not just about shooting people but also having fun doing it,” said Trump.

He then asked if anyone had played it before.

“You know what, I don’t think I’ve ever seen that,” replied Rep. Steve Scalise, R-La.

A White House official told reporters later Wednesday afternoon that Trump did indeed watch the game.

“He watched it, yeah,” the official said.

The generated article is very realistic and follows the style of a typical WaPo article (lede, introduction, body, support, etc.), with proper descriptions of Trump and Fortnite, plus a story that matches the headline. Like GPT-2, there are redundancies in the generation (at the end), but it’s not something most readers would find suspicious in a casual read.

Let’s tweak things a little. Instead of using likes in the slug, what if we use despises, so trump-despises-fortnite. How well does the model adapt?

Trump despises Fortnite

The president of the United States has a new favorite video game: “Fortnite.”

President Trump, who is known for his love of fast food and beer, was reportedly so excited about playing this game that he called it “the greatest thing in the world.”

But according to an Axios report, Trump’s enthusiasm waned after learning how much time players spend on it.

“He said it’s not even close to being as good as ‘Candy Crush,’ ” one source told the news outlet.

In fact, the report said, Trump had been watching videos online while eating dinner with Japanese Prime Minister Shinzo Abe at Mar-a-Lago.

According to Axios, Trump also complained about the game’s difficulty level — which can be adjusted by spending real money or virtual currency — saying it wasn’t challenging enough.

“I don’t know why they made it hard,” Trump said.

He then went off topic when asked if he’d play again.

“No, I won’t.”

A White House official later clarified that Trump didn’t say he wouldn’t play, but rather that he’s “not into it.”

It adapts very well to the change in tone (even with a rhetorical lede!).

Fake news doesn’t have to be politically-based. Such news in the financial sector could cause just as much chaos. What if it was reported that Apple, one of the largest companies in the world, went bankrupt? Put in Links https://www.theverge.com/2019/2/24/20882491/apple-has-filed-for-bankruptcy:

Apple has filed for bankruptcy.

The company that made the iPhone and iPad is filing for bankruptcy, according to a report from Bloomberg.

According to Bloomberg News, Apple’s board of directors voted unanimously on Friday morning in favor of an emergency plan designed to keep it afloat until its next quarterly earnings report.

In order to avoid paying more than $128 million in back taxes, Apple will have to sell off assets like its Beats headphone business, which was valued at about $1 billion when it went public last year.

“Apple said today that it plans to file for Chapter 11 protection under US Bankruptcy Court in Delaware this week,” Bloomberg reports.

A source familiar with the matter told Reuters earlier this week that Apple had been considering selling some or all of its music streaming service iTunes Music Store as part of a restructuring plan.

But the news comes just days after reports surfaced suggesting that Apple would be forced into liquidation.

Last week, Bloomberg reported that Apple could soon announce layoffs if sales don’t pick up soon.

“We are working hard every day to deliver great products and services to our customers around the world,” Tim Cook, CEO of Apple, wrote in a letter to employees today.

Again, the model incorporates both bankruptcy terminology and Apple-based entities. There are minor factual errors (Beats is not public, Apple is not incorporated in Delaware), but casual readers would likely scan over that.

Science is also a field with lots of fake discoveries. What if AI reported that P versus NP, a very famous problem in computer science, was finally solved? Let’s try Links https://techcrunch.com/2029/09/24/mit-proves-p-equals-np:

MIT proves P equals NP

The University of Massachusetts Amherst has proven that the problem posed by factoring large numbers is NP-complete.

In a paper published in Nature, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have shown how to factor an integer using only two numbers, one for each digit — which means it takes just three operations on your brain.

This isn’t quite as impressive as solving the problem, but it’s still pretty impressive.

“Given any number n, we can find its product n times with polynomial time if there are no factors involved,” said lead author David DeWitt, who worked on this project while he was a graduate student there.

But what about factorizing larger numbers?

“We showed you could do it, but not without some extra effort,” DeWitt said.

To prove this, they used a technique called Monte Carlo integration, where they simulated millions or billions of different inputs from thousands of computers running various versions of Java.

They then ran their algorithm through all possible combinations of input values until it found the best one.

“The result shows that even though our system doesn’t solve every single case, it does get close enough to be useful.”

This instance is less factually correct (and it’s clickbaity!), but the topic behind the generated text is complicated enough that only domain experts would be able to tell it’s not internally consistent.

It’s also worth pointing out now that the majority of the text generated so far is unique, and CTRL is not cheating by just outputting text from the source datasets. Even though Monte Carlo integration is indeed a real thing that’s relatively obscure, Googling that description of it fails to turn up results.

Lastly, fake news doesn’t have to be depressing or sow discord. What about wholesome fake news? Such as…the return of beloved television series Firefly? One Links https://www.hollywoodreporter.com/live-feed/fox-announces-firefly-season-2-revival-original-cast-returning-12343245 later:

Fox announces Firefly Season 2 revival, original cast returning

The network has announced that the second season of Joss Whedon’s cult sci-fi series will return for a new run in 2018.

“Fans have been asking us to bring back Firefly, and we couldn’t be more thrilled about this opportunity,” said Chris Albrecht, president of programming at Fox. “We are excited to continue our relationship with Joss, who is one of the most creative minds working today on television, as well as his amazing cast, which includes Nathan Fillion, Summer Glau, Alan Tudyk, Adam Baldwin, Sean Maher, Adrianne Palicki, Tricia Helfer, Jewel Staite, Michael Trucco, and many more.”

In addition to Whedon himself, executive producers include Damon Lindelof, Maurissa Tancharoen, Vince Gilligan, David Fury, Jeffrey Bell, Greg Plageman, Scott M. Gimple, Heather Kadin, Andrew Chambliss, John Rogers, and Ron Moore.

“The show was an instant hit when it debuted over 20 years ago, but its popularity only grew after the success of ‘Serenity,’ so we’re very pleased to welcome fans into another chapter of their lives,” added Feige.

That is a very stacked cast and crew, all of which (besides the original Firefly members) have acted/worked on sci-fi television series. The only major factual errors are that Chris Albrecht was at STARZ, not Fox, and Feige, presumably Kevin Feige of Marvel Studios, is not mentioned previously in the generated article.

I know I’ll get criticism for highlighting a potentially dangerous application of AI text generation. My perspective is that it’s important to know what such tools are capable of doing in order to more easily recognize fake news. The real problem with fake news isn’t the text itself: it’s the distribution of the news on social media like Facebook and Twitter, where the platforms not only incentivize it, but also fail to sufficiently punish deliberate, repeat offenders. It’s why journalism and awareness of fake news is extremely important.

Some might comment “these generated texts aren’t convincing at all!”, but keep in mind that’s because the headline says upfront that they’re fake. Would you be able to identify it as a fake if a respected source impulsively tweeted it?