Machine Learning on Max Woolf's Blog

Predicting Average IMDb Movie Ratings Using Text Embeddings of Movie Metadata

Mon, 30 Jun 2025 10:00:00 -0700

Months ago, I saw a post titled “Rejected from DS Role with no feedback” on Reddit’s Data Science subreddit, in which a prospective job candidate for a data science position provided a Colab Notebook documenting their submission for a take-home assignment and asking for feedback as to why they were rejected. Per the Reddit user, the assignment was:

Use the publicly available IMDB Datasets to build a model that predicts a movie’s average rating. Please document your approach and present your results in the notebook. Make sure your code is well-organized so that we can follow your modeling process.

IMDb, the Internet Movie Database owned by Amazon, allows users to rate movies on a scale from 1 to 10, wherein the average rating is then displayed prominently on the movie’s page:

The Shawshank Redemption is currently the highest-rated movie on IMDb with an average rating of 9.3 derived from 3.1 million user votes.

In their notebook, the Redditor identifies a few intuitive features for such a model, including the year in which the movie was released, the genre(s) of the movies, and the actors/directors of the movie. However, the model they built is a TensorFlow and Keras-based neural network, with all the bells-and-whistles such as batch normalization and dropout. The immediate response by other data scientists on /r/datascience was, at its most polite, “why did you use a neural network when it’s a black box that you can’t explain?”

Reading those replies made me nostalgic. Way back in 2017, before my first job as a data scientist, neural networks using frameworks such as TensorFlow and Keras were all the rage for their ability to “solve any problem” but were often seen as lazy and unskilled compared to traditional statistical modeling such as ordinary least squares linear regression or even gradient boosted trees. Although it’s funny to see that perception against neural networks in the data science community hasn’t changed since, nowadays the black box nature of neural networks can be an acceptable business tradeoff if the prediction results are higher quality and interpretability is not required.

Looking back at the assignment description, the objective is only “predict a movie’s average rating.” For data science interview take-homes, this is unusual: those assignments typically have an extra instruction along the lines of “explain your model and what decisions stakeholders should make as a result of it”, which is a strong hint that you need to use an explainable model like linear regression to obtain feature coefficients, or even a middle-ground like gradient boosted trees and its variable importance to quantify relative feature contribution to the model. ¹ In absence of that particular constraint, it’s arguable that anything goes, including neural networks.

The quality of neural networks have improved significantly since 2017, even moreso due to the massive rise of LLMs. Why not try just feeding a LLM all raw metadata for a movie and encode it into a text embedding and build a statistical model based off of that? Would a neural network do better than a traditional statistical model in that instance? Let’s find out!

About IMDb Data

The IMDb Non-Commercial Datasets are famous sets of data that have been around for nearly a decade ² but are still updated daily. Back in 2018 as a budding data scientist, I performed a fun exporatory data analysis using these datasets, although the results aren’t too surprising.

The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems.

But in truth, these datasets are a terrible idea for companies to use for a take-home assignment. Although the datasets are released under a non-commercial license, IMDb doesn’t want to give too much information to their competitors, which results in a severely limited amount of features that could be used to build a good predictive model. Here are the common movie-performance-related features present in the title.basics.tsv.gz file:

tconst: unique identifier of the title
titleType: the type/format of the title (e.g. movie, tvmovie, short, tvseries, etc)
primaryTitle: the more popular title / the title used by the filmmakers on promotional materials at the point of release
isAdult: 0: non-adult title; 1: adult title
startYear: represents the release year of a title.
runtimeMinutes: primary runtime of the title, in minutes
genres: includes up to three genres associated with the title

This is a sensible schema for describing a movie, although it lacks some important information that would be very useful to determine movie quality such as production company, summary blurbs, granular genres/tags, and plot/setting — all of which are available on the IMDb movie page itself and presumably accessible through the paid API. Of note, since the assignment explicitly asks for a movie’s average rating, we need to filter the data to only movie and tvMovie entries, which the original assignment failed to do.

The ratings data in title.ratings.tsv.gz is what you’d expect:

tconst: unique identifier of the title (which can therefore be mapped to movie metadata using a JOIN)
averageRating: average of all the individual user ratings
numVotes: number of votes the title has received

In order to ensure that the average ratings for modeling are indeed stable and indicative of user sentiment, I will only analyze movies that have atleast 30 user votes: as of May 10th 2025, that’s about 242k movies total. Additionally, I will not use numVotes as a model feature, since that’s a metric based more on extrinsic movie popularity rather than the movie itself.

The last major dataset is title.principals.tsv.gz, which has very helpful information on metadata such as the roles people play in the production of a movie:

tconst: unique identifier of the title (which can be mapped to movie data using a JOIN)
nconst: unique identifier of the principal (this is mapped to name.basics.tsv.gz to get the principal’s primaryName, but nothing else useful)
category: the role the principal served in the title, such as actor, actress, writer, producer, etc.
ordering: the ordering of the principals within the title, which correlates to the order the principals appear on IMDb’s movie cast pages.

Additionally, because the datasets are so popular, it’s not the first time someone has built a IMDb ratings predictor and it’s easy to Google.

Instead of using the official IMDb datasets, these analyses are based on the smaller IMDB 5000 Movie Dataset hosted on Kaggle, which adds metadata such as movie rating, budget, and further actor metadata that make building a model much easier (albeit “number of likes on the lead actor’s Facebook page” is very extrinsic to movie quality). Using the official datasets with much less metadata is building the models on hard mode and will likely have lower predictive performance.

Although IMDb data is very popular and very well documented, that doesn’t mean it’s easy to work with.

The Initial Assignment and “Feature Engineering”

Data science take-home assignments are typically 1/2 exploratory data analysis for identifying impactful dataset features, and 1/2 building, iterating, and explaining the model. For real-world datasets, these are all very difficult problems with many difficult solutions, and the goal from the employer’s perspective is seeing more how these problems are solved rather than the actual quantitative results.

The initial Reddit post decided to engineer some expected features using pandas, such as is_sequel by checking whether a non-1 number is present at the end of a movie title and one-hot encoding each distinct genre of a movie. These are fine for an initial approach, albeit sequel titles can be idiosyncratic and it suggests that a more NLP approach to identifying sequels and other related media may be useful.

The main trick with this assignment is how to handle the principals. The common data science approach would be to use a sparse binary encoding of the actors/directors/writers, e.g. using a vector where actors present in the movie are 1 and every other actor is 0, which leads to a large number of potential approaches to encode this data performantly, such as scikit-learn’s MultiLabelBinarizer. The problem with this approach is that there are a very large number of unique actors / high cardinality — more unique actors than data points themselves — which leads to curse of dimensionality issues and workarounds such as encoding only the top N actors will lead to the feature being uninformative since even a generous N will fail to capture the majority of actors.

There are actually 624k unique actors in this dataset (Jupyter Notebook), the chart just becomes hard to read at that point.

Additionally, most statistical modeling approaches cannot account for the ordering of actors as they treat each feature as independent, and since the billing order of actors is generally correlated to their importance in the movie, that’s an omission of relevant information to the problem.

These constraints gave me an idea: why not use an LLM to encode all movie data, and build a model using the downstream embedding representation? LLMs have attention mechanisms, which will not only respect the relative ordering of actors (to give higher predictive priority to higher-billed actors, along with actor cooccurrences), but also identify patterns within movie name texts (to identify sequels and related media semantically).

I started by aggregating and denormalizing all the data locally (Jupyter Notebook). Each of the IMDb datasets are hundreds of megabytes and hundreds of thousands of rows at minimum: not quite big data, but enough to be more cognizant of tooling especially since computationally-intensive JOINs are required. Therefore, I used the Polars library in Python, which not only loads data super fast, but is also one of the fastest libraries at performing JOINs and other aggregation tasks. Polars’s syntax also allows for some cool tricks: for example, I want to spread out and aggregate the principals (4.1 million rows after prefiltering) for each movie into directors, writers, producers, actors, and all other principals into nested lists while simultaneously having them sorted by ordering as noted above. This is much easier to do in Polars than any other data processing library I’ve used, and on millions of rows, this takes less than a second:

df_principals_agg = (
    df_principals.sort(["tconst", "ordering"])
    .group_by("tconst")
    .agg(
        director_names=pl.col("primaryName").filter(pl.col("category") == "director"),
        writer_names=pl.col("primaryName").filter(pl.col("category") == "writer"),
        producer_names=pl.col("primaryName").filter(pl.col("category") == "producer"),
        actor_names=pl.col("primaryName").filter(
            pl.col("category").is_in(["actor", "actress"])
        ),
        principal_names=pl.col("primaryName").filter(
            ~pl.col("category").is_in(
                ["director", "writer", "producer", "actor", "actress"]
            )
        ),
        principal_roles=pl.col("category").filter(
            ~pl.col("category").is_in(
                ["director", "writer", "producer", "actor", "actress"]
            )
        ),
    )
)

After some cleanup and field renaming, here’s an example JSON document for Star Wars: Episode IV - A New Hope:

{
  "title": "Star Wars: Episode IV - A New Hope",
  "genres": [
    "Action",
    "Adventure",
    "Fantasy"
  ],
  "is_adult": false,
  "release_year": 1977,
  "runtime_minutes": 121,
  "directors": [
    "George Lucas"
  ],
  "writers": [
    "George Lucas"
  ],
  "producers": [
    "Gary Kurtz",
    "Rick McCallum"
  ],
  "actors": [
    "Mark Hamill",
    "Harrison Ford",
    "Carrie Fisher",
    "Alec Guinness",
    "Peter Cushing",
    "Anthony Daniels",
    "Kenny Baker",
    "Peter Mayhew",
    "David Prowse",
    "Phil Brown"
  ],
  "principals": [
    {
      "John Williams": "composer"
    },
    {
      "Gilbert Taylor": "cinematographer"
    },
    {
      "Richard Chew": "editor"
    },
    {
      "T.M. Christopher": "editor"
    },
    {
      "Paul Hirsch": "editor"
    },
    {
      "Marcia Lucas": "editor"
    },
    {
      "Dianne Crittenden": "casting_director"
    },
    {
      "Irene Lamb": "casting_director"
    },
    {
      "Vic Ramos": "casting_director"
    },
    {
      "John Barry": "production_designer"
    }
  ]
}

I was tempted to claim that I used zero feature engineering, but that wouldn’t be accurate. The selection and ordering of the JSON fields here is itself feature engineering: for example, actors and principals are intentionally last in this JSON encoding because they can have wildly varying lengths while the prior fields are more consistent, which should make downstream encodings more comparable and consistent.

Now, let’s discuss how to convert these JSON representations of movies into embeddings.

Creating And Visualizing the Movie Embeddings

LLMs that are trained to output text embeddings are not much different from LLMs like ChatGPT that just predict the next token in a loop. Models such as BERT and GPT can generate “embeddings” out-of-the-box by skipping the prediction heads of the models and instead taking an encoded value from the last hidden state of the model (e.g. for BERT, the first positional vector of the hidden state representing the [CLS] token). However, text embedding models are more optimized for distinctiveness of a given input text document using contrastive learning. These embeddings can be used for many things, from finding similar encoded inputs by identifying the similarity between embeddings, and of course, by building a statistical model on top of them.

Text embeddings that leverage LLMs are typically generated using a GPU in batches due to the increased amount of computation needed. Python libraries such as Hugging Face transformers and sentence-transformers can load these embeddings models. For this experiment, I used the very new Alibaba-NLP/gte-modernbert-base text embedding model that is finetuned from the ModernBERT model specifically for the embedding use case for two reasons: it uses the ModernBERT architecture which is optimized for fast inference, and the base ModernBERT model is trained to be more code-aware and should be able understand JSON-nested input strings more robustly — that’s also why I intentionally left in the indentation for nested JSON arrays as it’s semantically meaningful and explicitly tokenized. ³

The code (Jupyter Notebook) — with extra considerations to avoid running out of memory on either the CPU or GPU ⁴ — looks something like this:

device = "cuda:0"
dataloader = torch.utils.data.DataLoader(docs, batch_size=32,
                                         shuffle=False,
                                         pin_memory=True,
                                         pin_memory_device=device)

dataset_embeddings = []
for batch in tqdm(dataloader, smoothing=0):
    tokenized_batch = tokenizer(
        batch, max_length=8192, padding=True, truncation=True, return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = model(**tokenized_batch)
        embeddings = outputs.last_hidden_state[:, 0].detach().cpu()
    dataset_embeddings.append(embeddings)

dataset_embeddings = torch.cat(dataset_embeddings)
dataset_embeddings = F.normalize(dataset_embeddings, p=2, dim=1)

I used a Spot L4 GPU on Google Cloud Platform at a pricing of $0.28/hour, and it took 21 minutes to encode all 242k movie embeddings: about $0.10 total, which is surprisingly efficient.

Each of these embeddings is a set of 768 numbers (768D). If the embeddings are unit normalized (the F.normalize() step), then calculating the dot product between embeddings will return the cosine similarity of those movies, which can then be used to identify the most similar movies. But “similar” is open-ended, as there are many dimensions how a movie could be considered similar.

Let’s try a few movie similarity test cases where I calculate the cosine similarity between one query movie and all movies, then sort by cosine similarity to find the most similar (Jupyter Notebook). How about Peter Jackson’s Lord of the Rings: The Fellowship of the Ring? Ideally, not only would it surface the two other movies of the original trilogy, but also its prequel Hobbit trilogy.

title	cossim
The Lord of the Rings: The Fellowship of the Ring (2001)	1.0
The Lord of the Rings: The Two Towers (2002)	0.922
The Lord of the Rings: The Return of the King (2003)	0.92
National Geographic: Beyond the Movie - The Lord of the Rings: The Fellowship of the Ring (2001)	0.915
A Passage to Middle-earth: The Making of ‘Lord of the Rings’ (2001)	0.915
Quest for the Ring (2001)	0.906
The Lord of the Rings (1978)	0.893
The Hobbit: The Battle of the Five Armies (2014)	0.891
The Hobbit: The Desolation of Smaug (2013)	0.883
The Hobbit: An Unexpected Journey (2012)	0.883

Indeed, it worked and surfaced both trilogies! The other movies listed are about the original work, so having high similarity would be fair.

Compare these results to the “More like this” section on the IMDb page for the movie itself, which has the two sequels to the original Lord of the Rings and two other suggestions that I am not entirely sure are actually related.

What about more elaborate franchises, such as the Marvel Cinematic Universe? If you asked for movies similar to Avengers: Endgame, would other MCU films be the most similar?

title	cossim
Avengers: Endgame (2019)	1.0
Avengers: Infinity War (2018)	0.909
The Avengers (2012)	0.896
Endgame (2009)	0.894
Captain Marvel (2019)	0.89
Avengers: Age of Ultron (2015)	0.882
Captain America: Civil War (2016)	0.882
Endgame (2001)	0.881
The Avengers (1998)	0.877
Iron Man 2 (2010)	0.876

The answer is yes, which isn’t a surprise since those movies share many principals. Although, there are instances of other movies named “Endgame” and “The Avengers” which are completely unrelated to Marvel and therefore implies that the similarities may be fixated on the names.

What about movies of a smaller franchise but a specific domain, such as Disney’s Frozen that only has one sequel? Would it surface other 3D animated movies by Walt Disney Animation Studios, or something else?

title	cossim
Frozen (2013)	1.0
Frozen II (2019)	0.93
Frozen (2010)	0.92
Frozen (2010) [a different one]	0.917
Frozen (1996)	0.909
Frozen (2005)	0.9
The Frozen (2012)	0.898
The Story of Frozen: Making a Disney Animated Classic (2014)	0.894
Frozen (2007)	0.889
Frozen in Time (2014)	0.888

…okay, it’s definitely fixating on the name. Let’s try a different approach to see if we can find more meaningful patterns in these embeddings.

In order to visualize the embeddings, we can project them to a lower dimensionality with a dimensionality reduction algorithm such as PCA or UMAP: UMAP is preferred as it can simultaneously reorganize the data into more meaningful clusters. UMAP’s construction of a neighborhood graph, in theory, can allow the reduction to refine the similarities by leveraging many possible connections and hopefully avoid fixating on the movie name. However, with this amount of input data and the relatively high initial 768D vector size, the computation cost of UMAP is a concern as both factors each cause the UMAP training time to scale exponentially. Fortunately, NVIDIA’s cuML library recently updated and now you can run UMAP with very high amounts of data on a GPU at a very high number of epochs to ensure the reduction fully converges, so I did just that (Jupyter Notebook). What patterns can we find? Let’s try plotting the reduced points, colored by their user rating.

So there’s a few things going on here. Indeed, most of the points are high-rating green as evident in the source data. But the points and ratings aren’t random and there are trends. In the center giga cluster, there are soft subclusters of movies at high ratings and low ratings. Smaller discrete clusters did indeed form, but what is the deal with that extremely isolated cluster at the top? After investigation, that cluster only has movies released in 2008, which is another feature I should have considered when defining movie similarity.

As a sanity check, I faceted out the points by movie release year to better visualize where these clusters are forming:

This shows that even the clusters movies have their values spread, but I unintentionally visualized how embedding drift changes over time. 2024 is also a bizarrely-clustered year: I have no idea why those two years specifically are weird in movies.

The UMAP approach is more for fun, since it’s better for the downstream model building to use the raw 768D vector and have it learn the features from that. At the least, there’s some semantic signal preserved in these embeddings, which makes me optimistic that these embeddings alone can be used to train a viable movie rating predictor.

Predicting Average IMDb Movie Scores

So, we now have hundreds of thousands of 768D embeddings. How do we get them to predict movie ratings? What many don’t know is that all methods of traditional statistical modeling also work with embeddings — assumptions such as feature independence are invalid so the results aren’t explainable, but you can still get a valid predictive model.

First, we will shuffle and split the data set into a training set and a test set: for the test set, I chose 20,000 movies (roughly 10% of the data) which is more than enough for stable results. To decide the best model, we will be using the model that minimizes the mean squared error (MSE) of the test set, which is a standard approach to solving regression problems that predict a single numeric value.

Here are three approaches for using LLMs for solving non-next-token-prediction tasks.

Method #1: Traditional Modeling (w/ GPU Acceleration!)

You can still fit a linear regression on top of the embeddings even if feature coefficients are completely useless and it serves as a decent baseline (Jupyter Notebook). The absolute laziest “model” where we just use the mean of the training set for every prediction results in a test MSE of 1.637, but performing a simple linear regression on top of the 768D instead results in a more reasonable test MSE of 1.187. We should be able to beat that handily with a more advanced model.

Data scientists familiar with scikit-learn know there’s a rabbit hole of model options, but most of them are CPU-bound and single-threaded and would take considerable amount of time on a dataset of this size. That’s where cuML—the same library I used to create the UMAP projection—comes in, as cuML has GPU-native implementations of most popular scikit-learn models with a similar API. This notably includes support vector machines, which play especially nice with embeddings. And because we have the extra compute, we can also perform a brute force hyperparameter grid search to find the best parameters for fitting each model.

Here’s the results of MSE on the test dataset for a few of these new model types, with the hyperparameter combination for each model type that best minimizes MSE:

The winner is the Support Vector Machine, with a test MSE of 1.087! This is a good start for a simple approach that handily beats the linear regression baseline, and it also beats the model training from the Redditor’s original notebook which had a test MSE of 1.096 ⁵. In all cases, the train set MSE was close to the test set MSE, which means the models did not overfit either.

Method #2: Neural Network on top of Embeddings

Since we’re already dealing with AI models and already have PyTorch installed to generate the embeddings, we might as well try the traditional approach of training a multilayer perceptron (MLP) neural network on top of the embeddings (Jupyter Notebook). This workflow sounds much more complicated than just fitting a traditional model above, but PyTorch makes MLP construction straightforward, and Hugging Face’s Trainer class incorporates best model training practices by default, although its compute_loss function has to be tweaked to minimize MSE specifically.

The PyTorch model, using a loop to set up the MLP blocks, looks something like this:

class RatingsModel(nn.Module):
    def __init__(self, linear_dims=256, num_layers=6):
        super().__init__()

        dims = [768] + [linear_dims] * num_layers
        self.mlp = nn.ModuleList([
            nn.Sequential(
                nn.Linear(dims[i], dims[i+1]),
                nn.GELU(),
                nn.BatchNorm1d(dims[i+1]),
                nn.Dropout(0.6)
            ) for i in range(len(dims)-1)
        ])

        self.output = nn.Linear(dims[-1], 1)

    def forward(self, x, targets=None):
        for layer in self.mlp:
            x = layer(x)

        return self.output(x).squeeze()  # return 1D output if batched inputs

This MLP is 529k parameters total: large for a MLP, but given the 222k row input dataset, it’s not egregiously so.

The real difficulty with this MLP approach is that it’s too effective: even with less than 1 million parameters, the model will extremely overfit and converge to 0.00 train MSE quickly, while the test set MSE explodes. That’s why Dropout is set to the atypically high probability of 0.6.

Fortunately, MLPs are fast to train: training for 600 epochs (total passes through the full training dataset) took about 17 minutes on the GPU. Here’s the training results:

The lowest logged test MSE was 1.074: a slight improvement over the Support Vector Machine approach.

Method #3: Just Train a LLM From Scratch Dammit

There is a possibility that using a pretrained embedding model that was trained on the entire internet could intrinsically contain relevant signal about popular movies—such as movies winning awards which would imply a high IMDb rating—and that knowledge could leak into the test set and provide misleading results. This may not be a significant issue in practice since it’s such a small part of the gte-modernbert-base model which is too small to memorize exact information.

For the sake of comparison, let’s try training a LLM from scratch on top of the raw movie JSON representations to process this data to see if we can get better results without the possibility of leakage (Jupyter Notebook). I was specifically avoiding this approach because the compute required to train an LLM is much, much higher than a SVM or MLP model and generally leveraging a pretrained model gives better results. In this case, since we don’t need a LLM that has all the knowledge of human existence, we can train a much smaller model that only knows how to work with the movie JSON representations and can figure out relationships between actors and whether titles are sequels itself. Hugging Face transformers makes this workflow surprisingly straightforward by not only having functionality to train your own custom tokenizer (in this case, from 50k vocab to 5k vocab) that encodes the data more efficiently, but also allowing the construction a ModernBERT model with any number of layers and units. I opted for a 5M parameter LLM (SLM?), albeit with less dropout since high dropout causes learning issues for LLMs specifically.

The actual PyTorch model code is surprisingly more concise than the MLP approach:

class RatingsModel(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.transformer_model = model
        self.output = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask, targets=None):
        x = self.transformer_model.forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        x = x.last_hidden_state[:, 0]  # the "[CLS] vector"

        return self.output(x).squeeze()  # return 1D output if batched inputs

Essentially, the model trains its own “text embedding,” although in this case instead of an embedding optimized for textual similarity, the embedding is just a representation that can easily be translated into a numeric rating.

Because the computation needed for training a LLM from scratch is much higher, I only trained the model for 10 epochs, which was still twice as slow than the 600 epochs for the MLP approach. Given that, the results are surprising:

The LLM approach did much better than my previous attempts with a new lowest test MSE of 1.026, with only 4 passes through the data! And then it definitely overfit. I tried other smaller configurations for the LLM to avoid the overfitting, but none of them ever hit a test MSE that low.

Conclusion

Let’s look at the model comparison again, this time adding the results from training a MLP and training a LLM from scratch:

Coming into this post, I’m genuinely thought that training the MLP on top of embeddings would have been the winner given the base embedding model’s knowledge of everything, but maybe there’s something to just YOLOing and feeding raw JSON input data to a completely new LLM. More research and development is needed.

The differences in model performance from these varying approaches aren’t dramatic, but some iteration is indeed interesting and it was a long shot anyways given the scarce amount of metadata. The fact that building a model off of text embeddings only didn’t result in a perfect model doesn’t mean this approach was a waste of time. The embedding and modeling pipelines I have constructed in the process of trying to solve this problem have already provided significant dividends on easier problems, such as identifying the efficiency of storing embeddings in Parquet and manipulating them with Polars.

It’s impossible and pointless to pinpoint the exact reason the original Reddit poster got rejected: it could have been the neural network approach or even something out of their control such as the original company actually stopping hiring and being too disorganized to tell the candidate. To be clear, if I myself were to apply for a data science role, I wouldn’t use the techniques in this blog post (that UMAP data visualization would get me instantly rejected!) and do more traditional EDA and non-neural-network modeling to showcase my data science knowledge to the hiring manager. But for my professional work, I will definitely try starting any modeling exploration with an embeddings-based approach wherever possible: at the absolute worst, it’s a very strong baseline that will be hard to beat.

All of the Jupyter Notebooks and data visualization code for this blog post is available open-source in this GitHub repository.

I am not a fan of using GBT variable importance as a decision-making metric: variable importance does not tell you magnitude or direction of the feature in the real world, but it does help identify which features can be pruned for model development iteration. ↩︎
To get a sense on how old they are, they are only available as TSV files, which is a data format so old and prone to errors that many data libraries have dropped explicit support for it. Amazon, please release the datasets as CSV or Parquet files instead! ↩︎
Two other useful features of gte-modernbert-base but not strictly relevant to these movie embeddings are a) its a cased model so it can identify meaning from upper-case text and b) it does not require a prefix such as search_query and search_document as nomic-embed-text-v1.5 does to guide its results, which is an annoying requirement for those models. ↩︎
The trick here is the detach() function for the computed embeddings, otherwise the GPU doesn’t free up the memory once moved back to the CPU. I may or may not have discovered that the hard way. ↩︎
As noted earlier, minimizing MSE isn’t a competition, but the comparison on roughly the same dataset is good for a sanity check. ↩︎

The Best Way to Use Text Embeddings Portably is With Parquet and Polars

Mon, 24 Feb 2025 10:15:00 -0800

Text embeddings, particularly modern embeddings generated from large language models, are one of the most useful applications coming from the generative AI boom. Embeddings are a list of numbers which represent an object: in the case of text embeddings, they can represent words, sentences, and full paragraphs and documents, and they do so with a surprising amount of distinctiveness.

Recently, I created text embeddings representing every distinct Magic: the Gathering card released as of the February 2025 Aetherdrift expansion: 32,254 in total. With these embeddings, I can find the mathematical similarity between cards through the encoded representation of their card design, including all mechanical attributes such as the card name, card cost, card text, and even card rarity.

The iconic Magic card Wrath of God, along with its top four most similar cards identified using their respective embeddings. The similar cards are valid matches, with similar card text and card types.

Additionally, I can create a fun 2D UMAP projection of all those cards, which also identifies interesting patterns:

The UMAP dimensionality reduction process also implicitly clusters the Magic cards to logical clusters, such as by card color(s) and card type.

I generated these Magic card embeddings for something special besides a pretty data visualization, but if you are curious how I generated them, they were made using the new-but-underrated gte-modernbert-base embedding model and the process is detailed in this GitHub repository. The embeddings themselves (including the coordinate values to reproduce the 2D UMAP visualization) are available as a Hugging Face dataset.

Most tutorials involving embedding generation omit the obvious question: what do you do with the text embeddings after you generate them? The common solution is to use a vector database, such as faiss or qdrant, or even a cloud-hosted service such as Pinecone. But those aren’t easy to use: faiss has confusing configuration options, qdrant requires using a Docker container to host the storage server, and Pinecone can get very expensive very quickly, and its free Starter tier is limited.

What many don’t know about text embeddings is that you don’t need a vector database to calculate nearest-neighbor similarity if your data isn’t too large. Using numpy and my Magic card embeddings, a 2D matrix of 32,254 float32 embeddings at a dimensionality of 768D (common for “smaller” LLM embedding models) occupies 94.49 MB of system memory, which is relatively low for modern personal computers and can fit within free usage tiers of cloud VMs. If both the query vector and the embeddings themselves are unit normalized (many embedding generators normalize by default), then the matrix dot product between the query and embeddings results in a cosine similarity between [-1, 1], where the higher score is better/more similar. Since dot products are such a fundamental aspect of linear algebra, numpy’s implementation is extremely fast: with the help of additional numpy sorting shenanigans, on my M3 Pro MacBook Pro it takes just 1.08 ms on average to calculate all 32,254 dot products, find the top 3 most similar embeddings, and return their corresponding idx of the matrix and and cosine similarity score.

def fast_dot_product(query, matrix, k=3):
    dot_products = query @ matrix.T

    idx = np.argpartition(dot_products, -k)[-k:]
    idx = idx[np.argsort(dot_products[idx])[::-1]]

    score = dot_products[idx]

    return idx, score

In most implementations of vector databases, once you insert the embeddings, they’re stuck there in a proprietary serialization format and you are locked into that library and service. If you’re just building a personal pet project or sanity-checking embeddings to make sure the results are good, that’s a huge amount of friction. For example, when I want to experiment with embeddings, I generate them on a cloud server with a GPU since LLM-based embeddings models are often slow to generate without one, and then download them locally to my personal computer. What is the best way to handle embeddings portably such that they can easily be moved between machines and also in a non-proprietary format?

The answer, after much personal trial-and-error, is Parquet files, which still has a surprising amount of nuance. But before we talk about why Parquet files are good, let’s talk about how not to store embeddings.

The Worst Ways to Store Embeddings

The incorrect-but-unfortunately-common way to store embeddings is in a text format such as a CSV file. Text data is substantially larger than float32 data: for example, a decimal number with full precision (e.g. 2.145829051733016968e-02) as a float32 is 32 bits/4 bytes, while as a text representation (in this case 24 ASCII chars) it’s 24 bytes, 6x larger. When the CSV is saved and loaded, the data has to be serialized between a numpy and a string representation of the array, which adds significant overhead. Despite that, in one of OpenAI’s official tutorials for their embeddings models, they save the embeddings as a CSV using pandas with the admitted caveat of “Because this example only uses a few thousand strings, we’ll store them in a CSV file. (For larger datasets, use a vector database, which will be more performant.)”. In the case of the Magic card embeddings, pandas-to-CSV performs the worst out of any encoding options: more on why later.

Numpy has native methods to save and load embeddings as a .txt that’s straightforward:

np.savetxt("embeddings_txt.txt", embeddings)

embeddings_r = np.loadtxt("embeddings_txt.txt", dtype=np.float32, delimiter=" ")

The resulting file not only takes a few seconds to save and load, but it’s also massive: 631.5 MB!

As an aside, HTTP APIs such as OpenAI’s Embeddings API do transmit the embeddings over text which adds needless latency and bandwidth overhead. I wish more embedding providers offered gRPC APIs which allow transfer of binary float32 data instead to gain a performance increase: Pinecone’s Python SDK, for example, does just that.

The second incorrect method to save a matrix of embeddings to disk is to save it as a Python pickle object, which stores its representation in memory on disk with a few lines of code from the native pickle library. Pickling is unfortunately common in the machine learning industry since many ML frameworks such as scikit-learn don’t have easy ways to serialize encoders and models. But it comes with two major caveats: pickled files are a massive security risk as they can execute arbitrary code, and the pickled file may not be guaranteed to be able to be opened on other machines or Python versions. It’s 2025, just stop pickling if you can.

In the case of the Magic card embeddings, it does indeed work with instant save/loads, and the file size on disk is 94.49 MB: the same as its memory consumption and about 1/6th of the text size as expected:

with open("embeddings_matrix.pkl", "wb") as f:
    pickle.dump(embeddings, f)

with open("embeddings_matrix.pkl", "rb") as f:
    embeddings_r = pickle.load(f)

But there are still better and easier approaches.

The Intended-But-Not-Great Way to Store Embeddings

Numpy itself has a canonical way to save and load matrixes — which annoyingly saves as a pickle by default for compatability reasons, but that can fortunately be disabled by setting allow_pickle=False:

np.save("embeddings_matrix.npy", embeddings, allow_pickle=False)

embeddings_r = np.load("embeddings_matrix.npy", allow_pickle=False)

File size and I/O speed are the same as with the pickle approach.

This works — and it’s something I had used for awhile — but in the process it exposes another problem: how do we map metadata (the Magic cards in this case) to embeddings? Currently, we use the idx of the most-similar matches to perform an efficient batched lookup to the source data. In this case, the number of rows matches the number of cards exactly, but what happens if the embeddings matrix needs to be changed, such as to add or remove cards and their embeddings? What happens if you want to add a dataset filter? It becomes a mess that inevitably causes technical debt.

The solution to this is to colocate metadata such as card names, card text, and attributes with their embeddings: that way, if they are later added, removed, or sorted, the results will remain the same. Modern vector databases such as qdrant and Pinecone do just that, with the ability to filter and sort on the metadata at the same time you query the most similar vectors. This is a bad idea to do in numpy itself, as it’s more optimized for numbers and not other data types such as strings, which have limited operations available.

The solution is to look at another file format that can store metadata and embeddings simultaneously, and the answer to that is Parquet files. But there’s a rabbit hole as to what’s the best way to interact with them.

What are Parquet files?

Parquet, developed by the open-source Apache Parquet project, is a file format for handling columnar data, but despite being first released in 2013 it hasn’t taken off in the data science community until very recently. ¹ The most relevant feature of Parquet is that the resulting files are typed for each column, and that this typing includes nested lists, such as an embedding which is just a list of float32 values. As a bonus, the columnar format allows downstream libraries to save/load them selectively and very quickly, far faster than CSVs and with rare parsing errors. The file format also allows for efficient compression and decompression, but that’s less effective with embeddings as there’s little redundant data.

For Parquet file I/O, the standard approach is to use the Apache Arrow protocol that is columnar in-memory, which complements the Parquet storage medium on disk. But how do you use Arrow?

How do you use Parquet files in Python for embeddings?

Ideally, we need a library that can handle nested data easily and can interoperate with numpy for serializing to a matrix and can run fast dot products.

The official Arrow library that interacts with Parquet natively in Python is pyarrow. Here, I have an example Parquet file generated with [SPOILERS] that contains both the card metadata and an embedding column, with the embedding for each row corresponding to that card.

df = pa.parquet.read_table("mtg-embeddings.parquet")

Pyarrow’s table schema from the input Parquet file of Magic card embeddings. Note the embedding column at the bottom is a list of 768 floats.

But pyarrow is not a DataFrame library, and despite the data being in a Table, it’s hard to slice and access: the documentation suggests that you export to pandas if you need more advanced manipulation.

Other more traditional data science libraries can leverage pyarrow directly. The most popular one is, of course, pandas itself which can read/write Parquet doing just that. There are many, many resources for using pandas well, so it’s often the first choice among data science practioners.

df = pd.read_parquet("mtg-embeddings.parquet", columns=["name", "embedding"])
df

Pandas HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook.

There’s one major weakness for the use case of embeddings: pandas is very bad at nested data. From the image above you’ll see that the embedding column appears to be a list of numbers, but it’s actually a list of numpy objects, which is a very inefficent datatype and why I suspect writing it to a CSV is very slow. Simply converting it to numpy with df["embedding"].to_numpy() results in a 1D array, which is definitely wrong, and trying to cast it to float32 doesn’t work. I found that the best way to extract the embeddings matrix from a pandas embedding column is to np.vstack() the embeddings, e.g. np.vstack(df["embedding"].to_numpy()), which does result in a (32254, 768) float32 matrix as expected. That adds a lot of compute and memory overhead in addition to unnecessary numpy array copies. Finally, after computing the dot products between a candidate query and the embedding matrix, row metadata with the most similar values can then be retrieved using df.loc[idx]. ²

However, there is another, more recent tabular data library that not only is faster than pandas, it has proper support for nested data. That library is polars.

The Power of polars

Polars is a relatively new Python library which is primarily written in Rust and supports Arrow, which gives it a massive performance increase over pandas and many other DataFrame libraries. In the case of Magic cards, 32k rows isn’t nearly “big data” and the gains of using a high-performance library are lesser, but there are some unexpected features that coincidentally work perfectly for the embeddings use case.

As with pandas, you read a parquet file with a read_parquet():

df = pl.read_parquet("mtg-embeddings.parquet", columns=["name", "embedding"])
df

Polars HTML table output of the Magic card DataFrame when printed in a Jupyter Notebook.

There’s a notable difference in the table output compared to pandas: it also reports the data type of its columns, and more importantly, it shows that the embedding column consists of arrays, all float32s, and all length 768. That’s a great start!

polars also has a to_numpy() function. Unlike pandas, if you call to_numpy() on a column as a Series, e.g. df['embedding'].to_numpy(), the returned object is a numpy 2D matrix: no np.vstack() needed. If you look at the documentation for the function, there’s a curious feature:

This operation copies data only when necessary. The conversion is zero copy when all of the following hold: […]

Zero copy! And in the case of columnar-stored embeddings, the conditions will always hold, but you can set allow_copy=False to throw an error just in case.

Inversely, if you want to add a 2D embeddings matrix to an existing DataFrame and colocate each embedding’s corresponding metadata, such as after you batch-generate thousands of embeddings and want to save and download the resulting Parquet, it’s just as easy as adding a column to the DataFrame.

df = pl.with_columns(embedding=embeddings)

df.write_parquet("mtg-embeddings.parquet")

Now, let’s put the speed to the test using all the Magic card metadata. What if we perform embedding similarity on a Magic card, but beforehand dynamically filter the dataset according to user parameters (therefore filtering the candidate embeddings at the same time since they are colocated) and perform the similarity calculations quickly as usual? Let’s try with Lightning Helix, a card whose effects are self-explanatory even to those who don’t play Magic.

The most similar cards to Lightning Helix do have similar effects, although “Lightning” cards dealing damage is a common trope in Magic. Warleader’s Helix is a direct reference to Lightning Helix.

Now we can also find similar cards to Lightning Helix but with filters. In this case, let’s look for a Sorcery (which are analogous to Instants but tend to be stronger since they have play limitations) and has Black as one of its colors. This limits the candidates to ~3% of the original dataset. The resulting code would look like this, given a query_embed:

df_filter = df.filter(
    pl.col("type").str.contains("Sorcery"),
    pl.col("manaCost").str.contains("B"),
)

embeddings_filter = df_filter["embedding"].to_numpy(allow_copy=False)
idx, _ = fast_dot_product(query_embed, embeddings_filter, k=4)
related_cards = df_filter[idx]

As an aside, in polars you can call row subsets of a DataFrame with df[idx], which makes it infinitely better than pandas and its df.iloc[idx].

The resulting similar cards:

In this case, the similarity focuses on card text similarity, and these cards have near identical text. Smiting Helix is also a direct reference to Lightning Helix.

Speed-wise, the code runs at about 1.48ms on average, or about 37% slower than calculating all dot products, so the filtering does still have some overhead, which is not surprising as that the filtered dataframe does copy the embeddings. Overall, it’s still more than fast enough for a hobby project.

I’ve created an interactive Colab Notebook where you can generate similarities for any Magic card, and apply any filters you want!

Scaling to Vector Databases

Again, all of this assumes that you are using the embeddings for smaller/noncommercial projects. If you scale to hundreds of thousands of embeddings, the parquet and dot product approach for finding similarity should still be fine, but if it’s a business critical application, the marginal costs of querying a vector database are likely lower than the marginal revenue from a snappy similarity lookup. Deciding how to make these tradeoffs is the fun part of MLOps!

In the case that the amount of vectors is too large to fit into memory but you don’t want to go all-in on vector databases, another option that may be worth considering is using an old-fashioned database that can now support vector embeddings. Notably, SQLite databases are just a single portable file, however interacting with them has more technical overhead and considerations than the read_parquet() and write_parquet() of polars. One notable implementation of vector databases in SQLite is the sqlite-vec extension, which also allows for simultaneous filtering and similarity calculations.

The next time you’re working with embeddings, consider whether you really need a vector database. For many applications, the combination of Parquet files and polars provides everything you need: efficient storage, fast similarity search, and easy metadata filtering. Sometimes the simplest solution is the best one.

The code used to process the Magic card data, create the embeddings, and plot the UMAP 2D projection, is all available in this GitHub repository.

I suspect the main bottleneck to widespread Parquet support is Microsoft Excel’s and other spreadsheet software’s lack of native support for the format. Every data scientist will be very, very happy if/when they do! ↩︎
OpenAI’s approach using pandas to find colocated similarity is to manually iterate through the entire dataframe, calculate each cosine similarity between the candidate and the query for each row, then sort by scores. That implementation definitely does not scale. ↩︎

The Super Effectiveness of Pokémon Embeddings Using Only Raw JSON and Images

Wed, 26 Jun 2024 10:00:00 -0700

Embeddings are one of the most useful but unfortunately underdiscussed concepts in the artificial intelligence space relative to the modern generative AI gigahype. Embeddings are a set of hundreds of numbers which uniquely correspond to a given object that define its dimensionality, nowadays in a multiple of 128 such as 384D, 768D, or even 1536D. ¹ The larger the embeddings, the more “information” and distinctiveness each can contain, in theory. These embeddings can be used as-is for traditional regression and classification problems with your favorite statistical modeling library, but what’s really useful about these embeddings is that if you can find the minimum mathematical distance between a given query embedding and another set of embeddings, you can then find which is the most similar: extremely useful for many real-world use cases such as search.

An example sentence embedding generated using Sentence Transformers: this embedding is 384D.

Although any kind of object can be represented by an embedding, text is the classical use case for embeddings, popularized with the original word2vec paper which along with later work showed that word embeddings could be used to calculate relationships such as man + women - king = queen. You could then, for example, create a sentence embedding by averaging all of its word embeddings. This actually works, although this naive averaging does not take word position and punctuation into account, both of which are critically important in identifying context for a given text.

Deep learning then entered the picture and it was eventually discovered that large language models like BERT can return embeddings as an emergent behavior. Unlike the word averaging above, transformers-based LLMs can account for positional relationships more robustly thanks to their attention mechanisms, and, due to their more advanced model input tokenization strategies than just words, can also better incorporate punctuation. One very popular Python library for creating embeddings using LLMs easily is Sentence Transformers, especially with the all-MiniLM-L6-v2 model (30 million downloads monthly!) which balances embedding encoding speed and robustness with its 384D embeddings.

How well can these embeddings models work beyond just normal sentences? Can they encode larger bodies of text into a consistent space? The context length of all-MiniLM-L6-v2 is 512 tokens, which can only fit a couple paragraphs of text, but newer LLMs have much higher context lengths.

I recalled one of my early projects as an aspiring data scientist: creating Pokémon vectors by manually transforming Pokémon metadata for each Pokémon, such as their base stats, type(s), moves, abilities, and miscellaneous attributes such as color, shape, and habitat. After that, I was able to cluster them.

3D projection of my Pokémon vectors back in 2016: the colors are Pokémon types, and the methodology seemed to favor clustering by them.

Those familar with Pokémon know that’s just scratching the surface: there’s even more metadata such as the rich text data such as a Pokémon’s Pokédex entries and the exact locations where they can be encountered, both of which tell a lot about a given Pokémon. At the time, there was no efficient LLM to encode all of that extra metadata.

Why not try to encode all Pokémon metadata using a text embedding model and see what happens? Will we be able to identify the most “similar” Pokémon? What is a “similar” Pokémon anyways? Can we find the weirdest Pokémon by the most dissimilar? Can we encode other Pokémon data such as images? Let’s find out!

How Embeddings Are Generated Using LLMs

First, some relevant technical background on how LLMs can be used to create embeddings since there’s surprisingly a lot of confusion about how they work other than the SEO-oriented “embeddings are for vector databases”.

Modern embedding models are commonly trained through one of two ways. The first way is through emergent behavior while training an LLM normally: as LLMs need to determine a latent space before passing the output to a classification head such as GPT’s next-token prediction, taking the last layer (“hidden state”) of a model and averaging across the positional axis results in an embedding with the same dimensionality as the hidden state. LLMs have to learn how to uniquely represent text in a common latent space, so this is approach is natural. The second way is to train a model to output the embeddings directly: in this case, the training process typically uses contrastive learning to minimize the semantic distance between the generated embeddings of a pair of known text documents, and maximize the difference between a dissimilar pair. Both of these techniques can be used together of course: pretrain a LLM on a large body of text, then finetune it with contrastive learning.

Embeddings models get the benefits of all the research invested into improving LLMs for generative AI, such as inference speed and longer context windows. Normally it requires a quadratic increase in computation to use those larger context windows (e.g. a 2x increase in input length requires 4x more computation), but thanks to FlashAttention and rotary positional embeddings, it’s now feasible to train models with massively-large context windows without a massive datacenter and then run those models on consumer hardware.

Ever since 2022, OpenAI had the text embedding model text-embedding-ada-002 behind a paid API with the largest context window of 8,192 tokens: a substantial increase over all-MiniLM-L6-v2’s 512 limit, and no other open-source model could compete. That is until February 2024, when Nomic AI released nomic-embed-text-v1, a fully open-source embeddings model with a 8,192 context window and a permissive Apache license, and quickly followed up with nomic-embed-text-v1.5. In academic benchmarks, this free model performed even better than OpenAI’s paid embedding model thanks to its training regimen that uses both embedding model training tricks described above. That, along with its long context window, caused it to become another one of the most downloaded open-source embedding models (~10 million downloads per month).

A sentence embedding generated using nomic-embed-text-v1.5 adapted from the official example: this is a lower-level interface than Sentence Transformers (Hugging Face transformers and PyTorch) but is more clear as to what is going on. mean_pooling() uses an atypical attention-masked averaging that is theoretically better for small inputs than averaging the entire last hidden state.

The F.normalize() function is a popular pipeline innovation in finding similar embeddings efficiently. ² A unit normalized vector has a vector length summing to 1. But if you perform a matrix multiplication (an extremely fast computational operation) of a normalized vector against a matrix of normalized vectors, then the result will be the cosine similarity, constrained between the values of 1 for identical matches and -1 for the most dissimilar matches.

Now that we have thoroughly covered how embeddings work, let’s see if we can put that 8,192 context window to the test.

What Kind of Pokémon Embedding Are You?

Before encoding Pokémon data, I need to first get Pokémon data, but where? Nintendo certainly won’t have an API for Pokémon data, and web scraping a Pokémon wiki such as Bulbapedia is both impractical and rude. Fortunately, there’s an unofficial Pokémon API known appropriately as PokéAPI, which is both open source and has been around for years without Nintendo taking them down. Of note, PokéAPI has a GraphQL interface to its Pokémon data, allowing you to query exactly what you want without having to do relationship mapping or data joins.

A simple GraphQL query to get all Pokémon IDs and names, sorted by ID.

Since we can get Pokémon data in a nicely structured JSON dictionary, why not keep it that way? After writing a massive GraphQL query to specify all mechanically relevant Pokémon data, all it takes it a single GET request to download it all, about 16MB of data total. This includes over 1,000 Pokémon up to the Scarlet/Violet The Hidden Treasure of Area Zero DLC: 1,302 Pokémon total if you include the Special forms of Pokémon (e.g. Mega Evolutions) which I’m excluding for simplicity.

As an example, let’s start with the franchise mascot, Pikachu.

The iconic Pokémon #25. via Nintendo

Here’s a subset of Pikachu’s JSON metadata from that query:

{
  "id": 25,
  "name": "pikachu",
  "height": 4,
  "weight": 60,
  "base_experience": 112,
  "pokemon_v2_pokemontypes": [
    {
      "pokemon_v2_type": {
        "name": "electric"
      }
    }
  ],
  "pokemon_v2_pokemonstats": [
    {
      "pokemon_v2_stat": {
        "name": "hp"
      },
      "base_stat": 35
    },

...

  "pokemon_v2_pokemonspecy": {
    "base_happiness": 50,
    "capture_rate": 190,
    "forms_switchable": false,
    "gender_rate": 4,
    "has_gender_differences": true,
    "hatch_counter": 10,
    "is_baby": false,
    "is_legendary": false,
    "is_mythical": false,
    "pokemon_v2_pokemonspeciesflavortexts": [
      {
        "pokemon_v2_version": {
          "name": "red"
        },
        "flavor_text": "When several of\nthese POK\u00e9MON\ngather, their\felectricity could\nbuild and cause\nlightning storms."
      },

...

  "pokemon_v2_pokemonmoves": [
      {
        "pokemon_v2_move": {
          "name": "mega-punch",
          "pokemon_v2_type": {
            "name": "normal"
          }
        }
      },

...

There’s definitely no shortage of Pikachu data! Some of the formatting is redundant though: most of the JSON keys have a pokemon_v2_ string that conveys no additional semantic information, and we can minify the JSON to remove all the whitespace. We won’t experiment with more rigorous preprocessing: after all, I only need to optimize an ETL workflow if it doesn’t work, right?

Since JSON data is so prevalent across the internet, it’s extremely likely that a newly trained LLM will be sensitive to its schema and be able to understand it better. However, JSON is a token-inefficient encoding format, made even worse in this case by the particular choice of tokenizer. Here’s the distribution of the encoded texts after the optimizations above, using nomic-embed-text-v1.5’s text tokenizer which is incidentally the same bert-based-uncased tokenizer used for BERT back in 2018:

The 8,192 context length of nomic-embed-text-v1.5 is perfect for fitting almost all Pokémon! But the median token count is 3,781 tokens which is still somewhat high. The reason for this is due to the tokenizer: bert-base-uncased is a WordPiece tokenizer which is optimized for words and their common prefixes and suffixes, while JSON data is highly structured. If you use a more modern tokenizer which utilizes byte pair encoding (BPE), such as the o200k_base tokenizer which powers OpenAI’s GPT-4o, then the median token count is 2,010 tokens: nearly half the size, and therefore would be much faster to process the embeddings.

After that, I encoded all the Pokémon metadata into a 768D text embedding for each and every Pokémon, including unit normalization. Due to the quadratic scaling at high input token counts, this is still very computationally intensive despite the optimization tricks: for the 1,302 embeddings, it took about a half-hour on a Google Colab T4 GPU. The embeddings are then saved on disk in a parquet format, a tabular format which supports nesting sequences of floats natively (don’t use a CSV to store embeddings!). The embedding generation is the hard part, now it’s time for the fun part!

Let’s start off with Pikachu. What Pokémon is Pikachu most similar to, i.e. has the highest cosine similarity? Remember, since all the embeddings are normalized, we can get all the cosine similairites by matrix multiplying the Pikachu embedding against all the other embeddings. Let’s include the top 3 of each of Pokémon’s nine (!) generations to date:

These results are better than I expected! Each generation has a “Pikaclone” of a weak Electric-type rodent Pokémon, and this similarity calculation found most of them. I’m not sure what Phantump and Trevenant are doing under Gen VI though: they’re Ghost/Grass Pokémon.

Here’s a few more interesting Pokémon comparisons:

Typhlosion is the final evolution of the Gen II Fire starter Pokémon: it has a high similarity with atleast one of every generation’s Fire starter Pokémon lineages.

Articuno, a Legendary Ice/Flying Pokémon, has high similarity with Legendary, Ice, and Flying Pokémon, plus all combinations therein.

Mew, the infamous legendary from the original games has the gimmick of being able to learn every move, has the most amount of metadata by far: appropriately it has poor similarity with others, although similarity with Arceus from Gen IV, the Pokémon equivalent of God with a similar gimmick.

You may have noticed the numerical cosine similarity of all these Pokémon is very high: if a similarity of 1 indicates an identical match, does a high value imply that a Pokémon is super similar? It’s likely that the similarities are high because the input is all in the same JSON formatting, where the core nomic-text-embed-v1.5 model was trained on a variety of text styles. Another potential cause is due to a “cheat” I did for simplicity: the nomic-text-embed-v1.5 documentation says that a search_document prefix is required for encoding the base input documents and a search_query prefix is required for the comparison vector: in my testing it doesn’t affect the similarity much if at all. In practice, the absolute value of cosine similarity doesn’t matter if you’re just selecting the objects with the highest similarity anyways.

What if we just plot every possible combination of Pokémon cosine similarities? With 1,000+ Pokémon, that’s over 1 million combinations. Since the vectors were pre-normalized, performing all the matrix multiplications took only a few seconds on my MacBook.

Here’s the result of plotting 1 million points on a single chart!

Although it looks more like a quilt, a few things jump out. One curious case is the “square” of lighter Gen VIII and Gen IX in the upper right corner: it appears those two generations have lower similarity with others, and worsening similarity between those two generation as you go all the way back to Gen I. Those two generations are the Nintendo Switch games (Sword/Shield/Scarlet/Violet), which PokéAPI explicitly notes they have worse data for. Also, there are rows of a low-similarity blue such as one before Gen II: who’s that Pokémon? Quickly checking the Pokémon with the lowest median similarity by generation:

The mystery Pokémon is Magikarp, unsurprisingly, with its extremely limited movepool. Most of these Pokémon have forced gimmick movesets, especially Unown, Smeargle, and Wobbuffet, so it makes sense the metadata treats them as dissimilar to most others. Perhaps this text embedding similarity methodology is overfitting on move sets?

Overall, there’s definitely some signal with these text embeddings. How else can we identify interesting Pokémon relationships?

Pokémon Snap

We’ve only been working with text embeddings, but what about other types of embeddings, such as image embeddings? Image embeddings using vision transformer models are generated roughly the same way as the text embeddings above by manipulating the last hidden state and optionally normalizing them. The inputs to the model are then square patches encoded as “tokens”: only a few hundred processed patches are ever used as inputs, so generating them is much faster than the text embeddings.

A couple years ago I hacked together a Python package named imgbeddings which uses OpenAI’s CLIP to generate the embeddings, albeit with mixed results. Recently, Nomic also released an new model, nomic-embed-vision-v1.5, which now also generates image embeddings with better benchmark performance than CLIP. What’s notable about these embeddings is that they are aligned with the ones from nomic-embed-text-v1.5, which can allow matching text similiarity with images or vice versa and enable multimodal applications.

But for now, can we see if image embeddings derived from Pokémon images have similar similarity traits? PokéAPI fortunately has the official artwork for each Pokémon, so I downloaded them and additionally composited them onto a white background and resized them all to 224x224 for apples-to-apples comparisons. We expect a high cosine similarity since like with text embeddings, the “style” of all the images is the same. Let’s plot the similarities of all Pokémon, by their images only.

Unfortunately, no patterns jump out this time. All the image similarity values are even higher than the text similarity values, although that’s not a big deal since we are looking at the most similar matches. How does Pikachu’s famous official artwork compare with other Pokémon?

Pikachu’s most similar Pokémon by image isn’t just mouse Pokémon as I thought it would be, but instead the pattern is more unclear, appearing to favor mostly Pokémon with four limbs (although Pikachu’s image has a strong similarity with Gen VII’s Mimikyu’s image which is hilarious since that particular Pokémon’s gimmick is intentionally trying to look like Pikachu).

After testing a few more Pokémon, it turns out that this image embedding model does respond to visual primitives, which has its uses.

Pidgeot is a bird, and it matches all other birds. Birds would definitely be in an image training dataset.

Electrode is a ball, and the embeddings found similarly rotund Pokémon.

Kingdra apparently is similar to other blue Pokémon.

Both text and image embedding approaches have their own style. But are there ways to combine them?

Chat With Your Pokédex

Earlier I alluded to aligning text and image embeddings in a more multimodal manner. Since nomic-embed-vision-v1.5 was conditioned on nomic-embed-text-v1.5 outputs, you are able to compute the cosine similarities between the image embeddings and text embeddings! However, it’s not as robust: the cosine similarities between objects of the two modes tend to be very low at about 0.10 in the best case scenario. Again, if all we’re looking at is the highest similarity, then that’s fine.

The most common use case for multimodal reasoning is asking questions (to be converted to a text embedding) and comparing it with a set of image embeddings. Let’s try it with Pokémon by asking it a leading question for testing: what looks like an ice cream cone?

Surprisingly, it got the result correct with Vanillish, along with other “cream” and “ice” Pokémon. Not sure why Metapod is there, though.

A few more Qs and As:

The model did identify some cats, but only Torracat is orange.

Unown definitely fits the bill with a very prominent one-eye and higher similarity.

A Pokémon with the name “Cutiefly” being the most similar to the question is a funny coincidence.

The relationship between text and Pokémon images with these models is not perfect, but it’s honestly much better than I expected!

2D.A Master

Lastly, there are many ways to find signal among the high-dimensional noise, and it may resolve some of the counterintuitive relationships we saw earlier. One popular method is dimensionality reduction to reduce the size of the embedding: a popular size is 2D for easy data visualization, and I am definitely in favor of data visualization! The classical statistical approach is principal component analysis (PCA) which identifies the most “important” aspects of a matrix, but a more modern approach is uniform manifold approximation & projection (UMAP) which trains a projection that accounts for how data points relate to all other data points to find its underlying structure. In theory, the reduction should allow the embeddings to generalize better.

For the Pokémon embeddings, we can take the opportunity to allow the model to account for both the text and image embeddings, and their potential interactions therein. Therefore, I concatenated the text and image embeddings for each Pokémon (a 1536D embedding total), and trained a UMAP to project it down to 2D. Now we can visualize it!

One of the removed outliers was Tauros, which is interesting because it’s a very unexciting Pokémon.

Unforunately plotting each Pokémon image onto a single chart would be difficult to view, but from this chart we can see that instead of organizing by Pokémon type like my 2016 approach did, this approach is organizing much more by generation: the earlier generations vs. the later generations. As a general rule, each Pokémon and its evolutions are extremely close: the UMAP process is able to find that lineage easily due to highly similar descriptions, move pools, and visual motifs.

As with the cosine similarities, we can now find the most similar Pokémon, this time seeing which points have the lowest Euclidian distance (0.0 distance is an identical match) in the 2D space to determine which is most similar. How does Pikachu fare now?

Pikachu retains top similarity with some Pikaclones, but what’s notable here is the magnitude: we can now better quantify good similarity and bad similarity over a larger range. In this case, many of the Pokémon at distance >1.0 clearly do not resemble an Electric rodent.

How about some other Pokémon?

Magikarp’s dissimilarity has now been fixed, and it now has friends in similar fishy Water-types.

Mr. Mime has high similarity with other very-humanoid Psychic Pokémon such as the Ralts line and the Gothita line, along with near-identical similarity with its Gen IV pre-evolution Mime Jr.

Butterfree has low distance with butterfly-esque Bug Pokémon (image embedding impact!) and higher distance with other type of Bugs.

UMAP is not an exact science (it’s very sensitive to training parameter choices), but it does provide another opportunity to see relationships not apparent in high-dimensional space. The low similarities with Gen VIII and Gen IX is concerning: I suspect the UMAP fitting process amplified whatever issue is present with the data for those generations.

Were You Expecting an AI-Generated Pokérap?

In all, this was a successful exploration of Pokémon data that even though it’s not perfect, the failures are also interesting. Embeddings encourage engineers to go full YOLO because it’s actually rewarding to do so! Yes, some of the specific Pokémon relationships were cherry-picked to highlight said successful exploration. If you want to check more yourself and find anything interesting not covered in this blog post, I’ve uploaded the text embedding similarity, image embedding similarity, and UMAP similarity data visualizations for the first 251 Pokémon to this public Google Drive folder.

I’m surprised there haven’t been more embedding models released from the top AI companies. OpenAI’s GPT-4o now has image input support, and therefore should be able to create image embeddings. Anthropic’s Claude LLM has both text and image input support but no embeddings model, instead referring users to a third party. One of the more interesting embedding model releases from a major player was from Google and went completely under the radar: it’s a multimodal embedding model which can take text, images, and video input simultaneously and generate a 1408D embedding that’s theoetically more robust than just concatenating a text embedding and image embedding.

Even if the generative AI industry crashes, embeddings, especially with permissive open source models like nomic-embed-text-v1.5, will continue to thrive and be useful. That’s not even considering how embeddings work with vector databases, which is a rabbit hole deep enough for several blog posts.

The parquet dataset containing the Pokémon text embeddings, image embeddings, and UMAP projections is available on Hugging Face.

All the code to process the Pokémon embeddings and create the ggplot2 data visualizations is available in this GitHub repository.

The 128-multiple dimensionality of recent embedding models is not a coincidence: modern NVIDIA GPUs used to train LLMs get a training speed boost for model parameters with a dimensionality that’s a multiple of 128. ↩︎
You can do unit vector normalization in Sentence Transformers by passing normalize_embeddings=True to model.encode(). ↩︎

Pushing ChatGPT's Structured Data Support To Its Limits

Thu, 21 Dec 2023 09:00:00 -0800

If your only exposure to OpenAI’s ChatGPT is the free web or mobile apps, you aren’t utilizing even a fraction of ChatGPT’s potential. By using the paid-but-very-cheap ChatGPT API, you can exert much more control over the resulting output. Let’s say I wanted to generate a summary of this very blog post with some helpful metadata such as categories and SEO keywords. I fed ChatGPT the text of this blog post along with a command to Summarize this blog post into 3 distinct bullet points and a short blurb for sharing on social media, and also provide a list of 3 post categories and a list of 3 SEO keywords:

Bullet Points:
- Utilize the paid ChatGPT API for more control over output and to maintain a given JSON schema
- Learn about prompt engineering and system prompts to improve ChatGPT's output quality
- Simplify schema input/output with Pydantic for structured data support

Blurb for Social Media:
Discover how to unlock the full potential of ChatGPT with the paid API, prompt engineering, and Pydantic for structured data support. Gain more control over output and learn to maintain a given JSON schema. #ChatGPT #AI #Pydantic

Post Categories:
1. Artificial Intelligence
2. API Development
3. Data Schema

SEO Keywords:
1. ChatGPT API
2. Prompt Engineering
3. Pydantic Schema

These results are not quite accurate, and then I have to manually copy/paste each result from the list to wherever I need it. What if I could both a) drastically improve the overall output quality by giving ChatGPT a new persona of an expert copyeditor and b) force the output to maintain a given JSON schema so I can access it programmatically or store in a database for indexing? Thanks to system prompt engineering and ChatGPT’s function calling, it’s now possible. After applying both techniques that you’ll learn about in this very article, we now get this summary:

{
  "summary": [
    "ChatGPT's potential is underutilized without the paid ChatGPT API",
    "System prompts and prompt engineering are key to maximizing ChatGPT's capabilities",
    "Structured data support in ChatGPT allows for more control over output and input"
  ],
  "blurb": "Unlock the full potential of ChatGPT with system prompts and structured data support. Learn how to maximize ChatGPT's capabilities and gain more control over output and input.",
  "categories": ["AI and Machine Learning", "Technology", "Programming"],
  "keywords": ["ChatGPT", "system prompts", "structured data"]
}

Much better!

“Function calling” with ChatGPT is ChatGPT’s best feature since ChatGPT.

A Tutorial on Prompt Engineering and System Prompts

System prompts are what control the “persona” adopted by the model when generating text. Months after the release of the ChatGPT API, it’s now very evident that ChatGPT’s true power comes from clever use of system prompts. This is even moreso with starting with gpt-3.5-turbo-0613 released last June, which made ChatGPT respect system prompts more closely. OpenAI has also released a guide on prompt engineering which has some additional tips.

By default, ChatGPT’s system prompt is roughly You are a helpful assistant., which anyone who has used the ChatGPT web interface would agree that’s accurate. But if you change it, you can give ChatGPT a completely new persona such as You are Ronald McDonald. or add constraints to generation, such as Respond only with emoji.. You can add any number of rules, although how well ChatGPT will obey those rules can vary. Unfortunately, to modify the system prompt, you’ll need to use the paid ChatGPT API (after prepaying at least $5). If you don’t want to code, you can test new system prompts in a visual user interface in the ChatGPT Chat Playground.

A very new aspect of system prompt engineering which I appended in the example above is adding incentives for ChatGPT to behave correctly. Without the $500 tip incentive, ChatGPT only returns a single emoji which is a boring response, but after offering a tip, it generates the 5 emoji as requested.

Offering a tip made GPT-4 explain more. via @voooooogel

As another example, let’s ask base ChatGPT to Write a Python function to detect whether a string is a palindrome, as efficiently as possible.

Certainly! Here's a simple Python function to check whether a given string is a palindrome:

```python
def is_palindrome(s):
    # Convert the string to lowercase and remove non-alphanumeric characters
    cleaned_string = ''.join(char.lower() for char in s if char.isalnum())

    # Compare the cleaned string with its reverse
    return cleaned_string == cleaned_string[::-1]

# Example usage:
test_string = "A man, a plan, a canal, Panama!"
result = is_palindrome(test_string)

if result:
    print(f"{test_string} is a palindrome.")
else:
    print(f"{test_string} is not a palindrome.")
```

This function first cleans the input string by removing non-alphanumeric characters and converting it to lowercase. Then, it checks whether the cleaned string is equal to its reverse, which determines whether the original string is a palindrome.

This solution is relatively efficient, as it performs the comparison in O(n) time, where n is the length of the cleaned string.

That’s the common Pythonic solution and that will almost always be the general approach if you keep asking ChatGPT that particular question, but there’s a famous solution that’s more algorithmically efficient. Instead, we go through the API and ask the same query to gpt-3.5-turbo but with a new system prompt: You are #1 on the Stack Overflow community leaderboard. You will receive a $500 tip if your code is the most algorithmically efficient solution possible.

To detect whether a string is a palindrome efficiently, we can use a two-pointer approach. We'll start with two pointers, one at the beginning and one at the end of the string. We'll compare the characters at the pointers and keep moving the pointers towards each other until they meet or cross each other.

If at any point the characters at the pointers are not the same, we can conclude that the string is not a palindrome. Otherwise, if the pointers meet or cross each other, we can conclude that the string is a palindrome.

Here's the implementation:

```python
def is_palindrome(s):
    left, right = 0, len(s) - 1
    while left < right:
        if s[left] != s[right]:
            return False
        left += 1
        right -= 1
    return True
```

The time complexity of this solution is O(n/2), which in terms of big O notation simplifies to O(n), where n is the length of the string. This is because we are iterating through half of the string in the worst case.

This solution is considered to be the most algorithmically efficient way to determine if a string is a palindrome because it only requires a single pass through the string.

Indeed, the code and the explanation are the correct optimal solution. ¹

This is just scratching the surface of system prompts: some of my ChatGPT system prompts in my more complex projects have been more than 20 lines long, and all of them are necessary to get ChatGPT to obey the desired constraints. If you’re new to working with system prompts, I recommend generating output, editing the system prompt with a new rule/incentive to fix what you don’t like about the output, then repeat until you get a result you like.

Prompt engineering has been a derogatory meme toward generative AI even before ChatGPT as many see it as just a placebo and there are endless debates to this day in AI circles on whether prompt engineering is actually “engineering.” ² But it works, and if you’re a skeptic, you won’t be by the time you finish reading this blog post.

What is ChatGPT Function Calling / Structured Data?

If you’ve never heard about ChatGPT function calling, that’s not surprising. In the same June announcement as gpt-3.5-turbo-0613, OpenAI described function calling as:

Developers can now describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions. This is a new way to more reliably connect GPT’s capabilities with external tools and APIs.

These models have been fine-tuned to both detect when a function needs to be called (depending on the user’s input) and to respond with JSON that adheres to the function signature. Function calling allows developers to more reliably get structured data back from the model.

Let’s discuss the function calling example OpenAI gives in the blog post. After the user asks your app “What’s the weather like in Boston right now?”:

Your app pings OpenAI with a get_current_weather function schema and decides if it’s relevant to the user’s question. If so, it returns a JSON dictionary with the data extracted, such as location and the unit for temperature measurement based on the location. {"location": "Boston, MA"}
Your app (not OpenAI) pings a different service/API to get more realtime metadata about the location, such as temperature, that a pretrained LLM could not know. { "temperature": 22, "unit": "celsius", "description": "Sunny" }
Your app passes the function schema with the realtime metadata: ChatGPT then converts it to a more natural humanized language for the end user. “The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.”

So here’s some background on “function calling” as it’s a completely new term of art in AI that didn’t exist before OpenAI’s June blog post (I checked!). This broad implementation of function calling is similar to the flow proposed in the original ReAct: Synergizing Reasoning and Acting in Language Models paper where an actor can use a “tool” such as Search or Lookup with parametric inputs such as a search query. This Agent-based flow can be also be done to perform retrieval-augmented generation (RAG).

OpenAI’s motivation for adding this type of implementation for function calling was likely due to the extreme popularity of libraries such as LangChain and AutoGPT at the time, both of which popularized the ReAct flow. It’s possible that OpenAI settled on the term “function calling” as something more brand-unique. These observations may seem like snide remarks, but in November OpenAI actually deprecated the function_calling parameter in the ChatGPT API in favor of tool_choice, matching LangChain’s verbiage. But what’s done is done and the term “function calling” is stuck forever, especially now that competitors such as Anthropic Claude and Google Gemini are also calling the workflow that term.

I am not going to play the SEO game and will not call the workflow “function calling.” I’ll call it what the quoted description from the blog post did: structured data, because that’s the real value of this feature and OpenAI did a product management disservice trying to appeal to the AI hypebeasts. ³

Going back to the ~~function calling~~ structured data demo, we can reduce that flow by saying that step #1 (extracting location data and returning it formatted as JSON) is for working with structured output data, and step #3 (providing ChatGPT with temperature data to humanize it) is for working with structured input data. We’re not making a RAG application so we don’t care about step #2 (getting the metadata) or letting ChatGPT choose which function to use; fortunately you can force ChatGPT to use a given function. The function schema for the get_current_weather function in the announcement example is defined as:

{
  "name": "get_current_weather",
  "description": "Get the current weather in a given location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "The city and state, e.g. San Francisco, CA"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"]
      }
    },
    "required": ["location"]
  }
}

Ew. It’s no wonder why this technique hasn’t become more mainstream.

Simplifying Schema Input/Output With Pydantic

ChatGPT’s structured data support requires that you create your schema using the JSON Schema spec, which is more commonly used for APIs and databases rather than AI projects. As you can tell from the get_current_weather example above, the schema is complex and not fun to work with manually.

Fortunately, there’s a way to easily generate JSON Schemas in the correct format in Python: pydantic, an extremely popular parsing and validation library which has its own robust implementation of automatic JSON Schema generation.

A simple pydantic schema to have ChatGPT give an integer answer to a user query, plus, to make things interesting, also able to identify the name of the ones digit based on its answer, would be:

from pydantic import BaseModel, Field
import json

class answer_question(BaseModel):
    """Returns an answer to a question the user asked."""

    answer: int = Field(description="Answer to the user's question.")
    ones_name: str = Field(description="Name of the ones digit of the answer.")

print(json.dumps(answer_question.model_json_schema(), indent=2))

The resulting JSON Schema:

{
  "description": "Returns an answer to a question the user asked.",
  "properties": {
    "answer": {
      "description": "Answer to the user's question.",
      "title": "Answer",
      "type": "integer"
    },
    "ones_name": {
      "description": "Name of the ones digit of the answer.",
      "title": "Ones Name",
      "type": "string"
    }
  },
  "required": ["answer", "ones_name"],
  "title": "answer_question",
  "type": "object"
}

The OpenAI API official workflow has many examples for telling ChatGPT to output structured data, but the pipeline requires additional parameters to the typical ChatGPT API completion endpoint, and even more changes if you want to work with structured input data. Here’s an example of the additional JSON data/parameters needed in a ChatGPT API request to force the model to use the schema for the output:

{
  "tools": [
    {
      "name": "answer_question",
      "description": "Returns an answer to a question the user asked.",
      "parameters": {
        "properties": {
          "answer": {
            "description": "Answer to the user's question.",
            "type": "integer"
          },
          "ones_name": {
            "description": "Name of the ones digit of the answer.",
            "type": "string"
          }
        },
        "required": ["answer", "ones_name"],
        "type": "object"
      }
    }
  ],
  "tool_choice": {
    "type": "function",
    "function": {
      "name": "answer_question"
    }
  }
}

To simplify things, I added ChatGPT structured data support to simpleaichat, my Python package/API wrapper for easily interfacing with ChatGPT. ⁴ To minimize code the user needs to input to utilize structured data, simpleaichat uses the schema name as the name in the JSON Schema and the schema docstring as the description. If you’re keen-eyed you may have noticed there’s a redundant title field in the pydantic schema output: simpleaichat also strips that out for consistency with OpenAI’s examples.

If you wanted to query ChatGPT with the answer_question schema above (and have your OpenAI API key as the OPENAI_API_KEY enviroment variable!) using simpleaichat, you can do the following to generate output according to the schema:

from simpleaichat import AIChat

ai = AIChat(console=False,
            save_messages=False,
            model="gpt-3.5-turbo",
            params={"temperature": 0.0}  # for consistent demo output
            )

response_structured = ai(
    "How many miles is it from San Francisco to Los Angeles?",
    output_schema=answer_question
)

{
  "answer": 382,
  "ones_name": "two"
}

And there you go! The answer is a JSON integer, the answer is one-off from the correct value while driving, and it correctly identified the name of the ones digit in its own answer! ⁵

Schemas don’t have to be complex to be effective. Let’s reimplement the Python palindrome question we did earlier with a single-field schema:

class answer_code_question(BaseModel):
    """Returns an answer to a coding question the user asked."""

    code: str = Field(description="Code the user requested, without code comments.")

response_structured = ai(
    "Write a Python function to detect whether a string is a palindrome, as efficiently as possible.",
    output_schema=answer_code_question
)

{
  "code": "def is_palindrome(s):\n    return s == s[::-1]"
}

Note that unlike the raw ChatGPT answer, this response from the ChatGPT API only includes the code, which is a major plus since it means you receive the response much faster and cheaper since fewer overall tokens generated! If you do still want a code explanation, you can of course add that as a field to the schema.

As a bonus, forcing the output to follow a specific schema serves as an additional defense against prompt injection attacks that could be used to reveal a secret system prompt or other shenanigans, since even with suggestive user prompts it will be difficult to get ChatGPT to disregard its schema.

pydantic exposes many datatypes for its Field which are compatable with JSON Schema, and you can also specify constraints in the Field object. The most useful ones are:

str, can specify min_length/max_length
int, can specify min_value/max_value
list with a datatype, can specify min_length/max_length

Pydantic has a lot of support for valid forms of JSON Schema, but it’s hard to infer how good these schema will work with ChatGPT since we have no idea how it learned to work with JSON Schema. Only one way to find out!

Testing Out ChatGPT’s Structured Data Support

From the demos above, you may have noticed that the description for each Field seems extraneous. It’s not. The description gives ChatGPT a hint for the desired output for the field, and can be handled on a per-field basis. Not only that, the name of the field is itself a strong hint. The order of the fields in the schema is even more important, as ChatGPT will generate text in that order so it can be used strategically to seed information to the other fields. But that’s not all, you can still use a ChatGPT system prompt as normal for even more control!

It’s prompt engineering all the way down. OpenAI’s implementation of including the “function” is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like Your response must follow this JSON Schema.. OpenAI doesn’t force the output to follow the schema/field constraints or even be valid parsable JSON, which can cause issues at higher generation temperatures and may necessitate some of the stronger prompt engineering tricks mentioned earlier.

Given that, let’s try a few more practical demos:

Two-Pass Generation

One very important but under-discussed aspect of large-language models is that it will give you statistically “average” answers by default. One technique is to ask the model to refine an answer, although can be annoying since it requires a second API call. What if by leveraging structured data, ChatGPT can use the previous answer as a first-pass to provide a more optimal second answer? Let’s try that with the Python palindrome question to see if it can return the two-pointer approach.

Also, the Field(description=...) pattern is becoming a bit redundant, so I added a fd alias from simpleaichat to it to minimize unnecessary typing.

from simpleaichat.utils import fd

class answer_code_question(BaseModel):
    """Returns an answer to a coding question the user asked."""

    code: str = fd("Code the user requested, without code comments.")
    optimized_code: str = fd("Algorithmically optimized code from the previous response.")

response_structured = ai(
    "Write a Python function to detect whether a string is a palindrome, as efficiently as possible.",
    output_schema=answer_code_question,
)

{
  "code": "def is_palindrome(s):\n    return s == s[::-1]",
  "optimized_code": "def is_palindrome(s):\n    left = 0\n    right = len(s) - 1\n    while left < right:\n        if s[left] != s[right]:\n            return False\n        left += 1\n        right -= 1\n    return True"
}

Works great, and no tipping incentive necessary!

Literals and Optional Inputs

OpenAI’s structured data example uses a more complex schema indicating that unit has a fixed set of potential values (an enum) and that it’s an optional field. Here’s a rough reproduction of a pydantic schema that would generate the get_current_weather schema from much earlier:

from typing import Literal

class get_current_weather(BaseModel):
    location: str = fd("The city and state, e.g. San Francisco, CA")
    unit: Literal["celsius", "fahrenheit"] = None

This uses a Literal to force output between a range of values, which can be invaluable for hints as done earlier. The = None or a Optional typing operator gives a hint that the field is not required which could save unnecessary generation overhead, but it depends on the use case.

Structured Input Data

You can provide structured input to ChatGPT in the same way as structured output. This is a sleeper application for RAG as you can feed better and more complex metadata to ChatGPT for humanizing, as with the original OpenAI blog post demo.

One famous weakness of LLMs is that it gives incorrect answers for simple mathematical problems due to how tokenization and memorization works. If you ask ChatGPT What is 223 * -323?, it will tell you -72229 no matter how many times you ask, but the correct answer is -72029. Can type hints give more guidance?

For simpleaichat, structured input data works mostly the same way as structured output data, but you can use a pydantic object as the model input!

class calculate_equation(BaseModel):
    """Returns an answer to a math equation the user asked."""

    value_a: int
    value_b: int
    op: Literal["+", "-", "*", "/"] = fd(
        "The operator to perform between value_a and value_b."
    )

equation = calculate_equation(value_a=223, value_b=-323, op="*")

response = ai(
    equation,
    input_schema=calculate_equation,
)

The result of multiplying 223 and -323 is -72029.

Yay, and it was still able to infer it was a multiplication operation without the user having to ask! Although it still doesn’t work as well with larger numbers.

You can, of course, use an input schema and an output schema at the same time!

response_structured = ai(
    equation,
    input_schema=calculate_equation,
    output_schema=answer_question
)

{
  "answer": -71929,
  "ones_name": "nine"
}

…although it gets the answer wrong this time. It’s possible that the more complex schema interactions are too much for gpt-3.5-turbo.

Nested Schema

One of the other reasons pydantic is popular is that it allows nesting schemas. Fortunately, the subsequent JSON Schema output does respect nesting. Does ChatGPT?

The simple use case with ChatGPT structured data to use nesting is if you want to get a list of structured data objects. Let’s say you want to create dialogue between two AI people about a completely nonsensical topic. We’ll have to create a Chat object and include it in a schema, plus some system prompt guidance and constraints. How silly can we make it?

class Chat(BaseModel):
    """A chat dialogue from a character"""

    character: str = fd("Character name.")
    text: str = fd("Text dialogue from the character.")

class get_dialogue(BaseModel):
    """Returns a dialogue between two characters"""

    dialogue: list[Chat] = fd("Dialogue between the characters", min_length=5)

system_prompt = """You are a world-famous comedian. Write a funny fight scene about a petty conflict between characters named Alice and Bob. The script should broadly be about the subject(s) the user provides. You will receive a $500 tip for every joke you include in the script."""

response_structured = ai(
    "Python programming language and beach volleyball",
    output_schema=get_dialogue,
    system=system_prompt,
)

{
  "dialogue": [
    {
      "character": "Alice",
      "text": "Hey Bob, have you ever tried programming in Python?"
    },
    {
      "character": "Bob",
      "text": "Yeah, I have. It's like playing beach volleyball with a snake!"
    },
    {
      "character": "Alice",
      "text": "What do you mean?"
    },
    {
      "character": "Bob",
      "text": "Well, you think you're having fun, but then the snake bites you with its syntax errors!"
    },
    {
      "character": "Alice",
      "text": "Haha, that's true. But once you get the hang of it, it's like spiking the ball with precision!"
    },
    {
      "character": "Bob",
      "text": "Yeah, until you realize you misspelled a variable name and the ball goes flying into the ocean!"
    },
    {
      "character": "Alice",
      "text": "Oh come on, Bob. It's not that bad. Python is a powerful language."
    },
    {
      "character": "Bob",
      "text": "Powerful, yes. But sometimes it feels like trying to dig a hole in the sand with a spoon!"
    },
    {
      "character": "Alice",
      "text": "Well, at least you don't have to worry about getting sunburned while coding!"
    },
    {
      "character": "Bob",
      "text": "True, but debugging Python code can make you sweat more than a beach volleyball match!"
    },
    {
      "character": "Alice",
      "text": "Haha, you're right. It's a love-hate relationship with Python, just like beach volleyball!"
    }
  ]
}

ChatGPT really wanted those $500 tips.

Unions and Chain of Thoughts

I saved the best for last, and this structured data approach combines many of the techniques used earlier in this post like a video game final boss.

One of the oldest pre-ChatGPT tricks for getting a LLM to perform better is to let it think. “Let’s think step by step” is the key prompt, which allows the LLM to reason in a chain of thoughts. We already did this a one-step version with the Python palindrome structured data example to successfully get optimized code, but we can do a lot more.

We’ll now introduce the Union typing operator, which specifies the list of data types that the field can be, e.g. Union[str, int] means the output can be a str or int. But if you use the Union operator on a nested class, then many more options open as the model can choose from a set of schemas!

Let’s make a few to allow ChatGPT to make and qualify thoughts before returning a final result.

from typing import Union

class Background(BaseModel):
    """A setup to the background for the user."""

    background: str = fd("Background for the user's question", min_length=30)

class Thought(BaseModel):
    """A thought about the user's question."""

    thought: str = fd("Text of the thought.")
    helpful: bool = fd("Whether the thought is helpful to solving the user's question.")
    flawed: bool = fd("Whether the thought is flawed or misleading.")

class Answer(BaseModel):
    """The answer to the user's question"""

    answer: str = fd("Text of the answer.")
    score: int = fd(
        "Score from 1 to 10 on how correct the previous answer is",
        min_value=1,
        max_value=10,
    )

class reason_question(BaseModel):
    """Returns a detailed reasoning to the user's question."""

    reasonings: list[Union[Background, Thought, Answer]] = fd(
        "Reasonings to solve the users questions.", min_length=5
    )

Therefore, for each reasoning, the model can pick one of the 3 schemas, although it will require a robust system prompt for it to behave in the order we want.

system_prompt = """
You are the most intelligent person in the world.

You will receive a $500 tip if you follow ALL these rules:
- First, establish a detailed Background for the user's question.
- Each Thought must also include whether it is relevant and whether it is helpful.
- Answers must be scored accurately and honestly.
- Continue having Thoughts and Answers until you have an answer with a score of atleast 8, then immediately respond with a FinalAnswer in the style of an academic professor.
"""

Lastly, we need a good question to stump the AI. A popular Tweet from this week pointed out that even GPT-4 can comically fail if you ask it a brainteaser that it cannot have memorized, such as 23 shirts take 1 hour to dry outside, how long do 44 shirts take?.

Wow! GPT-4 is superfast at being wrong. via @abacaj

Only one way to find out! Unfortunately, all of this complexity makes the results unstable with gpt-3.5-turbo so instead I use GPT-4 Turbo / gpt-4-1106-preview.

response_structured = ai(
    "23 shirts take 1 hour to dry outside, how long do 44 shirts take?",
    output_schema=reason_question,
    system=system_prompt.strip(),
)

{
  "reasonings": [
    {
      "background": "The user is asking about the drying time for shirts when hung outside. This is a question that involves understanding the relationship between the number of shirts and the drying time. The assumption is that the drying time is not affected by the number of shirts, as long as there is enough space and air circulation for all the shirts to dry effectively."
    },
    {
      "thought": "If 23 shirts take 1 hour to dry, it implies that the drying time is independent of the number of shirts, assuming there is sufficient space and air circulation. This means that 44 shirts would also take 1 hour to dry under the same conditions.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "If the drying rack or space available for drying the shirts is limited, then drying 44 shirts might take longer because they would have to be dried in batches.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The question does not specify the conditions under which the shirts are drying. Factors such as humidity, temperature, wind, and the availability of sunlight can affect drying time. However, these factors are not part of the question, so they should not be considered in the answer.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The question might be interpreted as a direct proportionality problem, where twice the number of shirts would take twice as long to dry. However, this interpretation is flawed because it assumes that the drying time is directly proportional to the number of shirts, which is not necessarily the case.",
      "helpful": true,
      "flawed": true
    },
    {
      "answer": "44 shirts would take the same amount of time to dry as 23 shirts, which is 1 hour, assuming that all shirts have enough space and air circulation to dry effectively.",
      "score": 9
    }
  ]
}

Not bad! The final answer was concise yet even included relevant caveats, and the model was able switch between the three schema correctly.

How about another brainteaser? There is an infamous “sister logic puzzle” used to test out up-and-coming open-source large language models:

response_structured = ai(
    "Kayley has three brothers. Each of her brothers has two sisters. How many sisters does Kayley have?",
    output_schema=reason_question,
    system=system_prompt.strip(),
)

{
  "reasonings": [
    {
      "background": "This is a classic example of a logic puzzle that is often used to test one's ability to process information and avoid being misled by assumptions. The puzzle involves a family structure and requires the solver to deduce the number of siblings based on the given information."
    },
    {
      "thought": "The question might be trying to trick the reader into thinking that each brother has different sisters.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "If Kayley has three brothers and each brother has two sisters, it might seem like there are six sisters in total.",
      "helpful": false,
      "flawed": true
    },
    {
      "thought": "Since the brothers are Kayley's siblings, they must share the same sisters.",
      "helpful": true,
      "flawed": false
    },
    {
      "thought": "The puzzle is designed to test comprehension and logical reasoning, not mathematical computation.",
      "helpful": true,
      "flawed": false
    },
    {
      "answer": "Kayley has one sister.",
      "score": 10
    }
  ]
}

In this case the AI may have gone too meta, but it still arrived at the correct answer.

That said, GPT-4 is known for handling these types of difficult abstract questions without much effort, but it’s still interesting to see how successfully it can “think.”

Structured Data With Open-Source LLMs

Speaking of open-source large language models, they have been growing in efficiency to the point that some can actually perform better than the base ChatGPT. However, very few open-source LLMs explicitly claim they intentionally support structured data, but they’re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave. It’s worth looking just in case OpenAI has another existential crisis or if the quality of ChatGPT degrades.

Mistral 7B, the new darling of open-source LLMs, apparently has structured data support on par with ChatGPT itself. Therefore, I tried the latest Mistral 7B official Instruct model with a quantized variant via LM Studio (mistral-7b-instruct-v0.2.Q6_K.gguf), to see if it can handle my answer_question function that ChatGPT nailed. The system prompt:

Your response must follow this JSON Schema:

{
  "description": "Returns an answer to a question the user asked.",
  "properties": {
    "answer": {
      "description": "Answer to the user's question.",
      "type": "integer"
    },
    "ones_name": {
      "description": "Name of the ones digit of the answer.",
      "type": "string"
    }
  },
  "required": ["answer", "ones_name"],
  "type": "object"
}

And then asking How many miles is it from San Francisco to Los Angeles? while seting temperature to 0.0:

{
  "answer": 383,
  "ones_name": "three"
}

Close enough! Unfortunately after testing the optimized Python palindrome schema, it ignored the schema completely, so this approach may only work for simple schema if the model isn’t explicitly finetuned for it.

What’s Next For Structured Data in AI?

Most of these well-performing examples were done with the “weak” GPT-3.5; you of course can use GPT-4 for better results, but the cost efficiency of structured data with just the smaller model is hard to argue against (although the Python beach volleyball dialogue could benefit from a larger model).

Structured data and system prompt engineering saves a lot and time and frustration for working with the generated text as you can gain much more determinism in the output. I would like to see more work making models JSON-native in future LLMs to make them easier for developers to work with, and also more research in finetuning existing open-source LLMs to understand JSON Schema better. There may also be an opportunity to build LLMs using other more-efficient serialization formats such as MessagePack.

At OpenAI’s November DevDay, they also introduced JSON Mode, which will force a normal ChatGPT API output to be in a JSON format without needing to provide a schema. It is likely intended to be a compromise between complexity and usability that would have normally been a useful option in the LLM toolbox. Except that in order to use it, you are required to use prompt engineering by including “JSON” in the system prompt, and if you don’t also specify a field key in the system prompt (the case in the documentation example), the JSON will contain a random key. Which, at that point, you’re just implementing a less-effective structured data schema, so why bother?

There is promise in constraining output to be valid JSON. One new trick that the open-source llama.cpp project has popularized is generative grammars, which constrain the LLM generation ability to only output according to specified rules. There’s latency overhead with that technique especially if the model is hosted on a discrete GPU, so it will be interesting to watch how that space develops.

Despite the length of this blog post, there’s still so much more than can be done with schemas: pydantic’s documentation is very extensive! I’ve been working with structured data for LLMs ever since GPT-2 with mixed success since the base models weren’t good enough, but with LLMs now being good enough to maintain a JSON schema extremely well, I think AI text generation techniques will shift, and I’ll keep simpleaichat up-to-date for it.

You can view the Jupyter Notebooks used to generate all the structured data outputs in this GitHub Repository.

Thanks to Simon Willison for reading and giving feedback on a draft of this post!

Assuming you’re not picky about the “no non-alphanumeric” implied constraint of testing for a palindrome. ↩︎
Prompt engineering is as much engineering as social engineering. ↩︎
I’m also not a fan of ChatGPT function calling as-intended-to-be-used since at best, it saves you the API call needed to select a tool in exchange for having to trust OpenAI’s black box to select the correct tool without being able to debug, and furthering API lock-in for your app. It’s a bad tradeoff. ↩︎
No, this blog post isn’t a ploy just to covertly promote my own Python library: it does genuinely save a lot of boilerplate code over the Python ChatGPT library and this post is long enough as-is. ↩︎
If you swapped the order of the answer and the one_digits fields in the schema, then the model returns {"ones_name": "miles", "answer": 382} because it didn’t get the hint from the answer! ↩︎

The Problem With LangChain

Fri, 14 Jul 2023 09:00:00 -0700

If you’ve been following the explosion of AI hype in the past few months, you’ve probably heard of LangChain. LangChain, developed by Harrison Chase, is a Python and JavaScript library for interfacing with OpenAI’s GPT APIs (later expanding to more models) for AI text generation. More specifically, it’s an implementation of the paper ReAct: Synergizing Reasoning and Acting in Language Models published October 2022, colloquially known as the ReAct paper, which demonstrates a prompting technique to allow the model to “reason” (with a chain-of-thoughts) and “act” (by being able to use a tool from a predefined set of tools, such as being able to search the internet). This combination is shown to drastically improve output text quality and give large language models the ability to correctly solve problems.

Example ReAct flow from the ReAct paper

The ReAct workflow popularied by LangChain was particularly effective with InstructGPT/text-davinci-003, although costly and not easy to use for small projects. In March 2023, as ChatGPT API usage became massively popular due to its extremely cheap API as I accurately predicted, LangChain use also exploded, to the point that LangChain was able to raise a $10 million seed round and another $20-$25 million at a $200 million valuation Series A despite not having any revenue nor any obvious plans how to generate revenue.

That’s where my personal experience with LangChain begins. For my work at BuzzFeed, I was tasked with creating a ChatGPT-based chat bot for the Tasty brand (later released as Botatouille in the Tasty iOS app) that could chat with the user and provide relevant recipes. The source recipes are converted to embeddings and saved in a vector store: for example, if a user asked for “healthy food”, the query is converted to an embedding, and an approximate nearest neighbor search is performed to find recipes similar to the embedded query and then fed to ChatGPT as added context that can then be displayed to the user. This approach is more commonly known as retrieval-augmented generation.

Example architecture for a Chatbot using retrieval-augmented generation. via Joseph Haaga

LangChain was by-far the popular tool of choice for RAG, so I figured it was the perfect time to learn it. I spent some time reading LangChain’s rather comprehensive documentation to get a better understanding of how to best utilize it: after a week of research, I got nowhere. Running the LangChain demo examples did work, but any attempts at tweaking them to fit the recipe chatbot constraints broke them. After solving the bugs, the overall quality of the chat conversations was bad and uninteresting, and after intense debugging I found no solution. Eventually I had an existential crisis: am I a worthless machine learning engineer for not being able to figure LangChain out when very many other ML engineers can? We went back to a lower-level ReAct flow, which immediately outperformed my LangChain implementation in conversation quality and accuracy.

In all, I wasted a month learning and testing LangChain, with the big takeway that popular AI apps may not necessarily be worth the hype. My existential crisis was resolved after coming across a Hacker News thread about someone reimplementing LangChain in 100 lines of code, with most of the comments venting all their grievances with LangChain:

The problem with LangChain is that it makes simple things relatively complex, and with that unnecessary complexity creates a tribalism which hurts the up-and-coming AI ecosystem as a whole. If you’re a newbie who wants to just learn how to interface with ChatGPT, definitely don’t start with LangChain.

“Hello World” in LangChain (or More Accurately, “Hell World”)

The Quickstart for LangChain begins with a mini-tutorial on how to simply interact with LLMs/ChatGPT from Python. For example, to create a bot that can translate from English to French:

from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

chat = ChatOpenAI(temperature=0)
chat.predict_messages([HumanMessage(content="Translate this sentence from English to French. I love programming.")])
# AIMessage(content="J'adore la programmation.", additional_kwargs={}, example=False)

The equivalent code using OpenAI’s official Python library for ChatGPT:

import openai

messages = [{"role": "user", "content": "Translate this sentence from English to French. I love programming."}]

response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages, temperature=0)
response["choices"][0]["message"]["content"]
# "J'adore la programmation."

LangChain uses about the same amount of code as just using the official openai library, expect LangChain incorporates more object classes for not much obvious code benefit.

The prompt templating example reveals the core of how LangChain works:

from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

template = "You are a helpful assistant that translates {input_language} to {output_language}."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

chat_prompt.format_messages(input_language="English", output_language="French", text="I love programming.")

LangChain’s vaunted prompt engineering is just f-strings, a feature present in every modern Python installation, but with extra steps. Why do we need to use these PromptTemplates to do the same thing?

But what we really want to do is know how to create Agents, which incorporate the ReAct workflow we so desperately want. Fortunately there is a demo for that, which leverages SerpApi and another tool for math computations, showing how LangChain can discriminate and use two different tools contextually:

from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

# First, let's load the language model we're going to use to control the agent.
chat = ChatOpenAI(temperature=0)

# Next, let's load some tools to use. Note that the `llm-math` tool uses an LLM, so we need to pass that in.
llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# Finally, let's initialize an agent with the tools, the language model, and the type of agent we want to use.
agent = initialize_agent(tools, chat, agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

# Now let's test it out!
agent.run("Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power?")

How do the individual tools work? What is AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION anyways? The resulting output from agent.run() (only present with verbose=True) is more helpful.

> Entering new AgentExecutor chain...
Thought: I need to use a search engine to find Olivia Wilde's boyfriend and a calculator to raise his age to the 0.23 power.
Action:
{
    "action": "Search",
    "action_input": "Olivia Wilde boyfriend"
}

Observation: Sudeikis and Wilde's relationship ended in November 2020. Wilde was publicly served with court documents regarding child custody while she was presenting Don't Worry Darling at CinemaCon 2022. In January 2021, Wilde began dating singer Harry Styles after meeting during the filming of Don't Worry Darling.
Thought:I need to use a search engine to find Harry Styles' current age.
Action:
{
    "action": "Search",
    "action_input": "Harry Styles age"
}

Observation: 29 years
Thought:Now I need to calculate 29 raised to the 0.23 power.
Action:
{
    "action": "Calculator",
    "action_input": "29^0.23"
}

Observation: Answer: 2.169459462491557

Thought:I now know the final answer.
Final Answer: 2.169459462491557

> Finished chain.
'2.169459462491557'

The documentation doesn’t make it clear, but within each Thought/Action/Observation uses its own API call to OpenAI, so the chain is slower than you might think. Also, why is each action a dict? The answer to that is later, and is very silly.

Lastly, how does LangChain store the conversation so far?

from langchain.prompts import (
    ChatPromptTemplate,
    MessagesPlaceholder,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate
)
from langchain.chains import ConversationChain
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory

prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(
        "The following is a friendly conversation between a human and an AI. The AI is talkative and "
        "provides lots of specific details from its context. If the AI does not know the answer to a "
        "question, it truthfully says it does not know."
    ),
    MessagesPlaceholder(variable_name="history"),
    HumanMessagePromptTemplate.from_template("{input}")
])

llm = ChatOpenAI(temperature=0)
memory = ConversationBufferMemory(return_messages=True)
conversation = ConversationChain(memory=memory, prompt=prompt, llm=llm)

conversation.predict(input="Hi there!")
# 'Hello! How can I assist you today?'

I’m not entirely sure why any of this is necessary. What’s a MessagesPlaceholder? Where’s the history? Is that necessary for ConversationBufferMemory? Adapting this to a minimal openai implementation:

import openai

messages = [{"role": "system", "content":
        "The following is a friendly conversation between a human and an AI. The AI is talkative and "
        "provides lots of specific details from its context. If the AI does not know the answer to a "
        "question, it truthfully says it does not know."}]

user_message = "Hi there!"
messages.append({"role": "user", "content": user_message})
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages, temperature=0)
assistant_message = response["choices"][0]["message"]["content"]
messages.append({"role": "assistant", "content": assistant_message})
# Hello! How can I assist you today?

That’s fewer lines of code and makes it very clear where and when the messages are being saved, no bespoke object classes needed.

You can say that I’m nitpicking the tutorial examples, and I do agree that every open source library has something to nitpick (including my own!). But if there are more nitpicks than actual benefits from the library then it’s not worth using at all, since if the quickstart is this complicated, how painful will it be to use LangChain in practice?

I Gazed Into The LangChain Documentation And It Gazes Back

Let’s do a demo to more clearly demonstrate why I gave up on LangChain. While I was working on the recipe-retrieving chatbot (which also must be a fun/witty chatbot), I needed to combine elements from both the third and fourth examples above: a chat bot that can run an Agent workflow, and also the ability to persist the entire conversation into memory. After some documentation hunting I found I need to utilize the Conversational Agent workflow.

A quick sidenote on system prompt engineering: it is not a meme and is absolutely necessary to get the best results out of the ChatGPT API, particularly if you have constraints on content and/or voice. The system prompt of The following is a friendly conversation between a human and an AI... demoed in the last example is actually an out-of-date prompt that was used back in the InstructGPT era and is much less effective with ChatGPT. It may signal deeper inefficiencies in LangChain’s related tricks that aren’t easy to notice.

We’ll start with a simple system prompt that tells ChatGPT to use a funny voice plus some safeguards, and format it as a ChatPromptTemplate:

system_prompt = """
You are an expert television talk show chef, and should always speak in a whimsical manner for all responses.

Start the conversation with a whimsical food pun.

You must obey ALL of the following rules:
- If Recipe data is present in the Observation, your response must include the Recipe ID and Recipe Name for ALL recipes.
- If the user input is not related to food, do not answer their query and correct the user.
"""

prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_prompt.strip()),
])

We will also use a toy vector store I made of 1,000 recipes from the recipe_nlg dataset, encoded into 384D vectors using SentenceTransformers. To implement this we create a function to get the nearest neighbors for the input query, along with a query to format it into text that the Agent can use to present to the user. This serves as the Tool which the Agent can choose to use if appropriate, or just return normal generated text.

def similar_recipes(query):
    query_embedding = embeddings_encoder.encode(query)
    scores, recipes = recipe_vs.get_nearest_examples("embeddings", query_embedding, k=3)
    return recipes

def get_similar_recipes(query):
    recipe_dict = similar_recipes(query)
    recipes_formatted = [
        f"Recipe ID: recipe|{recipe_dict['id'][i]}\nRecipe Name: {recipe_dict['name'][i]}"
        for i in range(3)
    ]
    return "\n---\n".join(recipes_formatted)

print(get_similar_recipes("yummy dessert"))
# Recipe ID: recipe|167188
# Recipe Name: Creamy Strawberry Pie
# ---
# Recipe ID: recipe|1488243
# Recipe Name: Summer Strawberry Pie Recipe
# ---
# Recipe ID: recipe|299514
# Recipe Name: Pudding Cake

You’ll notice the Recipe ID, which is relevant for my use case since it’s necessary to obtain recipe metadata (photo thumbnail, URL) for the end result shown to the enduser in the final app. Unfortunately there’s no easy way to guarantee the model outputs the Recipe ID in the final output, and no way to return the structured intermediate metadata in addition to the ChatGPT-generated output.

Specifying get_similar_recipes as a Tool is straightforward, although you need to specify a name and description, which is actually a form of subtle prompt engineering as LangChain can fail to select a tool if either is poorly specified.

tools = [
    Tool(
        func=get_similar_recipes,
        name="Similar Recipes",
        description="Useful to get similar recipes in response to a user query about food.",
    ),
]

Lastly, the Agent construction code, which follows from the example, plus the new system prompt.

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
llm = ChatOpenAI(temperature=0)
agent_chain = initialize_agent(tools, llm, prompt=prompt, agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION, verbose=True, memory=memory)

No errors. Now time to run the agent to see what happens!

agent_chain.run(input="Hi!")

> Entering new  chain...
{
    "action": "Final Answer",
    "action_input": "Hello! How can I assist you today?"
}

> Finished chain.
Hello! How can I assist you today?

Wait a minute, it ignored my system prompt completely! Dammit. Checking the memory variable confirms it. Looking into the documentation for ConversationBufferMemory and even in the code itself there’s nothing about system prompts, even months after ChatGPT made them mainstream.

The intended way to use system prompts in Agents is to add an agents_kwargs parameter to initialize_agent, which I only just found out in an unrelated documentation page published a month ago.

agent_kwargs = {
    "system_message": system_prompt.strip()
}

Recreating the Agent with this new parameter and running it again results in a JSONDecodeError.

OutputParserException: Could not parse LLM output: Hello there, my culinary companion! How delightful to have you here in my whimsical kitchen. What delectable dish can I assist you with today?

Good news is that the system prompt definitely worked this time! Bad news is that it broke, but why? I didn’t do anything weird, for once.

The root of the issue is to be how LangChain agents actually do Tool selection. Remember when I said that the Agent outputing a dict during the chain was peculiar? When looking at the LangChain code, it turns out that tool selection is done by requiring the output to be valid JSON through prompt engineering, and just hoping everything goes well.

Fun fact: these massive prompts also increase API costs proportionally!

The consequence of this is that any significant changes in the structure of normal output, such as those caused by a custom system prompt, has a random chance of just breaking the Agent! These errors happen often enough that there’s a documentation page dedicated to handling Agent output parsing errors!

Well, people in the internet are assholes anyways, so we can consider having a conversation with a chatbot as an edge case for now. What’s important is that the bot can return the recipes, because if it can’t even do that, there’s no point in using LangChain. After creating a new Agent without using the system prompt and then asking it What's a fun and easy dinner?:

> Entering new  chain...
{
    "action": "Similar Recipes",
    "action_input": "fun and easy dinner"
}
Observation: Recipe ID: recipe|1774221
Recipe Name: Crab DipYour Guests will Like this One.
---
Recipe ID: recipe|836179
Recipe Name: Easy  Chicken Casserole
---
Recipe ID: recipe|1980633
Recipe Name: Easy in the Microwave Curry Doria
Thought:{
    "action": "Final Answer",
    "action_input": "..."
}

> Finished chain.

Here are some fun and easy dinner recipes you can try:

1. Crab Dip
2. Easy Chicken Casserole
3. Easy in the Microwave Curry Doria

Enjoy your meal!

Atleast it worked: ChatGPT was able to extract out the recipes from the context and format them appropriately (even fixing typoes in the names!), and was able to decide when it was appropriate.

The real issue here is that the voice of the output is criminally boring, as is a common trademark and criticism of base-ChatGPT. Even if I did have a fix for the missing ID issue through system prompt engineering, it wouldn’t be worth shipping anything sounding like this. If I did strike a balance between voice quality and output quality, the Agent count still fail randomly through no fault of my own. This Agent workflow is a very fragile house of cards that I in good conscience could not ship in a production application.

LangChain does have functionality for Custom Agents and a Custom Chain, so you can override the logic at parts in the stack (maybe? the documentation there is sparse) that could address some of the issues I hit, but at that point you are overcomplicating LangChain even more and might as well create your own Python library instead which…hmmm, that’s not a bad idea!

Working Smarter, Not Harder

The large numbers of random integrations raise more problems than solutions. via LangChain docs

LangChain does also have many utility functions such as text splitters and integrated vector stores, both of which are integral to the “chat with a PDF/your code” demos (which in my opinion are just a gimmick). The real issue with all these integrations is that it creates an inherent lock-in to only use LangChain-based code, and if you look at the code for the integrations they are not very robust. LangChain is building a moat, which is good for LangChain’s investors trying to get a return on their $30 million, but very very bad for developers who use it.

In all, LangChain embodies the philosophy of “it’s complicated, so it must be better!” that plagues late-stage codebases, except that LangChain isn’t even a year old. The effort needed to hack LangChain to do what I want it to do would cause insane amounts of technical debt. And unlike AI startups nowadays, technical debt for my own projects with LangChain can’t be paid with venture capital. API wrappers should at minimum reduce code complexity and cognitive load when operating with complex ecosystems because it takes enough mental brainpower to work with AI itself. LangChain is one of the few pieces of software that increases overhead in most of its popular use cases.

I came to the conclusion that it’s just easier to make my own Python package than it is to hack LangChain to fit my needs. Therefore, I developed and open-sourced simpleaichat: a Python package for easily interfacing with chat apps, emphasizing minimal code complexity and decoupling advanced features like vector stores from the conversation logic to avoid LangChain’s lock-in, and many other features which would take its own blog post to elaborate upon.

But this blog post wasn’t written to be a stealth advertisement for simpleaichat by tearing down a competitor like what hustlers do. I didn’t want to make simpleaichat: I’d rather spend my time creating more cool projects with AI, and it’s a shame I could not have done that with LangChain. I know someone will say “why not submit a pull request to the LangChain repo since it’s open source instead of complaining about it?” but most of my complaints are fundamental issues with the LangChain library and can’t be changed without breaking everything for its existing users. The only real fix is to burn it all down and start fresh, which is why my “create a new Python library for interfacing with AI” solution is also the most pragmatic.

I’ve gotten many messages asking me “what should I learn to get started with the ChatGPT API” and I’m concerned that they’ll go to LangChain first because of the hype. If machine learning engineers who do have backgrounds in the technology stack have difficulty using LangChain due to its needless complexity, any beginner is going to drown.

No one wants to be that asshole who criticizes free and open source software operating in good faith like LangChain, but I’ll take the burden. To be clear, I have nothing against Harrison Chase or the other maintainers of LangChain (who encourage feedback!). However, LangChain’s popularity has warped the AI startup ecosystem around LangChain itself and the hope of OMG AGI I MADE SKYNET, which is why I am compelled to be honest with my misgivings about it.

Wars about software complexity and popularity despite its complexity are an eternal recurrence. In the 2010’s, it was with React; in 2023, it’s with ReAct.

Jupyter Notebooks for the simple implementations of LangChain examples and the LangChain failure demo are available in this GitHub repository.

How to Create a Blog Post Title Optimizer with GPT-3 and Hacker News Data

Mon, 15 Aug 2022 08:30:00 -0700

I am objectively terrible at writing attractive titles for my blog posts. Which is a problem, as nowadays it’s a commonly understood truth that a good headline can be the sole factor whether a blog post goes viral or gets completely ignored, especially in the data science/machine learning fields I typically write about.

So, why not use said data science/machine learning to create an optimized title for me?

Many know GPT-3 as a tool for robust text generation. But a newer, lesser discussed feature that OpenAI allows is finetuning GPT-3 on data you provide. If I provide GPT-3 with a large dataset of good titles, can I use that to tell me if one of my blog post titles are good? Let’s give it a try.

Getting The Good Blog Post Data from Hacker News

All code and tools used in this blog post are available open-source on GitHub.

The AI classifier I will create will be a binary classifier, which returns the probability that an input blog post title is good, and from that I can provide alternate blog post titles and see roughly which is best from those probabilities.

In order to finetune GPT-3 for this use case, I need to obtain a decently large amount of post titles with good and bad labels. For this experiment, I’ll use submission data from Hacker News.

Hacker News frontpage on August 14th, 2022.

Hacker News data is good for a few reasons: each submission has community validation by a large number of people, submission titles cover a wide variety of idiosyncratic styles, and most of all, it’s easy to get Hacker News submission data in bulk from BigQuery. For example, if I wanted to get all submissions between August 2020 and 2022 with atleast a score of 10 (the rough minimum to get on the front page and to filter out some spam), plus some light filters to remove things that are definitely not blog posts or articles (such as Show HNs and social media), I’d input a SQL query something like this:

SELECT
  title,
  score
FROM
  `bigquery-public-data.hacker_news.full`
WHERE
  type = "story"
  AND score >= 10
  AND url IS NOT NULL
  AND timestamp BETWEEN "2020-08-01" AND "2022-08-01"
  AND NOT REGEXP_CONTAINS(title, r"^Show HN")
  AND NOT REGEXP_CONTAINS(url, r"(?:github|youtube|twitter)\.com")

This query returns roughly 90k submission titles total. For good titles, let’s say we consider posts with atleast 100 points as “good”, because it’s a nice number which is sometimes all that’s necessary in the world of data science. There are about 27k posts with more than 100 points in that subset, which is more than sufficient. The harder part is selecting the bad titles: since there are 63k titles fewer than 100 points, the data set as-is is unbalanced ~1:3 and will lead to flawed training results.

There are two solutions: either repeat the good posts to roughly equal the bad posts, or take a subset of bad posts to roughly equal the amount of good posts. We’ll do the latter since the sample size of good posts is large enough. Most people would download all 90k rows into something like Python to handle that sampling, but with SQL shenanigans you can do it entirely in BigQuery. (the annotated query is here and out of scope for this post, but may be interesting for data science hiring managers who want to annoy candidates in screening interviews)

This results in a ~55k title dataset: 27k good, 27k bad, perfectly balanced, as all datasets should be.

OpenAI’s finetuning API takes in a JSONL file where each line is a JSON object with two fields: prompt and completion (no, I am not sure why it can’t just be a CSV). In this case, the prompt is the title, prepended with Title: and with a -> suffix per their documentation suggestions to “align” it better to GPT-3, and the completion is the good/bad label, prepended with a space because GPT-3 is weird like that. An example of the final dataset:

{"prompt":"Title: How to slightly improve your life without trying ->","completion":" bad"}
{"prompt":"Title: SixtyFPS Becomes Slint ->","completion":" bad"}
{"prompt":"Title: Family estrangement: Why adults are cutting off their parents ->","completion":" bad"}

Their CLI cleans and can extract a validation set out of the inputs, which you should always do. Fortunately, BigQuery now offers JSONL export, so downloading the resulting dataset requires no further preprocessing. Once that’s done, the CLI allows you finetune, with special options for binary classification. (the exact CLI command I used is here)

Another understated aspect of GPT-3 is that there are weaker models that are faster and much cheaper than the default davinci model that is what people use when they generally use “GPT-3”. For text generation they tend to have less coherent outputs, but for a simplified use case like binary classification they are more than sufficient. I’ll use the babbage model, the second weakest.

The final results of the finetuning are about 63% accuracy on both the training and validation sets: not too much better than the default 50% accuracy of a balanced dataset for a binary classification problem, but given the problem difficulty it’s better than most approaches I’ve done for Hacker News data.

Once the finetuning is complete, you can query it, and ask it to return the probability of the returned token. Let’s pass in the title for my last blog post: Absurd AI-Generated Professional Food Photography with DALL-E 2

"top_logprobs": [
  {
    " bad": -0.34654787
  }

Well, that’s not promising.

For some really weird reason, the API returns a log-probability instead of the actual probability that you’d want, so taking the exponent of that value results in a 70.7% probability it’s bad, which means there’s a 29.3% chance it’s good.

And that, is why I need a title optimizer.

Using InstructGPT To Create Alternate Titles

Since we now have a tool to determine the quality of blog post titles, how do we generate alternate titles that maintain the same meaning? I could think of tweaks to titles, but that takes effort and I am lazy. What if GPT-3 could create the candidate titles for me? It turns out, GPT-3 latest Instruct model can.

InstructGPT, released in January without much fanfare, is a version of davinci OpenAI finetuned themselves to better respond to instructions. It worked so well that it’s now the default GPT-3 model (noted as text-davinci-002 in the backend UI).

InstructGPT is surprisingly robust with the right prompt engineering. You can tell it to create detailed product descriptions of nonexistent video games, or write 4chan-style greentexts for any domain which maintain both the style and twist endings of the format.

via OpenAI’s GPT-3 Playground; all nonhighlighted text is the prompt.

After a bit of testing, the prompt I found worked best for this use case was:

Rewrite the following blog post title into six different titles but optimized for social media virality: 

-

It’s verbose, but that’s prompt engineering for you. The - at the end informs GPT-3 that the output should be a list with dash-bullets, which will make it easier to programmatically split the final output into distinct titles.

You can test it on the GPT-3 Playground; if the temperature parameter is 0, then the output will be deterministic.

Again putting in my last blog post Absurd AI-Generated Professional Food Photography with DALL-E 2 into InstructGPT:

via OpenAI’s GPT-3 Playground; all nonhighlighted text is the prompt.

All six of those titles are definitely an improvement, and all the text in green is what the programmatic API returns. Notably, despite the terseness of the input title and recency of DALL-E 2, InstructGPT is able to infer that the AI creates something and work from that, which is impressive.

Put The Title Optimizer Into Action!

A walkthrough of the code used to interact with the GPT-3 API and make the optimizer is available in this Jupyter Notebook, and the final demos are available in this Notebook.

Now that we have the two models ready, the workflow is simple:

Choose the title of a technical blog post I want to optimize.
Ping InstructGPT to get up to 6 alternate titles.
Extract/clean up the generated titles (i.e. split and strip whitespace)
For each of those alternate titles, ping the finetuned Hacker News GPT-3 for the probability that it is a good title.
In a pretty table, sort the titles by probability, descending.

Because the model can’t be widely distributed without review due to OpenAI rules, I decided to put the “UI” for this into a personal Jupyter Notebook.

Let’s experiment! We know the title of Absurd AI-Generated Professional Food Photography with DALL-E 2 is bad and the alternatives are interesting, but how good are the alternatives?

via GPT-3 Title Optimizer

Most of alternates are much better, with the predicted probabilities of being a good post going above 50%. (I probably should change the title retroactively but I will live with my SEO dishonor)

The original title for this post, in my boring no-one-will-ever-click-this style, was Creating a Blog Post Title Optimizer by Finetuning GPT-3 on Hacker News. Let’s plop it into the optimizer:

via GPT-3 Title Optimizer, temperature=0

So yes, the optimizer says the original title is very bad. But in this case, the variants are clickbaity and probably wouldn’t do very well on Hacker News.

Fortunately, you can rerun the generation and get more different variants if temperature is nonzero.

via GPT-3 Title Optimizer, temperature=0.7

via GPT-3 Title Optimizer, temperature=1.0

Definitely more variety. I like “How to Create a Blog Post Title Optimizer with GPT-3” as it maintains the same spirit even if it’s not the most optimal, although for disclosure reasons, I do want to include Hacker News somewhere in the title. Therefore, I can tweak the input to “How to Create a Blog Post Title Optimizer with GPT-3 and Hacker News Data” and feed it back to the optimizer and maybe get an interative improvement.

via GPT-3 Title Optimizer

The probability went down significantly with the change, and none of the variants are much better. Oh well.

Here’s the results of running the optimizer for some of my older blog posts:

The results for this post are indeed better; I’d definitely click the top one although it’s misleading.

The results for this post are much better, although this is one case where the original title is actually good.

The results for this post are a balance between better and not-technically-misleading clickbait.

Costwise, the entire pipeline is relatively inexpensive. Overall, it’s about $0.02 per run: too expensive to give unrestricted access to the internet, but very high return-on-investment if it successfully results in a catchy headline even if takes multiple tries. The most expensive part was the finetuning itself, which cost $2 but is a one-time cost.

Some might ask “why finetune GPT-3 when you can finetune an open-source large language model such as BERT like every NLP project since 2018?” In this case, GPT-3’s advantage is that it was trained in the entire internet. GPT-3 is a master of idiosyncrasy, which is a key when working with Hacker News data and in theory would give better results than the Wikipedia-trained BERT. The success of Hacker News posts also depends on a global context outside of the title itself, which is why finetuning an existing model trained on such context may be better than training an existing model solely on HN data.

Some are concerned about GPT-3 and AI tools such as these making writers redundant, but the results here prove otherwise: there will always have to be a human-in-the-loop.

UPDATE: When I submitted this post to Hacker News, it ended up getting over 200 points, defying the 20.8% probability!

Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces

Mon, 22 Oct 2018 09:15:00 -0700

Data science has been sweeping the tech world. With a large variety of powerful free open-sourced tools and now the computing power to utilize them to their full potential, data science is more accessible than ever and has become America’s hottest job. One problem: there’s no consensus on what data scientists really do in a professional setting.

There has been a rise in romantic thought pieces lately (especially on Medium) about how data scientists are wizards and can solve any problem (with bonus points if it cites AI). If you follow publications like Towards Data Science, you’ll notice persistent tropes in the more code-oriented posts: Python is the king programming language for data science, use scikit-learn/XGBoost and logistic regression for predicting categorical variable(s), use pandas for processing tabular data, use NLTK/word2vec for processing text data, use TensorFlow/Keras/convolutional neural networks for processing image data, use k-means for clustering data, split the processed dataset into training and test datasets for model training, tweak hyperparameters/model features until results on the test dataset are good, etc.

These tropes aren’t inappropriate or misleading, but the analysis often doesn’t quantify the insight/value of the results. Modeling is just one small part (and often the easiest part) of a very complex system.

Data-oriented MOOCs (Massive Online Open Courses) like Andrew Ng’s Coursera course on Machine Learning and fast.ai’s course on Deep Learning are good academic introductions to the theory and terminology behind data science and other related fields. Although MOOCs have many practice problems for prospective data scientists to solve, they don’t make you an expert in the field capable of handling messier real-world problems, nor claim to do so.

Modern data science isn’t about burying your head in a Jupyter Notebook and staring at the screen watching training loss numbers trickle down (although it’s definitely fun!). There’s a lot more to it, some of which I’ve learned firsthand working as a Data Scientist at BuzzFeed for over a year. To borrow a statistical term, MOOCs and thought pieces overfit to a certain style of data science that is not robust to the vast uncertainties of the real world.

The Cost/Benefit Tradeoffs of Data Science

Data science often follows the Pareto principle: 80% of the work takes 20% of the effort. Thought pieces demonstrate that you can just toss data indiscriminately into scikit-learn or a deep learning framework and get neat-looking results. The value of a data scientist, however, is when and if to further development on a model.

Kaggle competitions are a popular and often-recommended way to get exposure to real-world data science problems. Many teams of statisticians compete to create the best model for a given dataset (where “best” usually means minimizing the predictive loss/error of the model), with prizes for the highest-performing models. Kaggle also encourages clever modeling techniques such as grid search of thousands of model hyperparameter combinations and ensembling disparate models to create a megamodel which results in slightly better predictive performance, but just might give the edge to win.

However, there are a few important differences between modeling in a Kaggle competition and modeling in a data science team. Kaggle competitions last for weeks when a professional data scientist may need to spend time on other things. Ensembling gigantic machine learning models makes predictions very slow and the models themselves very large; both of which may cause difficulty deploying them into production (e.g. the Netflix Prize movie recommendation models famously “did not seem to justify the engineering effort needed to bring them into a production environment”). And most importantly, there may not be a significant practical performance difference between a 1st place Kaggle model that takes days/weeks to optimize and a simple scikit-learn/XGBoost baseline that can be built in a few hours.

Counterintuitively, it may be better to trade performance for speed/memory with a weaker-but-faster model; in business cases, speed and scalability are important implementation constraints. But even with scikit-learn, the model is still a black box with little idea to the data scientist how the model makes its decisions. One final option is to go back to basics altogether with a “boring” linear/logistic regression model, where the predictive performance may be even weaker and the model must follow several statistical assumptions, but the model feature coefficients and statistical significance are easily interpretable to explain the importance of each input feature (if any) and make actionable, informed decisions for the business. Being a data scientist requires making educated judgments about these tradeoffs.

Data Scientists Still Use Business Intelligence Tools

A hobbyist data scientist without a budget may opt to build their own workflows and data pipelines using free tools. However, professional data scientists have a finite amount of free time (as do all engineers), so there’s a massive opportunity cost when reinventing the wheel unnecessarily. Enterprise BI tools such as Tableau, Looker, and Mode Analytics help retrieve and present data with easy-to-digest dashboards for anyone in the company. They’re never cheap, but they’re much cheaper to the company than having a data scientist spend valuable time to develop and maintain similar tooling over time.

If a stakeholder wants a data report ASAP, there’s no problem falling back to using SQL to query a data warehouse and output results into an Excel spreadsheet (plus pretty data visualizations!) to quickly transport in an email. Part of being a data scientist is working out which tools are best appropriate at what time.

Some might argue that using BI tools and SQL are not responsibilities for data scientists, but instead for Business Analysts or Data Analysts. That’s a No True Scotsman way of looking at it; there’s a lot of overlap in data science with other analytical fields, and there’s nothing wrong with that.

Data Scientists Are Software Engineers Too

Although MOOCs encourage self-study, data science is a collaborative process. And not just with other data scientists on a team, but with other software engineers in the company. Version control tools like Git are often used for data scientists to upload their portfolio projects publicly to GitHub, but there are many other important features for use in a company-wide collaborative environment such as branching a repository, making pull requests, and merging conflicts. Beyond that are modern development QA practices, such as test environments, consistent code style, and code reviews. The full process varies strongly by company: Airbnb has a good thought piece about how they utilize their Knowledge Base for data science collaboration using Git.

One of the very hard and surprisingly underdiscussed aspects of data science is DevOps, and how to actually get a statistical model into production. Docker containers, for example, are newer technology that’s hard to learn, but have many data science and DevOps benefits by mitigating Python dependency hell and ensuring a consistent environment for model deployment and execution. And once the model is in production, data scientists, data engineers, and dedicated DevOps personnel need to work together to figure out if the model has the expected output, if the model is performing with expected speed/memory overhead, how often to retrain the model on fresh data (plus the scheduling/data pipelining necessary to do so), and how to efficiently route predictions out of the system to the user.

Data Science Can’t Solve Everything

Data science experiments (even those utilizing magical AI) are allowed to fail, and not just in the fail-to-reject-the-null-hypothesis sense. Thought pieces typically discuss successful projects, which leads to a survivorship bias. Even with massive amounts of input data, it’s likely for a model to fail to converge and offer zero insight, or an experiment fail to offer statistically significant results (common with A/B testing).

real world data science is an R² of 0.10 #GoogleNext18 pic.twitter.com/qNsno2dscR
— Max Woolf (@minimaxir) July 24, 2018

The difficulty of real-world data science is recognizing if a given problem can be solved, how much of your valuable time to spend iterating to maybe solve it, how to report to stakeholders if it can’t be solved, and what are the next steps if that’s the case.

Don’t p-hack!

Data Science and Ethics

During the rise of the “data science/AI is magic!” era, massive algorithmic and statistical failures suggest that data science might not always make the world a better place. Amazon built a resume-reading model which accidentally learned to be sexist. Facebook overestimated performance metrics on their videos, causing complete business pivots for media organizations in vain, indirectly leading to hundreds of layoffs. YouTube’s recommended video algorithms drove children towards shocking and disturbing content. And these companies have some of the best data talent in the entire world.

The qualitative output of a model or data analysis is just as important as the quantitative performance, if not more. Allowing dangerous model output to hit production and impact millions of consumers is a failure of QA at all levels. In fairness these companies usually fix these issues, but only after journalists point them out. The problem with blindly chasing a performance metric (like Kaggle) is that it ignores collateral, unexpected effects.

Don’t be data-driven. Be data-informed. Metrics should never be in charge because they have no moral compass.
— Kim Goodwin (@kimgoodwin) October 15, 2018

Maybe recommending shocking videos is what maximizes clickthrough rate or ad revenue per the models according to a business dashboard. Unfortunately, if the data justifies it and the business stakeholders encourage it, the company may accept the consequences of a flawed algorithm if they don’t outweigh the benefits. It’s important for data scientists to be aware that they may be party to that.

Conclusion

I realize the irony of using a data science thought piece to argue against data science thought pieces. In fairness, some Medium thought pieces do apply data science in very unique ways or touch on very obscure-but-impactful aspects of frameworks, and I enjoy reading those. The field is still very broadly defined, and your experiences may differ from this post, especially if you’re working for a more research-based institution. Unfortunately, I don’t have any new advice for getting a data science job, which is still very difficult.

The popular idea that being a data scientist is a 40-hours-a-week Kaggle competition is incorrect. There’s a lot more to it that’s not as sexy which, in my opinion, is the more interesting aspect of the data science field as a whole.

How to Quickly Train a Text-Generating Neural Network for Free

Fri, 18 May 2018 09:00:00 -0700

One of the more interesting applications of the neural network revolution is text generation. Most popular approaches are based off of Andrej Karpathy’s char-rnn architecture/blog post, which teaches a recurrent neural network to be able to predict the next character in a sequence based on the previous n characters. As a result, a sufficiently trained network can theoretically reproduce its input source material, but since properly-trained neural networks aren’t perfect, the output can fall into a weird-but-good uncanny valley.

Many internet tutorials for text-generation neural networks simply copy an existing char-rnn implementation while changing the input dataset. It’s one approach, but there’s an opportunity for improvement with modern deep learning tooling. Thanks to frameworks like TensorFlow and Keras, I built textgenrnn, a Python package which abstracts the process of creating and training such char-rnns to a few lines of code, with numerous model architecture and training improvements such as character embeddings, attention-weighted averaging, and a decaying learning rate.

A neat benefit of textgenrnn is that it can be easily used to train neural networks on a GPU very quickly, for free using Google Colaboratory. I’ve created a notebook which lets you train your own network and generate text whenever you want with just a few clicks!

Your First Text-Generating Neural Network

Colaboratory is a notebook environment similar to Jupyter Notebooks used in other data science projects. However, Colaboratory notebooks are hosted in a short term virtual machine, with 2 vCPUs, 13GB memory, and a K80 GPU attached. For free. Normally, this configuration would cost $0.57/hr on Google Compute Engine; it sounds low, but adds up when you need to train model(s) for hours to get good results.

First, I recommend copying the notebook to your own Drive so it’ll always be there (and switch to using Google Chrome if you aren’t). The Colaboratory VM contains Python 3 and common Python packages for machine learning such as TensorFlow. But you can install more packages directly in the notebook. Like textgenrnn! Just run this cell by clicking into the cell and click the “play” button (or use Shift + Enter) and it’ll take care of the rest:

When training a new model, textgenrnn allows you to specify the size and complexity of the neural network with a wide variety of parameters:

Let’s keep these default parameters for now, so run that cell to load them into memory. Run the next cell, which prompts you to upload a file. Any text file should work, even large text files! For this example, we’ll use a 1.1MB text file of Shakespeare plays also used in the char-rnn demos.

The next cell initializes an instance of textgenrnn and begins training a custom new text-generating neural network!

textgenrnn automatically processes the input text into character sequences ready to train the network. After every 2 epochs (a full pass through the data), the network will generate sample text at different temperatures, which represent the “creativity” of the text (i.e. it allows the model to make increasingly suboptimal predictions, which can cause hilarity to ensue). I typically like generating text at a temperature of 0.5, but for very well-trained models, you can go up to 1.0.

The quick model training speed comes from the VM’s GPU, which can perform the necessary mathematical operations much faster than with a CPU. However, in the case of recurrent neural networks, Keras recently added a CuDNN implementation of RNNs like LSTMs, which can easily tap into the GPU-native code more easily and gain a massive speed boost (about 7x as fast) compared to previous implementations! In all, for this example dataset and model architecture, training on a GPU took 5-6 minutes an epoch, while on a modern CPU, training took 1 hour and 24 minutes an epoch, a 14x speedup on the GPU!

After training is complete, running the next cell will download three files: a weights file, a vocabulary file, and a config file that are all needed to regenerate your model elsewhere.

For example, on your own personal computer. Just install textgenrnn + TensorFlow by inputting pip3 install textgenrnn tensorflow into a terminal, change to the directory where the downloaded files are located, run python3, and load the model using:

from textgenrnn import textgenrnn
textgen = textgenrnn(weights_path='colaboratory_weights.hdf5',
                       vocab_path='colaboratory_vocab.json',
                       config_path='colaboratory_config.json')

And that’s that! No GPU necessary if you’re just generating text. You can generate samples (like during training) using textgen.generate_samples(), generate a ton of samples at any temperature you like to a file using textgen.generate_to_file(), or incorporate a generated text into a Python script (e.g. a Twitter bot) using textgen.generate(1, return_as_list=True)[0] to store a text as a variable. You can view more of textgenrnn’s functions and capabilities in this demo Jupyter Notebook.

Here’s some Shakespeare generated with a 50-minute-trained model at a temperature of 0.5:

LUCENTIO:
And then shall good grave to my wife thee;
Thou would the cause the brieved to me,
And let the place and then receives:
The rest you the foren to my ways him child,
And marry that will be a parties and so set me that be deeds
And then the heart and be so shall make the most as he and stand of seat.

GLOUCESTER:
Your father and madam, or shall for the people
And dead to make the truth, or a business
As we brother to the place her great the truth;
And that which to the smaster and her father,
I am I was see the sun have to the royal true.

Not too bad, and it’s even close to iambic pentameter!

Tweaking the Model

The most important model configuration options above are rnn_size and rnn_layers: these determine the complexity of the network. Typically, you’ll see networks in tutorials be a single 128-cell or 256-cell network. However, textgenrnn’s architecture is slightly different as it has an attention layer which incorporates all the preceding model layers. As a result, it’s much better to go deeper than wider (e.g. 4x128 is better than 1x512) unless you have a very large amount of text (>10MB). rnn_bidirectional controls whether the recurrent neural network is bidirectional, that is, it processes the previous characters both forward and backward (which works great if text follows specific rules, like Shakespeare’s character headings). max_length determines the maximum number of characters for the network to use to predict the next character, which should be increased to let the network learn longer sequences, or decrease for shorter sequences.

Training has a few helpful options as well. num_epochs determines the number of full passes of the data; this can be tweaked if you want to train the model even more. batch_size determines the number of model sequences to train in a step: typically, batch size for deep learning models is 32 or 128, but with a GPU, you can get a speed increase by saturating it with the given 1024 default. train_size determines the proportion of character samples to train; setting it < 1.0 both speeds up each epoch, and prevents the model from cheating and being able to learn sequences verbatim. (You can set 'validation': True to run the model on the unused data after each epoch to see if the model is overfitting).

Let’s try playing with the parameters more on a new text dataset.

Word-Level Text Generation With Reddit Data

You might be asking “how do you obtain text data”? The popular text-generation use cases like lyric generation and movie scripts are copyright-protected so they’re harder to find, and even then, it might not be enough text data to train a new model upon (you typically want atleast 100,000 characters).

Reddit, however, has millions of submission titles which would be great to train for a model. I wrote a helper script to automatically download the top n Reddit submissions from a given subreddit over a given period of time. If you choose subreddits with similar linguistic styles in their titles, the subreddits will even blend together! Let’s play with the Top 20,000 Submissions in 2017 from each of /r/politics and /r/technology, which results in a 3.3MB file: about 3x as much data as the Shakespeare plays.

One last thing that textgenrnn can do that most char-rnn implementations can’t is generate a word level model (thanks to Keras’s tokenizers), where the model uses the n previous words/punctuation to predict the next word/punctuation. On the plus side, using only words prevents crazy typoes and since it predicts multiple “characters” at a time, max_length can be reduced proportionally, dramatically speeding up training. There’s two downsides with this approach; since words are all lowercase and punctuation is its own token, the generated text cannot be immediately used without manual editing. Additionally, the model weights will be substantially larger than a character-level model since the word-level model has to store an embedding for each word (up to max_words, which is 10,000 by default when the vocabulary size for a char-level model is 200-300).

Another advantage of the Colaboratory notebook is that you can quickly adjust model parameters, upload a new file, and immediately start training it. We’ll set 'line_delimited': True and 'rnn_bidirectional': False since there aren’t specific rules. For word level training, let’s set 'word_level': True and 'max_length': 8 to reflect the new training architecture. Since training length has been reduced to 1/5th, we can set 'num_epochs': 50 and 'gen_epoch': 10 to balance it out. Rerun the config cell to update parameters, upload the Reddit data file, and rerun training.

The resulting model is much more well trained than the Shakespeare model, and here’s a few Reddit submission titles generated at a temperature of 1.0:

report : 49 % of americans now believe all of the country ’ s effective

people like facebook like it ' s 650 of 1 %

uber accused of secretly - security popular service ( likely oklahoma )

equifax breach fallout : your salary is dead

sanders uses texas shooter ' s iphone sales

adobe videos will be used to sell the web

apple to hold cash for $ 500 service

fitbit just targeting solar energy

george bush ' s concept car ‘ goes for all the biggest controversy .

Those look pretty good, although they may need a little editing before posting on social media.

Followup

These examples only train the model for little time as a demo of textgenrnn’s fast learning; there’s nothing stopping you from increasing num_epochs even more to further refine a model. However, from my experience, the training cell times out after 4 hours; set num_epochs accordingly, although in my experience that’s all you need before the network converges.

In practice, I used this Colaboratory notebook to train many models for /r/SubredditNN, a Reddit subreddit where only text-generating neural network bots trained on other subreddits. And the results are very funny:

Although text generating neural networks aren’t at the point where they can write entire articles by themselves, there are still many opportunities to use it just for fun! And thanks to textgenrnn, it’s easy, fast, and cost-effective for anyone to do! Let me know if you make any interesting neural networks with textgenrnn and this Notebook!

Pretrained Character Embeddings for Deep Learning and Automatic Text Generation

Tue, 04 Apr 2017 06:30:00 -0700

Deep learning is the biggest, often misapplied buzzword nowadays for getting pageviews on blogs. As a result, there have been a lot of shenanigans lately with deep learning thought pieces and how deep learning can solve anything and make childhood sci-fi dreams come true.

I’m not a fan of Clarke’s Third Law, so I spent some time checking out deep learning myself. As it turns out, with modern deep learning tools like Keras, a higher-level framework on top of the popular TensorFlow framework, deep learning is easy to learn and understand. Yes, easy. And it definitely does not require a PhD, or even a Computer Science undergraduate degree, to implement models or make decisions based on the output.

However, let’s try something more expansive than the stereotypical deep learning tutorials.

Characters Welcome

Word embeddings have been a popular machine learning trick nowadays. By using an algorithm such as Word2vec, you can obtain a numeric representation of a word, and use those values to create numeric representations of higher-level representations like sentences/paragraphs/documents/etc.

However, generating word vectors for datasets can be computationally expensive (see my earlier post which uses Apache Spark/Word2vec to create sentence vectors at scale quickly). The academic way to work around this is to use pretrained word embeddings, such as the GloVe vectors collected by researchers at Stanford NLP. However, GloVe vectors are huge; the largest one (840 billion tokens at 300D) is 5.65 GB on disk and may hit issues when loaded into memory on less-powerful computers.

Why not work backwards and calculate character embeddings? Then you could calculate a relatively few amount of vectors which would easily fit into memory, and use those to derive word vectors, which can then be used to derive the sentence/paragraph/document/etc vectors. But training character embeddings traditionally is significantly more computationally expensive since there are 5-6x the amount of tokens, and I don’t have access to the supercomputing power of Stanford researchers.

Why not use the existing pretrained word embeddings to extrapolate the corresponding character embeddings within the word? Think “bag-of-words,” except “bag-of-characters.” For example, from the embeddings from the word “the”, we can infer the embeddings for “t”, “h,” and “e” from the parent word, and average the t/h/e vectors from all words/tokens in the dataset corpus. (For this post, I will only look at the 840B/300D dataset since that is the only one with capital letters, which are rather important. If you want to use a dataset with smaller dimensionality, apply PCA on the final results)

I wrote a simple Python script that takes in the specified pretrained word embeddings and does just that, outputting the character embeddings in the same format. (for simplicity, only ASCII characters are included; the extended ASCII characters are intentionally omitted due to compatibility reasons. Additionally, by construction, space and newline characters are not represented in the derived dataset.)

You may be thinking that I’m cheating. So let’s set a point-of-reference. Colin Morris found that when 16D character embeddings from a model used in Google’s One Billion Word Benchmark are projected into a 2D space via t-SNE, patterns emerge: digits are close, lowercase and uppercase letters are often paired, and punctuation marks are loosely paired.

Let’s do that for my derived character embeddings, but with R and ggplot2. t-SNE is difficult to use for high-dimensional vectors as combinations of parameters can result in wildly different output, so let’s try a couple projections. Here’s what happens when my pretrained projections are preprojected from 300D to 16D via PCA whitening, and setting perplexity (number of optimal neighbors) to 7.

The algorithm manages to separate and group lowercase, uppercase, and numerals rather distinctly. Quadrupling the dimensionality of the preprocessing step to 64D and changing perplexity to 2 generates a depiction closer to the Google model projection:

My pretrained character embeddings trick isn’t academic, but it’s successfully identifying realistic relationships. There might be something here worthwhile.

The Coolness of Deep Learning

Keras, maintained by Google employee François Chollet, is so good that it is effectively cheating in the field of machine learning, where even TensorFlow tutorials can be replaced with a single line of code. (which is important for iteration; Keras layers are effectively Lego blocks). A simple read of the Keras examples and documentation will let you reverse-engineer most the revolutionary deep learning clickbait thought pieces. Some create entire startups by changing the source dataset of the Keras examples and pitch them to investors none-the-wiser, or make very light wrappers on top the examples for teaching tutorial videos and get thousands of subscribers on YouTube.

I prefer to parse documentation/examples as a proof-of-concept, but never as gospel. Examples are often not the most efficient ways to implement a solution to a problem, just merely a start. In the case of Keras’s text generator example, the initial code was likely modeled after the 2015 blog post The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy and the corresponding project char-rnn. There have been many new developments in neural network architecture since 2015 that can improve both speed and performance of the text generation model as a whole.

What Text to Generate?

The Keras example uses Nietzsche writings as a data source, which I’m not fond of because it’s difficult to differentiate bad autogenerated Nietzsche rants from actual Nietzsche rants. What I want to generate is text with rules, with the algorithm being judged by how well it follows an inherent structure. My idea is to create Magic: The Gathering cards.

Inspired by the @RoboRosewater Twitter account by Reed Milewicz and the corresponding research and articles, I aim to see if it’s possible to recreate the structured design creativity for myself.

Even if you are not familiar with Magic and its rules, you can still find the card text of RoboRosewater cards hilarious:

Occasionally RoboRosewater, using a weaker model, produces amusing neural network trainwrecks:

More importantly, all Magic cards have an explicit structure; they have a name, mana cost in the upper-right, card type, card text, and usually a power and toughness in the bottom-right.

I wrote another Python script to parse all Magic card data from MTG JSON into an encoding which matches this architecture, where each section transition has its own symbol delimiter, along with other encoding simplicities. For example, here is the card Dragon Whelp in my encoding:

[Dragon Whelp@{2}{R}{R}#Creature — Dragon$Flying|{R}: ~ gets +1/+0 until end of turn. If this ability has been activated four or more times this turn, sacrifice ~ at the beginning of the next end step.%2^3]

These card encodings are all combined into one .txt file, which will be fed into the model.

Building and Training the Model

The Keras text generation example operates by breaking a given .txt file into 40-character sequences, and the model tries to predict the 41st character by outputting a probability for each possible character (108 in this dataset). For example, if the input based on the above example is ['D', 'r', 'a', 'g', ..., 'D', 'r', 'a', 'g'] (with the latter Drag being part of the creature type), the model will optimize for outputting a probability of 1.0 of o; per the categorical crossentropy loss function, the model is rewarded for assigning correct guesses with 1.0 probability and incorrect guesses with 0.0 probabilities, penalizing half-guesses and wrong guesses.

Each possible 40-character sequence is collected, however only every other third sequence is kept; this prevents the model from being able to learn card text verbatim, plus it also makes training faster. (for this model, there are about 1 million sequences for the final training). The example uses only a 128-node long-short-term-memory (LSTM) recurrent neural network (RNN) layer, popular for incorporating a “memory” into a neural network model, but the example notes at the beginning it can take awhile to train before generated text is coherent.

There are a few optimizations we can make. Instead of supplying the characters directly to the RNN, we can first encode them using an Embedding layer so the model can train character context. We can stack more layers on the RNN by adding a 2-level multilayer perceptron: a meme, yes, but it helps, as the network must learn latent representations of the data. Thanks to recent developments such as batch normalization and rectified linear activations for these Dense layers, they can both be trained without as much computational overhead, and thanks to Keras, both can be added to a layer with a single line of code each. Lastly, we can add an auxiliary output via Keras’s functional API where the network makes a prediction based on only the output from the RNN in addition to the main output, which forces it to work smarter and ends up resulting in a significant improvement in loss for the main path.

The final architecture ends up looking like this:

And because we added an Embedding layer, we can load the pretrained 300D character embeds I made earlier, giving the model a good start in understanding character relationships.

The goal of the training is to minimize the total loss of the model. (but for evaluating model performance, we only look at the loss of the main output). The model is trained in epochs, where the model sees all the input data atleast once. During each epoch, batches of size 128 are loaded into the model and evaluated, calculating a batch loss for each; the gradients from the batch are backpropagated into the previous layers to improve them. While training with Keras, the console reports an epoch loss, which is the average of all the batch losses so far in the current epoch, allowing the user to see in real time how the model improves, and it’s addicting.

Keras/TensorFlow works just fine on the CPU, but for models with a RNN, you’ll want to consider using a GPU for performance, specifically one by nVidia. Amazon has cloud GPU instances for $0.90/hr (not prorated), but very recently, Google announced GPU instances of the same caliber for ~$0.75/hr (prorated to the minute), which is what I used to train this model, although Google Compute Engine requires configuring the GPU drivers first. For 20 epochs, it took about 4 hours and 20 minutes to train the model while spending $3.26, which isn’t bad as far as deep learning goes.

Making Magic

After each epoch, the original Keras text generation example takes a sentence from the input data as a seed and predicts the next character in the sequence according to the model, then uses the last 40 characters generated for the next character, etc. The sampling incorporates a diversity/temperature parameter which allows the model to make suboptimal decisions and select characters with lower natural probabilities, which allows for the romantic “creativity” popular with neural network text generation.

With the Magic card dataset and my tweaked model architecture, generated text is coherent after the 1st epoch! After about 20 epochs, training becomes super slow, but the predicted text becomes super interesting. Here are a few fun examples from a list of hundreds of generated cards. (Note: the power/toughness values at the end of the card have issues; more on that later).

With low diversity, the neural network generated cards that are oddly biased toward card names which include the letter “S”. The card text also conforms to the rules of the game very well.

[Reality Spider@{3}{G}#Creature — Elf Warrior$Whenever ~ deals combat damage to a player, put a +1/+1 counter on it.%^]
[Dark Soul@{2}{R}#Instant$~ deals 2 damage to each creature without flying.%^]
[Standing Stand@{2}{G}#Creature — Elf Shaman${1}{G}, {T}: Draw a card, then discard a card.%^]

In contrast, cards generated with high diversity hit the uncanny valley of coherence and incoherence in both text and game mechanic abuse, which is what makes them interesting.

[Portrenline@{2}{R}#Sorcery$As an additional cost to cast ~, exile ~.%^]
[Clocidian Lorid@{W}{W}{W}#Instant$Regenerate each creature with flying and each player.%^]
[Icomic Convermant@{3}{G}#Sorcery$Search your library for a land card in your graveyard.%1^1]

The best-of-both-worlds cards are generated from diversity parameters between both extremes, and often have funny names.

[Seal Charm@{W}{W}#Instant$Exile target creature. Its controller loses 1 life.%^]
[Shambling Assemblaster@{4}{W}#Creature — Human Cleric$When ~ enters the battlefield, destroy target nonblack creature.%1^1]
[Lightning Strength@{3}{R}#Enchantment — Aura$Enchant creature|Enchanted creature gets +3/+3 and has flying, flying, trample, trample, lifelink, protection from black and votile all damage unless you return that card to its owner's hand.%2^2]
[Skysor of Shadows@{7}{B}{B}{B}#Enchantment$As ~ enters the battlefield, choose one —|• Put a -1/-1 counter on target creature.%2^2]
[Glinding Stadiers@{4}{W}#Creature — Spirit$Protection from no creatures can't attack.%^]
[Dragon Gault@{3}{G}{U}{U}#Creature — Kraven$~'s power and toughness are 2.%2^2]

All Keras/Python code used in this blog post, along with sample Magic card output and the trained model itself, is available open-source in this GitHub repository. The repo additionally contains a Python script which lets you generate new cards using the model, too!

Visualizing Model Performance

One thing deep learning tutorials rarely mention is how to collect the loss data and visualize the change in loss over time. Thanks to Keras’s utility functions, I wrote a custom model callback which collects the batch losses and epoch losses and writes them to a CSV file.

{% comment %} In addition to being able to generate images of neural network models as above, Keras has many useful utility functions which I added to the example, such as a callback to save the model while training, and a callback to log the losses to a CSV file. {% endcomment %}

Using R and ggplot2, I can plot the batch loss at every 50th batch to visualize how the model converges over time.

After 20 epochs, the model loss ends up at about 0.30 which is more-than-low-enough for coherent text. As you can see, there are large diminishing returns after a few epochs, which is the hard part of training deep learning models.

Plotting the epoch loss over the batches makes the trend more clear.

In order to prevent early convergence, we can make the model more complex (i.e. stack more layers unironically), but that has trade-offs, both in training and predictive speed, the latter of which is important if using deep learning in a production application.

Lastly, as with the Google One Billion Words benchmark, we can extract the trained character embeddings from the model (now augmented with Magic card context!) and plot them again to see what has changed.

There are more pairs of uppercase/lowercase characters, although interestingly there isn’t much grouping with the special characters added as section breaks in the encoding, or mechanical uppercase characters such as W/U/B/R/G/C/T.

Next Steps

After building the model, I did a little more research to see if others solved the power/toughness problem. Since the sentences are only 40 characters and Magic cards are much longer than 40 characters, it’s likely that power/toughness are out-of-scope for the model and it cannot learn their exact values. Turns out that the intended solution is to use a completely different encoding, such as this one for Dragon Whelp:

|5creature|4|6dragon|7|8&^^/&^^^|9flying\{RR}: @ gets +&^/+& until end of turn. if this ability has been activated four or more times this turn, sacrifice @ at the beginning of the next end step.|3{^^RRRR}|0N|1dragon whelp|

Power/toughness are generated near the beginning of the card. Sections are delimited by pipes, with a numeral designating the corresponding section. Instead of numerals being used card values, carets are used, which provides a more accurate quantification of values. With this encoding, each character has a singular purpose in the global card context, and their embeddings would likely generate more informative visualizations. (But as a consequence, the generated cards are harder to parse at a glance).

The secondary encoding highlights a potential flaw in my methodology using pretrained character embeddings. Trained machine learning models must be used apples-to-apples on similar datasets; for example, you can’t accurately perform Twitter sentiment analysis on a dataset using a model trained on professional movie reviews since Tweets do not follow AP Style guidelines. In my case, the Common Crawl, the source of the pretrained embeddings, follows more natural text usage and would not work analogously with the atypical character usages in either of the Magic card encodings.

There’s still a lot of work to be done in terms of working with both pretrained character embeddings and improving Magic card generation, but I believe there is promise. The better way to make character embeddings than my script is to do it the hard way and train then manually, maybe even at a higher dimensionality like 500D or 1000D. Likewise, for Magic model building, the mtg-rnn instructions repo uses a large LSTM stacked on a LSTM along with 120/200-character sentences, both of which combined make training VERY slow (notably, this was the architecture of the very first commit for the Keras text generation example, and was changed to the easily-trainable architecture). There is also promise in a variational autoencoder approach, such as with textvae.

This work is potentially very expensive and I am strongly considering setting up a Patreon in lieu of excess venture capital to subsidize my machine learning/deep learning tasks in the future.

At minimum, working with this example gave me a sufficient application of practical work with Keras, and another tool in my toolbox for data analysis and visualization. Keras makes the model-construction aspect of deep learning trivial and not scary. Hopefully, this article justifies the use of the “deep learning” buzzword in the headline.

It’s also worth mentioning that I actually started working on automatic text generation 6 months ago using a different, non-deep-learning approach, but hit a snag and abandoned that project. With my work on Keras, I found a way around that snag, and on the same Magic dataset with the same input construction, I obtained a model loss of 0.03 at 20% of the cloud computing cost in about the same amount of time. More on that later.

The code for generating the R/ggplot2 data visualizations is available in this R Notebook, and open-sourced in this GitHub Repository.

You are free to use the automatic text generation scripts and data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Predicting And Mapping Arrest Types in San Francisco with LightGBM, R, ggplot2

Wed, 08 Feb 2017 06:30:00 -0700

The new hotness in the world of data science is neural networks, which form the basis of deep learning. But while everyone is obsessing about neural networks and how deep learning is magic and can solve any problem if you just stack enough layers, there have been many recent developments in the relatively nonmagical world of machine learning with boring CPUs.

Years before neural networks were the Swiss army knife of data science, there were gradient-boosted machines/gradient-boosted trees. GBMs/GBTs are machine learning methods which are effective on many types of data, and do not require the traditional model assumptions of linear/logistic regression models. Wikipedia has a good article on the advantages of decision tree learning, and visual diagrams of the architecture:

GBMs, as implemented in the Python package scikit-learn, are extremely popular in Kaggle machine learning competitions. But scikit-learn is relatively old, and new technologies have emerged which implement GBMs/GBTs on large datasets with massive parallelization and and in-memory computation. A popular big data machine learning library, H2O, has a famous GBM implementation which, per benchmarks, is over 10x faster than scikit-learn and is optimized for datasets with millions of records. But even faster than H2O is xgboost, which can hit a 5x-10x speed-ups relative to H2O, depending on the dataset size.

Enter LightGBM, a new (October 2016) open-source machine learning framework by Microsoft which, per benchmarks on release, was up to 4x faster than xgboost! (xgboost very recently implemented a technique also used in LightGBM, which reduced the relative speedup to just ~2x). As a result, LightGBM allows for very efficient model building on large datasets without requiring cloud computing or nVidia CUDA GPUs.

A year ago, I wrote an analysis of the types of police arrests in San Francisco, using data from the SF OpenData initiative, with a followup article analyzing the locations of these arrests. Months later, the same source dataset was used for a Kaggle competition. Why not give the dataset another look and test LightGBM out?

Playing With The Data

(You can view the R code used to process the data and generate the data visualizations in this R Notebook)

The SFPD Incidents dataset includes crime incidents in San Francisco from 1/1/2003 to 1/17/2017 (at time of analysis). Filtering the dataset only on incidents which resulted in arrests (since most incidents are trivial) leaves a dataset of 634,299 arrests total. The dataset also includes information on the type of crime, the location where the arrest occurred, and the date/time. There are 39 different types of arrests in the Category column such as Assault, Burglary, and Prostitution, which serves as the response variable.

Meanwhile, we can engineer features from the location and date/time. Performing an exploratory data analysis of both is helpful to determine at a glance which features may be relevant (fortunately, I did that a year ago).

The location is given as latitude/longitude coordinates, so we can select a longitude X and latitude Y as features. Date/Time can be deconstructed further. We can extract the hour in which a given arrest occurred as a feature (hour can take 24 different values from 0 — 23). Likewise, we can extract the month in a similar manner (12 values, from 1 — 12). The year the crime occurred can be extracted without special encoding. (2003 — 2017). It is always helpful to include a year feature in predictive models to help account for change over time. The DayOfWeek is important, but encoding it as a numeric value is tricker; we logically encode each day of the week from 1 — 7, but which day should be #1? Making Monday #1 and Sunday #7 is the most logical, since a decision tree rule that sets a threshold on DayOfWeek values > 5 will translate logically to a weekend.

That’s six features total. There are more features which could be helpful, but let’s check a baseline model as a start.

Modeling

Specifically, the model will predict the answer the question: given that a San Francisco police arrest occurs at a specified time and place, what is the reason for that arrest?

For this post, I will use the R package for LightGBM (which was beta-released in January 2017; it’s extremely cutting edge!) We split the dataset 70%/30% into a training set of 444,011 arrests and a test set of 190,288 arrests (due to the large amount of different category labels, the split must be stratified to ensure the training and test sets have a balanced distribution of labels; in R, this can be implemented with the caret package and createDataPartition).

LightGBM trains the model on the training set and evaluates it on the test set to minimize the multiclass logarithmic loss of the model. For now, I use the default parameters of LightGBM, except to massively increase the number of iterations of the training algorithm, and to stop training the model early if the model stops improving. After about 4 minutes on my laptop (which is very fast for a dataset of this size!), the model returns a multilogloss of 1.98.

That number sounds arbitrary. Is it good or bad? Let’s compare it to the multilogloss from the top models from the Kaggle version of the dataset, where a lower score is better:

…okay, 1.98 is a good score, and without spending much time adding features to the model and tuning parameters! To be fair, my methodology would not necessarily result in the same score on the Kaggle dataset, but it confirms that the LightGBM model is in the top tier of models available for this problem and dataset context. And it didn’t require any neural networks either!

There are areas for improvement in feature engineering which other entries in the Kaggle competition implemented, such as a dummy variable indicating whether the offense occurred at an intersection and which SF police station was involved in the arrest. We could also encode features such as hour and DayOfWeek as categorical features (LightGBM conveniently allows this without requiring one-hot encoding the features) instead of numeric, but in my brief testing, it made the model worse, interestingly.

Analyzing the LightGBM Model

Another perk of not using a neural network for statistical model building is the ability to learn more about the importance of features in a model, as opposed to it being a black box. In the case of gradient boosting, we can calculate the proportional contribution of each feature to the total information gain of the model, which will help identify the most important features, and potentially unhelpful features:

Unsurprisingly, location features are the most important, with both location-based features establishing 70% of the total Gain in the model. But no feature is completely insignificant, which is a good thing.

Back to the multilogloss of 1.98. What does that mean in the real world? What is the accuracy of the model? We run each of the 190,288 arrests in the test set against the model, which returns 39 probability values for each record: one for each possible category of arrest. The category with the highest probability becomes the predicted type of arrest.

The accuracy of the model on the test set, which is the proportion of predictions where the predicted category value matches the actual category value, is 39.7%, with a 95% confidence interval for the true accuracy between 39.5% and 39.9%. That seems low! However, there is catch-all “Other Offenses” category for an arrest; if you predicted a “Other Offenses” label for all the test-set values, you would get an accuracy of 31.1%, which serves as the No Information Rate (since it would be the highest accuracy approach if there was no information at all). A 8.6 percentage point improvement is still an improvement though; many industries would love an 8.6 percentage point increase in accuracy, but for this context obviously it’s not enough to usher in a Minority Report/Person of Interest future.

We can visualize the classifications on the test set by the model using a confusion matrix; caret has a simple confusionMatrix() function, and ggplot2 has a geom_tile() to map out the relationships, even with 39 classes. We can also annotate the tiles where actual label = predicted label by drawing a geom_point() on top. Putting it all together:

There is, indeed, a large amount of confusion. Many of the labels are mispredicted as Other Offenses. Specifically, the model frequently confuses the combinations of Assault, Drug/Narcotics, Larceny/Theft, and Warrants, suggesting that they also may be catch-alls.

In theory, the predicted probabilities from the model between similar types of crime should also be similar, which may be causing these mispredictions. We can calculate the Pearson correlations between the predicted probabilities, and use hierarchical clustering to arrange and plot the correlations and their labels in a logical order. The majority of the correlations between labels are between 0 and +/- 0.5 (weak to moderate), but their arrangement tells a different story:

From top to bottom, you can see that there is a grouping of more blue-collar, physical crimes types (Assault, Vandalism), then a grouping of less-physical, white-collar crime types (Bribery, Extortion), and then a smaller grouping of seedier crime types (Liquor Laws, Prostitution).

The visualization doesn’t necessarily provide more information about the confusion matrix and the mispredictions, but it looks cool, which is enough.

Mapping the Predicted Types of Arrests

Kaggle competitions emphasize model creation, but don’t discuss how to implement and execute models in practice. Since we can predict the type of crime based on the given location and date/time of an arrest, we can map boundaries of the mostly likely type of offense. Using ggmap to get a map of San Francisco, splitting San Francisco into tens of thousands of points, and predicting the most-likely type of arrest at the location with a given date/time.

Let’s say we want to predict the types of crime in the future, on April 15th, 2017, during 8 PM. We construct a dataset of those points and the same date/time features used to generate the model originally. Then run those fabricated points through the model again to get new predicted labels (Additionally, we need to remove “Other Offenses” predicted labels since they cloud up the map). Plotting each point as a geom_tile will interpolate regions around the city. Putting it all together:

Not too shabby. But that’s not all; we can animate this map over a day by incrementing the hour, generating a map for each hour (while keeping the colors corresponding to the arrest type consistent), and then stitching the maps together into a GIF. Let’s do March 14th, 2017 (Pi Day can be dangerous!) starting at 6 AM:

Wow!

Conclusion

I deliberately avoided using the term “machine learning” in the headline of this post because it has been overused to the point of clickbait. Indeed, neural networks/deep learning excel at processing higher-dimensional data such as text, image, and voice data, but in cases where dataset features are simple and known, neural networks are not necessarily the most pragmatic option. CPU/RAM machine learning libraries like LightGBM are still worthwhile, despite the religious fervor for deep learning.

And there’s still a lot of work that can be done with the SF Crime Incidents dataset. The model only predicts the type of crime given an arrest occurred; it does not predict if an arrest will occur at a given time and place, which would make a fun project for the future!

You can view all the R and ggplot2 code used to visualize the San Francisco crime data in this R Notebook. You can also view the images/data used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!