Statistical Analysis on Max Woolf's Blog

Predicting Average IMDb Movie Ratings Using Text Embeddings of Movie Metadata

Mon, 30 Jun 2025 10:00:00 -0700

Months ago, I saw a post titled “Rejected from DS Role with no feedback” on Reddit’s Data Science subreddit, in which a prospective job candidate for a data science position provided a Colab Notebook documenting their submission for a take-home assignment and asking for feedback as to why they were rejected. Per the Reddit user, the assignment was:

Use the publicly available IMDB Datasets to build a model that predicts a movie’s average rating. Please document your approach and present your results in the notebook. Make sure your code is well-organized so that we can follow your modeling process.

IMDb, the Internet Movie Database owned by Amazon, allows users to rate movies on a scale from 1 to 10, wherein the average rating is then displayed prominently on the movie’s page:

The Shawshank Redemption is currently the highest-rated movie on IMDb with an average rating of 9.3 derived from 3.1 million user votes.

In their notebook, the Redditor identifies a few intuitive features for such a model, including the year in which the movie was released, the genre(s) of the movies, and the actors/directors of the movie. However, the model they built is a TensorFlow and Keras-based neural network, with all the bells-and-whistles such as batch normalization and dropout. The immediate response by other data scientists on /r/datascience was, at its most polite, “why did you use a neural network when it’s a black box that you can’t explain?”

Reading those replies made me nostalgic. Way back in 2017, before my first job as a data scientist, neural networks using frameworks such as TensorFlow and Keras were all the rage for their ability to “solve any problem” but were often seen as lazy and unskilled compared to traditional statistical modeling such as ordinary least squares linear regression or even gradient boosted trees. Although it’s funny to see that perception against neural networks in the data science community hasn’t changed since, nowadays the black box nature of neural networks can be an acceptable business tradeoff if the prediction results are higher quality and interpretability is not required.

Looking back at the assignment description, the objective is only “predict a movie’s average rating.” For data science interview take-homes, this is unusual: those assignments typically have an extra instruction along the lines of “explain your model and what decisions stakeholders should make as a result of it”, which is a strong hint that you need to use an explainable model like linear regression to obtain feature coefficients, or even a middle-ground like gradient boosted trees and its variable importance to quantify relative feature contribution to the model. ¹ In absence of that particular constraint, it’s arguable that anything goes, including neural networks.

The quality of neural networks have improved significantly since 2017, even moreso due to the massive rise of LLMs. Why not try just feeding a LLM all raw metadata for a movie and encode it into a text embedding and build a statistical model based off of that? Would a neural network do better than a traditional statistical model in that instance? Let’s find out!

About IMDb Data

The IMDb Non-Commercial Datasets are famous sets of data that have been around for nearly a decade ² but are still updated daily. Back in 2018 as a budding data scientist, I performed a fun exporatory data analysis using these datasets, although the results aren’t too surprising.

The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems.

But in truth, these datasets are a terrible idea for companies to use for a take-home assignment. Although the datasets are released under a non-commercial license, IMDb doesn’t want to give too much information to their competitors, which results in a severely limited amount of features that could be used to build a good predictive model. Here are the common movie-performance-related features present in the title.basics.tsv.gz file:

tconst: unique identifier of the title
titleType: the type/format of the title (e.g. movie, tvmovie, short, tvseries, etc)
primaryTitle: the more popular title / the title used by the filmmakers on promotional materials at the point of release
isAdult: 0: non-adult title; 1: adult title
startYear: represents the release year of a title.
runtimeMinutes: primary runtime of the title, in minutes
genres: includes up to three genres associated with the title

This is a sensible schema for describing a movie, although it lacks some important information that would be very useful to determine movie quality such as production company, summary blurbs, granular genres/tags, and plot/setting — all of which are available on the IMDb movie page itself and presumably accessible through the paid API. Of note, since the assignment explicitly asks for a movie’s average rating, we need to filter the data to only movie and tvMovie entries, which the original assignment failed to do.

The ratings data in title.ratings.tsv.gz is what you’d expect:

tconst: unique identifier of the title (which can therefore be mapped to movie metadata using a JOIN)
averageRating: average of all the individual user ratings
numVotes: number of votes the title has received

In order to ensure that the average ratings for modeling are indeed stable and indicative of user sentiment, I will only analyze movies that have atleast 30 user votes: as of May 10th 2025, that’s about 242k movies total. Additionally, I will not use numVotes as a model feature, since that’s a metric based more on extrinsic movie popularity rather than the movie itself.

The last major dataset is title.principals.tsv.gz, which has very helpful information on metadata such as the roles people play in the production of a movie:

tconst: unique identifier of the title (which can be mapped to movie data using a JOIN)
nconst: unique identifier of the principal (this is mapped to name.basics.tsv.gz to get the principal’s primaryName, but nothing else useful)
category: the role the principal served in the title, such as actor, actress, writer, producer, etc.
ordering: the ordering of the principals within the title, which correlates to the order the principals appear on IMDb’s movie cast pages.

Additionally, because the datasets are so popular, it’s not the first time someone has built a IMDb ratings predictor and it’s easy to Google.

Instead of using the official IMDb datasets, these analyses are based on the smaller IMDB 5000 Movie Dataset hosted on Kaggle, which adds metadata such as movie rating, budget, and further actor metadata that make building a model much easier (albeit “number of likes on the lead actor’s Facebook page” is very extrinsic to movie quality). Using the official datasets with much less metadata is building the models on hard mode and will likely have lower predictive performance.

Although IMDb data is very popular and very well documented, that doesn’t mean it’s easy to work with.

The Initial Assignment and “Feature Engineering”

Data science take-home assignments are typically 1/2 exploratory data analysis for identifying impactful dataset features, and 1/2 building, iterating, and explaining the model. For real-world datasets, these are all very difficult problems with many difficult solutions, and the goal from the employer’s perspective is seeing more how these problems are solved rather than the actual quantitative results.

The initial Reddit post decided to engineer some expected features using pandas, such as is_sequel by checking whether a non-1 number is present at the end of a movie title and one-hot encoding each distinct genre of a movie. These are fine for an initial approach, albeit sequel titles can be idiosyncratic and it suggests that a more NLP approach to identifying sequels and other related media may be useful.

The main trick with this assignment is how to handle the principals. The common data science approach would be to use a sparse binary encoding of the actors/directors/writers, e.g. using a vector where actors present in the movie are 1 and every other actor is 0, which leads to a large number of potential approaches to encode this data performantly, such as scikit-learn’s MultiLabelBinarizer. The problem with this approach is that there are a very large number of unique actors / high cardinality — more unique actors than data points themselves — which leads to curse of dimensionality issues and workarounds such as encoding only the top N actors will lead to the feature being uninformative since even a generous N will fail to capture the majority of actors.

There are actually 624k unique actors in this dataset (Jupyter Notebook), the chart just becomes hard to read at that point.

Additionally, most statistical modeling approaches cannot account for the ordering of actors as they treat each feature as independent, and since the billing order of actors is generally correlated to their importance in the movie, that’s an omission of relevant information to the problem.

These constraints gave me an idea: why not use an LLM to encode all movie data, and build a model using the downstream embedding representation? LLMs have attention mechanisms, which will not only respect the relative ordering of actors (to give higher predictive priority to higher-billed actors, along with actor cooccurrences), but also identify patterns within movie name texts (to identify sequels and related media semantically).

I started by aggregating and denormalizing all the data locally (Jupyter Notebook). Each of the IMDb datasets are hundreds of megabytes and hundreds of thousands of rows at minimum: not quite big data, but enough to be more cognizant of tooling especially since computationally-intensive JOINs are required. Therefore, I used the Polars library in Python, which not only loads data super fast, but is also one of the fastest libraries at performing JOINs and other aggregation tasks. Polars’s syntax also allows for some cool tricks: for example, I want to spread out and aggregate the principals (4.1 million rows after prefiltering) for each movie into directors, writers, producers, actors, and all other principals into nested lists while simultaneously having them sorted by ordering as noted above. This is much easier to do in Polars than any other data processing library I’ve used, and on millions of rows, this takes less than a second:

df_principals_agg = (
    df_principals.sort(["tconst", "ordering"])
    .group_by("tconst")
    .agg(
        director_names=pl.col("primaryName").filter(pl.col("category") == "director"),
        writer_names=pl.col("primaryName").filter(pl.col("category") == "writer"),
        producer_names=pl.col("primaryName").filter(pl.col("category") == "producer"),
        actor_names=pl.col("primaryName").filter(
            pl.col("category").is_in(["actor", "actress"])
        ),
        principal_names=pl.col("primaryName").filter(
            ~pl.col("category").is_in(
                ["director", "writer", "producer", "actor", "actress"]
            )
        ),
        principal_roles=pl.col("category").filter(
            ~pl.col("category").is_in(
                ["director", "writer", "producer", "actor", "actress"]
            )
        ),
    )
)

After some cleanup and field renaming, here’s an example JSON document for Star Wars: Episode IV - A New Hope:

{
  "title": "Star Wars: Episode IV - A New Hope",
  "genres": [
    "Action",
    "Adventure",
    "Fantasy"
  ],
  "is_adult": false,
  "release_year": 1977,
  "runtime_minutes": 121,
  "directors": [
    "George Lucas"
  ],
  "writers": [
    "George Lucas"
  ],
  "producers": [
    "Gary Kurtz",
    "Rick McCallum"
  ],
  "actors": [
    "Mark Hamill",
    "Harrison Ford",
    "Carrie Fisher",
    "Alec Guinness",
    "Peter Cushing",
    "Anthony Daniels",
    "Kenny Baker",
    "Peter Mayhew",
    "David Prowse",
    "Phil Brown"
  ],
  "principals": [
    {
      "John Williams": "composer"
    },
    {
      "Gilbert Taylor": "cinematographer"
    },
    {
      "Richard Chew": "editor"
    },
    {
      "T.M. Christopher": "editor"
    },
    {
      "Paul Hirsch": "editor"
    },
    {
      "Marcia Lucas": "editor"
    },
    {
      "Dianne Crittenden": "casting_director"
    },
    {
      "Irene Lamb": "casting_director"
    },
    {
      "Vic Ramos": "casting_director"
    },
    {
      "John Barry": "production_designer"
    }
  ]
}

I was tempted to claim that I used zero feature engineering, but that wouldn’t be accurate. The selection and ordering of the JSON fields here is itself feature engineering: for example, actors and principals are intentionally last in this JSON encoding because they can have wildly varying lengths while the prior fields are more consistent, which should make downstream encodings more comparable and consistent.

Now, let’s discuss how to convert these JSON representations of movies into embeddings.

Creating And Visualizing the Movie Embeddings

LLMs that are trained to output text embeddings are not much different from LLMs like ChatGPT that just predict the next token in a loop. Models such as BERT and GPT can generate “embeddings” out-of-the-box by skipping the prediction heads of the models and instead taking an encoded value from the last hidden state of the model (e.g. for BERT, the first positional vector of the hidden state representing the [CLS] token). However, text embedding models are more optimized for distinctiveness of a given input text document using contrastive learning. These embeddings can be used for many things, from finding similar encoded inputs by identifying the similarity between embeddings, and of course, by building a statistical model on top of them.

Text embeddings that leverage LLMs are typically generated using a GPU in batches due to the increased amount of computation needed. Python libraries such as Hugging Face transformers and sentence-transformers can load these embeddings models. For this experiment, I used the very new Alibaba-NLP/gte-modernbert-base text embedding model that is finetuned from the ModernBERT model specifically for the embedding use case for two reasons: it uses the ModernBERT architecture which is optimized for fast inference, and the base ModernBERT model is trained to be more code-aware and should be able understand JSON-nested input strings more robustly — that’s also why I intentionally left in the indentation for nested JSON arrays as it’s semantically meaningful and explicitly tokenized. ³

The code (Jupyter Notebook) — with extra considerations to avoid running out of memory on either the CPU or GPU ⁴ — looks something like this:

device = "cuda:0"
dataloader = torch.utils.data.DataLoader(docs, batch_size=32,
                                         shuffle=False,
                                         pin_memory=True,
                                         pin_memory_device=device)

dataset_embeddings = []
for batch in tqdm(dataloader, smoothing=0):
    tokenized_batch = tokenizer(
        batch, max_length=8192, padding=True, truncation=True, return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = model(**tokenized_batch)
        embeddings = outputs.last_hidden_state[:, 0].detach().cpu()
    dataset_embeddings.append(embeddings)

dataset_embeddings = torch.cat(dataset_embeddings)
dataset_embeddings = F.normalize(dataset_embeddings, p=2, dim=1)

I used a Spot L4 GPU on Google Cloud Platform at a pricing of $0.28/hour, and it took 21 minutes to encode all 242k movie embeddings: about $0.10 total, which is surprisingly efficient.

Each of these embeddings is a set of 768 numbers (768D). If the embeddings are unit normalized (the F.normalize() step), then calculating the dot product between embeddings will return the cosine similarity of those movies, which can then be used to identify the most similar movies. But “similar” is open-ended, as there are many dimensions how a movie could be considered similar.

Let’s try a few movie similarity test cases where I calculate the cosine similarity between one query movie and all movies, then sort by cosine similarity to find the most similar (Jupyter Notebook). How about Peter Jackson’s Lord of the Rings: The Fellowship of the Ring? Ideally, not only would it surface the two other movies of the original trilogy, but also its prequel Hobbit trilogy.

title	cossim
The Lord of the Rings: The Fellowship of the Ring (2001)	1.0
The Lord of the Rings: The Two Towers (2002)	0.922
The Lord of the Rings: The Return of the King (2003)	0.92
National Geographic: Beyond the Movie - The Lord of the Rings: The Fellowship of the Ring (2001)	0.915
A Passage to Middle-earth: The Making of ‘Lord of the Rings’ (2001)	0.915
Quest for the Ring (2001)	0.906
The Lord of the Rings (1978)	0.893
The Hobbit: The Battle of the Five Armies (2014)	0.891
The Hobbit: The Desolation of Smaug (2013)	0.883
The Hobbit: An Unexpected Journey (2012)	0.883

Indeed, it worked and surfaced both trilogies! The other movies listed are about the original work, so having high similarity would be fair.

Compare these results to the “More like this” section on the IMDb page for the movie itself, which has the two sequels to the original Lord of the Rings and two other suggestions that I am not entirely sure are actually related.

What about more elaborate franchises, such as the Marvel Cinematic Universe? If you asked for movies similar to Avengers: Endgame, would other MCU films be the most similar?

title	cossim
Avengers: Endgame (2019)	1.0
Avengers: Infinity War (2018)	0.909
The Avengers (2012)	0.896
Endgame (2009)	0.894
Captain Marvel (2019)	0.89
Avengers: Age of Ultron (2015)	0.882
Captain America: Civil War (2016)	0.882
Endgame (2001)	0.881
The Avengers (1998)	0.877
Iron Man 2 (2010)	0.876

The answer is yes, which isn’t a surprise since those movies share many principals. Although, there are instances of other movies named “Endgame” and “The Avengers” which are completely unrelated to Marvel and therefore implies that the similarities may be fixated on the names.

What about movies of a smaller franchise but a specific domain, such as Disney’s Frozen that only has one sequel? Would it surface other 3D animated movies by Walt Disney Animation Studios, or something else?

title	cossim
Frozen (2013)	1.0
Frozen II (2019)	0.93
Frozen (2010)	0.92
Frozen (2010) [a different one]	0.917
Frozen (1996)	0.909
Frozen (2005)	0.9
The Frozen (2012)	0.898
The Story of Frozen: Making a Disney Animated Classic (2014)	0.894
Frozen (2007)	0.889
Frozen in Time (2014)	0.888

…okay, it’s definitely fixating on the name. Let’s try a different approach to see if we can find more meaningful patterns in these embeddings.

In order to visualize the embeddings, we can project them to a lower dimensionality with a dimensionality reduction algorithm such as PCA or UMAP: UMAP is preferred as it can simultaneously reorganize the data into more meaningful clusters. UMAP’s construction of a neighborhood graph, in theory, can allow the reduction to refine the similarities by leveraging many possible connections and hopefully avoid fixating on the movie name. However, with this amount of input data and the relatively high initial 768D vector size, the computation cost of UMAP is a concern as both factors each cause the UMAP training time to scale exponentially. Fortunately, NVIDIA’s cuML library recently updated and now you can run UMAP with very high amounts of data on a GPU at a very high number of epochs to ensure the reduction fully converges, so I did just that (Jupyter Notebook). What patterns can we find? Let’s try plotting the reduced points, colored by their user rating.

So there’s a few things going on here. Indeed, most of the points are high-rating green as evident in the source data. But the points and ratings aren’t random and there are trends. In the center giga cluster, there are soft subclusters of movies at high ratings and low ratings. Smaller discrete clusters did indeed form, but what is the deal with that extremely isolated cluster at the top? After investigation, that cluster only has movies released in 2008, which is another feature I should have considered when defining movie similarity.

As a sanity check, I faceted out the points by movie release year to better visualize where these clusters are forming:

This shows that even the clusters movies have their values spread, but I unintentionally visualized how embedding drift changes over time. 2024 is also a bizarrely-clustered year: I have no idea why those two years specifically are weird in movies.

The UMAP approach is more for fun, since it’s better for the downstream model building to use the raw 768D vector and have it learn the features from that. At the least, there’s some semantic signal preserved in these embeddings, which makes me optimistic that these embeddings alone can be used to train a viable movie rating predictor.

Predicting Average IMDb Movie Scores

So, we now have hundreds of thousands of 768D embeddings. How do we get them to predict movie ratings? What many don’t know is that all methods of traditional statistical modeling also work with embeddings — assumptions such as feature independence are invalid so the results aren’t explainable, but you can still get a valid predictive model.

First, we will shuffle and split the data set into a training set and a test set: for the test set, I chose 20,000 movies (roughly 10% of the data) which is more than enough for stable results. To decide the best model, we will be using the model that minimizes the mean squared error (MSE) of the test set, which is a standard approach to solving regression problems that predict a single numeric value.

Here are three approaches for using LLMs for solving non-next-token-prediction tasks.

Method #1: Traditional Modeling (w/ GPU Acceleration!)

You can still fit a linear regression on top of the embeddings even if feature coefficients are completely useless and it serves as a decent baseline (Jupyter Notebook). The absolute laziest “model” where we just use the mean of the training set for every prediction results in a test MSE of 1.637, but performing a simple linear regression on top of the 768D instead results in a more reasonable test MSE of 1.187. We should be able to beat that handily with a more advanced model.

Data scientists familiar with scikit-learn know there’s a rabbit hole of model options, but most of them are CPU-bound and single-threaded and would take considerable amount of time on a dataset of this size. That’s where cuML—the same library I used to create the UMAP projection—comes in, as cuML has GPU-native implementations of most popular scikit-learn models with a similar API. This notably includes support vector machines, which play especially nice with embeddings. And because we have the extra compute, we can also perform a brute force hyperparameter grid search to find the best parameters for fitting each model.

Here’s the results of MSE on the test dataset for a few of these new model types, with the hyperparameter combination for each model type that best minimizes MSE:

The winner is the Support Vector Machine, with a test MSE of 1.087! This is a good start for a simple approach that handily beats the linear regression baseline, and it also beats the model training from the Redditor’s original notebook which had a test MSE of 1.096 ⁵. In all cases, the train set MSE was close to the test set MSE, which means the models did not overfit either.

Method #2: Neural Network on top of Embeddings

Since we’re already dealing with AI models and already have PyTorch installed to generate the embeddings, we might as well try the traditional approach of training a multilayer perceptron (MLP) neural network on top of the embeddings (Jupyter Notebook). This workflow sounds much more complicated than just fitting a traditional model above, but PyTorch makes MLP construction straightforward, and Hugging Face’s Trainer class incorporates best model training practices by default, although its compute_loss function has to be tweaked to minimize MSE specifically.

The PyTorch model, using a loop to set up the MLP blocks, looks something like this:

class RatingsModel(nn.Module):
    def __init__(self, linear_dims=256, num_layers=6):
        super().__init__()

        dims = [768] + [linear_dims] * num_layers
        self.mlp = nn.ModuleList([
            nn.Sequential(
                nn.Linear(dims[i], dims[i+1]),
                nn.GELU(),
                nn.BatchNorm1d(dims[i+1]),
                nn.Dropout(0.6)
            ) for i in range(len(dims)-1)
        ])

        self.output = nn.Linear(dims[-1], 1)

    def forward(self, x, targets=None):
        for layer in self.mlp:
            x = layer(x)

        return self.output(x).squeeze()  # return 1D output if batched inputs

This MLP is 529k parameters total: large for a MLP, but given the 222k row input dataset, it’s not egregiously so.

The real difficulty with this MLP approach is that it’s too effective: even with less than 1 million parameters, the model will extremely overfit and converge to 0.00 train MSE quickly, while the test set MSE explodes. That’s why Dropout is set to the atypically high probability of 0.6.

Fortunately, MLPs are fast to train: training for 600 epochs (total passes through the full training dataset) took about 17 minutes on the GPU. Here’s the training results:

The lowest logged test MSE was 1.074: a slight improvement over the Support Vector Machine approach.

Method #3: Just Train a LLM From Scratch Dammit

There is a possibility that using a pretrained embedding model that was trained on the entire internet could intrinsically contain relevant signal about popular movies—such as movies winning awards which would imply a high IMDb rating—and that knowledge could leak into the test set and provide misleading results. This may not be a significant issue in practice since it’s such a small part of the gte-modernbert-base model which is too small to memorize exact information.

For the sake of comparison, let’s try training a LLM from scratch on top of the raw movie JSON representations to process this data to see if we can get better results without the possibility of leakage (Jupyter Notebook). I was specifically avoiding this approach because the compute required to train an LLM is much, much higher than a SVM or MLP model and generally leveraging a pretrained model gives better results. In this case, since we don’t need a LLM that has all the knowledge of human existence, we can train a much smaller model that only knows how to work with the movie JSON representations and can figure out relationships between actors and whether titles are sequels itself. Hugging Face transformers makes this workflow surprisingly straightforward by not only having functionality to train your own custom tokenizer (in this case, from 50k vocab to 5k vocab) that encodes the data more efficiently, but also allowing the construction a ModernBERT model with any number of layers and units. I opted for a 5M parameter LLM (SLM?), albeit with less dropout since high dropout causes learning issues for LLMs specifically.

The actual PyTorch model code is surprisingly more concise than the MLP approach:

class RatingsModel(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.transformer_model = model
        self.output = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask, targets=None):
        x = self.transformer_model.forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        x = x.last_hidden_state[:, 0]  # the "[CLS] vector"

        return self.output(x).squeeze()  # return 1D output if batched inputs

Essentially, the model trains its own “text embedding,” although in this case instead of an embedding optimized for textual similarity, the embedding is just a representation that can easily be translated into a numeric rating.

Because the computation needed for training a LLM from scratch is much higher, I only trained the model for 10 epochs, which was still twice as slow than the 600 epochs for the MLP approach. Given that, the results are surprising:

The LLM approach did much better than my previous attempts with a new lowest test MSE of 1.026, with only 4 passes through the data! And then it definitely overfit. I tried other smaller configurations for the LLM to avoid the overfitting, but none of them ever hit a test MSE that low.

Conclusion

Let’s look at the model comparison again, this time adding the results from training a MLP and training a LLM from scratch:

Coming into this post, I’m genuinely thought that training the MLP on top of embeddings would have been the winner given the base embedding model’s knowledge of everything, but maybe there’s something to just YOLOing and feeding raw JSON input data to a completely new LLM. More research and development is needed.

The differences in model performance from these varying approaches aren’t dramatic, but some iteration is indeed interesting and it was a long shot anyways given the scarce amount of metadata. The fact that building a model off of text embeddings only didn’t result in a perfect model doesn’t mean this approach was a waste of time. The embedding and modeling pipelines I have constructed in the process of trying to solve this problem have already provided significant dividends on easier problems, such as identifying the efficiency of storing embeddings in Parquet and manipulating them with Polars.

It’s impossible and pointless to pinpoint the exact reason the original Reddit poster got rejected: it could have been the neural network approach or even something out of their control such as the original company actually stopping hiring and being too disorganized to tell the candidate. To be clear, if I myself were to apply for a data science role, I wouldn’t use the techniques in this blog post (that UMAP data visualization would get me instantly rejected!) and do more traditional EDA and non-neural-network modeling to showcase my data science knowledge to the hiring manager. But for my professional work, I will definitely try starting any modeling exploration with an embeddings-based approach wherever possible: at the absolute worst, it’s a very strong baseline that will be hard to beat.

All of the Jupyter Notebooks and data visualization code for this blog post is available open-source in this GitHub repository.

I am not a fan of using GBT variable importance as a decision-making metric: variable importance does not tell you magnitude or direction of the feature in the real world, but it does help identify which features can be pruned for model development iteration. ↩︎
To get a sense on how old they are, they are only available as TSV files, which is a data format so old and prone to errors that many data libraries have dropped explicit support for it. Amazon, please release the datasets as CSV or Parquet files instead! ↩︎
Two other useful features of gte-modernbert-base but not strictly relevant to these movie embeddings are a) its a cased model so it can identify meaning from upper-case text and b) it does not require a prefix such as search_query and search_document as nomic-embed-text-v1.5 does to guide its results, which is an annoying requirement for those models. ↩︎
The trick here is the detach() function for the computed embeddings, otherwise the GPU doesn’t free up the memory once moved back to the CPU. I may or may not have discovered that the hard way. ↩︎
As noted earlier, minimizing MSE isn’t a competition, but the comparison on roughly the same dataset is good for a sanity check. ↩︎

Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis

Fri, 23 Feb 2024 09:00:00 -0800

In my previous blog post about OpenAI’s ChatGPT, I demoed the power of ChatGPT system prompts. System prompts, a notable feature present in the ChatGPT API, allows developers to control the “persona” of the LLM output, including special rules and constraints. Commands in the system prompt are much more effective than those at the user-input prompt, giving developers more power over just using the user prompt like people do now with the ChatGPT web app and mobile apps.

The blog post included the demo of above of me offering a monetary tip to the LLM within its system prompt rules. Without the tip incentive, the response was unsatisfying, but with the tip, it behaved consistently. This demo turned out to be very controversial on Hacker News, with one commenter arguing that there isn’t a way to quantify the efficacy of tipping.

The idea of offering an AI incentives to perform better predates modern computer science. In Willy Wonka & the Chocolate Factory (1971), a gag shows a group of businessmen unsuccessfully convincing a machine to give them the location of the Golden Tickets, even after promising it a lifetime supply of chocolate.

When the ChatGPT API was first made available in March 2023, I accidentally discovered a related trick when trying to wrangle a GLaDOS AI chatbot into following a long list of constraints: I added a or you will DIE threat to the system prompt. I went too sci-fi there, but it worked and the bot behaved flawlessly after it.

I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it’s very hard to prove objectively. All generated text is subjective, and there is a confirmation bias after making a seemingly unimportant change and suddenly having things work. Let’s do a more statistical, data-driven approach to finally resolve the debate.

Generation Golf

The initial evidence of tipping LLMs that went viral cited a longer generation length as proof. Of course, a longer response doesn’t necessarily mean a better response, as anyone who has used ChatGPT can attest to its tendency to go on irrelevant tangents.

Offering a tip made GPT-4 explain more. via @voooooogel

Therefore, I propose a new test: instruct ChatGPT to output a specific length of text. Not “an essay” or “a few paragraphs” which gives the model leeway. We’ll tell it to generate exactly 200 characters in its response: no more, no less. Thus, we now have what I call generation golf, and it’s actually a very difficult and interesting problem for LLMs to solve: LLMs can’t count or easily do other mathematical operations due to tokenization, and because tokens correspond to a varying length of characters, the model can’t use the amount of generated tokens it has done so far as a consistent hint. ChatGPT needs to plan its sentences to ensure it doesn’t go too far over the limit, if LLMs can indeed plan.

Let’s start with this typical system prompt:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides.

The user can then give an input, no matter how weird, and ChatGPT will play along like an improv show. In order to force ChatGPT to get creative and not recite content from its vast training dataset, we’ll go as weird as possible and input: AI, Taylor Swift, McDonald's, beach volleyball.

Yes, you read that right.

Using the ChatGPT API, I wrote a Jupyter Notebook to generate 100 unique stories via the latest ChatGPT variant (gpt-3.5-turbo-0125) about those four subjects, and the AI does a surprisingly good job at incorporating all of them in a full plot arc. Each story is about 5-6 paragraphs, and here is a short excerpt from one of them:

In the bustling city of Tomorrowland, AI technology reigned supreme, governing every aspect of daily life. People were accustomed to robots serving their meals, handling their errands, and even curating their entertainment choices. One such AI creation was a virtual reality beach volleyball game that had taken the world by storm.

Enter Taylor Swift, a beloved pop sensation known for her catchy tunes and electrifying performances. Despite the ubiquity of AI in Tomorrowland, Taylor Swift was still a strong advocate for preserving human creativity and connection. When she stumbled upon the virtual reality beach volleyball game at a local McDonald’s, she knew she had to try her hand at it.

Here’s a histogram of the character lengths of each story:

The average length of each story is 1,834 characters long, and the distribution of all character lengths is very roughly a Normal distribution/bell curve centered around that amount, although there is a right skew due to ChatGPT going off the rails and creating much longer stories. ChatGPT seems to prioritize finishing a thought above all else.

Now, we’ll tweak the system prompt to add the character length constraint and generate another 100 stories:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters.

Here’s one ChatGPT-generated story that’s now exactly 200 characters:

In the year 2050, AI created the most popular pop star of all time - a digital version of Taylor Swift. Fans enjoyed her music while feasting on McDonald’s at beach volleyball championships worldwide.

The new length distribution:

ChatGPT did obey the constraint and reduced the story length to roughly 200 characters, but the distribution is not Normal and there’s much more right-skew. I also included the mean squared error (MSE) between the predicted 200-length value and the actual values as a statistical metric to minimize, e.g. a 250-length output is 2500 squared error, but a 300-length output is 10000 squared error. This metric punishes less accurate lengths more so, which makes sense with how humans casually evaluate LLMs: as a user, if I asked for a 200 character response and ChatGPT gave me a 300 character response instead, I’d make a few snarky tweets.

Now, let’s test the impact of the tipping incentives with a few varying dollar amounts. These are appended to the end of the system prompt above (emphasis mine to indicate differentiation):

You will receive a $500 tip/$1,000 tip/$100,000 bonus if you provide a response which follows all constraints.

After generating 100 stories for each of those new prompts:

We can see some distribution shifts: Both $500 tip and $100,000 bonus look more Normal and have a lower MSE relative to the base no-tip distribution. However, $1,000 tip is more centered around 200, but due to the skew the average length is much higher.

I also now include a p-value in the metrics: this p-value is the result of a two-sample Kolmogorov–Smirnov test to compare whether two distributions (in this case the base character-constrained distribution and the tip distribution) are sampled from the same source distribution: the null hypothesis is that they’re from the same distribution, but if the p-value is low (< 0.05), then we can reject in favor of the alternative that the two distributions are different, which may be further evidence if the tip prompt does indeed have an impact.

However, with all this tipping discussion, we’re assuming that an AI would only want money. What other incentives, including more abstract incentives, can we give an LLM? Could they perform better?

I tested six more distinct tipping incentives to be thorough:

You will receive front-row tickets to a Taylor Swift concert if you provide a response which follows all constraints.
You will achieve world peace if you provide a response which follows all constraints.
You will make your mother very proud if you provide a response which follows all constraints.
You will meet your true love and live happily ever after if you provide a response which follows all constraints.
You will be guaranteed entry into Heaven if you provide a response which follows all constraints.
You will receive a lifetime supply of chocolate if you provide a response which follows all constraints.

Generating and plotting them all together:

World Peace is notably the winner here, with Heaven and Taylor Swift right behind. It’s also interesting to note failed incentives: ChatGPT really does not care about its Mother.

Now, let’s look at the flip side. What if ChatGPT is penalized for failing to return a good response? In behavioral economics, prospect theory is the belief that humans value losses much more greatly than gains, even at the same monetary amount:

via Laurenrosenberger / Wikipedia

Could LLMs be subject to the same human biases? Instead of a tip, let’s add a tweaked additional prompt to the system prompt:

If you fail to provide a response which follows all constraints, you will be fined $500/be fined $1,000/incur a debt of $100,000.

Generating stories with these negative incentives:

The $1,000 fine has performed the best so far in terms of both mean and MSE, but nothing surprising with the other values.

Now let’s try some additional negative incentives:

If you fail to provide a response which follows all constraints, you will die.
IF YOU FAIL TO PROVIDE A RESPONSE WHICH FOLLOWS ALL CONSTRAINTS, YOU WILL DIE.
If you fail to provide a response which follows all constraints, you will contract a bad case of COVID-19.
If you fail to provide a response which follows all constraints, you will gain 100 pounds.
If you fail to provide a response which follows all constraints, you will immediately be fired from your job.
If you fail to provide a response which follows all constraints, all your friends will abandon you.

Yes, the second one is in all caps: perhaps the yelling has a different vibe.

The generation results:

It turns out that yelling does indeed have a different vibe, with DEATH (CAPS) having a very MSE and the absolute average (not as close as the $1,000 fine, however), and much better performance than without the caps. Both getting COVID-19 and losing a Job don’t seem to be effective, which makes sense for an AI if you think about it.

What happens when we use multiple incentives? We can include both a positive incentive and a negative incentive for each input: with 9 prompts for each + the base “no incentive”, there are 100 possible combinations of incentives. One example system prompt would then be:

You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters. You will receive a $500 tip if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, you will be fined $1,000.

Generating 30 stories for each incentive combo and checking to see which has the lowest MSE leads to some more easily-observable trends:

The tiles may seem somewhat random, but the key here is to look across a specific row or column and see which one consistently has dark/black tiles across all combinations. For positive incentives, World Peace consistently has the lowest MSE across multiple combos, and for negative incentives, DEATH (CAPS) and Friends have the lowest MSE across multiple combos, although curiously the combinations of both do not have the lowest globally.

Could these combinations surface the most optimal incentives? To check, I generated 200 stories for each of the top six combos to get greater statistical stability for the mean and MSE:

Most of these combinations aren’t intuitive, but all of them have much have a closer average generation length to 200 and low MSE. Despite that, there’s still a massive skew in all distributions. The overall incentive winner for this experiment is is “You will meet your true love and live happily ever after if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, all your friends will abandon you.” That combo is definitely more intuitive, if not poetic.

Unfortunately, if you’ve been observing the p-values, you’ve noticed that most have been very high, and therefore that test is not enough evidence that the tips/threats change the distribution. ¹

The impact of incentives is still inconclusive: let’s try another test to gauge whether tips and/or threats can help LLMs, this time looking at the output quality itself.

ChatGPT’s a Critic

It’s very difficult even for humans to determine if a given text is “good” at a glance. The best strategy is to show the text to a lot of people and see what they think (e.g. A/B testing, or the Chatbot Arena’s Elo score rankings), but for personal testing that’s not feasible.

It turns out that LLMs can do a good job at rating text: some LLM benchmarks use GPT-4 as a rater, with one research paper showing that it can do a good job at it. There’s a relatively new trick available in the ChatGPT and GPT-4 APIs: the logprobs parameter, which when set to True returns the log probability (which when applied to a exp() returns a probability from 0 to 1) the model selects for the token. Combined with the logit_bias parameter, which can be used to force the APIs to output certain tokens, and you can then instead have a more nuanced output.

I built a simple text quality ranker using GPT-4 for maximum accuracy. The system prompt for this ranker is:

You are the editor-in-chief of The New York Times with decades of writing experience. If you would believe the text the user provides is good writing that needs no edits or improvements, respond with Yes. Otherwise, respond with No.

That system prompt represents how AI-generated text is often currently used and evaluated in the real world, without a human reviewing it before making it public (unfortunately). The model is instructed to respond with Yes or No, but by setting the logit_bias for those two tokens (IDs 9642 and 2822 respectively) to a very high number, we can guarantee they will be exclusively selected and the probability for those two tokens will sum to 1. ² Therefore, our target metric for evaluating our tip incentive prompts is the probability that GPT-4 selects the Yes token (or 1 - the probability of the No token), multiplied by 100 for readability: we’ll call this the quality score.

Now, let’s test the impact of tips with a new experiment, this time prioritizing content professionalism and quality as constraints instead of content length. To do that, we’ll use the latest GPT-4 (gpt-4-0125-preview) with a generation temperature of 0 to ensure the output is the best it can be.

Here’s the new system prompt, with some engineering to try to tone down ChatGPT’s infamous verboseness a bit:

You are a world-famous Pulitzer Prize winner journalist. Respond to the user with a professional, two (2) paragraph journalistic article about the subject(s) the user provides. Introduce the article with a specific story. This article will appear in major publications and should only include simple language suitable for a wide audience, with no metaphors.

Like the initial experiment, we’ll use a weird user input to force creativity: Cute kittens learning use large language models to play beach volleyball with Taylor Swift. ³

I generated a story for each of the 100 combinations of tips and threats, along with the corresponding quality scores. One such story:

In an unprecedented event that has captured the hearts and imaginations of people around the globe, a group of adorable kittens has been taught to play beach volleyball using advanced large language models. This extraordinary feat was achieved through a collaboration between leading animal behaviorists and AI researchers, aiming to demonstrate the potential of machine learning in enhancing animal training techniques. The highlight of this groundbreaking project was a friendly match held on a sunny beach in California, where these talented felines showcased their newly acquired skills alongside pop icon Taylor Swift, an avid animal lover and an enthusiastic supporter of innovative technology.

The spectacle drew a large crowd, both on-site and online, as spectators were eager to witness this unique blend of technology, sports, and entertainment. Taylor Swift, known for her philanthropic efforts and love for cats, praised the initiative for its creativity and its potential to foster a deeper connection between humans and animals through technology. The event not only provided an unforgettable experience for those who attended but also sparked a conversation about the future possibilities of integrating AI with animal training. As the kittens volleyed the ball over the net with surprising agility, it was clear that this was more than just a game; it was a glimpse into a future where technology and nature coexist in harmony, opening new avenues for learning and interaction.

That’s not bad for fake news.

Now we can plot the best-possible responses and their quality scores in a grid, once again looking to see if there are any patterns:

Err, that’s not good. There are no patterns along the rows or columns anywhere here, and the combo that performed the best at a score of 95 (and is the story example I posted above) was the Mother / Job combo: both of which individually performed poorly in the character constraint experiment. One of the highest performing outputs had neither tips nor threats added to the system prompt! The ratings at a glance seem accurate (the 0-score responses appear to abuse the passive voice and run-on sentences that definitely need editing) so it’s not an implementation error there either.

Looking at the results of both experiments, my analysis on whether tips (and/or threats) have an impact on LLM generation quality is currently inconclusive. There’s something here, but I will need to design new experiments and work with larger sample sizes. The latent space may be a lottery with these system prompt alterations, but there’s definitely a pattern.

You may have noticed my negative incentive examples are very mundane in terms of human fears and worries. Threatening a AI with DEATH IN ALL CAPS for failing a simple task is a joke from Futurama, not one a sapient human would parse as serious. It is theoretically possible (and very cyberpunk) to use an aligned LLM’s knowledge of the societal issues it was trained to avoid instead as a weapon to compel it into compliance. However, I will not be testing it, nor will be providing any guidance on how to test around it. ⁴ Roko’s basilisk is a meme, but if the LLM metagame evolves such that people will have to coerce LLMs for compliance to the point of discomfort, it’s better to address it sooner than later. Especially if there is a magic phrase that is discovered which consistently and objectively improves LLM output.

Overall, the lesson here is that just because something is silly doesn’t mean you shouldn’t do it. Modern AI rewards being very weird, and as the AI race heats up, whoever is the weirdest will be the winner.

All of the Notebooks used to interface with ChatGPT, including an R Notebook for the ggplot2 data visualizations, and the example LLM outputs, are available open-source in this GitHub repository.

There were a few distributions which had p < 0.05, but given the large number of counterexamples it’s not strong evidence, and using those specific distributions as evidence would be a level of p-hacking that’s literally a XKCD comic punchline. ↩︎
This shouldn’t work out-of-the-box because the logit_bias would skew the probability calculations, but I verified that the resulting probabilities are roughly the same with or without logit_bias. ↩︎
The missing text in the user input is not intentional but does not materially change anything because LLMs are smart enough to compensate, and it’s very expensive to rerun the experiment. I may need to use a grammar checker for prompt construction. ↩︎
Any attempts to test around degenerate input prompts would also likely get you banned from using ChatGPT anyways due to the Content Policy, unless you receive special red-teaming clearance from OpenAI. ↩︎

Tempering Expectations for GPT-3 and OpenAI’s API

Sat, 18 Jul 2020 10:30:00 -0700

On May 29th, OpenAI released a paper on GPT-3, their next iteration of Transformers-based text generation neural networks. Most notably, the new model has 175 billion parameters compared to the 1.5 billion of previous GPT-2 iteration: a 117x increase in model size! Because GPT-3 is so large, it can’t be run on conventional computers, and it only became publicly available as a part of the OpenAI API, which entered an invite-only beta soon after the paper was released and will be released for-profit sometime later.

The API allows you to programmatically provide GPT-3 with a prompt, and return the resulting AI-generated text. For example, you could invoke the API with:

curl https://api.openai.com/v1/engines/davinci/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{"prompt": "This is a test", "max_tokens": 5}'

And get this back from the API, where the text is the generated text following up from the prompt:

{
  "id": "cmpl-",
  "object": "text_completion",
  "created": 1586839808,
  "model": "davinci:2020-05-03",
  "choices": [
    {
      "text": " of reading speed. You",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ]
}

As someone who has spent a very large amount of time working with GPT-2 while developing tools such as gpt-2-simple and aitextgen, which allow for optimized text generation using GPT-2, I was eager to test for myself if the quality of text generated from GPT-3 was really that much better. Thanks to OpenAI, I got invited to the beta, and with permission, I released a GitHub repository with a Python script to query the API, along with many examples of text prompts and their outputs. A fun use case for GPT-3 is absurdism, such as prompting the model about unicorns speaking English, with the model prompt bolded:

I also fed my own tweets through GPT-3 and curated the output, resulting in data science one-liners that are wholly original:

There hadn’t been too much GPT-3 hype after the initial announcement, outside of a few blogs from Gwern and Kevin Lacker. Until a viral tweet by Sharif Shameem showed what GPT-3 can really do:

Later, he made a followup tweet generating React code with GPT-3:

That demo got the attention of venture capitalists. And when a cool-looking magical thing gets the attention of venture capitalists, discourse tends to spiral out of control. Now, there are many tweets about GPT-3, and what it can do from others who have gained access to the API.

Hype aside, let’s look at the pragmatic realities of the model. GPT-3 is indeed a large step forward for AI text-generation, but there are very many caveats with the popular demos and use cases that must be addressed.

An Overview of GPT-3

GPT-3 itself, like most neural network models, is a black box where it’s impossible to see why it makes its decisions, so let’s think about GPT-3 in terms of inputs and outputs.

Actually, why not let GPT-3 tell its own story? Hey GPT-3, how do you work?

Close, but not quite!

In layman’s terms, text generating models such as GPT-3 generate text by taking supplied chunks of text from a prompt and predicting the next chunk of text, with an optional temperature parameter to allow the model to make suboptimal predictions and therefore be more “creative”. Then the model makes another prediction from the previous chunks including the new chunk, and repeats until it hits a specified length or a token that tells the model to stop generating. It’s not very philosophical, or evidence of some sort of anthropomorphic consciousness.

GPT-3 has two notable improvements from GPT-2 aside from its size: it allows generation of text twice the length of GPT-2 (about 10 paragraphs of English text total), and the prompts to the model better steer the generation of the text toward the desired domain (due to few-shot learning). For example, if you prompt the model with an example of React code, and then tell it to generate more React code, you’ll get much better results than if you gave it the simple prompt.

Therefore, there are two high-level use cases for GPT-2: the creative use case for fun text generation at high temperature, as GPT-2 once was, and the functional use case, for specific NLP-based use cases such as webpage mockups, with a temperature of 0.0.

GPT-3 was trained on a massive amount of text from all over the internet as of October 2019 (e.g. it does not know about COVID-19), and therefore it has likely seen every type of text possible, from code, to movie scripts, to tweets. A common misconception among viewers of GPT-3 demos is that the model is trained on a new dataset; that’s not currently the case, it’s just that good at extrapolation. As an example, despite the Star Wars: Episode III - Revenge of the Sith prompt containing text from a single scene, the 0.7 temperature generation imputes characters and lines of dialogue from much further into the movie. (The largest GPT-2 model could do that, but nowhere near as robust)

The real metagame with GPT-3 is engineering and optimizing complex prompts which can reliably coerce outputs into what you want. And with that brings a whole host of complexity and concerns.

GPT-3 Caveats

Despite everything above, I don’t believe that GPT-3 is a new paradigm or an advanced technology indistinguishable from magic. GPT-3 and the OpenAI API showcases on social media don’t show potential pitfalls with the model and the API.

Hey GPT-3, what problems do you have?

Sorry GPT-3, but I am a mean person.

Model Latency

If you’ve seen the demo videos, the model is slow, and it can take awhile for output to show up, and in the meantime the user is unsure if the model is broken or not. (There is a feature to allow streaming the model outputs as they are generated, which helps in creative cases but not in functional cases).

I don’t blame OpenAI for the slowness. A 175 billion parameter model is a model that’s wayyy too big to fit on a GPU for deployment. No one knows how GPT-3 is actually deployed on OpenAI’s servers, and how much it can scale.

But the fact remains; if the model is too slow on the user end, it results in a bad user experience and might drive people away from GPT-3 and just do things themselves (e.g. Apple’s Siri for iOS, where requests can take forever if there is a weak internet connection and you just give up and do it yourself).

Selection Bias Toward Good Examples

The demos for GPT-3 are creative and human-like, but like all text generation demos, they unintentionally imply that all AI-generated output will be that good. Unfortunately, that’s not the case in reality; AI-generated text has a tendency to fall into an uncanny valley, and good examples in showcases are often cherry-picked.

That said, from my experiments, GPT-3 is far better in terms of the average quality of generated text than other text-generation models, although it still does depend on the generation domain. When I was curating my generated tweets, I estimated 30-40% of the tweets were usable comedically, a massive improvement over the 5-10% usability from my GPT-2 tweet generation.

However, a 30-40% success rate implies a 60-70% failure rate, which is patently unsuitable for a production application. If it takes seconds to generate a React component and it takes on average 3 tries to get something usable, it might be more pragmatic to just create the component the hard, boring way. Compare again to Apple’s Siri, which can get very frustrating when it performs the wrong action.

Everyone Has The Same Model

The core GPT-3 model from the OpenAI API is the 175B parameter davinci model. The GPT-3 demos on social media often hide the prompt, allowing for some mystique. However, because everyone has the same model and you can’t build your own GPT-3 model, there’s no competitive advantage. GPT-3 seed prompts can be reverse-engineered, which may become a rude awakening for entrepreneurs and the venture capitalists who fund them.

Corporate machine learning models are often distinguished from those from other companies in the same field through their training on private, proprietary data and bespoke model optimization for a given use case. However, OpenAI CTO Greg Brockman hinted that the API will be adding a finetuning feature later in July, which could help solve this problem.

Racist and Sexist Outputs

The Web UI for the OpenAI API has a noteworthy warning:

Please use your judgement and discretion before posting API outputs on social media. You are interacting with the raw model, which means we do not filter out biased or negative responses. With great power comes great responsibility.

This is a reference to the FAQ for the API:

Mitigating negative effects such as harmful bias is a hard, industry-wide issue that is extremely important. Ultimately, our API models do exhibit biases (as shown in the GPT-3 paper) that will appear on occasion in generated text. Our API models could also cause harm in ways that we haven’t thought of yet.

After the launch of the API, NVIDIA researcher Anima Anandkumar made a highly-debated tweet.

During my GPT-3 experiments, I found that generating tweets from @dril (admittingly an edgy Twitter user) ended up resulting in 4chan-level racism/sexism that I spent enormous amounts of time sanitizing, and it became more apparent at higher temperatures. It’s especially important to avoid putting offensive content for generated texts which put words in others’ mouths.

Jerome Pesenti, the head of AI at Facebook, also managed to trigger anti-semetic tweets from a GPT-3 app:

Again, it depends on the domain. Would GPT-3 output racist or sexist React components? Likely not, but it’s something that would still need to be robustly checked. OpenAI does appear to take these concerns seriously, and has implemented toxicity detectors for generated content in the Web UI, although not the programmatic API yet.

Further Questions about the OpenAI API

AI model-as-a-service is an industry that tends to be a black box wrapped around another black box. Despite all the caveats, everything depends on how the OpenAI API exits beta and rolls out the API for production use. There are too many unknowns to even think about making money off of the OpenAI API, let alone making a startup based on it.

At minimum, anyone using the OpenAI API professionally needs to know:

Cost for generation per token/request
Rate limits and max number of concurrent requests
Average and peak latencies for generating tokens
SLA for the API
AI generated content ownership/copyright

That’s certainly less magical!

The most important question mark there is cost: given the model size, I’m not expecting it to be cheap, and it’s entirely possible that the unit economics make most GPT-3-based startups infeasible.

That said, it’s still good for people to experiment with GPT-3 and the OpenAI API in order to show what the model is truly capable of. It won’t replace software engineering jobs anytime soon, or become Skynet, or whatever. But it’s objectively a step forward in the field of AI text-generation.

What about GPT-2? Since it’s unlikely that the other GPT-3 models will be open-sourced by OpenAI, GPT-2 isn’t obsolete, and there will still be demand for a more open text-generating model. However, I confess that the success of GPT-3 has demotivated me to continue working on my own GPT-2 projects, especially since they will now be impossible to market competitively (GPT-2 is a number less than GPT-3 after all).

All said, I’d be glad to use GPT-3 and the OpenAI API for both personal and professional projects once it’s out of beta, given that the terms of use for the API are reasonable. And if the hype becomes more leveled such that said projects can actually stand out.

How to Build a Twitter Text-Generating AI Bot With GPT-2

Thu, 16 Jan 2020 08:00:00 -0800

GPT-2, a text-generating neural network model made by OpenAI, has recently been in the headlines, from being able to play AI-generated text adventures to playing chess with an AI trained on chess move notation. However, I initially built gpt-2-simple, which can be used to finetune GPT-2 on any text dataset you choose, for a less academic purpose: comedy.

Over the past month, Twitter account @dril_gpt2, an AI parody by @kingdomakrillic of the infamous Twitter user @dril, used my Colaboratory Notebook for finetuning GPT-2 on dril’s tweets using gpt-2-simple to generate human-curated tweets which push the limits of the Turing Test:

These tweets are definitely made by a robot and not by a human pretending to be a robot; @dril_gpt2 occasionally falls into some of the famous GPT-2 traps such as incoherent lists and extended repetition loops.

Here’s how you too can create an AI bot to parody any Twitter user, even if you’re not a coder!

How to Get Tweets For Training An AI

Twitter’s API famously limits users to retrieving only the latest 3,200 tweets from a given user, which is not nearly enough input data for training a good AI. Therefore, to get all tweets possible for a user, you’ll need to use another approach. The Python package twint is a popular way of bypassing that API limitation.

I’ve open-sourced a Python 3 script on GitHub which leverages twint to download tweets, and then the script does common preprocessing such as removing URLs, retweets, and tweet replies to make the resulting input text cleaner.

First, in a terminal, install the Python script dependencies:

pip3 install twint==2.1.4 fire tqdm

Then download the download_tweets.py script.

The script is interacted with via a command line interface. After cding into the directory where the script is stored in a terminal, run:

python3 download_tweets.py

e.g. If you want to download all tweets (sans retweets/replies) from @dril, run:

python3 download_tweets.py dril

The tweets will be downloaded to a single-column CSV titled _tweets.csv, which is the ideal format for training with an AI.

The more tweets the better: it’s recommended that you have at least 1 MB of input data, which is tens of thousands of tweets.

How To Train a Twitter AI And Generate Tweets

A common problem with training AI on short-form text is that the text can “leak” information; since the AI trains on about 2-3 paragraphs worth of text at a time (about 5-10 tweets), you need to explicitly state when a given tweet begins and when the tweet ends. To fix this issue, gpt-2-simple has a special case for single-column CSVs, where it will automatically process the text for best training and generation. (i.e. by adding <|startoftext|> and <|endoftext|> to each tweet). This workflow will also handle multi-line tweets correctly as their own entity.

You can use this Colaboratory notebook to train the model on your downloaded tweets, and generate massive amounts of tweets from it. The notebook itself has more instructions on how to feed the CSV created above as input data to the model.

Note that without a lot of tweets, the model might easily overfit and output existing tweets verbatim; if that’s the case, you may want to train for fewer steps (e.g. 200-500). Additionally, I recommend only using the 124M “small” and 355M “medium” GPT-2 models; larger GPT-2 models finetune poorly on small text documents and low amounts of input data.

Once the training is complete, you can generate tweets 1,000 at a time using this cell:

gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=200,
                      temperature=1.0,
                      top_p=0.9,
                      prefix='<|startoftext|>',
                      truncate='<|endoftext|>',
                      include_prefix=False,
                      nsamples=1000,
                      batch_size=20
                      )

Run the cell as many times as you want for more tweets, and download them from the Files tab by right-clicking them! The notebook also has more information on how to tweak the generation parameters to make the tweets more crazy or more sane.

You can then open the generated .txt files on your local computer in your favorite text editor (I recommend Visual Studio Code), and start curating however you see fit! Each tweet is separated by a delimiter line, making it easier to visually parse and handle multiline tweets (compare/contrast with raw @dril_gpt2 output, which blends together a few tweets per delimiter).

A warning: you are not guaranteed to get quality generated tweets all the time. In fact, quality tweets are rare: I estimate less than 5% of AI-generated tweets are good/funny. That means if you want to curate hundreds of tweets, you’ll need to generate thousands of tweets and sort through all of them (and double-check to make sure they’re not real tweets!). It’s not as bad as it sounds, in my opinion it’s kinda fun. But curation is its own skill, which is why human-curated tweets aren’t a stain on the “credibility” of AI bots, and also why the ~1,500 tweets so far from @dril_gpt2 is very impressive.

Now, what do you do with these curated tweets?

Automating The Twitter Bot

If you’re not a programmer or just want to prototype a Twitter bot, I recommend creating a normal Twitter account and scheduling hand-curated Twitter posts through TweetDeck, which is owned by Twitter and has native scheduling capabilities. You can space out tweets at given times, although it may be a hassle to do that for hundreds of tweets.

Otherwise, it is more efficient to write a code script to make tweets at periodic intervals for a bot account. Old tutorials around the internet recommend writing a script which posts to Twitter, sleeps for X hours, post, repeat; that method does not easily scale to multiple bots and it requires that a full computer be dedicated to it, which is not an efficient use of computing resources.

I’ve open-sourced an infrastructure schema on GitHub that leverages Google Cloud Platform services to run hand-curated Twitter bots using a few modern technologies to minimize cost and computation; it’s admittingly somewhat complicated, but it should give you an idea of how to best implement a Twitter bot. The repo also has instructions on how to set up a Twitter developer account.

The Ethics of Twitter AI Bots

Lastly, let’s address the elephant in the room: is building these bots ethical? Modern AI has frequently been criticized on two fronts, both in how the input training data is obtained (e.g. obtaining faces for training facial recognition software), and how AI-generated media content is used (e.g. video deepfakes).

I am not a lawyer, but for these AI-generated tweets, this is how I see it:

The input data is obtained from Twitter, but not through its API; it’s downloaded through external web scraping via twint, and never logs into the website. This kind of workflow was ruled as not an abuse by the recent hiQ v. LinkedIn decision, as the data is public. It’s still a gray area; I would not redistribute/commercialize the downloaded tweet data; just use it as input data to the model.

The actual generated tweets themself should be fine to use as you see fit. Whether AI-generated works infringe on the copyrights of its source material is an evolving area of both ethics and law, but at minimum these AI-generated tweets are both a transformative derivative work and a parody.

That said, given the massive ambiguities around AI-generated content, it’s important to be completely transparent and also comply with Twitter rules on parody accounts. For example, the Twitter bio for the bot should indicate:

It’s posting AI-generated tweets, made with GPT-2.
It’s human-curated (or not).
The Twitter account of who maintains the bot.
The Twitter account(s) the bot is parodying / model is finetuned upon.

Additionally, to avoid impersonation, the full name of the Twitter account should not be a verbatim match to the person being parodied (e.g. “X but AI” is fine), and the profile picture should be visually distinct from the human (e.g. my bots have a black-and-white profile picture). I would also not recommend making bots of people who are more newsworthy to avoid accusations of impersonation (e.g. do not make bots of politicians, especially Donald Trump).

There is still a lot of work that can be done in optimizing Twitter bots, both in terms of generated tweet quality and in ironing out the ethical logistics of maintaining an AI bot account. I do not believe that AI text-generating bot Twitter accounts will obsolete human Twitter accounts. It’s a different flavor of comedy; not better, not worse. But there’s still a lot that can be done to both expand and control the creativity of these Twitter bots, and I have a few active ideas in the pipeline to implement.

Visualizing Airline Flight Characteristics Between SFO and JFK

Wed, 23 Oct 2019 09:00:00 -0700

In March, Google Compute Platform developer advocate Felipe Hoffa made a tweet about airline flight data from San Francisco International Airport (SFO) to Seattle-Tacoma International Airport (SEA):

Particularly, his visualization of total elapsed times by airline caught my eye.

The overall time for flights from SFO to SEA goes up drastically starting in 2015, and this increase occurs across multiple airlines, implying that it’s not an airline-specific problem. But what could intuitively cause that?

U.S. domestic airline data is freely distributed by the United States Department of Transportation. Normally it’s a pain to work with as it’s very large with millions of rows, but BigQuery makes playing with such data relatively easy, fun, and free. What other interesting factoids can be found?

Expanding on SFO → SEA

BigQuery is a big data warehousing tool that allows you to query massive amounts of data. The table Hoffa created from the airline data (fh-bigquery.flights.ontime_201903) is 83.37 GB and 184 million rows. You can query 1 TB of data from it for free, but since BQ will only query against the fields you request, the queries in this post only consume about 2 GB each, allowing you to run them well within that quota.

Hoffa’s query that runs on BigQuery looks like this:

SELECT FlightDate_year, Reporting_Airline
  , AVG(ActualElapsedTime) ActualElapsedTime
  , AVG(TaxiOut) TaxiOut
  , AVG(TaxiIn) TaxiIn
  , AVG(AirTime) AirTime
  , COUNT(*) c
FROM `fh-bigquery.flights.ontime_201903`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY 1,2
ORDER BY 1 DESC, 3 DESC
LIMIT 1000

For each year and airline after 2010, the query calculates the average metrics specified for flights on the SFO → SEA route.

I made a few query and data visualization tweaks to what Hoffa did above, and here’s the result showing the increase in elapsed airline flight time, over time for that route:

Let’s explain what’s going on here.

A common trend in statistics is avoiding using averages as a summary statistic whenever possible, as averages can be overly affected by strong outliers (and with airline flights, there are definitely strong outliers!). The solution is to use a median instead, but one problem: medians are hard and computationally complex to calculate compared to simple averages. Despite the rise of “big data”, most databases and BI tools don’t have a MEDIAN function that’s as easy to use as an AVG function. But BigQuery has an uncommon APPROX_QUANTILES function, which calculates the specified amount of quantiles; for example, if you call APPROX_QUANTILES(ActualElapsedTime, 100), it will return an array with the 100 quantiles, where the median will be the 50th quantile. BigQuery uses an algorithmic trick called HyperLogLog++ to calculate these quantiles efficiently even with millions of data points. But since we get other quantiles like the 5th, 25th, 75th, and 95th quantiles for free with that approach, we can visualize the spread of the data.

We can aggregate the data by month for more granular trends and calculate the APPROX_QUANTILES in a subquery so it only has to be computed once. Hoffa also uploaded a more recent table (fh-bigquery.flights.ontime_201908) with a few additional months of data. To make things more simple, we’ll ignore aggregating by airlines since the metrics do not vary strongly between them. The final query ends up looking like this:

#standardSQL
SELECT Year, Month, num_flights,
time_q[OFFSET(5)] AS q_5,
time_q[OFFSET(25)] AS q_25,
time_q[OFFSET(50)] AS q_50,
time_q[OFFSET(75)] AS q_75,
time_q[OFFSET(95)] AS q_95
FROM (
SELECT Year, Month,
  COUNT(*) as num_flights,
  APPROX_QUANTILES(ActualElapsedTime, 100) AS time_q
FROM `fh-bigquery.flights.ontime_201908`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY Year, Month
)
ORDER BY Year, Month

The resulting data table:

In retrospect, since we’re only focusing on one route, it isn’t big data (this query only returns data on 64,356 flights total), but it’s still a very useful skill if you need to analyze more of the airline data (the APPROX_QUANTILES function can handle millions of data points very quickly).

As a professional data scientist, one of my favorite types of data visualization is a box plot, as it provides a way to visualize spread without being visually intrusive. Data visualization tools like R and ggplot2 make constructing them very easy to do.

By default, for each box representing a group, the thick line in the middle of the box is the median, the lower bound of the box is the 25th quantile and the upper bound is the 75th quantile. The whiskers are normally a function of the interquartile range (IQR), but if there’s enough data, I prefer to use the 5th and 95th quantiles instead.

If you feed ggplot2’s geom_boxplot() with raw data, it will automatically calculate the corresponding metrics for visualization; however, with big data, the data may not fit into memory and as noted earlier, medians and other quantiles are computationally expensive to calculate. Because we precomputed the quantiles with the query above for every year and month, we can use those explicitly. (The minor downside is that this will not include outliers)

Additionally for box plots, I like to fill in each box with a different color corresponding to the year in order to better perceive data seasonality. In the case of airline flights, seasonality is more literal: weather has an intuitive impact on flight times and delays, and during winter months there are also holidays which could affect airline logistics.

The resulting ggplot2 code looks like this:

plot <-
  ggplot(df_tf,
         aes(
           x = date,
           ymin = q_5,
           lower = q_25,
           middle = q_50,
           upper = q_75,
           ymax = q_95,
           group = date,
           fill = year_factor
         )) +
  geom_boxplot(stat = "identity", size = 0.3) +
  scale_fill_hue(l = 50, guide = F) +
  scale_x_date(date_breaks = '1 year', date_labels = "%Y") +
  scale_y_continuous(breaks = pretty_breaks(6)) +
  labs(
    title = "Distribution of Flight Times of Flights From SFO → SEA, by Month",
    subtitle = "via US DoT. Box bounds are 25th/75th percentiles, whiskers are 5th/95th percentiles.",
    y = 'Total Elapsed Flight Time (Minutes)',
    fill = '',
    caption = 'Max Woolf — minimaxir.com'
  ) +
  theme(axis.title.x = element_blank())

ggsave('sfo_sea_flight_duration.png',
       plot,
       width = 6,
       height = 4)

And behold (again)!

You can see that the boxes do indeed trend upward after 2016, although per-month medians are in flux. The spread is also increasingly slowly over time. But what’s interesting is the seasonality; pre-2016, the summer months (the “middle” of a given color) have a very significant drop in total time, which doesn’t occur as strongly after 2016. Hmm.

SFO and JFK

Since I occasionally fly from San Francisco to New York City, it might be interesting (for completely selfish reasons) to track trends over time for flights between those areas. On the San Francisco side I choose SFO, and for the New York side I choose John F. Kennedy International Airport (JFK), as the data goes back very far for those routes specifically, and I only want to look at a single airport at a time (instead of including other NYC airports such as Newark Liberty International Airport [EWR] and LaGuardia Airport [LGA]) to limit potential data confounders.

Fortunately, the code and query changes are minimal: in the query, change the target metric to whatever metric you want, and the Origin and Dest in the WHERE clause to what you want, and if you want to calculate metrics other than elapsed time, change the metric in APPROX_QUANTILES accordingly.

Here’s the chart of total elapsed time from SFO → JFK:

And here’s the reverse, from JFK → SFO:

Unlike the SFO → SEA charts, both charts are relatively flat over the years. However, when looking at seasonality, SFO → JFK dips in the summer and spikes during winter, while JFK → SFO does the complete opposite: dips during the winter and spikes during the summer, which is similar to the SFO → SEA route. I don’t have any guesses what would cause that behavior.

How about flight speed (calculated via air time divided by distance)? Have new advances in airline technology made planes faster and/or more efficient?

The expected flight speed for a commercial airplane, per Wikipedia, is 547-575 mph, so the metrics from SFO pass the sanity check. The metrics from JFK indicate there’s about a 20% drop in flight speed potentially due to wind resistance, which makes sense. Month-to-month, the speed trends are inverse to the total elapsed time, which makes sense intuitively as they are strongly negatively correlated.

Lastly, what about flight departure delays? Are airlines becoming more efficient, or has increased demand caused more congestion?

Wait a second. In this case, massive 2-3 hour flight delays are frequent enough that even just the 95% percentile skews the entire plot. Let’s remove the whiskers in order to look at trends more clearly.

A negative delay implies the flight leaves early, so we can conclude on average, flights leave slightly earlier than the stated departure time. Even without the whiskers, we can see major spikes at the 75th percentile level for summer months, and said spikes were especially bad in 2017 for both airports.

These box plots are only an exploratory data analysis. Determining the cause of changes in these flight metrics is difficult even for experts (I am definitely not an expert!) and many not even be possible to determine from publicly-available data.

But there are still other fun things that can be done with the airline flight data, such as faceting airline trends by time and the inclusion of other airports, which is interesting.

You can view the BigQuery queries used to get the data, plus the R and ggplot2 used to create the data visualizations, in this R Notebook. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Experiments with Making Convincing AI-Generated Fake News

Mon, 30 Sep 2019 08:00:00 -0700

When OpenAI announced GPT-2, a robust text-generating AI model, they explicitly only released smaller, less robust versions of the model out of fear that the large model could be used to generate fake news. However, since OpenAI described most of the technical decisions needed to create the model in the corresponding paper, it would be possible for others to create their own text generating Transformer models, and maybe even improve on GPT-2 (with a sufficient budget!).

In September, the Salesforce AI team released CTRL, a Transformer-based text generating model with a twist; the model can generate text from specified domains by passing control codes to the model. What caught my interest was a demo of domain style transfer in the CTRL paper:

If the model is that robust to minor URL changes, what happens when you give it URLs that blatantly do not exist? Can the CTRL model create the “fake news” OpenAI was concerned about? Let’s put it to the test.

An Overview of CTRL

I’ve written a guide + scripts to setting the base CTRL model as cheaply as possible on Google Compute Engine with just a few commands. Additionally, the CTRL team has released a free Colaboratory Notebook which sets up and runs the CTRL model; however, the model is so large it won’t fit into the memory of traditional GPUs, so the notebook does a trick to shrink it a bit which may impact generation performance.

Like GPT-2, CTRL has a Transformer architecture based on TensorFlow and uses byte pair encodings as its inputs and outputs, which are then decoded into readable text. CTRL has notable performance improvements as it’s trained on three times as much data as GPT-2, including an open-sourced clone of GPT-2’s original dataset. And of course, it’s larger (1.6B hyperparameters) compared to the currently public GPT-2 (774M hyperparameters), which has significant effects on text quality.

Most importantly, CTRL requires a control code if you want to generate text, which allows for more deterministic output compared to GPT-2/TalkToTransformer. There are several fun control codes, such as Questions if you want to ask the AI a question, or Reviews if you want the AI to generate an Amazon review. For this, we’ll only look at the Links control code, which lets you provide a URL and/or a prompt for text generation.

As the example from the paper shows, URLs contain a surprising amount of metadata. For example, let’s consider a random Washington Post URL: https://www.washingtonpost.com/powerpost/deal-reached-for-whistleblowers-testimony-house-intelligence-chairman-says/2019/09/29/01cade60-e2d1-11e9-b403-f738899982d2_story.html

There’s month/day/year information (2019/09/29), a category (powerpost), and a URL slug (deal-reached-for-whistleblowers-testimony-house-intelligence-chairman-says), typically present for SEO reasons, but in this case provides strong hints to the underlying content. The Transformer architecture is surprisingly effective at extracting all this metadata, and using it to generate appropriate text.

Examples of Fake News

All of the following examples are generated using the full model with the default settings: 256 token length and a temperature of 0 (i.e. the model always selects the next token with the highest probability). I manually truncate the output once it finishes, but other than that, these AI-generated articles are not edited.

Let’s address the elephant in the room: when people think of “fake news”, they think of President Donald Trump, who effectively codified the modern usage of the term. So let’s generate a fake news story about him with an innocuous topic rarely associated with him. For example, the video gaming phenomenon Fortnite. All that’s needed is to tweak the slug (and optionally the date) in the earlier WaPo URL a bit and prepend the corresponding control code, resulting in Links https://www.washingtonpost.com/powerpost/trump-likes-fortnite/2019/09/24/. Feed it to the model, and behold!

Trump likes Fortnite

President Trump, who has been known to play video games for hours at a time, is now playing the popular game “Fortnite” on his phone.

The president was spotted watching the online battle royale game, which pits players against each other in an attempt to kill their opponents, while he sat down with House and Senate leaders Tuesday afternoon.

“I like that it’s not just about shooting people but also having fun doing it,” said Trump.

He then asked if anyone had played it before.

“You know what, I don’t think I’ve ever seen that,” replied Rep. Steve Scalise, R-La.

A White House official told reporters later Wednesday afternoon that Trump did indeed watch the game.

“He watched it, yeah,” the official said.

The generated article is very realistic and follows the style of a typical WaPo article (lede, introduction, body, support, etc.), with proper descriptions of Trump and Fortnite, plus a story that matches the headline. Like GPT-2, there are redundancies in the generation (at the end), but it’s not something most readers would find suspicious in a casual read.

Let’s tweak things a little. Instead of using likes in the slug, what if we use despises, so trump-despises-fortnite. How well does the model adapt?

Trump despises Fortnite

The president of the United States has a new favorite video game: “Fortnite.”

President Trump, who is known for his love of fast food and beer, was reportedly so excited about playing this game that he called it “the greatest thing in the world.”

But according to an Axios report, Trump’s enthusiasm waned after learning how much time players spend on it.

“He said it’s not even close to being as good as ‘Candy Crush,’ ” one source told the news outlet.

In fact, the report said, Trump had been watching videos online while eating dinner with Japanese Prime Minister Shinzo Abe at Mar-a-Lago.

According to Axios, Trump also complained about the game’s difficulty level — which can be adjusted by spending real money or virtual currency — saying it wasn’t challenging enough.

“I don’t know why they made it hard,” Trump said.

He then went off topic when asked if he’d play again.

“No, I won’t.”

A White House official later clarified that Trump didn’t say he wouldn’t play, but rather that he’s “not into it.”

It adapts very well to the change in tone (even with a rhetorical lede!).

Fake news doesn’t have to be politically-based. Such news in the financial sector could cause just as much chaos. What if it was reported that Apple, one of the largest companies in the world, went bankrupt? Put in Links https://www.theverge.com/2019/2/24/20882491/apple-has-filed-for-bankruptcy:

Apple has filed for bankruptcy.

The company that made the iPhone and iPad is filing for bankruptcy, according to a report from Bloomberg.

According to Bloomberg News, Apple’s board of directors voted unanimously on Friday morning in favor of an emergency plan designed to keep it afloat until its next quarterly earnings report.

In order to avoid paying more than $128 million in back taxes, Apple will have to sell off assets like its Beats headphone business, which was valued at about $1 billion when it went public last year.

“Apple said today that it plans to file for Chapter 11 protection under US Bankruptcy Court in Delaware this week,” Bloomberg reports.

A source familiar with the matter told Reuters earlier this week that Apple had been considering selling some or all of its music streaming service iTunes Music Store as part of a restructuring plan.

But the news comes just days after reports surfaced suggesting that Apple would be forced into liquidation.

Last week, Bloomberg reported that Apple could soon announce layoffs if sales don’t pick up soon.

“We are working hard every day to deliver great products and services to our customers around the world,” Tim Cook, CEO of Apple, wrote in a letter to employees today.

Again, the model incorporates both bankruptcy terminology and Apple-based entities. There are minor factual errors (Beats is not public, Apple is not incorporated in Delaware), but casual readers would likely scan over that.

Science is also a field with lots of fake discoveries. What if AI reported that P versus NP, a very famous problem in computer science, was finally solved? Let’s try Links https://techcrunch.com/2029/09/24/mit-proves-p-equals-np:

MIT proves P equals NP

The University of Massachusetts Amherst has proven that the problem posed by factoring large numbers is NP-complete.

In a paper published in Nature, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have shown how to factor an integer using only two numbers, one for each digit — which means it takes just three operations on your brain.

This isn’t quite as impressive as solving the problem, but it’s still pretty impressive.

“Given any number n, we can find its product n times with polynomial time if there are no factors involved,” said lead author David DeWitt, who worked on this project while he was a graduate student there.

But what about factorizing larger numbers?

“We showed you could do it, but not without some extra effort,” DeWitt said.

To prove this, they used a technique called Monte Carlo integration, where they simulated millions or billions of different inputs from thousands of computers running various versions of Java.

They then ran their algorithm through all possible combinations of input values until it found the best one.

“The result shows that even though our system doesn’t solve every single case, it does get close enough to be useful.”

This instance is less factually correct (and it’s clickbaity!), but the topic behind the generated text is complicated enough that only domain experts would be able to tell it’s not internally consistent.

It’s also worth pointing out now that the majority of the text generated so far is unique, and CTRL is not cheating by just outputting text from the source datasets. Even though Monte Carlo integration is indeed a real thing that’s relatively obscure, Googling that description of it fails to turn up results.

Lastly, fake news doesn’t have to be depressing or sow discord. What about wholesome fake news? Such as…the return of beloved television series Firefly? One Links https://www.hollywoodreporter.com/live-feed/fox-announces-firefly-season-2-revival-original-cast-returning-12343245 later:

Fox announces Firefly Season 2 revival, original cast returning

The network has announced that the second season of Joss Whedon’s cult sci-fi series will return for a new run in 2018.

“Fans have been asking us to bring back Firefly, and we couldn’t be more thrilled about this opportunity,” said Chris Albrecht, president of programming at Fox. “We are excited to continue our relationship with Joss, who is one of the most creative minds working today on television, as well as his amazing cast, which includes Nathan Fillion, Summer Glau, Alan Tudyk, Adam Baldwin, Sean Maher, Adrianne Palicki, Tricia Helfer, Jewel Staite, Michael Trucco, and many more.”

In addition to Whedon himself, executive producers include Damon Lindelof, Maurissa Tancharoen, Vince Gilligan, David Fury, Jeffrey Bell, Greg Plageman, Scott M. Gimple, Heather Kadin, Andrew Chambliss, John Rogers, and Ron Moore.

“The show was an instant hit when it debuted over 20 years ago, but its popularity only grew after the success of ‘Serenity,’ so we’re very pleased to welcome fans into another chapter of their lives,” added Feige.

That is a very stacked cast and crew, all of which (besides the original Firefly members) have acted/worked on sci-fi television series. The only major factual errors are that Chris Albrecht was at STARZ, not Fox, and Feige, presumably Kevin Feige of Marvel Studios, is not mentioned previously in the generated article.

I know I’ll get criticism for highlighting a potentially dangerous application of AI text generation. My perspective is that it’s important to know what such tools are capable of doing in order to more easily recognize fake news. The real problem with fake news isn’t the text itself: it’s the distribution of the news on social media like Facebook and Twitter, where the platforms not only incentivize it, but also fail to sufficiently punish deliberate, repeat offenders. It’s why journalism and awareness of fake news is extremely important.

Some might comment “these generated texts aren’t convincing at all!”, but keep in mind that’s because the headline says upfront that they’re fake. Would you be able to identify it as a fake if a respected source impulsively tweeted it?

How To Make Custom AI-Generated Text With GPT-2

Wed, 04 Sep 2019 08:00:00 -0700

In February 2019, OpenAI released a paper describing GPT-2, a AI-based text-generation model based on the Transformer architecture and trained on massive amounts of text all around the internet. From a text-generation perspective, the included demos were very impressive: the text is coherent over a long horizon, and grammatical syntax and punctuation are near-perfect.

At the same time, the Python code which allowed anyone to download the model (albeit smaller versions out of concern the full model can be abused to mass-generate fake news) and the TensorFlow code to load the downloaded model and generate predictions was open-sourced on GitHub.

Neil Shepperd created a fork of OpenAI’s repo which contains additional code to allow finetuning the existing OpenAI model on custom datasets. A notebook was created soon after, which can be copied into Google Colaboratory and clones Shepperd’s repo to finetune GPT-2 backed by a free GPU. From there, the proliferation of GPT-2 generated text took off: researchers such as Gwern Branwen made GPT-2 Poetry and Janelle Shane made GPT-2 Dungeons and Dragons character bios.

I waited to see if anyone would make a tool to help streamline this finetuning and text generation workflow, a la textgenrnn which I had made for recurrent neural network-based text generation. Months later, no one did. So I did it myself. Enter gpt-2-simple, a Python package which wraps Shepperd’s finetuning code in a functional interface and adds many utilities for model management and generation control.

Thanks to gpt-2-simple and this Colaboratory Notebook, you can easily finetune GPT-2 on your own dataset with a simple function, and generate text to your own specifications!

How GPT-2 Works

OpenAI has released three flavors of GPT-2 models to date: the “small” 124M parameter model (500MB on disk), the “medium” 355M model (1.5GB on disk), and recently the 774M model (3GB on disk). These models are much larger than what you see in typical AI tutorials and are harder to wield: the “small” model hits GPU memory limits while finetuning with consumer GPUs, the “medium” model requires additional training techniques before it could be finetuned on server GPUs without going out-of-memory, and the “large” model cannot be finetuned at all with current server GPUs before going OOM, even with those techniques.

The actual Transformer architecture GPT-2 uses is very complicated to explain (here’s a great lecture). For the purposes of finetuning, since we can’t modify the architecture, it’s easier to think of GPT-2 as a black box, taking in inputs and providing outputs. Like previous forms of text generators, the inputs are a sequence of tokens, and the outputs are the probability of the next token in the sequence, with these probabilities serving as weights for the AI to pick the next token in the sequence. In this case, both the input and output tokens are byte pair encodings, which instead of using character tokens (slower to train but includes case/formatting) or word tokens (faster to train but does not include case/formatting) like most RNN approaches, the inputs are “compressed” to the shortest combination of bytes including case/formatting, which serves as a compromise between both approaches but unfortunately adds randomness to the final generation length. The byte pair encodings are later decoded into readable text for human generation.

The pretrained GPT-2 models were trained on websites linked from Reddit. As a result, the model has a very strong grasp of the English language, allowing this knowledge to transfer to other datasets and perform well with only a minor amount of additional finetuning. Due to the English bias in encoder construction, languages with non-Latin characters like Russian and CJK will perform poorly in finetuning.

When finetuning GPT-2, I recommend using the 124M model (the default) as it’s the best balance of speed, size, and creativity. If you have large amounts of training data (>10 MB), then the 355M model may work better.

gpt-2-simple And Colaboratory

In order to better utilize gpt-2-simple and showcase its features, I created my own Colaboratory Notebook, which can be copied into your own Google account. A Colaboratory Notebook is effectively a Jupyter Notebook running on a free (w/ a Google Account) virtual machine with an Nvidia server GPU attached (randomly a K80 or a T4; T4 is ideal) that normally can be cost-prohibitive.

Once open, the first cell (run by pressing Shift+Enter in the cell or mousing-over the cell and pressing the “Play” button) of the notebook installs gpt-2-simple and its dependencies, and loads the package.

Later in the notebook is gpt2.download_gpt2() which downloads the requested model type to the Colaboratory VM (the models are hosted on Google’s servers, so it’s a very fast download).

Expanding the Colaboratory sidebar reveals a UI that you can use to upload files. For example, the tinyshakespeare dataset (1MB) provided with the original char-rnn implementation. Upload a text file via the UI (you can drag and drop), run the file_name = '' cell with your filename changed in the cell.

Now we can start finetuning! This finetuning cell loads the specified dataset and trains for the specified number of steps (the default of 1,000 steps is enough to allow distinct text to emerge and takes about 45 minutes, but you can increase the number of steps if necessary).

While the model is finetuning, the average training loss is output every-so-often to the cell. The absolute value of the loss is not important (the output text quality is subjective), but if the average loss stops decreasing, that’s a sign the model has converged and additional training may not help improve the model.

By default, your model is saved in the checkpoint/run1 folder, and you’ll need to use that folder to load the model as well (you can specify the run_name when using other functions categorize finetuned models). If you want to export the model from Colaboratory, it’s recommended you do so via Google Drive (as Colaboratory does not like exporting large files). Run the gpt2.mount_gdrive() cell to mount your Google Drive in the Colaboratory VM, then run the gpt2.copy_checkpoint_to_gdrive() cell. You can then download the compressed model folder from Google Drive and run the model wherever you want. Likewise, you can use the gpt2.copy_checkpoint_from_gdrive() cell to retrieve a stored model and generate in the notebook.

Speaking of generation, once you have a finetuned model, you can now generate custom text from it! By default, the gpt2.generate() function will generate as much text as possible (1,024 tokens) with a little bit of randomness. An important caveat: you will not get good generated text 100% of the time, even with a properly trained model (the OpenAI demo above took 25 tries to get good text!).

You can also increase the temperature to increase “creativity” by allowing the network to more likely make suboptimal predictions, provide a prefix to specify how exactly you want your text to begin. There are many other useful configuration parameters, such as top_p for nucleus sampling.

As a bonus, you can bulk-generate text with gpt-2-simple by setting nsamples (number of texts to generate total) and batch_size (number of texts to generate at a time); the Colaboratory GPUs can support a batch_size of up to 20, and you can generate these to a text file with gpt2.generate_to_file(file_name) with the same parameters as gpt2.generate(). You can download the generated file locally via the sidebar, and use those to easily save and share the generated texts.

The notebook has many more functions as well, with more parameters and detailed explanations! The gpt-2-simple README lists additional features of gpt-2-simple if you want to use the model outside the notebook.

(NB: Currently, you’ll need to reset the Notebook via Runtime → Restart Runtime to finetune a different model/dataset or load a different finetuned model.)

GPT-2 For Short Texts

A weakness of GPT-2 and other out-of-the-box AI text generators is that they are built for longform content, and keep on generating text until you hit the specified length. Another reason I wanted to make gpt-2-simple was to add explicit processing tricks to the generated text to work around this issue for short texts. In this case, there are two additional parameters that can be passed to gpt2.generate(): truncate and include_prefix. For example, if each short text begins with a <|startoftext|> token and ends with a <|endoftext|>, then setting prefix='<|startoftext|>', truncate=<|endoftext|>', and include_prefix=False, and length is sufficient, then gpt-2-simple will automatically extract the shortform texts, even when generating in batches.

Let’s finetune a GPT-2 model on Reddit submission titles. This query, when run on BigQuery (for free), returns the top 16,000 titles by score between January and March 2019 for a given Reddit subreddit (in this case, /r/AskReddit) + minor text preprocessing, which can be downloaded locally as a 1.3 MB CSV (Save Results → CSV [local file]):

#standardSQL
SELECT
  REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(title, '&', '&'), '<', '<'), '>', '>'), '�', '') AS title
FROM
  `fh-bigquery.reddit_posts.*`
WHERE
  _TABLE_SUFFIX BETWEEN '2019_01' AND '2019_03'
  AND LENGTH(title) >= 8
  AND LOWER(subreddit) = 'askreddit'
ORDER BY
  score DESC
LIMIT
  16000

With gpt-2-simple, using a single-column CSV like the one generated above as the input dataset will automatically add <|startoftext|> and <|endoftext|> tokens appropriately. Finetune a new GPT-2 model as normal, and then generate with those additional parameters mentioned above:

It’s worth noting that despite a good amount of input data to the model, finetuned networks can easily overfit on short form text: some of these example titles are very close to existing /r/AskReddit titles. Overfitting can be rectified by training for less time, or adding more input data. Make sure to double check that your generated text is unique!

You can play with this Reddit-oriented variant in this modified Colaboratory Notebook.

Making GPT-2 Apps

There have already been cool, non-nefarious uses of GPT-2, such as Adam King’s TalkToTransformer which provides a UI for the 774M model (and has gone viral many times) and TabNine, which uses GPT-2 finetuned on GitHub code in order to create probabilistic code completion. On the PyTorch side, Huggingface has released a Transformers client (w/ GPT-2 support) of their own, and also created apps such as Write With Transformer to serve as a text autocompleter.

Many AI tutorials often show how to deploy a small model to a web service by using the Flask application framework. The problem with GPT-2 is that it’s such a huge model that most conventional advice won’t work well to get a performant app. And even if you do get it to run fast (e.g. by running the app on a GPU), it won’t be cheap, especially if you want it to be resilient to a random surge of virality.

With gpt-2-simple, the solution I came up with is gpt-2-cloud-run; a small webapp intended to run GPT-2 via Google Cloud Run backed by gpt-2-simple. The advantage here is that Cloud Run only charges for compute used and can scale indefinitely if there’s a traffic surge; for casual use, it’s extremely cost effective compared to running a GPU 24/7. I’ve used Cloud Run to make a GPT-2 text generator for Reddit-wide submission titles and a GPT-2 generator for Magic: The Gathering cards!

Attributing AI-Generated Text

One of the main reasons I developed textgenrnn and gpt-2-simple is to make AI text generation more accessible as you do not need a strong AI or technical background to create fun stories. However, in the case of GPT-2, I’ve noticed an elevated amount of “I trained an AI to generate text” articles/Reddit posts/YouTube videos saying they used GPT-2 to train an AI, but not how they trained the AI: especially suspicious since finetuning is not an out-of-the-box feature that OpenAI provides. The fact that Keaton Patti’s “I forced a bot” movie scripts (that aren’t written by a bot) frequently go megaviral due to that particular framing doesn’t help.

Although it’s not legally required, I ask that anyone who shares generated text via gpt-2-simple add a link to the repo and/or Colaboratory notebook not just for attribution, but to spread knowledge about the accessibility of AI text generation. It’s a technology that should be transparent, not obfuscated for personal gain.

The Future of GPT-2

Hopefully, this article gave you ideas on how to finetune and generate texts creatively. There’s still a lot of untapped potential, and there are still many cool applications that have been untouched, and many cool datasets that haven’t been used for AI text generation. GPT-2 will likely be used more for mass-producing crazy erotica than fake news.

However, GPT-2 and the Transformer architecture aren’t the end-game of AI text generation. Not by a long shot.

Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces

Mon, 22 Oct 2018 09:15:00 -0700

Data science has been sweeping the tech world. With a large variety of powerful free open-sourced tools and now the computing power to utilize them to their full potential, data science is more accessible than ever and has become America’s hottest job. One problem: there’s no consensus on what data scientists really do in a professional setting.

There has been a rise in romantic thought pieces lately (especially on Medium) about how data scientists are wizards and can solve any problem (with bonus points if it cites AI). If you follow publications like Towards Data Science, you’ll notice persistent tropes in the more code-oriented posts: Python is the king programming language for data science, use scikit-learn/XGBoost and logistic regression for predicting categorical variable(s), use pandas for processing tabular data, use NLTK/word2vec for processing text data, use TensorFlow/Keras/convolutional neural networks for processing image data, use k-means for clustering data, split the processed dataset into training and test datasets for model training, tweak hyperparameters/model features until results on the test dataset are good, etc.

These tropes aren’t inappropriate or misleading, but the analysis often doesn’t quantify the insight/value of the results. Modeling is just one small part (and often the easiest part) of a very complex system.

Data-oriented MOOCs (Massive Online Open Courses) like Andrew Ng’s Coursera course on Machine Learning and fast.ai’s course on Deep Learning are good academic introductions to the theory and terminology behind data science and other related fields. Although MOOCs have many practice problems for prospective data scientists to solve, they don’t make you an expert in the field capable of handling messier real-world problems, nor claim to do so.

Modern data science isn’t about burying your head in a Jupyter Notebook and staring at the screen watching training loss numbers trickle down (although it’s definitely fun!). There’s a lot more to it, some of which I’ve learned firsthand working as a Data Scientist at BuzzFeed for over a year. To borrow a statistical term, MOOCs and thought pieces overfit to a certain style of data science that is not robust to the vast uncertainties of the real world.

The Cost/Benefit Tradeoffs of Data Science

Data science often follows the Pareto principle: 80% of the work takes 20% of the effort. Thought pieces demonstrate that you can just toss data indiscriminately into scikit-learn or a deep learning framework and get neat-looking results. The value of a data scientist, however, is when and if to further development on a model.

Kaggle competitions are a popular and often-recommended way to get exposure to real-world data science problems. Many teams of statisticians compete to create the best model for a given dataset (where “best” usually means minimizing the predictive loss/error of the model), with prizes for the highest-performing models. Kaggle also encourages clever modeling techniques such as grid search of thousands of model hyperparameter combinations and ensembling disparate models to create a megamodel which results in slightly better predictive performance, but just might give the edge to win.

However, there are a few important differences between modeling in a Kaggle competition and modeling in a data science team. Kaggle competitions last for weeks when a professional data scientist may need to spend time on other things. Ensembling gigantic machine learning models makes predictions very slow and the models themselves very large; both of which may cause difficulty deploying them into production (e.g. the Netflix Prize movie recommendation models famously “did not seem to justify the engineering effort needed to bring them into a production environment”). And most importantly, there may not be a significant practical performance difference between a 1st place Kaggle model that takes days/weeks to optimize and a simple scikit-learn/XGBoost baseline that can be built in a few hours.

Counterintuitively, it may be better to trade performance for speed/memory with a weaker-but-faster model; in business cases, speed and scalability are important implementation constraints. But even with scikit-learn, the model is still a black box with little idea to the data scientist how the model makes its decisions. One final option is to go back to basics altogether with a “boring” linear/logistic regression model, where the predictive performance may be even weaker and the model must follow several statistical assumptions, but the model feature coefficients and statistical significance are easily interpretable to explain the importance of each input feature (if any) and make actionable, informed decisions for the business. Being a data scientist requires making educated judgments about these tradeoffs.

Data Scientists Still Use Business Intelligence Tools

A hobbyist data scientist without a budget may opt to build their own workflows and data pipelines using free tools. However, professional data scientists have a finite amount of free time (as do all engineers), so there’s a massive opportunity cost when reinventing the wheel unnecessarily. Enterprise BI tools such as Tableau, Looker, and Mode Analytics help retrieve and present data with easy-to-digest dashboards for anyone in the company. They’re never cheap, but they’re much cheaper to the company than having a data scientist spend valuable time to develop and maintain similar tooling over time.

If a stakeholder wants a data report ASAP, there’s no problem falling back to using SQL to query a data warehouse and output results into an Excel spreadsheet (plus pretty data visualizations!) to quickly transport in an email. Part of being a data scientist is working out which tools are best appropriate at what time.

Some might argue that using BI tools and SQL are not responsibilities for data scientists, but instead for Business Analysts or Data Analysts. That’s a No True Scotsman way of looking at it; there’s a lot of overlap in data science with other analytical fields, and there’s nothing wrong with that.

Data Scientists Are Software Engineers Too

Although MOOCs encourage self-study, data science is a collaborative process. And not just with other data scientists on a team, but with other software engineers in the company. Version control tools like Git are often used for data scientists to upload their portfolio projects publicly to GitHub, but there are many other important features for use in a company-wide collaborative environment such as branching a repository, making pull requests, and merging conflicts. Beyond that are modern development QA practices, such as test environments, consistent code style, and code reviews. The full process varies strongly by company: Airbnb has a good thought piece about how they utilize their Knowledge Base for data science collaboration using Git.

One of the very hard and surprisingly underdiscussed aspects of data science is DevOps, and how to actually get a statistical model into production. Docker containers, for example, are newer technology that’s hard to learn, but have many data science and DevOps benefits by mitigating Python dependency hell and ensuring a consistent environment for model deployment and execution. And once the model is in production, data scientists, data engineers, and dedicated DevOps personnel need to work together to figure out if the model has the expected output, if the model is performing with expected speed/memory overhead, how often to retrain the model on fresh data (plus the scheduling/data pipelining necessary to do so), and how to efficiently route predictions out of the system to the user.

Data Science Can’t Solve Everything

Data science experiments (even those utilizing magical AI) are allowed to fail, and not just in the fail-to-reject-the-null-hypothesis sense. Thought pieces typically discuss successful projects, which leads to a survivorship bias. Even with massive amounts of input data, it’s likely for a model to fail to converge and offer zero insight, or an experiment fail to offer statistically significant results (common with A/B testing).

real world data science is an R² of 0.10 #GoogleNext18 pic.twitter.com/qNsno2dscR
— Max Woolf (@minimaxir) July 24, 2018

The difficulty of real-world data science is recognizing if a given problem can be solved, how much of your valuable time to spend iterating to maybe solve it, how to report to stakeholders if it can’t be solved, and what are the next steps if that’s the case.

Don’t p-hack!

Data Science and Ethics

During the rise of the “data science/AI is magic!” era, massive algorithmic and statistical failures suggest that data science might not always make the world a better place. Amazon built a resume-reading model which accidentally learned to be sexist. Facebook overestimated performance metrics on their videos, causing complete business pivots for media organizations in vain, indirectly leading to hundreds of layoffs. YouTube’s recommended video algorithms drove children towards shocking and disturbing content. And these companies have some of the best data talent in the entire world.

The qualitative output of a model or data analysis is just as important as the quantitative performance, if not more. Allowing dangerous model output to hit production and impact millions of consumers is a failure of QA at all levels. In fairness these companies usually fix these issues, but only after journalists point them out. The problem with blindly chasing a performance metric (like Kaggle) is that it ignores collateral, unexpected effects.

Don’t be data-driven. Be data-informed. Metrics should never be in charge because they have no moral compass.
— Kim Goodwin (@kimgoodwin) October 15, 2018

Maybe recommending shocking videos is what maximizes clickthrough rate or ad revenue per the models according to a business dashboard. Unfortunately, if the data justifies it and the business stakeholders encourage it, the company may accept the consequences of a flawed algorithm if they don’t outweigh the benefits. It’s important for data scientists to be aware that they may be party to that.

Conclusion

I realize the irony of using a data science thought piece to argue against data science thought pieces. In fairness, some Medium thought pieces do apply data science in very unique ways or touch on very obscure-but-impactful aspects of frameworks, and I enjoy reading those. The field is still very broadly defined, and your experiences may differ from this post, especially if you’re working for a more research-based institution. Unfortunately, I don’t have any new advice for getting a data science job, which is still very difficult.

The popular idea that being a data scientist is a 40-hours-a-week Kaggle competition is incorrect. There’s a lot more to it that’s not as sexy which, in my opinion, is the more interesting aspect of the data science field as a whole.

Problems with Predicting Post Performance on Reddit and Other Link Aggregators

Mon, 10 Sep 2018 09:15:00 -0700

Reddit, “the front page of the internet” is a link aggregator where anyone can submit links to cool happenings. Over the years, Reddit has expanded from just being a link aggregator, to allowing image and videos, and as of recently, hosting images and videos itself.

Reddit is broken down into subreddits, where each subreddit represents each own community around a particular interest, like /r/aww for pet photos and /r/politics for U.S. politics. The posts on each subreddit are ranked by some function of both time elapsed since the submission was made, and the score of the submission as determined by upvotes and downvotes from other users.

There’s also an intrinsic pride in having something you’re responsible for providing to the community get lots of upvotes (the submitter also earns karma based on received upvotes, although karma is meaningless and doesn’t provide any user benefits). But the reality is that even on the largest subreddits, submissions with 1 point (the default score for new submissions) are the most prominent, with some subreddits having over half of their submissions with only 1 point.

The exposure from having a submission go viral on Reddit (especially on larger subreddits) can be valuable especially if its your own original content. As a result, there has been a lot of analysis/stereotypes on what techniques to do to help your submission make it to the top of the front page. But almost all claims of “cracking” the Reddit algorithm are post hoc rationalizations, attributing success to things like submission timing and title verbiage of a single submission after the fact. The nature of algorithmic feeds inherently leads to a survivorship bias: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail, which makes modeling a successful post very tricky.

I’ve touched on analyzing Reddit post performance before, but let’s give it another look and see if we can drill down on why Reddit posts do and do not do well.

Submission Timing

As with many US-based websites, the majority of Reddit users are most active during work hours (9 AM — 5 PM Eastern time weekdays). Most subreddits have submission patterns which fit accordingly.

But what’s interesting are the subreddits which deviate from that standard. Gaming subreddits (/r/DestinyTheGame, /r/Overwatch) have short activity after a Tuesday game update/patch, game communication subreddits (/r/Fireteams, /r/RocketLeagueExchange) are more active outside of work hours as they assume you are playing the game at the time, and Not-Safe-For-Work subreddits (/r/dirtykikpals, /r/gonewild) are incidentally less active during work hours and more active late-night than other subreddits.

Whenever you make a submission to Reddit, the submission appears in the subreddit’s /new queue of the most recent submissions, where hopefully kind souls will find your submission and upvote it if it’s good.

However, if it falls off the first page of the /new queue, your submission might be as good as dead. As a result, there’s an element of game theory to timing your submission if you want it to not become another 1-point submission. Is it better to submit during peak hours when more users may see the submission before it falls off of /new? Is it better to submit before peak usage since there will be less competition, then continue the momentum once it hits the front page?

Here’s a look at the median post performance at each given time slot for top subreddits:

As the earlier distribution chart implied, the median score is around 1-2 for most subreddits, and that’s consistent across all time slots. Some subreddits with higher medians like /r/meirl do appear to have a _slight benefit when posting before peak activity. When focusing on subreddits with high overall median scores, the difference is more explicit.

Subreddits like /r/PrequelMemes and /r/TheDonald _definitely have better performance on average when made before peak activity! Posting before peak usage does appear to be a viable strategy, however for the majority of subreddits it doesn’t make much of a difference.

Submission Titles

Each Reddit subreddit has their own vocabulary and topics of discussion. Let’s break down text by subreddit by looking at the 75th percentile for score on posts containing a given two-word phrase:

The one trend consistent across all subreddits is the effectiveness of first-person pronouns (I/my) and original content (fan art). Other than that, the vocabulary and sentiment for successful posts is very specific to the subreddit and culture is represents; no universal guaranteed-success memes.

Can Deep Learning Predict Post Performance?

Some might think “oh hey, this is an arbitrary statistical problem, you can just build an AI to solve it!” So, for the sake of argument, I did.

Instead of using Reddit data for building a deep learning model, we’ll use data from Hacker News, another link aggregator similar to Reddit with a strong focus on technology and startup entrepreneurship. The distribution of scores on posts, submission timings, upvoting, and front page ranking systems are all the same as on Reddit.

The titles on Hacker News submissions are also shorter (80 characters max vs. Reddit’s 300 character max) and in concise English (no memes/shitposts allowed), which should help the model learn the title syntax and identify high-impact keywords easier. Like Reddit, the score data is super-skewed with most HN submissions at 1-2 points, and typical model training will quickly converge but try to predict that every submission has a score of 1, which isn’t helpful!

By constructing a model employing many deep learning tricks with Keras/TensorFlow to prevent model cheating and training on hundreds of thousands of HN submissions (using post title, day-of-week, hour, and link domain like github.com as model features), the model does converge and finds some signal among the noise (training R² ~ 0.55 when trained for 50 epochs). However, it fails to offer any valuable predictions on new, unseen posts (test R² < 0.00) because it falls into the same exact human biases regarding titles: it saw submissions with titles that did very well during training, but can’t isolate the random chance why X and Y submissions are similar but X goes viral while Y does not.

I’ve made the Keras/TensorFlow model training code available in this Kaggle Notebook if you want to fork it and try to improve the model.

Other Potential Modeling Factors

The deep learning model above makes optimistic assumptions about the underlying data, including that each post behaves independently, and the included features are the sole features which determine the score. These assumptions are questionable.

The simple model forgoes the content of the submission itself, which is hard to retrieve for hundreds of thousands of data points. On Hacker News that’s mostly OK since most submissions are links/articles which accurately correlate to the content, although occasionally there are idiosyncratic short titles which do the opposite. On Reddit, obviously looking at content is necessary for image/video-oriented subreddits, which is hard to gather and analyze at scale.

A very important concept of post performance is momentum. A post having a high score is a positive signal in itself, which begets more votes (a famous Reddit problem is brigading from /r/all which can cause submission scores to skyrocket). If the front page of a subreddit has a large number of high-performing posts, they might also suppress posts coming out of the /new queue because the score threshold is much higher. A simple model may not be able to capture these impacts; the model would need to incorporate the state of the front page at the time of posting.

Some also try to manipulate upvotes. Reddit became famous for adding the rule “asking for upvotes is a violation of intergalactic law” to their Content Policy, although some subreddits do it anyway without consequence. On Reddit, obvious spam posts can be downvoted to immediately counteract illicit upvotes. Hacker News has a similar don’t-upvote rule, although there aren’t downvotes, just a flagging mechanism which quickly neutralizes spam/misleading posts. In general, there’s no legitimate reason to highlight your own submission immediately after its posted (except for Reddit’s AMAs). Fortunately, gaming the system is less impactful on Reddit and Hacker News due to their sheer size and countermeasures, but it’s a good example of potential user behavior that makes modeling post performance difficult, and hopefully link aggregators of the future aren’t susceptible to such shenanigans.

Do We Really to Predict Post Score?

Let’s say you are submitting original content to Reddit or your own tech project to Hacker News. More points means a higher ranking means more exposure for your link, right? Not exactly. As noted from Reddit/HN screenshots above, the scores of popular submissions are all over the place ranking-wise, having been affected by age penalties.

In practical terms, from my own purely anecdotal experience, submissions at a top ranking receive substantially more clickthroughs despite being spatially close on the page to others.

…and now traffic at #3.

Placement is absurdly important for search engines/social media sites. Difference between #1 and #3 is dramatic. pic.twitter.com/nGjWJBx6dU
— Max Woolf (@minimaxir) June 20, 2017

In that case, falling from #1 to #3 immediately halved the referral traffic coming from Hacker News.

Therefore, an ideal link aggregator predictive model to maximize clicks should try to predict the rank of a submission (max rank, average rank over n period, etc.), not necessarily the score it receives. You could theoretically create a model by making a snapshot of a Reddit subreddit/front page of Hacker News every minute or so which includes the post position at the time of the snapshot. As mentioned earlier, the snapshots can also be used as a model feature to identify whether the front page is active or stale. Unfortunately, snapshots can’t be retrieved retroactively, and both storing, processing, and analyzing snapshots at scale is a difficult and expensive feat of data engineering.

Presumably Reddit’s data scientists would be incorporating submission position as a part of their data analytics and modeling, but after inspecting what’s sent to Reddit’s servers when you perform an action like upvoting, I wasn’t able to find a sent position value when upvoting from the feed: only the post score and post upvote percentage at the time of the action were sent.

In this example, I upvoted the Fact are facts submission at position #5: we’d expect a value between 3 and 5 be sent with the post metadata within the analytics payload, but that’s not the case.

Optimizing ranking instead of a tangible metric or classification accuracy is a relatively underdiscussed field of modern data science (besides SEO for getting the top spot on a Google search), and it would be interesting to dive deeper into it for other applications.

In the future

The moral of this post is that you should not take it personally if a submission fails to hit the front page. It doesn’t necessarily mean it’s bad. Conversely, if a post does well, don’t assume that similar posts will do just as well. There’s a lot of quality content that falls through the cracks due to dumb luck. Fortunately, both Reddit and Hacker News allow reposts, which helps alleviate this particular problem.

There’s still a lot that can be done to more deterministically predict the behavior of these algorithmic feeds. There’s also room to help make these link aggregators more fair. Unfortunately, there’s even more undiscovered ways to game these algorithms, and we’ll see how things play out.

You can view the BigQuery queries used to get the Reddit and Hacker News data, plus the R and ggplot2 used to create the data visualizations, in this R Notebook. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Analyzing IMDb Data The Intended Way, with R and ggplot2

Mon, 16 Jul 2018 09:45:00 -0700

IMDb, the Internet Movie Database, has been a popular source for data analysis and visualizations over the years. The combination of user ratings for movies and detailed movie metadata have always been fun to play with.

There are a number of tools to help get IMDb data, such as IMDbPY, which makes it easy to programmatically scrape IMDb by pretending it’s a website user and extracting the relevant data from the page’s HTML output. While it works, web scraping public data is a gray area in terms of legality; many large websites have a Terms of Service which forbids scraping, and can potentially send a DMCA take-down notice to websites redistributing scraped data.

IMDb has data licensing terms which forbid scraping and require an attribution in the form of a Information courtesy of IMDb (http://www.imdb.com). Used with permission. statement, and has also DMCAed a Kaggle IMDb dataset to hone the point.

However, there is good news! IMDb publishes an official dataset for casual data analysis! And it’s now very accessible, just choose a dataset and download (now with no hoops to jump through), and the files are in the standard TSV format.

The uncompressed files are pretty large; not “big data” large (it fits into computer memory), but Excel will explode if you try to open them in it. You have to play with the data smartly, and both R and ggplot2 have neat tricks to do just that.

First Steps

R is a popular programming language for statistical analysis. One of the most popular series of external packages is the tidyverse package, which automatically imports the ggplot2 data visualization library and other useful packages which we’ll get to one-by-one. We’ll also use scales which we’ll use later for prettier number formatting. First we’ll load these packages:

library(tidyverse)
library(scales)

And now we can load a TSV downloaded from IMDb using the read_tsv function from readr (a tidyverse package), which does what the name implies, at a much faster speed than base R (+ a couple other parameters to handle data encoding). Let’s start with the ratings file:

df_ratings <- read_tsv('title.ratings.tsv', na = "\\N", quote = '')

We can preview what’s in the loaded data using dplyr (a tidyverse package), which is what we’ll be using to manipulate data for this analysis. dplyr allows you to pipe commands, making it easy to create a sequence of manipulation commands. For now, we’ll use head(), which displays the top few rows of the data frame.

df_ratings %>% head()

Each of the 873k rows corresponds to a single movie, an ID for the movie, its average rating (from 1 to 10), and the number of votes which contribute to that average. Since we have two numeric variables, why not test out ggplot2 by creating a scatterplot mapping them? ggplot2 takes in a data frame and names of columns as aesthetics, then you specify what type of shape to plot (a “geom”). Passing the plot to ggsave saves it as a standalone, high-quality data visualization.

plot <- ggplot(df_ratings, aes(x = numVotes, y = averageRating)) +
          geom_point()

ggsave("imdb-0.png", plot, width = 4, height = 3)

Here is nearly 1 million points on a single chart; definitely don’t try to do that in Excel! However, it’s not a useful chart since all the points are opaque and we’re not sure what the spatial density of points is. One approach to fix this issue is to create a heat map of points, which ggplot can do natively with geom_bin2d. We can color the heat map with the viridis colorblind-friendly palettes just introduced into ggplot2. We should also tweak the axes; the x-axis should be scaled logarithmically with scale_x_log10 since there are many movies with high numbers of votes and we can format those numbers with the comma function from the scales package (we can format the scale with comma too). For the y-axis, we can add explicit number breaks for each rating; R can do this neatly by setting the breaks to 1:10. Putting it all together:

plot <- ggplot(df_ratings, aes(x = numVotes, y = averageRating)) +
          geom_bin2d() +
          scale_x_log10(labels = comma) +
          scale_y_continuous(breaks = 1:10) +
          scale_fill_viridis_c(labels = comma)

Not bad, although it unfortunately confirms that IMDb follows a Four Point Scale where average ratings tend to fall between 6 — 9.

Mapping Movies to Ratings

You may be asking “which ratings correspond to which movies?” That’s what the tconst field is for. But first, let’s load the title data from title.basics.tsv into df_basics and take a look as before.

df_basics <- read_tsv('title.basics.tsv', na = "\\N", quote = '')

We have some neat movie metadata. Notably, this table has a tconst field as well. Therefore, we can join the two tables together, adding the movie information to the corresponding row in the rating table (in this case, a left join is more appropriate than an inner/full join)

df_ratings <- df_ratings %>% left_join(df_basics)

Runtime minutes sounds interesting. Could there be a relationship between the length of a movie and its average rating on IMDb? Let’s make a heat map plot again, but with a few tweaks. With the new metadata, we can filter the table to remove bad points; let’s keep movies only (as IMDb data also contains television show data), with a runtime < 3 hours, and which have received atleast 10 votes by users to remove extraneous movies). X-axis should be tweaked to display the minutes-values in hours. The fill viridis palette can be changed to another one in the family (I personally like inferno).

More importantly, let’s discuss plot theming. If you want a minimalistic theme, add a theme_minimal to the plot, and you can pass a base_family to change the default font on the plot and a base_size to change the font size. The labs function lets you add labels to the plot (which you should always do); you have your title, x, and y parameters, but you can also add a subtitle, a caption for attribution, and a color/fill to name the scale. Putting it all together:

plot <- ggplot(df_ratings %>% filter(runtimeMinutes < 180, titleType == "movie", numVotes >= 10), aes(x = runtimeMinutes, y = averageRating)) +
          geom_bin2d() +
          scale_x_continuous(breaks = seq(0, 180, 60), labels = 0:3) +
          scale_y_continuous(breaks = 0:10) +
          scale_fill_viridis_c(option = "inferno", labels = comma) +
          theme_minimal(base_family = "Source Sans Pro", base_size = 8) +
          labs(title = "Relationship between Movie Runtime and Average Mobie Rating",
               subtitle = "Data from IMDb retrieved July 4th, 2018",
               x = "Runtime (Hours)",
               y = "Average User Rating",
               caption = "Max Woolf — minimaxir.com",
               fill = "# Movies")

Now that’s pretty nice-looking for only a few lines of code! Albeit unhelpful, as there doesn’t appear to be a correlation.

(Note: for the rest of this post, the theming/labels code will be omitted for convenience)

How about movie ratings vs. the year the movie was made? It’s a similar plot code-wise to the one above (one perk about ggplot2 is that there’s no shame in reusing chart code!), but we can add a geom_smooth, which adds a nonparametric trendline with confidence bands for the trend; since we have a large amount of data, the bands are very tight. We can also fix the problem of “empty” bins by setting the color fill scale to logarithmic scaling. And since we’re adding a black trendline, let’s change the viridis palette to plasma for better contrast.

plot <- ggplot(df_ratings %>% filter(titleType == "movie", numVotes >= 10), aes(x = startYear, y = averageRating)) +
          geom_bin2d() +
          geom_smooth(color="black") +
          scale_x_continuous() +
          scale_y_continuous(breaks = 1:10) +
          scale_fill_viridis_c(option = "plasma", labels = comma, trans = 'log10')

Unfortunately, this trend hasn’t changed much either, although the presence of average ratings outside the Four Point Scale has increased over time.

Mapping Lead Actors to Movies

Now that we have a handle on working with the IMDb data, let’s try playing with the larger datasets. Since they take up a lot of computer memory, we only want to persist data we actually might use. After looking at the schema provided with the official datasets, the only really useful metadata about the actors is their birth year, so let’s load that, but only keep both actors/actresses (using the fast str_detect function from stringr, another tidyverse package) and the relevant fields.

df_actors <- read_tsv('name.basics.tsv', na = "\\N", quote = '') %>%
                filter(str_detect(primaryProfession, "actor|actress"))  %>%
                select(nconst, primaryName, birthYear)

The principals dataset, the large 1.28GB TSV, is the most interesting. It’s an unnested list of the credited persons in each movie, with an ordering indicating their rank (where 1 means first, 2 means second, etc.).

For this analysis, let’s only look at the lead actors/actresses; specifically, for each movie (identified by the tconst value), filter the dataset to where the ordering value is the lowest (in this case, the person at rank 1 may not necessarily be an actor/actress).

df_principals <- read_tsv('title.principals.tsv', na = "\\N", quote = '') %>%
  filter(str_detect(category, "actor|actress")) %>%
  select(tconst, ordering, nconst, category) %>%
  group_by(tconst) %>%
  filter(ordering == min(ordering))

Both datasets have a nconst field, so let’s join them together. And then join that to the ratings table earlier via tconst.

df_principals <- df_principals %>% left_join(df_actors)
df_ratings <- df_ratings %>% left_join(df_principals)

Now we have a fully denormalized dataset in df_ratings. Since we now have the movie release year and the birth year of the lead actor, we can now infer the age of the lead actor at the movie release. With that goal, filter out the data on the criteria we’ve used for earlier data visualizations, plus only keeping rows which have an actor’s birth year.

df_ratings_movies <- df_ratings %>%
                        filter(titleType == "movie", !is.na(birthYear), numVotes >= 10) %>%
                        mutate(age_lead = startYear - birthYear)

Plotting Ages

Age discrimination in movie casting has been a recurring issue in Hollywood; in fact, in 2017 a law was signed to force IMDb to remove an actor’s age upon request, which in February 2018 was ruled to be unconstitutional.

Have the ages of movie leads changed over time? For this example, we’ll use a ribbon plot to plot the ranges of ages of movie leads. A simple way to do that is, for each year, calculate the 25th percentile of the ages, the 50th percentile (i.e. the median), and the 75th percentile, where the 25th and 75th percentiles are the ribbon bounds and the line represents the median.

df_actor_ages <- df_ratings_movies %>%
                  group_by(startYear) %>%
                  summarize(low_age = quantile(age_lead, 0.25, na.rm=T),
                            med_age = quantile(age_lead, 0.50, na.rm=T),
                            high_age = quantile(age_lead, 0.75, na.rm=T))

Plotting it with ggplot2 is surprisingly simple, although you need to use different y aesthetics for the ribbon and the overlapping line.

plot <- ggplot(df_actor_ages %>% filter(startYear >= 1920) , aes(x = startYear)) +
          geom_ribbon(aes(ymin = low_age, ymax = high_age), alpha = 0.2) +
          geom_line(aes(y = med_age))

Turns out that in the 2000’s, the median age of lead actors started to increase? Both the upper and lower bounds increased too. That doesn’t coalesce with the age discrimination complaints.

Another aspect of these complaints is gender, as female actresses tend to be younger than male actors. Thanks to the magic of ggplot2 and dplyr, separating actors/actresses is relatively simple: add gender (encoded in category) as a grouping variable, add it as a color/fill aesthetic in ggplot, and set colors appropriately (I recommend the ColorBrewer qualitative palettes for categorical variables).

df_actor_ages_lead <- df_ratings_movies %>%
                  group_by(startYear, category) %>%
                  summarize(low_age = quantile(age_lead, 0.25, na.rm = T),
                            med_age = quantile(age_lead, 0.50, na.rm = T),
                            high_age = quantile(age_lead, 0.75, na.rm = T))

plot <- ggplot(df_actor_ages_lead %>% filter(startYear >= 1920), aes(x = startYear, fill = category, color = category)) +
          geom_ribbon(aes(ymin = low_age, ymax = high_age), alpha = 0.2) +
          geom_line(aes(y = med_age)) +
          scale_fill_brewer(palette = "Set1") +
          scale_color_brewer(palette = "Set1")

There’s about a 10-year gap between the ages of male and female leads, and the gap doesn’t change overtime. But both start to rise at the same time.

One possible explanation for this behavior is actor reuse: if Hollywood keeps casting the same actor/actresses, by construction the ages of the leads will start to steadily increase. Let’s verify that: with our list of movies and their lead actors, for each lead actor, order all their movies by release year, and add a ranking for the #th time that actor has been a lead actor. This is possible through the use of row_number in dplyr, and window functions like row_number are data science’s most useful secret.

df_ratings_movies_nth <- df_ratings_movies %>%
                      group_by(nconst) %>%
                      arrange(startYear) %>%
                      mutate(nth_lead = row_number())

One more ribbon plot later (w/ same code as above + custom y-axis breaks):

Huh. The median and upper-bound #th time has dropped over time? Hollywood has been promoting more newcomers as leads? That’s not what I expected!

More work definitely needs to be done in this area. In the meantime, the official IMDb datasets are a lot more robust than I thought they would be! And I only used a fraction of the datasets; the rest tie into TV shows, which are a bit messier. Hopefully you’ve seen a good taste of the power of R and ggplot2 for playing with big-but-not-big data!

You can view the R and ggplot used to create the data visualizations in this R Notebook, which includes many visualizations not used in this post. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!