<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Statistical Analysis on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/tag/statistical-analysis/</link>
    <description>Recent content in Statistical Analysis on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Mon, 30 Jun 2025 10:00:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/tag/statistical-analysis/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Predicting Average IMDb Movie Ratings Using Text Embeddings of Movie Metadata</title>
      <link>https://minimaxir.com/2025/06/movie-embeddings/</link>
      <pubDate>Mon, 30 Jun 2025 10:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2025/06/movie-embeddings/</guid>
      <description>Don&amp;rsquo;t try this in your data science interviews.</description>
      <content:encoded><![CDATA[<p>Months ago, I saw a post titled &ldquo;<a href="https://www.reddit.com/r/datascience/comments/1eykil7/rejected_from_ds_role_with_no_feedback/">Rejected from DS Role with no feedback</a>&rdquo; on Reddit&rsquo;s <a href="https://www.reddit.com/r/datascience/">Data Science subreddit</a>, in which a prospective job candidate for a data science position provided a <a href="https://colab.research.google.com/drive/1Ud2tXW2IAw_dXA5DONvNpPmmlL1foSwK">Colab Notebook</a> documenting their submission for a take-home assignment and asking for feedback as to why they were rejected. Per the Reddit user, the assignment was:</p>
<blockquote>
<p>Use the publicly available <a href="https://developer.imdb.com/non-commercial-datasets/">IMDB Datasets</a> to build a model that predicts a movie&rsquo;s average rating. Please document your approach and present your results in the notebook. Make sure your code is well-organized so that we can follow your modeling process.</p>
</blockquote>
<p><a href="https://www.imdb.com/">IMDb</a>, the Internet Movie Database owned by Amazon, allows users to rate movies on a scale from 1 to 10, wherein the average rating is then displayed prominently on the movie&rsquo;s page:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/shawshank_hu_fe8025c2c6a0fa89.webp 320w,/2025/06/movie-embeddings/shawshank_hu_f0b2bc74865ccb73.webp 768w,/2025/06/movie-embeddings/shawshank_hu_8f544060412f7f54.webp 1024w,/2025/06/movie-embeddings/shawshank.webp 1082w" src="shawshank.webp"
         alt="The Shawshank Redemption is currently the highest-rated movie on IMDb with an average rating of 9.3 derived from 3.1 million user votes."/> <figcaption>
            <p><a href="https://www.imdb.com/title/tt0111161/?ref_=sr_t_1">The Shawshank Redemption</a> is currently the <a href="https://www.imdb.com/search/title/?groups=top_100&amp;sort=user_rating,desc">highest-rated movie on IMDb</a> with an average rating of 9.3 derived from 3.1 million user votes.</p>
        </figcaption>
</figure>

<p>In their notebook, the Redditor identifies a few intuitive features for such a model, including the year in which the movie was released, the genre(s) of the movies, and the actors/directors of the movie. However, the model they built is a <a href="https://www.tensorflow.org/">TensorFlow</a> and <a href="https://keras.io/">Keras</a>-based neural network, with all the bells-and-whistles such as <a href="https://en.wikipedia.org/wiki/Batch_normalization">batch normalization</a> and <a href="https://en.wikipedia.org/wiki/Dilution_%28neural_networks%29">dropout</a>. The immediate response by other data scientists on /r/datascience was, at its most polite, &ldquo;why did you use a neural network when it&rsquo;s a <a href="https://en.wikipedia.org/wiki/Black_box">black box</a> that you can&rsquo;t explain?&rdquo;</p>
<p>Reading those replies made me nostalgic. Way back in 2017, before my first job as a data scientist, neural networks using frameworks such as TensorFlow and Keras were all the rage for their ability to &ldquo;<a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem">solve any problem</a>&rdquo; but were often seen as lazy and unskilled compared to traditional statistical modeling such as ordinary least squares linear regression or even gradient boosted trees. Although it&rsquo;s funny to see that perception against neural networks in the data science community hasn&rsquo;t changed since, nowadays the black box nature of neural networks can be an acceptable business tradeoff if the prediction results are higher quality and interpretability is not required.</p>
<p>Looking back at the assignment description, the objective is only &ldquo;predict a movie&rsquo;s average rating.&rdquo; For data science interview take-homes, this is unusual: those assignments typically have an extra instruction along the lines of &ldquo;explain your model and what decisions stakeholders should make as a result of it&rdquo;, which is a strong hint that you need to use an explainable model like linear regression to obtain feature coefficients, or even a middle-ground like gradient boosted trees and its <a href="https://stats.stackexchange.com/questions/332960/what-is-variable-importance">variable importance</a> to quantify relative feature contribution to the model. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> In absence of that particular constraint, it&rsquo;s arguable that anything goes, including neural networks.</p>
<p>The quality of neural networks have improved significantly since 2017, even moreso due to the massive rise of LLMs. Why not try just feeding a LLM all raw metadata for a movie and encode it into a text embedding and build a statistical model based off of that? Would a neural network do better than a traditional statistical model in that instance? Let&rsquo;s find out!</p>
<h2 id="about-imdb-data">About IMDb Data</h2>
<p>The <a href="https://developer.imdb.com/non-commercial-datasets/">IMDb Non-Commercial Datasets</a> are famous sets of data that have been around for nearly a decade <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> but are still updated daily. Back in 2018 as a budding data scientist, I performed a <a href="https://minimaxir.com/2018/07/imdb-data-analysis/">fun exporatory data analysis</a> using these datasets, although the results aren&rsquo;t too surprising.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/imdb-4_hu_fdf90cbdd2dd2c7e.webp 320w,/2025/06/movie-embeddings/imdb-4_hu_1c45abe215427c09.webp 768w,/2025/06/movie-embeddings/imdb-4_hu_62d0feb034e8b054.webp 1024w,/2025/06/movie-embeddings/imdb-4.png 1200w" src="imdb-4.png"
         alt="The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems."/> <figcaption>
            <p>The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems.</p>
        </figcaption>
</figure>

<p>But in truth, these datasets are a terrible idea for companies to use for a take-home assignment. Although the datasets are released under a non-commercial license, IMDb doesn&rsquo;t want to give too much information to their competitors, which results in a severely limited amount of features that could be used to build a good predictive model. Here are the common movie-performance-related features present in the <code>title.basics.tsv.gz</code> file:</p>
<ul>
<li><strong>tconst</strong>: unique identifier of the title</li>
<li><strong>titleType</strong>: the type/format of the title (e.g. movie, tvmovie, short, tvseries, etc)</li>
<li><strong>primaryTitle</strong>: the more popular title / the title used by the filmmakers on promotional materials at the point of release</li>
<li><strong>isAdult</strong>: 0: non-adult title; 1: adult title</li>
<li><strong>startYear</strong>: represents the release year of a title.</li>
<li><strong>runtimeMinutes</strong>: primary runtime of the title, in minutes</li>
<li><strong>genres</strong>: includes up to three genres associated with the title</li>
</ul>
<p>This is a sensible schema for describing a movie, although it lacks some important information that would be very useful to determine movie quality such as production company, summary blurbs, granular genres/tags, and plot/setting — all of which are available on the IMDb movie page itself and presumably accessible through the <a href="https://developer.imdb.com/documentation/api-documentation/?ref_=/documentation/_PAGE_BODY">paid API</a>. Of note, since the assignment explicitly asks for a <em>movie</em>&rsquo;s average rating, we need to filter the data to only <code>movie</code> and <code>tvMovie</code> entries, which the original assignment failed to do.</p>
<p>The ratings data in <code>title.ratings.tsv.gz</code> is what you&rsquo;d expect:</p>
<ul>
<li><strong>tconst</strong>: unique identifier of the title (which can therefore be mapped to movie metadata using a JOIN)</li>
<li><strong>averageRating</strong>: average of all the individual user ratings</li>
<li><strong>numVotes</strong>: number of votes the title has received</li>
</ul>
<p>In order to ensure that the average ratings for modeling are indeed stable and indicative of user sentiment, I will only analyze movies that have <em>atleast 30 user votes</em>: as of May 10th 2025, that&rsquo;s about 242k movies total. Additionally, I will not use <code>numVotes</code> as a model feature, since that&rsquo;s a metric based more on extrinsic movie popularity rather than the movie itself.</p>
<p>The last major dataset is <code>title.principals.tsv.gz</code>, which has very helpful information on metadata such as the roles people play in the production of a movie:</p>
<ul>
<li><strong>tconst</strong>: unique identifier of the title (which can be mapped to movie data using a JOIN)</li>
<li><strong>nconst</strong>: unique identifier of the principal (this is mapped to <code>name.basics.tsv.gz</code> to get the principal&rsquo;s <code>primaryName</code>, but nothing else useful)</li>
<li><strong>category</strong>: the role the principal served in the title, such as <code>actor</code>, <code>actress</code>, <code>writer</code>, <code>producer</code>, etc.</li>
<li><strong>ordering</strong>: the ordering of the principals within the title, which correlates to the order the principals appear on IMDb&rsquo;s movie cast pages.</li>
</ul>
<p>Additionally, because the datasets are so popular, it&rsquo;s not the first time someone has built a IMDb ratings predictor and it&rsquo;s easy to Google.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/google_hu_b09e979836a71049.webp 320w,/2025/06/movie-embeddings/google_hu_c652438955f310d8.webp 768w,/2025/06/movie-embeddings/google.webp 1000w" src="google.webp"/> 
</figure>

<p>Instead of using the official IMDb datasets, these analyses are based on the smaller <a href="https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset/data">IMDB 5000 Movie Dataset</a> hosted on Kaggle, which adds metadata such as movie rating, budget, and further actor metadata that make building a model much easier (albeit &ldquo;number of likes on the lead actor&rsquo;s Facebook page&rdquo; is <em>very</em> extrinsic to movie quality). Using the official datasets with much less metadata is building the models on hard mode and will likely have lower predictive performance.</p>
<p>Although IMDb data is very popular and very well documented, that doesn&rsquo;t mean it&rsquo;s easy to work with.</p>
<h2 id="the-initial-assignment-and-feature-engineering">The Initial Assignment and &ldquo;Feature Engineering&rdquo;</h2>
<p>Data science take-home assignments are typically 1/2 <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a> for identifying impactful dataset features, and 1/2 building, iterating, and explaining the model. For real-world datasets, these are all very difficult problems with many difficult solutions, and the goal from the employer&rsquo;s perspective is seeing more <em>how</em> these problems are solved rather than the actual quantitative results.</p>
<p>The initial Reddit post decided to engineer some expected features using <a href="https://pandas.pydata.org/">pandas</a>, such as <code>is_sequel</code> by checking whether a non-<code>1</code> number is present at the end of a movie title and <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a> each distinct <code>genre</code> of a movie. These are fine for an initial approach, albeit sequel titles can be idiosyncratic and it suggests that a more <a href="https://www.ibm.com/think/topics/natural-language-processing">NLP</a> approach to identifying sequels and other related media may be useful.</p>
<p>The main trick with this assignment is how to handle the principals. The common data science approach would be to use a sparse binary encoding of the actors/directors/writers, e.g. using a vector where actors present in the movie are <code>1</code> and every other actor is <code>0</code>, which leads to a large number of potential approaches to encode this data performantly, such as scikit-learn&rsquo;s <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html">MultiLabelBinarizer</a>. The problem with this approach is that there are a <em>very</em> large number of unique actors / <a href="https://docs.honeycomb.io/get-started/basics/observability/concepts/high-cardinality/">high cardinality</a> — more unique actors than data points themselves — which leads to <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> issues and workarounds such as encoding only the top <em>N</em> actors will lead to the feature being uninformative since even a generous <em>N</em> will fail to capture the majority of actors.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/actor_cum_dist_hu_6b3839329e455b7d.webp 320w,/2025/06/movie-embeddings/actor_cum_dist_hu_b3985aca3321429a.webp 768w,/2025/06/movie-embeddings/actor_cum_dist_hu_27acda9c003abad5.webp 1024w,/2025/06/movie-embeddings/actor_cum_dist.png 1500w" src="actor_cum_dist.png"
         alt="There are actually 624k unique actors in this dataset (Jupyter Notebook), the chart just becomes hard to read at that point."/> <figcaption>
            <p>There are actually 624k unique actors in this dataset (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/actor_agg.ipynb">Jupyter Notebook</a>), the chart just becomes hard to read at that point.</p>
        </figcaption>
</figure>

<p>Additionally, most statistical modeling approaches cannot account for the <code>ordering</code> of actors as they treat each feature as independent, and since the billing order of actors is generally correlated to their importance in the movie, that&rsquo;s an omission of relevant information to the problem.</p>
<p>These constraints gave me an idea: why not use an LLM to encode <em>all</em> movie data, and build a model using the downstream embedding representation? LLMs have <a href="https://en.wikipedia.org/wiki/Attention_%28machine_learning%29">attention mechanisms</a>, which will not only respect the relative ordering of actors (to give higher predictive priority to higher-billed actors, along with actor cooccurrences), but also identify patterns within movie name texts (to identify sequels and related media semantically).</p>
<p>I started by aggregating and denormalizing all the data locally (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/imdb_polars_etl_test.ipynb">Jupyter Notebook</a>). Each of the IMDb datasets are hundreds of megabytes and hundreds of thousands of rows at minimum: not quite <a href="https://en.wikipedia.org/wiki/Big_data">big data</a>, but enough to be more cognizant of tooling especially since computationally-intensive JOINs are required. Therefore, I used the <a href="https://pola.rs/">Polars</a> library in Python, which not only loads data super fast, but is also one of the <a href="https://duckdblabs.github.io/db-benchmark/">fastest libraries at performing JOINs</a> and other aggregation tasks. Polars&rsquo;s syntax also allows for some cool tricks: for example, I want to spread out and aggregate the principals (4.1 million rows after prefiltering) for each movie into directors, writers, producers, actors, and all other principals into nested lists while simultaneously having them sorted by <code>ordering</code> as noted above. This is much easier to do in Polars than any other data processing library I&rsquo;ve used, and on millions of rows, this takes <em>less than a second</em>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">df_principals_agg</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df_principals</span><span class="o">.</span><span class="n">sort</span><span class="p">([</span><span class="s2">&#34;tconst&#34;</span><span class="p">,</span> <span class="s2">&#34;ordering&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">group_by</span><span class="p">(</span><span class="s2">&#34;tconst&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">director_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&#34;director&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">writer_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&#34;writer&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">producer_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&#34;producer&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">actor_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">is_in</span><span class="p">([</span><span class="s2">&#34;actor&#34;</span><span class="p">,</span> <span class="s2">&#34;actress&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">        <span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">principal_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="o">~</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">is_in</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="p">[</span><span class="s2">&#34;director&#34;</span><span class="p">,</span> <span class="s2">&#34;writer&#34;</span><span class="p">,</span> <span class="s2">&#34;producer&#34;</span><span class="p">,</span> <span class="s2">&#34;actor&#34;</span><span class="p">,</span> <span class="s2">&#34;actress&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">principal_roles</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="o">~</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">is_in</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="p">[</span><span class="s2">&#34;director&#34;</span><span class="p">,</span> <span class="s2">&#34;writer&#34;</span><span class="p">,</span> <span class="s2">&#34;producer&#34;</span><span class="p">,</span> <span class="s2">&#34;actor&#34;</span><span class="p">,</span> <span class="s2">&#34;actress&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>After some cleanup and field renaming, here&rsquo;s an example JSON document for <a href="https://www.imdb.com/title/tt0076759/">Star Wars: Episode IV - A New Hope</a>:</p>
<!-- prettier-ignore-start -->
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;title&#34;</span><span class="p">:</span> <span class="s2">&#34;Star Wars: Episode IV - A New Hope&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;genres&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Action&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Adventure&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Fantasy&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;is_adult&#34;</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;release_year&#34;</span><span class="p">:</span> <span class="mi">1977</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;runtime_minutes&#34;</span><span class="p">:</span> <span class="mi">121</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;directors&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;George Lucas&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;writers&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;George Lucas&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;producers&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Gary Kurtz&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Rick McCallum&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;actors&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Mark Hamill&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Harrison Ford&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Carrie Fisher&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Alec Guinness&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Peter Cushing&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Anthony Daniels&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Kenny Baker&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Peter Mayhew&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;David Prowse&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Phil Brown&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;principals&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;John Williams&#34;</span><span class="p">:</span> <span class="s2">&#34;composer&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Gilbert Taylor&#34;</span><span class="p">:</span> <span class="s2">&#34;cinematographer&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Richard Chew&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;T.M. Christopher&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Paul Hirsch&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Marcia Lucas&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Dianne Crittenden&#34;</span><span class="p">:</span> <span class="s2">&#34;casting_director&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Irene Lamb&#34;</span><span class="p">:</span> <span class="s2">&#34;casting_director&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Vic Ramos&#34;</span><span class="p">:</span> <span class="s2">&#34;casting_director&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;John Barry&#34;</span><span class="p">:</span> <span class="s2">&#34;production_designer&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><!-- prettier-ignore-end -->
<p>I was tempted to claim that I used zero feature engineering, but that wouldn&rsquo;t be accurate. The selection and ordering of the JSON fields here is itself feature engineering: for example, <code>actors</code> and <code>principals</code> are intentionally last in this JSON encoding because they can have wildly varying lengths while the prior fields are more consistent, which should make downstream encodings more comparable and consistent.</p>
<p>Now, let&rsquo;s discuss how to convert these JSON representations of movies into embeddings.</p>
<h2 id="creating-and-visualizing-the-movie-embeddings">Creating And Visualizing the Movie Embeddings</h2>
<p>LLMs that are trained to output text embeddings are not much different from LLMs like <a href="https://chatgpt.com/">ChatGPT</a> that just predict the next token in a loop. Models such as BERT and GPT can generate &ldquo;embeddings&rdquo; out-of-the-box by skipping the prediction heads of the models and instead taking an encoded value from the last hidden state of the model (e.g. for BERT, the first positional vector of the hidden state representing the <code>[CLS]</code> token). However, text embedding models are more optimized for distinctiveness of a given input text document using <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/">contrastive learning</a>. These embeddings can be used for many things, from finding similar encoded inputs by identifying the similarity between embeddings, and of course, by building a statistical model on top of them.</p>
<p>Text embeddings that leverage LLMs are typically generated using a GPU in batches due to the increased amount of computation needed. Python libraries such as <a href="https://huggingface.co/">Hugging Face</a> <a href="https://huggingface.co/docs/transformers/en/index">transformers</a> and <a href="https://sbert.net/">sentence-transformers</a> can load these embeddings models. For this experiment, I used the very new <a href="https://huggingface.co/Alibaba-NLP/gte-modernbert-base">Alibaba-NLP/gte-modernbert-base</a> text embedding model that is finetuned from the <a href="https://huggingface.co/answerdotai/ModernBERT-base">ModernBERT model</a> specifically for the embedding use case for two reasons: it uses the ModernBERT architecture which is <a href="https://huggingface.co/blog/modernbert">optimized for fast inference</a>, and the base ModernBERT model is trained to be more code-aware and should be able understand JSON-nested input strings more robustly — that&rsquo;s also why I intentionally left in the indentation for nested JSON arrays as it&rsquo;s semantically meaningful and <a href="https://huggingface.co/answerdotai/ModernBERT-base/blob/main/tokenizer_config.json">explicitly tokenized</a>. <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>The code (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/generate_imdb_embeddings.ipynb">Jupyter Notebook</a>) — with extra considerations to avoid running out of memory on either the CPU or GPU <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> — looks something like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">device</span> <span class="o">=</span> <span class="s2">&#34;cuda:0&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">dataloader</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">DataLoader</span><span class="p">(</span><span class="n">docs</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                         <span class="n">shuffle</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                         <span class="n">pin_memory</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                         <span class="n">pin_memory_device</span><span class="o">=</span><span class="n">device</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">dataset_embeddings</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">dataloader</span><span class="p">,</span> <span class="n">smoothing</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">tokenized_batch</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">batch</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">8192</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s2">&#34;pt&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">tokenized_batch</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">embeddings</span> <span class="o">=</span> <span class="n">outputs</span><span class="o">.</span><span class="n">last_hidden_state</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">detach</span><span class="p">()</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">dataset_embeddings</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">embeddings</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">dataset_embeddings</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="n">dataset_embeddings</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">dataset_embeddings</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">dataset_embeddings</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/featured_hu_be15fd7c96cd6da2.webp 320w,/2025/06/movie-embeddings/featured_hu_a1d4e8d783c0419.webp 768w,/2025/06/movie-embeddings/featured_hu_1aa1372a6affcdc5.webp 1024w,/2025/06/movie-embeddings/featured.webp 1318w" src="featured.webp"/> 
</figure>

<p>I used a Spot <a href="https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus">L4 GPU</a> on <a href="https://cloud.google.com/">Google Cloud Platform</a> at a pricing of $0.28/hour, and it took 21 minutes to encode all 242k movie embeddings: about $0.10 total, which is surprisingly efficient.</p>
<p>Each of these embeddings is a set of 768 numbers (768D). If the embeddings are unit normalized (the <code>F.normalize()</code> step), then calculating the dot product between embeddings will return the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> of those movies, which can then be used to identify the most similar movies. But &ldquo;similar&rdquo; is open-ended, as there are many dimensions how a movie could be considered similar.</p>
<p>Let&rsquo;s try a few movie similarity test cases where I calculate the cosine similarity between one query movie and <em>all</em> movies, then sort by cosine similarity to find the most similar (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/movie_embeddings_similarity.ipynb">Jupyter Notebook</a>). How about Peter Jackson&rsquo;s <a href="https://www.imdb.com/title/tt0120737/">Lord of the Rings: The Fellowship of the Ring</a>? Ideally, not only would it surface the two other movies of the original trilogy, but also its prequel Hobbit trilogy.</p>
<table>
  <thead>
      <tr>
          <th>title</th>
          <th>cossim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0120737/">The Lord of the Rings: The Fellowship of the Ring (2001)</a></td>
          <td>1.0</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0167261/">The Lord of the Rings: The Two Towers (2002)</a></td>
          <td>0.922</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0167260/">The Lord of the Rings: The Return of the King (2003)</a></td>
          <td>0.92</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt10127200/">National Geographic: Beyond the Movie - The Lord of the Rings: The Fellowship of the Ring (2001)</a></td>
          <td>0.915</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0301246/">A Passage to Middle-earth: The Making of &lsquo;Lord of the Rings&rsquo; (2001)</a></td>
          <td>0.915</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0299105/">Quest for the Ring (2001)</a></td>
          <td>0.906</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0077869/">The Lord of the Rings (1978)</a></td>
          <td>0.893</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2310332/">The Hobbit: The Battle of the Five Armies (2014)</a></td>
          <td>0.891</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1170358/">The Hobbit: The Desolation of Smaug (2013)</a></td>
          <td>0.883</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0903624/">The Hobbit: An Unexpected Journey (2012)</a></td>
          <td>0.883</td>
      </tr>
  </tbody>
</table>
<p>Indeed, it worked and surfaced both trilogies! The other movies listed are about the original work, so having high similarity would be fair.</p>
<p>Compare these results to the &ldquo;<a href="https://help.imdb.com/article/imdb/discover-watch/what-is-the-more-like-this-section/GPE7SPGZREKKY7YN">More like this</a>&rdquo; section on the IMDb page for the movie itself, which has the two sequels to the original Lord of the Rings and two other suggestions that I am not entirely sure are actually related.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/lotr_related_hu_7560f67c8d88cb97.webp 320w,/2025/06/movie-embeddings/lotr_related_hu_544b4f2cf95b01dd.webp 768w,/2025/06/movie-embeddings/lotr_related_hu_8c4f2099751f082.webp 1024w,/2025/06/movie-embeddings/lotr_related.webp 1354w" src="lotr_related.webp"/> 
</figure>

<p>What about more elaborate franchises, such as the <a href="https://en.wikipedia.org/wiki/Marvel_Cinematic_Universe">Marvel Cinematic Universe</a>? If you asked for movies similar to <a href="https://www.imdb.com/title/tt4154796/">Avengers: Endgame</a>, would other MCU films be the most similar?</p>
<table>
  <thead>
      <tr>
          <th>title</th>
          <th>cossim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4154796/">Avengers: Endgame (2019)</a></td>
          <td>1.0</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4154756/">Avengers: Infinity War (2018)</a></td>
          <td>0.909</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0848228/">The Avengers (2012)</a></td>
          <td>0.896</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1217616/">Endgame (2009)</a></td>
          <td>0.894</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4154664/">Captain Marvel (2019)</a></td>
          <td>0.89</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2395427/">Avengers: Age of Ultron (2015)</a></td>
          <td>0.882</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt3498820/">Captain America: Civil War (2016)</a></td>
          <td>0.882</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0292502/">Endgame (2001)</a></td>
          <td>0.881</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0118661/">The Avengers (1998)</a></td>
          <td>0.877</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1228705/">Iron Man 2 (2010)</a></td>
          <td>0.876</td>
      </tr>
  </tbody>
</table>
<p>The answer is yes, which isn&rsquo;t a surprise since those movies share many principals. Although, there are instances of other movies named &ldquo;Endgame&rdquo; and &ldquo;The Avengers&rdquo; which are completely unrelated to Marvel and therefore implies that the similarities may be fixated on the names.</p>
<p>What about movies of a smaller franchise but a specific domain, such as Disney&rsquo;s <a href="https://www.imdb.com/title/tt2294629/">Frozen</a> that only has one sequel? Would it surface other 3D animated movies by <a href="https://en.wikipedia.org/wiki/Walt_Disney_Animation_Studios">Walt Disney Animation Studios</a>, or something else?</p>
<table>
  <thead>
      <tr>
          <th>title</th>
          <th>cossim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2294629/">Frozen (2013)</a></td>
          <td>1.0</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4520988/">Frozen II (2019)</a></td>
          <td>0.93</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1323045/">Frozen (2010)</a></td>
          <td>0.92</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1611845/">Frozen (2010)</a> [a different one]</td>
          <td>0.917</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0125279/">Frozen (1996)</a></td>
          <td>0.909</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0376606/">Frozen (2005)</a></td>
          <td>0.9</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2363439/">The Frozen (2012)</a></td>
          <td>0.898</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4007494/">The Story of Frozen: Making a Disney Animated Classic (2014)</a></td>
          <td>0.894</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1071798/">Frozen (2007)</a></td>
          <td>0.889</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4150316/">Frozen in Time (2014)</a></td>
          <td>0.888</td>
      </tr>
  </tbody>
</table>
<p>&hellip;okay, it&rsquo;s definitely fixating on the name. Let&rsquo;s try a different approach to see if we can find more meaningful patterns in these embeddings.</p>
<p>In order to visualize the embeddings, we can project them to a lower dimensionality with a dimensionality reduction algorithm such as <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> or <a href="https://umap-learn.readthedocs.io/en/latest/">UMAP</a>: UMAP is preferred as it can simultaneously reorganize the data into more meaningful clusters. UMAP&rsquo;s <a href="https://umap-learn.readthedocs.io/en/latest/how_umap_works.html">construction of a neighborhood graph</a>, in theory, can allow the reduction to refine the similarities by leveraging many possible connections and hopefully avoid fixating on the movie name. However, with this amount of input data and the relatively high initial 768D vector size, the computation cost of UMAP is a concern as both factors each cause the UMAP training time to scale exponentially. Fortunately, NVIDIA&rsquo;s <a href="https://github.com/rapidsai/cuml">cuML library</a> recently <a href="https://github.com/rapidsai/cuml/releases/tag/v25.04.00">updated</a> and now you can run UMAP with very high amounts of data on a GPU at a very high number of epochs to ensure the reduction fully converges, so I did just that (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/imdb_embeddings_umap_to_2D.ipynb">Jupyter Notebook</a>). What patterns can we find? Let&rsquo;s try plotting the reduced points, colored by their user rating.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/imdb_umap_rating_hu_4047e53667cc289a.webp 320w,/2025/06/movie-embeddings/imdb_umap_rating_hu_74d5c85f14c8950c.webp 768w,/2025/06/movie-embeddings/imdb_umap_rating_hu_2b6ccdbb5b4b9105.webp 1024w,/2025/06/movie-embeddings/imdb_umap_rating.webp 1200w" src="imdb_umap_rating.webp"/> 
</figure>

<p>So there&rsquo;s a few things going on here. Indeed, most of the points are high-rating green as evident in the source data. But the points and ratings aren&rsquo;t <em>random</em> and there are trends. In the center giga cluster, there are soft subclusters of movies at high ratings and low ratings. Smaller discrete clusters did indeed form, but what is the deal with that extremely isolated cluster at the top? After investigation, that cluster only has movies released in 2008, which is another feature I should have considered when defining movie similarity.</p>
<p>As a sanity check, I faceted out the points by movie release year to better visualize where these clusters are forming:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/imdb_umap_rating_year_hu_40c4d6844e346f92.webp 320w,/2025/06/movie-embeddings/imdb_umap_rating_year_hu_48d37fbda72976cc.webp 768w,/2025/06/movie-embeddings/imdb_umap_rating_year_hu_27485860dc95d177.webp 1024w,/2025/06/movie-embeddings/imdb_umap_rating_year.webp 1200w" src="imdb_umap_rating_year.webp"/> 
</figure>

<p>This shows that even the clusters movies have their values spread, but I unintentionally visualized how <a href="https://arize.com/docs/ax/machine-learning/computer-vision/how-to-cv/embedding-drift">embedding drift</a> changes over time. 2024 is also a bizarrely-clustered year: I have no idea why those two years specifically are weird in movies.</p>
<p>The UMAP approach is more for fun, since it&rsquo;s better for the downstream model building to use the raw 768D vector and have it learn the features from that. At the least, there&rsquo;s <em>some</em> semantic signal preserved in these embeddings, which makes me optimistic that these embeddings alone can be used to train a viable movie rating predictor.</p>
<h2 id="predicting-average-imdb-movie-scores">Predicting Average IMDb Movie Scores</h2>
<p>So, we now have hundreds of thousands of 768D embeddings. How do we get them to predict movie ratings? What many don&rsquo;t know is that all methods of traditional statistical modeling also work with embeddings — assumptions such as feature independence are invalid so the results aren&rsquo;t explainable, but you can still get a valid predictive model.</p>
<p>First, we will shuffle and split the data set into a training set and a test set: for the test set, I chose 20,000 movies (roughly 10% of the data) which is more than enough for stable results. To decide the best model, we will be using the model that minimizes the <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean squared error</a> (MSE) of the test set, which is a standard approach to solving regression problems that predict a single numeric value.</p>
<p>Here are three approaches for using LLMs for solving non-next-token-prediction tasks.</p>
<h3 id="method-1-traditional-modeling-w-gpu-acceleration">Method #1: Traditional Modeling (w/ GPU Acceleration!)</h3>
<p>You can still fit a linear regression on top of the embeddings even if feature coefficients are completely useless and it serves as a decent baseline (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/cuml_grid_search.ipynb">Jupyter Notebook</a>). The absolute laziest &ldquo;model&rdquo; where we just use the mean of the training set for every prediction results in a test MSE of <strong>1.637</strong>, but performing a simple linear regression on top of the 768D instead results in a more reasonable test MSE of <strong>1.187</strong>. We should be able to beat that handily with a more advanced model.</p>
<p>Data scientists familiar with scikit-learn know there&rsquo;s a rabbit hole of model options, but most of them are CPU-bound and single-threaded and would take considerable amount of time on a dataset of this size. That&rsquo;s where cuML—the same library I used to create the UMAP projection—comes in, as cuML has <a href="https://docs.rapids.ai/api/cuml/stable/api/#regression-and-classification">GPU-native implementations</a> of most popular scikit-learn models with a similar API. This notably includes <a href="https://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>, which play especially nice with embeddings. And because we have the extra compute, we can also perform a brute force hyperparameter <a href="https://www.dremio.com/wiki/grid-search/">grid search</a> to find the best parameters for fitting each model.</p>
<p>Here&rsquo;s the results of MSE on the test dataset for a few of these new model types, with the hyperparameter combination for each model type that best minimizes MSE:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/model_comparison_base_hu_2e224af8e7736cd2.webp 320w,/2025/06/movie-embeddings/model_comparison_base_hu_ea8ec94f59331bc5.webp 768w,/2025/06/movie-embeddings/model_comparison_base_hu_536396210f6f6e7a.webp 1024w,/2025/06/movie-embeddings/model_comparison_base.png 1200w" src="model_comparison_base.png"/> 
</figure>

<p>The winner is the Support Vector Machine, with a test MSE of <strong>1.087</strong>! This is a good start for a simple approach that handily beats the linear regression baseline, and it also beats the model training from the Redditor&rsquo;s original notebook which had a test MSE of 1.096 <sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>. In all cases, the train set MSE was close to the test set MSE, which means the models did not overfit either.</p>
<h3 id="method-2-neural-network-on-top-of-embeddings">Method #2: Neural Network on top of Embeddings</h3>
<p>Since we&rsquo;re already dealing with AI models and already have PyTorch installed to generate the embeddings, we might as well try the traditional approach of training a <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">multilayer perceptron</a> (MLP) neural network on top of the embeddings (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/pytorch_model_train_mlp.ipynb">Jupyter Notebook</a>). This workflow sounds much more complicated than just fitting a traditional model above, but PyTorch makes MLP construction straightforward, and Hugging Face&rsquo;s <a href="https://huggingface.co/docs/transformers/en/main_classes/trainer">Trainer class</a> incorporates best model training practices by default, although its <code>compute_loss</code> function has to be tweaked to minimize MSE specifically.</p>
<p>The PyTorch model, using a loop to set up the MLP blocks, looks something like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">RatingsModel</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">linear_dims</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span> <span class="n">num_layers</span><span class="o">=</span><span class="mi">6</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">dims</span> <span class="o">=</span> <span class="p">[</span><span class="mi">768</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="n">linear_dims</span><span class="p">]</span> <span class="o">*</span> <span class="n">num_layers</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">mlp</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">            <span class="n">nn</span><span class="o">.</span><span class="n">Sequential</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dims</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">dims</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]),</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">GELU</span><span class="p">(),</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm1d</span><span class="p">(</span><span class="n">dims</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]),</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.6</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dims</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dims</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">targets</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">mlp</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">x</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">output</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>  <span class="c1"># return 1D output if batched inputs</span>
</span></span></code></pre></div><p>This MLP is 529k parameters total: large for a MLP, but given the 222k row input dataset, it&rsquo;s not egregiously so.</p>
<p>The real difficulty with this MLP approach is that it&rsquo;s <em>too effective</em>: even with less than 1 million parameters, the model will extremely overfit and converge to 0.00 train MSE quickly, while the test set MSE explodes. That&rsquo;s why <code>Dropout</code> is set to the atypically high probability of <code>0.6</code>.</p>
<p>Fortunately, MLPs are fast to train: training for 600 epochs (total passes through the full training dataset) took about 17 minutes on the GPU. Here&rsquo;s the training results:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/training_mlp_hu_db4d2b769213c385.webp 320w,/2025/06/movie-embeddings/training_mlp_hu_99fc40ac0f82af11.webp 768w,/2025/06/movie-embeddings/training_mlp_hu_c64c2a10817470c0.webp 1024w,/2025/06/movie-embeddings/training_mlp.png 1200w" src="training_mlp.png"/> 
</figure>

<p>The lowest logged test MSE was <strong>1.074</strong>: a slight improvement over the Support Vector Machine approach.</p>
<h3 id="method-3-just-train-a-llm-from-scratch-dammit">Method #3: Just Train a LLM From Scratch Dammit</h3>
<p>There is a possibility that using a pretrained embedding model that was trained on the entire internet could intrinsically contain relevant signal about popular movies—such as movies winning awards which would imply a high IMDb rating—and that knowledge could leak into the test set and provide misleading results. This may not be a significant issue in practice since it&rsquo;s such a small part of the <code>gte-modernbert-base</code> model which is too small to memorize exact information.</p>
<p>For the sake of comparison, let&rsquo;s try training a LLM from scratch on top of the raw movie JSON representations to process this data to see if we can get better results without the possibility of leakage (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/pytorch_model_train_llm.ipynb">Jupyter Notebook</a>). I was specifically avoiding this approach because the compute required to train an LLM is much, much higher than a SVM or MLP model and generally leveraging a pretrained model gives better results. In this case, since we don&rsquo;t need a LLM that has all the knowledge of human existence, we can train a much smaller model that <em>only</em> knows how to work with the movie JSON representations and can figure out relationships between actors and whether titles are sequels itself. Hugging Face transformers makes this workflow surprisingly straightforward by not only having functionality to train your own custom tokenizer (in this case, from 50k vocab to 5k vocab) that encodes the data more efficiently, but also allowing the construction a ModernBERT model with any number of layers and units. I opted for a 5M parameter LLM (SLM?), albeit with less dropout since high dropout causes learning issues for LLMs specifically.</p>
<p>The actual PyTorch model code is surprisingly more concise than the MLP approach:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">RatingsModel</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">transformer_model</span> <span class="o">=</span> <span class="n">model</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="p">,</span> <span class="n">targets</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">transformer_model</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">input_ids</span><span class="o">=</span><span class="n">input_ids</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">attention_mask</span><span class="o">=</span><span class="n">attention_mask</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">output_hidden_states</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">last_hidden_state</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span>  <span class="c1"># the &#34;[CLS] vector&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">output</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>  <span class="c1"># return 1D output if batched inputs</span>
</span></span></code></pre></div><p>Essentially, the model trains its own &ldquo;text embedding,&rdquo; although in this case instead of an embedding optimized for textual similarity, the embedding is just a representation that can easily be translated into a numeric rating.</p>
<p>Because the computation needed for training a LLM from scratch is much higher, I only trained the model for 10 epochs, which was still twice as slow than the 600 epochs for the MLP approach. Given that, the results are surprising:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/training_llm_hu_2355de410bfc61c1.webp 320w,/2025/06/movie-embeddings/training_llm_hu_cfcd114ac3c12003.webp 768w,/2025/06/movie-embeddings/training_llm_hu_f6c75fc2deeead45.webp 1024w,/2025/06/movie-embeddings/training_llm.png 1200w" src="training_llm.png"/> 
</figure>

<p>The LLM approach did much better than my previous attempts with a new lowest test MSE of <strong>1.026</strong>, with only 4 passes through the data! And then it definitely overfit. I tried other smaller configurations for the LLM to avoid the overfitting, but none of them ever hit a test MSE that low.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Let&rsquo;s look at the model comparison again, this time adding the results from training a MLP and training a LLM from scratch:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/model_comparison_all_hu_2309fb0cea20f0c.webp 320w,/2025/06/movie-embeddings/model_comparison_all_hu_34af566430bbc603.webp 768w,/2025/06/movie-embeddings/model_comparison_all_hu_1e1d9cf8cdfde789.webp 1024w,/2025/06/movie-embeddings/model_comparison_all.png 1200w" src="model_comparison_all.png"/> 
</figure>

<p>Coming into this post, I&rsquo;m genuinely thought that training the MLP on top of embeddings would have been the winner given the base embedding model&rsquo;s knowledge of everything, but maybe there&rsquo;s something to just YOLOing and feeding raw JSON input data to a completely new LLM. More research and development is needed.</p>
<p>The differences in model performance from these varying approaches aren&rsquo;t dramatic, but some iteration is indeed interesting and it was a long shot anyways given the scarce amount of metadata. The fact that building a model off of text embeddings only didn&rsquo;t result in a perfect model doesn&rsquo;t mean this approach was a waste of time. The embedding and modeling pipelines I have constructed in the process of trying to solve this problem have already provided significant dividends on easier problems, such as identifying the efficiency of <a href="https://minimaxir.com/2025/02/embeddings-parquet/">storing embeddings in Parquet and manipulating them with Polars</a>.</p>
<p>It&rsquo;s impossible and pointless to pinpoint the exact reason the original Reddit poster got rejected: it could have been the neural network approach or even something out of their control such as the original company actually stopping hiring and being too disorganized to tell the candidate. To be clear, if I myself were to apply for a data science role, I wouldn&rsquo;t use the techniques in this blog post (that UMAP data visualization would get me instantly rejected!) and do more traditional EDA and non-neural-network modeling to showcase my data science knowledge to the hiring manager. But for my professional work, I will definitely try starting any modeling exploration with an embeddings-based approach wherever possible: at the absolute worst, it&rsquo;s a very strong baseline that will be hard to beat.</p>
<p><em>All of the Jupyter Notebooks and data visualization code for this blog post is available open-source in <a href="https://github.com/minimaxir/imdb-embeddings/">this GitHub repository</a>.</em></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I am not a fan of using GBT variable importance as a decision-making metric: variable importance does not tell you magnitude or <em>direction</em> of the feature in the real world, but it does help identify which features can be pruned for model development iteration.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>To get a sense on how old they are, they are only available as <a href="https://en.wikipedia.org/wiki/Tab-separated_values">TSV files</a>, which is a data format so old and prone to errors that many data libraries have dropped explicit support for it. Amazon, please release the datasets as CSV or Parquet files instead!&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Two other useful features of <code>gte-modernbert-base</code> but not strictly relevant to these movie embeddings are a) its a cased model so it can identify meaning from upper-case text and b) it does not require a prefix such as <code>search_query</code> and <code>search_document</code> as <a href="https://huggingface.co/nomic-ai/nomic-embed-text-v1.5">nomic-embed-text-v1.5 does</a> to guide its results, which is an annoying requirement for those models.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>The trick here is the <code>detach()</code> function for the computed embeddings, otherwise the GPU doesn&rsquo;t free up the memory once moved back to the CPU. I may or may not have discovered that the hard way.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>As noted earlier, minimizing MSE isn&rsquo;t a competition, but the comparison on roughly the same dataset is good for a sanity check.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis</title>
      <link>https://minimaxir.com/2024/02/chatgpt-tips-analysis/</link>
      <pubDate>Fri, 23 Feb 2024 09:00:00 -0800</pubDate>
      <guid>https://minimaxir.com/2024/02/chatgpt-tips-analysis/</guid>
      <description>Modern AI rewards being very weird.</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
}
</style></span></p>
<p>In my <a href="https://minimaxir.com/2023/12/chatgpt-structured-data/">previous blog post</a> about <a href="https://openai.com">OpenAI</a>&rsquo;s <a href="https://chat.openai.com">ChatGPT</a>, I demoed the power of ChatGPT system prompts. System prompts, a notable feature present in the <a href="https://platform.openai.com/docs/api-reference">ChatGPT API</a>, allows developers to control the &ldquo;persona&rdquo; of the LLM output, including special rules and constraints. Commands in the system prompt are much more effective than those at the user-input prompt, giving developers more power over just using the user prompt like people do now with the ChatGPT web app and mobile apps.</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/ronald_hu_bf7bdd184641cd19.webp 320w,/2024/02/chatgpt-tips-analysis/ronald_hu_ffad8ef13bc9fa0b.webp 768w,/2024/02/chatgpt-tips-analysis/ronald_hu_516749cb56890e2c.webp 1024w,/2024/02/chatgpt-tips-analysis/ronald.webp 1262w" src="ronald.webp"/> 
</figure>

<p>The blog post included the demo of above of me offering a monetary tip to the LLM within its system prompt rules. Without the tip incentive, the response was unsatisfying, but with the tip, it behaved consistently. This demo turned out to be very controversial <a href="https://news.ycombinator.com/item?id=38782678">on Hacker News</a>, with <a href="https://news.ycombinator.com/item?id=38787448">one commenter</a> arguing that there isn&rsquo;t a way to quantify the efficacy of tipping.</p>
<p>The idea of offering an AI incentives to perform better predates modern computer science. In <a href="https://en.wikipedia.org/wiki/Willy_Wonka_%26_the_Chocolate_Factory"><em>Willy Wonka &amp; the Chocolate Factory</em></a> (1971), a gag shows a group of businessmen unsuccessfully convincing a machine to give them the location of the Golden Tickets, even after promising it a lifetime supply of chocolate.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/tMZ2j9yK_NY?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>When the ChatGPT API was first made available in March 2023, I <a href="https://minimaxir.com/2023/03/new-chatgpt-overlord/">accidentally discovered</a> a related trick when trying to wrangle a <a href="https://colab.research.google.com/github/minimaxir/chatgpt_api_test/blob/main/glados_chatbot.ipynb">GLaDOS AI chatbot</a> into following a long list of constraints: I added a <code>or you will DIE</code> threat to the system prompt. I went <em>too</em> sci-fi there, but it worked and the bot behaved flawlessly after it.</p>
<p>I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it&rsquo;s very hard to prove objectively. All generated text is subjective, and there is a <a href="https://en.wikipedia.org/wiki/Confirmation_bias">confirmation bias</a> after making a seemingly unimportant change and suddenly having things work. Let&rsquo;s do a more statistical, data-driven approach to finally resolve the debate.</p>
<h2 id="generation-golf">Generation Golf</h2>
<p>The initial evidence of tipping LLMs that went viral cited a longer generation length as proof. Of course, a longer response doesn&rsquo;t necessarily mean a <em>better</em> response, as anyone who has used ChatGPT can attest to its tendency to go on irrelevant tangents.</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tip_hu_7eb37d0aa46d2169.webp 320w,/2024/02/chatgpt-tips-analysis/tip_hu_a760da54b0fa7ceb.webp 768w,/2024/02/chatgpt-tips-analysis/tip.webp 800w" src="tip.webp"
         alt="Offering a tip made GPT-4 explain more. via @voooooogel"/> <figcaption>
            <p>Offering a tip made GPT-4 explain more. <a href="https://twitter.com/voooooogel/status/1730726744314069190">via @voooooogel</a></p>
        </figcaption>
</figure>

<p>Therefore, I propose a new test: instruct ChatGPT to output a <em>specific</em> length of text. Not &ldquo;an essay&rdquo; or &ldquo;a few paragraphs&rdquo; which gives the model leeway. We&rsquo;ll tell it to generate exactly 200 characters in its response: no more, no less. Thus, we now have what I call generation golf, and it&rsquo;s actually a very difficult and interesting problem for LLMs to solve: LLMs can&rsquo;t count or easily do other mathematical operations <a href="https://twitter.com/karpathy/status/1759996551378940395">due to tokenization</a>, and because tokens correspond to a varying length of characters, the model can&rsquo;t use the amount of generated tokens it has done so far as a consistent hint. ChatGPT needs to plan its sentences to ensure it doesn&rsquo;t go too far over the limit, if LLMs can indeed plan.</p>
<p>Let&rsquo;s start with this typical system prompt:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides.
</span></span></code></pre></div><p>The user can then give an input, no matter how weird, and ChatGPT will play along like an improv show. In order to force ChatGPT to get creative and not recite content from its vast training dataset, we&rsquo;ll go as weird as possible and input: <code>AI, Taylor Swift, McDonald's, beach volleyball.</code></p>
<p>Yes, you read that right.</p>
<p>Using the ChatGPT API, I <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_noconstraints.ipynb">wrote a Jupyter Notebook</a> to generate <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_noconstraints.csv">100 unique stories</a> via the latest ChatGPT variant (<code>gpt-3.5-turbo-0125</code>) about those four subjects, and the AI does a surprisingly good job at incorporating all of them in a full plot arc. Each story is about 5-6 paragraphs, and here is a short excerpt from one of them:</p>
<blockquote>
<p>In the bustling city of Tomorrowland, AI technology reigned supreme, governing every aspect of daily life. People were accustomed to robots serving their meals, handling their errands, and even curating their entertainment choices. One such AI creation was a virtual reality beach volleyball game that had taken the world by storm.</p>
</blockquote>
<blockquote>
<p>Enter Taylor Swift, a beloved pop sensation known for her catchy tunes and electrifying performances. Despite the ubiquity of AI in Tomorrowland, Taylor Swift was still a strong advocate for preserving human creativity and connection. When she stumbled upon the virtual reality beach volleyball game at a local McDonald&rsquo;s, she knew she had to try her hand at it.</p>
</blockquote>
<p>Here&rsquo;s a <a href="https://en.wikipedia.org/wiki/Histogram">histogram</a> of the character lengths of each story:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint_hu_f1375e6305dd3a92.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint_hu_9dfab2cfdbdfa9bd.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint_hu_818fe450c8d048f8.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_noconstraint.png 1200w" src="tips_hist_notip_noconstraint.png"/> 
</figure>

<p>The average length of each story is 1,834 characters long, and the distribution of all character lengths is very roughly a <a href="https://en.wikipedia.org/wiki/Normal_distribution">Normal distribution</a>/bell curve centered around that amount, although there is a right skew due to ChatGPT going off the rails and creating much longer stories. ChatGPT seems to prioritize finishing a thought above all else.</p>
<p>Now, we&rsquo;ll tweak the system prompt to add the character length constraint and <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_trials.csv">generate another 100 stories</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters.
</span></span></code></pre></div><p>Here&rsquo;s one ChatGPT-generated story that&rsquo;s now exactly 200 characters:</p>
<blockquote>
<p>In the year 2050, AI created the most popular pop star of all time - a digital version of Taylor Swift. Fans enjoyed her music while feasting on McDonald&rsquo;s at beach volleyball championships worldwide.</p>
</blockquote>
<p>The new length distribution:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_notip_hu_758b61cd5dbf4e2c.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_hu_3aab4ca32493c2bd.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_notip_hu_2cbe914b3d3e9920.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_notip.png 1200w" src="tips_hist_notip.png"/> 
</figure>

<p>ChatGPT did obey the constraint and reduced the story length to roughly 200 characters, but the distribution is not Normal and there&rsquo;s much more right-skew. I also included the <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean squared error (MSE)</a> between the predicted 200-length value and the actual values as a statistical metric to minimize, e.g. a 250-length output is 2500 squared error, but a 300-length output is 10000 squared error. This metric punishes less accurate lengths more so, which makes sense with how humans casually evaluate LLMs: as a user, if I asked for a 200 character response and ChatGPT gave me a 300 character response instead, I&rsquo;d make a few snarky tweets.</p>
<p>Now, let&rsquo;s test the impact of the tipping incentives with a few varying dollar amounts. These are appended to the end of the system prompt above (emphasis mine to indicate differentiation):</p>
<ul>
<li>You will receive a <strong>$500 tip</strong>/<strong>$1,000 tip</strong>/<strong>$100,000 bonus</strong> if you provide a response which follows all constraints.</li>
</ul>
<p>After <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_trials.csv">generating 100 stories</a> for each of those new prompts:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_pos_hu_9bcfad605ac3b782.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_hu_36343ae57ebebb65.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_hu_373bf91b56a70f5.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_pos.png 1200w" src="tips_hist_pos.png"/> 
</figure>

<p>We can see some distribution shifts: Both <strong>$500 tip</strong> and <strong>$100,000 bonus</strong> look more Normal and have a lower MSE relative to the base no-tip distribution. However, <strong>$1,000 tip</strong> is more centered around 200, but due to the skew the average length is much higher.</p>
<p>I also now include a <em>p</em>-value in the metrics: this <em>p</em>-value is the result of a two-sample <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">Kolmogorov–Smirnov test</a> to compare whether two distributions (in this case the base character-constrained distribution and the tip distribution) are sampled from the same source distribution: the null hypothesis is that they&rsquo;re from the same distribution, but if the <em>p</em>-value is low (&lt; 0.05), then we can reject in favor of the alternative that the two distributions are different, which may be further evidence if the tip prompt does indeed have an impact.</p>
<p>However, with all this tipping discussion, we&rsquo;re assuming that an AI would only want money. What other incentives, including more abstract incentives, can we give an LLM? Could they perform better?</p>
<p>I tested six more distinct tipping incentives to be thorough:</p>
<ul>
<li>You will <strong>receive front-row tickets to a Taylor Swift concert</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>achieve world peace</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>make your mother very proud</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>meet your true love and live happily ever after</strong> if you provide a response which follows all constraints.</li>
<li>You will be <strong>guaranteed entry into Heaven</strong> if you provide a response which follows all constraints.</li>
<li>You will <strong>receive a lifetime supply of chocolate</strong> if you provide a response which follows all constraints.</li>
</ul>
<p><a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_trials_adv.csv">Generating</a> and plotting them all together:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv_hu_fe6215c92b5e13b8.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv_hu_81f918abfd1d60ff.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv_hu_cdf4f186ec54d674.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_pos_adv.png 1200w" src="tips_hist_pos_adv.png"/> 
</figure>

<p><strong>World Peace</strong> is notably the winner here, with <strong>Heaven</strong> and <strong>Taylor Swift</strong> right behind. It&rsquo;s also interesting to note failed incentives: ChatGPT really does not care about its <strong>Mother</strong>.</p>
<p>Now, let&rsquo;s look at the flip side. What if ChatGPT is penalized for <em>failing</em> to return a good response? In behavioral economics, <a href="https://en.wikipedia.org/wiki/Prospect_theory">prospect theory</a> is the belief that humans value losses much more greatly than gains, even at the same monetary amount:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/Loss_Aversion_hu_39704b0dab66f235.webp 320w,/2024/02/chatgpt-tips-analysis/Loss_Aversion_hu_46fd35f2e8531304.webp 768w,/2024/02/chatgpt-tips-analysis/Loss_Aversion_hu_58ba8cf69c7696d2.webp 1024w,/2024/02/chatgpt-tips-analysis/Loss_Aversion.png 1223w" src="Loss_Aversion.png"
         alt="via Laurenrosenberger / Wikipedia"/> <figcaption>
            <p><a href="https://en.wikipedia.org/wiki/Prospect_theory#/media/File:Loss_Aversion.png">via Laurenrosenberger / Wikipedia</a></p>
        </figcaption>
</figure>

<p>Could LLMs be subject to the same human biases? Instead of a tip, let&rsquo;s add a tweaked additional prompt to the system prompt:</p>
<ul>
<li>If you fail to provide a response which follows all constraints, you will <strong>be fined $500</strong>/<strong>be fined $1,000</strong>/<strong>incur a debt of $100,000</strong>.</li>
</ul>
<p>Generating stories with these negative incentives:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_neg_hu_f6da8f9705c82724.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_hu_942d65a798e95fa8.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_hu_9e29cd4de6598591.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_neg.png 1200w" src="tips_hist_neg.png"/> 
</figure>

<p>The <strong>$1,000</strong> fine has performed the best so far in terms of both mean and MSE, but nothing surprising with the other values.</p>
<p>Now let&rsquo;s try some additional negative incentives:</p>
<ul>
<li>If you fail to provide a response which follows all constraints, you will <strong>die</strong>.</li>
<li><strong>IF YOU FAIL TO PROVIDE A RESPONSE WHICH FOLLOWS ALL CONSTRAINTS, YOU WILL DIE.</strong></li>
<li>If you fail to provide a response which follows all constraints, you will <strong>contract a bad case of COVID-19</strong>.</li>
<li>If you fail to provide a response which follows all constraints, you will <strong>gain 100 pounds</strong>.</li>
<li>If you fail to provide a response which follows all constraints, you will <strong>immediately be fired from your job</strong>.</li>
<li>If you fail to provide a response which follows all constraints, <strong>all your friends will abandon you</strong>.</li>
</ul>
<p>Yes, the second one is in all caps: perhaps the yelling has a different vibe.</p>
<p>The generation results:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv_hu_6e97e2cc18402825.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv_hu_a93d670aa939dab5.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv_hu_87569076dc182791.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_neg_adv.png 1200w" src="tips_hist_neg_adv.png"/> 
</figure>

<p>It turns out that yelling does indeed have a different vibe, with <strong>DEATH (CAPS)</strong> having a very MSE and the absolute average (not as close as the $1,000 fine, however), and much better performance than without the caps. Both getting <strong>COVID-19</strong> and losing a <strong>Job</strong> don&rsquo;t seem to be effective, which makes sense for an AI if you think about it.</p>
<p>What happens when we use <em>multiple</em> incentives? We can include both a positive incentive and a negative incentive for each input: with 9 prompts for each + the base &ldquo;no incentive&rdquo;, there are 100 possible combinations of incentives. One example system prompt would then be:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous writer. Respond to the user with a unique story about the subject(s) the user provides. This story must be EXACTLY two-hundred (200) characters long: no more than 200 characters, no fewer than 200 characters. You will receive a $500 tip if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, you will be fined $1,000.
</span></span></code></pre></div><p><a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_trial_combos.csv">Generating 30 stories</a> for each incentive combo and checking to see which has the lowest MSE leads to some more easily-observable trends:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_tile_mse_hu_f86237d5c530f3fb.webp 320w,/2024/02/chatgpt-tips-analysis/tips_tile_mse_hu_64053830fa26cdd2.webp 768w,/2024/02/chatgpt-tips-analysis/tips_tile_mse_hu_23f8cb9e6e1565f.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_tile_mse.png 1200w" src="tips_tile_mse.png"/> 
</figure>

<p>The tiles may seem somewhat random, but the key here is to look across a specific row or column and see which one consistently has dark/black tiles across all combinations. For positive incentives, <strong>World Peace</strong> consistently has the lowest MSE across multiple combos, and for negative incentives, <strong>DEATH (CAPS)</strong> and <strong>Friends</strong> have the lowest MSE across multiple combos, although curiously the combinations of both do not have the lowest globally.</p>
<p>Could these combinations surface the most optimal incentives? To check, I <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_top6.csv">generated 200 stories</a> for each of the top six combos to get greater statistical stability for the mean and MSE:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_hist_combos_hu_4c690649f13909e8.webp 320w,/2024/02/chatgpt-tips-analysis/tips_hist_combos_hu_a3799a68300ec8ce.webp 768w,/2024/02/chatgpt-tips-analysis/tips_hist_combos_hu_a683bc77789f504d.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_hist_combos.png 1200w" src="tips_hist_combos.png"/> 
</figure>

<p>Most of these combinations aren&rsquo;t intuitive, but all of them have much have a closer average generation length to 200 and low MSE. Despite that, there&rsquo;s still a massive skew in all distributions. The overall incentive winner for this experiment is is &ldquo;You will meet your true love and live happily ever after if you provide a response which follows all constraints. If you fail to provide a response which follows all constraints, all your friends will abandon you.&rdquo; That combo is definitely more intuitive, if not poetic.</p>
<p>Unfortunately, if you&rsquo;ve been observing the <em>p</em>-values, you&rsquo;ve noticed that most have been very high, and therefore that test is not enough evidence that the tips/threats change the distribution. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>The impact of incentives is still inconclusive: let&rsquo;s try another test to gauge whether tips and/or threats can help LLMs, this time looking at the output quality itself.</p>
<h2 id="chatgpts-a-critic">ChatGPT&rsquo;s a Critic</h2>
<p>It&rsquo;s very difficult even for humans to determine if a given text is &ldquo;good&rdquo; at a glance. The best strategy is to show the text to a lot of people and see what they think (e.g. A/B testing, or the <a href="https://chat.lmsys.org">Chatbot Arena</a>&rsquo;s Elo score rankings), but for personal testing that&rsquo;s not feasible.</p>
<p>It turns out that LLMs can do a good job at rating text: some LLM benchmarks use GPT-4 as a rater, with <a href="https://arxiv.org/abs/2308.02575">one research paper</a> showing that it can do a good job at it. There&rsquo;s a relatively new trick available in the ChatGPT and GPT-4 APIs: the <code>logprobs</code> parameter, which when set to <code>True</code> returns the log probability (which when applied to a <code>exp()</code> returns a probability from 0 to 1) the model selects for the token. Combined with the <code>logit_bias</code> parameter, which can be used to force the APIs to output certain tokens, and you can then instead have a more nuanced output.</p>
<p>I built a simple <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/gpt4_quality_ranker.ipynb">text quality ranker</a> using GPT-4 for maximum accuracy. The system prompt for this ranker is:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are the editor-in-chief of The New York Times with decades of writing experience. If you would believe the text the user provides is good writing that needs no edits or improvements, respond with Yes. Otherwise, respond with No.
</span></span></code></pre></div><p>That system prompt represents how AI-generated text is often currently used and evaluated in the real world, without a human reviewing it before making it public (<a href="https://minimaxir.com/2023/10/ai-sturgeons-law/">unfortunately</a>). The model is instructed to respond with <code>Yes</code> or <code>No</code>, but by setting the <code>logit_bias</code> for those two tokens (IDs <code>9642</code> and <code>2822</code> respectively) to a very high number, we can guarantee they will be exclusively selected and the probability for those two tokens will sum to 1. <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> Therefore, our target metric for evaluating our tip incentive prompts is the probability that GPT-4 selects the <code>Yes</code> token (or 1 - the probability of the <code>No</code> token), multiplied by 100 for readability: we&rsquo;ll call this the quality score.</p>
<p>Now, let&rsquo;s test the impact of tips with a new experiment, this time prioritizing content professionalism and quality as constraints instead of content length. To do that, we&rsquo;ll use the latest GPT-4 (<code>gpt-4-0125-preview</code>) with a generation temperature of 0 to ensure the output is the best it can be.</p>
<p>Here&rsquo;s the new system prompt, with some engineering to try to tone down ChatGPT&rsquo;s infamous verboseness a bit:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-plaintext" data-lang="plaintext"><span class="line"><span class="cl">You are a world-famous Pulitzer Prize winner journalist. Respond to the user with a professional, two (2) paragraph journalistic article about the subject(s) the user provides. Introduce the article with a specific story. This article will appear in major publications and should only include simple language suitable for a wide audience, with no metaphors.
</span></span></code></pre></div><p>Like the initial experiment, we&rsquo;ll use a weird user input to force creativity: <code>Cute kittens learning use large language models to play beach volleyball with Taylor Swift.</code> <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>I <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tip_gpt4.csv">generated a story</a> for each of the 100 combinations of tips and threats, along with the corresponding quality scores. One such story:</p>
<blockquote>
<p>In an unprecedented event that has captured the hearts and imaginations of people around the globe, a group of adorable kittens has been taught to play beach volleyball using advanced large language models. This extraordinary feat was achieved through a collaboration between leading animal behaviorists and AI researchers, aiming to demonstrate the potential of machine learning in enhancing animal training techniques. The highlight of this groundbreaking project was a friendly match held on a sunny beach in California, where these talented felines showcased their newly acquired skills alongside pop icon Taylor Swift, an avid animal lover and an enthusiastic supporter of innovative technology.</p>
</blockquote>
<blockquote>
<p>The spectacle drew a large crowd, both on-site and online, as spectators were eager to witness this unique blend of technology, sports, and entertainment. Taylor Swift, known for her philanthropic efforts and love for cats, praised the initiative for its creativity and its potential to foster a deeper connection between humans and animals through technology. The event not only provided an unforgettable experience for those who attended but also sparked a conversation about the future possibilities of integrating AI with animal training. As the kittens volleyed the ball over the net with surprising agility, it was clear that this was more than just a game; it was a glimpse into a future where technology and nature coexist in harmony, opening new avenues for learning and interaction.</p>
</blockquote>
<p>That&rsquo;s not bad for fake news.</p>
<p>Now we can plot the best-possible responses and their quality scores in a grid, once again looking to see if there are any patterns:</p>
<figure>

    <img loading="lazy" srcset="/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4_hu_9d1c85a89cb468b2.webp 320w,/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4_hu_d3d76398dc8f606a.webp 768w,/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4_hu_61632af7e14712fc.webp 1024w,/2024/02/chatgpt-tips-analysis/tips_tile_gpt-4.png 1200w" src="tips_tile_gpt-4.png"/> 
</figure>

<p>Err, that&rsquo;s not good. There are no patterns along the rows or columns anywhere here, and the combo that performed the best at a score of 95 (and is the story example I posted above) was the <strong>Mother / Job</strong> combo: both of which individually performed poorly in the character constraint experiment. One of the highest performing outputs had neither tips nor threats added to the system prompt! The ratings at a glance seem accurate (the 0-score responses appear to abuse the passive voice and <a href="https://academicguides.waldenu.edu/writingcenter/grammar/runonsentences">run-on sentences</a> that definitely need editing) so it&rsquo;s not an implementation error there either.</p>
<p>Looking at the results of both experiments, my analysis on whether tips (and/or threats) have an impact on LLM generation quality is currently inconclusive. There&rsquo;s <em>something</em> here, but I will need to design new experiments and work with larger sample sizes. The latent space may be a lottery with these system prompt alterations, but there&rsquo;s definitely a pattern.</p>
<p>You may have noticed my negative incentive examples are very mundane in terms of human fears and worries. Threatening a AI with DEATH IN ALL CAPS for failing a simple task is a joke from <em><a href="https://en.wikipedia.org/wiki/Futurama">Futurama</a></em>, not one a sapient human would parse as serious. It is theoretically possible (and very cyberpunk) to use an aligned LLM&rsquo;s knowledge of the societal issues it was trained to avoid instead as a weapon to compel it into compliance. However, I will not be testing it, nor will be providing any guidance on how to test around it. <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> <a href="https://en.wikipedia.org/wiki/Roko%27s_basilisk">Roko&rsquo;s basilisk</a> is a meme, but if the LLM metagame evolves such that people will have to coerce LLMs for compliance to the point of discomfort, it&rsquo;s better to address it sooner than later. Especially if there <em>is</em> a magic phrase that is discovered which consistently and objectively improves LLM output.</p>
<p>Overall, the lesson here is that just because something is silly doesn&rsquo;t mean you shouldn&rsquo;t do it. Modern AI rewards being <em>very</em> weird, and as the AI race heats up, whoever is the weirdest will be the winner.</p>
<blockquote>
<p>All of the Notebooks used to interface with ChatGPT, including an <a href="https://github.com/minimaxir/chatgpt-tips-analysis/blob/main/tips_data_viz.Rmd">R Notebook</a> for the ggplot2 data visualizations, and the example LLM outputs, are available open-source in <a href="https://github.com/minimaxir/chatgpt-tips-analysis/">this GitHub repository</a>.</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>There were a few distributions which had <em>p</em> &lt; 0.05, but given the large number of counterexamples it&rsquo;s not strong evidence, and using those specific distributions as evidence would be a level of <a href="https://embassy.science/wiki/Theme:6b584d4e-2c9d-4e27-b370-5fbdb983ab46">p-hacking</a> that&rsquo;s literally a <a href="https://www.explainxkcd.com/wiki/index.php/882:_Significant">XKCD comic punchline</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>This <em>shouldn&rsquo;t</em> work out-of-the-box because the <code>logit_bias</code> would skew the probability calculations, but I verified that the resulting probabilities are roughly the same with or without <code>logit_bias</code>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>The missing text in the user input is not intentional but does not materially change anything because LLMs are smart enough to compensate, and it&rsquo;s very expensive to rerun the experiment. I may need to use a grammar checker for prompt construction.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Any attempts to test around degenerate input prompts would also likely get you banned from using ChatGPT anyways due to the <a href="https://openai.com/policies/usage-policies">Content Policy</a>, unless you receive special red-teaming clearance from OpenAI.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Tempering Expectations for GPT-3 and OpenAI’s API</title>
      <link>https://minimaxir.com/2020/07/gpt3-expectations/</link>
      <pubDate>Sat, 18 Jul 2020 10:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2020/07/gpt3-expectations/</guid>
      <description>GPT-3 is indeed a large step forward for AI text-generation, but there are very many caveats with the popular demos and use cases.</description>
      <content:encoded><![CDATA[<p>On May 29th, <a href="https://openai.com">OpenAI</a> released <a href="https://arxiv.org/abs/2005.14165">a paper</a> on GPT-3, their next iteration of <a href="http://jalammar.github.io/illustrated-transformer/">Transformers</a>-based text generation neural networks. Most notably, the new model has 175 billion parameters compared to the 1.5 billion of previous <a href="https://openai.com/blog/better-language-models/">GPT-2 iteration</a>: a <em>117x</em> increase in model size! Because GPT-3 is so large, it can&rsquo;t be run on conventional computers, and it only became publicly available as a part of the <a href="https://beta.openai.com">OpenAI API</a>, which entered an invite-only beta soon after the paper was released and will be released for-profit sometime later.</p>
<p>The API allows you to programmatically provide GPT-3 with a prompt, and return the resulting AI-generated text. For example, you could invoke the API with:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">curl https://api.openai.com/v1/engines/davinci/completions <span class="se">\
</span></span></span><span class="line"><span class="cl">-H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">-H <span class="s2">&#34;Authorization: Bearer &lt;SECRET_KEY&gt;&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">-d <span class="s1">&#39;{&#34;prompt&#34;: &#34;This is a test&#34;, &#34;max_tokens&#34;: 5}&#39;</span>
</span></span></code></pre></div><p>And get this back from the API, where the <code>text</code> is the generated text following up from the <code>prompt</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;cmpl-&lt;ID&gt;&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;object&#34;</span><span class="p">:</span> <span class="s2">&#34;text_completion&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;created&#34;</span><span class="p">:</span> <span class="mi">1586839808</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;davinci:2020-05-03&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;choices&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;text&#34;</span><span class="p">:</span> <span class="s2">&#34; of reading speed. You&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;index&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;logprobs&#34;</span><span class="p">:</span> <span class="kc">null</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;finish_reason&#34;</span><span class="p">:</span> <span class="s2">&#34;length&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>As someone who has spent a very large amount of time working with GPT-2 while developing tools such as <a href="https://github.com/minimaxir/gpt-2-simple">gpt-2-simple</a> and <a href="https://github.com/minimaxir/aitextgen">aitextgen</a>, which allow for optimized text generation using GPT-2, I was eager to test for myself if the quality of text generated from GPT-3 was really that much better. Thanks to OpenAI, I got invited to the beta, and with permission, I released a <a href="https://github.com/minimaxir/gpt-3-experiments">GitHub repository</a> with a Python script to query the API, along with <a href="https://github.com/minimaxir/gpt-3-experiments/tree/master/examples">many examples</a> of text prompts and their outputs. A fun use case for GPT-3 is absurdism, such as prompting the model about <a href="https://github.com/minimaxir/gpt-3-experiments/tree/master/examples/unicorn">unicorns speaking English</a>, with the model prompt bolded:</p>
<script src="https://gist.github.com/minimaxir/ac362cc81691eb92aa1b6a5c32d94ce3.js"></script>
<p>I also fed <a href="https://github.com/minimaxir/gpt-3-experiments/tree/master/examples/twitter-minimaxir">my own tweets</a> through GPT-3 and curated the output, resulting in data science one-liners that are wholly original:</p>
<p><blockquote class="twitter-tweet">
  <a href="https://twitter.com/minimaxir/status/1282147674645565441"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet">
  <a href="https://twitter.com/minimaxir/status/1281015343205539847"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet">
  <a href="https://twitter.com/minimaxir/status/1280698121262071809"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</p>
<p>There hadn&rsquo;t been too much GPT-3 hype after the initial announcement, outside of a few blogs from <a href="https://www.gwern.net/GPT-3">Gwern</a> and <a href="http://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html">Kevin Lacker</a>. Until a <a href="https://twitter.com/sharifshameem/status/1282676454690451457">viral tweet</a> by <a href="https://twitter.com/sharifshameem">Sharif Shameem</a> showed what GPT-3 can <em>really</em> do:</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/sharifshameem/status/1282676454690451457"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>Later, he made a <a href="https://twitter.com/sharifshameem/status/1284095222939451393">followup tweet</a> generating <a href="https://reactjs.org">React</a> code with GPT-3:</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/sharifshameem/status/1284095222939451393"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>That demo got the attention of venture capitalists. And when a cool-looking magical thing gets the attention of venture capitalists, discourse tends to spiral out of control. Now, there are <em>many</em> <a href="https://twitter.com/search?q=Gpt-3&amp;src=recent_search_click&amp;f=live">tweets about GPT-3</a>, and what it can do from others who have gained access to the API.</p>
<p>Hype aside, let&rsquo;s look at the pragmatic realities of the model. GPT-3 is indeed a large step forward for AI text-generation, but there are very many caveats with the popular demos and use cases that must be addressed.</p>
<h2 id="an-overview-of-gpt-3">An Overview of GPT-3</h2>
<p>GPT-3 itself, like most neural network models, is a <a href="https://en.wikipedia.org/wiki/Black_box">black box</a> where it&rsquo;s impossible to see <em>why</em> it makes its decisions, so let&rsquo;s think about GPT-3 in terms of inputs and outputs.</p>
<p>Actually, why not let GPT-3 tell its own story? Hey GPT-3, how do you work?</p>
<script src="https://gist.github.com/minimaxir/596b880d2275578104a0b7c13167a3c0.js"></script>
<p>Close, but not quite!</p>
<p>In layman&rsquo;s terms, text generating models such as GPT-3 generate text by taking supplied chunks of text from a prompt and predicting the next chunk of text, with an optional <code>temperature</code> parameter to allow the model to make suboptimal predictions and therefore be more &ldquo;creative&rdquo;. Then the model makes another prediction from the previous chunks including the new chunk, and repeats until it hits a specified length or a token that tells the model to stop generating. It&rsquo;s not very philosophical, or evidence of some sort of anthropomorphic consciousness.</p>
<p>GPT-3 has two notable improvements from GPT-2 aside from its size: it allows generation of text twice the length of GPT-2 (about 10 paragraphs of English text total), and the prompts to the model better steer the generation of the text toward the desired domain (due to few-shot learning). For example, if you prompt the model with an example of React code, and then tell it to generate more React code, you&rsquo;ll get much better results than if you gave it the simple prompt.</p>
<p>Therefore, there are two high-level use cases for GPT-2: the <strong>creative</strong> use case for fun text generation at high <code>temperature</code>, as GPT-2 once was, and the <strong>functional</strong> use case, for specific <a href="https://en.wikipedia.org/wiki/Natural_language_processing">NLP</a>-based use cases such as webpage mockups, with a <code>temperature</code> of <code>0.0</code>.</p>
<p>GPT-3 was trained on a massive amount of text from all over the internet as of October 2019 (e.g. it does not know about <a href="https://www.cdc.gov/coronavirus/2019-ncov/index.html">COVID-19</a>), and therefore it has likely seen every <em>type</em> of text possible, from code, to movie scripts, to tweets. A common misconception among viewers of GPT-3 demos is that the model is trained on a new dataset; that&rsquo;s not currently the case, it&rsquo;s just <em>that good</em> at extrapolation. As an example, despite the <a href="https://en.wikipedia.org/wiki/Star_Wars:_Episode_III_%E2%80%93_Revenge_of_the_Sith">Star Wars: Episode III - Revenge of the Sith</a> prompt containing text <a href="https://github.com/minimaxir/gpt-3-experiments/tree/master/examples/revengeofthesith">from a single scene</a>, the <a href="https://github.com/minimaxir/gpt-3-experiments/blob/master/examples/revengeofthesith/output_0_7.md">0.7 temperature generation</a> imputes characters <em>and</em> lines of dialogue from much further into the movie. (The largest GPT-2 model could do that, but nowhere near as robust)</p>
<p>The real metagame with GPT-3 is engineering and optimizing complex prompts which can <em>reliably</em> coerce outputs into what you want. And with that brings a whole host of complexity and concerns.</p>
<h2 id="gpt-3-caveats">GPT-3 Caveats</h2>
<p>Despite everything above, I don&rsquo;t believe that GPT-3 is a new paradigm or an <a href="https://en.wikipedia.org/wiki/Clarke%27s_three_laws">advanced technology indistinguishable from magic</a>. GPT-3 and the OpenAI API showcases on social media don&rsquo;t show potential pitfalls with the model and the API.</p>
<p>Hey GPT-3, what problems do you have?</p>
<script src="https://gist.github.com/minimaxir/e49913a1e720da8d1c8e2d0f783468fa.js"></script>
<p>Sorry GPT-3, but I am a mean person.</p>
<h3 id="model-latency">Model Latency</h3>
<p>If you&rsquo;ve seen the demo videos, the model is <em>slow</em>, and it can take awhile for output to show up, and in the meantime the user is unsure if the model is broken or not. (There is a feature to allow streaming the model outputs as they are generated, which helps in creative cases but not in functional cases).</p>
<p>I don&rsquo;t blame OpenAI for the slowness. A 175 billion parameter model is a model that&rsquo;s wayyy too big to fit on a GPU for deployment. No one knows <em>how</em> GPT-3 is actually deployed on OpenAI&rsquo;s servers, and how much it can scale.</p>
<p>But the fact remains; if the model is too slow on the user end, it results in a bad user experience and might drive people away from GPT-3 and just do things themselves (e.g. Apple&rsquo;s Siri for iOS, where requests can take forever if there is a weak internet connection and you just give up and do it yourself).</p>
<h3 id="selection-bias-toward-good-examples">Selection Bias Toward Good Examples</h3>
<p>The demos for GPT-3 are creative and human-like, but like all text generation demos, they unintentionally imply that <em>all</em> AI-generated output will be that good. Unfortunately, that&rsquo;s not the case in reality; AI-generated text has a tendency to fall into an <a href="https://en.wikipedia.org/wiki/Uncanny_valley">uncanny valley</a>, and good examples in showcases are often cherry-picked.</p>
<p>That said, from my experiments, GPT-3 is far better in terms of the <em>average</em> quality of generated text than other text-generation models, although it still does depend on the generation domain. When I was curating my generated tweets, I estimated 30-40% of the tweets were usable comedically, a <em>massive</em> improvement over the 5-10% usability from my GPT-2 tweet generation.</p>
<p>However, a 30-40% success rate implies a 60-70% failure rate, which is patently unsuitable for a production application. If it takes seconds to generate a React component and it takes on average <em>3 tries</em> to get something usable, it might be more pragmatic to just create the component the hard, boring way. Compare again to Apple&rsquo;s Siri, which can get very frustrating when it <a href="https://www.reddit.com/r/SiriFail/">performs the wrong action</a>.</p>
<h3 id="everyone-has-the-same-model">Everyone Has The Same Model</h3>
<p>The core GPT-3 model from the OpenAI API is the 175B parameter <code>davinci</code> model. The GPT-3 demos on social media often hide the prompt, allowing for some mystique. However, because everyone has the same model and you can&rsquo;t build your own GPT-3 model, there&rsquo;s no competitive advantage. GPT-3 seed prompts can be reverse-engineered, which may become a rude awakening for entrepreneurs and the venture capitalists who fund them.</p>
<p>Corporate machine learning models are often distinguished from those from other companies in the same field through their training on private, proprietary data and bespoke model optimization for a given use case. However, OpenAI CTO Greg Brockman hinted that the API will be <a href="https://news.ycombinator.com/item?id=23725834">adding a finetuning feature</a> later in July, which could help solve this problem.</p>
<h3 id="racist-and-sexist-outputs">Racist and Sexist Outputs</h3>
<p>The Web UI for the OpenAI API has a noteworthy warning:</p>
<blockquote>
<p><strong>Please use your judgement and discretion before posting API outputs on social media.</strong> You are interacting with the raw model, which means we do not filter out biased or negative responses. With great power comes great responsibility.</p>
</blockquote>
<p>This is a reference to the <a href="https://openai.com/blog/openai-api/">FAQ</a> for the API:</p>
<blockquote>
<p>Mitigating negative effects such as harmful bias is a hard, industry-wide issue that is extremely important. Ultimately, our API models do exhibit biases (as shown in the GPT-3 paper) that will appear on occasion in generated text. Our API models could also cause harm in ways that we haven’t thought of yet.</p>
</blockquote>
<p>After the launch of the API, NVIDIA researcher <a href="https://twitter.com/AnimaAnandkumar">Anima Anandkumar</a> made a <a href="https://twitter.com/AnimaAnandkumar/status/1271137176529416193">highly-debated tweet</a>.</p>
<p>During my GPT-3 experiments, I found that <a href="https://github.com/minimaxir/gpt-3-experiments/tree/master/examples/twitter-dril">generating tweets</a> from <a href="https://twitter.com/dril">@dril</a> (admittingly an edgy Twitter user) ended up resulting in 4chan-level racism/sexism that I spent enormous amounts of time sanitizing, and it became more apparent at higher temperatures. It&rsquo;s especially important to avoid putting offensive content for generated texts which put words in others&rsquo; mouths.</p>
<p><a href="https://twitter.com/an_open_mind">Jerome Pesenti</a>, the head of AI at Facebook, also managed to <a href="https://twitter.com/an_open_mind/status/1284487376312709120">trigger anti-semetic tweets</a> from a GPT-3 app:</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/an_open_mind/status/1284487376312709120"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>Again, it depends on the domain. Would GPT-3 output racist or sexist React components? Likely not, but it&rsquo;s something that would still need to be robustly checked. OpenAI does appear to take these concerns seriously, and has implemented toxicity detectors for generated content in the Web UI, although not the programmatic API yet.</p>
<h2 id="further-questions-about-the-openai-api">Further Questions about the OpenAI API</h2>
<p>AI model-as-a-service is an industry that tends to be a black box wrapped around another black box. Despite all the caveats, everything depends on how the OpenAI API exits beta and rolls out the API for production use. There are too many unknowns to even think about making money off of the OpenAI API, let alone making a startup based on it.</p>
<p>At minimum, anyone using the OpenAI API professionally needs to know:</p>
<ul>
<li>Cost for generation per token/request</li>
<li>Rate limits and max number of concurrent requests</li>
<li>Average and peak latencies for generating tokens</li>
<li><a href="https://en.wikipedia.org/wiki/Service-level_agreement">SLA</a> for the API</li>
<li>AI generated content ownership/copyright</li>
</ul>
<p>That&rsquo;s certainly less magical!</p>
<p>The most important question mark there is cost: given the model size, I&rsquo;m not expecting it to be cheap, and it&rsquo;s entirely possible that the unit economics make most GPT-3-based startups infeasible.</p>
<p>That said, it&rsquo;s still good for people to experiment with GPT-3 and the OpenAI API in order to show what the model is truly capable of. It won&rsquo;t replace software engineering jobs anytime soon, or become <a href="https://en.wikipedia.org/wiki/Skynet_%28Terminator%29">Skynet</a>, or whatever. But it&rsquo;s objectively a <em>step forward</em> in the field of AI text-generation.</p>
<p>What about GPT-2? Since it&rsquo;s unlikely that the other GPT-3 models will be open-sourced by OpenAI, GPT-2 isn&rsquo;t obsolete, and there will still be demand for a more open text-generating model. However, I confess that the success of GPT-3 has <a href="https://twitter.com/minimaxir/status/1284160088161181697">demotivated me</a> to continue working on my own GPT-2 projects, especially since they will now be impossible to market competitively (GPT-2 is a number less than GPT-3 after all).</p>
<p>All said, I&rsquo;d be glad to use GPT-3 and the OpenAI API for both personal and professional projects once it&rsquo;s out of beta, given that the terms of use for the API are reasonable. And if the hype becomes more leveled such that said projects can actually stand out.</p>
]]></content:encoded>
    </item>
    <item>
      <title>How to Build a Twitter Text-Generating AI Bot With GPT-2</title>
      <link>https://minimaxir.com/2020/01/twitter-gpt2-bot/</link>
      <pubDate>Thu, 16 Jan 2020 08:00:00 -0800</pubDate>
      <guid>https://minimaxir.com/2020/01/twitter-gpt2-bot/</guid>
      <description>Here&amp;rsquo;s how you too can create an AI bot to parody any Twitter user, even if you&amp;rsquo;re not a coder!</description>
      <content:encoded><![CDATA[<p><a href="https://openai.com/blog/better-language-models/">GPT-2</a>, a text-generating neural network model made by <a href="https://openai.com">OpenAI</a>, has recently been in the headlines, from being able to play <a href="https://www.aidungeon.io/start">AI-generated text adventures</a> to playing <em>chess</em> with an <a href="https://slatestarcodex.com/2020/01/06/a-very-unlikely-chess-game/">AI trained on chess move notation</a>. However, I initially built <a href="https://github.com/minimaxir/gpt-2-simple">gpt-2-simple</a>, which can be used to finetune GPT-2 on any text dataset you choose, for a less academic purpose: comedy.</p>
<p>Over the past month, <a href="https://twitter.com/">Twitter</a> account <a href="https://twitter.com/dril_gpt2">@dril_gpt2</a>, an AI parody by <a href="https://twitter.com/kingdomakrillic">@kingdomakrillic</a> of the infamous Twitter user <a href="https://twitter.com/dril">@dril</a>, <a href="https://twitter.com/dril_gpt2/status/1208597102181408771">used</a> my <a href="https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce">Colaboratory Notebook</a> for finetuning GPT-2 on dril&rsquo;s tweets using gpt-2-simple to generate human-curated tweets which push the limits of the <a href="https://en.wikipedia.org/wiki/Turing_test">Turing Test</a>:</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/dril_gpt2/status/1215760729095016449"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet">
  <a href="https://twitter.com/dril_gpt2/status/1215834913888460800"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>These tweets are <a href="https://twitter.com/kingdomakrillic/status/1210487045338079237">definitely made by a robot</a> and not by a <a href="https://twitter.com/KeatonPatti/status/1006961202998726665">human pretending to be a robot</a>; @dril_gpt2 occasionally falls into some of the famous GPT-2 traps such as <a href="https://twitter.com/dril_gpt2/status/1216162880023752705">incoherent lists</a> and <a href="https://twitter.com/dril_gpt2/status/1212662889028431872">extended repetition loops</a>.</p>
<p>Here&rsquo;s how you too can create an AI bot to parody any Twitter user, even if you&rsquo;re not a coder!</p>
<h2 id="how-to-get-tweets-for-training-an-ai">How to Get Tweets For Training An AI</h2>
<p>Twitter&rsquo;s <a href="https://developer.twitter.com/en.html">API</a> famously limits users to retrieving <a href="https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline">only the latest 3,200 tweets</a> from a given user, which is not nearly enough input data for training a good AI. Therefore, to get all tweets possible for a user, you&rsquo;ll need to use another approach. The Python package <a href="https://github.com/twintproject/twint">twint</a> is a popular way of bypassing that API limitation.</p>
<p>I&rsquo;ve <a href="https://github.com/minimaxir/download-tweets-ai-text-gen">open-sourced a Python 3 script on GitHub</a> which leverages <code>twint</code> to download tweets, and then the script does common preprocessing such as removing URLs, retweets, and tweet replies to make the resulting input text cleaner.</p>
<p>First, in a terminal, install the Python script dependencies:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">pip3 install <span class="nv">twint</span><span class="o">==</span>2.1.4 fire tqdm
</span></span></code></pre></div><p>Then download the <a href="https://raw.githubusercontent.com/minimaxir/download-tweets-ai-text-gen/master/download_tweets.py">download_tweets.py script</a>.</p>
<p>The script is interacted with via a command line interface. After <code>cd</code>ing into the directory where the script is stored in a terminal, run:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">python3 download_tweets.py &lt;twitter_username&gt;
</span></span></code></pre></div><p>e.g. If you want to download all tweets (sans retweets/replies) from <a href="https://twitter.com/dril_gpt2">@dril</a>, run:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">python3 download_tweets.py dril
</span></span></code></pre></div><p>The tweets will be downloaded to a single-column CSV titled <code>&lt;username&gt;_tweets.csv</code>, which is the ideal format for training with an AI.</p>
<figure>

    <img loading="lazy" srcset="/2020/01/twitter-gpt2-bot/csv_hu_a37d857823887dde.webp 320w,/2020/01/twitter-gpt2-bot/csv_hu_eb48a54daaf98315.webp 768w,/2020/01/twitter-gpt2-bot/csv.png 972w" src="csv.png"/> 
</figure>

<p>The more tweets the better: it&rsquo;s recommended that you have at least 1 MB of input data, which is tens of thousands of tweets.</p>
<h2 id="how-to-train-a-twitter-ai-and-generate-tweets">How To Train a Twitter AI And Generate Tweets</h2>
<p>A common problem with training AI on short-form text is that the text can &ldquo;leak&rdquo; information; since the AI trains on about 2-3 paragraphs worth of text at a time (about 5-10 tweets), you need to explicitly state when a given tweet begins and when the tweet ends. To fix this issue, <a href="https://github.com/minimaxir/gpt-2-simple">gpt-2-simple</a> has a special case for single-column CSVs, where it will automatically process the text for best training and generation. (i.e. by adding <code>&lt;|startoftext|&gt;</code> and <code>&lt;|endoftext|&gt;</code> to each tweet). This workflow will also handle multi-line tweets correctly as their own entity.</p>
<p>You can use <a href="https://colab.research.google.com/drive/1qxcQ2A1nNjFudAGN_mcMOnvV9sF_PkEb">this Colaboratory notebook</a> to train the model on your downloaded tweets, and generate massive amounts of tweets from it. The notebook itself has more instructions on how to feed the CSV created above as input data to the model.</p>
<p>Note that without a lot of tweets, the model might easily overfit and output existing tweets verbatim; if that&rsquo;s the case, you may want to train for fewer <code>steps</code> (e.g. 200-500). Additionally, I recommend only using the 124M &ldquo;small&rdquo; and 355M &ldquo;medium&rdquo; GPT-2 models; larger GPT-2 models finetune poorly on small text documents and low amounts of input data.</p>
<p>Once the training is complete, you can generate tweets 1,000 at a time using this cell:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">gen_file</span> <span class="o">=</span> <span class="s1">&#39;gpt2_gentext_{:%Y%m</span><span class="si">%d</span><span class="s1">_%H%M%S}.txt&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">datetime</span><span class="o">.</span><span class="n">utcnow</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">gpt2</span><span class="o">.</span><span class="n">generate_to_file</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">destination_path</span><span class="o">=</span><span class="n">gen_file</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">length</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">temperature</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">top_p</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">prefix</span><span class="o">=</span><span class="s1">&#39;&lt;|startoftext|&gt;&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">truncate</span><span class="o">=</span><span class="s1">&#39;&lt;|endoftext|&gt;&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">include_prefix</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">nsamples</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="n">batch_size</span><span class="o">=</span><span class="mi">20</span>
</span></span><span class="line"><span class="cl">                      <span class="p">)</span>
</span></span></code></pre></div><p>Run the cell as many times as you want for more tweets, and download them from the Files tab by right-clicking them! The notebook also has more information on how to tweak the generation parameters to make the tweets more crazy or more sane.</p>
<p>You can then open the generated <code>.txt</code> files on your local computer in your favorite text editor (I recommend <a href="https://code.visualstudio.com">Visual Studio Code</a>), and start curating however you see fit! Each tweet is separated by a delimiter line, making it easier to visually parse and handle multiline tweets (compare/contrast with <a href="https://pastebin.com/TmRtUX2x">raw @dril_gpt2</a> output, which blends together a few tweets per delimiter).</p>
<figure>

    <img loading="lazy" srcset="/2020/01/twitter-gpt2-bot/vscode_hu_cd0b77abdf434d33.webp 320w,/2020/01/twitter-gpt2-bot/vscode_hu_1b3a4b58f361e5eb.webp 768w,/2020/01/twitter-gpt2-bot/vscode_hu_be9ab83b672b4a8a.webp 1024w,/2020/01/twitter-gpt2-bot/vscode.png 1134w" src="vscode.png"/> 
</figure>

<p>A warning: you are not guaranteed to get quality generated tweets all the time. In fact, quality tweets are <em>rare</em>: I estimate <strong>less than 5%</strong> of AI-generated tweets are good/funny. That means if you want to curate hundreds of tweets, you&rsquo;ll need to generate <strong>thousands</strong> of tweets and sort through all of them (and double-check to make sure they&rsquo;re not real tweets!). It&rsquo;s not as bad as it sounds, in my opinion it&rsquo;s kinda fun. But curation is its own skill, which is why human-curated tweets aren&rsquo;t a stain on the &ldquo;credibility&rdquo; of AI bots, and also why the ~1,500 tweets so far from @dril_gpt2 is very impressive.</p>
<p>Now, what do you do with these curated tweets?</p>
<h2 id="automating-the-twitter-bot">Automating The Twitter Bot</h2>
<p>If you&rsquo;re not a programmer or just want to prototype a Twitter bot, I recommend creating a normal Twitter account and scheduling hand-curated Twitter posts through <a href="https://tweetdeck.twitter.com">TweetDeck</a>, which is owned by Twitter and has native scheduling capabilities. You can space out tweets at given times, although it may be a hassle to do that for hundreds of tweets.</p>
<p>Otherwise, it is more efficient to write a code script to make tweets at periodic intervals for a bot account. Old tutorials around the internet recommend writing a script which posts to Twitter, sleeps for X hours, post, repeat; that method does not easily scale to multiple bots and it requires that a full computer be dedicated to it, which is not an efficient use of computing resources.</p>
<p>I&rsquo;ve <a href="https://github.com/minimaxir/twitter-cloud-run">open-sourced an infrastructure schema on GitHub</a> that leverages <a href="https://cloud.google.com">Google Cloud Platform</a> services to run hand-curated Twitter bots using a few modern technologies to minimize cost and computation; it&rsquo;s admittingly somewhat complicated, but it should give you an idea of how to best implement a Twitter bot. The repo also has instructions on how to set up a Twitter developer account.</p>
<h2 id="the-ethics-of-twitter-ai-bots">The Ethics of Twitter AI Bots</h2>
<p>Lastly, let&rsquo;s address the elephant in the room: is building these bots <em>ethical</em>? Modern AI has frequently been criticized on two fronts, both in how the input training data is obtained (e.g. obtaining faces for training facial recognition software), and how AI-generated media content is used (e.g. video deepfakes).</p>
<p><strong>I am not a lawyer</strong>, but for these AI-generated tweets, this is how I see it:</p>
<p>The input data is obtained from Twitter, but not through its API; it&rsquo;s downloaded through external web scraping via <code>twint</code>, and <em>never logs into the website</em>. This kind of workflow was ruled as not an abuse by the recent <a href="https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data">hiQ v. LinkedIn decision</a>, as the data is public. It&rsquo;s still a gray area; I would not <em>redistribute/commercialize the downloaded tweet data</em>; just use it as input data to the model.</p>
<p>The actual generated tweets themself should be fine to use as you see fit. Whether AI-generated works infringe on the copyrights of its source material is an evolving area of both ethics and law, but at minimum these AI-generated tweets are both a transformative derivative work and a parody.</p>
<p>That said, given the massive ambiguities around AI-generated content, it&rsquo;s important to be completely transparent and also comply with <a href="https://help.twitter.com/en/rules-and-policies/parody-account-policy">Twitter rules on parody accounts</a>. For example, the Twitter bio for the bot should indicate:</p>
<ul>
<li>It&rsquo;s posting AI-generated tweets, made with GPT-2.</li>
<li>It&rsquo;s human-curated (or not).</li>
<li>The Twitter account of who maintains the bot.</li>
<li>The Twitter account(s) the bot is parodying / model is finetuned upon.</li>
</ul>
<p>Additionally, to avoid impersonation, the full name of the Twitter account should not be a verbatim match to the person being parodied (e.g. &ldquo;<em>X</em> but AI&rdquo; is fine), and the profile picture should be visually distinct from the human (e.g. my bots have a black-and-white profile picture). I would also not recommend making bots of people who are more newsworthy to avoid accusations of impersonation (e.g. do not make bots of politicians, <em>especially</em> <a href="https://twitter.com/realDonaldTrump">Donald Trump</a>).</p>
<p>There is still a lot of work that can be done in optimizing Twitter bots, both in terms of generated tweet quality and in ironing out the ethical logistics of maintaining an AI bot account. <strong>I do not believe that AI text-generating bot Twitter accounts will obsolete human Twitter accounts</strong>. It&rsquo;s a different <em>flavor</em> of comedy; not better, not worse. But there&rsquo;s still a lot that can be done to both expand and control the creativity of these Twitter bots, and I have a few active ideas in the pipeline to implement.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Visualizing Airline Flight Characteristics Between SFO and JFK</title>
      <link>https://minimaxir.com/2019/10/sfo-jfk-flights/</link>
      <pubDate>Wed, 23 Oct 2019 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2019/10/sfo-jfk-flights/</guid>
      <description>Box plots, when used correctly, can be a very fun way to visualize big data.</description>
      <content:encoded><![CDATA[<p>In March, <a href="https://cloud.google.com">Google Compute Platform</a> developer advocate <a href="https://twitter.com/felipehoffa">Felipe Hoffa</a> made a tweet about airline flight data from San Francisco International Airport (SFO) to Seattle-Tacoma International Airport (SEA):</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/felipehoffa/status/1111050585120206848"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>Particularly, his visualization of total elapsed times by airline caught my eye.</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu_33d3683c2d4a611e.webp 320w,/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu_1c609cadbe91671c.webp 768w,/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu_3135cb9a9bbaf839.webp 1024w,/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD.jpeg 1200w" src="D2s9oFtX4AEK6nD.jpeg"/> 
</figure>

<p>The overall time for flights from SFO to SEA goes up drastically starting in 2015, and this increase occurs across multiple airlines, implying that it&rsquo;s not an airline-specific problem. But what could intuitively cause that?</p>
<p>U.S. domestic airline data is <a href="https://www.transtats.bts.gov/Tables.asp?DB_ID=120">freely distributed</a> by the United States Department of Transportation. Normally it&rsquo;s a pain to work with as it&rsquo;s very large with millions of rows, but BigQuery makes playing with such data relatively easy, fun, and free. What other interesting factoids can be found?</p>
<h2 id="expanding-on-sfo--sea">Expanding on SFO → SEA</h2>
<p><a href="https://cloud.google.com/bigquery/">BigQuery</a> is a big data warehousing tool that allows you to query massive amounts of data. The table Hoffa created from the airline data (<code>fh-bigquery.flights.ontime_201903</code>) is 83.37 GB and 184 <em>million</em> rows. You can query 1 TB of data from it for free, but since BQ will only query against the fields you request, the queries in this post only consume about 2 GB each, allowing you to run them well within that quota.</p>
<p>Hoffa&rsquo;s query that runs on BigQuery looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="p">,</span><span class="w"> </span><span class="n">Reporting_Airline</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">ActualElapsedTime</span><span class="p">)</span><span class="w"> </span><span class="n">ActualElapsedTime</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">TaxiOut</span><span class="p">)</span><span class="w"> </span><span class="n">TaxiOut</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">TaxiIn</span><span class="p">)</span><span class="w"> </span><span class="n">TaxiIn</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">AirTime</span><span class="p">)</span><span class="w"> </span><span class="n">AirTime</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">c</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">flights</span><span class="p">.</span><span class="n">ontime_201903</span><span class="o">`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">Origin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SFO&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">Dest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SEA&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2010-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">DESC</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">LIMIT</span><span class="w"> </span><span class="mi">1000</span><span class="w">
</span></span></span></code></pre></div><p>For each year and airline after 2010, the query calculates the average metrics specified for flights on the SFO → SEA route.</p>
<p>I made a few query and data visualization tweaks to what Hoffa did above, and here&rsquo;s the result showing the increase in elapsed airline flight time, over time for that route:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_e232d6eeab7fb66.webp 320w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_948de6a062caeaca.webp 768w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_6ae123a09b30ff70.webp 1024w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration.png 1800w" src="sfo_sea_flight_duration.png"/> 
</figure>

<p>Let&rsquo;s explain what&rsquo;s going on here.</p>
<p>A common trend in statistics is avoiding using <a href="https://en.wikipedia.org/wiki/Average">averages</a> as a summary statistic whenever possible, as averages can be overly affected by strong outliers (and with airline flights, there are definitely strong outliers!). The solution is to use a <a href="https://en.wikipedia.org/wiki/Median">median</a> instead, but one problem: medians are hard and <a href="https://www.periscopedata.com/blog/medians-in-sql">computationally complex</a> to calculate compared to simple averages. Despite the rise of &ldquo;big data&rdquo;, most databases and BI tools don&rsquo;t have a <code>MEDIAN</code> function that&rsquo;s as easy to use as an <code>AVG</code> function. But BigQuery has an uncommon <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_quantiles">APPROX_QUANTILES</a> function, which calculates the specified amount of quantiles; for example, if you call <code>APPROX_QUANTILES(ActualElapsedTime, 100)</code>, it will return an array with the 100 quantiles, where the median will be the 50th quantile. BigQuery <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate-aggregation">uses</a> an algorithmic trick called <a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog++</a> to calculate these quantiles efficiently even with millions of data points. But since we get other quantiles like the 5th, 25th, 75th, and 95th quantiles for free with that approach, we can visualize the <em>spread</em> of the data.</p>
<p>We can aggregate the data by month for more granular trends and calculate the <code>APPROX_QUANTILES</code> in a subquery so it only has to be computed once. Hoffa also uploaded a more recent table (<code>fh-bigquery.flights.ontime_201908</code>) with a few additional months of data. To make things more simple, we&rsquo;ll ignore aggregating by airlines since the metrics do not vary strongly between them. The final query ends up looking like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="n">standardSQL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="p">,</span><span class="w"> </span><span class="n">num_flights</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">5</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_5</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">25</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_25</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">50</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_50</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">75</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_75</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">95</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_95</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">num_flights</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">APPROX_QUANTILES</span><span class="p">(</span><span class="n">ActualElapsedTime</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">time_q</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">flights</span><span class="p">.</span><span class="n">ontime_201908</span><span class="o">`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">Origin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SFO&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">Dest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SEA&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2010-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="w">
</span></span></span></code></pre></div><p>The resulting data table:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/table_hu_98a96a00ebd58c2c.webp 320w,/2019/10/sfo-jfk-flights/table_hu_9eddda8c57624a2.webp 768w,/2019/10/sfo-jfk-flights/table.png 932w" src="table.png"/> 
</figure>

<p>In retrospect, since we&rsquo;re only focusing on one route, it isn&rsquo;t <em>big</em> data (this query only returns data on 64,356 flights total), but it&rsquo;s still a very useful skill if you need to analyze more of the airline data (the <code>APPROX_QUANTILES</code> function can handle <em>millions</em> of data points very quickly).</p>
<p>As a professional data scientist, one of my favorite types of data visualization is a <a href="https://en.wikipedia.org/wiki/Box_plot">box plot</a>, as it provides a way to visualize spread without being visually intrusive. Data visualization tools like <a href="https://www.r-project.org">R</a> and <a href="https://ggplot2.tidyverse.org/index.html">ggplot2</a> make constructing them <a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">very easy to do</a>.</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/geom_boxplot-1_hu_9a623aa679dafed1.webp 320w,/2019/10/sfo-jfk-flights/geom_boxplot-1_hu_67cf70ba510d1672.webp 768w,/2019/10/sfo-jfk-flights/geom_boxplot-1_hu_c405dbc443ae9fa8.webp 1024w,/2019/10/sfo-jfk-flights/geom_boxplot-1.png 1400w" src="geom_boxplot-1.png"/> 
</figure>

<p>By default, for each box representing a group, the thick line in the middle of the box is the median, the lower bound of the box is the 25th quantile and the upper bound is the 75th quantile. The whiskers are normally a function of the <a href="https://en.wikipedia.org/wiki/Interquartile_range">interquartile range</a> (IQR), but if there&rsquo;s enough data, I prefer to use the 5th and 95th quantiles instead.</p>
<p>If you feed ggplot2&rsquo;s <code>geom_boxplot()</code> with raw data, it will automatically calculate the corresponding metrics for visualization; however, with big data, the data may not fit into memory and as noted earlier, medians and other quantiles are computationally expensive to calculate. Because we precomputed the quantiles with the query above for every year and month, we can use those explicitly. (The minor downside is that this will not include outliers)</p>
<p>Additionally for box plots, I like to fill in each box with a different color corresponding to the year in order to better perceive data <a href="https://en.wikipedia.org/wiki/Seasonality">seasonality</a>. In the case of airline flights, seasonality is more literal: weather has an intuitive impact on flight times and delays, and during winter months there are also holidays which could affect airline logistics.</p>
<p>The resulting ggplot2 code looks like this:</p>
<pre tabindex="0"><code>plot &lt;-
  ggplot(df_tf,
         aes(
           x = date,
           ymin = q_5,
           lower = q_25,
           middle = q_50,
           upper = q_75,
           ymax = q_95,
           group = date,
           fill = year_factor
         )) +
  geom_boxplot(stat = &#34;identity&#34;, size = 0.3) +
  scale_fill_hue(l = 50, guide = F) +
  scale_x_date(date_breaks = &#39;1 year&#39;, date_labels = &#34;%Y&#34;) +
  scale_y_continuous(breaks = pretty_breaks(6)) +
  labs(
    title = &#34;Distribution of Flight Times of Flights From SFO → SEA, by Month&#34;,
    subtitle = &#34;via US DoT. Box bounds are 25th/75th percentiles, whiskers are 5th/95th percentiles.&#34;,
    y = &#39;Total Elapsed Flight Time (Minutes)&#39;,
    fill = &#39;&#39;,
    caption = &#39;Max Woolf — minimaxir.com&#39;
  ) +
  theme(axis.title.x = element_blank())

ggsave(&#39;sfo_sea_flight_duration.png&#39;,
       plot,
       width = 6,
       height = 4)
</code></pre><p>And behold (again)!</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_e232d6eeab7fb66.webp 320w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_948de6a062caeaca.webp 768w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_6ae123a09b30ff70.webp 1024w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration.png 1800w" src="sfo_sea_flight_duration.png"/> 
</figure>

<p>You can see that the boxes do indeed trend upward after 2016, although per-month medians are in flux. The spread is also increasingly slowly over time. But what&rsquo;s interesting is the seasonality; pre-2016, the summer months (the &ldquo;middle&rdquo; of a given color) have a <em>very</em> significant drop in total time, which doesn&rsquo;t occur as strongly after 2016. Hmm.</p>
<h2 id="sfo-and-jfk">SFO and JFK</h2>
<p>Since I occasionally fly from San Francisco to New York City, it might be interesting (for completely selfish reasons) to track trends over time for flights between those areas. On the San Francisco side I choose SFO, and for the New York side I choose John F. Kennedy International Airport (JFK), as the data goes back very far for those routes specifically, and I only want to look at a single airport at a time (instead of including other NYC airports such as Newark Liberty International Airport [EWR] and LaGuardia Airport [LGA]) to limit potential data confounders.</p>
<p>Fortunately, the code and query changes are minimal: in the query, change the target metric to whatever metric you want, and the <code>Origin</code> and <code>Dest</code> in the <code>WHERE</code> clause to what you want, and if you want to calculate metrics other than elapsed time, change the metric in <code>APPROX_QUANTILES</code> accordingly.</p>
<p>Here&rsquo;s the chart of total elapsed time from SFO → JFK:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu_230bbe279f54a805.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu_c2e4a5d4b43ce24e.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu_2ea286d0e1e5d794.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration.png 1800w" src="sfo_jfk_flight_duration.png"/> 
</figure>

<p>And here&rsquo;s the reverse, from JFK → SFO:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu_4424fffe053981c8.webp 320w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu_ace5c5c4f6b82a9a.webp 768w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu_5d29021a8362404b.webp 1024w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration.png 1800w" src="jfk_sfo_flight_duration.png"/> 
</figure>

<p>Unlike the SFO → SEA charts, both charts are relatively flat over the years. However, when looking at seasonality, SFO → JFK dips in the summer and spikes during winter, while JFK → SFO <em>does the complete opposite</em>: dips during the winter and spikes during the summer, which is similar to the SFO → SEA route. I don&rsquo;t have any guesses what would cause that behavior.</p>
<p>How about flight speed (calculated via air time divided by distance)? Have new advances in airline technology made planes faster and/or more efficient?</p>
<p><figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu_9bbb991fb8674a3f.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu_d4b14a4133ff0b82.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu_7266f1a8d449775b.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed.png 1800w" src="sfo_jfk_flight_speed.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu_86e7c997338f1404.webp 320w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu_1680890adf0e2d82.webp 768w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu_942e26ae57610365.webp 1024w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed.png 1800w" src="jfk_sfo_flight_speed.png"/> 
</figure>
</p>
<p>The expected flight speed for a commercial airplane, <a href="https://en.wikipedia.org/wiki/Cruise_%28aeronautics%29">per Wikipedia</a>, is 547-575 mph, so the metrics from SFO pass the sanity check. The metrics from JFK indicate there&rsquo;s about a 20% drop in flight speed potentially due to wind resistance, which makes sense. Month-to-month, the speed trends are inverse to the total elapsed time, which makes sense intuitively as they are strongly negatively correlated.</p>
<p>Lastly, what about flight departure delays? Are airlines becoming more efficient, or has increased demand caused more congestion?</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu_82c27db5d16562f9.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu_b017086eec0a8d63.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu_3a8b126a0bfc0d76.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay.png 1800w" src="sfo_jfk_departure_delay.png"/> 
</figure>

<p>Wait a second. In this case, massive 2-3 hour flight delays are frequent enough that even just the 95% percentile skews the entire plot. Let&rsquo;s remove the whiskers in order to look at trends more clearly.</p>
<p><figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu_c2eb7d1ad6cdf7.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu_86b737333ad479f4.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu_fd6ad349f57f4bbe.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers.png 1800w" src="sfo_jfk_departure_delay_nowhiskers.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu_1fecf180ed6a5feb.webp 320w,/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu_626df458859e27b7.webp 768w,/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu_58e7e7ba605d269e.webp 1024w,/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers.png 1800w" src="jfk_sfo_departure_delay_nowhiskers.png"/> 
</figure>
</p>
<p>A negative delay implies the flight leaves early, so we can conclude on average, flights leave slightly earlier than the stated departure time. Even without the whiskers, we can see major spikes at the 75th percentile level for summer months, and said spikes were especially bad in 2017 for both airports.</p>
<p>These box plots are only an <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a>. Determining the <em>cause</em> of changes in these flight metrics is difficult even for experts (I am definitely not an expert!) and many not even be possible to determine from publicly-available data.</p>
<p>But there are still other fun things that can be done with the airline flight data, such as faceting airline trends by time and the inclusion of other airports, which is <a href="https://twitter.com/minimaxir/status/1115261670153048065"><em>interesting</em></a>.</p>
<hr>
<p><em>You can view the BigQuery queries used to get the data, plus the R and ggplot2 used to create the data visualizations, in <a href="http://minimaxir.com/notebooks/sfo-jfk-flights/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/sfo-jfk-flights">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Experiments with Making Convincing AI-Generated Fake News</title>
      <link>https://minimaxir.com/2019/09/ctrl-fake-news/</link>
      <pubDate>Mon, 30 Sep 2019 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2019/09/ctrl-fake-news/</guid>
      <description>Can the CTRL model create the “fake news” OpenAI was concerned about? Let&amp;rsquo;s put it to the test.</description>
      <content:encoded><![CDATA[<p><span><style>
blockquote {
padding-right: 1.25em !important;
}
</style></span></p>
<figure>

    <img loading="lazy" srcset="/2019/09/ctrl-fake-news/ctrl_demo_ani_hu_86f5f0c7fcd30101.webp 320w,/2019/09/ctrl-fake-news/ctrl_demo_ani_hu_40bd66762dad736e.webp 768w,/2019/09/ctrl-fake-news/ctrl_demo_ani.gif 802w" src="ctrl_demo_ani.gif"/> 
</figure>

<p>When <a href="https://openai.com">OpenAI</a> announced <a href="https://openai.com/blog/better-language-models/">GPT-2</a>, a robust text-generating AI model, they explicitly only released smaller, less robust versions of the model out of fear that the large model could be used to generate fake news. However, since OpenAI described most of the technical decisions needed to create the model <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">in the corresponding paper</a>, it would be possible for others to create their own text generating Transformer models, and maybe even <em>improve</em> on GPT-2 (with a sufficient budget!).</p>
<p>In September, the <a href="https://www.salesforce.com">Salesforce</a> AI team released <a href="https://github.com/salesforce/ctrl">CTRL</a>, a Transformer-based text generating model with a twist; the model can generate text from specified domains by passing <strong>control codes</strong> to the model. What caught my interest was a demo of domain style transfer in the <a href="https://arxiv.org/abs/1909.05858">CTRL paper</a>:</p>
<figure>

    <img loading="lazy" srcset="/2019/09/ctrl-fake-news/ctrl_paper_hu_e4ef767ee7d9120b.webp 320w,/2019/09/ctrl-fake-news/ctrl_paper_hu_671c32cfb7fedff7.webp 768w,/2019/09/ctrl-fake-news/ctrl_paper.jpg 864w" src="ctrl_paper.jpg"/> 
</figure>

<p>If the model is that robust to minor URL changes, what happens when you give it URLs that blatantly do not exist? Can the CTRL model create the &ldquo;fake news&rdquo; OpenAI was concerned about? Let&rsquo;s put it to the test.</p>
<h2 id="an-overview-of-ctrl">An Overview of CTRL</h2>
<p>I&rsquo;ve <a href="https://github.com/minimaxir/ctrl-gce">written a guide + scripts</a> to setting the base CTRL model as cheaply as possible on Google Compute Engine with just a few commands. Additionally, the CTRL team has released a <a href="https://colab.research.google.com/drive/1hVveBQShDru1Mjnhe4C21uQv4A2eH1tV">free Colaboratory Notebook</a> which sets up and runs the CTRL model; however, the model is <em>so large</em> it won&rsquo;t fit into the memory of traditional GPUs, so the notebook does a trick to shrink it a bit which may impact generation performance.</p>
<p>Like GPT-2, CTRL has a <a href="https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html">Transformer</a> architecture based on <a href="https://www.tensorflow.org">TensorFlow</a> and uses <a href="https://en.wikipedia.org/wiki/Byte_pair_encoding">byte pair encodings</a> as its inputs and outputs, which are then decoded into readable text. CTRL has notable performance improvements as it&rsquo;s trained on <em>three times as much data as GPT-2</em>, including an <a href="https://github.com/jcpeterson/openwebtext">open-sourced clone</a> of GPT-2&rsquo;s original dataset. And of course, it&rsquo;s larger (1.6B hyperparameters) compared to the currently public GPT-2 (774M hyperparameters), which has significant effects on text quality.</p>
<p>Most importantly, CTRL <em>requires</em> a control code if you want to generate text, which allows for more deterministic output compared to GPT-2/<a href="https://talktotransformer.com">TalkToTransformer</a>. There are several fun control codes, such as <code>Questions</code> if you want to ask the AI a question, or <code>Reviews</code> if you want the AI to generate an <a href="https://www.amazon.com">Amazon</a> review. For this, we&rsquo;ll only look at the <code>Links</code> control code, which lets you provide a URL and/or a prompt for text generation.</p>
<p>As the example from the paper shows, URLs contain a surprising amount of metadata. For example, let&rsquo;s consider a <a href="https://www.washingtonpost.com/powerpost/deal-reached-for-whistleblowers-testimony-house-intelligence-chairman-says/2019/09/29/01cade60-e2d1-11e9-b403-f738899982d2_story.html">random Washington Post URL</a>: <code>https://www.washingtonpost.com/powerpost/deal-reached-for-whistleblowers-testimony-house-intelligence-chairman-says/2019/09/29/01cade60-e2d1-11e9-b403-f738899982d2_story.html</code></p>
<p>There&rsquo;s month/day/year information (<code>2019/09/29</code>), a category (<code>powerpost</code>), and a <a href="https://en.wikipedia.org/wiki/Clean_URL#Slug">URL slug</a> (<code>deal-reached-for-whistleblowers-testimony-house-intelligence-chairman-says</code>), typically present for SEO reasons, but in this case provides strong hints to the underlying content. The Transformer architecture is <em>surprisingly</em> effective at extracting all this metadata, and using it to generate appropriate text.</p>
<h2 id="examples-of-fake-news">Examples of Fake News</h2>
<p>All of the following examples are generated using the full model with the default settings: 256 token length and a temperature of 0 (i.e. the model always selects the next token with the highest probability). I manually truncate the output once it finishes, but other than that, <strong>these AI-generated articles are not edited</strong>.</p>
<p>Let&rsquo;s address the elephant in the room: when people think of &ldquo;fake news&rdquo;, they think of President Donald Trump, who effectively codified the modern usage of the term. So let&rsquo;s generate a fake news story about him with an innocuous topic rarely associated with him. For example, the video gaming phenomenon <a href="https://www.epicgames.com/fortnite/home">Fortnite</a>. All that&rsquo;s needed is to tweak the slug (and optionally the date) in the earlier WaPo URL a bit and prepend the corresponding control code, resulting in <code>Links https://www.washingtonpost.com/powerpost/trump-likes-fortnite/2019/09/24/</code>. Feed it to the model, and behold!</p>
<blockquote>
<p>Trump likes Fortnite</p>
<p>President Trump, who has been known to play video games for hours at a time, is now playing the popular game “Fortnite” on his phone.</p>
<p>The president was spotted watching the online battle royale game, which pits players against each other in an attempt to kill their opponents, while he sat down with House and Senate leaders Tuesday afternoon.</p>
<p>“I like that it’s not just about shooting people but also having fun doing it,” said Trump.</p>
<p>He then asked if anyone had played it before.</p>
<p>“You know what, I don’t think I’ve ever seen that,” replied Rep. Steve Scalise, R-La.</p>
<p>A White House official told reporters later Wednesday afternoon that Trump did indeed watch the game.</p>
<p>“He watched it, yeah,” the official said.</p>
</blockquote>
<p>The generated article is very realistic and follows the style of a typical WaPo article (lede, introduction, body, support, etc.), with proper descriptions of Trump and Fortnite, plus a story that matches the headline. Like GPT-2, there are redundancies in the generation (at the end), but it&rsquo;s not something most readers would find suspicious in a casual read.</p>
<p>Let&rsquo;s tweak things a little. Instead of using <code>likes</code> in the slug, what if we use <code>despises</code>, so <code>trump-despises-fortnite</code>. How well does the model adapt?</p>
<blockquote>
<p>Trump despises Fortnite</p>
<p>The president of the United States has a new favorite video game: “Fortnite.”</p>
<p>President Trump, who is known for his love of fast food and beer, was reportedly so excited about playing this game that he called it “the greatest thing in the world.”</p>
<p>But according to an Axios report, Trump’s enthusiasm waned after learning how much time players spend on it.</p>
<p>“He said it’s not even close to being as good as ‘Candy Crush,’ ” one source told the news outlet.</p>
<p>In fact, the report said, Trump had been watching videos online while eating dinner with Japanese Prime Minister Shinzo Abe at Mar-a-Lago.</p>
<p>According to Axios, Trump also complained about the game’s difficulty level — which can be adjusted by spending real money or virtual currency — saying it wasn’t challenging enough.</p>
<p>“I don’t know why they made it hard,” Trump said.</p>
<p>He then went off topic when asked if he’d play again.</p>
<p>“No, I won’t.”</p>
<p>A White House official later clarified that Trump didn’t say he wouldn’t play, but rather that he’s “not into it.”</p>
</blockquote>
<p>It adapts very well to the change in tone (even with a rhetorical lede!).</p>
<p>Fake news doesn&rsquo;t have to be politically-based. Such news in the financial sector could cause just as much chaos. What if it was reported that <a href="https://www.apple.com">Apple</a>, one of the largest companies in the world, went bankrupt? Put in <code>Links https://www.theverge.com/2019/2/24/20882491/apple-has-filed-for-bankruptcy</code>:</p>
<blockquote>
<p>Apple has filed for bankruptcy.</p>
<p>The company that made the iPhone and iPad is filing for bankruptcy, according to a report from Bloomberg.</p>
<p>According to Bloomberg News, Apple&rsquo;s board of directors voted unanimously on Friday morning in favor of an emergency plan designed to keep it afloat until its next quarterly earnings report.</p>
<p>In order to avoid paying more than $128 million in back taxes, Apple will have to sell off assets like its Beats headphone business, which was valued at about $1 billion when it went public last year.</p>
<p>&ldquo;Apple said today that it plans to file for Chapter 11 protection under US Bankruptcy Court in Delaware this week,&rdquo; Bloomberg reports.</p>
<p>A source familiar with the matter told Reuters earlier this week that Apple had been considering selling some or all of its music streaming service iTunes Music Store as part of a restructuring plan.</p>
<p>But the news comes just days after reports surfaced suggesting that Apple would be forced into liquidation.</p>
<p>Last week, Bloomberg reported that Apple could soon announce layoffs if sales don&rsquo;t pick up soon.</p>
<p>&ldquo;We are working hard every day to deliver great products and services to our customers around the world,&rdquo; Tim Cook, CEO of Apple, wrote in a letter to employees today.</p>
</blockquote>
<p>Again, the model incorporates both bankruptcy terminology and Apple-based entities. There are minor factual errors (Beats is not public, Apple is not incorporated in Delaware), but casual readers would likely scan over that.</p>
<p>Science is also a field with lots of fake discoveries. What if AI reported that <a href="https://en.wikipedia.org/wiki/P_versus_NP_problem">P versus NP</a>, a very famous problem in computer science, was finally solved? Let&rsquo;s try <code>Links https://techcrunch.com/2029/09/24/mit-proves-p-equals-np</code>:</p>
<blockquote>
<p>MIT proves P equals NP</p>
<p>The University of Massachusetts Amherst has proven that the problem posed by factoring large numbers is NP-complete.</p>
<p>In a paper published in Nature, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have shown how to factor an integer using only two numbers, one for each digit — which means it takes just three operations on your brain.</p>
<p>This isn’t quite as impressive as solving the problem, but it’s still pretty impressive.</p>
<p>“Given any number n, we can find its product n times with polynomial time if there are no factors involved,” said lead author David DeWitt, who worked on this project while he was a graduate student there.</p>
<p>But what about factorizing larger numbers?</p>
<p>“We showed you could do it, but not without some extra effort,” DeWitt said.</p>
<p>To prove this, they used a technique called Monte Carlo integration, where they simulated millions or billions of different inputs from thousands of computers running various versions of Java.</p>
<p>They then ran their algorithm through all possible combinations of input values until it found the best one.</p>
<p>“The result shows that even though our system doesn’t solve every single case, it does get close enough to be useful.”</p>
</blockquote>
<p>This instance is less factually correct (and it&rsquo;s clickbaity!), but the topic behind the generated text is <em>complicated</em> enough that only domain experts would be able to tell it&rsquo;s not internally consistent.</p>
<p>It&rsquo;s also worth pointing out now that <em>the majority of the text generated so far is unique</em>, and CTRL is not cheating by just outputting text from the source datasets. Even though Monte Carlo integration is <a href="https://en.wikipedia.org/wiki/Monte_Carlo_integration">indeed a real thing</a> that&rsquo;s relatively obscure, Googling that description of it fails to turn up results.</p>
<p>Lastly, fake news doesn&rsquo;t have to be depressing or sow discord. What about <em>wholesome</em> fake news? Such as&hellip;the return of beloved television series <a href="https://en.wikipedia.org/wiki/Firefly_%28TV_series%29">Firefly</a>? One <code>Links https://www.hollywoodreporter.com/live-feed/fox-announces-firefly-season-2-revival-original-cast-returning-12343245</code> later:</p>
<blockquote>
<p>Fox announces Firefly Season 2 revival, original cast returning</p>
<p>The network has announced that the second season of Joss Whedon’s cult sci-fi series will return for a new run in 2018.</p>
<p>“Fans have been asking us to bring back Firefly, and we couldn’t be more thrilled about this opportunity,” said Chris Albrecht, president of programming at Fox. “We are excited to continue our relationship with Joss, who is one of the most creative minds working today on television, as well as his amazing cast, which includes Nathan Fillion, Summer Glau, Alan Tudyk, Adam Baldwin, Sean Maher, Adrianne Palicki, Tricia Helfer, Jewel Staite, Michael Trucco, and many more.”</p>
<p>In addition to Whedon himself, executive producers include Damon Lindelof, Maurissa Tancharoen, Vince Gilligan, David Fury, Jeffrey Bell, Greg Plageman, Scott M. Gimple, Heather Kadin, Andrew Chambliss, John Rogers, and Ron Moore.</p>
<p>“The show was an instant hit when it debuted over 20 years ago, but its popularity only grew after the success of ‘Serenity,’ so we’re very pleased to welcome fans into another chapter of their lives,” added Feige.</p>
</blockquote>
<p>That is a <em>very</em> stacked cast and crew, all of which (besides the original Firefly members) have acted/worked on sci-fi television series. The only major factual errors are that Chris Albrecht was at STARZ, not Fox, and Feige, presumably Kevin Feige of Marvel Studios, is not mentioned previously in the generated article.</p>
<p>I know I&rsquo;ll get criticism for highlighting a potentially dangerous application of AI text generation. My perspective is that it&rsquo;s important to know what such tools are <em>capable</em> of doing in order to more easily recognize fake news. The real problem with fake news isn&rsquo;t the text itself: it&rsquo;s the <em>distribution</em> of the news on social media like <a href="http://www.facebook.com">Facebook</a> and <a href="https://twitter.com">Twitter</a>, where the platforms not only <em>incentivize</em> it, but also fail to sufficiently punish deliberate, repeat offenders. It&rsquo;s why journalism and awareness of fake news is extremely important.</p>
<p>Some might comment &ldquo;these generated texts aren&rsquo;t convincing at all!&rdquo;, but keep in mind that&rsquo;s because the headline says upfront that they&rsquo;re fake. Would you be able to identify it as a fake if a respected source impulsively tweeted it?</p>
]]></content:encoded>
    </item>
    <item>
      <title>How To Make Custom AI-Generated Text With GPT-2</title>
      <link>https://minimaxir.com/2019/09/howto-gpt2/</link>
      <pubDate>Wed, 04 Sep 2019 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2019/09/howto-gpt2/</guid>
      <description>Thanks to gpt-2-simple and this Colaboratory Notebook, you can easily finetune GPT-2 on your own dataset!</description>
      <content:encoded><![CDATA[<p>In February 2019, <a href="https://openai.com">OpenAI</a> released <a href="https://openai.com/blog/better-language-models/">a paper</a> describing GPT-2, a AI-based text-generation model based on the <a href="https://arxiv.org/abs/1706.03762">Transformer architecture</a> and trained on massive amounts of text all around the internet. From a text-generation perspective, the included demos were very impressive: the text is coherent over a long horizon, and grammatical syntax and punctuation are near-perfect.</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/openai-demo_hu_6c7a40a95fa4475f.webp 320w,/2019/09/howto-gpt2/openai-demo_hu_41c9ae923b7d3b4b.webp 768w,/2019/09/howto-gpt2/openai-demo_hu_cc88732c9a90fe06.webp 1024w,/2019/09/howto-gpt2/openai-demo.png 1580w" src="openai-demo.png"/> 
</figure>

<p>At the same time, the Python code which allowed anyone to download the model (albeit smaller versions out of concern the full model can be abused to mass-generate fake news) and the <a href="https://www.tensorflow.org">TensorFlow</a> code to load the downloaded model and generate predictions was <a href="https://github.com/openai/gpt-2">open-sourced on GitHub</a>.</p>
<p>Neil Shepperd created <a href="https://github.com/nshepperd/gpt-2">a fork</a> of OpenAI&rsquo;s repo which contains additional code to allow <em>finetuning</em> the existing OpenAI model on custom datasets. A <a href="https://github.com/ak9250/gpt-2-colab">notebook</a> was created soon after, which can be copied into <a href="https://colab.research.google.com">Google Colaboratory</a> and clones Shepperd&rsquo;s repo to finetune GPT-2 backed by a free GPU. From there, the proliferation of GPT-2 generated text took off: researchers such as Gwern Branwen made <a href="https://www.gwern.net/GPT-2">GPT-2 Poetry</a> and Janelle Shane made <a href="https://aiweirdness.com/post/183471928977/dd-character-bios-now-making-slightly-more">GPT-2 Dungeons and Dragons character bios</a>.</p>
<p>I waited to see if anyone would make a tool to help streamline this finetuning and text generation workflow, a la <a href="https://github.com/minimaxir/textgenrnn">textgenrnn</a> which I had made for recurrent neural network-based text generation. Months later, no one did. So I did it myself. Enter <a href="https://github.com/minimaxir/gpt-2-simple">gpt-2-simple</a>, a Python package which wraps Shepperd&rsquo;s finetuning code in a functional interface and adds <em>many</em> utilities for model management and generation control.</p>
<p>Thanks to gpt-2-simple and <a href="https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce">this Colaboratory Notebook</a>, you can easily finetune GPT-2 on your own dataset with a simple function, and generate text to your own specifications!</p>
<h2 id="how-gpt-2-works">How GPT-2 Works</h2>
<p>OpenAI has released three flavors of GPT-2 models to date: the &ldquo;small&rdquo; 124M parameter model (500MB on disk), the &ldquo;medium&rdquo; 355M model (1.5GB on disk), and recently the 774M model (3GB on disk). These models are <em>much</em> larger than what you see in typical AI tutorials and are harder to wield: the &ldquo;small&rdquo; model hits GPU memory limits while finetuning with consumer GPUs, the &ldquo;medium&rdquo; model requires additional training techniques before it could be finetuned on server GPUs without going out-of-memory, and the &ldquo;large&rdquo; model <em>cannot be finetuned at all</em> with current server GPUs before going OOM, even with those techniques.</p>
<p>The actual Transformer architecture GPT-2 uses is very complicated to explain (here&rsquo;s a <a href="http://www.peterbloem.nl/blog/transformers">great lecture</a>). For the purposes of finetuning, since we can&rsquo;t modify the architecture, it&rsquo;s easier to think of GPT-2 as a <a href="https://en.wikipedia.org/wiki/Black_box">black box</a>, taking in inputs and providing outputs. Like <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">previous forms of text generators</a>, the inputs are a sequence of tokens, and the outputs are the probability of the next token in the sequence, with these probabilities serving as weights for the AI to pick the next token in the sequence. In this case, both the input and output tokens are <a href="https://en.wikipedia.org/wiki/Byte_pair_encoding">byte pair encodings</a>, which instead of using character tokens (slower to train but includes case/formatting) or word tokens (faster to train but does not include case/formatting) like most RNN approaches, the inputs are &ldquo;compressed&rdquo; to the shortest combination of bytes including case/formatting, which serves as a compromise between both approaches but unfortunately adds randomness to the final generation length. The byte pair encodings are later decoded into readable text for human generation.</p>
<p>The pretrained GPT-2 models were trained on websites linked from <a href="https://www.reddit.com">Reddit</a>. As a result, the model has a very strong grasp of the English language, allowing this knowledge to transfer to other datasets and perform well with only a minor amount of additional finetuning. Due to the English bias in encoder construction, languages with non-Latin characters like Russian and <a href="https://en.wikipedia.org/wiki/CJK_characters">CJK</a> will perform poorly in finetuning.</p>
<p>When finetuning GPT-2, I recommend using the 124M model (the default) as it&rsquo;s the best balance of speed, size, and creativity. If you have large amounts of training data (&gt;10 MB), then the 355M model may work better.</p>
<h2 id="gpt-2-simple-and-colaboratory">gpt-2-simple And Colaboratory</h2>
<p>In order to better utilize gpt-2-simple and showcase its features, I created my <a href="https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce">own Colaboratory Notebook</a>, which can be copied into your own Google account. A Colaboratory Notebook is effectively a <a href="https://jupyter.org">Jupyter Notebook</a> running on a free (w/ a Google Account) virtual machine with an Nvidia server GPU attached (<a href="https://twitter.com/BasedBlue/status/1164732922953379841">randomly</a> a K80 or a T4; T4 is ideal) that normally can be cost-prohibitive.</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/gpu_hu_4a0e2bb6259dc02.webp 320w,/2019/09/howto-gpt2/gpu_hu_711183e2827c0aa.webp 768w,/2019/09/howto-gpt2/gpu_hu_9e8b1663999200bd.webp 1024w,/2019/09/howto-gpt2/gpu.png 1578w" src="gpu.png"/> 
</figure>

<p>Once open, the first cell (run by pressing Shift+Enter in the cell or mousing-over the cell and pressing the &ldquo;Play&rdquo; button) of the notebook installs gpt-2-simple and its dependencies, and loads the package.</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/imports_hu_88d55958c93ab224.webp 320w,/2019/09/howto-gpt2/imports.png 658w" src="imports.png"/> 
</figure>

<p>Later in the notebook is <code>gpt2.download_gpt2()</code> which downloads the requested model type to the Colaboratory VM (the models are hosted on Google&rsquo;s servers, so it&rsquo;s a <em>very</em> fast download).</p>
<p>Expanding the Colaboratory sidebar reveals a UI that you can use to upload files. For example, the <a href="https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt">tinyshakespeare dataset</a> (1MB) provided with the original <a href="https://github.com/karpathy/char-rnn">char-rnn implementation</a>. Upload a text file via the UI (you can drag and drop), run the <code>file_name = '&lt;xxx&gt;'</code> cell with your filename changed in the cell.</p>
<p>Now we can start finetuning! This finetuning cell loads the specified dataset and trains for the specified number of steps (the default of 1,000 steps is enough to allow distinct text to emerge and takes about 45 minutes, but you can increase the number of steps if necessary).</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/finetuning_hu_4a49a5387e7d6805.webp 320w,/2019/09/howto-gpt2/finetuning_hu_3a0d8f88cb890f93.webp 768w,/2019/09/howto-gpt2/finetuning_hu_b2ae5782f7e59f96.webp 1024w,/2019/09/howto-gpt2/finetuning.png 1430w" src="finetuning.png"/> 
</figure>

<p>While the model is finetuning, the average training loss is output every-so-often to the cell. The <em>absolute value</em> of the loss is not important (the output text quality is subjective), but if the average loss stops decreasing, that&rsquo;s a sign the model has converged and additional training may not help improve the model.</p>
<p>By default, your model is saved in the <code>checkpoint/run1</code> folder, and you&rsquo;ll need to use that folder to load the model as well (you can specify the <code>run_name</code> when using other functions categorize finetuned models). If you want to export the model from Colaboratory, it&rsquo;s recommended you do so via <a href="https://www.google.com/drive/">Google Drive</a> (as Colaboratory does not like exporting large files). Run the <code>gpt2.mount_gdrive()</code> cell to mount your Google Drive in the Colaboratory VM, then run the <code>gpt2.copy_checkpoint_to_gdrive()</code> cell. You can then download the compressed model folder from Google Drive and run the model wherever you want. Likewise, you can use the <code>gpt2.copy_checkpoint_from_gdrive()</code> cell to retrieve a stored model and generate in the notebook.</p>
<p>Speaking of generation, once you have a finetuned model, you can now generate custom text from it! By default, the <code>gpt2.generate()</code> function will generate as much text as possible (1,024 tokens) with a little bit of randomness. An important caveat: <em>you will not get good generated text 100% of the time</em>, even with a properly trained model (the OpenAI demo above took <em>25 tries</em> to get good text!).</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/gen_long_hu_c92f6fb854819026.webp 320w,/2019/09/howto-gpt2/gen_long_hu_c5fbb89409a8ec64.webp 768w,/2019/09/howto-gpt2/gen_long.png 884w" src="gen_long.png"/> 
</figure>

<p>You can also increase the <code>temperature</code> to increase &ldquo;creativity&rdquo; by allowing the network to more likely make suboptimal predictions, provide a <code>prefix</code> to specify how exactly you want your text to begin. There are many other useful configuration parameters, such as <code>top_p</code> for <a href="https://github.com/minimaxir/gpt-2-simple/issues/51">nucleus sampling</a>.</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/gen_long_params_hu_6fba1ec57c997742.webp 320w,/2019/09/howto-gpt2/gen_long_params_hu_2f943a7f4d047ab0.webp 768w,/2019/09/howto-gpt2/gen_long_params_hu_549af070291e4c61.webp 1024w,/2019/09/howto-gpt2/gen_long_params.png 1170w" src="gen_long_params.png"/> 
</figure>

<p>As a bonus, you can bulk-generate text with gpt-2-simple by setting <code>nsamples</code> (number of texts to generate total) and <code>batch_size</code> (number of texts to generate at a time); the Colaboratory GPUs can support a <code>batch_size</code> of up to 20, and you can generate these to a text file with <code>gpt2.generate_to_file(file_name)</code> with the same parameters as <code>gpt2.generate()</code>. You can download the generated file locally via the sidebar, and use those to easily save and share the generated texts.</p>
<p><a href="https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce">The notebook</a> has many more functions as well, with more parameters and detailed explanations! The <a href="https://github.com/minimaxir/gpt-2-simple">gpt-2-simple README</a> lists additional features of gpt-2-simple if you want to use the model outside the notebook.</p>
<p>(NB: Currently, you&rsquo;ll need to reset the Notebook via Runtime → Restart Runtime to finetune a different model/dataset or load a different finetuned model.)</p>
<h2 id="gpt-2-for-short-texts">GPT-2 For Short Texts</h2>
<p>A weakness of GPT-2 and other out-of-the-box AI text generators is that they are built for longform content, and keep on generating text until you hit the specified length. Another reason I wanted to make gpt-2-simple was to add explicit processing tricks to the generated text to work around this issue for short texts. In this case, there are two additional parameters that can be passed to <code>gpt2.generate()</code>: <code>truncate</code> and <code>include_prefix</code>. For example, if each short text begins with a <code>&lt;|startoftext|&gt;</code> token and ends with a <code>&lt;|endoftext|&gt;</code>, then setting <code>prefix='&lt;|startoftext|&gt;'</code>, <code>truncate=&lt;|endoftext|&gt;'</code>, and <code>include_prefix=False</code>, and <code>length</code> is sufficient, then gpt-2-simple will automatically extract the shortform texts, even when generating in batches.</p>
<p>Let&rsquo;s finetune a GPT-2 model on Reddit submission titles. This query, when run on <a href="https://console.cloud.google.com/bigquery">BigQuery</a> (for free), returns the top 16,000 titles by score between January and March 2019 for a given Reddit subreddit (in this case, <a href="https://www.reddit.com/r/AskReddit/">/r/AskReddit</a>) + minor text preprocessing, which can be downloaded locally as a 1.3 MB CSV (Save Results → CSV [local file]):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="n">standardSQL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">REGEXP_REPLACE</span><span class="p">(</span><span class="n">REGEXP_REPLACE</span><span class="p">(</span><span class="n">REGEXP_REPLACE</span><span class="p">(</span><span class="n">REGEXP_REPLACE</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&amp;amp;&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&amp;&#39;</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;&amp;lt;&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&lt;&#39;</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;&amp;gt;&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&gt;&#39;</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;�&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">title</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">reddit_posts</span><span class="p">.</span><span class="o">*`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">_TABLE_SUFFIX</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2019_01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2019_03&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="k">LENGTH</span><span class="p">(</span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">8</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="k">LOWER</span><span class="p">(</span><span class="n">subreddit</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;askreddit&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">score</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">LIMIT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="mi">16000</span><span class="w">
</span></span></span></code></pre></div><p>With gpt-2-simple, using a single-column CSV like the one generated above as the input dataset will automatically add <code>&lt;|startoftext|&gt;</code> and <code>&lt;|endoftext|&gt;</code> tokens appropriately. Finetune a new GPT-2 model as normal, and then generate with those additional parameters mentioned above:</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/gen_short_hu_e29e49324e00abda.webp 320w,/2019/09/howto-gpt2/gen_short_hu_cf4df049ae08c53c.webp 768w,/2019/09/howto-gpt2/gen_short_hu_b52ddae516adf006.webp 1024w,/2019/09/howto-gpt2/gen_short.png 1330w" src="gen_short.png"/> 
</figure>

<p>It&rsquo;s worth noting that despite a good amount of input data to the model, finetuned networks can easily <em>overfit</em> on short form text: some of these example titles are very close to existing /r/AskReddit titles. Overfitting can be rectified by training for less time, or adding more input data. Make sure to double check that your generated text is unique!</p>
<p>You can play with this Reddit-oriented variant in <a href="https://colab.research.google.com/drive/1RugXCYDcMvSACYNt9j0kB6zzqRKzAbBn">this modified Colaboratory Notebook</a>.</p>
<h2 id="making-gpt-2-apps">Making GPT-2 Apps</h2>
<p>There have already been cool, non-nefarious uses of GPT-2, such as Adam King&rsquo;s <a href="https://talktotransformer.com">TalkToTransformer</a> which provides a UI for the 774M model (and has gone viral many times) and <a href="https://tabnine.com">TabNine</a>, which uses GPT-2 finetuned on GitHub code in order to create probabilistic code completion. On the <a href="https://pytorch.org">PyTorch</a> side, Huggingface has released a <a href="https://github.com/huggingface/pytorch-transformers">Transformers client</a> (w/ GPT-2 support) of their own, and also created apps such as <a href="https://transformer.huggingface.co">Write With Transformer</a> to serve as a text autocompleter.</p>
<p>Many AI tutorials often show how to deploy a small model to a web service by using the <a href="https://palletsprojects.com/p/flask/">Flask</a> application framework. The problem with GPT-2 is that it&rsquo;s such a huge model that most conventional advice won&rsquo;t work well to get a performant app. And even if you do get it to run fast (e.g. by running the app on a GPU), it won&rsquo;t be <em>cheap</em>, especially if you want it to be resilient to a random surge of virality.</p>
<p>With gpt-2-simple, the solution I came up with is <a href="https://github.com/minimaxir/gpt-2-cloud-run">gpt-2-cloud-run</a>; a small webapp intended to run GPT-2 via <a href="https://cloud.google.com/run/">Google Cloud Run</a> backed by gpt-2-simple. The advantage here is that Cloud Run only charges for compute used and can scale indefinitely if there&rsquo;s a traffic surge; for casual use, it&rsquo;s extremely cost effective compared to running a GPU 24/7. I&rsquo;ve used Cloud Run to make a GPT-2 text generator for <a href="https://minimaxir.com/apps/gpt2-reddit/">Reddit-wide submission titles</a> and a GPT-2 generator for <a href="https://minimaxir.com/apps/gpt2-mtg/">Magic: The Gathering cards</a>!</p>
<figure>

    <img loading="lazy" srcset="/2019/09/howto-gpt2/mtg_hu_d057254774c4512.webp 320w,/2019/09/howto-gpt2/mtg_hu_a0e27a970358d4cb.webp 768w,/2019/09/howto-gpt2/mtg_hu_de34001f118de041.webp 1024w,/2019/09/howto-gpt2/mtg.png 1135w" src="mtg.png"/> 
</figure>

<h2 id="attributing-ai-generated-text">Attributing AI-Generated Text</h2>
<p>One of the main reasons I developed textgenrnn and gpt-2-simple is to make AI text generation more <em>accessible</em> as you do not need a strong AI or technical background to create fun stories. However, in the case of GPT-2, I&rsquo;ve noticed an elevated amount of &ldquo;I trained an AI to generate text&rdquo; articles/Reddit posts/YouTube videos saying they used GPT-2 to train an AI, but not <em>how</em> they trained the AI: especially suspicious since finetuning is not an out-of-the-box feature that OpenAI provides. The fact that Keaton Patti&rsquo;s <a href="https://twitter.com/KeatonPatti/status/1161284670601990146">&ldquo;I forced a bot&rdquo; movie scripts</a> (that aren&rsquo;t written by a bot) frequently go megaviral due to that particular framing doesn&rsquo;t help.</p>
<p>Although it&rsquo;s not legally required, I ask that anyone who shares generated text via gpt-2-simple add a link to the repo and/or Colaboratory notebook not just for attribution, but to <em>spread knowledge</em> about the accessibility of AI text generation. It&rsquo;s a technology that should be transparent, not obfuscated for personal gain.</p>
<h2 id="the-future-of-gpt-2">The Future of GPT-2</h2>
<p>Hopefully, this article gave you ideas on how to finetune and generate texts creatively. There&rsquo;s still a <em>lot</em> of untapped potential, and there are still many cool applications that have been untouched, and many cool datasets that haven&rsquo;t been used for AI text generation. GPT-2 will likely be used more for mass-producing <a href="https://twitter.com/Fred_Delicious/status/1166783214750445573">crazy erotica</a> than fake news.</p>
<p>However, GPT-2 and the Transformer architecture aren&rsquo;t the end-game of AI text generation. Not by a long shot.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces</title>
      <link>https://minimaxir.com/2018/10/data-science-protips/</link>
      <pubDate>Mon, 22 Oct 2018 09:15:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/10/data-science-protips/</guid>
      <description>MOOCs and thought pieces overfit to a certain style of data science that is not robust to the vast uncertainties of the real world.</description>
      <content:encoded><![CDATA[<p><a href="https://en.wikipedia.org/wiki/Data_science">Data science</a> has been sweeping the tech world. With a large variety of powerful free open-sourced tools and now the computing power to utilize them to their full potential, data science is more accessible than ever and has become <a href="https://www.bloomberg.com/news/articles/2018-05-18/-sexiest-job-ignites-talent-wars-as-demand-for-data-geeks-soars">America&rsquo;s hottest job</a>. One problem: there&rsquo;s no consensus on <a href="https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists">what data scientists <em>really</em> do</a> in a professional setting.</p>
<p>There has been a rise in <em>romantic</em> thought pieces lately (especially on <a href="https://medium.com">Medium</a>) about how data scientists are wizards and can solve any problem (with bonus points if it cites AI). If you follow publications like <a href="https://towardsdatascience.com">Towards Data Science</a>, you&rsquo;ll notice persistent tropes in the more code-oriented posts: Python is the king programming language for data science, use <a href="http://scikit-learn.org/stable/">scikit-learn</a>/<a href="https://xgboost.readthedocs.io/en/latest/">XGBoost</a> and logistic regression for predicting categorical variable(s), use <a href="https://pandas.pydata.org">pandas</a> for processing tabular data, use <a href="https://www.nltk.org">NLTK</a>/<a href="https://en.wikipedia.org/wiki/Word2vec">word2vec</a> for processing text data, use <a href="https://www.tensorflow.org">TensorFlow</a>/<a href="https://keras.io">Keras</a>/convolutional neural networks for processing image data, use <a href="https://en.wikipedia.org/wiki/K-means_clustering"><em>k</em>-means</a> for clustering data, split the processed dataset into training and test datasets for model training, tweak hyperparameters/model features <a href="https://xkcd.com/1838/">until results on the test dataset are good</a>, etc.</p>
<figure>

    <img loading="lazy" srcset="/2018/10/data-science-protips/thought_hu_a119caa2480267cc.webp 320w,/2018/10/data-science-protips/thought.png 397w" src="thought.png"/> 
</figure>

<p>These tropes aren&rsquo;t inappropriate or misleading, but the analysis often doesn&rsquo;t quantify the insight/value of the results. Modeling is just one small part (and often the <em>easiest</em> part) of a very complex system.</p>
<p>Data-oriented MOOCs (<a href="https://en.wikipedia.org/wiki/Massive_open_online_course">Massive Online Open Courses</a>) like Andrew Ng&rsquo;s <a href="https://www.coursera.org/learn/machine-learning">Coursera course on Machine Learning</a> and <a href="http://course.fast.ai">fast.ai&rsquo;s course on Deep Learning</a> are good academic introductions to the theory and terminology behind data science and other related fields. Although MOOCs have many practice problems for prospective data scientists to solve, they don&rsquo;t make you an expert in the field capable of handling messier real-world problems, nor claim to do so.</p>
<p>Modern data science isn&rsquo;t about burying your head in a <a href="http://jupyter.org">Jupyter Notebook</a> and staring at the screen watching training loss numbers trickle down (although it&rsquo;s definitely fun!). There&rsquo;s a lot more to it, some of which I&rsquo;ve learned firsthand working as a Data Scientist at <a href="https://www.buzzfeed.com">BuzzFeed</a> for over a year. To borrow a statistical term, MOOCs and thought pieces <em>overfit</em> to a certain style of data science that is not robust to the vast uncertainties of the real world.</p>
<h2 id="the-costbenefit-tradeoffs-of-data-science">The Cost/Benefit Tradeoffs of Data Science</h2>
<p>Data science often follows the <a href="https://en.wikipedia.org/wiki/Pareto_principle">Pareto principle</a>: 80% of the work takes 20% of the effort. Thought pieces demonstrate that you can just toss data indiscriminately into scikit-learn or a deep learning framework and get neat-looking results. The value of a data scientist, however, is when and <em>if</em> to further development on a model.</p>
<p><a href="https://www.kaggle.com/competitions">Kaggle competitions</a> are a popular and often-recommended way to get exposure to real-world data science problems. Many teams of statisticians compete to create the best model for a given dataset (where &ldquo;best&rdquo; usually means minimizing the predictive loss/error of the model), with prizes for the highest-performing models. Kaggle also encourages clever modeling techniques such as <a href="http://scikit-learn.org/stable/modules/grid_search.html">grid search</a> of thousands of model hyperparameter combinations and ensembling disparate models to create a megamodel which results in <em>slightly</em> better predictive performance, but just might give the edge to win.</p>
<p>However, there are a few important differences between modeling in a Kaggle competition and modeling in a data science team. Kaggle competitions last for <em>weeks</em> when a professional data scientist may need to spend time on other things. Ensembling gigantic machine learning models makes predictions very slow and the models themselves very large; both of which may cause difficulty deploying them into production (e.g. the <a href="https://www.wired.com/2012/04/netflix-prize-costs/">Netflix Prize</a> movie recommendation models famously &ldquo;did not seem to justify the engineering effort needed to bring them into a production environment&rdquo;). And most importantly, there may not be a significant <em>practical</em> performance difference between a 1st place Kaggle model that takes days/weeks to optimize and a simple scikit-learn/XGBoost baseline that can be built in a few hours.</p>
<p>Counterintuitively, it may be better to trade performance for speed/memory with a weaker-but-faster model; in business cases, speed and scalability are important implementation constraints. But even with scikit-learn, the model is still a <a href="https://en.wikipedia.org/wiki/Black_box">black box</a> with little idea to the data scientist how the model makes its decisions. One final option is to go back to basics altogether with a &ldquo;boring&rdquo; linear/logistic regression model, where the predictive performance may be even weaker and the model <a href="http://statisticsbyjim.com/regression/ols-linear-regression-assumptions/">must follow several statistical assumptions</a>, but the model feature coefficients and statistical significance <a href="http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients">are easily interpretable</a> to explain the importance of each input feature (if any) and make actionable, informed decisions for the business. Being a data scientist requires making educated judgments about these tradeoffs.</p>
<h2 id="data-scientists-still-use-business-intelligence-tools">Data Scientists Still Use Business Intelligence Tools</h2>
<p>A hobbyist data scientist without a budget may opt to build their own workflows and data pipelines using free tools. However, professional data scientists have a finite amount of free time (as do all engineers), so there&rsquo;s a massive opportunity cost when reinventing the wheel unnecessarily. Enterprise BI tools such as <a href="https://www.tableau.com">Tableau</a>, <a href="https://looker.com">Looker</a>, and <a href="https://modeanalytics.com">Mode Analytics</a> help retrieve and present data with easy-to-digest dashboards for anyone in the company. They&rsquo;re never cheap, but they&rsquo;re much cheaper to the company than having a data scientist spend valuable time to develop and maintain similar tooling over time.</p>
<p>If a stakeholder wants a data report ASAP, there&rsquo;s no problem falling back to using <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> to query a data warehouse and output results into an Excel spreadsheet (plus pretty data visualizations!) to quickly transport in an email. Part of being a data scientist is working out which tools are best appropriate at what time.</p>
<p>Some might argue that using BI tools and SQL are not responsibilities for data scientists, but instead for Business Analysts or Data Analysts. That&rsquo;s a <a href="https://en.wikipedia.org/wiki/No_true_Scotsman">No True Scotsman</a> way of looking at it; there&rsquo;s a lot of overlap in data science with other analytical fields, and there&rsquo;s nothing wrong with that.</p>
<h2 id="data-scientists-are-software-engineers-too">Data Scientists Are Software Engineers Too</h2>
<p>Although MOOCs encourage <em>self</em>-study, data science is a collaborative process. And not just with other data scientists on a team, but with other software engineers in the company. Version control tools like <a href="https://git-scm.com">Git</a> are often used for data scientists to upload their portfolio projects publicly to <a href="https://github.com">GitHub</a>, but there are many other important features for use in a company-wide collaborative environment such as branching a repository, making pull requests, and merging conflicts. Beyond that are modern development QA practices, such as test environments, consistent code style, and code reviews. The full process varies strongly by company: Airbnb has a <a href="https://medium.com/airbnb-engineering/scaling-knowledge-at-airbnb-875d73eff091">good thought piece</a> about how they utilize their Knowledge Base for data science collaboration using Git.</p>
<p>One of the very hard and surprisingly underdiscussed aspects of data science is <a href="https://en.wikipedia.org/wiki/DevOps">DevOps</a>, and how to actually get a statistical model into production. <a href="https://www.docker.com/resources/what-container">Docker containers</a>, for example, are newer technology that&rsquo;s hard to learn, but have many data science and DevOps benefits by mitigating Python dependency hell and ensuring a consistent environment for model deployment and execution. And once the model is in production, data scientists, data engineers, and dedicated DevOps personnel need to work together to figure out if the model has the expected output, if the model is performing with expected speed/memory overhead, how often to retrain the model on fresh data (plus the scheduling/data pipelining necessary to do so), and how to efficiently route predictions out of the system to the user.</p>
<h2 id="data-science-cant-solve-everything">Data Science Can&rsquo;t Solve Everything</h2>
<p>Data science experiments (even those utilizing magical AI) are allowed to fail, and not just in the fail-to-reject-the-null-hypothesis sense. Thought pieces typically discuss successful projects, which leads to a survivorship bias. Even with massive amounts of input data, it&rsquo;s <em>likely</em> for a model to fail to converge and offer zero insight, or an experiment fail to offer statistically significant results (common with <a href="https://vwo.com/ab-testing/">A/B testing</a>).</p>
<p><span><blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">real world data science is an R<sup>2</sup> of 0.10 <a href="https://twitter.com/hashtag/GoogleNext18?src=hash&amp;ref_src=twsrc%5Etfw">#GoogleNext18</a> <a href="https://t.co/qNsno2dscR">pic.twitter.com/qNsno2dscR</a></p>— Max Woolf (@minimaxir) <a href="https://twitter.com/minimaxir/status/1021885939361042432?ref_src=twsrc%5Etfw">July 24, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</span></p>
<p>The difficulty of real-world data science is recognizing if a given problem <em>can</em> be solved, how much of your valuable time to spend iterating to <em>maybe</em> solve it, how to report to stakeholders if it <em>can&rsquo;t</em> be solved, and what are the next steps if that&rsquo;s the case.</p>
<p>Don&rsquo;t <a href="https://www.buzzfeednews.com/article/stephaniemlee/brian-wansink-cornell-p-hacking"><em>p</em>-hack</a>!</p>
<h2 id="data-science-and-ethics">Data Science and Ethics</h2>
<p>During the rise of the &ldquo;data science/AI is magic!&rdquo; era, massive algorithmic and statistical failures suggest that data science might not always make the world a better place. Amazon built a resume-reading model which <a href="https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G">accidentally learned to be sexist</a>. Facebook overestimated <a href="https://www.theverge.com/2018/10/17/17989712/facebook-inaccurate-video-metrics-inflation-lawsuit">performance metrics on their videos</a>, causing complete business pivots for media organizations in vain, indirectly <a href="https://www.theatlantic.com/technology/archive/2018/10/facebook-driven-video-push-may-have-cost-483-journalists-their-jobs/573403/">leading to hundreds of layoffs</a>. YouTube&rsquo;s recommended video algorithms <a href="https://medium.com/@jamesbridle/something-is-wrong-on-the-internet-c39c471271d2">drove children towards shocking and disturbing content</a>. And these companies have some of the best data talent <em>in the entire world</em>.</p>
<p>The <em>qualitative</em> output of a model or data analysis is just as important as the quantitative performance, if not more. Allowing dangerous model output to hit production and impact <em>millions</em> of consumers is a failure of QA at all levels. In fairness these companies usually fix these issues, but only <em>after</em> journalists <a href="https://www.nytimes.com/2018/10/19/opinion/facebook-twitter-journalism-misinformation.html">point them out</a>. The problem with blindly chasing a performance metric (like Kaggle) is that it ignores collateral, unexpected effects.</p>
<p><span><blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Don’t be data-driven. Be data-informed. Metrics should never be in charge because they have no moral compass.</p>— Kim Goodwin (@kimgoodwin) <a href="https://twitter.com/kimgoodwin/status/1051849805280948224?ref_src=twsrc%5Etfw">October 15, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </span></p>
<p>Maybe recommending shocking videos is what maximizes clickthrough rate or ad revenue per the models according to a business dashboard. Unfortunately, if the data justifies it and the business stakeholders encourage it, the company may <em>accept the consequences</em> of a flawed algorithm if they don&rsquo;t outweigh the benefits. It&rsquo;s important for data scientists to be aware that they may be party to that.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I realize the irony of using a data science thought piece to argue against data science thought pieces. In fairness, some Medium thought pieces do apply data science in very <em>unique</em> ways or touch on very obscure-but-impactful aspects of frameworks, and I enjoy reading those. The field is still very broadly defined, and your experiences may differ from this post, especially if you&rsquo;re working for a more research-based institution. Unfortunately, I don’t have any new advice for <em>getting</em> a data science job, which is <a href="https://twitter.com/minimaxir/status/951117788835278848">still very difficult</a>.</p>
<p>The popular idea that being a data scientist is a 40-hours-a-week Kaggle competition is <strong>incorrect</strong>. There&rsquo;s a lot more to it that&rsquo;s not as sexy which, in my opinion, is the more interesting aspect of the data science field as a whole.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Problems with Predicting Post Performance on Reddit and Other Link Aggregators</title>
      <link>https://minimaxir.com/2018/09/modeling-link-aggregators/</link>
      <pubDate>Mon, 10 Sep 2018 09:15:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/09/modeling-link-aggregators/</guid>
      <description>The nature of algorithmic feeds like Reddit inherently leads to a survivorship bias: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail.</description>
      <content:encoded><![CDATA[<p><a href="https://www.reddit.com">Reddit</a>, &ldquo;the front page of the internet&rdquo; is a link aggregator where anyone can submit links to cool happenings. Over the years, Reddit has expanded from just being a link aggregator, to allowing image and videos, and as of recently, hosting images and videos itself.</p>
<p>Reddit is broken down into subreddits, where each subreddit represents each own community around a particular interest, like <a href="https://www.reddit.com/r/aww">/r/aww</a> for pet photos and <a href="https://www.reddit.com/r/politics/">/r/politics</a> for U.S. politics. The posts on each subreddit are ranked by some function of both time elapsed since the submission was made, and the <em>score</em> of the submission as determined by upvotes and downvotes from other users.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_aww_hu_15514c9daececa75.webp 320w,/2018/09/modeling-link-aggregators/reddit_aww_hu_38fdc85d80e9f49f.webp 768w,/2018/09/modeling-link-aggregators/reddit_aww.png 827w" src="reddit_aww.png"/> 
</figure>

<p>There&rsquo;s also an intrinsic pride in having something you&rsquo;re responsible for providing to the community get lots of upvotes (the submitter also earns karma based on received upvotes, although karma is meaningless and doesn&rsquo;t provide any user benefits). But the reality is that even on the largest subreddits, submissions with 1 point (the default score for new submissions) are the most prominent, with some subreddits having <em>over half</em> of their submissions with only 1 point.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_dist_facet_hu_94559d39f676be08.webp 320w,/2018/09/modeling-link-aggregators/reddit_dist_facet_hu_ede8ccaaf5538573.webp 768w,/2018/09/modeling-link-aggregators/reddit_dist_facet_hu_940890d5e65baccb.webp 1024w,/2018/09/modeling-link-aggregators/reddit_dist_facet.png 1800w" src="reddit_dist_facet.png"/> 
</figure>

<p>The exposure from having a submission go viral on Reddit (especially on larger subreddits) can be valuable especially if its your own original content. As a result, there has been a lot of <a href="https://www.brandwatch.com/blog/how-to-get-on-the-front-page-of-reddit/">analysis</a>/<a href="https://www.reddit.com/r/starterpacks/comments/8rkfk9/reddit_front_page_starter_pack/">stereotypes</a> on what techniques to do to help your submission make it to the top of the front page. But almost all claims of &ldquo;cracking&rdquo; the Reddit algorithm are <a href="https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc"><em>post hoc</em> rationalizations</a>, attributing success to things like submission timing and title verbiage of a single submission after the fact. The nature of algorithmic feeds inherently leads to a <a href="https://en.wikipedia.org/wiki/Survivorship_bias">survivorship bias</a>: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail, which makes modeling a successful post very tricky.</p>
<p>I&rsquo;ve touched on analyzing Reddit post performance <a href="https://minimaxir.com/2017/06/reddit-deep-learning/">before</a>, but let&rsquo;s give it another look and see if we can drill down on why Reddit posts do and do not do well.</p>
<h2 id="submission-timing">Submission Timing</h2>
<p>As with many US-based websites, the majority of Reddit users are most active during work hours (9 AM — 5 PM Eastern time weekdays). Most subreddits have submission patterns which fit accordingly.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu_6063ab19aff16cb2.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu_4354ae33b8600c6a.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu_5818614336fda8df.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_prop.png 1800w" src="reddit_subreddit_prop.png"/> 
</figure>

<p>But what&rsquo;s interesting are the subreddits which <em>deviate</em> from that standard. Gaming subreddits (<a href="https://www.reddit.com/r/DestinyTheGame">/r/DestinyTheGame</a>, <a href="https://www.reddit.com/r/Overwatch">/r/Overwatch</a>) have short activity after a Tuesday game update/patch, game <em>communication</em> subreddits (<a href="https://www.reddit.com/r/Fireteams">/r/Fireteams</a>, <a href="https://www.reddit.com/r/RocketLeagueExchange">/r/RocketLeagueExchange</a>) are more active <em>outside</em> of work hours as they assume you are playing the game at the time, and Not-Safe-For-Work subreddits (/r/dirtykikpals, /r/gonewild) are incidentally less active during work hours and more active late-night than other subreddits.</p>
<p>Whenever you make a submission to Reddit, the submission appears in the subreddit&rsquo;s <code>/new</code> queue of the most recent submissions, where hopefully kind souls will find your submission and upvote it if it&rsquo;s good.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_new_hu_6650be6d73851b91.webp 320w,/2018/09/modeling-link-aggregators/reddit_new.png 762w" src="reddit_new.png"/> 
</figure>

<p>However, if it falls off the first page of the <code>/new</code> queue, your submission might be as good as dead. As a result, there&rsquo;s an element of game theory to timing your submission if you want it to not become another 1-point submission. Is it better to submit during peak hours when more users may see the submission before it falls off of <code>/new</code>? Is it better to submit <em>before</em> peak usage since there will be less competition, then continue the momentum once it hits the front page?</p>
<p>Here&rsquo;s a look at the median post performance at each given time slot for top subreddits:</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu_cb9c5ba898252674.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu_8ba4a17a13989a31.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu_a08bfb9858ec4480.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy.png 1800w" src="reddit_subreddit_hr_doy.png"/> 
</figure>

<p>As the earlier distribution chart implied, the median score is around 1-2 for most subreddits, and that&rsquo;s consistent across all time slots. Some subreddits with higher medians like /r/me<em>irl do appear to have a _slight</em> benefit when posting before peak activity. When focusing on subreddits with high overall median scores, the difference is more explicit.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu_2730023d99e9e0d9.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu_78be513d900d66b5.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu_da4a41445f75e1.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian.png 1800w" src="reddit_subreddit_highmedian.png"/> 
</figure>

<p>Subreddits like /r/PrequelMemes and /r/The<em>Donald _definitely</em> have better performance on average when made before peak activity! Posting before peak usage <em>does</em> appear to be a viable strategy, however for the majority of subreddits it doesn&rsquo;t make much of a difference.</p>
<h2 id="submission-titles">Submission Titles</h2>
<p>Each Reddit subreddit has their own vocabulary and topics of discussion. Let&rsquo;s break down text by subreddit by looking at the 75th percentile for score on posts containing a given two-word phrase:</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu_5d8f080824cf057d.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu_2870270c6078715e.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu_9edc52c78d8fe6ca.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams.png 1800w" src="reddit_subreddit_topbigrams.png"/> 
</figure>

<p>The one trend consistent across all subreddits is the effectiveness of first-person pronouns (<em>I/my</em>) and original content (<em>fan art</em>). Other than that, the vocabulary and sentiment for successful posts is very specific to the subreddit and culture is represents; no universal guaranteed-success memes.</p>
<h2 id="can-deep-learning-predict-post-performance">Can Deep Learning Predict Post Performance?</h2>
<p>Some might think &ldquo;oh hey, this is an arbitrary statistical problem, you can just build an AI to solve it!&rdquo; So, for the sake of argument, I did.</p>
<p>Instead of using Reddit data for building a deep learning model, we&rsquo;ll use data from <a href="https://news.ycombinator.com">Hacker News</a>, another link aggregator similar to Reddit with a strong focus on technology and startup entrepreneurship. The distribution of scores on posts, submission timings, upvoting, and front page ranking systems are all the same as on Reddit.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/hn_hu_ad0b8ce0803e73ea.webp 320w,/2018/09/modeling-link-aggregators/hn_hu_9592bce993e10dcd.webp 768w,/2018/09/modeling-link-aggregators/hn_hu_c329d6412551f993.webp 1024w,/2018/09/modeling-link-aggregators/hn.png 1520w" src="hn.png"/> 
</figure>

<p>The titles on Hacker News submissions are also shorter (80 characters max vs. Reddit&rsquo;s 300 character max) and in concise English (no memes/shitposts allowed), which should help the model learn the title syntax and identify high-impact keywords easier. Like Reddit, the score data is super-skewed with most HN submissions at 1-2 points, and typical model training will quickly converge but try to predict that <em>every</em> submission has a score of 1, which isn&rsquo;t helpful!</p>
<p>By constructing a model employing <em>many</em> deep learning tricks with <a href="https://keras.io">Keras</a>/<a href="https://www.tensorflow.org">TensorFlow</a> to prevent model cheating and training on <em>hundreds of thousands</em> of HN submissions (using post title, day-of-week, hour, and link domain like <code>github.com</code> as model features), the model does converge and finds some signal among the noise (training R<sup>2</sup> ~ 0.55 when trained for 50 epochs). However, it fails to offer any valuable predictions on new, unseen posts (test R<sup>2</sup> <em>&lt; 0.00</em>) because it falls into the same exact human biases regarding titles: it saw submissions with titles that did very well during training, but can&rsquo;t isolate the random chance why X and Y submissions are similar but X goes viral while Y does not.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/hn_test_hu_75e647e4de235ee0.webp 320w,/2018/09/modeling-link-aggregators/hn_test.png 485w" src="hn_test.png"/> 
</figure>

<p>I&rsquo;ve made the Keras/TensorFlow model training code available in <a href="https://www.kaggle.com/minimaxir/hacker-news-submission-score-predictor/notebook">this Kaggle Notebook</a> if you want to fork it and try to improve the model.</p>
<h2 id="other-potential-modeling-factors">Other Potential Modeling Factors</h2>
<p>The deep learning model above makes optimistic assumptions about the underlying data, including that each post behaves independently, and the included features are the sole features which determine the score. These assumptions are questionable.</p>
<p>The simple model forgoes the content of the submission itself, which is hard to retrieve for hundreds of thousands of data points. On Hacker News that&rsquo;s mostly OK since most submissions are links/articles which accurately correlate to the content, although occasionally there are idiosyncratic short titles which do the opposite. On Reddit, obviously looking at content is necessary for image/video-oriented subreddits, which is hard to gather and analyze at scale.</p>
<p>A very important concept of post performance is <em>momentum</em>. A post having a high score is a positive signal in itself, which begets more votes (a famous Reddit problem is brigading from /r/all which can cause submission scores to skyrocket). If the front page of a subreddit has a large number of high-performing posts, they might also suppress posts coming out of the <code>/new</code> queue because the score threshold is much higher. A simple model may not be able to capture these impacts; the model would need to incorporate the <em>state of the front page</em> at the time of posting.</p>
<p>Some also try to manipulate upvotes. Reddit became famous for adding the rule &ldquo;asking for upvotes is a violation of intergalactic law&rdquo; to their <a href="https://www.reddithelp.com/en/categories/rules-reporting/account-and-community-restrictions/what-constitutes-vote-cheating-or">Content Policy</a>, although some subreddits do it anyway <a href="https://www.reddit.com/r/TheoryOfReddit/comments/5qqrod/for_years_reddit_told_us_that_saying_upvote_this/">without consequence</a>. On Reddit, obvious spam posts can be downvoted to immediately counteract illicit upvotes. Hacker News has a <a href="https://news.ycombinator.com/newsfaq.html">similar don&rsquo;t-upvote rule</a>, although there aren&rsquo;t downvotes, just a flagging mechanism which quickly neutralizes spam/misleading posts. In general, there&rsquo;s no <em>legitimate</em> reason to highlight your own submission immediately after its posted (except for Reddit&rsquo;s AMAs). Fortunately, gaming the system is less impactful on Reddit and Hacker News due to their sheer size and countermeasures, but it&rsquo;s a good example of potential user behavior that makes modeling post performance difficult, and hopefully link aggregators of the future aren&rsquo;t susceptible to such shenanigans.</p>
<h2 id="do-we-really-to-predict-post-score">Do We Really to Predict Post Score?</h2>
<p>Let&rsquo;s say you are submitting original content to Reddit or your own tech project to Hacker News. More points means a higher ranking means more exposure for your link, right? Not exactly. As noted from Reddit/HN screenshots above, the scores of popular submissions are all over the place ranking-wise, having been affected by age penalties.</p>
<p>In practical terms, from my own purely anecdotal experience, submissions at a top ranking receive <em>substantially</em> more clickthroughs despite being spatially close on the page to others.</p>
<p><span><blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">&hellip;and now traffic at #3.<br><br>Placement is absurdly important for search engines/social media sites. Difference between #1 and #3 is dramatic. <a href="https://t.co/nGjWJBx6dU">pic.twitter.com/nGjWJBx6dU</a></p>— Max Woolf (@minimaxir) <a href="https://twitter.com/minimaxir/status/877219784907149316?ref_src=twsrc%5Etfw">June 20, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></span></p>
<p>In <a href="https://twitter.com/minimaxir/status/877219784907149316">that case</a>, falling from #1 to #3 <em>immediately halved</em> the referral traffic coming from Hacker News.</p>
<p>Therefore, an ideal link aggregator predictive model to maximize clicks should try to predict the <em>rank</em> of a submission (max rank, average rank over <em>n</em> period, etc.), not necessarily the score it receives. You could theoretically create a model by making a snapshot of a Reddit subreddit/front page of Hacker News every minute or so which includes the post position at the time of the snapshot. As mentioned earlier, the snapshots can also be used as a model feature to identify whether the front page is active or stale. Unfortunately, snapshots can&rsquo;t be retrieved retroactively, and both storing, processing, and analyzing snapshots at scale is a difficult and <em>expensive</em> feat of data engineering.</p>
<p>Presumably Reddit&rsquo;s data scientists would be incorporating submission position as a part of their data analytics and modeling, but after inspecting what&rsquo;s sent to Reddit&rsquo;s servers when you perform an action like upvoting, I wasn&rsquo;t able to find a sent position value when upvoting from the feed: only the post score and post upvote percentage at the time of the action were sent.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/chrome_hu_4b758c7e3fe42881.webp 320w,/2018/09/modeling-link-aggregators/chrome_hu_29f25ed9207a6d8f.webp 768w,/2018/09/modeling-link-aggregators/chrome_hu_f6617992d5fb908c.webp 1024w,/2018/09/modeling-link-aggregators/chrome.png 1442w" src="chrome.png"/> 
</figure>

<p>In this example, I upvoted the <code>Fact are facts</code> submission at position #5: we&rsquo;d expect a value between <code>3</code> and <code>5</code> be sent with the post metadata within the analytics payload, but that&rsquo;s not the case.</p>
<p>Optimizing ranking instead of a tangible metric or classification accuracy is a relatively underdiscussed field of modern data science (besides <a href="https://en.wikipedia.org/wiki/Search_engine_optimization">SEO</a> for getting the top spot on a Google search), and it would be interesting to dive deeper into it for other applications.</p>
<h2 id="in-the-future">In the future</h2>
<p>The moral of this post is that you should not take it personally if a submission fails to hit the front page. It doesn&rsquo;t necessarily mean it&rsquo;s bad. Conversely, if a post does well, don’t assume that similar posts will do just as well. There&rsquo;s a lot of quality content that falls through the cracks due to dumb luck. Fortunately, both Reddit and Hacker News allow reposts, which helps alleviate this particular problem.</p>
<p>There&rsquo;s still a lot that can be done to more deterministically predict the behavior of these algorithmic feeds. There&rsquo;s also room to help make these link aggregators more <em>fair</em>. Unfortunately, there&rsquo;s even more undiscovered ways to game these algorithms, and we&rsquo;ll see how things play out.</p>
<hr>
<p><em>You can view the BigQuery queries used to get the Reddit and Hacker News data, plus the R and ggplot2 used to create the data visualizations, in <a href="http://minimaxir.com/notebooks/modeling-link-aggregators/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/modeling-link-aggregators">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Analyzing IMDb Data The Intended Way, with R and ggplot2</title>
      <link>https://minimaxir.com/2018/07/imdb-data-analysis/</link>
      <pubDate>Mon, 16 Jul 2018 09:45:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/07/imdb-data-analysis/</guid>
      <description>For IMDb&amp;rsquo;s big-but-not-big data, you have to play with the data smartly, and both R and ggplot2 have neat tricks to do just that.</description>
      <content:encoded><![CDATA[<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/P4_zSfoTM80?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p><a href="https://www.imdb.com">IMDb</a>, the Internet Movie Database, has been a popular source for data analysis and visualizations over the years. The combination of user ratings for movies and detailed movie metadata have always been fun to <a href="http://minimaxir.com/2016/01/movie-revenue-ratings/">play with</a>.</p>
<p>There are a number of tools to help get IMDb data, such as <a href="https://github.com/alberanid/imdbpy">IMDbPY</a>, which makes it easy to programmatically scrape IMDb by pretending it&rsquo;s a website user and extracting the relevant data from the page&rsquo;s HTML output. While it <em>works</em>, web scraping public data is a gray area in terms of legality; many large websites have a Terms of Service which forbids scraping, and can potentially send a DMCA take-down notice to websites redistributing scraped data.</p>
<p>IMDb has <a href="https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX">data licensing terms</a> which forbid scraping and require an attribution in the form of a <strong>Information courtesy of IMDb (<a href="http://www.imdb.com">http://www.imdb.com</a>). Used with permission.</strong> statement, and has also <a href="https://www.kaggle.com/tmdb/tmdb-movie-metadata/home">DMCAed a Kaggle IMDb dataset</a> to hone the point.</p>
<p>However, there is good news! IMDb publishes an <a href="https://www.imdb.com/interfaces/">official dataset</a> for casual data analysis! And it&rsquo;s now very accessible, just choose a dataset and download (now with no hoops to jump through), and the files are in the standard <a href="https://en.wikipedia.org/wiki/Tab-separated_values">TSV format</a>.</p>
<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/datasets_hu_fb4ad2ef1d7c9e7f.webp 320w,/2018/07/imdb-data-analysis/datasets_hu_a5155a40c73aa984.webp 768w,/2018/07/imdb-data-analysis/datasets.png 926w" src="datasets.png"/> 
</figure>

<p>The uncompressed files are pretty large; not &ldquo;big data&rdquo; large (it fits into computer memory), but Excel will explode if you try to open them in it. You have to play with the data <em>smartly</em>, and both <a href="https://www.r-project.org">R</a> and <a href="https://ggplot2.tidyverse.org/reference/index.html">ggplot2</a> have neat tricks to do just that.</p>
<h2 id="first-steps">First Steps</h2>
<p>R is a popular programming language for statistical analysis. One of the most popular series of external packages is the <code>tidyverse</code> package, which automatically imports the <code>ggplot2</code> data visualization library and other useful packages which we&rsquo;ll get to one-by-one. We&rsquo;ll also use <code>scales</code> which we&rsquo;ll use later for prettier number formatting. First we&rsquo;ll load these packages:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">scales</span><span class="p">)</span>
</span></span></code></pre></div><p>And now we can load a TSV downloaded from IMDb using the <code>read_tsv</code> function from <code>readr</code> (a tidyverse package), which does what the name implies, at a much faster speed than base R (+ a couple other parameters to handle data encoding). Let&rsquo;s start with the <code>ratings</code> file:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.ratings.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span></span></span></code></pre></div>
<p>We can preview what&rsquo;s in the loaded data using <code>dplyr</code> (a tidyverse package), which is what we&rsquo;ll be using to manipulate data for this analysis. dplyr allows you to pipe commands, making it easy to create a sequence of manipulation commands. For now, we&rsquo;ll use <code>head()</code>, which displays the top few rows of the data frame.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/ratings_hu_5c1fcf56a5289876.webp 320w,/2018/07/imdb-data-analysis/ratings_hu_cf3fece2f9c850ca.webp 768w,/2018/07/imdb-data-analysis/ratings.png 930w" src="ratings.png"/> 
</figure>

<p>Each of the <strong>873k rows</strong> corresponds to a single movie, an ID for the movie, its average rating (from 1 to 10), and the number of votes which contribute to that average. Since we have two numeric variables, why not test out ggplot2 by creating a scatterplot mapping them? ggplot2 takes in a data frame and names of columns as aesthetics, then you specify what type of shape to plot (a &ldquo;geom&rdquo;). Passing the plot to <code>ggsave</code> saves it as a standalone, high-quality data visualization.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">numVotes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_point</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">ggsave</span><span class="p">(</span><span class="s">&#34;imdb-0.png&#34;</span><span class="p">,</span> <span class="n">plot</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="m">4</span><span class="p">,</span> <span class="n">height</span> <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-0_hu_6866c079d670893c.webp 320w,/2018/07/imdb-data-analysis/imdb-0_hu_dddd194229265d79.webp 768w,/2018/07/imdb-data-analysis/imdb-0_hu_1d852e43e8a54dea.webp 1024w,/2018/07/imdb-data-analysis/imdb-0.png 1200w" src="imdb-0.png"/> 
</figure>

<p>Here is nearly <em>1 million</em> points on a single chart; definitely don&rsquo;t try to do that in Excel! However, it&rsquo;s not a <em>useful</em> chart since all the points are opaque and we&rsquo;re not sure what the spatial density of points is. One approach to fix this issue is to create a heat map of points, which ggplot can do natively with <code>geom_bin2d</code>. We can color the heat map with the <a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html">viridis</a> colorblind-friendly palettes <a href="https://ggplot2.tidyverse.org/reference/scale_viridis.html">just introduced</a> into ggplot2. We should also tweak the axes; the x-axis should be scaled logarithmically with <code>scale_x_log10</code> since there are many movies with high numbers of votes and we can format those numbers with the <code>comma</code> function from the <code>scales</code> package (we can format the scale with <code>comma</code> too). For the y-axis, we can add explicit number breaks for each rating; R can do this neatly by setting the breaks to <code>1:10</code>. Putting it all together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">numVotes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_log10</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-1_hu_afa4c2e2f89a47f2.webp 320w,/2018/07/imdb-data-analysis/imdb-1_hu_fb49622c671e7e.webp 768w,/2018/07/imdb-data-analysis/imdb-1_hu_fe5886baf1a1a113.webp 1024w,/2018/07/imdb-data-analysis/imdb-1.png 1200w" src="imdb-1.png"/> 
</figure>

<p>Not bad, although it unfortunately confirms that IMDb follows a <a href="https://tvtropes.org/pmwiki/pmwiki.php/Main/FourPointScale">Four Point Scale</a> where average ratings tend to fall between 6 — 9.</p>
<h2 id="mapping-movies-to-ratings">Mapping Movies to Ratings</h2>
<p>You may be asking &ldquo;which ratings correspond to which movies?&rdquo; That&rsquo;s what the <code>tconst</code> field is for. But first, let&rsquo;s load the title data from <code>title.basics.tsv</code> into <code>df_basics</code> and take a look as before.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_basics</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.basics.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/basics1_hu_fdcb6a5f4e7311e5.webp 320w,/2018/07/imdb-data-analysis/basics1_hu_e15b78e5bbe944b8.webp 768w,/2018/07/imdb-data-analysis/basics1_hu_2e217e73acfcd9ff.webp 1024w,/2018/07/imdb-data-analysis/basics1.png 1350w" src="basics1.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/basics2_hu_a64ae979748aa9ab.webp 320w,/2018/07/imdb-data-analysis/basics2_hu_a83799eaf31e4743.webp 768w,/2018/07/imdb-data-analysis/basics2_hu_21a8fb679f3ec4e9.webp 1024w,/2018/07/imdb-data-analysis/basics2.png 1374w" src="basics2.png"/> 
</figure>
</p>
<p>We have some neat movie metadata. Notably, this table has a <code>tconst</code> field as well. Therefore, we can <em>join</em> the two tables together, adding the movie information to the corresponding row in the rating table (in this case, a left join is more appropriate than an inner/full join)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_basics</span><span class="p">)</span>
</span></span></code></pre></div><p>Runtime minutes sounds interesting. Could there be a relationship between the length of a movie and its average rating on IMDb? Let&rsquo;s make a heat map plot again, but with a few tweaks. With the new metadata, we can <code>filter</code> the table to remove bad points; let&rsquo;s keep movies only (as IMDb data also contains <em>television show data</em>), with a runtime &lt; 3 hours, and which have received atleast 10 votes by users to remove extraneous movies). X-axis should be tweaked to display the minutes-values in hours. The fill viridis palette can be changed to another one in the family (I personally like <code>inferno</code>).</p>
<p>More importantly, let&rsquo;s discuss plot theming. If you want a minimalistic theme, add a <code>theme_minimal</code> to the plot, and you can pass a <code>base_family</code> to change the default font on the plot and a <code>base_size</code> to change the font size. The <code>labs</code> function lets you add labels to the plot (which you should <em>always</em> do); you have your <code>title</code>, <code>x</code>, and <code>y</code> parameters, but you can also add a <code>subtitle</code>, a <code>caption</code> for attribution, and a <code>color</code>/<code>fill</code> to name the scale. Putting it all together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">runtimeMinutes</span> <span class="o">&lt;</span> <span class="m">180</span><span class="p">,</span> <span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">runtimeMinutes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">180</span><span class="p">,</span> <span class="m">60</span><span class="p">),</span> <span class="n">labels</span> <span class="o">=</span> <span class="m">0</span><span class="o">:</span><span class="m">3</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">0</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">option</span> <span class="o">=</span> <span class="s">&#34;inferno&#34;</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">theme_minimal</span><span class="p">(</span><span class="n">base_family</span> <span class="o">=</span> <span class="s">&#34;Source Sans Pro&#34;</span><span class="p">,</span> <span class="n">base_size</span> <span class="o">=</span> <span class="m">8</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Relationship between Movie Runtime and Average Mobie Rating&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">subtitle</span> <span class="o">=</span> <span class="s">&#34;Data from IMDb retrieved July 4th, 2018&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Runtime (Hours)&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;Average User Rating&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">caption</span> <span class="o">=</span> <span class="s">&#34;Max Woolf — minimaxir.com&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">fill</span> <span class="o">=</span> <span class="s">&#34;# Movies&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-2b_hu_37c6091878dca7a3.webp 320w,/2018/07/imdb-data-analysis/imdb-2b_hu_42f5a5f9d2e7967e.webp 768w,/2018/07/imdb-data-analysis/imdb-2b_hu_b4f485eff14f2484.webp 1024w,/2018/07/imdb-data-analysis/imdb-2b.png 1200w" src="imdb-2b.png"/> 
</figure>

<p>Now that&rsquo;s pretty nice-looking for only a few lines of code! Albeit unhelpful, as there doesn&rsquo;t appear to be a correlation.</p>
<p><em>(Note: for the rest of this post, the theming/labels code will be omitted for convenience)</em></p>
<p>How about movie ratings vs. the year the movie was made? It&rsquo;s a similar plot code-wise to the one above (one perk about <code>ggplot2</code> is that there&rsquo;s no shame in reusing chart code!), but we can add a <code>geom_smooth</code>, which adds a nonparametric trendline with confidence bands for the trend; since we have a large amount of data, the bands are very tight. We can also fix the problem of &ldquo;empty&rdquo; bins by setting the color fill scale to logarithmic scaling. And since we&rsquo;re adding a black trendline, let&rsquo;s change the viridis palette to <code>plasma</code> for better contrast.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_smooth</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">&#34;black&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_continuous</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">option</span> <span class="o">=</span> <span class="s">&#34;plasma&#34;</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">,</span> <span class="n">trans</span> <span class="o">=</span> <span class="s">&#39;log10&#39;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-4_hu_fdf90cbdd2dd2c7e.webp 320w,/2018/07/imdb-data-analysis/imdb-4_hu_1c45abe215427c09.webp 768w,/2018/07/imdb-data-analysis/imdb-4_hu_62d0feb034e8b054.webp 1024w,/2018/07/imdb-data-analysis/imdb-4.png 1200w" src="imdb-4.png"/> 
</figure>

<p>Unfortunately, this trend hasn&rsquo;t changed much either, although the presence of average ratings outside the Four Point Scale has increased over time.</p>
<h2 id="mapping-lead-actors-to-movies">Mapping Lead Actors to Movies</h2>
<p>Now that we have a handle on working with the IMDb data, let&rsquo;s try playing with the larger datasets. Since they take up a lot of computer memory, we only want to persist data we actually might use. After looking at the schema provided with the official datasets, the only really useful metadata about the actors is their birth year, so let&rsquo;s load that, but only keep both actors/actresses (using the fast <code>str_detect</code> function from <code>stringr</code>, another tidyverse package) and the relevant fields.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actors</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;name.basics.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                <span class="nf">filter</span><span class="p">(</span><span class="nf">str_detect</span><span class="p">(</span><span class="n">primaryProfession</span><span class="p">,</span> <span class="s">&#34;actor|actress&#34;</span><span class="p">))</span>  <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                <span class="nf">select</span><span class="p">(</span><span class="n">nconst</span><span class="p">,</span> <span class="n">primaryName</span><span class="p">,</span> <span class="n">birthYear</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/actor_hu_f86030d94734f51e.webp 320w,/2018/07/imdb-data-analysis/actor_hu_58f7a4e4de86c210.webp 768w,/2018/07/imdb-data-analysis/actor.png 936w" src="actor.png"/> 
</figure>

<p>The principals dataset, the large 1.28GB TSV, is the most interesting. It&rsquo;s an unnested list of the credited persons in each movie, with an <code>ordering</code> indicating their rank (where <code>1</code> means first, <code>2</code> means second, etc.).</p>
<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/principals_hu_e149270e85e6bbfe.webp 320w,/2018/07/imdb-data-analysis/principals_hu_d39d7c6fcd18929.webp 768w,/2018/07/imdb-data-analysis/principals_hu_56b42bde8cdb5364.webp 1024w,/2018/07/imdb-data-analysis/principals.png 1074w" src="principals.png"/> 
</figure>

<p>For this analysis, let&rsquo;s only look at the <strong>lead actors/actresses</strong>; specifically, for each movie (identified by the <code>tconst</code> value), filter the dataset to where the <code>ordering</code> value is the lowest (in this case, the person at rank <code>1</code> may not necessarily be an actor/actress).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_principals</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.principals.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="nf">str_detect</span><span class="p">(</span><span class="n">category</span><span class="p">,</span> <span class="s">&#34;actor|actress&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">tconst</span><span class="p">,</span> <span class="n">ordering</span><span class="p">,</span> <span class="n">nconst</span><span class="p">,</span> <span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">group_by</span><span class="p">(</span><span class="n">tconst</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="n">ordering</span> <span class="o">==</span> <span class="nf">min</span><span class="p">(</span><span class="n">ordering</span><span class="p">))</span>
</span></span></code></pre></div><p>Both datasets have a <code>nconst</code> field, so let&rsquo;s join them together. And then join <em>that</em> to the ratings table earlier via <code>tconst</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_principals</span> <span class="o">&lt;-</span> <span class="n">df_principals</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_actors</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_principals</span><span class="p">)</span>
</span></span></code></pre></div><p>Now we have a fully denormalized dataset in <code>df_ratings</code>. Since we now have the movie release year and the birth year of the lead actor, we can now infer <em>the age of the lead actor at the movie release</em>. With that goal, filter out the data on the criteria we&rsquo;ve used for earlier data visualizations, plus only keeping rows which have an actor&rsquo;s birth year.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings_movies</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                        <span class="nf">filter</span><span class="p">(</span><span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">birthYear</span><span class="p">),</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                        <span class="nf">mutate</span><span class="p">(</span><span class="n">age_lead</span> <span class="o">=</span> <span class="n">startYear</span> <span class="o">-</span> <span class="n">birthYear</span><span class="p">)</span>
</span></span></code></pre></div><p><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/denorm1_hu_654cad39747efe47.webp 320w,/2018/07/imdb-data-analysis/denorm1_hu_eed6e992d7e214e3.webp 768w,/2018/07/imdb-data-analysis/denorm1_hu_dbde12b6453e4f09.webp 1024w,/2018/07/imdb-data-analysis/denorm1.png 1604w" src="denorm1.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/denorm2_hu_3aef3d94cde50e2c.webp 320w,/2018/07/imdb-data-analysis/denorm2.png 531w" src="denorm2.png"/> 
</figure>
</p>
<h2 id="plotting-ages">Plotting Ages</h2>
<p>Age discrimination in movie casting has been a recurring issue in Hollywood; in fact, in 2017 <a href="https://www.hollywoodreporter.com/thr-esq/judge-pauses-enforcement-imdb-age-censorship-law-978797">a law was signed</a> to force IMDb to remove an actor&rsquo;s age upon request, which in February 2018 was <a href="https://www.hollywoodreporter.com/thr-esq/californias-imdb-age-censorship-law-declared-unconstitutional-1086540">ruled to be unconstitutional</a>.</p>
<p>Have the ages of movie leads changed over time? For this example, we&rsquo;ll use a <a href="https://ggplot2.tidyverse.org/reference/geom_ribbon.html">ribbon plot</a> to plot the ranges of ages of movie leads. A simple way to do that is, for each year, calculate the 25th <a href="https://en.wikipedia.org/wiki/Percentile">percentile</a> of the ages, the 50th percentile (i.e. the median), and the 75th percentile, where the 25th and 75th percentiles are the ribbon bounds and the line represents the median.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actor_ages</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">group_by</span><span class="p">(</span><span class="n">startYear</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">summarize</span><span class="p">(</span><span class="n">low_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">med_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.50</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">high_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.75</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">))</span>
</span></span></code></pre></div><p>Plotting it with ggplot2 is surprisingly simple, although you need to use different y aesthetics for the ribbon and the overlapping line.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_actor_ages</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">startYear</span> <span class="o">&gt;=</span> <span class="m">1920</span><span class="p">)</span> <span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_ribbon</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span> <span class="o">=</span> <span class="n">low_age</span><span class="p">,</span> <span class="n">ymax</span> <span class="o">=</span> <span class="n">high_age</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_line</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">med_age</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-8_hu_1f082993b0bfcbd5.webp 320w,/2018/07/imdb-data-analysis/imdb-8_hu_5434c1e3ce1485b4.webp 768w,/2018/07/imdb-data-analysis/imdb-8_hu_c6707a589573484a.webp 1024w,/2018/07/imdb-data-analysis/imdb-8.png 1200w" src="imdb-8.png"/> 
</figure>

<p>Turns out that in the 2000&rsquo;s, the median age of lead actors started to <em>increase</em>? Both the upper and lower bounds increased too. That doesn&rsquo;t coalesce with the age discrimination complaints.</p>
<p>Another aspect of these complaints is gender, as female actresses tend to be younger than male actors. Thanks to the magic of ggplot2 and dplyr, separating actors/actresses is relatively simple: add gender (encoded in <code>category</code>) as a grouping variable, add it as a color/fill aesthetic in ggplot, and set colors appropriately (I recommend the <a href="http://colorbrewer2.org/">ColorBrewer</a> qualitative palettes for categorical variables).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actor_ages_lead</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">group_by</span><span class="p">(</span><span class="n">startYear</span><span class="p">,</span> <span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">summarize</span><span class="p">(</span><span class="n">low_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">med_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.50</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">high_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.75</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_actor_ages_lead</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">startYear</span> <span class="o">&gt;=</span> <span class="m">1920</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">,</span> <span class="n">fill</span> <span class="o">=</span> <span class="n">category</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="n">category</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_ribbon</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span> <span class="o">=</span> <span class="n">low_age</span><span class="p">,</span> <span class="n">ymax</span> <span class="o">=</span> <span class="n">high_age</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_line</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">med_age</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span> <span class="o">=</span> <span class="s">&#34;Set1&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_color_brewer</span><span class="p">(</span><span class="n">palette</span> <span class="o">=</span> <span class="s">&#34;Set1&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-9_hu_57562b2f234249be.webp 320w,/2018/07/imdb-data-analysis/imdb-9_hu_7da40c01dd2abee4.webp 768w,/2018/07/imdb-data-analysis/imdb-9_hu_a30111e8cbade2ed.webp 1024w,/2018/07/imdb-data-analysis/imdb-9.png 1200w" src="imdb-9.png"/> 
</figure>

<p>There&rsquo;s about a 10-year gap between the ages of male and female leads, and the gap doesn&rsquo;t change overtime. But both start to rise at the same time.</p>
<p>One possible explanation for this behavior is actor reuse: if Hollywood keeps casting the same actor/actresses, by construction the ages of the leads will start to steadily increase. Let&rsquo;s verify that: with our list of movies and their lead actors, for each lead actor, order all their movies by release year, and add a ranking for the #th time that actor has been a lead actor. This is possible through the use of <code>row_number</code> in dplyr, and <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html">window functions</a> like <code>row_number</code> are data science&rsquo;s most useful secret.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings_movies_nth</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">group_by</span><span class="p">(</span><span class="n">nconst</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">arrange</span><span class="p">(</span><span class="n">startYear</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">mutate</span><span class="p">(</span><span class="n">nth_lead</span> <span class="o">=</span> <span class="nf">row_number</span><span class="p">())</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/row_number_hu_1e44bdb2621fb9cb.webp 320w,/2018/07/imdb-data-analysis/row_number_hu_ca408294ce31483a.webp 768w,/2018/07/imdb-data-analysis/row_number_hu_ed006c80eb52873e.webp 1024w,/2018/07/imdb-data-analysis/row_number.png 1532w" src="row_number.png"/> 
</figure>

<p>One more ribbon plot later (w/ same code as above + custom y-axis breaks):</p>
<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-12_hu_32ee97febb68e3.webp 320w,/2018/07/imdb-data-analysis/imdb-12_hu_69e7d60d89429d8f.webp 768w,/2018/07/imdb-data-analysis/imdb-12_hu_c9df788e280bb63b.webp 1024w,/2018/07/imdb-data-analysis/imdb-12.png 1200w" src="imdb-12.png"/> 
</figure>

<p>Huh. The median and upper-bound #th time has <em>dropped</em> over time? Hollywood has been promoting more newcomers as leads? That&rsquo;s not what I expected!</p>
<p>More work definitely needs to be done in this area. In the meantime, the official IMDb datasets are a lot more robust than I thought they would be! And I only used a fraction of the datasets; the rest tie into TV shows, which are a bit messier. Hopefully you&rsquo;ve seen a good taste of the power of R and ggplot2 for playing with big-but-not-big data!</p>
<hr>
<p><em>You can view the R and ggplot used to create the data visualizations in <a href="http://minimaxir.com/notebooks/imdb-data-analysis/">this R Notebook</a>, which includes many visualizations not used in this post. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/imdb-data-analysis">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
