<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Statistics on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/category/statistics/</link>
    <description>Recent content in Statistics on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Mon, 30 Jun 2025 10:00:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/category/statistics/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Predicting Average IMDb Movie Ratings Using Text Embeddings of Movie Metadata</title>
      <link>https://minimaxir.com/2025/06/movie-embeddings/</link>
      <pubDate>Mon, 30 Jun 2025 10:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2025/06/movie-embeddings/</guid>
      <description>Don&amp;rsquo;t try this in your data science interviews.</description>
      <content:encoded><![CDATA[<p>Months ago, I saw a post titled &ldquo;<a href="https://www.reddit.com/r/datascience/comments/1eykil7/rejected_from_ds_role_with_no_feedback/">Rejected from DS Role with no feedback</a>&rdquo; on Reddit&rsquo;s <a href="https://www.reddit.com/r/datascience/">Data Science subreddit</a>, in which a prospective job candidate for a data science position provided a <a href="https://colab.research.google.com/drive/1Ud2tXW2IAw_dXA5DONvNpPmmlL1foSwK">Colab Notebook</a> documenting their submission for a take-home assignment and asking for feedback as to why they were rejected. Per the Reddit user, the assignment was:</p>
<blockquote>
<p>Use the publicly available <a href="https://developer.imdb.com/non-commercial-datasets/">IMDB Datasets</a> to build a model that predicts a movie&rsquo;s average rating. Please document your approach and present your results in the notebook. Make sure your code is well-organized so that we can follow your modeling process.</p>
</blockquote>
<p><a href="https://www.imdb.com/">IMDb</a>, the Internet Movie Database owned by Amazon, allows users to rate movies on a scale from 1 to 10, wherein the average rating is then displayed prominently on the movie&rsquo;s page:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/shawshank_hu_fe8025c2c6a0fa89.webp 320w,/2025/06/movie-embeddings/shawshank_hu_f0b2bc74865ccb73.webp 768w,/2025/06/movie-embeddings/shawshank_hu_8f544060412f7f54.webp 1024w,/2025/06/movie-embeddings/shawshank.webp 1082w" src="shawshank.webp"
         alt="The Shawshank Redemption is currently the highest-rated movie on IMDb with an average rating of 9.3 derived from 3.1 million user votes."/> <figcaption>
            <p><a href="https://www.imdb.com/title/tt0111161/?ref_=sr_t_1">The Shawshank Redemption</a> is currently the <a href="https://www.imdb.com/search/title/?groups=top_100&amp;sort=user_rating,desc">highest-rated movie on IMDb</a> with an average rating of 9.3 derived from 3.1 million user votes.</p>
        </figcaption>
</figure>

<p>In their notebook, the Redditor identifies a few intuitive features for such a model, including the year in which the movie was released, the genre(s) of the movies, and the actors/directors of the movie. However, the model they built is a <a href="https://www.tensorflow.org/">TensorFlow</a> and <a href="https://keras.io/">Keras</a>-based neural network, with all the bells-and-whistles such as <a href="https://en.wikipedia.org/wiki/Batch_normalization">batch normalization</a> and <a href="https://en.wikipedia.org/wiki/Dilution_%28neural_networks%29">dropout</a>. The immediate response by other data scientists on /r/datascience was, at its most polite, &ldquo;why did you use a neural network when it&rsquo;s a <a href="https://en.wikipedia.org/wiki/Black_box">black box</a> that you can&rsquo;t explain?&rdquo;</p>
<p>Reading those replies made me nostalgic. Way back in 2017, before my first job as a data scientist, neural networks using frameworks such as TensorFlow and Keras were all the rage for their ability to &ldquo;<a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem">solve any problem</a>&rdquo; but were often seen as lazy and unskilled compared to traditional statistical modeling such as ordinary least squares linear regression or even gradient boosted trees. Although it&rsquo;s funny to see that perception against neural networks in the data science community hasn&rsquo;t changed since, nowadays the black box nature of neural networks can be an acceptable business tradeoff if the prediction results are higher quality and interpretability is not required.</p>
<p>Looking back at the assignment description, the objective is only &ldquo;predict a movie&rsquo;s average rating.&rdquo; For data science interview take-homes, this is unusual: those assignments typically have an extra instruction along the lines of &ldquo;explain your model and what decisions stakeholders should make as a result of it&rdquo;, which is a strong hint that you need to use an explainable model like linear regression to obtain feature coefficients, or even a middle-ground like gradient boosted trees and its <a href="https://stats.stackexchange.com/questions/332960/what-is-variable-importance">variable importance</a> to quantify relative feature contribution to the model. <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> In absence of that particular constraint, it&rsquo;s arguable that anything goes, including neural networks.</p>
<p>The quality of neural networks have improved significantly since 2017, even moreso due to the massive rise of LLMs. Why not try just feeding a LLM all raw metadata for a movie and encode it into a text embedding and build a statistical model based off of that? Would a neural network do better than a traditional statistical model in that instance? Let&rsquo;s find out!</p>
<h2 id="about-imdb-data">About IMDb Data</h2>
<p>The <a href="https://developer.imdb.com/non-commercial-datasets/">IMDb Non-Commercial Datasets</a> are famous sets of data that have been around for nearly a decade <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> but are still updated daily. Back in 2018 as a budding data scientist, I performed a <a href="https://minimaxir.com/2018/07/imdb-data-analysis/">fun exporatory data analysis</a> using these datasets, although the results aren&rsquo;t too surprising.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/imdb-4_hu_fdf90cbdd2dd2c7e.webp 320w,/2025/06/movie-embeddings/imdb-4_hu_1c45abe215427c09.webp 768w,/2025/06/movie-embeddings/imdb-4_hu_62d0feb034e8b054.webp 1024w,/2025/06/movie-embeddings/imdb-4.png 1200w" src="imdb-4.png"
         alt="The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems."/> <figcaption>
            <p>The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems.</p>
        </figcaption>
</figure>

<p>But in truth, these datasets are a terrible idea for companies to use for a take-home assignment. Although the datasets are released under a non-commercial license, IMDb doesn&rsquo;t want to give too much information to their competitors, which results in a severely limited amount of features that could be used to build a good predictive model. Here are the common movie-performance-related features present in the <code>title.basics.tsv.gz</code> file:</p>
<ul>
<li><strong>tconst</strong>: unique identifier of the title</li>
<li><strong>titleType</strong>: the type/format of the title (e.g. movie, tvmovie, short, tvseries, etc)</li>
<li><strong>primaryTitle</strong>: the more popular title / the title used by the filmmakers on promotional materials at the point of release</li>
<li><strong>isAdult</strong>: 0: non-adult title; 1: adult title</li>
<li><strong>startYear</strong>: represents the release year of a title.</li>
<li><strong>runtimeMinutes</strong>: primary runtime of the title, in minutes</li>
<li><strong>genres</strong>: includes up to three genres associated with the title</li>
</ul>
<p>This is a sensible schema for describing a movie, although it lacks some important information that would be very useful to determine movie quality such as production company, summary blurbs, granular genres/tags, and plot/setting — all of which are available on the IMDb movie page itself and presumably accessible through the <a href="https://developer.imdb.com/documentation/api-documentation/?ref_=/documentation/_PAGE_BODY">paid API</a>. Of note, since the assignment explicitly asks for a <em>movie</em>&rsquo;s average rating, we need to filter the data to only <code>movie</code> and <code>tvMovie</code> entries, which the original assignment failed to do.</p>
<p>The ratings data in <code>title.ratings.tsv.gz</code> is what you&rsquo;d expect:</p>
<ul>
<li><strong>tconst</strong>: unique identifier of the title (which can therefore be mapped to movie metadata using a JOIN)</li>
<li><strong>averageRating</strong>: average of all the individual user ratings</li>
<li><strong>numVotes</strong>: number of votes the title has received</li>
</ul>
<p>In order to ensure that the average ratings for modeling are indeed stable and indicative of user sentiment, I will only analyze movies that have <em>atleast 30 user votes</em>: as of May 10th 2025, that&rsquo;s about 242k movies total. Additionally, I will not use <code>numVotes</code> as a model feature, since that&rsquo;s a metric based more on extrinsic movie popularity rather than the movie itself.</p>
<p>The last major dataset is <code>title.principals.tsv.gz</code>, which has very helpful information on metadata such as the roles people play in the production of a movie:</p>
<ul>
<li><strong>tconst</strong>: unique identifier of the title (which can be mapped to movie data using a JOIN)</li>
<li><strong>nconst</strong>: unique identifier of the principal (this is mapped to <code>name.basics.tsv.gz</code> to get the principal&rsquo;s <code>primaryName</code>, but nothing else useful)</li>
<li><strong>category</strong>: the role the principal served in the title, such as <code>actor</code>, <code>actress</code>, <code>writer</code>, <code>producer</code>, etc.</li>
<li><strong>ordering</strong>: the ordering of the principals within the title, which correlates to the order the principals appear on IMDb&rsquo;s movie cast pages.</li>
</ul>
<p>Additionally, because the datasets are so popular, it&rsquo;s not the first time someone has built a IMDb ratings predictor and it&rsquo;s easy to Google.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/google_hu_b09e979836a71049.webp 320w,/2025/06/movie-embeddings/google_hu_c652438955f310d8.webp 768w,/2025/06/movie-embeddings/google.webp 1000w" src="google.webp"/> 
</figure>

<p>Instead of using the official IMDb datasets, these analyses are based on the smaller <a href="https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset/data">IMDB 5000 Movie Dataset</a> hosted on Kaggle, which adds metadata such as movie rating, budget, and further actor metadata that make building a model much easier (albeit &ldquo;number of likes on the lead actor&rsquo;s Facebook page&rdquo; is <em>very</em> extrinsic to movie quality). Using the official datasets with much less metadata is building the models on hard mode and will likely have lower predictive performance.</p>
<p>Although IMDb data is very popular and very well documented, that doesn&rsquo;t mean it&rsquo;s easy to work with.</p>
<h2 id="the-initial-assignment-and-feature-engineering">The Initial Assignment and &ldquo;Feature Engineering&rdquo;</h2>
<p>Data science take-home assignments are typically 1/2 <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a> for identifying impactful dataset features, and 1/2 building, iterating, and explaining the model. For real-world datasets, these are all very difficult problems with many difficult solutions, and the goal from the employer&rsquo;s perspective is seeing more <em>how</em> these problems are solved rather than the actual quantitative results.</p>
<p>The initial Reddit post decided to engineer some expected features using <a href="https://pandas.pydata.org/">pandas</a>, such as <code>is_sequel</code> by checking whether a non-<code>1</code> number is present at the end of a movie title and <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a> each distinct <code>genre</code> of a movie. These are fine for an initial approach, albeit sequel titles can be idiosyncratic and it suggests that a more <a href="https://www.ibm.com/think/topics/natural-language-processing">NLP</a> approach to identifying sequels and other related media may be useful.</p>
<p>The main trick with this assignment is how to handle the principals. The common data science approach would be to use a sparse binary encoding of the actors/directors/writers, e.g. using a vector where actors present in the movie are <code>1</code> and every other actor is <code>0</code>, which leads to a large number of potential approaches to encode this data performantly, such as scikit-learn&rsquo;s <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html">MultiLabelBinarizer</a>. The problem with this approach is that there are a <em>very</em> large number of unique actors / <a href="https://docs.honeycomb.io/get-started/basics/observability/concepts/high-cardinality/">high cardinality</a> — more unique actors than data points themselves — which leads to <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> issues and workarounds such as encoding only the top <em>N</em> actors will lead to the feature being uninformative since even a generous <em>N</em> will fail to capture the majority of actors.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/actor_cum_dist_hu_6b3839329e455b7d.webp 320w,/2025/06/movie-embeddings/actor_cum_dist_hu_b3985aca3321429a.webp 768w,/2025/06/movie-embeddings/actor_cum_dist_hu_27acda9c003abad5.webp 1024w,/2025/06/movie-embeddings/actor_cum_dist.png 1500w" src="actor_cum_dist.png"
         alt="There are actually 624k unique actors in this dataset (Jupyter Notebook), the chart just becomes hard to read at that point."/> <figcaption>
            <p>There are actually 624k unique actors in this dataset (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/actor_agg.ipynb">Jupyter Notebook</a>), the chart just becomes hard to read at that point.</p>
        </figcaption>
</figure>

<p>Additionally, most statistical modeling approaches cannot account for the <code>ordering</code> of actors as they treat each feature as independent, and since the billing order of actors is generally correlated to their importance in the movie, that&rsquo;s an omission of relevant information to the problem.</p>
<p>These constraints gave me an idea: why not use an LLM to encode <em>all</em> movie data, and build a model using the downstream embedding representation? LLMs have <a href="https://en.wikipedia.org/wiki/Attention_%28machine_learning%29">attention mechanisms</a>, which will not only respect the relative ordering of actors (to give higher predictive priority to higher-billed actors, along with actor cooccurrences), but also identify patterns within movie name texts (to identify sequels and related media semantically).</p>
<p>I started by aggregating and denormalizing all the data locally (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/imdb_polars_etl_test.ipynb">Jupyter Notebook</a>). Each of the IMDb datasets are hundreds of megabytes and hundreds of thousands of rows at minimum: not quite <a href="https://en.wikipedia.org/wiki/Big_data">big data</a>, but enough to be more cognizant of tooling especially since computationally-intensive JOINs are required. Therefore, I used the <a href="https://pola.rs/">Polars</a> library in Python, which not only loads data super fast, but is also one of the <a href="https://duckdblabs.github.io/db-benchmark/">fastest libraries at performing JOINs</a> and other aggregation tasks. Polars&rsquo;s syntax also allows for some cool tricks: for example, I want to spread out and aggregate the principals (4.1 million rows after prefiltering) for each movie into directors, writers, producers, actors, and all other principals into nested lists while simultaneously having them sorted by <code>ordering</code> as noted above. This is much easier to do in Polars than any other data processing library I&rsquo;ve used, and on millions of rows, this takes <em>less than a second</em>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">df_principals_agg</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">df_principals</span><span class="o">.</span><span class="n">sort</span><span class="p">([</span><span class="s2">&#34;tconst&#34;</span><span class="p">,</span> <span class="s2">&#34;ordering&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">group_by</span><span class="p">(</span><span class="s2">&#34;tconst&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="o">.</span><span class="n">agg</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">director_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&#34;director&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">writer_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&#34;writer&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">producer_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&#34;producer&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">actor_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">is_in</span><span class="p">([</span><span class="s2">&#34;actor&#34;</span><span class="p">,</span> <span class="s2">&#34;actress&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">        <span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">principal_names</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;primaryName&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="o">~</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">is_in</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="p">[</span><span class="s2">&#34;director&#34;</span><span class="p">,</span> <span class="s2">&#34;writer&#34;</span><span class="p">,</span> <span class="s2">&#34;producer&#34;</span><span class="p">,</span> <span class="s2">&#34;actor&#34;</span><span class="p">,</span> <span class="s2">&#34;actress&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">principal_roles</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="o">~</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">&#34;category&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">is_in</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="p">[</span><span class="s2">&#34;director&#34;</span><span class="p">,</span> <span class="s2">&#34;writer&#34;</span><span class="p">,</span> <span class="s2">&#34;producer&#34;</span><span class="p">,</span> <span class="s2">&#34;actor&#34;</span><span class="p">,</span> <span class="s2">&#34;actress&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>After some cleanup and field renaming, here&rsquo;s an example JSON document for <a href="https://www.imdb.com/title/tt0076759/">Star Wars: Episode IV - A New Hope</a>:</p>
<!-- prettier-ignore-start -->
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;title&#34;</span><span class="p">:</span> <span class="s2">&#34;Star Wars: Episode IV - A New Hope&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;genres&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Action&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Adventure&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Fantasy&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;is_adult&#34;</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;release_year&#34;</span><span class="p">:</span> <span class="mi">1977</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;runtime_minutes&#34;</span><span class="p">:</span> <span class="mi">121</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;directors&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;George Lucas&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;writers&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;George Lucas&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;producers&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Gary Kurtz&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Rick McCallum&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;actors&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Mark Hamill&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Harrison Ford&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Carrie Fisher&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Alec Guinness&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Peter Cushing&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Anthony Daniels&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Kenny Baker&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Peter Mayhew&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;David Prowse&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;Phil Brown&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;principals&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;John Williams&#34;</span><span class="p">:</span> <span class="s2">&#34;composer&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Gilbert Taylor&#34;</span><span class="p">:</span> <span class="s2">&#34;cinematographer&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Richard Chew&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;T.M. Christopher&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Paul Hirsch&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Marcia Lucas&#34;</span><span class="p">:</span> <span class="s2">&#34;editor&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Dianne Crittenden&#34;</span><span class="p">:</span> <span class="s2">&#34;casting_director&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Irene Lamb&#34;</span><span class="p">:</span> <span class="s2">&#34;casting_director&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;Vic Ramos&#34;</span><span class="p">:</span> <span class="s2">&#34;casting_director&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;John Barry&#34;</span><span class="p">:</span> <span class="s2">&#34;production_designer&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><!-- prettier-ignore-end -->
<p>I was tempted to claim that I used zero feature engineering, but that wouldn&rsquo;t be accurate. The selection and ordering of the JSON fields here is itself feature engineering: for example, <code>actors</code> and <code>principals</code> are intentionally last in this JSON encoding because they can have wildly varying lengths while the prior fields are more consistent, which should make downstream encodings more comparable and consistent.</p>
<p>Now, let&rsquo;s discuss how to convert these JSON representations of movies into embeddings.</p>
<h2 id="creating-and-visualizing-the-movie-embeddings">Creating And Visualizing the Movie Embeddings</h2>
<p>LLMs that are trained to output text embeddings are not much different from LLMs like <a href="https://chatgpt.com/">ChatGPT</a> that just predict the next token in a loop. Models such as BERT and GPT can generate &ldquo;embeddings&rdquo; out-of-the-box by skipping the prediction heads of the models and instead taking an encoded value from the last hidden state of the model (e.g. for BERT, the first positional vector of the hidden state representing the <code>[CLS]</code> token). However, text embedding models are more optimized for distinctiveness of a given input text document using <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/">contrastive learning</a>. These embeddings can be used for many things, from finding similar encoded inputs by identifying the similarity between embeddings, and of course, by building a statistical model on top of them.</p>
<p>Text embeddings that leverage LLMs are typically generated using a GPU in batches due to the increased amount of computation needed. Python libraries such as <a href="https://huggingface.co/">Hugging Face</a> <a href="https://huggingface.co/docs/transformers/en/index">transformers</a> and <a href="https://sbert.net/">sentence-transformers</a> can load these embeddings models. For this experiment, I used the very new <a href="https://huggingface.co/Alibaba-NLP/gte-modernbert-base">Alibaba-NLP/gte-modernbert-base</a> text embedding model that is finetuned from the <a href="https://huggingface.co/answerdotai/ModernBERT-base">ModernBERT model</a> specifically for the embedding use case for two reasons: it uses the ModernBERT architecture which is <a href="https://huggingface.co/blog/modernbert">optimized for fast inference</a>, and the base ModernBERT model is trained to be more code-aware and should be able understand JSON-nested input strings more robustly — that&rsquo;s also why I intentionally left in the indentation for nested JSON arrays as it&rsquo;s semantically meaningful and <a href="https://huggingface.co/answerdotai/ModernBERT-base/blob/main/tokenizer_config.json">explicitly tokenized</a>. <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>The code (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/generate_imdb_embeddings.ipynb">Jupyter Notebook</a>) — with extra considerations to avoid running out of memory on either the CPU or GPU <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> — looks something like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="n">device</span> <span class="o">=</span> <span class="s2">&#34;cuda:0&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">dataloader</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">DataLoader</span><span class="p">(</span><span class="n">docs</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                         <span class="n">shuffle</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                         <span class="n">pin_memory</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                         <span class="n">pin_memory_device</span><span class="o">=</span><span class="n">device</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">dataset_embeddings</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">dataloader</span><span class="p">,</span> <span class="n">smoothing</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">tokenized_batch</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">batch</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">8192</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s2">&#34;pt&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">tokenized_batch</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">embeddings</span> <span class="o">=</span> <span class="n">outputs</span><span class="o">.</span><span class="n">last_hidden_state</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">detach</span><span class="p">()</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">dataset_embeddings</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">embeddings</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">dataset_embeddings</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="n">dataset_embeddings</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">dataset_embeddings</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">dataset_embeddings</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/featured_hu_be15fd7c96cd6da2.webp 320w,/2025/06/movie-embeddings/featured_hu_a1d4e8d783c0419.webp 768w,/2025/06/movie-embeddings/featured_hu_1aa1372a6affcdc5.webp 1024w,/2025/06/movie-embeddings/featured.webp 1318w" src="featured.webp"/> 
</figure>

<p>I used a Spot <a href="https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus">L4 GPU</a> on <a href="https://cloud.google.com/">Google Cloud Platform</a> at a pricing of $0.28/hour, and it took 21 minutes to encode all 242k movie embeddings: about $0.10 total, which is surprisingly efficient.</p>
<p>Each of these embeddings is a set of 768 numbers (768D). If the embeddings are unit normalized (the <code>F.normalize()</code> step), then calculating the dot product between embeddings will return the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> of those movies, which can then be used to identify the most similar movies. But &ldquo;similar&rdquo; is open-ended, as there are many dimensions how a movie could be considered similar.</p>
<p>Let&rsquo;s try a few movie similarity test cases where I calculate the cosine similarity between one query movie and <em>all</em> movies, then sort by cosine similarity to find the most similar (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/movie_embeddings_similarity.ipynb">Jupyter Notebook</a>). How about Peter Jackson&rsquo;s <a href="https://www.imdb.com/title/tt0120737/">Lord of the Rings: The Fellowship of the Ring</a>? Ideally, not only would it surface the two other movies of the original trilogy, but also its prequel Hobbit trilogy.</p>
<table>
  <thead>
      <tr>
          <th>title</th>
          <th>cossim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0120737/">The Lord of the Rings: The Fellowship of the Ring (2001)</a></td>
          <td>1.0</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0167261/">The Lord of the Rings: The Two Towers (2002)</a></td>
          <td>0.922</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0167260/">The Lord of the Rings: The Return of the King (2003)</a></td>
          <td>0.92</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt10127200/">National Geographic: Beyond the Movie - The Lord of the Rings: The Fellowship of the Ring (2001)</a></td>
          <td>0.915</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0301246/">A Passage to Middle-earth: The Making of &lsquo;Lord of the Rings&rsquo; (2001)</a></td>
          <td>0.915</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0299105/">Quest for the Ring (2001)</a></td>
          <td>0.906</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0077869/">The Lord of the Rings (1978)</a></td>
          <td>0.893</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2310332/">The Hobbit: The Battle of the Five Armies (2014)</a></td>
          <td>0.891</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1170358/">The Hobbit: The Desolation of Smaug (2013)</a></td>
          <td>0.883</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0903624/">The Hobbit: An Unexpected Journey (2012)</a></td>
          <td>0.883</td>
      </tr>
  </tbody>
</table>
<p>Indeed, it worked and surfaced both trilogies! The other movies listed are about the original work, so having high similarity would be fair.</p>
<p>Compare these results to the &ldquo;<a href="https://help.imdb.com/article/imdb/discover-watch/what-is-the-more-like-this-section/GPE7SPGZREKKY7YN">More like this</a>&rdquo; section on the IMDb page for the movie itself, which has the two sequels to the original Lord of the Rings and two other suggestions that I am not entirely sure are actually related.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/lotr_related_hu_7560f67c8d88cb97.webp 320w,/2025/06/movie-embeddings/lotr_related_hu_544b4f2cf95b01dd.webp 768w,/2025/06/movie-embeddings/lotr_related_hu_8c4f2099751f082.webp 1024w,/2025/06/movie-embeddings/lotr_related.webp 1354w" src="lotr_related.webp"/> 
</figure>

<p>What about more elaborate franchises, such as the <a href="https://en.wikipedia.org/wiki/Marvel_Cinematic_Universe">Marvel Cinematic Universe</a>? If you asked for movies similar to <a href="https://www.imdb.com/title/tt4154796/">Avengers: Endgame</a>, would other MCU films be the most similar?</p>
<table>
  <thead>
      <tr>
          <th>title</th>
          <th>cossim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4154796/">Avengers: Endgame (2019)</a></td>
          <td>1.0</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4154756/">Avengers: Infinity War (2018)</a></td>
          <td>0.909</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0848228/">The Avengers (2012)</a></td>
          <td>0.896</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1217616/">Endgame (2009)</a></td>
          <td>0.894</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4154664/">Captain Marvel (2019)</a></td>
          <td>0.89</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2395427/">Avengers: Age of Ultron (2015)</a></td>
          <td>0.882</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt3498820/">Captain America: Civil War (2016)</a></td>
          <td>0.882</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0292502/">Endgame (2001)</a></td>
          <td>0.881</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0118661/">The Avengers (1998)</a></td>
          <td>0.877</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1228705/">Iron Man 2 (2010)</a></td>
          <td>0.876</td>
      </tr>
  </tbody>
</table>
<p>The answer is yes, which isn&rsquo;t a surprise since those movies share many principals. Although, there are instances of other movies named &ldquo;Endgame&rdquo; and &ldquo;The Avengers&rdquo; which are completely unrelated to Marvel and therefore implies that the similarities may be fixated on the names.</p>
<p>What about movies of a smaller franchise but a specific domain, such as Disney&rsquo;s <a href="https://www.imdb.com/title/tt2294629/">Frozen</a> that only has one sequel? Would it surface other 3D animated movies by <a href="https://en.wikipedia.org/wiki/Walt_Disney_Animation_Studios">Walt Disney Animation Studios</a>, or something else?</p>
<table>
  <thead>
      <tr>
          <th>title</th>
          <th>cossim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2294629/">Frozen (2013)</a></td>
          <td>1.0</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4520988/">Frozen II (2019)</a></td>
          <td>0.93</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1323045/">Frozen (2010)</a></td>
          <td>0.92</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1611845/">Frozen (2010)</a> [a different one]</td>
          <td>0.917</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0125279/">Frozen (1996)</a></td>
          <td>0.909</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt0376606/">Frozen (2005)</a></td>
          <td>0.9</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt2363439/">The Frozen (2012)</a></td>
          <td>0.898</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4007494/">The Story of Frozen: Making a Disney Animated Classic (2014)</a></td>
          <td>0.894</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt1071798/">Frozen (2007)</a></td>
          <td>0.889</td>
      </tr>
      <tr>
          <td><a href="https://www.imdb.com/title/tt4150316/">Frozen in Time (2014)</a></td>
          <td>0.888</td>
      </tr>
  </tbody>
</table>
<p>&hellip;okay, it&rsquo;s definitely fixating on the name. Let&rsquo;s try a different approach to see if we can find more meaningful patterns in these embeddings.</p>
<p>In order to visualize the embeddings, we can project them to a lower dimensionality with a dimensionality reduction algorithm such as <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> or <a href="https://umap-learn.readthedocs.io/en/latest/">UMAP</a>: UMAP is preferred as it can simultaneously reorganize the data into more meaningful clusters. UMAP&rsquo;s <a href="https://umap-learn.readthedocs.io/en/latest/how_umap_works.html">construction of a neighborhood graph</a>, in theory, can allow the reduction to refine the similarities by leveraging many possible connections and hopefully avoid fixating on the movie name. However, with this amount of input data and the relatively high initial 768D vector size, the computation cost of UMAP is a concern as both factors each cause the UMAP training time to scale exponentially. Fortunately, NVIDIA&rsquo;s <a href="https://github.com/rapidsai/cuml">cuML library</a> recently <a href="https://github.com/rapidsai/cuml/releases/tag/v25.04.00">updated</a> and now you can run UMAP with very high amounts of data on a GPU at a very high number of epochs to ensure the reduction fully converges, so I did just that (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/imdb_embeddings_umap_to_2D.ipynb">Jupyter Notebook</a>). What patterns can we find? Let&rsquo;s try plotting the reduced points, colored by their user rating.</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/imdb_umap_rating_hu_4047e53667cc289a.webp 320w,/2025/06/movie-embeddings/imdb_umap_rating_hu_74d5c85f14c8950c.webp 768w,/2025/06/movie-embeddings/imdb_umap_rating_hu_2b6ccdbb5b4b9105.webp 1024w,/2025/06/movie-embeddings/imdb_umap_rating.webp 1200w" src="imdb_umap_rating.webp"/> 
</figure>

<p>So there&rsquo;s a few things going on here. Indeed, most of the points are high-rating green as evident in the source data. But the points and ratings aren&rsquo;t <em>random</em> and there are trends. In the center giga cluster, there are soft subclusters of movies at high ratings and low ratings. Smaller discrete clusters did indeed form, but what is the deal with that extremely isolated cluster at the top? After investigation, that cluster only has movies released in 2008, which is another feature I should have considered when defining movie similarity.</p>
<p>As a sanity check, I faceted out the points by movie release year to better visualize where these clusters are forming:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/imdb_umap_rating_year_hu_40c4d6844e346f92.webp 320w,/2025/06/movie-embeddings/imdb_umap_rating_year_hu_48d37fbda72976cc.webp 768w,/2025/06/movie-embeddings/imdb_umap_rating_year_hu_27485860dc95d177.webp 1024w,/2025/06/movie-embeddings/imdb_umap_rating_year.webp 1200w" src="imdb_umap_rating_year.webp"/> 
</figure>

<p>This shows that even the clusters movies have their values spread, but I unintentionally visualized how <a href="https://arize.com/docs/ax/machine-learning/computer-vision/how-to-cv/embedding-drift">embedding drift</a> changes over time. 2024 is also a bizarrely-clustered year: I have no idea why those two years specifically are weird in movies.</p>
<p>The UMAP approach is more for fun, since it&rsquo;s better for the downstream model building to use the raw 768D vector and have it learn the features from that. At the least, there&rsquo;s <em>some</em> semantic signal preserved in these embeddings, which makes me optimistic that these embeddings alone can be used to train a viable movie rating predictor.</p>
<h2 id="predicting-average-imdb-movie-scores">Predicting Average IMDb Movie Scores</h2>
<p>So, we now have hundreds of thousands of 768D embeddings. How do we get them to predict movie ratings? What many don&rsquo;t know is that all methods of traditional statistical modeling also work with embeddings — assumptions such as feature independence are invalid so the results aren&rsquo;t explainable, but you can still get a valid predictive model.</p>
<p>First, we will shuffle and split the data set into a training set and a test set: for the test set, I chose 20,000 movies (roughly 10% of the data) which is more than enough for stable results. To decide the best model, we will be using the model that minimizes the <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean squared error</a> (MSE) of the test set, which is a standard approach to solving regression problems that predict a single numeric value.</p>
<p>Here are three approaches for using LLMs for solving non-next-token-prediction tasks.</p>
<h3 id="method-1-traditional-modeling-w-gpu-acceleration">Method #1: Traditional Modeling (w/ GPU Acceleration!)</h3>
<p>You can still fit a linear regression on top of the embeddings even if feature coefficients are completely useless and it serves as a decent baseline (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/cuml_grid_search.ipynb">Jupyter Notebook</a>). The absolute laziest &ldquo;model&rdquo; where we just use the mean of the training set for every prediction results in a test MSE of <strong>1.637</strong>, but performing a simple linear regression on top of the 768D instead results in a more reasonable test MSE of <strong>1.187</strong>. We should be able to beat that handily with a more advanced model.</p>
<p>Data scientists familiar with scikit-learn know there&rsquo;s a rabbit hole of model options, but most of them are CPU-bound and single-threaded and would take considerable amount of time on a dataset of this size. That&rsquo;s where cuML—the same library I used to create the UMAP projection—comes in, as cuML has <a href="https://docs.rapids.ai/api/cuml/stable/api/#regression-and-classification">GPU-native implementations</a> of most popular scikit-learn models with a similar API. This notably includes <a href="https://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>, which play especially nice with embeddings. And because we have the extra compute, we can also perform a brute force hyperparameter <a href="https://www.dremio.com/wiki/grid-search/">grid search</a> to find the best parameters for fitting each model.</p>
<p>Here&rsquo;s the results of MSE on the test dataset for a few of these new model types, with the hyperparameter combination for each model type that best minimizes MSE:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/model_comparison_base_hu_2e224af8e7736cd2.webp 320w,/2025/06/movie-embeddings/model_comparison_base_hu_ea8ec94f59331bc5.webp 768w,/2025/06/movie-embeddings/model_comparison_base_hu_536396210f6f6e7a.webp 1024w,/2025/06/movie-embeddings/model_comparison_base.png 1200w" src="model_comparison_base.png"/> 
</figure>

<p>The winner is the Support Vector Machine, with a test MSE of <strong>1.087</strong>! This is a good start for a simple approach that handily beats the linear regression baseline, and it also beats the model training from the Redditor&rsquo;s original notebook which had a test MSE of 1.096 <sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>. In all cases, the train set MSE was close to the test set MSE, which means the models did not overfit either.</p>
<h3 id="method-2-neural-network-on-top-of-embeddings">Method #2: Neural Network on top of Embeddings</h3>
<p>Since we&rsquo;re already dealing with AI models and already have PyTorch installed to generate the embeddings, we might as well try the traditional approach of training a <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">multilayer perceptron</a> (MLP) neural network on top of the embeddings (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/pytorch_model_train_mlp.ipynb">Jupyter Notebook</a>). This workflow sounds much more complicated than just fitting a traditional model above, but PyTorch makes MLP construction straightforward, and Hugging Face&rsquo;s <a href="https://huggingface.co/docs/transformers/en/main_classes/trainer">Trainer class</a> incorporates best model training practices by default, although its <code>compute_loss</code> function has to be tweaked to minimize MSE specifically.</p>
<p>The PyTorch model, using a loop to set up the MLP blocks, looks something like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">RatingsModel</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">linear_dims</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span> <span class="n">num_layers</span><span class="o">=</span><span class="mi">6</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">dims</span> <span class="o">=</span> <span class="p">[</span><span class="mi">768</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="n">linear_dims</span><span class="p">]</span> <span class="o">*</span> <span class="n">num_layers</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">mlp</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span>
</span></span><span class="line"><span class="cl">            <span class="n">nn</span><span class="o">.</span><span class="n">Sequential</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dims</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">dims</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]),</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">GELU</span><span class="p">(),</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm1d</span><span class="p">(</span><span class="n">dims</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]),</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.6</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dims</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">])</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dims</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">targets</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">mlp</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">x</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">output</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>  <span class="c1"># return 1D output if batched inputs</span>
</span></span></code></pre></div><p>This MLP is 529k parameters total: large for a MLP, but given the 222k row input dataset, it&rsquo;s not egregiously so.</p>
<p>The real difficulty with this MLP approach is that it&rsquo;s <em>too effective</em>: even with less than 1 million parameters, the model will extremely overfit and converge to 0.00 train MSE quickly, while the test set MSE explodes. That&rsquo;s why <code>Dropout</code> is set to the atypically high probability of <code>0.6</code>.</p>
<p>Fortunately, MLPs are fast to train: training for 600 epochs (total passes through the full training dataset) took about 17 minutes on the GPU. Here&rsquo;s the training results:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/training_mlp_hu_db4d2b769213c385.webp 320w,/2025/06/movie-embeddings/training_mlp_hu_99fc40ac0f82af11.webp 768w,/2025/06/movie-embeddings/training_mlp_hu_c64c2a10817470c0.webp 1024w,/2025/06/movie-embeddings/training_mlp.png 1200w" src="training_mlp.png"/> 
</figure>

<p>The lowest logged test MSE was <strong>1.074</strong>: a slight improvement over the Support Vector Machine approach.</p>
<h3 id="method-3-just-train-a-llm-from-scratch-dammit">Method #3: Just Train a LLM From Scratch Dammit</h3>
<p>There is a possibility that using a pretrained embedding model that was trained on the entire internet could intrinsically contain relevant signal about popular movies—such as movies winning awards which would imply a high IMDb rating—and that knowledge could leak into the test set and provide misleading results. This may not be a significant issue in practice since it&rsquo;s such a small part of the <code>gte-modernbert-base</code> model which is too small to memorize exact information.</p>
<p>For the sake of comparison, let&rsquo;s try training a LLM from scratch on top of the raw movie JSON representations to process this data to see if we can get better results without the possibility of leakage (<a href="https://github.com/minimaxir/imdb-embeddings/blob/main/pytorch_model_train_llm.ipynb">Jupyter Notebook</a>). I was specifically avoiding this approach because the compute required to train an LLM is much, much higher than a SVM or MLP model and generally leveraging a pretrained model gives better results. In this case, since we don&rsquo;t need a LLM that has all the knowledge of human existence, we can train a much smaller model that <em>only</em> knows how to work with the movie JSON representations and can figure out relationships between actors and whether titles are sequels itself. Hugging Face transformers makes this workflow surprisingly straightforward by not only having functionality to train your own custom tokenizer (in this case, from 50k vocab to 5k vocab) that encodes the data more efficiently, but also allowing the construction a ModernBERT model with any number of layers and units. I opted for a 5M parameter LLM (SLM?), albeit with less dropout since high dropout causes learning issues for LLMs specifically.</p>
<p>The actual PyTorch model code is surprisingly more concise than the MLP approach:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py3" data-lang="py3"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">RatingsModel</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">transformer_model</span> <span class="o">=</span> <span class="n">model</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="p">,</span> <span class="n">targets</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">transformer_model</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">input_ids</span><span class="o">=</span><span class="n">input_ids</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">attention_mask</span><span class="o">=</span><span class="n">attention_mask</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">output_hidden_states</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">last_hidden_state</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span>  <span class="c1"># the &#34;[CLS] vector&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">output</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>  <span class="c1"># return 1D output if batched inputs</span>
</span></span></code></pre></div><p>Essentially, the model trains its own &ldquo;text embedding,&rdquo; although in this case instead of an embedding optimized for textual similarity, the embedding is just a representation that can easily be translated into a numeric rating.</p>
<p>Because the computation needed for training a LLM from scratch is much higher, I only trained the model for 10 epochs, which was still twice as slow than the 600 epochs for the MLP approach. Given that, the results are surprising:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/training_llm_hu_2355de410bfc61c1.webp 320w,/2025/06/movie-embeddings/training_llm_hu_cfcd114ac3c12003.webp 768w,/2025/06/movie-embeddings/training_llm_hu_f6c75fc2deeead45.webp 1024w,/2025/06/movie-embeddings/training_llm.png 1200w" src="training_llm.png"/> 
</figure>

<p>The LLM approach did much better than my previous attempts with a new lowest test MSE of <strong>1.026</strong>, with only 4 passes through the data! And then it definitely overfit. I tried other smaller configurations for the LLM to avoid the overfitting, but none of them ever hit a test MSE that low.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Let&rsquo;s look at the model comparison again, this time adding the results from training a MLP and training a LLM from scratch:</p>
<figure>

    <img loading="lazy" srcset="/2025/06/movie-embeddings/model_comparison_all_hu_2309fb0cea20f0c.webp 320w,/2025/06/movie-embeddings/model_comparison_all_hu_34af566430bbc603.webp 768w,/2025/06/movie-embeddings/model_comparison_all_hu_1e1d9cf8cdfde789.webp 1024w,/2025/06/movie-embeddings/model_comparison_all.png 1200w" src="model_comparison_all.png"/> 
</figure>

<p>Coming into this post, I&rsquo;m genuinely thought that training the MLP on top of embeddings would have been the winner given the base embedding model&rsquo;s knowledge of everything, but maybe there&rsquo;s something to just YOLOing and feeding raw JSON input data to a completely new LLM. More research and development is needed.</p>
<p>The differences in model performance from these varying approaches aren&rsquo;t dramatic, but some iteration is indeed interesting and it was a long shot anyways given the scarce amount of metadata. The fact that building a model off of text embeddings only didn&rsquo;t result in a perfect model doesn&rsquo;t mean this approach was a waste of time. The embedding and modeling pipelines I have constructed in the process of trying to solve this problem have already provided significant dividends on easier problems, such as identifying the efficiency of <a href="https://minimaxir.com/2025/02/embeddings-parquet/">storing embeddings in Parquet and manipulating them with Polars</a>.</p>
<p>It&rsquo;s impossible and pointless to pinpoint the exact reason the original Reddit poster got rejected: it could have been the neural network approach or even something out of their control such as the original company actually stopping hiring and being too disorganized to tell the candidate. To be clear, if I myself were to apply for a data science role, I wouldn&rsquo;t use the techniques in this blog post (that UMAP data visualization would get me instantly rejected!) and do more traditional EDA and non-neural-network modeling to showcase my data science knowledge to the hiring manager. But for my professional work, I will definitely try starting any modeling exploration with an embeddings-based approach wherever possible: at the absolute worst, it&rsquo;s a very strong baseline that will be hard to beat.</p>
<p><em>All of the Jupyter Notebooks and data visualization code for this blog post is available open-source in <a href="https://github.com/minimaxir/imdb-embeddings/">this GitHub repository</a>.</em></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I am not a fan of using GBT variable importance as a decision-making metric: variable importance does not tell you magnitude or <em>direction</em> of the feature in the real world, but it does help identify which features can be pruned for model development iteration.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>To get a sense on how old they are, they are only available as <a href="https://en.wikipedia.org/wiki/Tab-separated_values">TSV files</a>, which is a data format so old and prone to errors that many data libraries have dropped explicit support for it. Amazon, please release the datasets as CSV or Parquet files instead!&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Two other useful features of <code>gte-modernbert-base</code> but not strictly relevant to these movie embeddings are a) its a cased model so it can identify meaning from upper-case text and b) it does not require a prefix such as <code>search_query</code> and <code>search_document</code> as <a href="https://huggingface.co/nomic-ai/nomic-embed-text-v1.5">nomic-embed-text-v1.5 does</a> to guide its results, which is an annoying requirement for those models.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>The trick here is the <code>detach()</code> function for the computed embeddings, otherwise the GPU doesn&rsquo;t free up the memory once moved back to the CPU. I may or may not have discovered that the hard way.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>As noted earlier, minimizing MSE isn&rsquo;t a competition, but the comparison on roughly the same dataset is good for a sanity check.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Visualizing Airline Flight Characteristics Between SFO and JFK</title>
      <link>https://minimaxir.com/2019/10/sfo-jfk-flights/</link>
      <pubDate>Wed, 23 Oct 2019 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2019/10/sfo-jfk-flights/</guid>
      <description>Box plots, when used correctly, can be a very fun way to visualize big data.</description>
      <content:encoded><![CDATA[<p>In March, <a href="https://cloud.google.com">Google Compute Platform</a> developer advocate <a href="https://twitter.com/felipehoffa">Felipe Hoffa</a> made a tweet about airline flight data from San Francisco International Airport (SFO) to Seattle-Tacoma International Airport (SEA):</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/felipehoffa/status/1111050585120206848"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>Particularly, his visualization of total elapsed times by airline caught my eye.</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu_33d3683c2d4a611e.webp 320w,/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu_1c609cadbe91671c.webp 768w,/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu_3135cb9a9bbaf839.webp 1024w,/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD.jpeg 1200w" src="D2s9oFtX4AEK6nD.jpeg"/> 
</figure>

<p>The overall time for flights from SFO to SEA goes up drastically starting in 2015, and this increase occurs across multiple airlines, implying that it&rsquo;s not an airline-specific problem. But what could intuitively cause that?</p>
<p>U.S. domestic airline data is <a href="https://www.transtats.bts.gov/Tables.asp?DB_ID=120">freely distributed</a> by the United States Department of Transportation. Normally it&rsquo;s a pain to work with as it&rsquo;s very large with millions of rows, but BigQuery makes playing with such data relatively easy, fun, and free. What other interesting factoids can be found?</p>
<h2 id="expanding-on-sfo--sea">Expanding on SFO → SEA</h2>
<p><a href="https://cloud.google.com/bigquery/">BigQuery</a> is a big data warehousing tool that allows you to query massive amounts of data. The table Hoffa created from the airline data (<code>fh-bigquery.flights.ontime_201903</code>) is 83.37 GB and 184 <em>million</em> rows. You can query 1 TB of data from it for free, but since BQ will only query against the fields you request, the queries in this post only consume about 2 GB each, allowing you to run them well within that quota.</p>
<p>Hoffa&rsquo;s query that runs on BigQuery looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="p">,</span><span class="w"> </span><span class="n">Reporting_Airline</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">ActualElapsedTime</span><span class="p">)</span><span class="w"> </span><span class="n">ActualElapsedTime</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">TaxiOut</span><span class="p">)</span><span class="w"> </span><span class="n">TaxiOut</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">TaxiIn</span><span class="p">)</span><span class="w"> </span><span class="n">TaxiIn</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">AirTime</span><span class="p">)</span><span class="w"> </span><span class="n">AirTime</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">c</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">flights</span><span class="p">.</span><span class="n">ontime_201903</span><span class="o">`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">Origin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SFO&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">Dest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SEA&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2010-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">DESC</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">LIMIT</span><span class="w"> </span><span class="mi">1000</span><span class="w">
</span></span></span></code></pre></div><p>For each year and airline after 2010, the query calculates the average metrics specified for flights on the SFO → SEA route.</p>
<p>I made a few query and data visualization tweaks to what Hoffa did above, and here&rsquo;s the result showing the increase in elapsed airline flight time, over time for that route:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_e232d6eeab7fb66.webp 320w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_948de6a062caeaca.webp 768w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_6ae123a09b30ff70.webp 1024w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration.png 1800w" src="sfo_sea_flight_duration.png"/> 
</figure>

<p>Let&rsquo;s explain what&rsquo;s going on here.</p>
<p>A common trend in statistics is avoiding using <a href="https://en.wikipedia.org/wiki/Average">averages</a> as a summary statistic whenever possible, as averages can be overly affected by strong outliers (and with airline flights, there are definitely strong outliers!). The solution is to use a <a href="https://en.wikipedia.org/wiki/Median">median</a> instead, but one problem: medians are hard and <a href="https://www.periscopedata.com/blog/medians-in-sql">computationally complex</a> to calculate compared to simple averages. Despite the rise of &ldquo;big data&rdquo;, most databases and BI tools don&rsquo;t have a <code>MEDIAN</code> function that&rsquo;s as easy to use as an <code>AVG</code> function. But BigQuery has an uncommon <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_quantiles">APPROX_QUANTILES</a> function, which calculates the specified amount of quantiles; for example, if you call <code>APPROX_QUANTILES(ActualElapsedTime, 100)</code>, it will return an array with the 100 quantiles, where the median will be the 50th quantile. BigQuery <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate-aggregation">uses</a> an algorithmic trick called <a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog++</a> to calculate these quantiles efficiently even with millions of data points. But since we get other quantiles like the 5th, 25th, 75th, and 95th quantiles for free with that approach, we can visualize the <em>spread</em> of the data.</p>
<p>We can aggregate the data by month for more granular trends and calculate the <code>APPROX_QUANTILES</code> in a subquery so it only has to be computed once. Hoffa also uploaded a more recent table (<code>fh-bigquery.flights.ontime_201908</code>) with a few additional months of data. To make things more simple, we&rsquo;ll ignore aggregating by airlines since the metrics do not vary strongly between them. The final query ends up looking like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="n">standardSQL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="p">,</span><span class="w"> </span><span class="n">num_flights</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">5</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_5</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">25</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_25</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">50</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_50</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">75</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_75</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">95</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_95</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">num_flights</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">APPROX_QUANTILES</span><span class="p">(</span><span class="n">ActualElapsedTime</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">time_q</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">flights</span><span class="p">.</span><span class="n">ontime_201908</span><span class="o">`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">Origin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SFO&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">Dest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SEA&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">AND</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2010-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="w">
</span></span></span></code></pre></div><p>The resulting data table:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/table_hu_98a96a00ebd58c2c.webp 320w,/2019/10/sfo-jfk-flights/table_hu_9eddda8c57624a2.webp 768w,/2019/10/sfo-jfk-flights/table.png 932w" src="table.png"/> 
</figure>

<p>In retrospect, since we&rsquo;re only focusing on one route, it isn&rsquo;t <em>big</em> data (this query only returns data on 64,356 flights total), but it&rsquo;s still a very useful skill if you need to analyze more of the airline data (the <code>APPROX_QUANTILES</code> function can handle <em>millions</em> of data points very quickly).</p>
<p>As a professional data scientist, one of my favorite types of data visualization is a <a href="https://en.wikipedia.org/wiki/Box_plot">box plot</a>, as it provides a way to visualize spread without being visually intrusive. Data visualization tools like <a href="https://www.r-project.org">R</a> and <a href="https://ggplot2.tidyverse.org/index.html">ggplot2</a> make constructing them <a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">very easy to do</a>.</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/geom_boxplot-1_hu_9a623aa679dafed1.webp 320w,/2019/10/sfo-jfk-flights/geom_boxplot-1_hu_67cf70ba510d1672.webp 768w,/2019/10/sfo-jfk-flights/geom_boxplot-1_hu_c405dbc443ae9fa8.webp 1024w,/2019/10/sfo-jfk-flights/geom_boxplot-1.png 1400w" src="geom_boxplot-1.png"/> 
</figure>

<p>By default, for each box representing a group, the thick line in the middle of the box is the median, the lower bound of the box is the 25th quantile and the upper bound is the 75th quantile. The whiskers are normally a function of the <a href="https://en.wikipedia.org/wiki/Interquartile_range">interquartile range</a> (IQR), but if there&rsquo;s enough data, I prefer to use the 5th and 95th quantiles instead.</p>
<p>If you feed ggplot2&rsquo;s <code>geom_boxplot()</code> with raw data, it will automatically calculate the corresponding metrics for visualization; however, with big data, the data may not fit into memory and as noted earlier, medians and other quantiles are computationally expensive to calculate. Because we precomputed the quantiles with the query above for every year and month, we can use those explicitly. (The minor downside is that this will not include outliers)</p>
<p>Additionally for box plots, I like to fill in each box with a different color corresponding to the year in order to better perceive data <a href="https://en.wikipedia.org/wiki/Seasonality">seasonality</a>. In the case of airline flights, seasonality is more literal: weather has an intuitive impact on flight times and delays, and during winter months there are also holidays which could affect airline logistics.</p>
<p>The resulting ggplot2 code looks like this:</p>
<pre tabindex="0"><code>plot &lt;-
  ggplot(df_tf,
         aes(
           x = date,
           ymin = q_5,
           lower = q_25,
           middle = q_50,
           upper = q_75,
           ymax = q_95,
           group = date,
           fill = year_factor
         )) +
  geom_boxplot(stat = &#34;identity&#34;, size = 0.3) +
  scale_fill_hue(l = 50, guide = F) +
  scale_x_date(date_breaks = &#39;1 year&#39;, date_labels = &#34;%Y&#34;) +
  scale_y_continuous(breaks = pretty_breaks(6)) +
  labs(
    title = &#34;Distribution of Flight Times of Flights From SFO → SEA, by Month&#34;,
    subtitle = &#34;via US DoT. Box bounds are 25th/75th percentiles, whiskers are 5th/95th percentiles.&#34;,
    y = &#39;Total Elapsed Flight Time (Minutes)&#39;,
    fill = &#39;&#39;,
    caption = &#39;Max Woolf — minimaxir.com&#39;
  ) +
  theme(axis.title.x = element_blank())

ggsave(&#39;sfo_sea_flight_duration.png&#39;,
       plot,
       width = 6,
       height = 4)
</code></pre><p>And behold (again)!</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_e232d6eeab7fb66.webp 320w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_948de6a062caeaca.webp 768w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu_6ae123a09b30ff70.webp 1024w,/2019/10/sfo-jfk-flights/sfo_sea_flight_duration.png 1800w" src="sfo_sea_flight_duration.png"/> 
</figure>

<p>You can see that the boxes do indeed trend upward after 2016, although per-month medians are in flux. The spread is also increasingly slowly over time. But what&rsquo;s interesting is the seasonality; pre-2016, the summer months (the &ldquo;middle&rdquo; of a given color) have a <em>very</em> significant drop in total time, which doesn&rsquo;t occur as strongly after 2016. Hmm.</p>
<h2 id="sfo-and-jfk">SFO and JFK</h2>
<p>Since I occasionally fly from San Francisco to New York City, it might be interesting (for completely selfish reasons) to track trends over time for flights between those areas. On the San Francisco side I choose SFO, and for the New York side I choose John F. Kennedy International Airport (JFK), as the data goes back very far for those routes specifically, and I only want to look at a single airport at a time (instead of including other NYC airports such as Newark Liberty International Airport [EWR] and LaGuardia Airport [LGA]) to limit potential data confounders.</p>
<p>Fortunately, the code and query changes are minimal: in the query, change the target metric to whatever metric you want, and the <code>Origin</code> and <code>Dest</code> in the <code>WHERE</code> clause to what you want, and if you want to calculate metrics other than elapsed time, change the metric in <code>APPROX_QUANTILES</code> accordingly.</p>
<p>Here&rsquo;s the chart of total elapsed time from SFO → JFK:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu_230bbe279f54a805.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu_c2e4a5d4b43ce24e.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu_2ea286d0e1e5d794.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration.png 1800w" src="sfo_jfk_flight_duration.png"/> 
</figure>

<p>And here&rsquo;s the reverse, from JFK → SFO:</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu_4424fffe053981c8.webp 320w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu_ace5c5c4f6b82a9a.webp 768w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu_5d29021a8362404b.webp 1024w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration.png 1800w" src="jfk_sfo_flight_duration.png"/> 
</figure>

<p>Unlike the SFO → SEA charts, both charts are relatively flat over the years. However, when looking at seasonality, SFO → JFK dips in the summer and spikes during winter, while JFK → SFO <em>does the complete opposite</em>: dips during the winter and spikes during the summer, which is similar to the SFO → SEA route. I don&rsquo;t have any guesses what would cause that behavior.</p>
<p>How about flight speed (calculated via air time divided by distance)? Have new advances in airline technology made planes faster and/or more efficient?</p>
<p><figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu_9bbb991fb8674a3f.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu_d4b14a4133ff0b82.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu_7266f1a8d449775b.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed.png 1800w" src="sfo_jfk_flight_speed.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu_86e7c997338f1404.webp 320w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu_1680890adf0e2d82.webp 768w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu_942e26ae57610365.webp 1024w,/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed.png 1800w" src="jfk_sfo_flight_speed.png"/> 
</figure>
</p>
<p>The expected flight speed for a commercial airplane, <a href="https://en.wikipedia.org/wiki/Cruise_%28aeronautics%29">per Wikipedia</a>, is 547-575 mph, so the metrics from SFO pass the sanity check. The metrics from JFK indicate there&rsquo;s about a 20% drop in flight speed potentially due to wind resistance, which makes sense. Month-to-month, the speed trends are inverse to the total elapsed time, which makes sense intuitively as they are strongly negatively correlated.</p>
<p>Lastly, what about flight departure delays? Are airlines becoming more efficient, or has increased demand caused more congestion?</p>
<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu_82c27db5d16562f9.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu_b017086eec0a8d63.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu_3a8b126a0bfc0d76.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay.png 1800w" src="sfo_jfk_departure_delay.png"/> 
</figure>

<p>Wait a second. In this case, massive 2-3 hour flight delays are frequent enough that even just the 95% percentile skews the entire plot. Let&rsquo;s remove the whiskers in order to look at trends more clearly.</p>
<p><figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu_c2eb7d1ad6cdf7.webp 320w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu_86b737333ad479f4.webp 768w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu_fd6ad349f57f4bbe.webp 1024w,/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers.png 1800w" src="sfo_jfk_departure_delay_nowhiskers.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu_1fecf180ed6a5feb.webp 320w,/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu_626df458859e27b7.webp 768w,/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu_58e7e7ba605d269e.webp 1024w,/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers.png 1800w" src="jfk_sfo_departure_delay_nowhiskers.png"/> 
</figure>
</p>
<p>A negative delay implies the flight leaves early, so we can conclude on average, flights leave slightly earlier than the stated departure time. Even without the whiskers, we can see major spikes at the 75th percentile level for summer months, and said spikes were especially bad in 2017 for both airports.</p>
<p>These box plots are only an <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a>. Determining the <em>cause</em> of changes in these flight metrics is difficult even for experts (I am definitely not an expert!) and many not even be possible to determine from publicly-available data.</p>
<p>But there are still other fun things that can be done with the airline flight data, such as faceting airline trends by time and the inclusion of other airports, which is <a href="https://twitter.com/minimaxir/status/1115261670153048065"><em>interesting</em></a>.</p>
<hr>
<p><em>You can view the BigQuery queries used to get the data, plus the R and ggplot2 used to create the data visualizations, in <a href="http://minimaxir.com/notebooks/sfo-jfk-flights/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/sfo-jfk-flights">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Problems with Predicting Post Performance on Reddit and Other Link Aggregators</title>
      <link>https://minimaxir.com/2018/09/modeling-link-aggregators/</link>
      <pubDate>Mon, 10 Sep 2018 09:15:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/09/modeling-link-aggregators/</guid>
      <description>The nature of algorithmic feeds like Reddit inherently leads to a survivorship bias: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail.</description>
      <content:encoded><![CDATA[<p><a href="https://www.reddit.com">Reddit</a>, &ldquo;the front page of the internet&rdquo; is a link aggregator where anyone can submit links to cool happenings. Over the years, Reddit has expanded from just being a link aggregator, to allowing image and videos, and as of recently, hosting images and videos itself.</p>
<p>Reddit is broken down into subreddits, where each subreddit represents each own community around a particular interest, like <a href="https://www.reddit.com/r/aww">/r/aww</a> for pet photos and <a href="https://www.reddit.com/r/politics/">/r/politics</a> for U.S. politics. The posts on each subreddit are ranked by some function of both time elapsed since the submission was made, and the <em>score</em> of the submission as determined by upvotes and downvotes from other users.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_aww_hu_15514c9daececa75.webp 320w,/2018/09/modeling-link-aggregators/reddit_aww_hu_38fdc85d80e9f49f.webp 768w,/2018/09/modeling-link-aggregators/reddit_aww.png 827w" src="reddit_aww.png"/> 
</figure>

<p>There&rsquo;s also an intrinsic pride in having something you&rsquo;re responsible for providing to the community get lots of upvotes (the submitter also earns karma based on received upvotes, although karma is meaningless and doesn&rsquo;t provide any user benefits). But the reality is that even on the largest subreddits, submissions with 1 point (the default score for new submissions) are the most prominent, with some subreddits having <em>over half</em> of their submissions with only 1 point.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_dist_facet_hu_94559d39f676be08.webp 320w,/2018/09/modeling-link-aggregators/reddit_dist_facet_hu_ede8ccaaf5538573.webp 768w,/2018/09/modeling-link-aggregators/reddit_dist_facet_hu_940890d5e65baccb.webp 1024w,/2018/09/modeling-link-aggregators/reddit_dist_facet.png 1800w" src="reddit_dist_facet.png"/> 
</figure>

<p>The exposure from having a submission go viral on Reddit (especially on larger subreddits) can be valuable especially if its your own original content. As a result, there has been a lot of <a href="https://www.brandwatch.com/blog/how-to-get-on-the-front-page-of-reddit/">analysis</a>/<a href="https://www.reddit.com/r/starterpacks/comments/8rkfk9/reddit_front_page_starter_pack/">stereotypes</a> on what techniques to do to help your submission make it to the top of the front page. But almost all claims of &ldquo;cracking&rdquo; the Reddit algorithm are <a href="https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc"><em>post hoc</em> rationalizations</a>, attributing success to things like submission timing and title verbiage of a single submission after the fact. The nature of algorithmic feeds inherently leads to a <a href="https://en.wikipedia.org/wiki/Survivorship_bias">survivorship bias</a>: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail, which makes modeling a successful post very tricky.</p>
<p>I&rsquo;ve touched on analyzing Reddit post performance <a href="https://minimaxir.com/2017/06/reddit-deep-learning/">before</a>, but let&rsquo;s give it another look and see if we can drill down on why Reddit posts do and do not do well.</p>
<h2 id="submission-timing">Submission Timing</h2>
<p>As with many US-based websites, the majority of Reddit users are most active during work hours (9 AM — 5 PM Eastern time weekdays). Most subreddits have submission patterns which fit accordingly.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu_6063ab19aff16cb2.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu_4354ae33b8600c6a.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu_5818614336fda8df.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_prop.png 1800w" src="reddit_subreddit_prop.png"/> 
</figure>

<p>But what&rsquo;s interesting are the subreddits which <em>deviate</em> from that standard. Gaming subreddits (<a href="https://www.reddit.com/r/DestinyTheGame">/r/DestinyTheGame</a>, <a href="https://www.reddit.com/r/Overwatch">/r/Overwatch</a>) have short activity after a Tuesday game update/patch, game <em>communication</em> subreddits (<a href="https://www.reddit.com/r/Fireteams">/r/Fireteams</a>, <a href="https://www.reddit.com/r/RocketLeagueExchange">/r/RocketLeagueExchange</a>) are more active <em>outside</em> of work hours as they assume you are playing the game at the time, and Not-Safe-For-Work subreddits (/r/dirtykikpals, /r/gonewild) are incidentally less active during work hours and more active late-night than other subreddits.</p>
<p>Whenever you make a submission to Reddit, the submission appears in the subreddit&rsquo;s <code>/new</code> queue of the most recent submissions, where hopefully kind souls will find your submission and upvote it if it&rsquo;s good.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_new_hu_6650be6d73851b91.webp 320w,/2018/09/modeling-link-aggregators/reddit_new.png 762w" src="reddit_new.png"/> 
</figure>

<p>However, if it falls off the first page of the <code>/new</code> queue, your submission might be as good as dead. As a result, there&rsquo;s an element of game theory to timing your submission if you want it to not become another 1-point submission. Is it better to submit during peak hours when more users may see the submission before it falls off of <code>/new</code>? Is it better to submit <em>before</em> peak usage since there will be less competition, then continue the momentum once it hits the front page?</p>
<p>Here&rsquo;s a look at the median post performance at each given time slot for top subreddits:</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu_cb9c5ba898252674.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu_8ba4a17a13989a31.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu_a08bfb9858ec4480.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy.png 1800w" src="reddit_subreddit_hr_doy.png"/> 
</figure>

<p>As the earlier distribution chart implied, the median score is around 1-2 for most subreddits, and that&rsquo;s consistent across all time slots. Some subreddits with higher medians like /r/me<em>irl do appear to have a _slight</em> benefit when posting before peak activity. When focusing on subreddits with high overall median scores, the difference is more explicit.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu_2730023d99e9e0d9.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu_78be513d900d66b5.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu_da4a41445f75e1.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian.png 1800w" src="reddit_subreddit_highmedian.png"/> 
</figure>

<p>Subreddits like /r/PrequelMemes and /r/The<em>Donald _definitely</em> have better performance on average when made before peak activity! Posting before peak usage <em>does</em> appear to be a viable strategy, however for the majority of subreddits it doesn&rsquo;t make much of a difference.</p>
<h2 id="submission-titles">Submission Titles</h2>
<p>Each Reddit subreddit has their own vocabulary and topics of discussion. Let&rsquo;s break down text by subreddit by looking at the 75th percentile for score on posts containing a given two-word phrase:</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu_5d8f080824cf057d.webp 320w,/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu_2870270c6078715e.webp 768w,/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu_9edc52c78d8fe6ca.webp 1024w,/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams.png 1800w" src="reddit_subreddit_topbigrams.png"/> 
</figure>

<p>The one trend consistent across all subreddits is the effectiveness of first-person pronouns (<em>I/my</em>) and original content (<em>fan art</em>). Other than that, the vocabulary and sentiment for successful posts is very specific to the subreddit and culture is represents; no universal guaranteed-success memes.</p>
<h2 id="can-deep-learning-predict-post-performance">Can Deep Learning Predict Post Performance?</h2>
<p>Some might think &ldquo;oh hey, this is an arbitrary statistical problem, you can just build an AI to solve it!&rdquo; So, for the sake of argument, I did.</p>
<p>Instead of using Reddit data for building a deep learning model, we&rsquo;ll use data from <a href="https://news.ycombinator.com">Hacker News</a>, another link aggregator similar to Reddit with a strong focus on technology and startup entrepreneurship. The distribution of scores on posts, submission timings, upvoting, and front page ranking systems are all the same as on Reddit.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/hn_hu_ad0b8ce0803e73ea.webp 320w,/2018/09/modeling-link-aggregators/hn_hu_9592bce993e10dcd.webp 768w,/2018/09/modeling-link-aggregators/hn_hu_c329d6412551f993.webp 1024w,/2018/09/modeling-link-aggregators/hn.png 1520w" src="hn.png"/> 
</figure>

<p>The titles on Hacker News submissions are also shorter (80 characters max vs. Reddit&rsquo;s 300 character max) and in concise English (no memes/shitposts allowed), which should help the model learn the title syntax and identify high-impact keywords easier. Like Reddit, the score data is super-skewed with most HN submissions at 1-2 points, and typical model training will quickly converge but try to predict that <em>every</em> submission has a score of 1, which isn&rsquo;t helpful!</p>
<p>By constructing a model employing <em>many</em> deep learning tricks with <a href="https://keras.io">Keras</a>/<a href="https://www.tensorflow.org">TensorFlow</a> to prevent model cheating and training on <em>hundreds of thousands</em> of HN submissions (using post title, day-of-week, hour, and link domain like <code>github.com</code> as model features), the model does converge and finds some signal among the noise (training R<sup>2</sup> ~ 0.55 when trained for 50 epochs). However, it fails to offer any valuable predictions on new, unseen posts (test R<sup>2</sup> <em>&lt; 0.00</em>) because it falls into the same exact human biases regarding titles: it saw submissions with titles that did very well during training, but can&rsquo;t isolate the random chance why X and Y submissions are similar but X goes viral while Y does not.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/hn_test_hu_75e647e4de235ee0.webp 320w,/2018/09/modeling-link-aggregators/hn_test.png 485w" src="hn_test.png"/> 
</figure>

<p>I&rsquo;ve made the Keras/TensorFlow model training code available in <a href="https://www.kaggle.com/minimaxir/hacker-news-submission-score-predictor/notebook">this Kaggle Notebook</a> if you want to fork it and try to improve the model.</p>
<h2 id="other-potential-modeling-factors">Other Potential Modeling Factors</h2>
<p>The deep learning model above makes optimistic assumptions about the underlying data, including that each post behaves independently, and the included features are the sole features which determine the score. These assumptions are questionable.</p>
<p>The simple model forgoes the content of the submission itself, which is hard to retrieve for hundreds of thousands of data points. On Hacker News that&rsquo;s mostly OK since most submissions are links/articles which accurately correlate to the content, although occasionally there are idiosyncratic short titles which do the opposite. On Reddit, obviously looking at content is necessary for image/video-oriented subreddits, which is hard to gather and analyze at scale.</p>
<p>A very important concept of post performance is <em>momentum</em>. A post having a high score is a positive signal in itself, which begets more votes (a famous Reddit problem is brigading from /r/all which can cause submission scores to skyrocket). If the front page of a subreddit has a large number of high-performing posts, they might also suppress posts coming out of the <code>/new</code> queue because the score threshold is much higher. A simple model may not be able to capture these impacts; the model would need to incorporate the <em>state of the front page</em> at the time of posting.</p>
<p>Some also try to manipulate upvotes. Reddit became famous for adding the rule &ldquo;asking for upvotes is a violation of intergalactic law&rdquo; to their <a href="https://www.reddithelp.com/en/categories/rules-reporting/account-and-community-restrictions/what-constitutes-vote-cheating-or">Content Policy</a>, although some subreddits do it anyway <a href="https://www.reddit.com/r/TheoryOfReddit/comments/5qqrod/for_years_reddit_told_us_that_saying_upvote_this/">without consequence</a>. On Reddit, obvious spam posts can be downvoted to immediately counteract illicit upvotes. Hacker News has a <a href="https://news.ycombinator.com/newsfaq.html">similar don&rsquo;t-upvote rule</a>, although there aren&rsquo;t downvotes, just a flagging mechanism which quickly neutralizes spam/misleading posts. In general, there&rsquo;s no <em>legitimate</em> reason to highlight your own submission immediately after its posted (except for Reddit&rsquo;s AMAs). Fortunately, gaming the system is less impactful on Reddit and Hacker News due to their sheer size and countermeasures, but it&rsquo;s a good example of potential user behavior that makes modeling post performance difficult, and hopefully link aggregators of the future aren&rsquo;t susceptible to such shenanigans.</p>
<h2 id="do-we-really-to-predict-post-score">Do We Really to Predict Post Score?</h2>
<p>Let&rsquo;s say you are submitting original content to Reddit or your own tech project to Hacker News. More points means a higher ranking means more exposure for your link, right? Not exactly. As noted from Reddit/HN screenshots above, the scores of popular submissions are all over the place ranking-wise, having been affected by age penalties.</p>
<p>In practical terms, from my own purely anecdotal experience, submissions at a top ranking receive <em>substantially</em> more clickthroughs despite being spatially close on the page to others.</p>
<p><span><blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">&hellip;and now traffic at #3.<br><br>Placement is absurdly important for search engines/social media sites. Difference between #1 and #3 is dramatic. <a href="https://t.co/nGjWJBx6dU">pic.twitter.com/nGjWJBx6dU</a></p>— Max Woolf (@minimaxir) <a href="https://twitter.com/minimaxir/status/877219784907149316?ref_src=twsrc%5Etfw">June 20, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></span></p>
<p>In <a href="https://twitter.com/minimaxir/status/877219784907149316">that case</a>, falling from #1 to #3 <em>immediately halved</em> the referral traffic coming from Hacker News.</p>
<p>Therefore, an ideal link aggregator predictive model to maximize clicks should try to predict the <em>rank</em> of a submission (max rank, average rank over <em>n</em> period, etc.), not necessarily the score it receives. You could theoretically create a model by making a snapshot of a Reddit subreddit/front page of Hacker News every minute or so which includes the post position at the time of the snapshot. As mentioned earlier, the snapshots can also be used as a model feature to identify whether the front page is active or stale. Unfortunately, snapshots can&rsquo;t be retrieved retroactively, and both storing, processing, and analyzing snapshots at scale is a difficult and <em>expensive</em> feat of data engineering.</p>
<p>Presumably Reddit&rsquo;s data scientists would be incorporating submission position as a part of their data analytics and modeling, but after inspecting what&rsquo;s sent to Reddit&rsquo;s servers when you perform an action like upvoting, I wasn&rsquo;t able to find a sent position value when upvoting from the feed: only the post score and post upvote percentage at the time of the action were sent.</p>
<figure>

    <img loading="lazy" srcset="/2018/09/modeling-link-aggregators/chrome_hu_4b758c7e3fe42881.webp 320w,/2018/09/modeling-link-aggregators/chrome_hu_29f25ed9207a6d8f.webp 768w,/2018/09/modeling-link-aggregators/chrome_hu_f6617992d5fb908c.webp 1024w,/2018/09/modeling-link-aggregators/chrome.png 1442w" src="chrome.png"/> 
</figure>

<p>In this example, I upvoted the <code>Fact are facts</code> submission at position #5: we&rsquo;d expect a value between <code>3</code> and <code>5</code> be sent with the post metadata within the analytics payload, but that&rsquo;s not the case.</p>
<p>Optimizing ranking instead of a tangible metric or classification accuracy is a relatively underdiscussed field of modern data science (besides <a href="https://en.wikipedia.org/wiki/Search_engine_optimization">SEO</a> for getting the top spot on a Google search), and it would be interesting to dive deeper into it for other applications.</p>
<h2 id="in-the-future">In the future</h2>
<p>The moral of this post is that you should not take it personally if a submission fails to hit the front page. It doesn&rsquo;t necessarily mean it&rsquo;s bad. Conversely, if a post does well, don’t assume that similar posts will do just as well. There&rsquo;s a lot of quality content that falls through the cracks due to dumb luck. Fortunately, both Reddit and Hacker News allow reposts, which helps alleviate this particular problem.</p>
<p>There&rsquo;s still a lot that can be done to more deterministically predict the behavior of these algorithmic feeds. There&rsquo;s also room to help make these link aggregators more <em>fair</em>. Unfortunately, there&rsquo;s even more undiscovered ways to game these algorithms, and we&rsquo;ll see how things play out.</p>
<hr>
<p><em>You can view the BigQuery queries used to get the Reddit and Hacker News data, plus the R and ggplot2 used to create the data visualizations, in <a href="http://minimaxir.com/notebooks/modeling-link-aggregators/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/modeling-link-aggregators">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Analyzing IMDb Data The Intended Way, with R and ggplot2</title>
      <link>https://minimaxir.com/2018/07/imdb-data-analysis/</link>
      <pubDate>Mon, 16 Jul 2018 09:45:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/07/imdb-data-analysis/</guid>
      <description>For IMDb&amp;rsquo;s big-but-not-big data, you have to play with the data smartly, and both R and ggplot2 have neat tricks to do just that.</description>
      <content:encoded><![CDATA[<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/P4_zSfoTM80?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p><a href="https://www.imdb.com">IMDb</a>, the Internet Movie Database, has been a popular source for data analysis and visualizations over the years. The combination of user ratings for movies and detailed movie metadata have always been fun to <a href="http://minimaxir.com/2016/01/movie-revenue-ratings/">play with</a>.</p>
<p>There are a number of tools to help get IMDb data, such as <a href="https://github.com/alberanid/imdbpy">IMDbPY</a>, which makes it easy to programmatically scrape IMDb by pretending it&rsquo;s a website user and extracting the relevant data from the page&rsquo;s HTML output. While it <em>works</em>, web scraping public data is a gray area in terms of legality; many large websites have a Terms of Service which forbids scraping, and can potentially send a DMCA take-down notice to websites redistributing scraped data.</p>
<p>IMDb has <a href="https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX">data licensing terms</a> which forbid scraping and require an attribution in the form of a <strong>Information courtesy of IMDb (<a href="http://www.imdb.com">http://www.imdb.com</a>). Used with permission.</strong> statement, and has also <a href="https://www.kaggle.com/tmdb/tmdb-movie-metadata/home">DMCAed a Kaggle IMDb dataset</a> to hone the point.</p>
<p>However, there is good news! IMDb publishes an <a href="https://www.imdb.com/interfaces/">official dataset</a> for casual data analysis! And it&rsquo;s now very accessible, just choose a dataset and download (now with no hoops to jump through), and the files are in the standard <a href="https://en.wikipedia.org/wiki/Tab-separated_values">TSV format</a>.</p>
<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/datasets_hu_fb4ad2ef1d7c9e7f.webp 320w,/2018/07/imdb-data-analysis/datasets_hu_a5155a40c73aa984.webp 768w,/2018/07/imdb-data-analysis/datasets.png 926w" src="datasets.png"/> 
</figure>

<p>The uncompressed files are pretty large; not &ldquo;big data&rdquo; large (it fits into computer memory), but Excel will explode if you try to open them in it. You have to play with the data <em>smartly</em>, and both <a href="https://www.r-project.org">R</a> and <a href="https://ggplot2.tidyverse.org/reference/index.html">ggplot2</a> have neat tricks to do just that.</p>
<h2 id="first-steps">First Steps</h2>
<p>R is a popular programming language for statistical analysis. One of the most popular series of external packages is the <code>tidyverse</code> package, which automatically imports the <code>ggplot2</code> data visualization library and other useful packages which we&rsquo;ll get to one-by-one. We&rsquo;ll also use <code>scales</code> which we&rsquo;ll use later for prettier number formatting. First we&rsquo;ll load these packages:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">scales</span><span class="p">)</span>
</span></span></code></pre></div><p>And now we can load a TSV downloaded from IMDb using the <code>read_tsv</code> function from <code>readr</code> (a tidyverse package), which does what the name implies, at a much faster speed than base R (+ a couple other parameters to handle data encoding). Let&rsquo;s start with the <code>ratings</code> file:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.ratings.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span></span></span></code></pre></div>
<p>We can preview what&rsquo;s in the loaded data using <code>dplyr</code> (a tidyverse package), which is what we&rsquo;ll be using to manipulate data for this analysis. dplyr allows you to pipe commands, making it easy to create a sequence of manipulation commands. For now, we&rsquo;ll use <code>head()</code>, which displays the top few rows of the data frame.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/ratings_hu_5c1fcf56a5289876.webp 320w,/2018/07/imdb-data-analysis/ratings_hu_cf3fece2f9c850ca.webp 768w,/2018/07/imdb-data-analysis/ratings.png 930w" src="ratings.png"/> 
</figure>

<p>Each of the <strong>873k rows</strong> corresponds to a single movie, an ID for the movie, its average rating (from 1 to 10), and the number of votes which contribute to that average. Since we have two numeric variables, why not test out ggplot2 by creating a scatterplot mapping them? ggplot2 takes in a data frame and names of columns as aesthetics, then you specify what type of shape to plot (a &ldquo;geom&rdquo;). Passing the plot to <code>ggsave</code> saves it as a standalone, high-quality data visualization.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">numVotes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_point</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">ggsave</span><span class="p">(</span><span class="s">&#34;imdb-0.png&#34;</span><span class="p">,</span> <span class="n">plot</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="m">4</span><span class="p">,</span> <span class="n">height</span> <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-0_hu_6866c079d670893c.webp 320w,/2018/07/imdb-data-analysis/imdb-0_hu_dddd194229265d79.webp 768w,/2018/07/imdb-data-analysis/imdb-0_hu_1d852e43e8a54dea.webp 1024w,/2018/07/imdb-data-analysis/imdb-0.png 1200w" src="imdb-0.png"/> 
</figure>

<p>Here is nearly <em>1 million</em> points on a single chart; definitely don&rsquo;t try to do that in Excel! However, it&rsquo;s not a <em>useful</em> chart since all the points are opaque and we&rsquo;re not sure what the spatial density of points is. One approach to fix this issue is to create a heat map of points, which ggplot can do natively with <code>geom_bin2d</code>. We can color the heat map with the <a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html">viridis</a> colorblind-friendly palettes <a href="https://ggplot2.tidyverse.org/reference/scale_viridis.html">just introduced</a> into ggplot2. We should also tweak the axes; the x-axis should be scaled logarithmically with <code>scale_x_log10</code> since there are many movies with high numbers of votes and we can format those numbers with the <code>comma</code> function from the <code>scales</code> package (we can format the scale with <code>comma</code> too). For the y-axis, we can add explicit number breaks for each rating; R can do this neatly by setting the breaks to <code>1:10</code>. Putting it all together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">numVotes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_log10</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-1_hu_afa4c2e2f89a47f2.webp 320w,/2018/07/imdb-data-analysis/imdb-1_hu_fb49622c671e7e.webp 768w,/2018/07/imdb-data-analysis/imdb-1_hu_fe5886baf1a1a113.webp 1024w,/2018/07/imdb-data-analysis/imdb-1.png 1200w" src="imdb-1.png"/> 
</figure>

<p>Not bad, although it unfortunately confirms that IMDb follows a <a href="https://tvtropes.org/pmwiki/pmwiki.php/Main/FourPointScale">Four Point Scale</a> where average ratings tend to fall between 6 — 9.</p>
<h2 id="mapping-movies-to-ratings">Mapping Movies to Ratings</h2>
<p>You may be asking &ldquo;which ratings correspond to which movies?&rdquo; That&rsquo;s what the <code>tconst</code> field is for. But first, let&rsquo;s load the title data from <code>title.basics.tsv</code> into <code>df_basics</code> and take a look as before.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_basics</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.basics.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/basics1_hu_fdcb6a5f4e7311e5.webp 320w,/2018/07/imdb-data-analysis/basics1_hu_e15b78e5bbe944b8.webp 768w,/2018/07/imdb-data-analysis/basics1_hu_2e217e73acfcd9ff.webp 1024w,/2018/07/imdb-data-analysis/basics1.png 1350w" src="basics1.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/basics2_hu_a64ae979748aa9ab.webp 320w,/2018/07/imdb-data-analysis/basics2_hu_a83799eaf31e4743.webp 768w,/2018/07/imdb-data-analysis/basics2_hu_21a8fb679f3ec4e9.webp 1024w,/2018/07/imdb-data-analysis/basics2.png 1374w" src="basics2.png"/> 
</figure>
</p>
<p>We have some neat movie metadata. Notably, this table has a <code>tconst</code> field as well. Therefore, we can <em>join</em> the two tables together, adding the movie information to the corresponding row in the rating table (in this case, a left join is more appropriate than an inner/full join)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_basics</span><span class="p">)</span>
</span></span></code></pre></div><p>Runtime minutes sounds interesting. Could there be a relationship between the length of a movie and its average rating on IMDb? Let&rsquo;s make a heat map plot again, but with a few tweaks. With the new metadata, we can <code>filter</code> the table to remove bad points; let&rsquo;s keep movies only (as IMDb data also contains <em>television show data</em>), with a runtime &lt; 3 hours, and which have received atleast 10 votes by users to remove extraneous movies). X-axis should be tweaked to display the minutes-values in hours. The fill viridis palette can be changed to another one in the family (I personally like <code>inferno</code>).</p>
<p>More importantly, let&rsquo;s discuss plot theming. If you want a minimalistic theme, add a <code>theme_minimal</code> to the plot, and you can pass a <code>base_family</code> to change the default font on the plot and a <code>base_size</code> to change the font size. The <code>labs</code> function lets you add labels to the plot (which you should <em>always</em> do); you have your <code>title</code>, <code>x</code>, and <code>y</code> parameters, but you can also add a <code>subtitle</code>, a <code>caption</code> for attribution, and a <code>color</code>/<code>fill</code> to name the scale. Putting it all together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">runtimeMinutes</span> <span class="o">&lt;</span> <span class="m">180</span><span class="p">,</span> <span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">runtimeMinutes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">180</span><span class="p">,</span> <span class="m">60</span><span class="p">),</span> <span class="n">labels</span> <span class="o">=</span> <span class="m">0</span><span class="o">:</span><span class="m">3</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">0</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">option</span> <span class="o">=</span> <span class="s">&#34;inferno&#34;</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">theme_minimal</span><span class="p">(</span><span class="n">base_family</span> <span class="o">=</span> <span class="s">&#34;Source Sans Pro&#34;</span><span class="p">,</span> <span class="n">base_size</span> <span class="o">=</span> <span class="m">8</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Relationship between Movie Runtime and Average Mobie Rating&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">subtitle</span> <span class="o">=</span> <span class="s">&#34;Data from IMDb retrieved July 4th, 2018&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Runtime (Hours)&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;Average User Rating&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">caption</span> <span class="o">=</span> <span class="s">&#34;Max Woolf — minimaxir.com&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">fill</span> <span class="o">=</span> <span class="s">&#34;# Movies&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-2b_hu_37c6091878dca7a3.webp 320w,/2018/07/imdb-data-analysis/imdb-2b_hu_42f5a5f9d2e7967e.webp 768w,/2018/07/imdb-data-analysis/imdb-2b_hu_b4f485eff14f2484.webp 1024w,/2018/07/imdb-data-analysis/imdb-2b.png 1200w" src="imdb-2b.png"/> 
</figure>

<p>Now that&rsquo;s pretty nice-looking for only a few lines of code! Albeit unhelpful, as there doesn&rsquo;t appear to be a correlation.</p>
<p><em>(Note: for the rest of this post, the theming/labels code will be omitted for convenience)</em></p>
<p>How about movie ratings vs. the year the movie was made? It&rsquo;s a similar plot code-wise to the one above (one perk about <code>ggplot2</code> is that there&rsquo;s no shame in reusing chart code!), but we can add a <code>geom_smooth</code>, which adds a nonparametric trendline with confidence bands for the trend; since we have a large amount of data, the bands are very tight. We can also fix the problem of &ldquo;empty&rdquo; bins by setting the color fill scale to logarithmic scaling. And since we&rsquo;re adding a black trendline, let&rsquo;s change the viridis palette to <code>plasma</code> for better contrast.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_smooth</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">&#34;black&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_continuous</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">option</span> <span class="o">=</span> <span class="s">&#34;plasma&#34;</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">,</span> <span class="n">trans</span> <span class="o">=</span> <span class="s">&#39;log10&#39;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-4_hu_fdf90cbdd2dd2c7e.webp 320w,/2018/07/imdb-data-analysis/imdb-4_hu_1c45abe215427c09.webp 768w,/2018/07/imdb-data-analysis/imdb-4_hu_62d0feb034e8b054.webp 1024w,/2018/07/imdb-data-analysis/imdb-4.png 1200w" src="imdb-4.png"/> 
</figure>

<p>Unfortunately, this trend hasn&rsquo;t changed much either, although the presence of average ratings outside the Four Point Scale has increased over time.</p>
<h2 id="mapping-lead-actors-to-movies">Mapping Lead Actors to Movies</h2>
<p>Now that we have a handle on working with the IMDb data, let&rsquo;s try playing with the larger datasets. Since they take up a lot of computer memory, we only want to persist data we actually might use. After looking at the schema provided with the official datasets, the only really useful metadata about the actors is their birth year, so let&rsquo;s load that, but only keep both actors/actresses (using the fast <code>str_detect</code> function from <code>stringr</code>, another tidyverse package) and the relevant fields.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actors</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;name.basics.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                <span class="nf">filter</span><span class="p">(</span><span class="nf">str_detect</span><span class="p">(</span><span class="n">primaryProfession</span><span class="p">,</span> <span class="s">&#34;actor|actress&#34;</span><span class="p">))</span>  <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                <span class="nf">select</span><span class="p">(</span><span class="n">nconst</span><span class="p">,</span> <span class="n">primaryName</span><span class="p">,</span> <span class="n">birthYear</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/actor_hu_f86030d94734f51e.webp 320w,/2018/07/imdb-data-analysis/actor_hu_58f7a4e4de86c210.webp 768w,/2018/07/imdb-data-analysis/actor.png 936w" src="actor.png"/> 
</figure>

<p>The principals dataset, the large 1.28GB TSV, is the most interesting. It&rsquo;s an unnested list of the credited persons in each movie, with an <code>ordering</code> indicating their rank (where <code>1</code> means first, <code>2</code> means second, etc.).</p>
<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/principals_hu_e149270e85e6bbfe.webp 320w,/2018/07/imdb-data-analysis/principals_hu_d39d7c6fcd18929.webp 768w,/2018/07/imdb-data-analysis/principals_hu_56b42bde8cdb5364.webp 1024w,/2018/07/imdb-data-analysis/principals.png 1074w" src="principals.png"/> 
</figure>

<p>For this analysis, let&rsquo;s only look at the <strong>lead actors/actresses</strong>; specifically, for each movie (identified by the <code>tconst</code> value), filter the dataset to where the <code>ordering</code> value is the lowest (in this case, the person at rank <code>1</code> may not necessarily be an actor/actress).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_principals</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.principals.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="nf">str_detect</span><span class="p">(</span><span class="n">category</span><span class="p">,</span> <span class="s">&#34;actor|actress&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">tconst</span><span class="p">,</span> <span class="n">ordering</span><span class="p">,</span> <span class="n">nconst</span><span class="p">,</span> <span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">group_by</span><span class="p">(</span><span class="n">tconst</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="n">ordering</span> <span class="o">==</span> <span class="nf">min</span><span class="p">(</span><span class="n">ordering</span><span class="p">))</span>
</span></span></code></pre></div><p>Both datasets have a <code>nconst</code> field, so let&rsquo;s join them together. And then join <em>that</em> to the ratings table earlier via <code>tconst</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_principals</span> <span class="o">&lt;-</span> <span class="n">df_principals</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_actors</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_principals</span><span class="p">)</span>
</span></span></code></pre></div><p>Now we have a fully denormalized dataset in <code>df_ratings</code>. Since we now have the movie release year and the birth year of the lead actor, we can now infer <em>the age of the lead actor at the movie release</em>. With that goal, filter out the data on the criteria we&rsquo;ve used for earlier data visualizations, plus only keeping rows which have an actor&rsquo;s birth year.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings_movies</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                        <span class="nf">filter</span><span class="p">(</span><span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">birthYear</span><span class="p">),</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                        <span class="nf">mutate</span><span class="p">(</span><span class="n">age_lead</span> <span class="o">=</span> <span class="n">startYear</span> <span class="o">-</span> <span class="n">birthYear</span><span class="p">)</span>
</span></span></code></pre></div><p><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/denorm1_hu_654cad39747efe47.webp 320w,/2018/07/imdb-data-analysis/denorm1_hu_eed6e992d7e214e3.webp 768w,/2018/07/imdb-data-analysis/denorm1_hu_dbde12b6453e4f09.webp 1024w,/2018/07/imdb-data-analysis/denorm1.png 1604w" src="denorm1.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/denorm2_hu_3aef3d94cde50e2c.webp 320w,/2018/07/imdb-data-analysis/denorm2.png 531w" src="denorm2.png"/> 
</figure>
</p>
<h2 id="plotting-ages">Plotting Ages</h2>
<p>Age discrimination in movie casting has been a recurring issue in Hollywood; in fact, in 2017 <a href="https://www.hollywoodreporter.com/thr-esq/judge-pauses-enforcement-imdb-age-censorship-law-978797">a law was signed</a> to force IMDb to remove an actor&rsquo;s age upon request, which in February 2018 was <a href="https://www.hollywoodreporter.com/thr-esq/californias-imdb-age-censorship-law-declared-unconstitutional-1086540">ruled to be unconstitutional</a>.</p>
<p>Have the ages of movie leads changed over time? For this example, we&rsquo;ll use a <a href="https://ggplot2.tidyverse.org/reference/geom_ribbon.html">ribbon plot</a> to plot the ranges of ages of movie leads. A simple way to do that is, for each year, calculate the 25th <a href="https://en.wikipedia.org/wiki/Percentile">percentile</a> of the ages, the 50th percentile (i.e. the median), and the 75th percentile, where the 25th and 75th percentiles are the ribbon bounds and the line represents the median.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actor_ages</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">group_by</span><span class="p">(</span><span class="n">startYear</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">summarize</span><span class="p">(</span><span class="n">low_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">med_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.50</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">high_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.75</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">))</span>
</span></span></code></pre></div><p>Plotting it with ggplot2 is surprisingly simple, although you need to use different y aesthetics for the ribbon and the overlapping line.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_actor_ages</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">startYear</span> <span class="o">&gt;=</span> <span class="m">1920</span><span class="p">)</span> <span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_ribbon</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span> <span class="o">=</span> <span class="n">low_age</span><span class="p">,</span> <span class="n">ymax</span> <span class="o">=</span> <span class="n">high_age</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_line</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">med_age</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-8_hu_1f082993b0bfcbd5.webp 320w,/2018/07/imdb-data-analysis/imdb-8_hu_5434c1e3ce1485b4.webp 768w,/2018/07/imdb-data-analysis/imdb-8_hu_c6707a589573484a.webp 1024w,/2018/07/imdb-data-analysis/imdb-8.png 1200w" src="imdb-8.png"/> 
</figure>

<p>Turns out that in the 2000&rsquo;s, the median age of lead actors started to <em>increase</em>? Both the upper and lower bounds increased too. That doesn&rsquo;t coalesce with the age discrimination complaints.</p>
<p>Another aspect of these complaints is gender, as female actresses tend to be younger than male actors. Thanks to the magic of ggplot2 and dplyr, separating actors/actresses is relatively simple: add gender (encoded in <code>category</code>) as a grouping variable, add it as a color/fill aesthetic in ggplot, and set colors appropriately (I recommend the <a href="http://colorbrewer2.org/">ColorBrewer</a> qualitative palettes for categorical variables).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actor_ages_lead</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">group_by</span><span class="p">(</span><span class="n">startYear</span><span class="p">,</span> <span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">summarize</span><span class="p">(</span><span class="n">low_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">med_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.50</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">high_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.75</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_actor_ages_lead</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">startYear</span> <span class="o">&gt;=</span> <span class="m">1920</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">,</span> <span class="n">fill</span> <span class="o">=</span> <span class="n">category</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="n">category</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_ribbon</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span> <span class="o">=</span> <span class="n">low_age</span><span class="p">,</span> <span class="n">ymax</span> <span class="o">=</span> <span class="n">high_age</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_line</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">med_age</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span> <span class="o">=</span> <span class="s">&#34;Set1&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_color_brewer</span><span class="p">(</span><span class="n">palette</span> <span class="o">=</span> <span class="s">&#34;Set1&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-9_hu_57562b2f234249be.webp 320w,/2018/07/imdb-data-analysis/imdb-9_hu_7da40c01dd2abee4.webp 768w,/2018/07/imdb-data-analysis/imdb-9_hu_a30111e8cbade2ed.webp 1024w,/2018/07/imdb-data-analysis/imdb-9.png 1200w" src="imdb-9.png"/> 
</figure>

<p>There&rsquo;s about a 10-year gap between the ages of male and female leads, and the gap doesn&rsquo;t change overtime. But both start to rise at the same time.</p>
<p>One possible explanation for this behavior is actor reuse: if Hollywood keeps casting the same actor/actresses, by construction the ages of the leads will start to steadily increase. Let&rsquo;s verify that: with our list of movies and their lead actors, for each lead actor, order all their movies by release year, and add a ranking for the #th time that actor has been a lead actor. This is possible through the use of <code>row_number</code> in dplyr, and <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html">window functions</a> like <code>row_number</code> are data science&rsquo;s most useful secret.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings_movies_nth</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">group_by</span><span class="p">(</span><span class="n">nconst</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">arrange</span><span class="p">(</span><span class="n">startYear</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">mutate</span><span class="p">(</span><span class="n">nth_lead</span> <span class="o">=</span> <span class="nf">row_number</span><span class="p">())</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/row_number_hu_1e44bdb2621fb9cb.webp 320w,/2018/07/imdb-data-analysis/row_number_hu_ca408294ce31483a.webp 768w,/2018/07/imdb-data-analysis/row_number_hu_ed006c80eb52873e.webp 1024w,/2018/07/imdb-data-analysis/row_number.png 1532w" src="row_number.png"/> 
</figure>

<p>One more ribbon plot later (w/ same code as above + custom y-axis breaks):</p>
<figure>

    <img loading="lazy" srcset="/2018/07/imdb-data-analysis/imdb-12_hu_32ee97febb68e3.webp 320w,/2018/07/imdb-data-analysis/imdb-12_hu_69e7d60d89429d8f.webp 768w,/2018/07/imdb-data-analysis/imdb-12_hu_c9df788e280bb63b.webp 1024w,/2018/07/imdb-data-analysis/imdb-12.png 1200w" src="imdb-12.png"/> 
</figure>

<p>Huh. The median and upper-bound #th time has <em>dropped</em> over time? Hollywood has been promoting more newcomers as leads? That&rsquo;s not what I expected!</p>
<p>More work definitely needs to be done in this area. In the meantime, the official IMDb datasets are a lot more robust than I thought they would be! And I only used a fraction of the datasets; the rest tie into TV shows, which are a bit messier. Hopefully you&rsquo;ve seen a good taste of the power of R and ggplot2 for playing with big-but-not-big data!</p>
<hr>
<p><em>You can view the R and ggplot used to create the data visualizations in <a href="http://minimaxir.com/notebooks/imdb-data-analysis/">this R Notebook</a>, which includes many visualizations not used in this post. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/imdb-data-analysis">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Visualizing One Million NCAA Basketball Shots</title>
      <link>https://minimaxir.com/2018/03/basketball-shots/</link>
      <pubDate>Mon, 19 Mar 2018 09:20:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/03/basketball-shots/</guid>
      <description>Although visualizing basketball shots has been done before, this time we have access to an order of magnitude more public data to do some really cool stuff.</description>
      <content:encoded><![CDATA[<p>So <a href="https://www.ncaa.com/march-madness">March Madness</a> is happing right now. In celebration, <a href="https://www.google.com">Google</a> uploaded <a href="https://console.cloud.google.com/launcher/details/ncaa-bb-public/ncaa-basketball">massive basketball datasets</a> from the <a href="https://www.ncaa.com">NCAA</a> and <a href="https://www.sportradar.com/">Sportradar</a> to <a href="https://cloud.google.com/bigquery/">BigQuery</a> for anyone to query and experiment. After learning that the <a href="https://www.reddit.com/r/bigquery/comments/82nz17/dataset_statistics_for_ncaa_mens_and_womens/">dataset had location data</a> on where basketball shots were made on the court, I played with it and a couple hours later, I created a decent heat map data visualization. The next day, I <a href="https://www.reddit.com/r/dataisbeautiful/comments/837qnu/heat_map_of_1058383_basketball_shots_from_ncaa/">posted it</a> to Reddit&rsquo;s <a href="https://www.reddit.com/r/dataisbeautiful">/r/dataisbeautiful subreddit</a> where it earned about <strong>40,000 upvotes</strong>. (!?)</p>
<p>Let&rsquo;s dig a little deeper. Although visualizing basketball shots has been <a href="http://www.slate.com/blogs/browbeat/2012/03/06/mapping_the_nba_how_geography_can_teach_players_where_to_shoot.html">done</a> <a href="http://toddwschneider.com/posts/ballr-interactive-nba-shot-charts-with-r-and-shiny/">before</a>, this time we have access to an order of magnitude more public data to do some really cool stuff.</p>
<h2 id="full-court">Full Court</h2>
<p>The Sportradar play-by-play table on BigQuery <code>mbb_pbp_sr</code> has more than 1 million NCAA men&rsquo;s basketball shots since the 2013-2014 season, with more being added now during March Madness. Here&rsquo;s a heat map of the locations where those shots were made on the full basketball court:</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_count_attempts_unlog_hu_35ce830f74de77b5.webp 320w,/2018/03/basketball-shots/ncaa_count_attempts_unlog_hu_ff7511dcccb6bf50.webp 768w,/2018/03/basketball-shots/ncaa_count_attempts_unlog_hu_c03f9beaec2e4059.webp 1024w,/2018/03/basketball-shots/ncaa_count_attempts_unlog.png 1800w" src="ncaa_count_attempts_unlog.png"/> 
</figure>

<p>We can clearly see at a glance that the majority of shots are made right in front of the basket. For 3-point shots, the center and the corners have higher numbers of shot attempts than the other areas. But not much else since the data is so spatially skewed: setting the bin color scale to logarithmic makes trends more apparent and helps things go viral on Reddit.</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_count_attempts_hu_3a087234886ce568.webp 320w,/2018/03/basketball-shots/ncaa_count_attempts_hu_31931a7d73c00179.webp 768w,/2018/03/basketball-shots/ncaa_count_attempts_hu_39e87b359975bcd4.webp 1024w,/2018/03/basketball-shots/ncaa_count_attempts.png 1800w" src="ncaa_count_attempts.png"/> 
</figure>

<p>Now there&rsquo;s more going on here: shot behavior is clearly symmetric on each side of the court, and there&rsquo;s a small gap between the 3-point line and where 3-pt shots are typically made, likely to ensure that it it&rsquo;s not accidentally ruled as a 2-pt shot.</p>
<p>How likely is it to score a shot from a given spot? Are certain spots better than others?</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_count_perc_success_hu_1a20df6dc8d568f.webp 320w,/2018/03/basketball-shots/ncaa_count_perc_success_hu_72c3f2cbec0a75d8.webp 768w,/2018/03/basketball-shots/ncaa_count_perc_success_hu_308287fdb103668e.webp 1024w,/2018/03/basketball-shots/ncaa_count_perc_success.png 1800w" src="ncaa_count_perc_success.png"/> 
</figure>

<p>Surprisingly, shot accuracy is about <em>equal</em> from anywhere within typical shooting distance, except directly in front of the basket where it&rsquo;s much higher. What is the <a href="https://en.wikipedia.org/wiki/Expected_value">expected value</a> of a shot at a given position: that is, how many points on average will they earn for their team?</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_count_avg_points_hu_cc6b1aabe2a1fbbd.webp 320w,/2018/03/basketball-shots/ncaa_count_avg_points_hu_48fa925084585c1d.webp 768w,/2018/03/basketball-shots/ncaa_count_avg_points_hu_e4e431e478a401a7.webp 1024w,/2018/03/basketball-shots/ncaa_count_avg_points.png 1800w" src="ncaa_count_avg_points.png"/> 
</figure>

<p>The average points earned for 3-pt shots is about 1.5x higher than many 2-pt shot locations in the inner court due to the equal accuracy, but locations next to the basket have an even higher expected value. Perhaps the accuracy of shots close to the basket is higher (&gt;1.5x) than 3-pt shots and outweighs the lower point value?</p>
<p>Since both sides of the court are indeed the same, we can combine the two sides and just plot a half-court instead. (Cross-court shots, which many Redditors <a href="https://www.reddit.com/r/dataisugly/comments/839rax/basketball_heat_map_shows_an_impressive_number_of/">argued</a> that they invalidated my visualizations above, constitute only <em>0.16%</em> of the basketball shots in the dataset, so they can be safely removed as outliers).</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_count_attempts_half_log_hu_1b25bb288c7845a4.webp 320w,/2018/03/basketball-shots/ncaa_count_attempts_half_log_hu_1c576186de477a2e.webp 768w,/2018/03/basketball-shots/ncaa_count_attempts_half_log_hu_f23437ee277976f3.webp 1024w,/2018/03/basketball-shots/ncaa_count_attempts_half_log.png 1200w" src="ncaa_count_attempts_half_log.png"/> 
</figure>

<p>There are still a few oddities, such as shots being made <em>behind</em> the basket. Let&rsquo;s drill down a bit.</p>
<h2 id="focusing-on-basketball-shot-type">Focusing on Basketball Shot Type</h2>
<p>The Sportradar dataset classifies a shot as one of 5 major types: a <strong>jump shot</strong> where the player jumps-and-throws the basketball, a <strong>layup</strong> where the player runs down the field toward the basket and throws a one-handed shot, a <strong>dunk</strong> where the player slams the ball into the basket (looking cool in the process), a <strong>hook shot</strong> where the player close to the basket throws the ball with a hook motion, and a <strong>tip shot</strong> where the player intercepts a basket rebound at the tip of the basket and pushes it in.</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_types_prop_attempts_hu_5b2e2e8111e12e08.webp 320w,/2018/03/basketball-shots/ncaa_types_prop_attempts_hu_ced73cb24cc6fc7d.webp 768w,/2018/03/basketball-shots/ncaa_types_prop_attempts_hu_baa56eb71d1a510d.webp 1024w,/2018/03/basketball-shots/ncaa_types_prop_attempts.png 1200w" src="ncaa_types_prop_attempts.png"/> 
</figure>

<p>However, the most frequent types of shots are the less flashy, more practical jump shots and layups. But is a certain type of shot &ldquo;better?&rdquo;</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_types_perc_hu_eddd49d65debceac.webp 320w,/2018/03/basketball-shots/ncaa_types_perc_hu_7ec71b6836db1818.webp 768w,/2018/03/basketball-shots/ncaa_types_perc_hu_bb58c5550052e5d8.webp 1024w,/2018/03/basketball-shots/ncaa_types_perc.png 1200w" src="ncaa_types_perc.png"/> 
</figure>

<p>Layups are safer than jump shots, but dunks are the most accurate of all the types (however, players likely wouldn&rsquo;t attempt a dunk unless they knew it would be successful). The accuracy of layups and other close-to-basket shots is indeed more than 1.5x better than the jump shots of 3-pt shots, which explains the expected value behavior above.</p>
<p>Plotting the heat maps for each type of shot offers more insight into how they work:</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_count_attempts_half_types_log_hu_f158f6e3a8368a14.webp 320w,/2018/03/basketball-shots/ncaa_count_attempts_half_types_log_hu_21a49f6411f78b6.webp 768w,/2018/03/basketball-shots/ncaa_count_attempts_half_types_log.png 900w" src="ncaa_count_attempts_half_types_log.png"/> 
</figure>

<p>They&rsquo;re wildly different heat maps which match the shot type descriptions above, but show we&rsquo;ll need to separate data visualizations by type to accurately see trends.</p>
<h2 id="impact-of-game-elapsed-time-at-time-of-shot">Impact of Game Elapsed Time At Time of Shot</h2>
<p>A NCAA basketball game lasts for 40 minutes total (2 halves of 20 minutes each), with the possibility of overtime. The <a href="https://bigquery.cloud.google.com/savedquery/4194148158:3359d86507814fb19a5997a770456baa">example BigQuery</a> for the NCAA-provided data compares the percentage of 3-point shots made during the first 35 minutes of the game versus the last 5 minutes: at the end of the game, accuracy was lower by 4 percentage points (31.2% vs. 35.1%). It might be interesting to facet these visualizations by the elapsed time of the game to see if there are any behavioral changes.</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_types_prop_type_elapsed_hu_bb28d87a78c18d3f.webp 320w,/2018/03/basketball-shots/ncaa_types_prop_type_elapsed_hu_b1bc08ac4dea3c7c.webp 768w,/2018/03/basketball-shots/ncaa_types_prop_type_elapsed_hu_d69cf0b659690837.webp 1024w,/2018/03/basketball-shots/ncaa_types_prop_type_elapsed.png 1200w" src="ncaa_types_prop_type_elapsed.png"/> 
</figure>

<p>There isn&rsquo;t much difference between the proportions within a given half, but there is a difference between the first half and the second half, where the second half has fewer jump shots and more aggressive layups and dunks. After looking at shot success percentage:</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed_hu_92a660a371b13a60.webp 320w,/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed_hu_8687c28a1832735b.webp 768w,/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed_hu_de114505630e7a6f.webp 1024w,/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed.png 1200w" src="ncaa_types_perc_success_type_elapsed.png"/> 
</figure>

<p>The jump shot accuracy loss at the end of the game with Sportradar data is similar to that of the NCAA data, which is a good sanity check (but it&rsquo;s odd that the accuracy drop only happens in the last 5 minutes and not elsewhere in the 2nd half). Layup accuracy increases in the second half with the number of layups.</p>
<p>We can also visualize heat maps for each combo of shot type with time elapsed bucket, but given the results above, the changes in behavior over time may not be very perceptible.</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log_hu_87f66d471a4c95fc.webp 320w,/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log_hu_d5cd2612709d9ea.webp 768w,/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log_hu_8e1f44bad4069e9f.webp 1024w,/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log.png 1200w" src="ncaa_count_attempts_half_interval_log.png"/> 
</figure>

<h2 id="impact-of-winninglosing-before-shot">Impact of Winning/Losing Before Shot</h2>
<p>Another theory worth exploring is determining if there is any difference whether a team is winning or losing when they make their shot (technically, when the delta between the team score and the other team score is positive for winning teams, negative for losing teams, or 0 if tied). Are players more relaxed when they have a lead? Are players more prone to making mistakes when losing?</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_types_prop_type_score_hu_29c29d850235c76d.webp 320w,/2018/03/basketball-shots/ncaa_types_prop_type_score_hu_4c6a81e571854d10.webp 768w,/2018/03/basketball-shots/ncaa_types_prop_type_score_hu_5205e23cfda70f5a.webp 1024w,/2018/03/basketball-shots/ncaa_types_prop_type_score.png 1200w" src="ncaa_types_prop_type_score.png"/> 
</figure>

<p>Layups are the same across all buckets, but for teams that are winning, there are fewer jump shots and <strong>more dunkin&rsquo; action</strong> (nearly double the dunks!). However, the accuracy chart illustrates an issue:</p>
<figure>

    <img loading="lazy" srcset="/2018/03/basketball-shots/ncaa_types_perc_success_type_score_hu_31d0201603d0a7d7.webp 320w,/2018/03/basketball-shots/ncaa_types_perc_success_type_score_hu_bafe4c92c10d1157.webp 768w,/2018/03/basketball-shots/ncaa_types_perc_success_type_score_hu_8e7746842c943e81.webp 1024w,/2018/03/basketball-shots/ncaa_types_perc_success_type_score.png 1200w" src="ncaa_types_perc_success_type_score.png"/> 
</figure>

<p>Accuracy for most types of shots is much better for teams that are winning&hellip;which may be the <em>reason</em> they&rsquo;re winning. More research can be done in this area.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I fully admit I am not a basketball expert. But playing around with this data was a fun way to get a new perspective on how collegiate basketball games work. There&rsquo;s a lot more work that can be done with big basketball data and game strategy; the NCAA-provided data doesn&rsquo;t have location data, but it does have <strong>6x more shots</strong>, which will be very helpful for further fun in this area.</p>
<hr>
<p><em>You can view the R code, ggplot2 code, and BigQueries used to create the data visualizations in <a href="http://minimaxir.com/notebooks/basketball-shots/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/ncaa-basketball">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
<p><em>Special thanks to Ewen Gallic for his implementation of a <a href="http://egallic.fr/en/drawing-a-basketball-court-with-r/">basketball court in ggplot2</a>, which saved me a lot of time!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>A Visual Overview of Stack Overflow&#39;s Question Tags</title>
      <link>https://minimaxir.com/2018/02/stack-overflow-questions/</link>
      <pubDate>Fri, 09 Feb 2018 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/02/stack-overflow-questions/</guid>
      <description>I was surprised to see that all types of programming languages have quick answer times and a high probability of receiving an acceptable answer!</description>
      <content:encoded><![CDATA[<p><a href="https://stackoverflow.com">Stack Overflow</a> is the most popular contemporary knowledge base for programming questions. But most interact with the site by Googling a programming question and getting a top result that links to SO. There isn&rsquo;t as much discussion about actually <em>asking</em> questions on the site.</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/python_last_list_hu_25d38cdb30d0498f.webp 320w,/2018/02/stack-overflow-questions/python_last_list_hu_379aa50fc7ec9a0a.webp 768w,/2018/02/stack-overflow-questions/python_last_list_hu_28ba6b374bd5a225.webp 1024w,/2018/02/stack-overflow-questions/python_last_list.png 1686w" src="python_last_list.png"/> 
</figure>

<p>I <em>could</em> use <a href="https://stackoverflow.com/users/9314418/minimaxir?tab=profile">my Stack Overflow account</a> and test out the process of creating a question, but <del>I already know everything about programming</del> there may be another way to learn how SO works. Stack Overflow <a href="https://archive.org/details/stackexchange">releases an archive</a> of all questions on the site every 3 months, and this archive is <a href="https://cloud.google.com/bigquery/public-data/stackoverflow">syndicated to BigQuery</a>, making it trivial to retrieve and analyze the millions of SO questions over the years. Even though (now-former) Stack Overflow data scientist <a href="https://twitter.com/drob">David Robinson</a> has written <a href="https://stackoverflow.blog/2017/09/06/incredible-growth-python/">many</a> <a href="https://stackoverflow.blog/2017/04/19/programming-languages-used-late-night/">interesting</a> blog posts for Stack Overflow with their data, I figured why not give it a try.</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/python_last_list_answer_hu_eb0af2ca58a32eeb.webp 320w,/2018/02/stack-overflow-questions/python_last_list_answer_hu_a239dff4552731b7.webp 768w,/2018/02/stack-overflow-questions/python_last_list_answer_hu_c17d3dec6132cd9f.webp 1024w,/2018/02/stack-overflow-questions/python_last_list_answer.png 1670w" src="python_last_list_answer.png"/> 
</figure>

<h2 id="overview">Overview</h2>
<p>Unlike social media sites like <a href="https://twitter.com">Twitter</a> and <a href="https://www.reddit.com">Reddit</a> where the majority of traffic is driven within the first days after something is posted, posts on evergreen content sources like Stack Overflow are still relevant many years later. In fact, the traffic to Stack Overflow for most of 2017 (derived by finding the difference between question view counts from archive snapshots) is approximately uniform across question age, with a slight bias toward older content.</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/so_overview_hu_ccb1bb5b14e0f490.webp 320w,/2018/02/stack-overflow-questions/so_overview_hu_fd5456b53e8a3d50.webp 768w,/2018/02/stack-overflow-questions/so_overview_hu_b48cb8326f951666.webp 1024w,/2018/02/stack-overflow-questions/so_overview.png 1200w" src="so_overview.png"/> 
</figure>

<p>In 2017, Stack Overflow received about 40k-50k new questions each week, an impressive feat:</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/weekly_count_hu_f42f46bbf2c0045c.webp 320w,/2018/02/stack-overflow-questions/weekly_count_hu_adafdf8ff991a648.webp 768w,/2018/02/stack-overflow-questions/weekly_count_hu_20ee00d40fdeabb2.webp 1024w,/2018/02/stack-overflow-questions/weekly_count.png 1200w" src="weekly_count.png"/> 
</figure>

<p>For the rest of this post, we&rsquo;ll only look at questions made in 2017 (until December; about 2.3 million questions total) in order to get a sense of the current development landscape, and what&rsquo;s to come in the future. But what types of questions are they?</p>
<h2 id="tag-breakdown">Tag Breakdown</h2>
<p>All questions on Stack Overflow are required to have atleast 1 tag indicating the programming language/technologies involved with the question, and can have up to 5 tags. In the example &ldquo;how do you get the last element of a list in Python&rdquo; <a href="https://stackoverflow.com/questions/930397/getting-the-last-element-of-a-list-in-python">question</a> above, the tags are <code>python</code>, <code>list</code>, and <code>indexing</code>. In 2017, most of new questions had 2-3 tags. (i.e. people aren&rsquo;t <a href="http://minimaxir.com/2014/03/hashtag-tag/">tag spamming</a> like on <a href="https://www.instagram.com/?hl=en">Instagram</a> for maximum exposure).</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/so_tag_breakdown_hu_824c3cbac84d4ce6.webp 320w,/2018/02/stack-overflow-questions/so_tag_breakdown_hu_35c62637eb6e12ac.webp 768w,/2018/02/stack-overflow-questions/so_tag_breakdown_hu_41d81ccb55b35e25.webp 1024w,/2018/02/stack-overflow-questions/so_tag_breakdown.png 1200w" src="so_tag_breakdown.png"/> 
</figure>

<p>In theory, tag spamming might make a question more likely to be answered; however for all tag counts, the proportion of questions with accepted answer (the green checkmark) is <strong>36-39%</strong>, so there&rsquo;s not much practical benefit from minmaxing tag counts. Which types of tagged questions are most likely to be answered?</p>
<p>First, here&rsquo;s the breakdown of the top 40 tags on Stack Overflow, by the number of new questions containing that tag for each month throughout 2017. This can give a sense of each technology&rsquo;s growth/decline throughout the year.</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/monthly_count_tag_hu_ea69cdded812352f.webp 320w,/2018/02/stack-overflow-questions/monthly_count_tag_hu_10da23bf89790b71.webp 768w,/2018/02/stack-overflow-questions/monthly_count_tag_hu_67b73cc591239cf1.webp 1024w,/2018/02/stack-overflow-questions/monthly_count_tag.png 1800w" src="monthly_count_tag.png"/> 
</figure>

<p>Both new web development technologies like <code>reactjs</code> and <code>typescript</code> and data science tools like <code>pandas</code> and <code>r</code> are trending upward.</p>
<p>For the Top 1,000 tags, here are the top 30 tags by the proportion of questions which received an acceptable answer:</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/acceptable_answer_top_30_hu_3fe7bc1f073db8d2.webp 320w,/2018/02/stack-overflow-questions/acceptable_answer_top_30_hu_f71e8403d24ba45c.webp 768w,/2018/02/stack-overflow-questions/acceptable_answer_top_30_hu_e89eb6c7dcf96060.webp 1024w,/2018/02/stack-overflow-questions/acceptable_answer_top_30.png 1800w" src="acceptable_answer_top_30.png"/> 
</figure>

<p>In contrast, here are the bottom 30 out of the Top 1,000:</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/acceptable_answer_bottom_30_hu_97b990139e2d88c3.webp 320w,/2018/02/stack-overflow-questions/acceptable_answer_bottom_30_hu_e4bfbf35b53fc86b.webp 768w,/2018/02/stack-overflow-questions/acceptable_answer_bottom_30_hu_32acaa74db309a7c.webp 1024w,/2018/02/stack-overflow-questions/acceptable_answer_bottom_30.png 1800w" src="acceptable_answer_bottom_30.png"/> 
</figure>

<p>The top tags are newer, sexier technologies like <code>rust</code> and <code>dart</code>, with another strong hint of data science tooling with <code>dplyr</code> (which I used to aggregate the data for this post!) and <code>data.table</code>. In contrast, the bottom tags are less sexy and more corporate like <code>salesforce</code>, <code>drupal</code>, and <code>sharepoint-2013</code> (that&rsquo;s why consultants who specialize in these technologies can get paid very well!).</p>
<p>It should be noted these two charts do not necessarily imply that one technology is &ldquo;better&rdquo; than another, and the difference in answer rates may be due to question difficulty and the number of people skilled in the tech available that can answer it effectively.</p>
<p>The timing when questions are asked might vary by tag. Per <a href="https://stackoverflow.blog/2017/04/19/programming-languages-used-late-night/">a Stack Overflow analysis</a>, people typically ask questions during the 9 AM - 5 PM work hours (although in my case, I cannot easily adjust for the time zone of the asker). How does this data fare?</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/monthly_count_hr_doy_hu_fae4937bf0f0691e.webp 320w,/2018/02/stack-overflow-questions/monthly_count_hr_doy_hu_40befef30e7c85c5.webp 768w,/2018/02/stack-overflow-questions/monthly_count_hr_doy_hu_7c83e2680a6fe00.webp 1024w,/2018/02/stack-overflow-questions/monthly_count_hr_doy.png 1800w" src="monthly_count_hr_doy.png"/> 
</figure>

<p>This visualization is a bit weird. I adjusted the times to the Eastern time since internet activity for U.S.-based websites tends to revolve around that time zone. But for most technologies, the peak question-asking times are well before 9 AM to 5 PM: do those technologies correspond more to greater use in Europe and Asia? (In contrast, data-oriented technologies like <code>r</code>, <code>pandas</code> and <code>excel</code> <em>do</em> peak during the 9-5 block).</p>
<h2 id="how-easy-is-it-to-get-an-answer-by-tag">How easy is it to get an answer by tag?</h2>
<p>Stack Overflow caters the homepage toward the logged-in user&rsquo;s recommended tags. Therefore, it&rsquo;s not a surprise that the distribution of view counts on 2017 questions for each tag are very similar, although there is a slight edge toward the new &ldquo;hip&rdquo; technologies like <code>typescript</code>, <code>spring</code>, and <code>swift</code>.</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/views_boxplot_tag_hu_78b7bfb6f63173af.webp 320w,/2018/02/stack-overflow-questions/views_boxplot_tag_hu_add2aa16b5291c89.webp 768w,/2018/02/stack-overflow-questions/views_boxplot_tag_hu_f3845f5a14be4e23.webp 1024w,/2018/02/stack-overflow-questions/views_boxplot_tag.png 1800w" src="views_boxplot_tag.png"/> 
</figure>

<p>At the least, the distribution ensures that atleast 10 people see your question for these popular topics, which is nifty when you consider posts on Twitter and Reddit can die without any visibility at all. But will they provide an acceptable answer?</p>
<p>The time it takes to get an acceptable answer also varies significantly by tag:</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/acceptable_answer_density_hu_441417d4b0b9dbfd.webp 320w,/2018/02/stack-overflow-questions/acceptable_answer_density_hu_7e30755ce8384eeb.webp 768w,/2018/02/stack-overflow-questions/acceptable_answer_density_hu_538cf9d028958aed.webp 1024w,/2018/02/stack-overflow-questions/acceptable_answer_density.png 1800w" src="acceptable_answer_density.png"/> 
</figure>

<p>A median time of <em>15 minutes</em> for tags like <code>pandas</code> and <code>arrays</code> is pretty impressive! And even in the worst case scenario for these popular tags, the median is only a couple hours, much lower than I thought it would be.</p>
<h2 id="the-relationship-between-tags">The Relationship Between Tags</h2>
<p>As one would expect, the types of questions asked for each tag are much different. Here&rsquo;s a wordcloud for each of the tags, quantifying the words most frequently used in the questions on those tags:</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/so_tag_wordcloud_hu_8ba9e0f7676ec6b7.webp 320w,/2018/02/stack-overflow-questions/so_tag_wordcloud_hu_2078ca85488e7569.webp 768w,/2018/02/stack-overflow-questions/so_tag_wordcloud_hu_a7c21a23620e1454.webp 1024w,/2018/02/stack-overflow-questions/so_tag_wordcloud.png 1800w" src="so_tag_wordcloud.png"/> 
</figure>

<p>Notably, each word cloud is significantly different from reach other, even when technologies are related (also surprisingly true in the case of <code>angular</code> and <code>angularjs</code>!).</p>
<p>How are the tags related anyways? We can calculate an <a href="https://en.wikipedia.org/wiki/Adjacency_matrix">adjacency matrix</a> of the tag pairs in the questions to see which tags are related:</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/so_tag_adjacency_hu_6c0ca82329fcb525.webp 320w,/2018/02/stack-overflow-questions/so_tag_adjacency_hu_51742ee9039c83b6.webp 768w,/2018/02/stack-overflow-questions/so_tag_adjacency_hu_1cf472d92985bb8e.webp 1024w,/2018/02/stack-overflow-questions/so_tag_adjacency.png 1800w" src="so_tag_adjacency.png"/> 
</figure>

<p>Looking down a given row/column, you can see which technologies have a lot of questions in common with another (for example, <code>javascript</code> and <code>json</code> are frequently asked in conjunction with other tags).</p>
<p>Going back earlier to talking about tag abuse, do the presence of certain pairs of tags lead to notably different answer rates?</p>
<figure>

    <img loading="lazy" srcset="/2018/02/stack-overflow-questions/so_tag_adjacency_percent_hu_f1f8dada071ec058.webp 320w,/2018/02/stack-overflow-questions/so_tag_adjacency_percent_hu_89c242977eb1efb1.webp 768w,/2018/02/stack-overflow-questions/so_tag_adjacency_percent_hu_5603c58116c008e5.webp 1024w,/2018/02/stack-overflow-questions/so_tag_adjacency_percent.png 1800w" src="so_tag_adjacency_percent.png"/> 
</figure>

<p>Tag pairs which don&rsquo;t make much sense (e.g. <code>ios</code>+<code>android</code>, <code>ios</code>+<code>javascript</code>, <code>android</code>+<code>php</code>) tend to have very low answer rates (20%-30%). But tags with already high answer rates like <code>regex</code> don&rsquo;t get much higher or much lower at a given pair.</p>
<h2 id="conclusion">Conclusion</h2>
<p>There&rsquo;s a lot more than can be done looking at question tags on Stack Overflow. I was surprised to see that all types of programming languages have quick answer times and a high probability of receiving an acceptable answer! I&rsquo;ll definitely keep an eye on the SO archives as they are released, and I&rsquo;m excited to see how trends change in the future.</p>
<hr>
<p><em>You can view the R and ggplot2 code used to create the data visualizations in <a href="http://minimaxir.com/notebooks/stack-overflow-questions/">this R Notebook</a>. You can also view the images/data used for this post in <a href="https://github.com/minimaxir/stack-overflow-questions">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>How to Make High Quality Data Visualizations for Websites With R and ggplot2</title>
      <link>https://minimaxir.com/2017/08/ggplot2-web/</link>
      <pubDate>Mon, 14 Aug 2017 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/08/ggplot2-web/</guid>
      <description>In general, it takes little additional effort to make something unique with ggplot2, and the effort is well worth it.</description>
      <content:encoded><![CDATA[<p>If you&rsquo;ve been following my blog, I like to use <a href="https://cran.r-project.org">R</a> and <a href="http://ggplot2.tidyverse.org/reference/">ggplot2</a> for data visualization. A lot.</p>
<p>One of my older blog posts, <a href="http://minimaxir.com/2015/02/ggplot-tutorial/">An Introduction on How to Make Beautiful Charts With R and ggplot2</a>, is still one of my most-trafficked posts years later, and even today I see techniques from that particular post incorporated into modern data visualizations on sites such as <a href="https://www.reddit.com">Reddit&rsquo;s</a> <a href="https://www.reddit.com/r/dataisbeautiful/">/r/dataisbeautiful</a> subreddit.</p>
<p>However, that post is a little outdated. Thanks to a few updates to ggplot2 since then and other advances in data visualization best-practices, making pretty charts for websites/blogs using R and ggplot2 is even more easy, quick, <em>and</em> fun!</p>
<h2 id="quick-introduction-to-ggplot2">Quick Introduction to ggplot2</h2>
<p>ggplot2 uses a more concise setup toward creating charts as opposed to the more declarative style of Python&rsquo;s <a href="https://matplotlib.org">matplotlib</a> and base R. And it also includes a few example datasets for practicing ggplot2 functionality; for example, the <code>mpg</code> dataset is a <a href="http://ggplot2.tidyverse.org/reference/mpg.html">dataset</a> of the performance of popular models of cars in 1998 and 2008.</p>
<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/mpg_hu_a640dba59b901764.webp 320w,/2017/08/ggplot2-web/mpg_hu_27b6e8ca229c6f49.webp 768w,/2017/08/ggplot2-web/mpg_hu_cbb195b8dd54f306.webp 1024w,/2017/08/ggplot2-web/mpg.png 1376w" src="mpg.png"/> 
</figure>

<p>Let&rsquo;s say you want to create a <a href="https://en.wikipedia.org/wiki/Scatter_plot">scatter plot</a>. Following <a href="http://ggplot2.tidyverse.org/reference/geom_smooth.html">a great example</a> from the ggplot2 documentation, let&rsquo;s plot the highway mileage of the car vs. the <a href="https://en.wikipedia.org/wiki/Engine_displacement">volume displacement</a> of the engine. In ggplot2, first you instantiate the chart with the <code>ggplot()</code> function, specifying the source dataset and the core aesthetics you want to plot, such as x, y, color, and fill. In this case, we set the core aesthetics to x = displacement and y = mileage, and add a <code>geom_point()</code> layer to make a scatter plot:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">displ</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">hwy</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">			<span class="nf">geom_point</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/plot1_hu_cdb77dbedd0d1aec.webp 320w,/2017/08/ggplot2-web/plot1_hu_e036e0d8db01be8d.webp 768w,/2017/08/ggplot2-web/plot1.png 994w" src="plot1.png"/> 
</figure>

<p>As we can see, there is a negative correlation between the two metrics. I&rsquo;m sure you&rsquo;ve seen plots like these around the internet before. But with only a couple of lines of codes, you can make them look more contemporary.</p>
<p>ggplot2 lets you add a well-designed theme with just one line of code. Relatively new to <code>ggplot2</code> is <code>theme_minimal()</code>, which <a href="http://ggplot2.tidyverse.org/reference/ggtheme.html">generates</a> a muted style similar to <a href="http://fivethirtyeight.com">FiveThirtyEight</a>&rsquo;s modern data visualizations:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">theme_minimal</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/plot2_hu_1abcf13146957fb5.webp 320w,/2017/08/ggplot2-web/plot2_hu_70ccfd8927b0ba23.webp 768w,/2017/08/ggplot2-web/plot2.png 994w" src="plot2.png"/> 
</figure>

<p>But we can still add color. Setting a color aesthetic on a character/categorical variable will set the colors of the corresponding points, making it easy to differentiate at a glance.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">displ</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">hwy</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">class</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">			<span class="nf">geom_point</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">			<span class="nf">theme_minimal</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/plot3_hu_c34c19184e3ebcf2.webp 320w,/2017/08/ggplot2-web/plot3_hu_f25d6095028fc6d5.webp 768w,/2017/08/ggplot2-web/plot3.png 994w" src="plot3.png"/> 
</figure>

<p>Adding the color aesthetic certainly makes things much prettier. ggplot2 automatically adds a legend for the colors as well.
However, for this particular visualization, it is difficult to see trends in the points for each class. A easy way around this is to add a <a href="https://en.wikipedia.org/wiki/Least_squares">least squares regression</a> trendline for each class <a href="http://ggplot2.tidyverse.org/reference/geom_smooth.html">using</a> <code>geom_smooth()</code> (which normally adds a smoothed line, but since there isn&rsquo;t a lot of data for each group, we force it to a linear model and do not plot confidence intervals)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">	<span class="nf">geom_smooth</span><span class="p">(</span><span class="n">method</span> <span class="o">=</span> <span class="s">&#34;lm&#34;</span><span class="p">,</span> <span class="n">se</span> <span class="o">=</span> <span class="bp">F</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/plot4_hu_d736cf0961bca1f5.webp 320w,/2017/08/ggplot2-web/plot4_hu_6e954c53dc4d849a.webp 768w,/2017/08/ggplot2-web/plot4.png 994w" src="plot4.png"/> 
</figure>

<p>Pretty neat, and now comparative trends are much more apparent! For example, pickups and SUVs have similar efficiency, which makes intuitive sense.</p>
<p>The chart axes should be labeled (<em>always</em> label your charts!). All the typical labels, like <code>title</code>, <code>x</code>-axis, and <code>y</code>-axis can be done with the <code>labs()</code> function. But relatively new to ggplot2 are the <code>subtitle</code> and <code>caption</code> fields, both of do what you expect:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s">&#34;Efficiency of Popular Models of Cars&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">subtitle</span><span class="o">=</span><span class="s">&#34;By Class of Car&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">x</span><span class="o">=</span><span class="s">&#34;Engine Displacement (liters)&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">y</span><span class="o">=</span><span class="s">&#34;Highway Miles per Gallon&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">caption</span><span class="o">=</span><span class="s">&#34;by Max Woolf — minimaxir.com&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/plot5_hu_d6d809535fa48bb6.webp 320w,/2017/08/ggplot2-web/plot5_hu_68db8294c51b2638.webp 768w,/2017/08/ggplot2-web/plot5.png 994w" src="plot5.png"/> 
</figure>

<p>That&rsquo;s a pretty good start. Now let&rsquo;s take it to the next level.</p>
<h2 id="how-to-save-a-ggplot2-chart-for-web">How to Save A ggplot2 chart For Web</h2>
<p>Something surprisingly undiscussed in the field of data visualization is how to <em>save</em> a chart as a high quality image file. For example, with <a href="https://products.office.com/en-us/excel">Excel</a> charts, Microsoft <a href="https://support.office.com/en-us/article/Save-a-chart-as-a-picture-in-Excel-for-Windows-254bbf9a-1ce1-459f-914a-4902e8ca9217">officially recommends</a> to copy the chart, <em>paste it as an image back into Excel</em>, then save the pasted image, without having any control over image quality and size in the browser (the <em>real</em> best way to save an Excel/<a href="https://www.apple.com/numbers/">Numbers</a> chart as an image for a webpage is to copy/paste the chart object into a <a href="https://products.office.com/en-us/powerpoint">PowerPoint</a>/<a href="https://www.apple.com/keynote/">Keynote</a> slide, and export <em>the slide</em> as an image. This also makes it extremely easy to annotate/brand said chart beforehand in PowerPoint/Keynote).</p>
<p>R IDEs such as <a href="https://www.rstudio.com">RStudio</a> have a chart-saving UI with the typical size/filetype options. But if you save an image from this UI, the shapes and texts of the resulting image will be heavily aliased (R <a href="https://danieljhocking.wordpress.com/2013/03/12/high-resolution-figures-in-r/">renders images at 72 dpi</a> by default, which is much lower than that of modern HiDPI/Retina displays).</p>
<p>The data visualizations used earlier in this post were generated in-line as a part of an <a href="http://rmarkdown.rstudio.com/r_notebooks.html">R Notebook</a>, but it is surprisingly difficult to extract the generated chart as a separate file. But ggplot2 also has <code>ggsave()</code>, which saves the image to disk using antialiasing and makes the fonts/shapes in the chart look much better, and assumes a default dpi of 300. Saving charts using <code>ggsave()</code>, and adjusting the sizes of the text and geoms to compensate for the higher dpi, makes the charts look very presentable. A width of 4 and a height of 3 results in a 1200x900px image, which if posted on a blog with a content width of ~600px (like mine), will render at full resolution on HiDPI/Retina displays, or downsample appropriately otherwise. Due to modern PNG compression, the file size/bandwidth cost for using larger images is minimal.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">displ</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">hwy</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">class</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_smooth</span><span class="p">(</span><span class="n">method</span> <span class="o">=</span> <span class="s">&#34;lm&#34;</span><span class="p">,</span> <span class="n">se</span><span class="o">=</span><span class="bp">F</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="m">0.5</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_point</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">0.5</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">theme_minimal</span><span class="p">(</span><span class="n">base_size</span><span class="o">=</span><span class="m">9</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">labs</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s">&#34;Efficiency of Popular Models of Cars&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">subtitle</span><span class="o">=</span><span class="s">&#34;By Class of Car&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">x</span><span class="o">=</span><span class="s">&#34;Engine Displacement (liters)&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">y</span><span class="o">=</span><span class="s">&#34;Highway Miles per Gallon&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">         <span class="n">caption</span><span class="o">=</span><span class="s">&#34;by Max Woolf — minimaxir.com&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">ggsave</span><span class="p">(</span><span class="s">&#34;tutorial-0.png&#34;</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="m">4</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="m">3</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-0_hu_3491aa5980f9ce57.webp 320w,/2017/08/ggplot2-web/tutorial-0_hu_386b410431f20644.webp 768w,/2017/08/ggplot2-web/tutorial-0_hu_3025381ee3b4d2f8.webp 1024w,/2017/08/ggplot2-web/tutorial-0.png 1200w" src="tutorial-0.png"/> 
</figure>

<p>Compare to the previous non-ggsave chart, which is more blurry around text/shapes:</p>
<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/plot5_hu_d6d809535fa48bb6.webp 320w,/2017/08/ggplot2-web/plot5_hu_68db8294c51b2638.webp 768w,/2017/08/ggplot2-web/plot5.png 994w" src="plot5.png"/> 
</figure>

<p>For posterity, here&rsquo;s the same chart saved at 1200x900px using the RStudio image-saving UI:</p>
<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/plot-1200-900_hu_2be01453db7558b.webp 320w,/2017/08/ggplot2-web/plot-1200-900_hu_9f79679f34f611e0.webp 768w,/2017/08/ggplot2-web/plot-1200-900_hu_69a1ff889438a21b.webp 1024w,/2017/08/ggplot2-web/plot-1200-900.png 1200w" src="plot-1200-900.png"/> 
</figure>

<p>Note that the antialiasing optimizations assume that you are <em>not</em> uploading the final chart to a service like <a href="https://medium.com">Medium</a> or <a href="https://wordpress.com">WordPress.com</a>, which will compress the images and reduce the quality anyways. But if you are uploading it to Reddit or self-hosting your own blog, it&rsquo;s definitely worth it.</p>
<h2 id="fancy-fonts">Fancy Fonts</h2>
<p>Changing the chart font is another way to add a personal flair.
Theme functions like <code>theme_minimal()</code> accept a <code>base_family</code> parameter. With that, you can specify any font family as the default instead of the base sans-serif. (On Windows, you may need to install the <code>extrafont</code> package first). Fonts from <a href="https://fonts.google.com">Google Fonts</a> are free and work easily with ggplot2 once installed. For example, we can use <a href="https://fonts.google.com/specimen/Roboto">Roboto</a>, Google&rsquo;s modern font which has also been getting a lot of usage on <a href="https://stackoverflow.com">Stack Overflow</a>&rsquo;s great ggplot2 <a href="https://stackoverflow.blog/2017/06/15/developers-use-spaces-make-money-use-tabs/">data visualizations</a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">theme_minimal</span><span class="p">(</span><span class="n">base_size</span><span class="o">=</span><span class="m">9</span><span class="p">,</span> <span class="n">base_family</span><span class="o">=</span><span class="s">&#34;Roboto&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-1_hu_895dfabb6331218f.webp 320w,/2017/08/ggplot2-web/tutorial-1_hu_1014960e9eb00de2.webp 768w,/2017/08/ggplot2-web/tutorial-1_hu_283f71e45e79c23c.webp 1024w,/2017/08/ggplot2-web/tutorial-1.png 1200w" src="tutorial-1.png"/> 
</figure>

<p>A general text design guideline is to use fonts of different weights/widths for different hierarchies of content. In this case, we can use a bolder condensed font for the title, and deemphasize the subtitle and caption using lighter colors, all done using the <code>theme()</code> <a href="http://ggplot2.tidyverse.org/reference/theme.html">function</a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">theme</span><span class="p">(</span><span class="n">plot.subtitle</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">&#34;#666666&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">          <span class="n">plot.title</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">family</span><span class="o">=</span><span class="s">&#34;Roboto Condensed Bold&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">          <span class="n">plot.caption</span> <span class="o">=</span> <span class="nf">element_text</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">&#34;#AAAAAA&#34;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="m">6</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-2_hu_45115eb223bac5fe.webp 320w,/2017/08/ggplot2-web/tutorial-2_hu_96b9283a20212470.webp 768w,/2017/08/ggplot2-web/tutorial-2_hu_ec38c0bbfa9bd892.webp 1024w,/2017/08/ggplot2-web/tutorial-2.png 1200w" src="tutorial-2.png"/> 
</figure>

<p>It&rsquo;s worth nothing that data visualizations posted on websites should be easily <em>legible</em> for mobile-device users as well, hence the intentional use of larger fonts relative to charts typically produced in the desktop-oriented Excel.</p>
<p>Additionally, all theming options can be set as a session default at the beginning of a script using <code>theme_set()</code>, saving even more time instead of having to recreate the theme for each chart.</p>
<h2 id="the-ggplot2-colors">The &ldquo;ggplot2 colors&rdquo;</h2>
<p>The &ldquo;ggplot2 colors&rdquo; for categorical variables are infamous for being the primary indicator of a chart being made with ggplot2. But there is a science to it; ggplot2 by default selects colors using the <code>scale_color_hue()</code> <a href="http://ggplot2.tidyverse.org/reference/scale_hue.html">function</a>, which selects colors in the HSL space by changing the hue [H] between 0 and 360, keeping saturation [S] and lightness [L] constant. As a result, ggplot2 selects the most <em>distinct</em> colors possible while keeping lightness constant. For example, if you have 2 different categories, ggplot2 chooses the colors with h = 0 and h = 180; if 3 colors, h = 0, h = 120, h = 240, etc.</p>
<p>It&rsquo;s smart, but does make a given chart lose distinctness when many other ggplot2 charts use the same selection methodology. A quick way to take advantage of this hue dispersion while still making the colors unique is to change the lightness; by default, <code>l = 65</code>, but setting it slightly lower will make the charts look more professional/<a href="https://www.bloomberg.com">Bloomberg</a>-esque.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p_color</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">        <span class="nf">scale_color_hue</span><span class="p">(</span><span class="n">l</span> <span class="o">=</span> <span class="m">40</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-4_hu_264938515f752543.webp 320w,/2017/08/ggplot2-web/tutorial-4_hu_eb1a54c4bb2c178d.webp 768w,/2017/08/ggplot2-web/tutorial-4_hu_f9b6b0a558fcfa8b.webp 1024w,/2017/08/ggplot2-web/tutorial-4.png 1200w" src="tutorial-4.png"/> 
</figure>

<h2 id="rcolorbrewer">RColorBrewer</h2>
<p>Another coloring option for ggplot2 charts are the <a href="http://colorbrewer2.org/#type=sequential&amp;scheme=BuGn&amp;n=3">ColorBrewer</a> palettes implemented with the <code>RColorBrewer</code> package, which are supported natively in ggplot2 with functions such as <code>scale_color_brewer()</code>. The sequential palettes like &ldquo;Blues&rdquo; and &ldquo;Greens&rdquo; do what the name implies:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p_color</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">        <span class="nf">scale_color_brewer</span><span class="p">(</span><span class="n">palette</span><span class="o">=</span><span class="s">&#34;Blues&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-5_hu_1b83219640238315.webp 320w,/2017/08/ggplot2-web/tutorial-5_hu_437d57ae9e55c1f4.webp 768w,/2017/08/ggplot2-web/tutorial-5_hu_48a903b9c7756119.webp 1024w,/2017/08/ggplot2-web/tutorial-5.png 1200w" src="tutorial-5.png"/> 
</figure>

<p>A famous diverging palette for visualizations on /r/dataisbeautiful is the &ldquo;Spectral&rdquo; palette, which is a lighter rainbow (recommended for dark backgrounds)</p>
<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-6_hu_b0f4861a140f3ca2.webp 320w,/2017/08/ggplot2-web/tutorial-6_hu_2bea8aa9ecf4b77f.webp 768w,/2017/08/ggplot2-web/tutorial-6_hu_8c72b6730d700f72.webp 1024w,/2017/08/ggplot2-web/tutorial-6.png 1200w" src="tutorial-6.png"/> 
</figure>

<p>However, while the charts look pretty, it&rsquo;s difficult to tell the categories apart. The qualitative palettes fix this problem, and have more distinct possibilities than the <code>scale_color_hue()</code> approach mentioned earlier.</p>
<p>Here are 3 examples of qualitative palettes, &ldquo;Set1&rdquo;, &ldquo;Set2&rdquo;, and &ldquo;Set3,&rdquo; whichever fit your preference.</p>
<p><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-7_hu_25cbbc089de8f962.webp 320w,/2017/08/ggplot2-web/tutorial-7_hu_a16288142b4ca2c9.webp 768w,/2017/08/ggplot2-web/tutorial-7_hu_5dcf40a21178ff45.webp 1024w,/2017/08/ggplot2-web/tutorial-7.png 1200w" src="tutorial-7.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-8_hu_26d266cdc130beea.webp 320w,/2017/08/ggplot2-web/tutorial-8_hu_33341f45209f13f8.webp 768w,/2017/08/ggplot2-web/tutorial-8_hu_69ff86f540e43dba.webp 1024w,/2017/08/ggplot2-web/tutorial-8.png 1200w" src="tutorial-8.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-9_hu_8d2a2fc7ea80dd4.webp 320w,/2017/08/ggplot2-web/tutorial-9_hu_8c554b2661491e44.webp 768w,/2017/08/ggplot2-web/tutorial-9_hu_50b6d07e248d4fbd.webp 1024w,/2017/08/ggplot2-web/tutorial-9.png 1200w" src="tutorial-9.png"/> 
</figure>
</p>
<h2 id="viridis-and-accessibility">Viridis and Accessibility</h2>
<p>Let&rsquo;s mix up the visualization a bit. A rarely-used-but-very-useful ggplot2 geom is <code>geom2d_bin()</code>, which counts the number of points in a given 2d spatial area:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">displ</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">hwy</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_bin2d</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="n">[...theming</span> <span class="n">options...]</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-tile_hu_e43b208518838250.webp 320w,/2017/08/ggplot2-web/tutorial-tile_hu_cdaf166456144b15.webp 768w,/2017/08/ggplot2-web/tutorial-tile_hu_4c5bd954986c81ea.webp 1024w,/2017/08/ggplot2-web/tutorial-tile.png 1200w" src="tutorial-tile.png"/> 
</figure>

<p>We see that the largest number of points are centered around (2,30). However, the default ggplot2 color palette for continuous variables is <em>boring</em>. Yes, we can use the RColorBrewer sequential palettes above, but as noted, they aren&rsquo;t perceptually distinct, and could cause issues for readers who are colorblind.</p>
<p>The <a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html">viridis R package</a> provides a set of 4 high-contrast palettes which are very colorblind friendly, and works easily with ggplot2 by extending a <code>scale_fill_viridis()/scale_color_viridis()</code> function.</p>
<p>The default &ldquo;viridis&rdquo; palette has been increasingly popular on the web lately:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">p_color</span> <span class="o">&lt;-</span> <span class="n">p</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">        <span class="nf">scale_fill_viridis</span><span class="p">(</span><span class="n">option</span><span class="o">=</span><span class="s">&#34;viridis&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-10_hu_bdab45ed149b2987.webp 320w,/2017/08/ggplot2-web/tutorial-10_hu_4f8ed1d4b0d4c15b.webp 768w,/2017/08/ggplot2-web/tutorial-10_hu_2136ce24111625f9.webp 1024w,/2017/08/ggplot2-web/tutorial-10.png 1200w" src="tutorial-10.png"/> 
</figure>

<p>&ldquo;magma&rdquo; and &ldquo;inferno&rdquo; are similar, and give the data visualization a fiery edge:</p>
<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-11_hu_891953ed03995865.webp 320w,/2017/08/ggplot2-web/tutorial-11_hu_339503edfac14382.webp 768w,/2017/08/ggplot2-web/tutorial-11_hu_b58a6f34b44b0e07.webp 1024w,/2017/08/ggplot2-web/tutorial-11.png 1200w" src="tutorial-11.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-12_hu_9576afecf56c3191.webp 320w,/2017/08/ggplot2-web/tutorial-12_hu_f3ada4502649973e.webp 768w,/2017/08/ggplot2-web/tutorial-12_hu_7b8a7ff2ebe2952f.webp 1024w,/2017/08/ggplot2-web/tutorial-12.png 1200w" src="tutorial-12.png"/> 
</figure>

<p>Lastly, &ldquo;plasma&rdquo; is a mix between the 3 palettes above:</p>
<figure>

    <img loading="lazy" srcset="/2017/08/ggplot2-web/tutorial-13_hu_d6ee0c44a3b9408.webp 320w,/2017/08/ggplot2-web/tutorial-13_hu_1cc7fd9e09047f6f.webp 768w,/2017/08/ggplot2-web/tutorial-13_hu_cae2bedd89d23c95.webp 1024w,/2017/08/ggplot2-web/tutorial-13.png 1200w" src="tutorial-13.png"/> 
</figure>

<h2 id="next-steps">Next Steps</h2>
<p>FiveThirtyEight actually uses ggplot2 for their data journalism workflow <a href="https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/FiveThirtyEights-data-journalism-workflow-with-R?ocid=player">in an interesting way</a>; they render the base chart using ggplot2, but export it as as a SVG/PDF vector file which can scale to any size, and then the design team annotates/customizes the data visualization in <a href="http://www.adobe.com/products/illustrator.html">Adobe Illustrator</a> before exporting it as a static PNG for the article (in general, I recommend using an external image editor to add text annotations to a data visualization because doing it manually in ggplot2 is inefficient).</p>
<p>For general use cases, ggplot2 has very strong defaults for beautiful data visualizations. And certainly there is a lot <em>more</em> you can do to make a visualization beautiful than what&rsquo;s listed in this post, such as using facets and tweaking parameters of geoms for further distinction, but those are more specific to a given data visualization. In general, it takes little additional effort to make something <em>unique</em> with ggplot2, and the effort is well worth it. And prettier charts are more persuasive, which is a good return-on-investment.</p>
<hr>
<p><em>You can view the R and ggplot2 code used to create the data visualizations in <a href="http://minimaxir.com/notebooks/ggplot2-web/">this R Notebook</a>. You can also view the images/data used for this post in <a href="https://github.com/minimaxir/ggplot2-web">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Predicting the Success of a Reddit Submission with Deep Learning and Keras</title>
      <link>https://minimaxir.com/2017/06/reddit-deep-learning/</link>
      <pubDate>Mon, 26 Jun 2017 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/06/reddit-deep-learning/</guid>
      <description>Thanks to Keras, performing deep learning on a very large number of Reddit submissions is actually pretty easy. Performing it &lt;em&gt;well&lt;/em&gt; is a different story.</description>
      <content:encoded><![CDATA[<p>I&rsquo;ve been trying to figure out what makes a <a href="https://www.reddit.com">Reddit</a> submission &ldquo;good&rdquo; for years. If we assume the number of upvotes on a submission is a fair proxy for submission quality, optimizing a statistical model for Reddit data with submission score as a response variable might lead to interesting (and profitable) insights when transferred into other domains, such as Facebook Likes and Twitter Favorites.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/reddit-example_hu_ced286403a9a1f93.webp 320w,/2017/06/reddit-deep-learning/reddit-example_hu_25e458673cf9d615.webp 768w,/2017/06/reddit-deep-learning/reddit-example_hu_5d082790fbac9a8c.webp 1024w,/2017/06/reddit-deep-learning/reddit-example.png 1202w" src="reddit-example.png"/> 
</figure>

<p>An important part of a Reddit submission is the submission <strong>title</strong>. Like news headlines, a catchy title will make a user <a href="http://minimaxir.com/2015/10/reddit-topwords/">more inclined</a> to engage with a submission and potentially upvote.</p>
<p>Additionally, the <strong>time when the submission is made</strong> is <a href="http://minimaxir.com/2015/10/reddit-bigquery/">important</a>; submitting when user activity is the highest tends to lead to better results if you are trying to maximize exposure.</p>
<p>The actual <strong>content</strong> of the Reddit submission such as images/links to a website is likewise important, but good content is relatively difficult to optimize.</p>
<p>Can the magic of deep learning reconcile these concepts and create a model which can predict if a submission is a good submission? Thanks to <a href="https://github.com/fchollet/keras">Keras</a>, performing deep learning on a very large number of Reddit submissions is actually pretty easy. Performing it <em>well</em> is a different story.</p>
<h2 id="getting-the-data--feature-engineering">Getting the Data + Feature Engineering</h2>
<p>It&rsquo;s difficult to retrieve the content of millions of Reddit submissions at scale (ethically), so let&rsquo;s initially start by building a model using submissions on <a href="https://www.reddit.com/r/AskReddit/">/r/AskReddit</a>: Reddit&rsquo;s largest subreddit which receives 8,000+ submissions each day. /r/AskReddit is a self-post only subreddit with no external links, allowing us to focus on only the submission title and timing.</p>
<p><a href="http://minimaxir.com/2015/10/reddit-bigquery/">As always</a>, we can collect large amounts of Reddit data from the public Reddit dataset on <a href="https://cloud.google.com/bigquery/">BigQuery</a>. The submission <code>title</code> is available by default. The raw timestamp of the submission is also present, allowing us to extract the <code>hour</code> of submission (adjusted to Eastern Standard Time) and <code>dayofweek</code>, as used in the heatmap above. But why stop there? Since /r/AskReddit receives hundreds of submissions <em>every hour</em> on average, we should look at the <code>minute</code> level to see if there are any deeper trends (e.g. there are only 30 slots available on the first page of /new and since there is so much submission activity, it might be more advantageous to submit during off-peak times). Lastly, to account for potential changes in behavior as the year progresses, we should add a <code>dayofyear</code> feature, where January 1st = 1, January 2nd = 2, etc which can also account for variance due to atypical days like holidays.</p>
<p>Instead of predicting the raw number on upvotes of the Reddit submission (as the distribution of submission scores is heavily skewed), we should predict <strong>whether or not the submission is good</strong>, shaping the problem as a <a href="https://en.wikipedia.org/wiki/Logistic_regression">logistic regression</a>. In this case, let&rsquo;s define a &ldquo;good submission&rdquo; as one whose score is equal to or above the <strong>50th percentile (median) of all submissions</strong> in /r/AskReddit. Unfortunately, the median score ends up being <strong>2 points</strong>; although &ldquo;one upvote&rdquo; might be a low threshold for a &ldquo;good&rdquo; submission, it splits the dataset into 64% bad submissions, 36% good submissions, and setting the percentile threshold higher will result in a very unbalanced dataset for model training (a score of 2+ also implies that the submission did not get downvoted to death, which is useful).</p>
<p>Gathering all <strong>976,538 /r/AskReddit submissions</strong> from January 2017 to April 2017 should be enough data for this project. Here&rsquo;s the final BigQuery:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="n">standardSQL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">CAST</span><span class="p">(</span><span class="n">FORMAT_TIMESTAMP</span><span class="p">(</span><span class="s1">&#39;%H&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">TIMESTAMP_SECONDS</span><span class="p">(</span><span class="n">created_utc</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;America/New_York&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">INT64</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">hour</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">CAST</span><span class="p">(</span><span class="n">FORMAT_TIMESTAMP</span><span class="p">(</span><span class="s1">&#39;%M&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">TIMESTAMP_SECONDS</span><span class="p">(</span><span class="n">created_utc</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;America/New_York&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">INT64</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">minute</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">CAST</span><span class="p">(</span><span class="n">FORMAT_TIMESTAMP</span><span class="p">(</span><span class="s1">&#39;%w&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">TIMESTAMP_SECONDS</span><span class="p">(</span><span class="n">created_utc</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;America/New_York&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">INT64</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">dayofweek</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">CAST</span><span class="p">(</span><span class="n">FORMAT_TIMESTAMP</span><span class="p">(</span><span class="s1">&#39;%j&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">TIMESTAMP_SECONDS</span><span class="p">(</span><span class="n">created_utc</span><span class="p">),</span><span class="w"> </span><span class="s1">&#39;America/New_York&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">INT64</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">dayofyear</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">IF</span><span class="p">(</span><span class="n">PERCENT_RANK</span><span class="p">()</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">(</span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">score</span><span class="w"> </span><span class="k">ASC</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">50</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">is_top_submission</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">reddit_posts</span><span class="p">.</span><span class="o">*`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="p">(</span><span class="n">_TABLE_SUFFIX</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2017_01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2017_04&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="n">subreddit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;AskReddit&#39;</span><span class="w">
</span></span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/bigquery_hu_89adb35f6f1860d6.webp 320w,/2017/06/reddit-deep-learning/bigquery_hu_bb2e955b3cb7daeb.webp 768w,/2017/06/reddit-deep-learning/bigquery_hu_fa76341d390d603.webp 1024w,/2017/06/reddit-deep-learning/bigquery.png 2104w" src="bigquery.png"/> 
</figure>

<h2 id="model-architecture">Model Architecture</h2>
<p><em>If you want to see the detailed data transformations and Keras code examples/outputs for this post, you can view <a href="https://github.com/minimaxir/predict-reddit-submission-success/blob/master/predict_askreddit_submission_success_timing.ipynb">this Jupyter Notebook</a>.</em></p>
<p>Text processing is a good use case for deep learning, as it can identify relationships between words where older methods like <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a> can&rsquo;t. Keras, a high level deep-learning framework on top of lower frameworks like <a href="https://www.tensorflow.org">TensorFlow</a>, can easily convert a list of texts to a <a href="https://keras.io/preprocessing/sequence/">padded sequence</a> of <a href="https://keras.io/preprocessing/text/">index tokens</a> that can interact with deep learning models, along with many other benefits. Data scientists often use <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent neural networks</a> that can &ldquo;learn&rdquo; for classifying text. However <a href="https://github.com/facebookresearch/fastText">fasttext</a>, a newer algorithm from researchers at Facebook, can perform classification tasks at an <a href="http://minimaxir.com/2017/06/keras-cntk/">order of magnitude faster</a> training time than RNNs, with similar predictive performance.</p>
<p>fasttext works by <a href="https://arxiv.org/abs/1607.01759">averaging word vectors</a>. In this Reddit model architecture inspired by the <a href="https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py">official Keras fasttext example</a>, each word in a Reddit submission title (up to 20) is mapped to a 50-dimensional vector from an Embeddings layer of up to 40,000 words. The Embeddings layer is <a href="https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html">initialized</a> with <a href="https://nlp.stanford.edu/projects/glove/">GloVe word embeddings</a> pre-trained on billions of words to give the model a good start. All the word vectors for a given Reddit submission title are averaged together, and then a Dense fully-connected layer outputs a probability the given text is a good submission. The gradients then backpropagate and improve the word embeddings for future batches during training.</p>
<p>Keras has a <a href="https://keras.io/visualization/">convenient utility</a> to visualize deep learning models:</p>
<figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/model_shapes-1_hu_b9f7a08f534a0b45.webp 320w,/2017/06/reddit-deep-learning/model_shapes-1.png 663w" src="model_shapes-1.png"/> 
</figure>

<p>However, the first output above is the <em>auxiliary output</em> for <a href="https://en.wikipedia.org/wiki/Regularization_%28mathematics%29">regularizing</a> the word embeddings; we still have to incorporate the submission timing data into the model.</p>
<p>Each of the four timing features (hour, minute, day of week, day of year) receives its own Embeddings layer, outputting a 64D vector. This allows the features to learn latent characteristics which may be missed using traditional <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">one-hot encoding</a> for categorical data in machine learning problems.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/model_shapes-2_hu_52d718feedd74c43.webp 320w,/2017/06/reddit-deep-learning/model_shapes-2_hu_84d0630736ebd887.webp 768w,/2017/06/reddit-deep-learning/model_shapes-2_hu_f74f2c7dacf4dc23.webp 1024w,/2017/06/reddit-deep-learning/model_shapes-2.png 1754w" src="model_shapes-2.png"/> 
</figure>

<p>The 50D word average vector is concatenated with the four vectors above, resulting in a 306D vector. This combined vector is connected to another fully-connected layer which can account for hidden interactions between all five input features (plus <a href="https://keras.io/layers/normalization/">batch normalization</a>, which improves training speed for Dense layers). Then the model outputs a final probability prediction: the <em>main output</em>.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/model_shapes-3_hu_d2ecf94768050fa.webp 320w,/2017/06/reddit-deep-learning/model_shapes-3_hu_e208de51b840cc8a.webp 768w,/2017/06/reddit-deep-learning/model_shapes-3.png 852w" src="model_shapes-3.png"/> 
</figure>

<p>The final model:</p>
<figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/model_hu_ea04eea1eca03032.webp 320w,/2017/06/reddit-deep-learning/model_hu_6adb1a1bee6dfcb9.webp 768w,/2017/06/reddit-deep-learning/model_hu_b6ceee5bdac0e8e1.webp 1024w,/2017/06/reddit-deep-learning/model.png 1350w" src="model.png"/> 
</figure>

<p>All of this sounds difficult to implement, but Keras&rsquo;s <a href="https://keras.io/getting-started/functional-api-guide/">functional API</a> ensures that adding each layer and linking them together can be done in a single line of code each.</p>
<h2 id="training-results">Training Results</h2>
<p>Because the model uses no recurrent layers, it trains fast enough on a CPU despite the large dataset size.</p>
<p>We split the full dataset into 80%/20% training/test datasets, training the model on the former and testing the model against the latter. Keras trains a model with a simple <code>fit</code> command and trains for 20 epochs, where one epoch represents an entire pass of the training set.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/fit_hu_c4b22cd471fd6b14.webp 320w,/2017/06/reddit-deep-learning/fit_hu_25fedd4b89849374.webp 768w,/2017/06/reddit-deep-learning/fit_hu_408a494d98bce4d7.webp 1024w,/2017/06/reddit-deep-learning/fit.png 1236w" src="fit.png"/> 
</figure>

<p>There&rsquo;s a lot happening in the console output due to the architecture, but the main metrics of interest are the <code>main_out_acc</code>, the accuracy of the training set through the main output, and <code>val_main_out_acc</code>, the accuracy of the test set. Ideally, the accuracy of both should increase as training progresses. However, the test accuracy <em>must</em> be better than the 64% baseline (if we just say all /r/AskReddit submissions are bad), otherwise this model is unhelpful.</p>
<p>Keras&rsquo;s <a href="https://keras.io/callbacks/#csvlogger">CSVLogger</a> trivially logs all these metrics to a CSV file. Plotting the results of the 20 epochs:</p>
<figure>

    <img loading="lazy" srcset="/2017/06/reddit-deep-learning/predict-reddit-1_hu_5671c8a5110b2d25.webp 320w,/2017/06/reddit-deep-learning/predict-reddit-1_hu_dab24707e22e81d.webp 768w,/2017/06/reddit-deep-learning/predict-reddit-1_hu_325aebfe1b36135c.webp 1024w,/2017/06/reddit-deep-learning/predict-reddit-1.png 1200w" src="predict-reddit-1.png"/> 
</figure>

<p>The test accuracy does indeed beat the 64% baseline; however, test accuracy <em>decreases</em> as training progresses. This is a sign of <a href="https://en.wikipedia.org/wiki/Overfitting">overfitting</a>, possibly due to the potential disparity between texts in the training and test sets. In deep learning, you can account for overfitting by adding <a href="https://keras.io/layers/core/#dropout">Dropout</a> to relevant layers, but in my testing it did not help.</p>
<h2 id="using-the-model-to-optimize-reddit-submissions">Using The Model To Optimize Reddit Submissions</h2>
<p>At the least, we now have a model that understands the latent characteristics of an /r/AskReddit submission. But how do you apply the model <em>in practical, real-world situations</em>?</p>
<p>Let&rsquo;s take a random /r/AskReddit submission: <a href="https://www.reddit.com/r/AskReddit/comments/5odcpd/which_movies_plot_would_drastically_change_if_you/">Which movie&rsquo;s plot would drastically change if you removed a letter from its title?</a>, submitted Monday, January 16th at 3:46 PM EST and receiving 4 upvotes (a &ldquo;good&rdquo; submission in context of this model). Plugging those input variables into the trained model results in a <strong>0.669</strong> probability of it being considered a good submission, which is consistent with the true results.</p>
<p>But what if we made <em>minor, iterative changes</em> to the title while keeping the time submitted unchanged? Can we improve this probability?</p>
<p>&ldquo;Drastically&rdquo; is a silly adjective; removing it and using the title <strong>Which movie&rsquo;s plot would change if you removed a letter from its title?</strong> results in a greater probability of <strong>0.682</strong>.</p>
<p>&ldquo;Removed&rdquo; is <a href="http://www.ef.edu/english-resources/english-grammar/conditional/">grammatically incorrect</a>; fixing the issue and using the title <strong>Which movie&rsquo;s plot would change if you remove a letter from its title?</strong> results in a greater probability of <strong>0.692</strong>.</p>
<p>&ldquo;Which&rdquo; is also <a href="https://www.englishclub.com/vocabulary/wh-question-words.htm">grammatically incorrect</a>; fixing the issue and using the title <strong>What movie&rsquo;s plot would change if you remove a letter from its title?</strong> results in a greater probability of <strong>0.732</strong>.</p>
<p>Although adjectives are sometimes redundant, they can add an intriguing emphasis; adding a &ldquo;single&rdquo; and using the title <strong>What movie&rsquo;s plot would change if you remove a single letter from its title?</strong> results in a greater probability of <strong>0.753</strong>.</p>
<p>Not bad for a little workshopping!</p>
<p>Now that we have an improved title, we can find an optimal time to make the submission through brute force by calculating the probabilities for all combinations of hour, minute, and day of week (and offsetting the day of year appropriately). After doing so, I discovered that making the submission on the previous Sunday at 10:55 PM EST results in the maximum probability possible of being a good submission at <strong>0.841</strong> (the other top submission times are at various other minutes during that hour; the best time on a different day is the following Tuesday at 4:05 AM EST with a probability of <strong>0.823</strong>).</p>
<p>In all, this model of Reddit submission success prediction is a proof of concept; there are many, <em>many</em> optimizations that can be done on the feature engineering side and on the data collection side (especially if we want to model subreddits other than /r/AskReddit). Predicting which submissions go viral instead of just predicting which submissions receive atleast one upvote is another, more advanced problem entirely.</p>
<p>Thanks to the high-level abstractions and utility functions of Keras, I was able to prototype the initial model in an afternoon instead of the weeks/months required for academic papers and software applications in this area. At the least, this little experiment serves as an example of applying Keras to a real-world dataset, and the tradeoffs that result when deep learning can&rsquo;t magically solve everything. But that doesn&rsquo;t mean my experiments on the Reddit data were unproductive; on the contrary, I now have a few new clever ideas how to fix some of the issues discovered, which I hope to implement soon.</p>
<p>Again, I strongly recommend reading the data transformations and Keras code examples in <a href="https://github.com/minimaxir/predict-reddit-submission-success/blob/master/predict_askreddit_submission_success_timing.ipynb">this Jupyter Notebook</a> for more information into the methodology, as building modern deep learning models is more intuitive and less arcane than what thought pieces on Medium imply.</p>
<hr>
<p><em>You can view the R and ggplot2 code used to visualize the model data in <a href="http://minimaxir.com/notebooks/predict-reddit-submission-success/">this R Notebook</a>, including 2D projections of the Embedding layers not in this article. You can also view the images/data used for this post in <a href="https://github.com/minimaxir/predict-reddit-submission-success">this GitHub repository</a>.</em></p>
<p><em>You are free to use the data visualizations/model architectures from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>The Decline of Imgur on Reddit and the Rise of Reddit&#39;s Native Image Hosting</title>
      <link>https://minimaxir.com/2017/06/imgur-decline/</link>
      <pubDate>Tue, 20 Jun 2017 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/06/imgur-decline/</guid>
      <description>Before Reddit added native image hosting, Imgur accounted for 15% of all submissions to Reddit. Now it&amp;rsquo;s below 9%.</description>
      <content:encoded><![CDATA[<p>Last week, Bloomberg <a href="https://www.bloomberg.com/news/articles/2017-06-17/reddit-said-to-be-raising-funds-valuing-startup-at-1-7-billion">reported</a> that Reddit was raising about $150 Million in venture capital at a valuation of $1.7 billion. Since Reddit&rsquo;s data is <a href="http://minimaxir.com/2015/10/reddit-bigquery/">public on BigQuery</a>, I quickly checked if there were any recent user engagement growth spurts which could justify such a high worth. Here&rsquo;s an example BigQuery which aggregates the total number of Reddit submissions made for each month until the end of April 2017:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="n">standardSQL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">DATE_TRUNC</span><span class="p">(</span><span class="nb">DATE</span><span class="p">(</span><span class="n">TIMESTAMP_SECONDS</span><span class="p">(</span><span class="n">created_utc</span><span class="p">)),</span><span class="w"> </span><span class="k">MONTH</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">mon</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">num_submissions</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">reddit_posts</span><span class="p">.</span><span class="o">*`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="p">(</span><span class="n">_TABLE_SUFFIX</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2016_01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2017_04&#39;</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="n">_TABLE_SUFFIX</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;full_corpus_201512&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">mon</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">mon</span><span class="w">
</span></span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/reddit-1_hu_50c3dc5d7726f37e.webp 320w,/2017/06/imgur-decline/reddit-1_hu_3424cd96f290c9b5.webp 768w,/2017/06/imgur-decline/reddit-1_hu_d72ee57a46a0b1c1.webp 1024w,/2017/06/imgur-decline/reddit-1.png 1500w" src="reddit-1.png"/> 
</figure>

<p>As it turns out, Reddit did indeed get a large boost in activity toward the end of 2016, likely due to the <em>heated</em> discussions and events around the <a href="https://en.wikipedia.org/wiki/United_States_presidential_election,_2016">U.S. Presidential Election</a>. But Reddit has maintained the growth rate since then, which is very appealing to potential investors.</p>
<p>How are other sites benefiting from Reddit&rsquo;s growth? <a href="http://imgur.com">Imgur</a>, an image-host developed to be the <em>de facto</em> image hosting service for Reddit, shared in Reddit&rsquo;s continual growth&hellip;</p>
<figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/reddit-2_hu_f32c314c9db02d44.webp 320w,/2017/06/imgur-decline/reddit-2_hu_3de34fd78ea0813b.webp 768w,/2017/06/imgur-decline/reddit-2_hu_e9950c21b9870ca4.webp 1024w,/2017/06/imgur-decline/reddit-2.png 1500w" src="reddit-2.png"/> 
</figure>

<p>&hellip;until mid-2016, when Imgur submission activity abruptly dropped. What happened?</p>
<p>Coincidentally in mid-2016, Reddit <a href="https://techcrunch.com/2016/05/25/reddit-image-uploads/">made itself</a> an image host for submissions to the site. Initially limited to uploads via the iOS/Android apps, Reddit then allowed desktop users to upload images through a <a href="https://www.reddit.com/r/changelog/comments/4kuk2j/reddit_change_introducing_image_uploading_beta/">beta rollout</a> starting May 24th, and a full <a href="https://www.reddit.com/r/announcements/comments/4p5dm9/image_hosting_on_reddit/">sitewide release</a> on June 21st.</p>
<p>How many Reddit-hosted image submissions are there compared to the number of Imgur submissions?</p>
<figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/reddit-3_hu_e253b95a38033adc.webp 320w,/2017/06/imgur-decline/reddit-3_hu_b700c86581397587.webp 768w,/2017/06/imgur-decline/reddit-3_hu_f675746a4ace3aec.webp 1024w,/2017/06/imgur-decline/reddit-3.png 1500w" src="reddit-3.png"/> 
</figure>

<p>Wow, native Reddit images caught on.</p>
<h2 id="market-share">Market Share</h2>
<figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/pics_hu_254562dd20df73be.webp 320w,/2017/06/imgur-decline/pics_hu_a0d29b8f5ec323e1.webp 768w,/2017/06/imgur-decline/pics_hu_822ab094826f3973.webp 1024w,/2017/06/imgur-decline/pics.png 1101w" src="pics.png"/> 
</figure>

<p>Did the rise of Reddit-hosted images cause the decline of Imgur on Reddit? Let&rsquo;s look at the daily number of Imgur submissions and Reddit-hosted Image submissions from December 2015 to April 2017, normalized by the total number of sitewide submissions on that day. This gives us a Reddit &ldquo;market share&rdquo; metric for both services.</p>
<p>Additionally, we can plot vertical lines representing the dates when Reddit-hosted images rolled out in the limited beta release and the full sitewide release to see if there is a link between those events and submission behavior.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/reddit-4_hu_4ee09338a5791411.webp 320w,/2017/06/imgur-decline/reddit-4_hu_6c7245f5f940c2e1.webp 768w,/2017/06/imgur-decline/reddit-4_hu_7cb95dbaf55d9d98.webp 1024w,/2017/06/imgur-decline/reddit-4.png 1200w" src="reddit-4.png"/> 
</figure>

<p>Before Reddit added native image hosting, Imgur accounted for 15% of all submissions to Reddit. Now it&rsquo;s below 9%. More Reddit-hosted images are being shared on Reddit than images from Imgur.</p>
<p>Instead of looking at all of Reddit, where spam subreddits could skew the results, we can also look at the largest image-only subreddits: <a href="https://www.reddit.com/r/pics/">/r/pics</a> and <a href="https://www.reddit.com/r/gifs/">/r/gifs</a>, both of which were a part of the beta rollout.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/reddit-5_hu_f39f160457e2df18.webp 320w,/2017/06/imgur-decline/reddit-5_hu_aaa929a44f375cda.webp 768w,/2017/06/imgur-decline/reddit-5_hu_f4911f05f33ceed5.webp 1024w,/2017/06/imgur-decline/reddit-5.png 1200w" src="reddit-5.png"/> 
</figure>

<p>Here, the impact of the two rollouts is much noticeable, with immediate increases in Reddit-hosted image market share after each rollout, and proportional decreases in Imgur market share. The growth rate after the beta release is flat for both services, but when Reddit image hosting becomes sitewide, the market shares of Reddit-hosted/Imgur images increase/decrease linearly over time once users officially learn that the native image upload functionality exists. And these trends do not appear to be slowing down.</p>
<h2 id="a-silver-lining">A Silver Lining?</h2>
<p>Obviously Imgur does not like losing a <em>large</em> chunk of traffic, but there&rsquo;s a possibility that this outcome will be better for the business than what&rsquo;s implied from the charts above.</p>
<p>Hosting images on the internet isn&rsquo;t free, and bandwidth costs are the primary reason dedicated image hosts have died off over the years. Direct image links which show the user only the image and nothing else are convenient, but they are pure loss for the service. That&rsquo;s why image hosts encourage linking to the image on a landing page of the website, filled with ads which generate an expected revenue greater than the cost of serving the image.</p>
<p>After a user uploads an image to Imgur on the desktop, the user is given two share links that can be submitted to sites like Reddit: an image link that goes to the image + ads, and a direct link to the image.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/imgur_direct_hu_4e7b2a396ae5e6bf.webp 320w,/2017/06/imgur-decline/imgur_direct_hu_290e7d38ff430219.webp 768w,/2017/06/imgur-decline/imgur_direct.png 991w" src="imgur_direct.png"/> 
</figure>

<p>Recently, Imgur has <a href="https://www.reddit.com/r/assholedesign/comments/5gs96k/just_show_me_the_fucking_image_imgur/">pushed app downloads</a> when visiting the site on an iOS/Android device, including <a href="https://www.reddit.com/r/assholedesign/comments/695efj/upload_image_on_imgur_mobile_has_been_replaced_by/">disabling uploads</a> in the mobile browser. When sharing an image from the Imgur app, the <em>only</em> way to share an image is through the image link, which could lead to an increase in the proportion of ad-filled Imgur image links on Reddit. Said increase could counteract the decrease in total Imgur submissions, and Imgur could actually come out ahead.</p>
<p>With BigQuery, we can check the percentage of all Imgur submissions to Reddit which are direct links and the percentage which are indirect/lead to a landing page, and see if the ratio changes along the same time horizon used above:</p>
<figure>

    <img loading="lazy" srcset="/2017/06/imgur-decline/reddit-6_hu_f1c47ff2cd14f4d3.webp 320w,/2017/06/imgur-decline/reddit-6_hu_7baf41c4d88bcb6a.webp 768w,/2017/06/imgur-decline/reddit-6_hu_822a82d187387670.webp 1024w,/2017/06/imgur-decline/reddit-6.png 1200w" src="reddit-6.png"/> 
</figure>

<p>Welp. No significant change in the ratio over time, eliminating that possible silver lining.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Note that the decline of Imgur on Reddit says nothing about Imgur as a business; it&rsquo;s entirely possible that Imgur&rsquo;s traffic on the main site itself is sufficient for growth. But the loss of Reddit traffic certainly can&rsquo;t be ignored, and it&rsquo;s interesting to visualize how quickly a service can be replaced when there&rsquo;s an equivalent native feature.</p>
<p>It&rsquo;s worth nothing that new competitors in the image space such as <a href="https://giphy.com">Giphy</a> utilize image hosting as a <em>secondary</em> service. Instead, they focus on building a repository of images which can be licensed and accessed programmatically by other services like Slack, Facebook, and Twitter. And Giphy has raised <a href="https://www.crunchbase.com/organization/giphy#/entity">$150 Million</a> total with this approach, so perhaps the image hosting market itself has indeed changed.</p>
<hr>
<p><em>You can view the R, ggplot2 code, and BigQueries used to visualize the Reddit data in <a href="http://minimaxir.com/notebooks/imgur-decline/">this R Notebook</a>. You can also view the images/data used for this post in <a href="https://github.com/minimaxir/imgur-decline">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Advantages of Using R Notebooks For Data Analysis Instead of Jupyter Notebooks</title>
      <link>https://minimaxir.com/2017/06/r-notebooks/</link>
      <pubDate>Tue, 06 Jun 2017 08:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/06/r-notebooks/</guid>
      <description>The relatively new R Notebooks improve the workflows of common data analysis in ways Jupyter Notebooks can&amp;rsquo;t.</description>
      <content:encoded><![CDATA[<p><a href="http://jupyter.org">Jupyter Notebooks</a>, formerly known as <a href="https://ipython.org/notebook.html">IPython Notebooks</a>, are ubiquitous in modern data analysis. The Notebook format allows statistical code and its output to be viewed on any computer in a logical and <em>reproducible</em> manner, avoiding both the confusion caused by unclear code and the inevitable &ldquo;it only works on my system&rdquo; curse.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/jupyterdemo_hu_287135b578ef9105.webp 320w,/2017/06/r-notebooks/jupyterdemo_hu_3059d8862e947c85.webp 768w,/2017/06/r-notebooks/jupyterdemo_hu_62821160794f3044.webp 1024w,/2017/06/r-notebooks/jupyterdemo.png 1536w" src="jupyterdemo.png"/> 
</figure>

<p>In Jupyter Notebooks, each block of Python input code executes in its own cell, and the output of the block appears inline; this allows the user to iterate on the results, both to make the data transformations explicit and to and make sure the results are as expected.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/jupyter_hu_919cd14dc214e4fc.webp 320w,/2017/06/r-notebooks/jupyter_hu_8647bc6d434c5157.webp 768w,/2017/06/r-notebooks/jupyter_hu_10ef5c83bc24f007.webp 1024w,/2017/06/r-notebooks/jupyter.png 1852w" src="jupyter.png"/> 
</figure>

<p>In addition to code blocks, Jupyter Notebooks support <a href="https://en.wikipedia.org/wiki/Markdown">Markdown</a> cells, allowing for more detailed write-ups with easy formatting. The final Notebook can be exported as a HTML file displayable in a browser, or the raw Notebook file can be shared and <a href="https://github.com/blog/1995-github-jupyter-notebooks-3">rendered</a> on sites like <a href="https://github.com">GitHub</a>. Although Jupyter is a Python application, it can run kernels of <a href="https://irkernel.github.io">non-Python languages</a>, such as <a href="https://www.r-project.org">R</a>.</p>
<p>Over the years, there have a been a few new competitors in the reproducible data analysis field, such as <a href="http://beakernotebook.com/features">Beaker Notebook</a> and, for heavy-duty business problems, <a href="https://zeppelin.apache.org">Apache Zeppelin</a>. However, today we&rsquo;ll look at the relatively new <a href="http://rmarkdown.rstudio.com/r_notebooks.html">R Notebooks</a>, and how they help improve the workflows of common data analysis in ways Jupyter Notebooks can&rsquo;t without third-party extensions.</p>
<h2 id="about-r-notebooks">About R Notebooks</h2>
<p>R Notebooks are a format maintained by <a href="https://www.rstudio.com">RStudio</a>, which develops and maintains a large number of open source R packages and tools, most notably the free-for-consumer RStudio R IDE. More specifically, R Notebooks are an extension of the earlier <a href="http://rmarkdown.rstudio.com">R Markdown</a> <code>.Rmd</code> format, useful for rendering analyses into HTML/PDFs, or other cool formats like <a href="http://rmarkdown.rstudio.com/tufte_handout_format.html">Tufte handouts</a> or even <a href="https://bookdown.org">books</a>. The default output of an R Notebook file is a <code>.nb.html</code> file, which can be viewed as a webpage on any system. (<a href="https://rpubs.com">RPubs</a> has many examples of R Notebooks, although I recommend using <a href="https://pages.github.com">GitHub Pages</a> to host notebooks publicly).</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/RNotebookAnimation_hu_2bd7eafe2a4daec8.webp 320w,/2017/06/r-notebooks/RNotebookAnimation.gif 425w" src="RNotebookAnimation.gif"/> 
</figure>

<p>Instead of having separate cells for code and text, a R Markdown file is all plain text. The cells are indicated by three backticks and a gray background in RStudio, which makes it easy to enter a code block, easy to identify code blocks at a glance, and easy to execute a notebook block-by-block. Each cell also has a green indicator bar which shows which code is running and which code is queued, line-by-line.</p>
<p>For Notebook files, a HTML webpage is automatically generated whenever the file is saved, which can immediately be viewed in any browser (the generated webpage stores the cell output and any necessary dependencies).</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/notebooktest_hu_7203602b197e5272.webp 320w,/2017/06/r-notebooks/notebooktest.png 642w" src="notebooktest.png"/> 
</figure>

<p>R Notebooks can only be created and edited in RStudio, but this is a case where tight vertical integration of open-source software is a good thing. Among many other features, RStudio includes a file manager, a function help, a variable explorer, and a project manager; all of which make analysis much easier and faster as opposed to the browser-only Jupyter.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/rstudio_hu_9ca09ff475c1ab6c.webp 320w,/2017/06/r-notebooks/rstudio_hu_8f31449381ea8c8c.webp 768w,/2017/06/r-notebooks/rstudio_hu_6790ce776fe0161c.webp 1024w,/2017/06/r-notebooks/rstudio.png 1280w" src="rstudio.png"/> 
</figure>

<p>I&rsquo;ve made many, many Jupyter Notebooks and R Notebooks <a href="http://minimaxir.com/data-portfolio">over the years</a>, which has given me insight into the strengths and weaknesses of both formats. Here are a few native features of R Notebooks which present an objective advantage over Jupyter Notebooks, particularly those not highlighted in the documentation:</p>
<h2 id="version-control">Version Control</h2>
<p>Version control of files with tools such as <a href="https://en.wikipedia.org/wiki/Git">git</a> is important as it both maintains an explorable database of changes to the code files and also improves collaboration by using a centralized server (e.g. GitHub) where anyone with access to the repository can pull and push changes to the code. In the data science world, large startups such as <a href="https://stripe.com/blog/reproducible-research">Stripe</a> and <a href="https://medium.com/airbnb-engineering/scaling-knowledge-at-airbnb-875d73eff091">Airbnb</a> have seen a lot of success with this approach.</p>
<p>RStudio incidentally has a native git client for tracking and committing changes to a <code>.Rmd</code> file, which is easy since <code>.Rmd</code> files are effectively plain text files where you can see differences between versions at a per-line level. (You may not want to store the changes to the generated <code>.nb.html</code> Notebook since they will be large and redundant to the changes made in the corresponding <code>.Rmd</code>; I recommend adding a <code>*.nb.html</code> rule to a <code>.gitignore</code> file during analysis).</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/git_hu_81f9aa52fb4095c2.webp 320w,/2017/06/r-notebooks/git_hu_9986faacff44886a.webp 768w,/2017/06/r-notebooks/git_hu_b18c85f56c32a67c.webp 1024w,/2017/06/r-notebooks/git.png 1376w" src="git.png"/> 
</figure>

<p>The <code>.ipynb</code> Jupyter Notebook files are blobs of JSON that also store cell output, which will result in large diffs if you keep them in version control and make any changes which result in different output. This can cause the git database to balloon and makes reading per-line diffs hard if not impossible.</p>
<p>On Hacker News, the version control issues in Jupyter are <a href="https://news.ycombinator.com/item?id=14034341">a common complaint</a>, however a Jupyter developer noted of a possibility of <a href="https://news.ycombinator.com/item?id=14035158">working with RStudio</a> on solving this issue.</p>
<h2 id="inline-code-rendering">Inline Code Rendering</h2>
<p>A common practice in Jupyter Notebooks is to print common values as a part of a write-up or testing statistical code. In Jupyter Notebooks, if you want to verify the number of rows in a dataset for exploratory data analysis, you have to add an appropriate print statement to the cell to get the number <code>n</code> rows, and then add a Markdown cell to redundantly describe what you just print in the output.</p>
<p>In R Notebooks, you can skip a step by calling such print statements in-line in the Markdown text, which will then be rendered with the Notebook. This also avoids hard-coding such numbers in the Markdown text if you change the data beforehand (e.g. parameter tuning) or if the values are nontrivial to calculate by hand.</p>
<p>For example, these lines of R Markdown from my <a href="http://minimaxir.com/notebooks/first-comment/">Reddit First Comment Notebook</a>:</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/inline_hu_c6d4dc66bf14ef60.webp 320w,/2017/06/r-notebooks/inline_hu_df8ce63e0e546f98.webp 768w,/2017/06/r-notebooks/inline.png 972w" src="inline.png"/> 
</figure>

<p>translate into:</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/reddit_hu_8d3c46de15fb586d.webp 320w,/2017/06/r-notebooks/reddit_hu_9a37f5897d7a073f.webp 768w,/2017/06/r-notebooks/reddit_hu_f0ac03e8aa2e427.webp 1024w,/2017/06/r-notebooks/reddit.png 1024w" src="reddit.png"/> 
</figure>

<h2 id="metadata">Metadata</h2>
<p>R Notebooks are configured with a <a href="http://yaml.org">YAML</a> header, which can include common attributes such as title, author, date published, and other relevant options. These fields will then be configured correctly in the metadata for HTML/PDF/Handouts output. Here&rsquo;s an example from <a href="http://minimaxir.com/notebooks/amazon-spark/">one of my notebooks</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">title</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Playing with 80 Million Amazon Product Review Ratings Using Apache Spark&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">author</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Max Woolf (@minimaxir)&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">date</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;January 2nd, 2017&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nt">output</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">html_notebook</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">highlight</span><span class="p">:</span><span class="w"> </span><span class="l">tango</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">mathjax</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">number_sections</span><span class="p">:</span><span class="w"> </span><span class="kc">yes</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">theme</span><span class="p">:</span><span class="w"> </span><span class="l">spacelab</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">toc</span><span class="p">:</span><span class="w"> </span><span class="kc">yes</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">toc_float</span><span class="p">:</span><span class="w"> </span><span class="kc">yes</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="nn">---</span><span class="w">
</span></span></span></code></pre></div><p>Said metadata features are <a href="https://github.com/ipython/ipython/issues/6073">often requested but unimplemented</a> in Jupyter.</p>
<h2 id="notebook-theming">Notebook Theming</h2>
<p>As noted in the example metadata above, R Notebooks allow extensive theming. Jupyter Notebooks do <a href="https://github.com/dunovank/jupyter-themes">support themes</a>, but with a third-party Python package, or placing custom CSS in an <a href="https://stackoverflow.com/a/32158550">odd location</a>.</p>
<p>Like Jupyter Notebooks, the front-end of browser-based R Notebooks is based off of the <a href="http://getbootstrap.com">Bootstrap</a> HTML framework. R Notebooks, however, allow you to natively select the style of code syntax highlighting via <code>highlight</code> (similar options as <a href="https://help.farbox.com/pygments.html">pygments</a>) and also the entire Bootstrap theme via <code>theme</code> (with a selection from the excellent <a href="https://bootswatch.com">Bootswatch</a> themes by <a href="https://twitter.com/thomashpark">Thomas Park</a>), giving your Notebook a unique look without adding dependencies.</p>
<h2 id="data-tables">Data Tables</h2>
<p>When you print a data frame in a Jupyter Notebook, the output appears as a standard <em>boring</em> HTML table:</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/htmltable_hu_320fef023a1fcc55.webp 320w,/2017/06/r-notebooks/htmltable_hu_20e1593d3a696894.webp 768w,/2017/06/r-notebooks/htmltable_hu_14e5ba80dcd5dd7f.webp 1024w,/2017/06/r-notebooks/htmltable.png 1836w" src="htmltable.png"/> 
</figure>

<p>No cell block output is ever truncated. Accidentally printing an entire 100,000+ row table to a Jupyter Notebook is a mistake you only make <em>once</em>.</p>
<p>R Notebook tables are pretty tables with pagination for both rows and columns, and can support large amounts of data if necessary.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/rtable_hu_57d067a3de215b70.webp 320w,/2017/06/r-notebooks/rtable_hu_e7f78f095bdb7f18.webp 768w,/2017/06/r-notebooks/rtable_hu_3a6c1cacd852bc16.webp 1024w,/2017/06/r-notebooks/rtable.png 1386w" src="rtable.png"/> 
</figure>

<p>The R Notebook output table also includes the data type of the column, which is helpful for debugging unexpected issues where a column has an unintended data type (e.g. a numeric <code>&lt;dbl&gt;</code> column or a datetime <code>&lt;S3: POSIXct&gt;</code> column is parsed as a text-based <code>&lt;chr&gt;</code> column).</p>
<h2 id="table-of-contents">Table of Contents</h2>
<p>A Table of Contents always helps navigating, particularly in a PDF export. Jupyter Notebooks <a href="https://github.com/minrk/ipython_extensions">requires an extension</a> for a ToC, while R Notebooks will natively create one from section headers (controllable via <code>toc</code> and <code>number_sections</code>). An optional <code>toc_float</code> parameter causes the Table of Contents to float on the left in the browser, making it always accessible.</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/notebookheader_hu_adec52f6ee5a1336.webp 320w,/2017/06/r-notebooks/notebookheader_hu_5ecc23e67295d193.webp 768w,/2017/06/r-notebooks/notebookheader_hu_b18757e8f5470627.webp 1024w,/2017/06/r-notebooks/notebookheader.png 1976w" src="notebookheader.png"/> 
</figure>

<p>In conclusion, R Notebooks haven&rsquo;t received much publicity since the benefits aren&rsquo;t immediately obvious, but for the purpose of reproducible analyses, the breadth of native features allows for excellent utility while avoiding dependency hell. Running R in an R Notebook is a significantly better experience than running R in a Jupyter Notebook. The advantages present in R Notebooks can also provide guidance for feature development in other Notebook software, which improves the data analysis ecosystem as a whole.</p>
<p>However, there&rsquo;s an elephant in the room&hellip;</p>
<h2 id="what-about-python">What About Python?</h2>
<p>So you might be thinking &ldquo;an R Notebook forces you to use R, but <em>serious</em> data science work is done using Python!&rdquo; Plot twist: you can use Python in an R Notebook!</p>
<figure>

    <img loading="lazy" srcset="/2017/06/r-notebooks/python_hu_d75a8e044545d86b.webp 320w,/2017/06/r-notebooks/python.png 326w" src="python.png"/> 
</figure>

<p>Well, sort of. The Python session ends after the cell executes, making it unhelpful for tasks other than <em>ad hoc</em> scripts.</p>
<p>The topic on whether R or Python is better for data analysis is a <a href="https://news.ycombinator.com/item?id=14056098">common</a> <a href="https://news.ycombinator.com/item?id=13239530">religious</a> <a href="https://news.ycombinator.com/item?id=12301996">flamewar</a> topic which is best saved for a separate blog post (tl;dr: I disagree with the paraphrased quote above in that both languages have their advantages and you&rsquo;ll benefit significantly from knowing both ecosystems).</p>
<p>And I wouldn&rsquo;t count R out of &ldquo;serious data science&rdquo;. You can use R <a href="http://spark.rstudio.com">seamlessly</a> with big data tools like <a href="https://spark.apache.org">Apache Spark</a>, and R can <a href="https://rstudio.github.io/keras/">now</a> use <a href="https://keras.io">Keras</a>/<a href="https://www.tensorflow.org">TensorFlow</a> for deep learning with near-API-parity to the Python version. <em>Hmm</em>.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
