<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Big Data on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/category/big-data/</link>
    <description>Recent content in Big Data on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf &amp;copy; 2025.</copyright>
    <lastBuildDate>Wed, 23 Oct 2019 09:00:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/category/big-data/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Visualizing Airline Flight Characteristics Between SFO and JFK</title>
      <link>https://minimaxir.com/2019/10/sfo-jfk-flights/</link>
      <pubDate>Wed, 23 Oct 2019 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2019/10/sfo-jfk-flights/</guid>
      <description>Box plots, when used correctly, can be a very fun way to visualize big data.</description>
      <content:encoded><![CDATA[<p>In March, <a href="https://cloud.google.com">Google Compute Platform</a> developer advocate <a href="https://twitter.com/felipehoffa">Felipe Hoffa</a> made a tweet about airline flight data from San Francisco International Airport (SFO) to Seattle-Tacoma International Airport (SEA):</p>
<blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">The time to fly from San Francisco to Seattle (SFO-&gt;SEA) keeps getting longer throughout the years - with a huge increase in how much time the airplane spends taxiing around SeaTac .<br><br>Playing with US flights data in BigQuery: <a href="https://t.co/eD9unaokWx">https://t.co/eD9unaokWx</a> <a href="https://t.co/3vfnBhiJv4">pic.twitter.com/3vfnBhiJv4</a></p>&mdash; Felipe Hoffa (@felipehoffa) <a href="https://twitter.com/felipehoffa/status/1111050585120206848?ref_src=twsrc%5Etfw">March 27, 2019</a></blockquote>


<p>Particularly, his visualization of total elapsed times by airline caught my eye.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu14670032688953978905.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu15067241946496000759.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD_hu17026061684893448646.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/D2s9oFtX4AEK6nD.jpeg 1200w" src="D2s9oFtX4AEK6nD.jpeg"/> 
</figure>

<p>The overall time for flights from SFO to SEA goes up drastically starting in 2015, and this increase occurs across multiple airlines, implying that it&rsquo;s not an airline-specific problem. But what could intuitively cause that?</p>
<p>U.S. domestic airline data is <a href="https://www.transtats.bts.gov/Tables.asp?DB_ID=120">freely distributed</a> by the United States Department of Transportation. Normally it&rsquo;s a pain to work with as it&rsquo;s very large with millions of rows, but BigQuery makes playing with such data relatively easy, fun, and free. What other interesting factoids can be found?</p>
<h2 id="expanding-on-sfo--sea">Expanding on SFO → SEA</h2>
<p><a href="https://cloud.google.com/bigquery/">BigQuery</a> is a big data warehousing tool that allows you to query massive amounts of data. The table Hoffa created from the airline data (<code>fh-bigquery.flights.ontime_201903</code>) is 83.37 GB and 184 <em>million</em> rows. You can query 1 TB of data from it for free, but since BQ will only query against the fields you request, the queries in this post only consume about 2 GB each, allowing you to run them well within that quota.</p>
<p>Hoffa&rsquo;s query that runs on BigQuery looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="p">,</span><span class="w"> </span><span class="n">Reporting_Airline</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">ActualElapsedTime</span><span class="p">)</span><span class="w"> </span><span class="n">ActualElapsedTime</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">TaxiOut</span><span class="p">)</span><span class="w"> </span><span class="n">TaxiOut</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">TaxiIn</span><span class="p">)</span><span class="w"> </span><span class="n">TaxiIn</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">AirTime</span><span class="p">)</span><span class="w"> </span><span class="n">AirTime</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">c</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">flights</span><span class="p">.</span><span class="n">ontime_201903</span><span class="o">`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">Origin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SFO&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">Dest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SEA&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2010-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">DESC</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">1000</span><span class="w">
</span></span></span></code></pre></div><p>For each year and airline after 2010, the query calculates the average metrics specified for flights on the SFO → SEA route.</p>
<p>I made a few query and data visualization tweaks to what Hoffa did above, and here&rsquo;s the result showing the increase in elapsed airline flight time, over time for that route:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu12880096236318151475.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu16685457815119799848.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu2709660247252405983.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration.png 1800w" src="sfo_sea_flight_duration.png"/> 
</figure>

<p>Let&rsquo;s explain what&rsquo;s going on here.</p>
<p>A common trend in statistics is avoiding using <a href="https://en.wikipedia.org/wiki/Average">averages</a> as a summary statistic whenever possible, as averages can be overly affected by strong outliers (and with airline flights, there are definitely strong outliers!). The solution is to use a <a href="https://en.wikipedia.org/wiki/Median">median</a> instead, but one problem: medians are hard and <a href="https://www.periscopedata.com/blog/medians-in-sql">computationally complex</a> to calculate compared to simple averages. Despite the rise of &ldquo;big data&rdquo;, most databases and BI tools don&rsquo;t have a <code>MEDIAN</code> function that&rsquo;s as easy to use as an <code>AVG</code> function. But BigQuery has an uncommon <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_quantiles">APPROX_QUANTILES</a> function, which calculates the specified amount of quantiles; for example, if you call <code>APPROX_QUANTILES(ActualElapsedTime, 100)</code>, it will return an array with the 100 quantiles, where the median will be the 50th quantile. BigQuery <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate-aggregation">uses</a> an algorithmic trick called <a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog++</a> to calculate these quantiles efficiently even with millions of data points. But since we get other quantiles like the 5th, 25th, 75th, and 95th quantiles for free with that approach, we can visualize the <em>spread</em> of the data.</p>
<p>We can aggregate the data by month for more granular trends and calculate the <code>APPROX_QUANTILES</code> in a subquery so it only has to be computed once. Hoffa also uploaded a more recent table (<code>fh-bigquery.flights.ontime_201908</code>) with a few additional months of data. To make things more simple, we&rsquo;ll ignore aggregating by airlines since the metrics do not vary strongly between them. The final query ends up looking like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">#</span><span class="n">standardSQL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="p">,</span><span class="w"> </span><span class="n">num_flights</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">5</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_5</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">25</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_25</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">50</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_50</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">75</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_75</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">time_q</span><span class="p">[</span><span class="k">OFFSET</span><span class="p">(</span><span class="mi">95</span><span class="p">)]</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">q_95</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">num_flights</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">APPROX_QUANTILES</span><span class="p">(</span><span class="n">ActualElapsedTime</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">time_q</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="o">`</span><span class="n">fh</span><span class="o">-</span><span class="n">bigquery</span><span class="p">.</span><span class="n">flights</span><span class="p">.</span><span class="n">ontime_201908</span><span class="o">`</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">Origin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SFO&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">Dest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;SEA&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">FlightDate_year</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2010-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">Year</span><span class="p">,</span><span class="w"> </span><span class="k">Month</span><span class="w">
</span></span></span></code></pre></div><p>The resulting data table:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/table_hu5722602836010489408.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/table_hu18095213541617915982.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/table.png 932w" src="table.png"/> 
</figure>

<p>In retrospect, since we&rsquo;re only focusing on one route, it isn&rsquo;t <em>big</em> data (this query only returns data on 64,356 flights total), but it&rsquo;s still a very useful skill if you need to analyze more of the airline data (the <code>APPROX_QUANTILES</code> function can handle <em>millions</em> of data points very quickly).</p>
<p>As a professional data scientist, one of my favorite types of data visualization is a <a href="https://en.wikipedia.org/wiki/Box_plot">box plot</a>, as it provides a way to visualize spread without being visually intrusive. Data visualization tools like <a href="https://www.r-project.org">R</a> and <a href="https://ggplot2.tidyverse.org/index.html">ggplot2</a> make constructing them <a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">very easy to do</a>.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/geom_boxplot-1_hu14566074060889838274.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/geom_boxplot-1_hu3174097347091120844.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/geom_boxplot-1_hu3660002285753879867.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/geom_boxplot-1.png 1400w" src="geom_boxplot-1.png"/> 
</figure>

<p>By default, for each box representing a group, the thick line in the middle of the box is the median, the lower bound of the box is the 25th quantile and the upper bound is the 75th quantile. The whiskers are normally a function of the <a href="https://en.wikipedia.org/wiki/Interquartile_range">interquartile range</a> (IQR), but if there&rsquo;s enough data, I prefer to use the 5th and 95th quantiles instead.</p>
<p>If you feed ggplot2&rsquo;s <code>geom_boxplot()</code> with raw data, it will automatically calculate the corresponding metrics for visualization; however, with big data, the data may not fit into memory and as noted earlier, medians and other quantiles are computationally expensive to calculate. Because we precomputed the quantiles with the query above for every year and month, we can use those explicitly. (The minor downside is that this will not include outliers)</p>
<p>Additionally for box plots, I like to fill in each box with a different color corresponding to the year in order to better perceive data <a href="https://en.wikipedia.org/wiki/Seasonality">seasonality</a>. In the case of airline flights, seasonality is more literal: weather has an intuitive impact on flight times and delays, and during winter months there are also holidays which could affect airline logistics.</p>
<p>The resulting ggplot2 code looks like this:</p>
<pre tabindex="0"><code>plot &lt;-
  ggplot(df_tf,
         aes(
           x = date,
           ymin = q_5,
           lower = q_25,
           middle = q_50,
           upper = q_75,
           ymax = q_95,
           group = date,
           fill = year_factor
         )) +
  geom_boxplot(stat = &#34;identity&#34;, size = 0.3) +
  scale_fill_hue(l = 50, guide = F) +
  scale_x_date(date_breaks = &#39;1 year&#39;, date_labels = &#34;%Y&#34;) +
  scale_y_continuous(breaks = pretty_breaks(6)) +
  labs(
    title = &#34;Distribution of Flight Times of Flights From SFO → SEA, by Month&#34;,
    subtitle = &#34;via US DoT. Box bounds are 25th/75th percentiles, whiskers are 5th/95th percentiles.&#34;,
    y = &#39;Total Elapsed Flight Time (Minutes)&#39;,
    fill = &#39;&#39;,
    caption = &#39;Max Woolf — minimaxir.com&#39;
  ) +
  theme(axis.title.x = element_blank())

ggsave(&#39;sfo_sea_flight_duration.png&#39;,
       plot,
       width = 6,
       height = 4)
</code></pre><p>And behold (again)!</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu12880096236318151475.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu16685457815119799848.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration_hu2709660247252405983.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_sea_flight_duration.png 1800w" src="sfo_sea_flight_duration.png"/> 
</figure>

<p>You can see that the boxes do indeed trend upward after 2016, although per-month medians are in flux. The spread is also increasingly slowly over time. But what&rsquo;s interesting is the seasonality; pre-2016, the summer months (the &ldquo;middle&rdquo; of a given color) have a <em>very</em> significant drop in total time, which doesn&rsquo;t occur as strongly after 2016. Hmm.</p>
<h2 id="sfo-and-jfk">SFO and JFK</h2>
<p>Since I occasionally fly from San Francisco to New York City, it might be interesting (for completely selfish reasons) to track trends over time for flights between those areas. On the San Francisco side I choose SFO, and for the New York side I choose John F. Kennedy International Airport (JFK), as the data goes back very far for those routes specifically, and I only want to look at a single airport at a time (instead of including other NYC airports such as Newark Liberty International Airport [EWR] and LaGuardia Airport [LGA]) to limit potential data confounders.</p>
<p>Fortunately, the code and query changes are minimal: in the query, change the target metric to whatever metric you want, and the <code>Origin</code> and <code>Dest</code> in the <code>WHERE</code> clause to what you want, and if you want to calculate metrics other than elapsed time, change the metric in <code>APPROX_QUANTILES</code> accordingly.</p>
<p>Here&rsquo;s the chart of total elapsed time from SFO → JFK:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu40679136762088995.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu12750883730698291111.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration_hu3240290949035422618.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_duration.png 1800w" src="sfo_jfk_flight_duration.png"/> 
</figure>

<p>And here&rsquo;s the reverse, from JFK → SFO:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu12437652317497328577.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu17735780327068668341.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration_hu2691778815073968140.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_duration.png 1800w" src="jfk_sfo_flight_duration.png"/> 
</figure>

<p>Unlike the SFO → SEA charts, both charts are relatively flat over the years. However, when looking at seasonality, SFO → JFK dips in the summer and spikes during winter, while JFK → SFO <em>does the complete opposite</em>: dips during the winter and spikes during the summer, which is similar to the SFO → SEA route. I don&rsquo;t have any guesses what would cause that behavior.</p>
<p>How about flight speed (calculated via air time divided by distance)? Have new advances in airline technology made planes faster and/or more efficient?</p>
<p><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu3347624992993391283.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu9967425817079747799.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed_hu3854178846017463568.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_flight_speed.png 1800w" src="sfo_jfk_flight_speed.png"/> 
</figure>

<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu9927073050069018089.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu6746868239130637370.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed_hu2369234530184054987.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_flight_speed.png 1800w" src="jfk_sfo_flight_speed.png"/> 
</figure>
</p>
<p>The expected flight speed for a commercial airplane, <a href="https://en.wikipedia.org/wiki/Cruise_%28aeronautics%29">per Wikipedia</a>, is 547-575 mph, so the metrics from SFO pass the sanity check. The metrics from JFK indicate there&rsquo;s about a 20% drop in flight speed potentially due to wind resistance, which makes sense. Month-to-month, the speed trends are inverse to the total elapsed time, which makes sense intuitively as they are strongly negatively correlated.</p>
<p>Lastly, what about flight departure delays? Are airlines becoming more efficient, or has increased demand caused more congestion?</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu9685560990823876078.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu9065541836650921452.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_hu14509144604913423591.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay.png 1800w" src="sfo_jfk_departure_delay.png"/> 
</figure>

<p>Wait a second. In this case, massive 2-3 hour flight delays are frequent enough that even just the 95% percentile skews the entire plot. Let&rsquo;s remove the whiskers in order to look at trends more clearly.</p>
<p><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu14773758919487792962.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu10725626404936578720.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers_hu7573290938867231858.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/sfo_jfk_departure_delay_nowhiskers.png 1800w" src="sfo_jfk_departure_delay_nowhiskers.png"/> 
</figure>

<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu2963828405551268393.webp 320w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu7225597780080219382.webp 768w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers_hu12722266661464968228.webp 1024w,https://minimaxir.com/2019/10/sfo-jfk-flights/jfk_sfo_departure_delay_nowhiskers.png 1800w" src="jfk_sfo_departure_delay_nowhiskers.png"/> 
</figure>
</p>
<p>A negative delay implies the flight leaves early, so we can conclude on average, flights leave slightly earlier than the stated departure time. Even without the whiskers, we can see major spikes at the 75th percentile level for summer months, and said spikes were especially bad in 2017 for both airports.</p>
<p>These box plots are only an <a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a>. Determining the <em>cause</em> of changes in these flight metrics is difficult even for experts (I am definitely not an expert!) and many not even be possible to determine from publicly-available data.</p>
<p>But there are still other fun things that can be done with the airline flight data, such as faceting airline trends by time and the inclusion of other airports, which is <a href="https://twitter.com/minimaxir/status/1115261670153048065"><em>interesting</em></a>.</p>
<hr>
<p><em>You can view the BigQuery queries used to get the data, plus the R and ggplot2 used to create the data visualizations, in <a href="http://minimaxir.com/notebooks/sfo-jfk-flights/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/sfo-jfk-flights">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Problems with Predicting Post Performance on Reddit and Other Link Aggregators</title>
      <link>https://minimaxir.com/2018/09/modeling-link-aggregators/</link>
      <pubDate>Mon, 10 Sep 2018 09:15:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/09/modeling-link-aggregators/</guid>
      <description>The nature of algorithmic feeds like Reddit inherently leads to a survivorship bias: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail.</description>
      <content:encoded><![CDATA[<p><a href="https://www.reddit.com">Reddit</a>, &ldquo;the front page of the internet&rdquo; is a link aggregator where anyone can submit links to cool happenings. Over the years, Reddit has expanded from just being a link aggregator, to allowing image and videos, and as of recently, hosting images and videos itself.</p>
<p>Reddit is broken down into subreddits, where each subreddit represents each own community around a particular interest, like <a href="https://www.reddit.com/r/aww">/r/aww</a> for pet photos and <a href="https://www.reddit.com/r/politics/">/r/politics</a> for U.S. politics. The posts on each subreddit are ranked by some function of both time elapsed since the submission was made, and the <em>score</em> of the submission as determined by upvotes and downvotes from other users.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_aww_hu25176234466961965.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_aww_hu1059537840248352357.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_aww.png 827w" src="reddit_aww.png"/> 
</figure>

<p>There&rsquo;s also an intrinsic pride in having something you&rsquo;re responsible for providing to the community get lots of upvotes (the submitter also earns karma based on received upvotes, although karma is meaningless and doesn&rsquo;t provide any user benefits). But the reality is that even on the largest subreddits, submissions with 1 point (the default score for new submissions) are the most prominent, with some subreddits having <em>over half</em> of their submissions with only 1 point.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_dist_facet_hu16210560607117522797.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_dist_facet_hu7511568348448049772.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_dist_facet_hu8155937286239840678.webp 1024w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_dist_facet.png 1800w" src="reddit_dist_facet.png"/> 
</figure>

<p>The exposure from having a submission go viral on Reddit (especially on larger subreddits) can be valuable especially if its your own original content. As a result, there has been a lot of <a href="https://www.brandwatch.com/blog/how-to-get-on-the-front-page-of-reddit/">analysis</a>/<a href="https://www.reddit.com/r/starterpacks/comments/8rkfk9/reddit_front_page_starter_pack/">stereotypes</a> on what techniques to do to help your submission make it to the top of the front page. But almost all claims of &ldquo;cracking&rdquo; the Reddit algorithm are <a href="https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc"><em>post hoc</em> rationalizations</a>, attributing success to things like submission timing and title verbiage of a single submission after the fact. The nature of algorithmic feeds inherently leads to a <a href="https://en.wikipedia.org/wiki/Survivorship_bias">survivorship bias</a>: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail, which makes modeling a successful post very tricky.</p>
<p>I&rsquo;ve touched on analyzing Reddit post performance <a href="https://minimaxir.com/2017/06/reddit-deep-learning/">before</a>, but let&rsquo;s give it another look and see if we can drill down on why Reddit posts do and do not do well.</p>
<h2 id="submission-timing">Submission Timing</h2>
<p>As with many US-based websites, the majority of Reddit users are most active during work hours (9 AM — 5 PM Eastern time weekdays). Most subreddits have submission patterns which fit accordingly.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu6981845223318001636.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu10576795282692131227.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_prop_hu2078777263525778722.webp 1024w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_prop.png 1800w" src="reddit_subreddit_prop.png"/> 
</figure>

<p>But what&rsquo;s interesting are the subreddits which <em>deviate</em> from that standard. Gaming subreddits (<a href="https://www.reddit.com/r/DestinyTheGame">/r/DestinyTheGame</a>, <a href="https://www.reddit.com/r/Overwatch">/r/Overwatch</a>) have short activity after a Tuesday game update/patch, game <em>communication</em> subreddits (<a href="https://www.reddit.com/r/Fireteams">/r/Fireteams</a>, <a href="https://www.reddit.com/r/RocketLeagueExchange">/r/RocketLeagueExchange</a>) are more active <em>outside</em> of work hours as they assume you are playing the game at the time, and Not-Safe-For-Work subreddits (/r/dirtykikpals, /r/gonewild) are incidentally less active during work hours and more active late-night than other subreddits.</p>
<p>Whenever you make a submission to Reddit, the submission appears in the subreddit&rsquo;s <code>/new</code> queue of the most recent submissions, where hopefully kind souls will find your submission and upvote it if it&rsquo;s good.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_new_hu215933848107066800.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_new.png 762w" src="reddit_new.png"/> 
</figure>

<p>However, if it falls off the first page of the <code>/new</code> queue, your submission might be as good as dead. As a result, there&rsquo;s an element of game theory to timing your submission if you want it to not become another 1-point submission. Is it better to submit during peak hours when more users may see the submission before it falls off of <code>/new</code>? Is it better to submit <em>before</em> peak usage since there will be less competition, then continue the momentum once it hits the front page?</p>
<p>Here&rsquo;s a look at the median post performance at each given time slot for top subreddits:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu15453516271955278607.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu1845382180573999855.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy_hu11123541071259965980.webp 1024w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_hr_doy.png 1800w" src="reddit_subreddit_hr_doy.png"/> 
</figure>

<p>As the earlier distribution chart implied, the median score is around 1-2 for most subreddits, and that&rsquo;s consistent across all time slots. Some subreddits with higher medians like /r/me<em>irl do appear to have a _slight</em> benefit when posting before peak activity. When focusing on subreddits with high overall median scores, the difference is more explicit.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu17005185977742628363.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu5771425682208630544.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian_hu3906090746680530990.webp 1024w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_highmedian.png 1800w" src="reddit_subreddit_highmedian.png"/> 
</figure>

<p>Subreddits like /r/PrequelMemes and /r/The<em>Donald _definitely</em> have better performance on average when made before peak activity! Posting before peak usage <em>does</em> appear to be a viable strategy, however for the majority of subreddits it doesn&rsquo;t make much of a difference.</p>
<h2 id="submission-titles">Submission Titles</h2>
<p>Each Reddit subreddit has their own vocabulary and topics of discussion. Let&rsquo;s break down text by subreddit by looking at the 75th percentile for score on posts containing a given two-word phrase:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu3652754302652216611.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu7485290599917841926.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams_hu15872333215249959520.webp 1024w,https://minimaxir.com/2018/09/modeling-link-aggregators/reddit_subreddit_topbigrams.png 1800w" src="reddit_subreddit_topbigrams.png"/> 
</figure>

<p>The one trend consistent across all subreddits is the effectiveness of first-person pronouns (<em>I/my</em>) and original content (<em>fan art</em>). Other than that, the vocabulary and sentiment for successful posts is very specific to the subreddit and culture is represents; no universal guaranteed-success memes.</p>
<h2 id="can-deep-learning-predict-post-performance">Can Deep Learning Predict Post Performance?</h2>
<p>Some might think &ldquo;oh hey, this is an arbitrary statistical problem, you can just build an AI to solve it!&rdquo; So, for the sake of argument, I did.</p>
<p>Instead of using Reddit data for building a deep learning model, we&rsquo;ll use data from <a href="https://news.ycombinator.com">Hacker News</a>, another link aggregator similar to Reddit with a strong focus on technology and startup entrepreneurship. The distribution of scores on posts, submission timings, upvoting, and front page ranking systems are all the same as on Reddit.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/hn_hu13280095182528689595.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/hn_hu8486119616301924688.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/hn_hu259174151342083437.webp 1024w,https://minimaxir.com/2018/09/modeling-link-aggregators/hn.png 1520w" src="hn.png"/> 
</figure>

<p>The titles on Hacker News submissions are also shorter (80 characters max vs. Reddit&rsquo;s 300 character max) and in concise English (no memes/shitposts allowed), which should help the model learn the title syntax and identify high-impact keywords easier. Like Reddit, the score data is super-skewed with most HN submissions at 1-2 points, and typical model training will quickly converge but try to predict that <em>every</em> submission has a score of 1, which isn&rsquo;t helpful!</p>
<p>By constructing a model employing <em>many</em> deep learning tricks with <a href="https://keras.io">Keras</a>/<a href="https://www.tensorflow.org">TensorFlow</a> to prevent model cheating and training on <em>hundreds of thousands</em> of HN submissions (using post title, day-of-week, hour, and link domain like <code>github.com</code> as model features), the model does converge and finds some signal among the noise (training R<sup>2</sup> ~ 0.55 when trained for 50 epochs). However, it fails to offer any valuable predictions on new, unseen posts (test R<sup>2</sup> <em>&lt; 0.00</em>) because it falls into the same exact human biases regarding titles: it saw submissions with titles that did very well during training, but can&rsquo;t isolate the random chance why X and Y submissions are similar but X goes viral while Y does not.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/hn_test_hu5467679921659818169.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/hn_test.png 485w" src="hn_test.png"/> 
</figure>

<p>I&rsquo;ve made the Keras/TensorFlow model training code available in <a href="https://www.kaggle.com/minimaxir/hacker-news-submission-score-predictor/notebook">this Kaggle Notebook</a> if you want to fork it and try to improve the model.</p>
<h2 id="other-potential-modeling-factors">Other Potential Modeling Factors</h2>
<p>The deep learning model above makes optimistic assumptions about the underlying data, including that each post behaves independently, and the included features are the sole features which determine the score. These assumptions are questionable.</p>
<p>The simple model forgoes the content of the submission itself, which is hard to retrieve for hundreds of thousands of data points. On Hacker News that&rsquo;s mostly OK since most submissions are links/articles which accurately correlate to the content, although occasionally there are idiosyncratic short titles which do the opposite. On Reddit, obviously looking at content is necessary for image/video-oriented subreddits, which is hard to gather and analyze at scale.</p>
<p>A very important concept of post performance is <em>momentum</em>. A post having a high score is a positive signal in itself, which begets more votes (a famous Reddit problem is brigading from /r/all which can cause submission scores to skyrocket). If the front page of a subreddit has a large number of high-performing posts, they might also suppress posts coming out of the <code>/new</code> queue because the score threshold is much higher. A simple model may not be able to capture these impacts; the model would need to incorporate the <em>state of the front page</em> at the time of posting.</p>
<p>Some also try to manipulate upvotes. Reddit became famous for adding the rule &ldquo;asking for upvotes is a violation of intergalactic law&rdquo; to their <a href="https://www.reddithelp.com/en/categories/rules-reporting/account-and-community-restrictions/what-constitutes-vote-cheating-or">Content Policy</a>, although some subreddits do it anyway <a href="https://www.reddit.com/r/TheoryOfReddit/comments/5qqrod/for_years_reddit_told_us_that_saying_upvote_this/">without consequence</a>. On Reddit, obvious spam posts can be downvoted to immediately counteract illicit upvotes. Hacker News has a <a href="https://news.ycombinator.com/newsfaq.html">similar don&rsquo;t-upvote rule</a>, although there aren&rsquo;t downvotes, just a flagging mechanism which quickly neutralizes spam/misleading posts. In general, there&rsquo;s no <em>legitimate</em> reason to highlight your own submission immediately after its posted (except for Reddit&rsquo;s AMAs). Fortunately, gaming the system is less impactful on Reddit and Hacker News due to their sheer size and countermeasures, but it&rsquo;s a good example of potential user behavior that makes modeling post performance difficult, and hopefully link aggregators of the future aren&rsquo;t susceptible to such shenanigans.</p>
<h2 id="do-we-really-to-predict-post-score">Do We Really to Predict Post Score?</h2>
<p>Let&rsquo;s say you are submitting original content to Reddit or your own tech project to Hacker News. More points means a higher ranking means more exposure for your link, right? Not exactly. As noted from Reddit/HN screenshots above, the scores of popular submissions are all over the place ranking-wise, having been affected by age penalties.</p>
<p>In practical terms, from my own purely anecdotal experience, submissions at a top ranking receive <em>substantially</em> more clickthroughs despite being spatially close on the page to others.</p>
<p><span><blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">&hellip;and now traffic at #3.<br><br>Placement is absurdly important for search engines/social media sites. Difference between #1 and #3 is dramatic. <a href="https://t.co/nGjWJBx6dU">pic.twitter.com/nGjWJBx6dU</a></p>— Max Woolf (@minimaxir) <a href="https://twitter.com/minimaxir/status/877219784907149316?ref_src=twsrc%5Etfw">June 20, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></span></p>
<p>In <a href="https://twitter.com/minimaxir/status/877219784907149316">that case</a>, falling from #1 to #3 <em>immediately halved</em> the referral traffic coming from Hacker News.</p>
<p>Therefore, an ideal link aggregator predictive model to maximize clicks should try to predict the <em>rank</em> of a submission (max rank, average rank over <em>n</em> period, etc.), not necessarily the score it receives. You could theoretically create a model by making a snapshot of a Reddit subreddit/front page of Hacker News every minute or so which includes the post position at the time of the snapshot. As mentioned earlier, the snapshots can also be used as a model feature to identify whether the front page is active or stale. Unfortunately, snapshots can&rsquo;t be retrieved retroactively, and both storing, processing, and analyzing snapshots at scale is a difficult and <em>expensive</em> feat of data engineering.</p>
<p>Presumably Reddit&rsquo;s data scientists would be incorporating submission position as a part of their data analytics and modeling, but after inspecting what&rsquo;s sent to Reddit&rsquo;s servers when you perform an action like upvoting, I wasn&rsquo;t able to find a sent position value when upvoting from the feed: only the post score and post upvote percentage at the time of the action were sent.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/09/modeling-link-aggregators/chrome_hu7598567199388751497.webp 320w,https://minimaxir.com/2018/09/modeling-link-aggregators/chrome_hu835536969575416485.webp 768w,https://minimaxir.com/2018/09/modeling-link-aggregators/chrome_hu7852716498606684125.webp 1024w,https://minimaxir.com/2018/09/modeling-link-aggregators/chrome.png 1442w" src="chrome.png"/> 
</figure>

<p>In this example, I upvoted the <code>Fact are facts</code> submission at position #5: we&rsquo;d expect a value between <code>3</code> and <code>5</code> be sent with the post metadata within the analytics payload, but that&rsquo;s not the case.</p>
<p>Optimizing ranking instead of a tangible metric or classification accuracy is a relatively underdiscussed field of modern data science (besides <a href="https://en.wikipedia.org/wiki/Search_engine_optimization">SEO</a> for getting the top spot on a Google search), and it would be interesting to dive deeper into it for other applications.</p>
<h2 id="in-the-future">In the future</h2>
<p>The moral of this post is that you should not take it personally if a submission fails to hit the front page. It doesn&rsquo;t necessarily mean it&rsquo;s bad. Conversely, if a post does well, don’t assume that similar posts will do just as well. There&rsquo;s a lot of quality content that falls through the cracks due to dumb luck. Fortunately, both Reddit and Hacker News allow reposts, which helps alleviate this particular problem.</p>
<p>There&rsquo;s still a lot that can be done to more deterministically predict the behavior of these algorithmic feeds. There&rsquo;s also room to help make these link aggregators more <em>fair</em>. Unfortunately, there&rsquo;s even more undiscovered ways to game these algorithms, and we&rsquo;ll see how things play out.</p>
<hr>
<p><em>You can view the BigQuery queries used to get the Reddit and Hacker News data, plus the R and ggplot2 used to create the data visualizations, in <a href="http://minimaxir.com/notebooks/modeling-link-aggregators/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/modeling-link-aggregators">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Analyzing IMDb Data The Intended Way, with R and ggplot2</title>
      <link>https://minimaxir.com/2018/07/imdb-data-analysis/</link>
      <pubDate>Mon, 16 Jul 2018 09:45:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/07/imdb-data-analysis/</guid>
      <description>For IMDb&amp;rsquo;s big-but-not-big data, you have to play with the data smartly, and both R and ggplot2 have neat tricks to do just that.</description>
      <content:encoded><![CDATA[
    <div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/P4_zSfoTM80?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p><a href="https://www.imdb.com">IMDb</a>, the Internet Movie Database, has been a popular source for data analysis and visualizations over the years. The combination of user ratings for movies and detailed movie metadata have always been fun to <a href="http://minimaxir.com/2016/01/movie-revenue-ratings/">play with</a>.</p>
<p>There are a number of tools to help get IMDb data, such as <a href="https://github.com/alberanid/imdbpy">IMDbPY</a>, which makes it easy to programmatically scrape IMDb by pretending it&rsquo;s a website user and extracting the relevant data from the page&rsquo;s HTML output. While it <em>works</em>, web scraping public data is a gray area in terms of legality; many large websites have a Terms of Service which forbids scraping, and can potentially send a DMCA take-down notice to websites redistributing scraped data.</p>
<p>IMDb has <a href="https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX">data licensing terms</a> which forbid scraping and require an attribution in the form of a <strong>Information courtesy of IMDb (<a href="http://www.imdb.com">http://www.imdb.com</a>). Used with permission.</strong> statement, and has also <a href="https://www.kaggle.com/tmdb/tmdb-movie-metadata/home">DMCAed a Kaggle IMDb dataset</a> to hone the point.</p>
<p>However, there is good news! IMDb publishes an <a href="https://www.imdb.com/interfaces/">official dataset</a> for casual data analysis! And it&rsquo;s now very accessible, just choose a dataset and download (now with no hoops to jump through), and the files are in the standard <a href="https://en.wikipedia.org/wiki/Tab-separated_values">TSV format</a>.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/datasets_hu7863836115409348825.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/datasets_hu18403351276810519768.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/datasets.png 926w" src="datasets.png"/> 
</figure>

<p>The uncompressed files are pretty large; not &ldquo;big data&rdquo; large (it fits into computer memory), but Excel will explode if you try to open them in it. You have to play with the data <em>smartly</em>, and both <a href="https://www.r-project.org">R</a> and <a href="https://ggplot2.tidyverse.org/reference/index.html">ggplot2</a> have neat tricks to do just that.</p>
<h2 id="first-steps">First Steps</h2>
<p>R is a popular programming language for statistical analysis. One of the most popular series of external packages is the <code>tidyverse</code> package, which automatically imports the <code>ggplot2</code> data visualization library and other useful packages which we&rsquo;ll get to one-by-one. We&rsquo;ll also use <code>scales</code> which we&rsquo;ll use later for prettier number formatting. First we&rsquo;ll load these packages:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">scales</span><span class="p">)</span>
</span></span></code></pre></div><p>And now we can load a TSV downloaded from IMDb using the <code>read_tsv</code> function from <code>readr</code> (a tidyverse package), which does what the name implies, at a much faster speed than base R (+ a couple other parameters to handle data encoding). Let&rsquo;s start with the <code>ratings</code> file:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.ratings.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span></span></span></code></pre></div>
<p>We can preview what&rsquo;s in the loaded data using <code>dplyr</code> (a tidyverse package), which is what we&rsquo;ll be using to manipulate data for this analysis. dplyr allows you to pipe commands, making it easy to create a sequence of manipulation commands. For now, we&rsquo;ll use <code>head()</code>, which displays the top few rows of the data frame.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/ratings_hu13948815259199649470.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/ratings_hu5303107099751686438.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/ratings.png 930w" src="ratings.png"/> 
</figure>

<p>Each of the <strong>873k rows</strong> corresponds to a single movie, an ID for the movie, its average rating (from 1 to 10), and the number of votes which contribute to that average. Since we have two numeric variables, why not test out ggplot2 by creating a scatterplot mapping them? ggplot2 takes in a data frame and names of columns as aesthetics, then you specify what type of shape to plot (a &ldquo;geom&rdquo;). Passing the plot to <code>ggsave</code> saves it as a standalone, high-quality data visualization.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">numVotes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_point</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">ggsave</span><span class="p">(</span><span class="s">&#34;imdb-0.png&#34;</span><span class="p">,</span> <span class="n">plot</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="m">4</span><span class="p">,</span> <span class="n">height</span> <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/imdb-0_hu6134842797724310445.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-0_hu3266731833703060037.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-0_hu3168636653484656689.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-0.png 1200w" src="imdb-0.png"/> 
</figure>

<p>Here is nearly <em>1 million</em> points on a single chart; definitely don&rsquo;t try to do that in Excel! However, it&rsquo;s not a <em>useful</em> chart since all the points are opaque and we&rsquo;re not sure what the spatial density of points is. One approach to fix this issue is to create a heat map of points, which ggplot can do natively with <code>geom_bin2d</code>. We can color the heat map with the <a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html">viridis</a> colorblind-friendly palettes <a href="https://ggplot2.tidyverse.org/reference/scale_viridis.html">just introduced</a> into ggplot2. We should also tweak the axes; the x-axis should be scaled logarithmically with <code>scale_x_log10</code> since there are many movies with high numbers of votes and we can format those numbers with the <code>comma</code> function from the <code>scales</code> package (we can format the scale with <code>comma</code> too). For the y-axis, we can add explicit number breaks for each rating; R can do this neatly by setting the breaks to <code>1:10</code>. Putting it all together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">numVotes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_log10</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/imdb-1_hu16407796210713744466.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-1_hu7919033511924613114.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-1_hu15981801755106187835.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-1.png 1200w" src="imdb-1.png"/> 
</figure>

<p>Not bad, although it unfortunately confirms that IMDb follows a <a href="https://tvtropes.org/pmwiki/pmwiki.php/Main/FourPointScale">Four Point Scale</a> where average ratings tend to fall between 6 — 9.</p>
<h2 id="mapping-movies-to-ratings">Mapping Movies to Ratings</h2>
<p>You may be asking &ldquo;which ratings correspond to which movies?&rdquo; That&rsquo;s what the <code>tconst</code> field is for. But first, let&rsquo;s load the title data from <code>title.basics.tsv</code> into <code>df_basics</code> and take a look as before.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_basics</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.basics.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/basics1_hu5171046096969118144.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/basics1_hu9870877445783615510.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/basics1_hu18072297923411652101.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/basics1.png 1350w" src="basics1.png"/> 
</figure>

<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/basics2_hu12589800305372186450.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/basics2_hu12558651774804560685.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/basics2_hu14225003055189698458.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/basics2.png 1374w" src="basics2.png"/> 
</figure>
</p>
<p>We have some neat movie metadata. Notably, this table has a <code>tconst</code> field as well. Therefore, we can <em>join</em> the two tables together, adding the movie information to the corresponding row in the rating table (in this case, a left join is more appropriate than an inner/full join)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_basics</span><span class="p">)</span>
</span></span></code></pre></div><p>Runtime minutes sounds interesting. Could there be a relationship between the length of a movie and its average rating on IMDb? Let&rsquo;s make a heat map plot again, but with a few tweaks. With the new metadata, we can <code>filter</code> the table to remove bad points; let&rsquo;s keep movies only (as IMDb data also contains <em>television show data</em>), with a runtime &lt; 3 hours, and which have received atleast 10 votes by users to remove extraneous movies). X-axis should be tweaked to display the minutes-values in hours. The fill viridis palette can be changed to another one in the family (I personally like <code>inferno</code>).</p>
<p>More importantly, let&rsquo;s discuss plot theming. If you want a minimalistic theme, add a <code>theme_minimal</code> to the plot, and you can pass a <code>base_family</code> to change the default font on the plot and a <code>base_size</code> to change the font size. The <code>labs</code> function lets you add labels to the plot (which you should <em>always</em> do); you have your <code>title</code>, <code>x</code>, and <code>y</code> parameters, but you can also add a <code>subtitle</code>, a <code>caption</code> for attribution, and a <code>color</code>/<code>fill</code> to name the scale. Putting it all together:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">runtimeMinutes</span> <span class="o">&lt;</span> <span class="m">180</span><span class="p">,</span> <span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">runtimeMinutes</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">180</span><span class="p">,</span> <span class="m">60</span><span class="p">),</span> <span class="n">labels</span> <span class="o">=</span> <span class="m">0</span><span class="o">:</span><span class="m">3</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">0</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">option</span> <span class="o">=</span> <span class="s">&#34;inferno&#34;</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">theme_minimal</span><span class="p">(</span><span class="n">base_family</span> <span class="o">=</span> <span class="s">&#34;Source Sans Pro&#34;</span><span class="p">,</span> <span class="n">base_size</span> <span class="o">=</span> <span class="m">8</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Relationship between Movie Runtime and Average Mobie Rating&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">subtitle</span> <span class="o">=</span> <span class="s">&#34;Data from IMDb retrieved July 4th, 2018&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Runtime (Hours)&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;Average User Rating&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">caption</span> <span class="o">=</span> <span class="s">&#34;Max Woolf — minimaxir.com&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">               <span class="n">fill</span> <span class="o">=</span> <span class="s">&#34;# Movies&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/imdb-2b_hu5622052623198360170.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-2b_hu3715381085374890563.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-2b_hu10643667316762467152.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-2b.png 1200w" src="imdb-2b.png"/> 
</figure>

<p>Now that&rsquo;s pretty nice-looking for only a few lines of code! Albeit unhelpful, as there doesn&rsquo;t appear to be a correlation.</p>
<p><em>(Note: for the rest of this post, the theming/labels code will be omitted for convenience)</em></p>
<p>How about movie ratings vs. the year the movie was made? It&rsquo;s a similar plot code-wise to the one above (one perk about <code>ggplot2</code> is that there&rsquo;s no shame in reusing chart code!), but we can add a <code>geom_smooth</code>, which adds a nonparametric trendline with confidence bands for the trend; since we have a large amount of data, the bands are very tight. We can also fix the problem of &ldquo;empty&rdquo; bins by setting the color fill scale to logarithmic scaling. And since we&rsquo;re adding a black trendline, let&rsquo;s change the viridis palette to <code>plasma</code> for better contrast.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">averageRating</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_bin2d</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_smooth</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">&#34;black&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_x_continuous</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_viridis_c</span><span class="p">(</span><span class="n">option</span> <span class="o">=</span> <span class="s">&#34;plasma&#34;</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">,</span> <span class="n">trans</span> <span class="o">=</span> <span class="s">&#39;log10&#39;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/imdb-4_hu16393258413625180940.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-4_hu2361684155542955917.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-4_hu5511852794587375111.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-4.png 1200w" src="imdb-4.png"/> 
</figure>

<p>Unfortunately, this trend hasn&rsquo;t changed much either, although the presence of average ratings outside the Four Point Scale has increased over time.</p>
<h2 id="mapping-lead-actors-to-movies">Mapping Lead Actors to Movies</h2>
<p>Now that we have a handle on working with the IMDb data, let&rsquo;s try playing with the larger datasets. Since they take up a lot of computer memory, we only want to persist data we actually might use. After looking at the schema provided with the official datasets, the only really useful metadata about the actors is their birth year, so let&rsquo;s load that, but only keep both actors/actresses (using the fast <code>str_detect</code> function from <code>stringr</code>, another tidyverse package) and the relevant fields.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actors</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;name.basics.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                <span class="nf">filter</span><span class="p">(</span><span class="nf">str_detect</span><span class="p">(</span><span class="n">primaryProfession</span><span class="p">,</span> <span class="s">&#34;actor|actress&#34;</span><span class="p">))</span>  <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                <span class="nf">select</span><span class="p">(</span><span class="n">nconst</span><span class="p">,</span> <span class="n">primaryName</span><span class="p">,</span> <span class="n">birthYear</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/actor_hu10487424934138013860.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/actor_hu11157392983137949493.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/actor.png 936w" src="actor.png"/> 
</figure>

<p>The principals dataset, the large 1.28GB TSV, is the most interesting. It&rsquo;s an unnested list of the credited persons in each movie, with an <code>ordering</code> indicating their rank (where <code>1</code> means first, <code>2</code> means second, etc.).</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/principals_hu11943327370438692943.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/principals_hu3553100163218354409.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/principals_hu2543647643657283133.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/principals.png 1074w" src="principals.png"/> 
</figure>

<p>For this analysis, let&rsquo;s only look at the <strong>lead actors/actresses</strong>; specifically, for each movie (identified by the <code>tconst</code> value), filter the dataset to where the <code>ordering</code> value is the lowest (in this case, the person at rank <code>1</code> may not necessarily be an actor/actress).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_principals</span> <span class="o">&lt;-</span> <span class="nf">read_tsv</span><span class="p">(</span><span class="s">&#39;title.principals.tsv&#39;</span><span class="p">,</span> <span class="n">na</span> <span class="o">=</span> <span class="s">&#34;\\N&#34;</span><span class="p">,</span> <span class="n">quote</span> <span class="o">=</span> <span class="s">&#39;&#39;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="nf">str_detect</span><span class="p">(</span><span class="n">category</span><span class="p">,</span> <span class="s">&#34;actor|actress&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">tconst</span><span class="p">,</span> <span class="n">ordering</span><span class="p">,</span> <span class="n">nconst</span><span class="p">,</span> <span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">group_by</span><span class="p">(</span><span class="n">tconst</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="n">ordering</span> <span class="o">==</span> <span class="nf">min</span><span class="p">(</span><span class="n">ordering</span><span class="p">))</span>
</span></span></code></pre></div><p>Both datasets have a <code>nconst</code> field, so let&rsquo;s join them together. And then join <em>that</em> to the ratings table earlier via <code>tconst</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_principals</span> <span class="o">&lt;-</span> <span class="n">df_principals</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_actors</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">df_ratings</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_principals</span><span class="p">)</span>
</span></span></code></pre></div><p>Now we have a fully denormalized dataset in <code>df_ratings</code>. Since we now have the movie release year and the birth year of the lead actor, we can now infer <em>the age of the lead actor at the movie release</em>. With that goal, filter out the data on the criteria we&rsquo;ve used for earlier data visualizations, plus only keeping rows which have an actor&rsquo;s birth year.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings_movies</span> <span class="o">&lt;-</span> <span class="n">df_ratings</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                        <span class="nf">filter</span><span class="p">(</span><span class="n">titleType</span> <span class="o">==</span> <span class="s">&#34;movie&#34;</span><span class="p">,</span> <span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">birthYear</span><span class="p">),</span> <span class="n">numVotes</span> <span class="o">&gt;=</span> <span class="m">10</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                        <span class="nf">mutate</span><span class="p">(</span><span class="n">age_lead</span> <span class="o">=</span> <span class="n">startYear</span> <span class="o">-</span> <span class="n">birthYear</span><span class="p">)</span>
</span></span></code></pre></div><p><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/denorm1_hu17897601220600655212.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/denorm1_hu11268818236259188540.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/denorm1_hu4732603171930879136.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/denorm1.png 1604w" src="denorm1.png"/> 
</figure>

<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/denorm2_hu14374280315869545296.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/denorm2.png 531w" src="denorm2.png"/> 
</figure>
</p>
<h2 id="plotting-ages">Plotting Ages</h2>
<p>Age discrimination in movie casting has been a recurring issue in Hollywood; in fact, in 2017 <a href="https://www.hollywoodreporter.com/thr-esq/judge-pauses-enforcement-imdb-age-censorship-law-978797">a law was signed</a> to force IMDb to remove an actor&rsquo;s age upon request, which in February 2018 was <a href="https://www.hollywoodreporter.com/thr-esq/californias-imdb-age-censorship-law-declared-unconstitutional-1086540">ruled to be unconstitutional</a>.</p>
<p>Have the ages of movie leads changed over time? For this example, we&rsquo;ll use a <a href="https://ggplot2.tidyverse.org/reference/geom_ribbon.html">ribbon plot</a> to plot the ranges of ages of movie leads. A simple way to do that is, for each year, calculate the 25th <a href="https://en.wikipedia.org/wiki/Percentile">percentile</a> of the ages, the 50th percentile (i.e. the median), and the 75th percentile, where the 25th and 75th percentiles are the ribbon bounds and the line represents the median.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actor_ages</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">group_by</span><span class="p">(</span><span class="n">startYear</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">summarize</span><span class="p">(</span><span class="n">low_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">med_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.50</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">high_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.75</span><span class="p">,</span> <span class="n">na.rm</span><span class="o">=</span><span class="bp">T</span><span class="p">))</span>
</span></span></code></pre></div><p>Plotting it with ggplot2 is surprisingly simple, although you need to use different y aesthetics for the ribbon and the overlapping line.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_actor_ages</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">startYear</span> <span class="o">&gt;=</span> <span class="m">1920</span><span class="p">)</span> <span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_ribbon</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span> <span class="o">=</span> <span class="n">low_age</span><span class="p">,</span> <span class="n">ymax</span> <span class="o">=</span> <span class="n">high_age</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_line</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">med_age</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/imdb-8_hu10948478971841059925.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-8_hu13051318194336975597.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-8_hu317573737393000708.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-8.png 1200w" src="imdb-8.png"/> 
</figure>

<p>Turns out that in the 2000&rsquo;s, the median age of lead actors started to <em>increase</em>? Both the upper and lower bounds increased too. That doesn&rsquo;t coalesce with the age discrimination complaints.</p>
<p>Another aspect of these complaints is gender, as female actresses tend to be younger than male actors. Thanks to the magic of ggplot2 and dplyr, separating actors/actresses is relatively simple: add gender (encoded in <code>category</code>) as a grouping variable, add it as a color/fill aesthetic in ggplot, and set colors appropriately (I recommend the <a href="http://colorbrewer2.org/">ColorBrewer</a> qualitative palettes for categorical variables).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_actor_ages_lead</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">group_by</span><span class="p">(</span><span class="n">startYear</span><span class="p">,</span> <span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                  <span class="nf">summarize</span><span class="p">(</span><span class="n">low_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">med_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.50</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                            <span class="n">high_age</span> <span class="o">=</span> <span class="nf">quantile</span><span class="p">(</span><span class="n">age_lead</span><span class="p">,</span> <span class="m">0.75</span><span class="p">,</span> <span class="n">na.rm</span> <span class="o">=</span> <span class="bp">T</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_actor_ages_lead</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">startYear</span> <span class="o">&gt;=</span> <span class="m">1920</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">startYear</span><span class="p">,</span> <span class="n">fill</span> <span class="o">=</span> <span class="n">category</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="n">category</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_ribbon</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span> <span class="o">=</span> <span class="n">low_age</span><span class="p">,</span> <span class="n">ymax</span> <span class="o">=</span> <span class="n">high_age</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">geom_line</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">med_age</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span> <span class="o">=</span> <span class="s">&#34;Set1&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">          <span class="nf">scale_color_brewer</span><span class="p">(</span><span class="n">palette</span> <span class="o">=</span> <span class="s">&#34;Set1&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/imdb-9_hu6690156637882651252.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-9_hu15681283428661429333.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-9_hu8179671388075376659.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-9.png 1200w" src="imdb-9.png"/> 
</figure>

<p>There&rsquo;s about a 10-year gap between the ages of male and female leads, and the gap doesn&rsquo;t change overtime. But both start to rise at the same time.</p>
<p>One possible explanation for this behavior is actor reuse: if Hollywood keeps casting the same actor/actresses, by construction the ages of the leads will start to steadily increase. Let&rsquo;s verify that: with our list of movies and their lead actors, for each lead actor, order all their movies by release year, and add a ranking for the #th time that actor has been a lead actor. This is possible through the use of <code>row_number</code> in dplyr, and <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html">window functions</a> like <code>row_number</code> are data science&rsquo;s most useful secret.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_ratings_movies_nth</span> <span class="o">&lt;-</span> <span class="n">df_ratings_movies</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">group_by</span><span class="p">(</span><span class="n">nconst</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">arrange</span><span class="p">(</span><span class="n">startYear</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                      <span class="nf">mutate</span><span class="p">(</span><span class="n">nth_lead</span> <span class="o">=</span> <span class="nf">row_number</span><span class="p">())</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/row_number_hu11150803237373467722.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/row_number_hu7522724450310747674.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/row_number_hu15997550928508463062.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/row_number.png 1532w" src="row_number.png"/> 
</figure>

<p>One more ribbon plot later (w/ same code as above + custom y-axis breaks):</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/07/imdb-data-analysis/imdb-12_hu18285805988034086464.webp 320w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-12_hu1353155581280692189.webp 768w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-12_hu3015684537399317154.webp 1024w,https://minimaxir.com/2018/07/imdb-data-analysis/imdb-12.png 1200w" src="imdb-12.png"/> 
</figure>

<p>Huh. The median and upper-bound #th time has <em>dropped</em> over time? Hollywood has been promoting more newcomers as leads? That&rsquo;s not what I expected!</p>
<p>More work definitely needs to be done in this area. In the meantime, the official IMDb datasets are a lot more robust than I thought they would be! And I only used a fraction of the datasets; the rest tie into TV shows, which are a bit messier. Hopefully you&rsquo;ve seen a good taste of the power of R and ggplot2 for playing with big-but-not-big data!</p>
<hr>
<p><em>You can view the R and ggplot used to create the data visualizations in <a href="http://minimaxir.com/notebooks/imdb-data-analysis/">this R Notebook</a>, which includes many visualizations not used in this post. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/imdb-data-analysis">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Visualizing One Million NCAA Basketball Shots</title>
      <link>https://minimaxir.com/2018/03/basketball-shots/</link>
      <pubDate>Mon, 19 Mar 2018 09:20:00 -0700</pubDate>
      <guid>https://minimaxir.com/2018/03/basketball-shots/</guid>
      <description>Although visualizing basketball shots has been done before, this time we have access to an order of magnitude more public data to do some really cool stuff.</description>
      <content:encoded><![CDATA[<p>So <a href="https://www.ncaa.com/march-madness">March Madness</a> is happing right now. In celebration, <a href="https://www.google.com">Google</a> uploaded <a href="https://console.cloud.google.com/launcher/details/ncaa-bb-public/ncaa-basketball">massive basketball datasets</a> from the <a href="https://www.ncaa.com">NCAA</a> and <a href="https://www.sportradar.com/">Sportradar</a> to <a href="https://cloud.google.com/bigquery/">BigQuery</a> for anyone to query and experiment. After learning that the <a href="https://www.reddit.com/r/bigquery/comments/82nz17/dataset_statistics_for_ncaa_mens_and_womens/">dataset had location data</a> on where basketball shots were made on the court, I played with it and a couple hours later, I created a decent heat map data visualization. The next day, I <a href="https://www.reddit.com/r/dataisbeautiful/comments/837qnu/heat_map_of_1058383_basketball_shots_from_ncaa/">posted it</a> to Reddit&rsquo;s <a href="https://www.reddit.com/r/dataisbeautiful">/r/dataisbeautiful subreddit</a> where it earned about <strong>40,000 upvotes</strong>. (!?)</p>
<p>Let&rsquo;s dig a little deeper. Although visualizing basketball shots has been <a href="http://www.slate.com/blogs/browbeat/2012/03/06/mapping_the_nba_how_geography_can_teach_players_where_to_shoot.html">done</a> <a href="http://toddwschneider.com/posts/ballr-interactive-nba-shot-charts-with-r-and-shiny/">before</a>, this time we have access to an order of magnitude more public data to do some really cool stuff.</p>
<h2 id="full-court">Full Court</h2>
<p>The Sportradar play-by-play table on BigQuery <code>mbb_pbp_sr</code> has more than 1 million NCAA men&rsquo;s basketball shots since the 2013-2014 season, with more being added now during March Madness. Here&rsquo;s a heat map of the locations where those shots were made on the full basketball court:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_unlog_hu4867952281502461163.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_unlog_hu9346122800458336033.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_unlog_hu11322785576098161409.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_unlog.png 1800w" src="ncaa_count_attempts_unlog.png"/> 
</figure>

<p>We can clearly see at a glance that the majority of shots are made right in front of the basket. For 3-point shots, the center and the corners have higher numbers of shot attempts than the other areas. But not much else since the data is so spatially skewed: setting the bin color scale to logarithmic makes trends more apparent and helps things go viral on Reddit.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_hu9485440869209512752.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_hu9045474166992494800.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_hu9958816625909530494.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts.png 1800w" src="ncaa_count_attempts.png"/> 
</figure>

<p>Now there&rsquo;s more going on here: shot behavior is clearly symmetric on each side of the court, and there&rsquo;s a small gap between the 3-point line and where 3-pt shots are typically made, likely to ensure that it it&rsquo;s not accidentally ruled as a 2-pt shot.</p>
<p>How likely is it to score a shot from a given spot? Are certain spots better than others?</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_count_perc_success_hu12225515012359601598.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_perc_success_hu2228709944006703286.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_perc_success_hu5831402751780252829.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_perc_success.png 1800w" src="ncaa_count_perc_success.png"/> 
</figure>

<p>Surprisingly, shot accuracy is about <em>equal</em> from anywhere within typical shooting distance, except directly in front of the basket where it&rsquo;s much higher. What is the <a href="https://en.wikipedia.org/wiki/Expected_value">expected value</a> of a shot at a given position: that is, how many points on average will they earn for their team?</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_count_avg_points_hu10419792495384116250.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_avg_points_hu15601930336810861658.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_avg_points_hu17315627164072483656.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_avg_points.png 1800w" src="ncaa_count_avg_points.png"/> 
</figure>

<p>The average points earned for 3-pt shots is about 1.5x higher than many 2-pt shot locations in the inner court due to the equal accuracy, but locations next to the basket have an even higher expected value. Perhaps the accuracy of shots close to the basket is higher (&gt;1.5x) than 3-pt shots and outweighs the lower point value?</p>
<p>Since both sides of the court are indeed the same, we can combine the two sides and just plot a half-court instead. (Cross-court shots, which many Redditors <a href="https://www.reddit.com/r/dataisugly/comments/839rax/basketball_heat_map_shows_an_impressive_number_of/">argued</a> that they invalidated my visualizations above, constitute only <em>0.16%</em> of the basketball shots in the dataset, so they can be safely removed as outliers).</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_log_hu6311352350205583488.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_log_hu18172877754035200964.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_log_hu404545173576074469.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_log.png 1200w" src="ncaa_count_attempts_half_log.png"/> 
</figure>

<p>There are still a few oddities, such as shots being made <em>behind</em> the basket. Let&rsquo;s drill down a bit.</p>
<h2 id="focusing-on-basketball-shot-type">Focusing on Basketball Shot Type</h2>
<p>The Sportradar dataset classifies a shot as one of 5 major types: a <strong>jump shot</strong> where the player jumps-and-throws the basketball, a <strong>layup</strong> where the player runs down the field toward the basket and throws a one-handed shot, a <strong>dunk</strong> where the player slams the ball into the basket (looking cool in the process), a <strong>hook shot</strong> where the player close to the basket throws the ball with a hook motion, and a <strong>tip shot</strong> where the player intercepts a basket rebound at the tip of the basket and pushes it in.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_attempts_hu3392918158565174522.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_attempts_hu12549922522811515231.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_attempts_hu16294967146481582394.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_attempts.png 1200w" src="ncaa_types_prop_attempts.png"/> 
</figure>

<p>However, the most frequent types of shots are the less flashy, more practical jump shots and layups. But is a certain type of shot &ldquo;better?&rdquo;</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_hu16207526305124407108.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_hu13706376528972923074.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_hu8670138399362658184.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc.png 1200w" src="ncaa_types_perc.png"/> 
</figure>

<p>Layups are safer than jump shots, but dunks are the most accurate of all the types (however, players likely wouldn&rsquo;t attempt a dunk unless they knew it would be successful). The accuracy of layups and other close-to-basket shots is indeed more than 1.5x better than the jump shots of 3-pt shots, which explains the expected value behavior above.</p>
<p>Plotting the heat maps for each type of shot offers more insight into how they work:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_types_log_hu7498890725666041649.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_types_log_hu1346367562685938460.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_types_log.png 900w" src="ncaa_count_attempts_half_types_log.png"/> 
</figure>

<p>They&rsquo;re wildly different heat maps which match the shot type descriptions above, but show we&rsquo;ll need to separate data visualizations by type to accurately see trends.</p>
<h2 id="impact-of-game-elapsed-time-at-time-of-shot">Impact of Game Elapsed Time At Time of Shot</h2>
<p>A NCAA basketball game lasts for 40 minutes total (2 halves of 20 minutes each), with the possibility of overtime. The <a href="https://bigquery.cloud.google.com/savedquery/4194148158:3359d86507814fb19a5997a770456baa">example BigQuery</a> for the NCAA-provided data compares the percentage of 3-point shots made during the first 35 minutes of the game versus the last 5 minutes: at the end of the game, accuracy was lower by 4 percentage points (31.2% vs. 35.1%). It might be interesting to facet these visualizations by the elapsed time of the game to see if there are any behavioral changes.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_elapsed_hu14677248589911283567.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_elapsed_hu17897638517913371063.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_elapsed_hu5811039205050045231.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_elapsed.png 1200w" src="ncaa_types_prop_type_elapsed.png"/> 
</figure>

<p>There isn&rsquo;t much difference between the proportions within a given half, but there is a difference between the first half and the second half, where the second half has fewer jump shots and more aggressive layups and dunks. After looking at shot success percentage:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed_hu11212319267715503641.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed_hu12936827827249594501.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed_hu3226018228894802713.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_elapsed.png 1200w" src="ncaa_types_perc_success_type_elapsed.png"/> 
</figure>

<p>The jump shot accuracy loss at the end of the game with Sportradar data is similar to that of the NCAA data, which is a good sanity check (but it&rsquo;s odd that the accuracy drop only happens in the last 5 minutes and not elsewhere in the 2nd half). Layup accuracy increases in the second half with the number of layups.</p>
<p>We can also visualize heat maps for each combo of shot type with time elapsed bucket, but given the results above, the changes in behavior over time may not be very perceptible.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log_hu17473377362441353989.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log_hu4392665695619691142.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log_hu6807374620946215023.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_count_attempts_half_interval_log.png 1200w" src="ncaa_count_attempts_half_interval_log.png"/> 
</figure>

<h2 id="impact-of-winninglosing-before-shot">Impact of Winning/Losing Before Shot</h2>
<p>Another theory worth exploring is determining if there is any difference whether a team is winning or losing when they make their shot (technically, when the delta between the team score and the other team score is positive for winning teams, negative for losing teams, or 0 if tied). Are players more relaxed when they have a lead? Are players more prone to making mistakes when losing?</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_score_hu12795432602139047209.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_score_hu10785160504979282450.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_score_hu12737280586579011607.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_prop_type_score.png 1200w" src="ncaa_types_prop_type_score.png"/> 
</figure>

<p>Layups are the same across all buckets, but for teams that are winning, there are fewer jump shots and <strong>more dunkin&rsquo; action</strong> (nearly double the dunks!). However, the accuracy chart illustrates an issue:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_score_hu11034190112321056276.webp 320w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_score_hu8938501007324301094.webp 768w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_score_hu7985798081415160403.webp 1024w,https://minimaxir.com/2018/03/basketball-shots/ncaa_types_perc_success_type_score.png 1200w" src="ncaa_types_perc_success_type_score.png"/> 
</figure>

<p>Accuracy for most types of shots is much better for teams that are winning&hellip;which may be the <em>reason</em> they&rsquo;re winning. More research can be done in this area.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I fully admit I am not a basketball expert. But playing around with this data was a fun way to get a new perspective on how collegiate basketball games work. There&rsquo;s a lot more work that can be done with big basketball data and game strategy; the NCAA-provided data doesn&rsquo;t have location data, but it does have <strong>6x more shots</strong>, which will be very helpful for further fun in this area.</p>
<hr>
<p><em>You can view the R code, ggplot2 code, and BigQueries used to create the data visualizations in <a href="http://minimaxir.com/notebooks/basketball-shots/">this R Notebook</a>. You can also view the images/code used for this post in <a href="https://github.com/minimaxir/ncaa-basketball">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
<p><em>Special thanks to Ewen Gallic for his implementation of a <a href="http://egallic.fr/en/drawing-a-basketball-court-with-r/">basketball court in ggplot2</a>, which saved me a lot of time!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Playing with 80 Million Amazon Product Review Ratings Using Apache Spark</title>
      <link>https://minimaxir.com/2017/01/amazon-spark/</link>
      <pubDate>Mon, 02 Jan 2017 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/01/amazon-spark/</guid>
      <description>Manipulating actually-big-data is just as easy as performing an analysis on a dataset with only a few records.</description>
      <content:encoded><![CDATA[<p><a href="https://www.amazon.com">Amazon</a> product reviews and ratings are a very important business. Customers on Amazon often make purchasing decisions based on those reviews, and a single bad review can cause a potential purchaser to reconsider. A couple years ago, I wrote a blog post titled <a href="http://minimaxir.com/2014/06/reviewing-reviews/">A Statistical Analysis of 1.2 Million Amazon Reviews</a>, which was well-received.</p>
<p>Back then, I was only limited to 1.2M reviews because attempting to process more data caused out-of-memory issues and my R code took <em>hours</em> to run.</p>
<p><a href="http://spark.apache.org">Apache Spark</a>, which makes processing gigantic amounts of data efficient and sensible, has become very popular in the past couple years (for good tutorials on using Spark with Python, I recommend the <a href="https://courses.edx.org/courses/course-v1:BerkeleyX&#43;CS105x&#43;1T2016/info">free</a> <a href="https://courses.edx.org/courses/course-v1:BerkeleyX&#43;CS110x&#43;2T2016/info">eDX</a> <a href="https://courses.edx.org/courses/course-v1:BerkeleyX&#43;CS120x&#43;2T2016/info">courses</a>). Although data scientists often use Spark to process data with distributed cloud computing via <a href="https://aws.amazon.com/ec2/">Amazon EC2</a> or <a href="https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/">Microsoft Azure</a>, Spark works just fine even on a typical laptop, given enough memory (for this post, I use a 2016 MacBook Pro/16GB RAM, with 8GB allocated to the Spark driver).</p>
<p>I wrote a <a href="https://github.com/minimaxir/amazon-spark/blob/master/amazon_preprocess.py">simple Python script</a> to combine the per-category ratings-only data from the <a href="http://jmcauley.ucsd.edu/data/amazon/">Amazon product reviews dataset</a> curated by Julian McAuley, Rahul Pandey, and Jure Leskovec for their 2015 paper <a href="http://cseweb.ucsd.edu/~jmcauley/pdfs/kdd15.pdf">Inferring Networks of Substitutable and Complementary Products</a>. The result is a 4.53 GB CSV that would definitely not open in Microsoft Excel. The truncated and combined dataset includes the <strong>user_id</strong> of the user leaving the review, the <strong>item_id</strong> indicating the Amazon product receiving the review, the <strong>rating</strong> the user gave the product from 1 to 5, and the <strong>timestamp</strong> indicating the time when the review was written (truncated to the Day). We can also infer the <strong>category</strong> of the reviewed product from the name of the data subset.</p>
<p>Afterwards, using the new <a href="http://spark.rstudio.com">sparklyr</a> package for R, I can easily start a local Spark cluster with a single <code>spark_connect()</code> command and load the entire CSV into the cluster in seconds with a single <code>spark_read_csv()</code> command.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/output_hu11825264051834637359.webp 320w,https://minimaxir.com/2017/01/amazon-spark/output_hu15952964483290737184.webp 768w,https://minimaxir.com/2017/01/amazon-spark/output_hu15480810814297675177.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/output.png 1106w" src="output.png"/> 
</figure>

<p>There are 80.74 million records total in the dataset, or as the output helpfully reports, <code>8.074e+07</code> records. Performing advanced queries with traditional tools like <a href="https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">dplyr</a> or even Python&rsquo;s <a href="http://pandas.pydata.org">pandas</a> on such a dataset would take a considerable amount of time to execute.</p>
<p>With sparklyr, manipulating actually-big-data is <em>just as easy</em> as performing an analysis on a dataset with only a few records (and an order of magnitude easier than the Python approaches taught in the eDX class mentioned above!).</p>
<h2 id="exploratory-analysis">Exploratory Analysis</h2>
<p><em>(You can view the R code used to process the data with Spark and generate the data visualizations in <a href="http://minimaxir.com/notebooks/amazon-spark/">this R Notebook</a>)</em></p>
<p>There are <strong>20,368,412</strong> unique users who provided reviews in this dataset. <strong>51.9%</strong> of those users have only written one review.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/user_count_cum_hu18131942492474983602.webp 320w,https://minimaxir.com/2017/01/amazon-spark/user_count_cum_hu724464668483888738.webp 768w,https://minimaxir.com/2017/01/amazon-spark/user_count_cum_hu4886088281670307632.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/user_count_cum.png 1200w" src="user_count_cum.png"/> 
</figure>

<p>Relatedly, there are <strong>8,210,439</strong> unique products in this dataset, where <strong>43.3%</strong> have only one review.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/item_count_cum_hu4442615987071581876.webp 320w,https://minimaxir.com/2017/01/amazon-spark/item_count_cum_hu3681574755398551736.webp 768w,https://minimaxir.com/2017/01/amazon-spark/item_count_cum_hu12092451074977659738.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/item_count_cum.png 1200w" src="item_count_cum.png"/> 
</figure>

<p>After removing duplicate ratings, I added a few more features to each rating which may help illustrate how review behavior changed over time: a ranking value indicating the # review that the author of a given review has written (1st review by author, 2nd review by author, etc.), a ranking value indicating the # review that the product of a given review has received (1st review for product, 2nd review for product, etc.), and the month and year the review was made.</p>
<p>The first two added features require a <em>very</em> large amount of processing power, and highlight the convenience of Spark&rsquo;s speed (and the fact that Spark uses all CPU cores by default, while typical R/Python approaches are single-threaded!)</p>
<p>These changes are cached into a Spark DataFrame <code>df_t</code>. If I wanted to determine which Amazon product category receives the best review ratings on average, I can aggregate the data by category, calculate the average rating score for each category, and sort. Thanks to the power of Spark, the data processing for this many-millions-of-records takes seconds.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_agg</span> <span class="o">&lt;-</span> <span class="n">df_t</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">group_by</span><span class="p">(</span><span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">summarize</span><span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="nf">n</span><span class="p">(),</span> <span class="n">avg_rating</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">rating</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">avg_rating</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">collect</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/avg_hu1760763212046264347.webp 320w,https://minimaxir.com/2017/01/amazon-spark/avg_hu11354551484906702319.webp 768w,https://minimaxir.com/2017/01/amazon-spark/avg.png 962w" src="avg.png"/> 
</figure>

<p>Or, visualized in chart form using <a href="http://ggplot2.org">ggplot2</a>:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/avg_rating_desc_hu5140884482166784622.webp 320w,https://minimaxir.com/2017/01/amazon-spark/avg_rating_desc_hu10587704718194281244.webp 768w,https://minimaxir.com/2017/01/amazon-spark/avg_rating_desc_hu14654037686561640934.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/avg_rating_desc.png 1200w" src="avg_rating_desc.png"/> 
</figure>

<p>Digital Music/CD products receive the highest reviews on average, while Video Games and Cell Phones receive the lowest reviews on average, with a <strong>0.77</strong> rating range between them. This does make some intuitive sense; Digital Music and CDs are types of products where you know <em>exactly</em> what you are getting with no chance of a random product defect, while Cell Phones and Accessories can have variable quality from shady third-party sellers (Video Games in particular are also prone to irrational <a href="http://steamed.kotaku.com/steam-games-are-now-even-more-susceptible-to-review-bom-1774940065">review bombing</a> over minor grievances).</p>
<p>We can refine this visualization by splitting each bar into a percentage breakdown of each rating from 1-5. This could be plotted with a pie chart for each category, however a stacked bar chart, scaled to 100%, looks much cleaner.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/category_breakdown_hu18042580852294897809.webp 320w,https://minimaxir.com/2017/01/amazon-spark/category_breakdown_hu2852538476163673527.webp 768w,https://minimaxir.com/2017/01/amazon-spark/category_breakdown_hu15719492803595356648.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/category_breakdown.png 1200w" src="category_breakdown.png"/> 
</figure>

<p>The new visualization does help support the theory above; the top categories have a significantly higher percentage of 4/5-star ratings than the bottom categories, and a much a lower proportion of 1/2/3-star ratings. The inverse holds true for the bottom categories.</p>
<p>How have these breakdowns changed over time? Are there other factors in play?</p>
<h2 id="rating-breakdowns-over-time">Rating Breakdowns Over Time</h2>
<p>Perhaps the advent of the binary Like/Dislike behaviors in social media in the 2000&rsquo;s have translated into a change in behavior for a 5-star review system. Here are the rating breakdowns for reviews written in each month from January 2000 to July 2014:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/time_breakdown_hu13200017163796445302.webp 320w,https://minimaxir.com/2017/01/amazon-spark/time_breakdown_hu17499833333894482917.webp 768w,https://minimaxir.com/2017/01/amazon-spark/time_breakdown_hu17179845752596451917.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/time_breakdown.png 1200w" src="time_breakdown.png"/> 
</figure>

<p>The voting behavior oscillates very slightly over time with no clear spikes or inflection points, which dashes that theory.</p>
<h2 id="distribution-of-average-scores">Distribution of Average Scores</h2>
<p>We should look at the global averages of Amazon product scores (i.e. what customers see when they buy products), and the users who give the ratings. We would expect the distributions to match, so any deviations would be interesting.</p>
<p>Products on average, when looking at products with atleast 5 ratings, have a <strong>4.16</strong> overall rating.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/item_histogram_hu8206876982132244050.webp 320w,https://minimaxir.com/2017/01/amazon-spark/item_histogram_hu15828039166294226842.webp 768w,https://minimaxir.com/2017/01/amazon-spark/item_histogram_hu13496291278628755905.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/item_histogram.png 1200w" src="item_histogram.png"/> 
</figure>

<p>When looking at a similar graph for the overall ratings given by users, (5 ratings minimum), the average rating is slightly higher at <strong>4.20</strong>.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/user_histogram_hu14766794894005151412.webp 320w,https://minimaxir.com/2017/01/amazon-spark/user_histogram_hu7663585943588508158.webp 768w,https://minimaxir.com/2017/01/amazon-spark/user_histogram_hu9872340281103558608.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/user_histogram.png 1200w" src="user_histogram.png"/> 
</figure>

<p>The primary difference between the two distributions is that there is significantly higher proportion of Amazon customers giving <em>only</em> 5-star reviews. Normalizing and overlaying the two charts clearly highlights that discrepancy.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/user_item_histogram_hu3303216325667529326.webp 320w,https://minimaxir.com/2017/01/amazon-spark/user_item_histogram_hu15243540103538174757.webp 768w,https://minimaxir.com/2017/01/amazon-spark/user_item_histogram_hu15869884500794183138.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/user_item_histogram.png 1200w" src="user_item_histogram.png"/> 
</figure>

<h2 id="the-marginal-review">The Marginal Review</h2>
<p>A few posts ago, I discussed how the <a href="http://minimaxir.com/2016/11/first-comment/">first comment on a Reddit post</a> has dramatically more influence than subsequent comments. Does user rating behavior change after making more and more reviews? Is the typical rating behavior different for the first review of a given product?</p>
<p>Here is the ratings breakdown for the <em>n</em>-th Amazon review a user gives:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/user_nth_breakdown_hu4547040191787093813.webp 320w,https://minimaxir.com/2017/01/amazon-spark/user_nth_breakdown_hu15895047257663722124.webp 768w,https://minimaxir.com/2017/01/amazon-spark/user_nth_breakdown_hu208867516538168575.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/user_nth_breakdown.png 1200w" src="user_nth_breakdown.png"/> 
</figure>

<p>The first user review has a slightly higher proportion of being a 1-star review than subsequent reviews. Otherwise, the voting behavior is mostly the same overtime, although users have an increased proportion of giving a 4-star review instead of a 5-star review as they get more comfortable.</p>
<p>In contrast, here is the ratings breakdown for the <em>n</em>-th review an Amazon product received:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2017/01/amazon-spark/item_nth_breakdown_hu2428390323882003896.webp 320w,https://minimaxir.com/2017/01/amazon-spark/item_nth_breakdown_hu2245718244944557328.webp 768w,https://minimaxir.com/2017/01/amazon-spark/item_nth_breakdown_hu15049137115157011289.webp 1024w,https://minimaxir.com/2017/01/amazon-spark/item_nth_breakdown.png 1200w" src="item_nth_breakdown.png"/> 
</figure>

<p>The first product review has a slightly higher proportion of being a 5-star review than subsequent reviews. However, after the 10th review, there is <em>zero</em> change in the distribution of ratings, which implies that the marginal rating behavior is independent from the current score after that threshold.</p>
<h2 id="summary">Summary</h2>
<p>Granted, this blog post is more playing with data and less analyzing data. What might be interesting to look into for future technical posts is conditional behavior, such as predicting the rating of a review given the previous ratings on that product/by that user. However, this post shows that while &ldquo;big data&rdquo; may be an inscrutable buzzword nowadays, you don&rsquo;t have to work for a Fortune 500 company to be able to understand it. Even with a data set consisting of 5 simple features, you can extract a large number of insights.</p>
<p>And this post doesn&rsquo;t even look at the text of the Amazon product reviews or the metadata associated with the products! I do have a few ideas lined up there which I won&rsquo;t spoil.</p>
<hr>
<p><em>You can view all the R and ggplot2 code used to visualize the Amazon data in <a href="http://minimaxir.com/notebooks/amazon-spark/">this R Notebook</a>. You can also view the images/data used for this post in <a href="https://github.com/minimaxir/amazon-spark">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>What Percent of the Top-Voted Comments in Reddit Threads Were Also 1st Comment?</title>
      <link>https://minimaxir.com/2016/11/first-comment/</link>
      <pubDate>Mon, 07 Nov 2016 06:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/11/first-comment/</guid>
      <description>Are commenters &amp;rsquo;late to this thread&amp;rsquo; indeed late?</description>
      <content:encoded><![CDATA[<p><a href="https://www.reddit.com">Reddit</a> threads can be a crowded place. In popular subreddits such as <a href="https://www.reddit.com/r/AskReddit/">/r/AskReddit</a> and <a href="https://www.reddit.com/r/pics/">/r/pics</a>, Reddit submissions can receive hundreds, even <em>thousands</em> of unique comments. Some comments inevitably become lost in the noise. Reddit&rsquo;s <a href="https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/">ranking algorithm</a> attempts to rectify this by determining comment ranking using both time and community voting; comments in a thread, by default, are ordered based on the <strong>points score</strong> (upvotes - downvotes) the comment receives, subject to a rank decay based on the age of the comment.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/reddit_askreddit_hu3410051859361510654.webp 320w,https://minimaxir.com/2016/11/first-comment/reddit_askreddit_hu5991386434350666882.webp 768w,https://minimaxir.com/2016/11/first-comment/reddit_askreddit_hu15158773245132013787.webp 1024w,https://minimaxir.com/2016/11/first-comment/reddit_askreddit.png 1735w" src="reddit_askreddit.png"/> 
</figure>

<p>In theory, this system should allow comments that posted later in the thread&rsquo;s lifetime to rank much higher temporarily, then Redditors can vote on the new comment; if the new comment is good, it can now rise to the top and therefore the content which would otherwise be buried is now surfaced. Anecdotally, that doesn&rsquo;t be the case with Reddit&rsquo;s modern algorithm; comments made late in the thread appear at the bottom, where they likely will not receive any upvotes (this led to a minor &ldquo;<a href="https://www.google.com/#q=site:reddit.com&#43;%22late&#43;to&#43;this&#43;thread%22">I know I&rsquo;m late to this thread but&hellip;</a>&rdquo; meme).</p>
<p>I, of course, am not satisfied with anecdotes. A month ago, a Redditor asked &ldquo;<a href="https://www.reddit.com/r/TheoryOfReddit/comments/53d5ep/what_percentage_of_the_top_comment_in_threads/">What percentage of the top comment in threads were also the first comment?</a>&rdquo; Why not calculate it <em>exactly</em> using big data?</p>
<h2 id="getting-the-reddit-data">Getting the Reddit Data</h2>
<p><em>You can view all the <a href="https://www.r-project.org">R</a> and <a href="http://ggplot2.org">ggplot2</a> code used to query, analyze, and visualize the Reddit data in <a href="http://minimaxir.com/notebooks/first-comment/">this R Notebook</a>.</em></p>
<p>In order to process a great amount of Reddit data, I turned to <a href="https://cloud.google.com/bigquery/">BigQuery</a>, which now has data for <a href="https://www.reddit.com/r/datasets/comments/590re2/updated_reddit_comments_and_posts_updated_on/">all Reddit comments</a> until September 2016.</p>
<p>For this analysis, I will only look at the <strong>top-level comments</strong> (i.e. comments which are not replies to other comments), since those are the ones most affected by the ordering and submission of new comments. Additionally I will only look at comments within Reddit threads with <strong>atleast 30 top-level comments</strong> to ensure I only look at threads with sufficient discussion and where late posts are more likely to become hidden. It also mirrors the &ldquo;late to this thread&rdquo; meme: can posts be <em>too</em> late?</p>
<p>The queried data will be all comments posted from January 2015 to September 2016: this give a good balance of sample size and foundation around the modern comment ranking algorithms. The total number of Reddit comments analyzed, after filtering on threads with sufficient conversation and limiting the scope to the first 100 comments of a thread scoring within the Top 100, is <strong>n = 86,561,476</strong>.</p>
<p>With clever use of BigQuery window functions, I obtained the aggregate data, counting the number of comments from the filtered Reddit threads at each voting rank and created rank.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/data_hu334563466712559601.webp 320w,https://minimaxir.com/2016/11/first-comment/data.png 485w" src="data.png"/> 
</figure>

<h2 id="visualizing-the-discussion">Visualizing the Discussion</h2>
<p>Filtering on the top-voted comments (<code>score_rank = 1</code>) only, <em>what percent of the top-voted comments in Reddit threads were also 1st Comment?</em></p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/reddit-first-4_hu12804590896089379556.webp 320w,https://minimaxir.com/2016/11/first-comment/reddit-first-4_hu5277637141094211576.webp 768w,https://minimaxir.com/2016/11/first-comment/reddit-first-4_hu7564725095777991083.webp 1024w,https://minimaxir.com/2016/11/first-comment/reddit-first-4.png 1200w" src="reddit-first-4.png"/> 
</figure>

<p>The answer is <strong>17.24%</strong> of all top-voted comments! That&rsquo;s certainly more than what I expected! Additionally, 56% of the top-voted comments were posted within the first 5 comments, and 77% within the first 10 comments. The chart follows a <a href="https://en.wikipedia.org/wiki/Power_law">power-law distribution</a>.</p>
<p>Let&rsquo;s invert it: filtering on only the first comments (<code>created_rank = 1</code>) made in comment threads, <em>what percentage of the 1st Comments in Reddit threads were also the top-voted comment?</em></p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/reddit-first-3_hu12928376286274734035.webp 320w,https://minimaxir.com/2016/11/first-comment/reddit-first-3_hu9548730243516503665.webp 768w,https://minimaxir.com/2016/11/first-comment/reddit-first-3_hu9347278910298157779.webp 1024w,https://minimaxir.com/2016/11/first-comment/reddit-first-3.png 1200w" src="reddit-first-3.png"/> 
</figure>

<p>By construction, the answer is the same as before (17.24%), however the followup proportions are slightly different, with the first comment ranking within the Top 5 comments 46% of the time, and within the Top 10 comments 62% of the time.</p>
<p>It may be worth it to visualize both dimensions at the same time using a heatmap, with the created rank on one axis, score rank on the other, and a z-axis representing the number of comments at each rank pairing. We can also add a faint contour line to help visualize clusters of the data. Putting it together:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/reddit-first-2_hu5458472257424540589.webp 320w,https://minimaxir.com/2016/11/first-comment/reddit-first-2_hu15038371960275821730.webp 768w,https://minimaxir.com/2016/11/first-comment/reddit-first-2_hu6340668934849351155.webp 1024w,https://minimaxir.com/2016/11/first-comment/reddit-first-2.png 1200w" src="reddit-first-2.png"/> 
</figure>

<p>Woah, most of the values are constrained between the semisquare constrained by the first 5 comments and the top 5 comments! But it&rsquo;s harder to see trends, so let&rsquo;s try applying a logarithmic base-10 scaling on the comment count:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/reddit-first-2a_hu4314560845286008287.webp 320w,https://minimaxir.com/2016/11/first-comment/reddit-first-2a_hu6134800337682896312.webp 768w,https://minimaxir.com/2016/11/first-comment/reddit-first-2a_hu14985410634842795981.webp 1024w,https://minimaxir.com/2016/11/first-comment/reddit-first-2a.png 1200w" src="reddit-first-2a.png"/> 
</figure>

<p>Much better! We can see a grouping of the 5x5 semisquare, but also smaller groupings of a 30x30 shape (this may possibly be due to the 30 comment filter threshold), a faint 60x60 shape, and <em>voids</em> in the upper-left and lower-right corners.</p>
<p>From the 2D heatmap, there appears to be a <strong>positive correlation</strong> between the rank of the comment and the time it was submitted. Ideally, if Reddit&rsquo;s algorithm correctly cycled posts so that each comment gets a fair chance at going viral, then there should be <strong>no correlation</strong> between score rank and time posted.</p>
<h2 id="analysis-by-subreddit">Analysis by Subreddit</h2>
<p>When working with Reddit data, it is always important to facet the analysis by subreddit, as subreddits can have idiosyncratic behaviors which deviate from general Reddit behavior. As noted in the original Reddit thread with the initial question, it is possible that the percentage of first comments becoming top comment is &ldquo;higher in lighter subs (funny, pics, videos) than more serious subs (askscience, history, etc).&rdquo;</p>
<p>I tweaked the BigQuery above to retrieve the same data for each of the Top 100 subreddits (determined by unique commenter count over the same time period). Afterward, via scripting, I created a 1D proportion-of-first-comments-by-score-rank and 2D heatmaps for each subreddit. You can view and download the 1D charts <a href="https://github.com/minimaxir/first-comment/tree/master/img-1d">here</a>, and the 2D heatmaps <a href="https://github.com/minimaxir/first-comment/tree/master/img-2d">here</a>.</p>
<p>For example, here&rsquo;s the chart of first-comment-rankings for <a href="https://www.reddit.com/r/IAmA/">/r/IAmA</a>, one of Reddit&rsquo;s biggest subreddits where normal Redditors can ask celebrities any question they want.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/IAmA-1d_hu12533851420029287525.webp 320w,https://minimaxir.com/2016/11/first-comment/IAmA-1d_hu2860976722341072203.webp 768w,https://minimaxir.com/2016/11/first-comment/IAmA-1d_hu17020721084305306837.webp 1024w,https://minimaxir.com/2016/11/first-comment/IAmA-1d.png 1200w" src="IAmA-1d.png"/> 
</figure>

<p>Unlike the all-Reddit chart, the distribution of first-comment proportions is more uniform instead of following a power law. It makes sense in theory; people would likely upvote top-level questions which the original poster replied to, so there should be less of a bias toward the first top-level comment.</p>
<p>What does the 2D heatmap show?</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/IAmA-2d_hu15022819391458170049.webp 320w,https://minimaxir.com/2016/11/first-comment/IAmA-2d_hu7501658129836543661.webp 768w,https://minimaxir.com/2016/11/first-comment/IAmA-2d_hu12479744954776080619.webp 1024w,https://minimaxir.com/2016/11/first-comment/IAmA-2d.png 1200w" src="IAmA-2d.png"/> 
</figure>

<p>Damn it.</p>
<p>While the 1D behavior is different, the overall 2D behavior is the same albeit with larger voids (indeed, in the heatmap, you can see at <code>created_rank = 1</code>, the vertical strip doesn&rsquo;t fit the pattern).</p>
<p>It turns out that most /r/IAmA threads have this comment:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/automoderator_hu16248000721135095862.webp 320w,https://minimaxir.com/2016/11/first-comment/automoderator.png 560w" src="automoderator.png"/> 
</figure>

<p>As it&rsquo;s made by a robot, it&rsquo;s always the first comment, and it gets ignored/downvoted in normal circumstances. Other subreddits with the same pattern of 1D irregularities, 2D regularities, and AutoModerator usage are <a href="https://www.reddit.com/r/gameofthrones/">/r/gameofthrones</a>, <a href="https://www.reddit.com/r/photoshopbattles/">/r/photoshopbattles</a>, and <a href="https://www.reddit.com/r/WritingPrompts/">/r/WritingPrompts</a>.</p>
<p>Some subreddits have more uniformity than typical Reddit rank behavior. In <a href="https://www.reddit.com/r/funny/">/r/funny</a>, <a href="https://www.reddit.com/r/leagueoflegends/">/r/leagueoflegends</a>, <a href="https://www.reddit.com/r/pics/">/r/pics</a>, <a href="https://www.reddit.com/r/todayilearned/">/r/todayilearned</a>, and <a href="https://www.reddit.com/r/video/">/r/videos</a> (i.e. many default subreddits), there is no upper-left void (early comments can be poorly ranked) and the bottom-right void is minimized but still present.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/funny-2d_hu16942781851955669877.webp 320w,https://minimaxir.com/2016/11/first-comment/funny-2d_hu5631728181406531520.webp 768w,https://minimaxir.com/2016/11/first-comment/funny-2d_hu7569998453113935705.webp 1024w,https://minimaxir.com/2016/11/first-comment/funny-2d.png 1200w" src="funny-2d.png"/> 
</figure>

<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/leagueoflegends-2d_hu3519457788906926997.webp 320w,https://minimaxir.com/2016/11/first-comment/leagueoflegends-2d_hu2566406629925078746.webp 768w,https://minimaxir.com/2016/11/first-comment/leagueoflegends-2d_hu16887153970850675097.webp 1024w,https://minimaxir.com/2016/11/first-comment/leagueoflegends-2d.png 1200w" src="leagueoflegends-2d.png"/> 
</figure>

<p>Inversely, there are subreddits where the correlation is obvious. <a href="https://www.reddit.com/r/pcmasterrace/">/r/pcmasterrace</a> and /r/gonewild both exhibit very straight lines, and are subreddits where the comments themselves are not very constructive, so whatever gets posted gets upvoted anyways.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/pcmasterrace-2d_hu3165470510000620234.webp 320w,https://minimaxir.com/2016/11/first-comment/pcmasterrace-2d_hu11965638413263816106.webp 768w,https://minimaxir.com/2016/11/first-comment/pcmasterrace-2d_hu9194754396634035473.webp 1024w,https://minimaxir.com/2016/11/first-comment/pcmasterrace-2d.png 1200w" src="pcmasterrace-2d.png"/> 
</figure>

<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/11/first-comment/gonewild-2d_hu14401683365729707287.webp 320w,https://minimaxir.com/2016/11/first-comment/gonewild-2d_hu9968179837543238337.webp 768w,https://minimaxir.com/2016/11/first-comment/gonewild-2d_hu2820129614851072752.webp 1024w,https://minimaxir.com/2016/11/first-comment/gonewild-2d.png 1200w" src="gonewild-2d.png"/> 
</figure>

<p>Rushing to say <strong>FIRST!!1!11!</strong> in a comments section of a blog post or forum thread is a meme that long predates Reddit. However, rushing to make the first comment in a Reddit thread may have strategic merit if you want to get your voice heard.</p>
<p>Even in the most optimistic circumstances, comments that are late to a thread have a very, very low probability of becoming one of the top comments. In fairness, it&rsquo;s hard to determine with public Reddit data if tweaking the ranking algorithm such that new comments will always rank at the top initially will actually improve the Reddit user experience as a whole. On the other hand, this behavior presents an opportunity: if there is a <a href="https://en.wikipedia.org/wiki/Long_tail">long tail</a> of Reddit content that is unjustifiably being buried due to lack of attention, then perhaps there is a <em>business opportunity</em> in creating a service to discover and resurface quality comments&hellip;</p>
<hr>
<p><em>You can view all the <a href="https://www.r-project.org">R</a> and <a href="http://ggplot2.org">ggplot2</a> code used to query, analyze, and visualize the Reddit data in <a href="http://minimaxir.com/notebooks/first-comment/">this R Notebook</a>. You can also view the images/data used for this post in <a href="https://github.com/minimaxir/first-comment">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Methods for Finding Related Reddit Subreddits with Simple Set Theory</title>
      <link>https://minimaxir.com/2016/06/reddit-related-subreddits/</link>
      <pubDate>Mon, 20 Jun 2016 08:20:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/06/reddit-related-subreddits/</guid>
      <description>Fancy machine learning approaches may not be required to help Redditors discover new things.</description>
      <content:encoded><![CDATA[<p>I recently <a href="http://minimaxir.com/2016/05/reddit-graph/">wrote a post</a> on how to visualize <a href="https://en.wikipedia.org/wiki/Graph_theory">network graphs</a> of <a href="https://www.reddit.com">Reddit</a> subreddits.</p>
<p>One of the reasons I&rsquo;ve been researching the topic is to find a good way to facilitate discovery of lesser-known subreddits, as Reddit is doing a terrible job at it (although they have been trying a <a href="https://www.reddit.com/r/changelog/comments/4o4qjh/more_small_tests_to_improve_user_experience_live/d49leyu?context=2">few new experiments</a> <em>very recently</em>). As it turns out, invoking graph theory is overkill. Even fancy machine learning approaches like <a href="https://en.wikipedia.org/wiki/Collaborative_filtering">collaborative filtering</a>, while powerful, may not be required to help Redditors discover new things.</p>
<h2 id="finding-related-subreddits">Finding Related Subreddits</h2>
<p>Let&rsquo;s say we have two sets: Set <em>A</em>, where <em>A</em> represents the number of active users in a given subreddit, and set <em>B</em>, where <em>B</em> is the set of active users in a subreddit. The intersection of Sets <em>A</em> and <em>B</em> (A ∩ B) represents users who are active in <em>both</em> subreddits.</p>
<p>Using <a href="https://cloud.google.com/bigquery/">BigQuery</a>, I can get the comment data from <strong>ALL</strong> public Reddit subreddits, as otherwise this technique would not work well using any smaller subset. The network graph edgelist conveniently gives (A ∩ B), obtained <a href="http://minimaxir.com/2016/05/reddit-graph/">as described in my previous post</a>, which calculates the number of active users for all pairs of subreddits (defining &ldquo;active users&rdquo; as users who have made a comment in at least 5 unique threads in a given subreddit within the past 6 months).</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/active-edge.png 287w" src="active-edge.png"/> 
</figure>

<p>In this case, we can filter the edgelist to only allow intersections where there are at least 10 active users; this prevents including dead and personal subreddits.</p>
<p>We can run another similar query to get the number of active users for each subreddit.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/active-users.png 223w" src="active-users.png"/> 
</figure>

<p>After that, for a given subreddit <em>A</em>, find:</p>
<p>(A ∩ B) / (B)</p>
<p>for all subreddits <em>B</em> where (A ∩ B) &gt; 0 (i.e. only neighbors of <em>A</em>). This computation takes less than a second. Additionally, the output is always a percentage between 0% and 100%. For the visualizations, we plot the Top 15 subreddits with the highest overlap of the specified subreddit <em>A</em> (and color the bars with a nice <a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html">viridis palette</a> to provide another easy way to perceive relative magnitude of relatedness).</p>
<p>The methodology may sound arbitrary, but the results are very interesting. Here&rsquo;s a chart of the top related subreddits for <a href="https://www.reddit.com/r/aww">/r/aww</a>, one of the most popular places on the internet for cat pictures.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/aww-related_hu7452210711215269735.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/aww-related_hu10024078028970969555.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/aww-related_hu6748153850376620041.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/aww-related.png 1200w" src="aww-related.png"/> 
</figure>

<p>I have honestly <em>never</em> heard of any of these subreddits before. But yet, by analyzing public user activity alone, I found a few new places to get more cute pics.</p>
<p>This methodology is excellent for finding subreddit-specific subsubreddits which may not be documented. The related subreddits for <a href="https://www.reddit.com/r/buildapc">/r/buildapc</a> offer more places to get PC building advice.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-related_hu1444884617368300487.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-related_hu4669675434075365966.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-related_hu12936135219142779611.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-related.png 1200w" src="buildapc-related.png"/> 
</figure>

<p>Related subreddits for sport-specific subreddits, like <a href="https://www.reddit.com/r/cfb">/r/cfb</a> (college football) include the corresponding teams.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-related_hu2579841373365720772.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-related_hu13142719208350041010.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-related_hu14267997494245642230.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-related.png 1200w" src="cfb-related.png"/> 
</figure>

<p><a href="https://www.reddit.com/r/food">/r/food</a> related subreddits list a surprising number of subreddits dedicated to specific foods.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/food-related_hu16026725087471090124.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/food-related_hu150083133972731399.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/food-related_hu11622016290584749171.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/food-related.png 1200w" src="food-related.png"/> 
</figure>

<p>There is a surprising amount of depth to the <a href="https://www.reddit.com/r/me_irl">/r/me_irl</a> network.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/me_irl-related_hu14038412834064863518.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/me_irl-related_hu7168717665286325706.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/me_irl-related_hu726448844965468624.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/me_irl-related.png 1200w" src="me_irl-related.png"/> 
</figure>

<p>The chart for <a href="https://www.reddit.com/r/programming">/r/programming</a> can tell you which subreddits exist for specific programming languages and technologies.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/programming-related_hu16988724025478746388.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/programming-related_hu8397177089322304462.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/programming-related_hu11081663765832830224.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/programming-related.png 1200w" src="programming-related.png"/> 
</figure>

<p>The methodology can also reveal a <em>lack</em> of related subreddits, by the large contrast between subreddits with high relatedness and low relatedness. For example, while /r/cfb may have large numbers of obviously-related subreddits as a sports subreddit, <a href="https://www.reddit.com/r/golf">/r/golf</a> has only 2.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/golf-related_hu3218786536730386494.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/golf-related_hu16170888182856983365.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/golf-related_hu15971132694235677801.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/golf-related.png 1200w" src="golf-related.png"/> 
</figure>

<p>You can view Related Subreddit charts for the Top 200 Subreddits <a href="https://github.com/minimaxir/subreddit-related/tree/master/related">in this GitHub repository</a>.</p>
<h2 id="finding-similar-subreddits">Finding Similar Subreddits</h2>
<p>Another method for finding related subreddits would be to find subreddits with similar communities. An academic approach to finding similarity between sets is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard Index</a>. Using the same set A and set B definitions above, the formula now becomes:</p>
<p>(A ∩ B) / [(A) + (B) - (A ∩ B)]</p>
<p>which outputs the Jaccard Index, between 0 and 1. This formula only requires a few tweaks to the original code. The results from this computation tell a different story.</p>
<p>Here are the most-similar subreddits to /r/aww:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/aww-jaccard-nondefault_hu4434561575161442021.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/aww-jaccard-nondefault_hu7921511715840130783.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/aww-jaccard-nondefault_hu4055276538136354615.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/aww-jaccard-nondefault.png 1200w" src="aww-jaccard-nondefault.png"/> 
</figure>

<p>In this implementation, the <a href="https://www.reddit.com/r/defaults/comments/4l3svc/list_of_default_subreddits_usa_26_may_2016/">default Reddit subreddits</a> must be removed from the results, as the communities of default subreddits are largely similar to most others by design. Even former defaults like <a href="https://www.reddit.com/r/adviceanimals">/r/adviceanimals</a> and <a href="https://www.reddit.com/r/technology">/r/technology</a> still have large amounts of holdout users which skew the results. As <a href="https://www.reddit.com/r/aww">/r/aww</a> is a mass-appeal subreddit, it makes sense that the communities are similar to other mass-appeal subreddits.</p>
<p>The magnitude of the Jaccard Index measures the strength of the similarity. Most subreddit relationships have a low Jaccard Index, but the relative magnitude between all subreddit neighbors illustrate comparisons for potential related subreddits regardless (this is also the reason why the x-axis is not fixed across plots). The subreddit relationship with the highest absolute similarity is <a href="https://www.reddit.com/r/arrow">/r/arrow</a> and <a href="https://www.reddit.com/r/flashtv">/r/flashtv</a> at 0.345, which make sense given the massive overlap between the two CW television shows.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/arrow-jaccard-nondefault_hu4581217514535812066.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/arrow-jaccard-nondefault_hu3694072670605671008.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/arrow-jaccard-nondefault_hu16427741968543964628.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/arrow-jaccard-nondefault.png 1200w" src="arrow-jaccard-nondefault.png"/> 
</figure>

<p>The Jaccard Index is more useful for finding similar subreddits to niche subreddits. Let&rsquo;s try a few of the subreddits mentioned previously and see how the results changed.</p>
<p>/r/buildapc is a niche, and the output identifies well-established subreddits, unlike with the previous related-subreddit methodology.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-jaccard-nondefault_hu10461306939229030101.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-jaccard-nondefault_hu15639218814474986181.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-jaccard-nondefault_hu14479453447745821303.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/buildapc-jaccard-nondefault.png 1200w" src="buildapc-jaccard-nondefault.png"/> 
</figure>

<p>The subreddit most similar to /r/cfb (college football) is <a href="https://www.reddit.com/r/collegebasketball">/r/collegebasketball</a>!</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-jaccard-nondefault_hu14165105937879864405.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-jaccard-nondefault_hu8639737058399322601.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-jaccard-nondefault_hu10483996780611261190.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/cfb-jaccard-nondefault.png 1200w" src="cfb-jaccard-nondefault.png"/> 
</figure>

<p>The subreddit most similar to /r/food is <a href="https://www.reddit.com/r/cooking">/r/cooking</a>!</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/food-jaccard-nondefault_hu4173928572311319346.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/food-jaccard-nondefault_hu11891340969130342547.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/food-jaccard-nondefault_hu6265471217254151858.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/food-jaccard-nondefault.png 1200w" src="food-jaccard-nondefault.png"/> 
</figure>

<p>The subreddit most similar to /r/programming is <a href="https://www.reddit.com/r/linux">/r/linux</a>! (of course)</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/06/reddit-related-subreddits/programming-jaccard-nondefault_hu18323117916079437568.webp 320w,https://minimaxir.com/2016/06/reddit-related-subreddits/programming-jaccard-nondefault_hu3690818704065230742.webp 768w,https://minimaxir.com/2016/06/reddit-related-subreddits/programming-jaccard-nondefault_hu5461393601724154561.webp 1024w,https://minimaxir.com/2016/06/reddit-related-subreddits/programming-jaccard-nondefault.png 1200w" src="programming-jaccard-nondefault.png"/> 
</figure>

<p>You can view the Similar Subreddit charts for the Top 200 Subreddits <a href="https://github.com/minimaxir/subreddit-related/tree/master/similar">in this GitHub repository</a>.</p>
<p>Again, Reddit has significantly better internal data for identifying user activity between subreddits, such as voting patterns and clickthrough tracking. But the results shown using these two set methodologies are pretty good for using public data. In fact, these two set approaches can theoretically work with <em>any</em> set of categorized, settable data, which may give me a few ideas for new blog posts in the future.</p>
<p>And there&rsquo;s still the fancy machine learning approaches to try.</p>
<hr>
<p><em>As always, the full code used to process the comment data and generate the visualizations is available in <a href="https://github.com/minimaxir/subreddit-related/blob/master/find_related_subreddits.ipynb">this Jupyter notebook</a>, open-sourced <a href="https://github.com/minimaxir/subreddit-related">on GitHub</a>.</em></p>
<p><em>If you do find any other interesting trends in the related/similar charts of other subreddits and write about it, it would be greatly appreciated if proper attribution is given back to this post and/or myself. Thanks!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>How to Create a Network Graph Visualization of Reddit Subreddits</title>
      <link>https://minimaxir.com/2016/05/reddit-graph/</link>
      <pubDate>Fri, 27 May 2016 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/05/reddit-graph/</guid>
      <description>There is very little discussion on how to gather the data for large-scale network graph visualizations, and how to make them. It is time to fix that.</description>
      <content:encoded><![CDATA[<p><a href="https://en.wikipedia.org/wiki/Graph_theory">Network graphs</a> are pretty data visualizations, and I like pretty data visualizations. Recently, <a href="https://www.reddit.com">Reddit</a> user CuriousGnu <a href="http://www.curiousgnu.com/reddit-comments">posted a network graph</a> of the comment patterns of the top 50 Reddit subreddits:</p>
<p>The <a href="https://www.reddit.com/r/dataisbeautiful/comments/4fsrjd/oc_redditors_who_commented_in_rx_also_commented/">visualization</a> was made with <a href="https://gephi.org">Gephi</a>, a very popular free and open-source network graph tool.</p>
<p>Gephi is <em>extremely</em> difficult to use, and most blog posts about the software are in the form of Step 1: Gephi, Step 2: ???, Step 3: Profit. Even if you know <em>do</em> how to use it, most of the network design customizations must be done manually, which is not helped by software slowness even on high-end machines. My own attempts to use Gephi for nice-looking networks have had <a href="https://www.reddit.com/r/dataisbeautiful/comments/3z60z6/network_of_reddit_commenting_patterns_for_the_top/">mixed</a> <a href="https://www.reddit.com/r/magicTCG/comments/401hdq/graph_network_of_magic_the_gathering_creature/">results</a>.</p>
<p>Additionally, there is very little discussion on how to gather the data for large-scale network graph visualizations, and how to make them in a <em>reproducible</em> manner. It is time to fix that and create a Reddit network graph visualization with many more nodes, step by step.</p>
<h2 id="getting-reddit-edge-data">Getting Reddit Edge Data</h2>
<p>Network graphs are typically formed by getting the relationship data between two entities (the edges), then extrapolating the vertices of the graph (the nodes) from that data.</p>
<p>There are two common data structures for representing edge data. One is an <a href="https://en.wikipedia.org/wiki/Adjacency_matrix">adjacency matrix</a>, which is a 2D matrix where the rows/columns represent the entities, and the value at the intersection between a row/column represents the <em>weight</em> of the relationships. For the visualization above, CuriousGnu made an adjacency matrix by querying the relationships from <a href="https://cloud.google.com/bigquery/">BigQuery</a> for each subreddit manually. That requires adding a line of SQL for <em>each</em> subreddit you want to plot, which is time-consuming and I am lazy.</p>
<p>Let&rsquo;s try option #2: an <a href="https://reference.wolfram.com/language/ref/EdgeList.html">edge list</a>, which is a tabular dataset where each row contains the two entities and a weight. With clever use of BigQuery, we can query the edges for <em>every single subreddit</em> at the same time. And we can query on real-time Reddit data from approximately the past 6 months using Jason Baumgartner&rsquo;s <a href="https://pushshift.io/using-bigquery-with-reddit-data/">Reddit dataset</a> on BigQuery.</p>
<p>The process works like this:</p>
<ol>
<li>Determine active users of a subreddit by identifying the subreddits where a user has <strong>commented</strong> on at least <strong>5 different submissions</strong> within the past 6 months.</li>
<li>Perform a <a href="http://stackoverflow.com/questions/3362038/what-is-self-join-and-when-would-you-use-it">self-join</a> by joining the table on itself: this will create <strong>links</strong> between all subreddits where a given user is active. (e.g. an active user of /r/askreddit, /r/pics, and /r/gifs will form 9 links: askreddit → askreddit, askreddit → pics, askreddit → gifs, pics → askreddit, etc.)</li>
<li>Aggregate the counts of the number of links between two subreddits; this will become the edge <strong>Weight</strong>.</li>
<li>Filter the resulting dataset by removing self-loops and reverse-edges. (e.g. since we have askreddit → pics, remove pics → askreddit). Additionally, we should only retain edges with <strong>at least 200 active users</strong> to keep the resulting dataset a manageable size for this analysis.</li>
</ol>
<p>Putting it all together results in this query:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">a</span><span class="p">.</span><span class="n">l_subreddit</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="k">Source</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">.</span><span class="n">l_subreddit</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">Target</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">Weight</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="n">author</span><span class="p">,</span><span class="w"> </span><span class="k">LOWER</span><span class="p">(</span><span class="n">subreddit</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">l_subreddit</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span><span class="p">(</span><span class="n">link_id</span><span class="p">))</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">unique_threads</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="p">[</span><span class="n">pushshift</span><span class="p">:</span><span class="n">rt_reddit</span><span class="p">.</span><span class="n">comments</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">author</span><span class="p">,</span><span class="w"> </span><span class="n">l_subreddit</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">HAVING</span><span class="w"> </span><span class="n">unique_threads</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">5</span><span class="p">)</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="n">author</span><span class="p">,</span><span class="w"> </span><span class="k">LOWER</span><span class="p">(</span><span class="n">subreddit</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">l_subreddit</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span><span class="p">(</span><span class="n">link_id</span><span class="p">))</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">unique_threads</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="p">[</span><span class="n">pushshift</span><span class="p">:</span><span class="n">rt_reddit</span><span class="p">.</span><span class="n">comments</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">author</span><span class="p">,</span><span class="w"> </span><span class="n">l_subreddit</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="k">HAVING</span><span class="w"> </span><span class="n">unique_threads</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">5</span><span class="p">)</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">a</span><span class="p">.</span><span class="n">author</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b</span><span class="p">.</span><span class="n">author</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">Source</span><span class="p">,</span><span class="w"> </span><span class="n">Target</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">HAVING</span><span class="w"> </span><span class="k">Source</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">Target</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">Weight</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">200</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">Weight</span><span class="w"> </span><span class="k">DESC</span><span class="w">
</span></span></span></code></pre></div><p>Only 13 lines of code, with 3 of those lines repeated. Running the query only takes a few minutes. (which is actually <em>forever</em> in BigQuery time: when people talk about &ldquo;big data,&rdquo; this is <em>actually big data</em>!)</p>
<p>That query (at the time of analysis) returns <a href="https://docs.google.com/spreadsheets/d/1MFHno-sYR3MkWgntnieWobWQ2e3x4CAcNUdVIFjmlQI/edit?usp=sharing">this dataset</a> of 7,498 edges; more than enough. Now for the fun part.</p>
<h2 id="visualizing-the-reddit-data">Visualizing the Reddit Data</h2>
<p>The edge list linked above can actually be imported into Gephi as-is. <strong>Don&rsquo;t</strong>.</p>
<p>Instead, let&rsquo;s use R and my favorite data visualization tool <code>ggplot2</code>, with a twist.</p>
<p>First, we load the edge list into R, and create an undirected network graph using the <code>igraph</code> package.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">net</span> <span class="o">&lt;-</span> <span class="nf">graph.data.frame</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">directed</span><span class="o">=</span><span class="bp">F</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/igraph_hu9095004272863274831.webp 320w,https://minimaxir.com/2016/05/reddit-graph/igraph_hu8290687697856770257.webp 768w,https://minimaxir.com/2016/05/reddit-graph/igraph_hu14773623017789069366.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/igraph.png 1258w" src="igraph.png"/> 
</figure>

<p>The imported edge list results in a network with 1,131 nodes/subreddits. After pruning nodes with only a few neighbors and removing the subsequently-orphaned edges, we get a network of 517 nodes/subreddits with 6,732 edges.</p>
<p>We can then add summary statistics for the nodes, such as the group/community each node belongs to, and the <a href="https://en.wikipedia.org/wiki/Centrality">eigenvector centrality</a> of the node.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">V</span><span class="p">(</span><span class="n">net</span><span class="p">)</span><span class="o">$</span><span class="n">group</span> <span class="o">&lt;-</span> <span class="nf">membership</span><span class="p">(</span><span class="nf">cluster_walktrap</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="nf">E</span><span class="p">(</span><span class="n">net</span><span class="p">)</span><span class="o">$</span><span class="n">Weight</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="nf">V</span><span class="p">(</span><span class="n">net</span><span class="p">)</span><span class="o">$</span><span class="n">centrality</span> <span class="o">&lt;-</span> <span class="nf">eigen_centrality</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="nf">E</span><span class="p">(</span><span class="n">net</span><span class="p">)</span><span class="o">$</span><span class="n">Weight</span><span class="p">)</span><span class="o">$</span><span class="n">vector</span>
</span></span></code></pre></div><p>Convert the network to a dataframe suitable for plotting using the <code>ggnetwork</code> library.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_net</span> <span class="o">&lt;-</span> <span class="nf">ggnetwork</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">layout</span> <span class="o">=</span> <span class="s">&#34;fruchtermanreingold&#34;</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="s">&#34;Weight&#34;</span><span class="p">,</span> <span class="n">niter</span><span class="o">=</span><span class="m">50000</span><span class="p">)</span>
</span></span></code></pre></div><p>Now time for ggplot2/ggnetwork fun. In this case, we will color the nodes whether or not they are a default subreddit (orange if default, blue otherwise) and color the lines accordingly (orange if either end is a default subreddit, blue otherwise).</p>
<p>Yes, writing and optimizing all of this code is <em>significantly</em> easier than using Gephi, believe it or not.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">default_colors</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s">&#34;#3498db&#34;</span><span class="p">,</span> <span class="s">&#34;#e67e22&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">default_labels</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s">&#34;Not Default&#34;</span><span class="p">,</span> <span class="s">&#34;Default&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="n">df_net</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">,</span> <span class="n">xend</span> <span class="o">=</span> <span class="n">xend</span><span class="p">,</span> <span class="n">yend</span> <span class="o">=</span> <span class="n">yend</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="n">centrality</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_edges</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">color</span> <span class="o">=</span> <span class="n">connectDefault</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="m">0.05</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_nodes</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">fill</span> <span class="o">=</span> <span class="n">defaultnode</span><span class="p">),</span> <span class="n">shape</span> <span class="o">=</span> <span class="m">21</span><span class="p">,</span> <span class="n">stroke</span><span class="o">=</span><span class="m">0.2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">&#34;black&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_nodelabel_repel</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_net</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">color</span> <span class="o">=</span> <span class="n">defaultnode</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="n">vertex.names</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                          <span class="n">fontface</span> <span class="o">=</span> <span class="s">&#34;bold&#34;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="m">0.5</span><span class="p">,</span> <span class="n">box.padding</span> <span class="o">=</span> <span class="nf">unit</span><span class="p">(</span><span class="m">0.05</span><span class="p">,</span> <span class="s">&#34;lines&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                          <span class="n">label.padding</span><span class="o">=</span> <span class="nf">unit</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span> <span class="s">&#34;lines&#34;</span><span class="p">),</span> <span class="n">segment.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">,</span> <span class="n">label.size</span><span class="o">=</span><span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="n">default_colors</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">default_labels</span><span class="p">,</span> <span class="n">guide</span><span class="o">=</span><span class="bp">F</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scale_fill_manual</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="n">default_colors</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">default_labels</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">ggtitle</span><span class="p">(</span><span class="s">&#34;Network Graph of Reddit Subreddits (by @minimaxir)&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scale_size</span><span class="p">(</span><span class="n">range</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span> <span class="m">4</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">theme_blank</span><span class="p">()</span>
</span></span></code></pre></div><!-- <span class="hidden-lg">_If you are on a smartphone or tablet, tap <a href="/img/reddit-graph/subreddit-1.pdf" target="_blank">this link</a> to view the network in a zoomable format._</span> -->
<p>The large networks in the blog post are rendered as a PDF, which allows for easy pan/zooming at a very low file size (284KB!), while SVG/<a href="https://d3js.org">d3</a>/<a href="http://sigmajs.org">sigma.js</a> approaches have very poor performance at large numbers of nodes/edges.</p>
<p>As we expect, the default subreddits are in the center of the network graph and have high centrality (although /r/art and /r/earthporn are oddly far separated from the other defaults). The large amounts of orange graph-wide illustrate the breadth of the defaults.</p>
<p>Now let&rsquo;s color the nodes and edges by group, just as you saw in the introductory visualization:</p>
<!-- <span class="hidden-lg">_If you are on a smartphone or tablet, tap <a href="/img/reddit-graph/subreddit-2.pdf" target="_blank">this link</a> to view the network in a zoomable format._</span> -->
<p>If an edge links to a node of the same group, the edge is colored that group. Otherwise, the edge is colored gray. (the code that implements this is not shown because it is somewhat convoluted). This color scheme helps gauge the overall impact of the communities on Reddit. But why not look at specific groups?</p>
<h2 id="subgraph-surprises">Subgraph Surprises</h2>
<p>As you can see plainly in the group-colored visualization, there is a giant green group at the center which includes the default subreddits. Analyzing that is not helpful. But we can filter the network on other specific groups and their subgraphs to see if we can define any Reddit subcultures. (note that the Group number is merely an ID; the value and order are not relevant).</p>
<p>The most notable Reddit groups are gaming groups. We have two distinct groups of gamers:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-006_hu10508581818142447171.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-006_hu16135338840656175481.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-006_hu16930240503280894621.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-006.png 1200w" src="group-006.png"/> 
</figure>

<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-008_hu18347586986957538753.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-008_hu11518241172470025579.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-008_hu6532239915706647079.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-008.png 1200w" src="group-008.png"/> 
</figure>

<p>Plus Nintendo gamers? With a little Vita on the side?</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-010_hu5361338082595600258.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-010_hu6956518711832937120.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-010_hu17961460541310098366.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-010.png 1200w" src="group-010.png"/> 
</figure>

<p>Subreddits related to sports and sporting teams form a nice cluster:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-001_hu12635140973507527200.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-001_hu8378863093781276960.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-001_hu218370102267924479.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-001.png 1200w" src="group-001.png"/> 
</figure>

<p>PC-building has a distinct community:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-011_hu4528435591239793754.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-011_hu4580304487847779461.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-011_hu18048193634369146904.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-011.png 1200w" src="group-011.png"/> 
</figure>

<p>The British make nice triangles!</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-022_hu6601819275364951168.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-022_hu5797361921442358649.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-022_hu10976470785105648723.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-022.png 1200w" src="group-022.png"/> 
</figure>

<p>Relationship and female-oriented subreddits have a relationship.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-007_hu12060515596759531401.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-007_hu11312257643430791003.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-007_hu15379572613566778299.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-007.png 1200w" src="group-007.png"/> 
</figure>

<p>Lastly, DC Comics has their own sector, particularly with the corresponding CW television shows. (although some Marvel shows sneak in!)</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2016/05/reddit-graph/group-003_hu4604322086069880708.webp 320w,https://minimaxir.com/2016/05/reddit-graph/group-003_hu14610407072459303422.webp 768w,https://minimaxir.com/2016/05/reddit-graph/group-003_hu13488003038545685704.webp 1024w,https://minimaxir.com/2016/05/reddit-graph/group-003.png 1200w" src="group-003.png"/> 
</figure>

<p>Of course, Reddit itself has better data for identifying relationships between subreddits, as they can track user activity more intimately. Meanwhile, the output for this post turned out better than expected and I hope to include similar visualizations in future blog posts. Hopefully, it dispelled some of the mystery behind pretty network graphs. (if you do use the code or data visualization designs from this post, it would be greatly appreciated if proper attribution is given back to this post and/or myself. Thanks!).</p>
<hr>
<p><em>As always, the full code used to process the edge list and generate the visualizations is available in <a href="https://github.com/minimaxir/reddit-graph/blob/master/subreddit_network_pdf.ipynb">this Jupyter notebook</a>, open-sourced <a href="https://github.com/minimaxir/reddit-graph">on GitHub</a>.</em></p>
<p><em>Additionally, thanks to Professor James P. Curley of Columbia University for providing <a href="http://curleylab.psych.columbia.edu/netviz/netviz1.html#/">helpful slides</a> which have good code samples for getting started with igraph/ggnetwork.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Mapping Where Arrests Frequently Occur in San Francisco Using Crime Data</title>
      <link>https://minimaxir.com/2015/12/sf-arrest-maps/</link>
      <pubDate>Mon, 07 Dec 2015 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2015/12/sf-arrest-maps/</guid>
      <description>Let&amp;rsquo;s plot 587,499 arrests on top of a map of San Francisco for fun and see what happens.</description>
      <content:encoded><![CDATA[<p>In my previous post, <a href="http://minimaxir.com/2015/12/sf-arrests/">Analyzing San Francisco Crime Data to Determine When Arrests Frequently Occur</a>, I found out that there are trends where SF Police arrests occur more frequently than others. By processing the <a href="https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry">SFPD Incidents dataset</a> from the <a href="https://data.sfgov.org">SF OpenData portal</a>, I found that arrests typically occur on Wednesdays at 4-5 PM, and that the type of crime is relevant to the frequency of the crime. (e.g. DUIs happen late Friday/Saturday night).</p>
<p>However, I could not understand <em>why</em> Wednesday/4-5PM is a peak time for arrests. In addition to analyzing <em>when</em> arrests occur, I also looked at <em>where</em> arrests occur. For example, perhaps more crime happens as people are leaving work; in that case, we would expect to see crimes downtown.</p>
<h2 id="making-a-map-of-sf-arrests">Making a Map of SF Arrests</h2>
<p>Continuing from the previous analysis, I have a data frame of all police arrests that have occurred in San Francisco from 2003 - 2015 (587,499 arrests total).</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/arrests_hu18118600371790478763.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/arrests_hu4177723299286775352.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/arrests_hu599879548976239824.webp 1024w,https://minimaxir.com/2015/12/sf-arrest-maps/arrests.png 1936w" src="arrests.png"/> 
</figure>

<p>What is the most efficient way to make a map for the data? There are too many data points for rendering each point in a tool like <a href="http://www.tableau.com">Tableau</a> or <a href="https://www.google.com/maps">Google Maps</a>. I can use <code>ggplot2</code> again as I did <a href="http://minimaxir.com/2015/11/nyc-ggplot2-howto/">to make a map of New York City</a> manually, but as noted in that article, the abstract nature of the map may hide information.</p>
<p>Enter <code>ggmap</code>. ggmap, an <a href="https://github.com/dkahle/ggmap">R package by David Kahle</a>, is a tool that allows the user to retrieve a map image from a number of map data providers, and integrates seamlessly with ggplot2 for simple visualization creation. Kahle and Hadley Wickham (the creator of ggplot2) <a href="https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf">coauthored a paper</a> describing practical applications of ggmap.</p>
<p>I will include most of the map generation code in-line. <em>For more detailed code and output, a <a href="https://github.com/minimaxir/sf-arrests-when-where/blob/master/crime_data_sf.ipynb">Jupyter notebook</a> containing the code and visualizations used in this article is available open-source on GitHub.</em></p>
<p>By default, you can ask ggmap just for a location using <code>get_map()</code>, and it will give you an approximate map around that location. You can configure the zoom level on that point as well. Optionally, if you need precise bounds for the map, you can set the bounding box manually, and the <a href="http://boundingbox.klokantech.com">Bounding Box Tool</a> works extremely well for this purpose, with the CSV coordinate export already being in the correct format.</p>
<p>ggmap allows maps from sources such as <a href="https://www.google.com/maps">Google Maps</a> and <a href="http://www.openstreetmap.org/#map=5/51.500/-0.100">OpenStreetMap</a>, and the maps can be themed, such as a color map of a black-and-white map. A black-and-white minimalistic map would be best for readability. A <a href="https://www.reddit.com/r/dataisbeautiful/comments/3ule41/mapping_restaurants_in_san_francisco_by_health/">Reddit submission by /u/all_genes_considered</a> used <a href="http://maps.stamen.com/#terrain/12/37.7706/-122.3782">Stamen maps</a> as a source with the toner-lite theme, and that worked well.</p>
<p>Since we&rsquo;ve identified the map parameters, now we can request an appropriate map:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">bbox</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">-122.516441</span><span class="p">,</span><span class="m">37.702072</span><span class="p">,</span><span class="m">-122.37276</span><span class="p">,</span><span class="m">37.811818</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">map</span> <span class="o">&lt;-</span> <span class="nf">get_map</span><span class="p">(</span><span class="n">location</span> <span class="o">=</span> <span class="n">bbox</span><span class="p">,</span> <span class="n">source</span> <span class="o">=</span> <span class="s">&#34;stamen&#34;</span><span class="p">,</span> <span class="n">maptype</span> <span class="o">=</span> <span class="s">&#34;toner-lite&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>Plotting the map by itself results in something like this:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-0_hu10072400600952302923.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-0_hu14861928855447305861.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-0.png 900w" src="sf-arrest-where-0.png"/> 
</figure>

<p>On the right track (aside from two Guerrero Streets), but obviously it&rsquo;ll need some aesthetic adjustments.</p>
<p>Let&rsquo;s plot all 587,499 arrests on top of the map for fun and see what happens.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggmap</span><span class="p">(</span><span class="n">map</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">geom_point</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">df_arrest</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">Y</span><span class="p">),</span> <span class="n">color</span> <span class="o">=</span> <span class="s">&#34;#27AE60&#34;</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">0.5</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.01</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">theme</span><span class="p">(</span><span class="kc">...</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Locations of Police Arrests Made in San Francisco from 2003 – 2015&#34;</span><span class="p">)</span>
</span></span></code></pre></div><ul>
<li><code>ggmap()</code> sets up the base map.</li>
<li><code>geom_point()</code> plots points. The data for plotting is specified at this point. &ldquo;color&rdquo; and &ldquo;size&rdquo; parameters do just that. An alpha of 0.01 causes each point to be 99% transparent; therefore, addresses with a lot of points will be more opaque.</li>
<li><code>fte_theme()</code> is my theme based on the FiveThirtyEight style.</li>
<li><code>theme()</code> is needed for a few additional theme tweaks to remove the axes/margins</li>
<li><code>labs()</code> is for labeling the plot (<em>always</em> label!)</li>
</ul>
<p>Rendering the plot results in:</p>
<p><a href="http://i.imgur.com/Xu8wXzc.png" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-1_hu14106836243586229618.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-1.png 570w" src="sf-arrest-where-1.png"/> 
</figure>
</a></p>
<p><em>NOTE: All maps in this article are embedded at a lower size to ensure that the article doesn&rsquo;t take days to load. To load a high-resolution version of any map, click on it and it will open in a new tab.</em></p>
<p><em>Additionally, since ggmap forces plots to a fixed ratio, this results in the &ldquo;random white space&rdquo; problem mentioned in the NYC article, for which I still have not found a solution, but have minimized the impact.</em></p>
<p>It&rsquo;s clear to see where arrests in the city occur. A large concentration in the Tenderloin and the Mission, along with clusters in Bayview and Fisherman&rsquo;s Wharf. However, point-stacking is not helpful when comparing high-density areas, so this visualization can be optimized.</p>
<p>How about faceting by type of crime again? We can render a map of San Francisco for each type of crime, and then we can see if the clusters for a given type of crime are different from others.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggmap</span><span class="p">(</span><span class="n">map</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">geom_point</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">df_arrest</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">Category</span> <span class="o">%in%</span> <span class="n">df_top_crimes</span><span class="o">$</span><span class="n">Category[2</span><span class="o">:</span><span class="m">19</span><span class="n">]</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">Y</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">Category</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="m">0.75</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="m">0.05</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">theme</span><span class="p">(</span><span class="kc">...</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Locations of Police Arrests Made in San Francisco from 2003 – 2015, by Type of Crime&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">facet_wrap</span><span class="p">(</span><span class="o">~</span> <span class="n">Category</span><span class="p">,</span> <span class="n">nrow</span> <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
</span></span></code></pre></div><p>Again, only one line of new code for the facet, although the source data needs to be filtered as it was in the previous post.</p>
<p>Running the code yields:</p>
<p><a href="http://i.imgur.com/6NgzV3k.jpg" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-2_hu6941835751255399820.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-2_hu4652477988635860610.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-2_hu10596737585120254249.webp 1024w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-2.png 1065w" src="sf-arrest-where-2.png"/> 
</figure>
</a></p>
<p>This is certainly more interesting (and pretty). Some crimes, such as Assaults, Drugs/Narcotics, and Warrants, occur all over the city. Other crimes, such as Disorderly Conduct and Robbery, primarily have clusters in the Tenderloin and in the Mission close to the 16th Street BART stop. (Prostitution notably has a cluster in the Mission and a cluster <em>above</em> the Tenderloin.)</p>
<p>Again, we can&rsquo;t compare high-density points, so now we should probably normalize the data by facet. One way to do this is to weight each point by the reciprocal of the number of points in the facet (e.g. if there are 5,000 Fraud arrests, assign a weight of 1/5000 to each Fraud arrest), and aggregate the sums of the weights in a geographical area.</p>
<p>We can reuse the normalization code from the previous post, and the hex overlay code from my NYC taxi plot post as well:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sum_thresh</span> <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">threshold</span> <span class="o">=</span> <span class="m">10</span><span class="n">^</span><span class="m">-3</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kr">if</span> <span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">threshold</span><span class="p">)</span> <span class="p">{</span><span class="kr">return</span> <span class="p">(</span><span class="kc">NA</span><span class="p">)}</span>
</span></span><span class="line"><span class="cl">    <span class="kr">else</span> <span class="p">{</span><span class="kr">return</span> <span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">))}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggmap</span><span class="p">(</span><span class="n">map</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">stat_summary_hex</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">df_arrest</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">Category</span> <span class="o">%in%</span> <span class="n">df_top_crimes</span><span class="o">$</span><span class="n">Category[2</span><span class="o">:</span><span class="m">19</span><span class="n">]</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">group_by</span><span class="p">(</span><span class="n">Category</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">w</span><span class="o">=</span><span class="m">1</span><span class="o">/</span><span class="nf">n</span><span class="p">()),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">Y</span><span class="p">,</span> <span class="n">z</span><span class="o">=</span><span class="n">w</span><span class="p">),</span> <span class="n">fun</span><span class="o">=</span><span class="n">sum_thresh</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">0.8</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">&#34;#CCCCCC&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">theme</span><span class="p">(</span><span class="kc">...</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">scale_fill_gradient</span><span class="p">(</span><span class="n">low</span> <span class="o">=</span> <span class="s">&#34;#DDDDDD&#34;</span><span class="p">,</span> <span class="n">high</span> <span class="o">=</span> <span class="s">&#34;#2980B9&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">labs</span><span class="p">(</span><span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by Type of Crime&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">            <span class="nf">facet_wrap</span><span class="p">(</span><span class="o">~</span> <span class="n">Category</span><span class="p">,</span> <span class="n">nrow</span> <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
</span></span></code></pre></div><ul>
<li><code>sum_thresh()</code> is a helper function that aggregates the sums of weights, but will not plot the corresponding hex if there is not enough data at that location.</li>
<li><code>scale_fill_gradient()</code> sets the gradient for the chart. If there are few arrests, the hex will be gray; if there are many arrests, it will be deep blue.</li>
</ul>
<p>Putting it all together:</p>
<p><a href="http://i.imgur.com/LXKPseq.jpg" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-3_hu7456771257946400118.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-3_hu9154202637838410471.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-3_hu16751458202771499645.webp 1024w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-3.png 1065w" src="sf-arrest-where-3.png"/> 
</figure>
</a></p>
<p>This confirms the interpretations mentioned above.</p>
<p>Since the code base is already created, it is very simple to facet on any variable. So why not create a faceted map for <em>every conceivable variable</em>?</p>
<p>How about checking arrest locations by Police Districts?</p>
<p><a href="http://i.imgur.com/i82wsIZ.png" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-4_hu6091410268985354255.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-4_hu1518390982070671483.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-4.png 975w" src="sf-arrest-where-4.png"/> 
</figure>
</a></p>
<p>The map shows that the hex plotting works correctly, at the least. Notably, the Central, Northern, and Southern Police Districts end up making a large proportion of their arrests nearby the Tenderloin/Market Street instead of anywhere else in their area of perview.</p>
<p>Is the location of arrests seasonal? Does it vary by the month the arrest occured?</p>
<p><a href="http://i.imgur.com/u2eQMZf.png" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-5_hu6072912836200734980.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-5_hu15520078146852942027.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-5.png 975w" src="sf-arrest-where-5.png"/> 
</figure>
</a></p>
<p>Nope. Still Tenderloin and Mission.</p>
<p>Maybe the locations of arrests have changed over time, as legal polices changed. Let&rsquo;s facet by year.</p>
<p><a href="http://i.imgur.com/x4SRnkU.png" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-6_hu8171979095774921676.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-6_hu12204436199112275202.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-6.png 788w" src="sf-arrest-where-6.png"/> 
</figure>
</a></p>
<p>Here things are <em>slightly</em> different across each facet; Tenderloin had a much higher concentration of arrests peaking in 2009-2010, and the concentration of yearly arrests in the Tenderloin has decreased relative to everywhere else in the city.</p>
<p>Does the location of arrests vary by the time of day? As noted earlier, there could be more arrests downtown during working hours.</p>
<p><a href="http://i.imgur.com/cDKo8Lt.png" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-7_hu925150912043529827.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-7_hu11229007422759645671.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-7.png 788w" src="sf-arrest-where-7.png"/> 
</figure>
</a></p>
<p>Higher relative concentration in Tenderlion/Mission during work hours, lesser during the night.</p>
<p>Last try. Perhaps the day of week leads to different locations, especially as people tend to go out to bars all across the city.</p>
<p><a href="http://i.imgur.com/tNBRilL.png" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-8_hu8922690803486119596.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-8_hu17195955955733530060.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/sf-arrest-where-8.png 788w" src="sf-arrest-where-8.png"/> 
</figure>
</a></p>
<p><em>Zero</em> difference. Eh.</p>
<p>We did learn that there is certainly a lot of arrests in the Tenderloin and 16th Street/Mission BART stop, though. However, that doesn&rsquo;t necessarily mean there is more <em>crime</em> in those areas (correlation does not imply causation), but it is something worth noting when traveling around the San Francisco.</p>
<h2 id="bonus-do-social-security-payments-lead-to-an-increase-in-arrests">Bonus: Do Social Security payments lead to an increase in arrests?</h2>
<p>In response to my previous article, <a href="https://www.reddit.com/r/sanfrancisco/comments/3vfgg2/analyzing_san_francisco_crime_data_to_determine/cxn29wd">Redditor /u/NowProveIt hypothesizes</a> that the spike in Wednesday arrests could be attributed to <a href="https://www.ssa.gov/kc/rp_paybenefits.htm">Social Security disability</a> (RDSI) payments. The <a href="https://www.socialsecurity.gov/pubs/EN-05-10031-2015.pdf">Social Security Benefit Payments schedule</a> is typically every second, third, and fourth Wednesday of a month.</p>
<p>Normally, you would expect that the arrest behavior for any Wednesday in a given month to be independent from each other. Therefore, if the arrest behavior for the <em>first</em> Wednesday is different than that for the secord/third/fourth Wednesday (presumably, the First Wednesday has fewer arrests overall), then we might have a lead.</p>
<p>Through more <code>dplyr</code> shenanigans, I am able to filter the dataset of arrests to Wednesday arrests only, and classify each Wednesday as the first, second, third, or fourth of the month. (there are occasionally fifth Wednesdays but no one cares about those).</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/ordinal_hu1849703887847077295.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/ordinal.png 406w" src="ordinal.png"/> 
</figure>

<p>We can plot a single line chart for each ordinal of the number of arrests over the day. We are looking to see if the First Wednesday has different behavior than the other Wednesdays.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/ssi-crime-1_hu351666373830362631.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/ssi-crime-1_hu5550050970829996543.webp 768w,https://minimaxir.com/2015/12/sf-arrest-maps/ssi-crime-1_hu14151437229357599572.webp 1024w,https://minimaxir.com/2015/12/sf-arrest-maps/ssi-crime-1.png 1500w" src="ssi-crime-1.png"/> 
</figure>

<p>&hellip;and it doesn&rsquo;t.</p>
<p>Looking at locations data doesn&rsquo;t help either.</p>
<p><a href="http://i.imgur.com/wTDKjOm.png" target=_blank><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrest-maps/ssi-crime-2_hu1192086183471697157.webp 320w,https://minimaxir.com/2015/12/sf-arrest-maps/ssi-crime-2.png 413w" src="ssi-crime-2.png"/> 
</figure>
</a></p>
<p>Oh well, it was worth a shot.</p>
<p>As always, all the code and raw images are available <a href="https://github.com/minimaxir/sf-arrests-when-where">in the GitHub repository</a>. Not many more questions were answered by looking at the location data of San Francisco crimes. But that&rsquo;s OK. There&rsquo;s certainly other cool things to do with this data. Kaggle, for instance, is creating <a href="https://www.kaggle.com/c/sf-crime/scripts">a repository of scripts</a> which play around with the Crime Incident dataset.</p>
<p>But for now, at least I made a few pretty charts out of it.</p>
<hr>
<p><em>If you use the code or data visualization designs contained within this article, it would be greatly appreciated if proper attribution is given back to this article and/or myself. Thanks!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Analyzing San Francisco Crime Data to Determine When Arrests Frequently Occur</title>
      <link>https://minimaxir.com/2015/12/sf-arrests/</link>
      <pubDate>Fri, 04 Dec 2015 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2015/12/sf-arrests/</guid>
      <description>Spoilers: Most arrests in San Francisco happen Wednesdays at 4-5 PM. For some reason.</description>
      <content:encoded><![CDATA[<p>The <a href="https://data.sfgov.org">SF OpenData portal</a> is a good source for detailed statistics about San Francisco. One of the most popular datasets on the portal is the <a href="https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry">SFPD Incidents dataset</a>, which contains a tabular list of 1,842,050 reports (at time of writing) from 2003 to present.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/incident-data_hu6678631468181363976.webp 320w,https://minimaxir.com/2015/12/sf-arrests/incident-data_hu17399309134632357460.webp 768w,https://minimaxir.com/2015/12/sf-arrests/incident-data_hu4743379933112491716.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/incident-data.png 1794w" src="incident-data.png"/> 
</figure>

<p>The data can be exported into a 377.9 MB CSV; not large enough to be considered &ldquo;big data,&rdquo; but still too heavy for programs like Excel to process efficiently. Let&rsquo;s take a look at the data using <a href="https://www.r-project.org">R</a> and see if there&rsquo;s anything interesting.</p>
<h2 id="processing-the-data">Processing the Data</h2>
<p>For this article, I&rsquo;m going to do something different and illustrate the data processing step-by-step, both as a teaching tool, and to show that I am not using vague methodology to generate a narratively-convenient conclusion. <em>For more detailed code and output, a <a href="https://github.com/minimaxir/sf-arrests-when-where/blob/master/crime_data_sf.ipynb">Jupyter notebook</a> containing the code and visualizations used in this article is available open-source on GitHub.</em></p>
<p>Loading a 1.9 million row file into R can take awhile, even on modern computers with a SSD. Enter <code>readr</code>, <a href="https://github.com/hadley/readr">another R package</a> by <code>ggplot2</code> author Hadley Wickham, which grants access to a <code>read_csv()</code> function that has nearly 10x the speed of the base <code>read.csv()</code> R function, with more sensible defaults too.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">path</span> <span class="o">&lt;-</span> <span class="s">&#34;~/Downloads/SFPD_Incidents_-_from_1_January_2003.csv&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o">&lt;-</span> <span class="nf">read_csv</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</span></span></code></pre></div><p>In memory, the data set is 180.9 MB, and removing a few useless columns (e.g. IncidentNum) further reduces the size to 126.9 MB. Since there are many redundancies in the row data (e.g. only 10 distinct PdDistrict values), R can perform memory optimizations.</p>
<p>You may have noticed in the first article image that the text data in some of the columns is in ALL CAPS, which would look ugly if the text was used in a data visualization. We can create a helper function to convert a column of text values into proper case through the use of <a href="http://stackoverflow.com/questions/15776732/how-to-convert-a-vector-of-strings-to-title-case">regular expression shenanigans</a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">proper_case</span> <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kr">return</span> <span class="p">(</span><span class="nf">gsub</span><span class="p">(</span><span class="s">&#34;\\b([A-Z])([A-Z]+)&#34;</span><span class="p">,</span> <span class="s">&#34;\\U\\1\\L\\2&#34;</span> <span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">perl</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Now we can do more through processing using <code>dplyr</code>, <a href="https://github.com/hadley/dplyr"><em>another</em> Hadley Wickham R package</a>. dplyr is a utility that makes R easier to use: it provides a new syntax that allows data manipulation with intuitive function names, the functions can be chained using the <code>%&gt;%</code> operator for efficiency, and all data processing is <em>significantly</em> faster due to a C++ code base. (Fun fact: before the release of dplyr, I intended to quit using R for data analysis in favor of Python. Base R syntax is <em>that</em> difficult to use.)</p>
<p>In dplyr, <code>mutate</code> allows the creation and transformation of columns. We will transform the text columns by running the columns through the <code>proper_case</code> function earlier:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df</span> <span class="o">&lt;-</span> <span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">Category</span> <span class="o">=</span> <span class="nf">proper_case</span><span class="p">(</span><span class="n">Category</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                 <span class="n">Descript</span> <span class="o">=</span> <span class="nf">proper_case</span><span class="p">(</span><span class="n">Descript</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                 <span class="n">PdDistrict</span> <span class="o">=</span> <span class="nf">proper_case</span><span class="p">(</span><span class="n">PdDistrict</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">                 <span class="n">Resolution</span> <span class="o">=</span> <span class="nf">proper_case</span><span class="p">(</span><span class="n">Resolution</span><span class="p">))</span>
</span></span></code></pre></div><p>After all that, the data looks like:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/processed-table_hu7350244622855921552.webp 320w,https://minimaxir.com/2015/12/sf-arrests/processed-table_hu12883713792875989979.webp 768w,https://minimaxir.com/2015/12/sf-arrests/processed-table_hu18360537561968014976.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/processed-table.png 1522w" src="processed-table.png"/> 
</figure>

<p>Much better!</p>
<p>However, many of the records have a &ldquo;None&rdquo; value for Resolution. This implies that the police appeared at the incident but did no action, which isn&rsquo;t that helpful for analysis. How about we look at incidents which resulted in an arrest?</p>
<p>dplyr&rsquo;s <code>filter</code> command does that, and we can use <code>grepl()</code> to do a text search for each Resolution value for the presence of &ldquo;Arrest&rdquo;.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_arrest</span> <span class="o">&lt;-</span> <span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="nf">grepl</span><span class="p">(</span><span class="s">&#34;Arrest&#34;</span><span class="p">,</span> <span class="n">Resolution</span><span class="p">))</span>
</span></span></code></pre></div><p>That&rsquo;s it! There are 587,499 arrests total in the dataset.</p>
<h2 id="arrests-over-time">Arrests Over Time</h2>
<p>One of the most simple data visualizations is a line chart, and it&rsquo;s a good starting point to use for analyzing arrests. Has the number of daily arrests been changing over time? dplyr and ggplot2 make this very easy to visualize in R.</p>
<p>First, the Date column must be formatted as a Date internally in R instead of text. Then we <code>group_by</code> the Date, and then use <code>summarize</code> to perform an aggregate on each group; in this case, count how many entries for the group. (<code>n()</code> is a convenient shortcut). We can also ensure that the dates are in ascending order.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_arrest_daily</span> <span class="o">&lt;-</span> <span class="n">df_arrest</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">mutate</span><span class="p">(</span><span class="n">Date</span> <span class="o">=</span> <span class="nf">as.Date</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span> <span class="s">&#34;%m/%d/%Y&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">group_by</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">summarize</span><span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="nf">n</span><span class="p">())</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">arrange</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/date-table.png 162w" src="date-table.png"/> 
</figure>

<p>Nifty! However, keep in mind that there are thousands of days in this dataset.</p>
<p>Now we can make a pretty line chart in ggplot2. Here&rsquo;s the code, and I will explain what everything does afterward:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_arrest_daily</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">Date</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">count</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_line</span><span class="p">(</span><span class="n">color</span> <span class="o">=</span> <span class="s">&#34;#F2CA27&#34;</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">0.1</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_smooth</span><span class="p">(</span><span class="n">color</span> <span class="o">=</span> <span class="s">&#34;#1A1A1A&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scale_x_date</span><span class="p">(</span><span class="n">breaks</span> <span class="o">=</span> <span class="nf">date_breaks</span><span class="p">(</span><span class="s">&#34;2 years&#34;</span><span class="p">),</span> <span class="n">labels</span> <span class="o">=</span> <span class="nf">date_format</span><span class="p">(</span><span class="s">&#34;%Y&#34;</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">labs</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Date of Arrest&#34;</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;# of Police Arrests&#34;</span><span class="p">,</span> <span class="n">title</span> <span class="o">=</span> <span class="s">&#34;Daily Police Arrests in San Francisco from 2003 – 2015&#34;</span><span class="p">)</span>
</span></span></code></pre></div><ul>
<li><code>ggplot()</code> sets up the base chart and axes.</li>
<li><code>geom_line()</code> creates the line for the line chart. &ldquo;color&rdquo; and &ldquo;size&rdquo; parameters do just that.</li>
<li><code>geom_smooth()</code> adds a smoothing spline on top of the chart to serve as a trendline, which is helpful since there are a lot of points.</li>
<li><code>fte_theme()</code> is my theme based on the FiveThirtyEight style.</li>
<li><code>scale_x_date()</code> explicitly sets the x-axis to scale with date values. However, there are a few extremely useful formatting parameters with this function: &ldquo;breaks&rdquo; lets you set the chart breaks in plain English, and &ldquo;labels&rdquo; lets you format the dates at this breaks; in this case, there are breaks every 2 years, and only the year will be displayed for minimalism.</li>
<li><code>labs()</code> is a quick shortcut for labeling your axes and plot (<em>always</em> label!)</li>
</ul>
<p>Running the code and saving the output results in this image:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-1_hu3654474446870102994.webp 320w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-1_hu17611426655894899589.webp 768w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-1_hu16325201971485510571.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-1.png 1200w" src="sf-arrest-when-1.png"/> 
</figure>

<p>The line chart has high variation due to the number of points (in retrospect, a 30-day moving average of arrests would work better visually). As the trendline indicates, the trend is actually <em>multimodal</em>, with daily arrest peaks in 2009 and 2014. Definitely interesting. The number of arrests appears to be on a downward trend since then.</p>
<p>The next step is to look into possible answers for the day-by-day variation.</p>
<h2 id="when-do-arrests-happen">When Do Arrests Happen?</h2>
<p>One of my go-to data visualizations is a heat map of times of week; in this case, we can find which day-of-week and time-of-day when the most Arrests occur in San Francisco, and compare that with other time slots at a glance.</p>
<p>This requires the Hour and Day-of-Week to be present in separate columns: we have a DOY column already, but we need to parse the Hour component out of the HH:MM values in the Time column.</p>
<p>This requires another helper function which uses <code>strsplit()</code> to split a single time value to Hour and Minute components, take the first value (Hour), and convert that value to a numeric value (instead of text) For example, &ldquo;09:40&rdquo; input returns 9.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">get_hour</span> <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kr">return</span> <span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">strsplit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="s">&#34;:&#34;</span><span class="p">)</span><span class="n">[[1]][1]</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Unfortunately, this will not work for an entire column. Using <code>sapply()</code> applies a specified function to each element in a column, which accomplishes the same goal.</p>
<p>The goal is to count how many Arrests occur for a given day-of-week and hour combination. In dplyr, we <code>group_by</code> both &ldquo;DayOfWeek&rdquo; and &ldquo;Hour&rdquo;, and then use <code>summarize</code> again.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_arrest_time</span> <span class="o">&lt;-</span> <span class="n">df_arrest</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">mutate</span><span class="p">(</span><span class="n">Hour</span> <span class="o">=</span> <span class="nf">sapply</span><span class="p">(</span><span class="n">Time</span><span class="p">,</span> <span class="n">get_hour</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">group_by</span><span class="p">(</span><span class="n">DayOfWeek</span><span class="p">,</span> <span class="n">Hour</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">summarize</span><span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="nf">n</span><span class="p">())</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/dow-table.png 211w" src="dow-table.png"/> 
</figure>

<p>A few more tweaks are done (off camera) to convert the Hours to representations like &ldquo;12 PM&rdquo; and get everything in the correct order.</p>
<p>Now, it&rsquo;s time to make the heatmap using <code>ggplot2</code>. Here&rsquo;s the code, and I will explain what the new functions do:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_arrest_time</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">Hour</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">DayOfWeek</span><span class="p">,</span> <span class="n">fill</span> <span class="o">=</span> <span class="n">count</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_tile</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">theme</span><span class="p">(</span><span class="kc">...</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">labs</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Hour of Arrest (Local Time)&#34;</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;Day of Week of Arrest&#34;</span><span class="p">,</span> <span class="n">title</span> <span class="o">=</span> <span class="s">&#34;# of Police Arrests in San Francisco from 2003 – 2015, by Time of Arrest&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scale_fill_gradient</span><span class="p">(</span><span class="n">low</span> <span class="o">=</span> <span class="s">&#34;white&#34;</span><span class="p">,</span> <span class="n">high</span> <span class="o">=</span> <span class="s">&#34;#27AE60&#34;</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">comma</span><span class="p">)</span>
</span></span></code></pre></div><ul>
<li><code>geom_title()</code> creates tiles. (instead of lines)</li>
<li><code>theme()</code> is needed for a few additional theme tweaks to get the gradient bar to render (tweaks not shown)</li>
<li><code>scale_fill_gradient()</code> tells the tiles to fill on a gradient, from white as the lowest value to a green as the highest value. The &ldquo;labels = comma&rdquo; parameter is a hidden helpful tip to allow any values in the legend to show with commas.</li>
</ul>
<p>Putting it all together:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-2_hu10564400753457300353.webp 320w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-2_hu15126039314468368210.webp 768w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-2_hu15004922971988667904.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-2.png 1800w" src="sf-arrest-when-2.png"/> 
</figure>

<p>The heatmap is an intuitive result. Arrests don&rsquo;t happen in the early morning, and arrests tend to be elevated Friday and Saturday night, when everyone is out on the town.</p>
<p>However, the peak arrest time is apparently on Wednesdays at 4-5 PM. Wednesdays and the 4-5 PM timeslot in general have elevated arrest frequency, too. Why is that the case?</p>
<p>This requires further analysis.</p>
<h2 id="facets-of-arrest">Facets of Arrest</h2>
<p>Perhaps the odd results can be explained by another lurking variable. Logically, certain types of crime, such as DUIs, should happen primarily at night. ggplot2 has tool known as faceting that makes such analysis easy by rendering a chart for each instance of another value in another variable. In this case, with only <em>one</em> line of ggplot2 code, we can plot a heatmap for <em>each</em> of the top types of arrests, and see if there is any significant variation in the heatmap.</p>
<p>After quickly using dplyr to aggregate and sort the top categories of arrest, by number of occurrences:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_top_crimes</span> <span class="o">&lt;-</span> <span class="n">df_arrest</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">group_by</span><span class="p">(</span><span class="n">Category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">summarize</span><span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="nf">n</span><span class="p">())</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">count</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/top-crimes.png 272w" src="top-crimes.png"/> 
</figure>

<p>&ldquo;Other Offenses&rdquo; is a catch-all, so we will ignore that. Filter on the top 18 types of crime excluding Other Offenses and aggregate as usual.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_arrest_time_crime</span> <span class="o">&lt;-</span> <span class="n">df_arrest</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">filter</span><span class="p">(</span><span class="n">Category</span> <span class="o">%in%</span> <span class="n">df_top_crimes</span><span class="o">$</span><span class="n">Category[2</span><span class="o">:</span><span class="m">19</span><span class="n">]</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">mutate</span><span class="p">(</span><span class="n">Hour</span> <span class="o">=</span> <span class="nf">sapply</span><span class="p">(</span><span class="n">Time</span><span class="p">,</span> <span class="n">get_hour</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">group_by</span><span class="p">(</span><span class="n">Category</span><span class="p">,</span> <span class="n">DayOfWeek</span><span class="p">,</span> <span class="n">Hour</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">summarize</span><span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="nf">n</span><span class="p">())</span>
</span></span></code></pre></div><p>Time for the heat map! The <code>ggplot</code> code is nearly identical to the previous heatmap code, except we add <code>facet_wrap()</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">plot</span> <span class="o">&lt;-</span> <span class="nf">ggplot</span><span class="p">(</span><span class="n">df_arrest_time_crime</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">Hour</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">DayOfWeek</span><span class="p">,</span> <span class="n">fill</span> <span class="o">=</span> <span class="n">count</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">geom_tile</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">theme</span><span class="p">(</span><span class="kc">...</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">labs</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Hour of Arrest (Local Time)&#34;</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;Day of Week of Arrest&#34;</span><span class="p">,</span> <span class="n">title</span> <span class="o">=</span> <span class="s">&#34;# of Police Arrests in San Francisco from 2003 – 2015, by Category and Time of Arrest&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scale_fill_gradient</span><span class="p">(</span><span class="n">low</span> <span class="o">=</span> <span class="s">&#34;white&#34;</span><span class="p">,</span> <span class="n">high</span> <span class="o">=</span> <span class="s">&#34;#2980B9&#34;</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">    <span class="nf">facet_wrap</span><span class="p">(</span><span class="o">~</span> <span class="n">Category</span><span class="p">,</span> <span class="n">nrow</span> <span class="o">=</span> <span class="m">6</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-3_hu3393399497914154644.webp 320w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-3_hu4460630850231079869.webp 768w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-3_hu1600981870581629953.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-3.png 1800w" src="sf-arrest-when-3.png"/> 
</figure>

<p>Easy visualization to make, but it&rsquo;s not fully correct. There can only be one scale for the whole visualization, which is why the categories with lots of arrests appear colored and others do not (however, it shows that Drugs/Narcotics arrests are a large contributor to the Wednesday emphasis of the data). We need to normalize the counts by facet. dplyr has a nice trick for normalization: group by the normalization variable (Category), then mutate to add a column based on the aggregate for each unique value.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_arrest_time_crime</span> <span class="o">&lt;-</span> <span class="n">df_arrest_time_crime</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                            <span class="nf">group_by</span><span class="p">(</span><span class="n">Category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                            <span class="nf">mutate</span><span class="p">(</span><span class="n">norm</span> <span class="o">=</span> <span class="n">count</span><span class="o">/</span><span class="nf">sum</span><span class="p">(</span><span class="n">count</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/crime-norm_hu17610382876934806107.webp 320w,https://minimaxir.com/2015/12/sf-arrests/crime-norm.png 374w" src="crime-norm.png"/> 
</figure>

<p>Setting the &ldquo;fill&rdquo; to &ldquo;norm&rdquo; and rerunning the heatmap code yields:</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-4_hu15926472780241731893.webp 320w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-4_hu126985993692781798.webp 768w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-4_hu5168466869331596364.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-4.png 1800w" src="sf-arrest-when-4.png"/> 
</figure>

<p>Now things get interesting.</p>
<p>Prostitution has the most notably unique behavior, which high concentrations of arrests at night on weekdays. Drunkenness and DUIs have high concentrations at night on weekends. And Disorderly Conduct has a high concentration of arrests at 5 AM on weekdays? That&rsquo;s not intuitive.</p>
<p>Notably, some offenses have relatively random times of arrests, such as Stolen Property and Vehicle Theft.</p>
<p>However, this doesn&rsquo;t help explain why arrests tend to happen Wednesdays/4-5PM. Maybe faceting by another variable will provide more information.</p>
<p>Perhaps Police district? Maybe some PDs in San Francisco are more zealous than others. Since we created a code workflow earlier, we can apply it to any other variable very easily; in this case, it&rsquo;s mostly just replacing instances of &ldquo;Category&rdquo; with &ldquo;PdDistrict.&rdquo;</p>
<p>Doing thus yields this heatmap.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-5_hu15170436718480429699.webp 320w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-5_hu14218803337538104956.webp 768w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-5_hu17003730383693044022.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-5.png 1800w" src="sf-arrest-when-5.png"/> 
</figure>

<p>Which isn&rsquo;t helpful. The charts are mostly identical to each other, and to the original heatmap. (Central Station (<a href="http://www.sf-police.org/Modules/ShowDocument.aspx?documentID=27554">coverage map</a>), however, has activity correlated to Drunkenness arrests.)</p>
<p>Perhaps the frequency of arrests is correlated to the time of year? How about faceting by month?</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-6_hu14468941755787881798.webp 320w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-6_hu4400784780178534062.webp 768w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-6_hu11340823854500891215.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-6.png 1800w" src="sf-arrest-when-6.png"/> 
</figure>

<p>Nope. Zero difference.</p>
<p>Last try. As shown in the line chart, the # of Arrests has oscillated over the years. Perhaps there&rsquo;s a specific year that&rsquo;s skewing the results. Let&rsquo;s facet by Year.</p>
<figure>

    

    <img loading="lazy" srcset="https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-7_hu11901066775171177880.webp 320w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-7_hu9726493787191630017.webp 768w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-7_hu1915616666993260300.webp 1024w,https://minimaxir.com/2015/12/sf-arrests/sf-arrest-when-7.png 1800w" src="sf-arrest-when-7.png"/> 
</figure>

<p>Nope^2. 2010-2012 have elevated Wednesday activity, but not by much.</p>
<p>This is frustrating. As of this posting, I don&rsquo;t have an obvious answer for the elevated arrests Wednesdays at 4-5PM. That being said, there definitely is still more to learn from looking at SF Crime data, although that&rsquo;s enough analysis for the time being.</p>
<p><a href="http://minimaxir.com/2015/12/sf-arrest-maps/">My next article</a> discusses how to plot arrests on a map using the <code>ggmap</code> R library, which hopefully will provide more answers. The <a href="https://github.com/minimaxir/sf-arrests-when-where">GitHub repository</a> contains a Jupyter notebook with code and visualizations for both for this article, and for the upcoming ggmap visualizations (if you want a sneak peek) which will show <em>where</em> arrests in San Francisco frequently occur.</p>
<hr>
<p><em>If you use the code or data visualization designs contained within this article, it would be greatly appreciated if proper attribution is given back to this article and/or myself. Thanks!</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
