<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Jupyter on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/tag/jupyter/</link>
    <description>Recent content in Jupyter on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Wed, 06 Apr 2016 08:00:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/tag/jupyter/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The Importance of Sanity-Checking Datasets Before Analysis</title>
      <link>https://minimaxir.com/2016/04/trust-but-verify/</link>
      <pubDate>Wed, 06 Apr 2016 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/04/trust-but-verify/</guid>
      <description>The 1972 TV Special &amp;lsquo;The Lorax&amp;rsquo; is the best movie ever, earning $1.2 billion?</description>
      <content:encoded><![CDATA[<p>I&rsquo;ve done some cool things with movie data using a dataset from <a href="http://www.omdbapi.com">OMDb API</a>, which is sourced from <a href="http://www.imdb.com">IMDb</a> and <a href="http://www.rottentomatoes.com">Rotten Tomatoes</a> data. In my <a href="http://minimaxir.com/2016/01/movie-revenue-ratings/">previous article</a> on the dataset, I plotted the relationship between the domestic box office revenue of movies and their Rotten Tomatoes scores.</p>
<p>I want to take another look at domestic Box Office Revenues with aggregate statistics such as means/medians on categorical variables such as MPAA rating and release month. For this type of analysis in particular, I&rsquo;ll also need to implement code in <a href="https://www.r-project.org">R</a> for inflation adjustment.</p>
<p>However, I ran into a few unexpectedly silly issues.</p>
<h2 id="seeing-double">Seeing Double</h2>
<p>There are many similarities between data validation and the Quality Assurance process of product development, which is why this particular area appeals to me personally as a Software QA Engineer. Whenever a cool dataset is released publicly, I play around with it to look for any obvious flaws and to get a good all-around benchmark on the robustness of the data (this is a separate procedure from the traditional &ldquo;data cleaning&rdquo; phase necessary to begin quantification on some poorly-structured datasets).</p>
<p>Do the extreme values in the data make sense? Is the data encoded in a sane format? Are there any obvious gaps or logical contradictions in summary representations of the data, especially when compared to other canonical sources?</p>
<p>These concerns are also some of the reasons I&rsquo;ve switched to the <a href="http://jupyter.org">Jupyter Notebook</a> as my primary data science IDE. After each block of code which transforms data, I can print the data frame inline to immediately see the results of the code execution, and refer back to them if anything odd happens in the future.</p>
<p>Let&rsquo;s say I have a data frame of Movies using the latest data dump (3/26/16) from OMDb. This data set contains 1,160,273 movies, including both IMDb and Rotten Tomatoes data. After cleaning the data (not shown), I can use the R package <code>dplyr</code> by Hadley Wickham to sort the data frame by Box Office Revenue descending, and print the <code>head</code> (top) of the data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">imdbID</span><span class="p">,</span> <span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">,</span> <span class="n">BoxOffice</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">BoxOffice</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">(</span><span class="m">25</span><span class="p">),</span> <span class="n">n</span> <span class="o">=</span> <span class="m">25</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/data-2_hu_fc7149d7b4ad38a9.webp 320w,/2016/04/trust-but-verify/data-2_hu_e103dfde4f2240f3.webp 768w,/2016/04/trust-but-verify/data-2_hu_cbb23b4322bee2d7.webp 1024w,/2016/04/trust-but-verify/data-2.png 1258w" src="data-2.png"/> 
</figure>

<p>Those movies being the best <em>makes sense</em>. For <a href="http://www.rottentomatoes.com/m/star_wars_episode_vii_the_force_awakens/">Star Wars: The Force Awakens</a>, I can compare it to the Box Office reported on the corresponding Rotten Tomatoes page, which in turn matches the <a href="http://www.boxofficemojo.com/movies/?id=starwars7.htm">domestic Box Office Revenue</a> on <a href="http://www.boxofficemojo.com">Box Office Mojo</a>.</p>
<p>But wait, <a href="https://en.wikipedia.org/wiki/The_Dark_Knight_%28film%29">The Dark Knight</a> appears <em>twice</em>? How?!</p>
<p>There&rsquo;s no way I would have missed something this obvious during the sanity-check for my previous article. In order to make sure that I&rsquo;m not going insane, I double-checked the December 2015 data dump I used for that post, derived the top movies with the same methodology for the modern data dump, and the duplicate movies <em>were not present</em>. Weird.</p>
<p>There are 2 different IDs for
The Dark Knight, and for some other movies near the top (<a href="http://www.imdb.com/title/tt4817264/">Inside Out</a>, &ldquo;<a href="http://www.imdb.com/title/tt3138972/">The Gravity</a>&rdquo;). Fortunately, duplicate data like this is easy to debug. The second data entry for The Dark Knight has a greater IMDb ID (1774602) which means it was likely added to the site later. Let&rsquo;s look up the <a href="http://www.imdb.com/title/tt1774602/">corresponding IMDb page</a>:</p>
<figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/dark-knight_hu_a2dd88a3ae15f413.webp 320w,/2016/04/trust-but-verify/dark-knight_hu_1518ed909d29f88e.webp 768w,/2016/04/trust-but-verify/dark-knight_hu_e8a475182d872549.webp 1024w,/2016/04/trust-but-verify/dark-knight.png 1128w" src="dark-knight.png"/> 
</figure>

<p>Huh. Apparently someone put a filler movie entry with the same name and release year as a blockbuster movie in hopes that people search for it by accident (and since it received 50 ratings and an average score of 8.6, this tactic was successful).</p>
<p>Using the Rotten Tomatoes <a href="http://developer.rottentomatoes.com/docs/read/json/v10/Movie_Alias">IMDb Lookup API</a>, we find that &ldquo;The Dark Knight&rdquo; page on Rotten Tomatoes&hellip;<a href="http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?type=imdb&amp;id=1774602">doesn&rsquo;t exist</a>.</p>
<p>We can run a safe deduplicate by removing entries with the same title (excluding the &ldquo;The&rdquo; if present) and release year.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_dup</span> <span class="o">&lt;-</span> <span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">Title</span> <span class="o">=</span> <span class="nf">gsub</span><span class="p">(</span><span class="s">&#34;The &#34;</span><span class="p">,</span> <span class="s">&#34;&#34;</span><span class="p">,</span> <span class="n">Title</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="n">dup</span> <span class="o">&lt;-</span> <span class="nf">duplicated</span><span class="p">(</span><span class="n">df_dup</span><span class="p">)</span>   <span class="c1"># find entry indices which are duplicates</span>
</span></span><span class="line"><span class="cl"><span class="nf">rm</span><span class="p">(</span><span class="n">df_dup</span><span class="p">)</span>   <span class="c1"># remove temp dataframe</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df_dedup</span> <span class="o">&lt;-</span> <span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="o">!</span><span class="n">dup</span><span class="p">)</span>   <span class="c1"># keep entries which are *not* dups</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">df_dedup</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">imdbID</span><span class="p">,</span> <span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">,</span> <span class="n">BoxOffice</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">BoxOffice</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">(</span><span class="m">25</span><span class="p">),</span> <span class="n">n</span> <span class="o">=</span> <span class="m">25</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/data-1_hu_8b5a2ca66b9bcf38.webp 320w,/2016/04/trust-but-verify/data-1_hu_a71793e3bcc29bd2.webp 768w,/2016/04/trust-but-verify/data-1_hu_e83f2d858764f621.webp 1024w,/2016/04/trust-but-verify/data-1.png 1224w" src="data-1.png"/> 
</figure>

<p>There we go! The de-duped dataset has 1,114,431 movies, impliying that there were 45,842 of these duplicate entries.</p>
<p>I&rsquo;m not sure <em>whose</em> fault it is that duplicate movies suddenly became present in the data dump: OMDb or Rotten Tomatoes. <em>But it doesn&rsquo;t matter</em>: the wrong entries still need to be addressed, and it&rsquo;s good to have a test case for the future too.</p>
<h2 id="inflation-station">Inflation Station</h2>
<p>A <a href="http://stackoverflow.com/a/26068058">Stack Overflow answer</a> from <a href="http://stackoverflow.com/users/1048757/brash-equilibrium">Ben Hanowell</a> has a good R implementation and rationale for implementing inflation adjustment using the <a href="https://research.stlouisfed.org/fred2/data/CPIAUCSL.txt">historical Consumer Price Index data</a> from the <a href="https://www.stlouisfed.org">Federal Reserve Bank of St. Louis</a>.</p>
<p>Take the index for each year (averaging each month for simplicity) and create an adjustment factor to convert historical dollar amounts into present-day dollar amounts. Much better than plugging hundreds of thousands of values into an online calculator. Here&rsquo;s the SO code made <code>dplyr</code>-friendly for this purpose, with the requisite sanity-checks.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">inflation</span> <span class="o">&lt;-</span> <span class="nf">read_csv</span><span class="p">(</span><span class="s">&#34;http://research.stlouisfed.org/fred2/data/CPIAUCSL.csv&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">group_by</span><span class="p">(</span><span class="n">Year</span> <span class="o">=</span> <span class="nf">as.integer</span><span class="p">(</span><span class="nf">substr</span><span class="p">(</span><span class="n">DATE</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">4</span><span class="p">)))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">summarize</span><span class="p">(</span><span class="n">Avg_Value</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">VALUE</span><span class="p">))</span> <span class="o">%&gt;%</span>   <span class="c1"># average across all months</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">mutate</span><span class="p">(</span><span class="n">Adjust</span> <span class="o">=</span> <span class="nf">tail</span><span class="p">(</span><span class="n">Avg_Value</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">Avg_Value</span><span class="p">)</span>   <span class="c1"># normalize by most-recent year</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">inflation</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">inflation</span> <span class="o">%&gt;%</span> <span class="nf">tail</span><span class="p">())</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/inf.png 290w" src="inf.png"/> 
</figure>

<p>For example, to get the inflation-adjusted Box Office Revenue for a movie released in 1949 in 2016 dollars, we multiply the reported revenue by 10. That sounds about right (and matches closely enough to the output of the <a href="http://data.bls.gov/cgi-bin/cpicalc.pl?cost1=1&amp;year1=1949&amp;year2=2016">Bureau of Labor Statistics inflation calculator</a>).</p>
<p>Now map each inflation adjustment factor to each movie by merging the two datasets (on the <code>Year</code> column), then multiply the Box Office revenue by the adjustment factor to get the inflation-adjusted revenue. Plus another sanity-check for good measure.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_dedup_join</span> <span class="o">&lt;-</span> <span class="n">df_dedup</span> <span class="o">%&gt;%</span> <span class="nf">inner_join</span><span class="p">(</span><span class="n">inflation</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">AdjBoxOffice</span> <span class="o">=</span> <span class="n">BoxOffice</span> <span class="o">*</span> <span class="n">Adjust</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">df_dedup_join</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">,</span> <span class="n">AdjBoxOffice</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">AdjBoxOffice</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">(</span><span class="m">25</span><span class="p">),</span> <span class="n">n</span><span class="o">=</span><span class="m">25</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/data-3_hu_9bedc8e778de7ad8.webp 320w,/2016/04/trust-but-verify/data-3_hu_7c39435bd36c198e.webp 768w,/2016/04/trust-but-verify/data-3_hu_b47d674b228181e4.webp 1024w,/2016/04/trust-but-verify/data-3.png 1070w" src="data-3.png"/> 
</figure>

<p>Uh-oh.</p>
<p>I mean, <a href="https://en.wikipedia.org/wiki/The_Lorax_%28TV_special%29">The Lorax</a> probably earned $1.2 billion in VHS sales for Earth Day education <em>alone</em>, but the TV special was never released in theaters. There was a <a href="https://en.wikipedia.org/wiki/The_Lorax_%28film%29">CGI remake of The Lorax</a> a few years ago which was reasonably popular. Could it be that someone at Rotten Tomatoes or Box Office Mojo confused the two media?</p>
<p>That is exactly what happened. On Rotten Tomatoes, The <a href="http://www.rottentomatoes.com/m/the-lorax/">1972 Lorax</a> was encoded with similar box office revenue as the <a href="http://www.rottentomatoes.com/m/the_lorax/">2012 Lorax</a>; then the inflation factor sextupled it. For this type of data fidelity issue, it&rsquo;s considerably more obvious whose at fault.</p>
<p>Unfortunately, that&rsquo;s not the end of problems with the dataset. I compared my results with <a href="http://www.vox.com/2016/4/4/11351788/batman-v-superman-terrible-reviews#undefined">Vox&rsquo;s dataset</a> on worldwide historical box office revenues. In the Top 200 Movies by inflation-adjusted revenue, there are noted historical movie omissions such as <a href="http://www.rottentomatoes.com/m/jaws/">Jaws</a> and <a href="http://www.rottentomatoes.com/m/star_wars/">Star Wars: A New Hope</a>. It turns out Rotten Tomatoes does not have Box Office Revenue data for these movies at all.</p>
<p>That is a very serious problem which I&rsquo;ll have to think about if it blocks any analysis on aggregate box office data completely. In the end, sanity-checking third party data is important because you never know <em>how</em> the data will surprise you, until it&rsquo;s too late.</p>
<hr>
<p><em>You can view the Top 200 movies by domestic box office revenue for each of the 12/15 source dataset, the 3/16 dataset, the 3/16 deduped dataset, and the 3/16 deduced inflation-adjusted data <a href="https://github.com/minimaxir/movie-data-sanity-checking">in this GitHub repository</a>, along with the Jupyter notebook.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
