<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Media Analysis on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/category/media-analysis/</link>
    <description>Recent content in Media Analysis on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Thu, 07 Jan 2016 08:30:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/category/media-analysis/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Movie Review Aggregator Ratings Have No Relationship with Box Office Success</title>
      <link>https://minimaxir.com/2016/01/movie-revenue-ratings/</link>
      <pubDate>Thu, 07 Jan 2016 08:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/01/movie-revenue-ratings/</guid>
      <description>Perhaps the movie rating system itself is broken.</description>
      <content:encoded><![CDATA[<p><a href="http://www.rottentomatoes.com">Rotten Tomatoes</a> has become synonymous with movie quality in recent years. The Rotten Tomatoes Tomatometer aggregates all reviews written by movie critics for a given movie on the internet, determines whether each reviewer rates the movie as &ldquo;Fresh&rdquo; or &ldquo;Rotten&rdquo; and calculates an average. If the proportion of Fresh reviews for a given movie is greater than or equal to 60%, the movie itself is considered &ldquo;Fresh&rdquo; and receives a special icon.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/examples.png 266w" src="examples.png"/> 
</figure>

<p>Top Movies like Christopher Nolan&rsquo;s <a href="http://www.rottentomatoes.com/m/the_dark_knight/">The Dark Knight</a> received a 94% Rotten Tomatoes rating, and generated $533.3 million in domestic box office revenue. But other movies, like Michael Bay&rsquo;s <a href="http://www.rottentomatoes.com/m/transformers_revenge_of_the_fallen/">Transformers: Revenge of the Fallen</a>, received a 19% Tomatometer rating, but still generated $402.1 million in domestic box office revenue.</p>
<p>How strong is the relationship between Tomatometer scores and box office success, anyways? Or are other, better metrics? Time to make some pretty charts.</p>
<p>I obtained a large amount of movie data from the <a href="http://www.omdbapi.com">OMDb API</a>, which provides easy access to movie metadata from IMDb and Rotten Tomatoes. This data contains Rotten Tomatoes Tomatometer scores, Rotten Tomatoes Audience Scores, IMDb User Rankings, and Metacritic Scores. If you want to know how I processed the data in R and plotted the charts using ggplot2, I have <a href="https://www.youtube.com/watch?v=F5Hjlkxw_2A">prepared a screencast</a> for your viewing pleasure.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/F5Hjlkxw_2A?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>For this analysis, we will be looking at the <a href="http://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/">log-transformation</a> of domestic box office revenue, since the values are skewed by mega-blockbusters like the ones mentioned previously. Revenues are not inflation-adjusted since the rating data is only present for recent years and due to the log-transformation already present, inflation correction would not impact this particular analysis much.</p>
<h2 id="rotten-tomatoes-tomatometer">Rotten Tomatoes Tomatometer</h2>
<p>After processing, I have a data subset of 4,863 movies with both Tomatometer and Box Office Gross values. Let&rsquo;s plot all those movies on a scatterplot of log(BoxOffice) vs. Meter with each point having a slight transparency; that way, clusters of points will be come apparent where the areas are darker on the chart.</p>
<p>We expect a positive linear relationship: movies with high Tomatometer scores to have high box office revenue, and inversely movies with low score to have low box office revenue.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-1_hu_96d5e65a38238ebb.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-1.png 600w" src="box-office-rating-1.png"/> 
</figure>

<p>Wait, why does the trendline have a <em>negative</em> slope?</p>
<p>The <a href="https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient">Pearson correlation</a> between the Tomatometer scores and log(BoxOffice) is <strong>-0.18</strong>, implying a weak <em>negative</em> linear relationship between the two variables. Not what I expected.</p>
<p>There do appear to be clusters in the data. There is a group of points between $10M and $100M revenue and 0% to 20% Tomatometer rating. Another group is present between $1,000 and $1M revenue and 80% to 100% RT rating. Both of these areas are outside of a linear relationship: perhaps these clusters are skewing trends too?</p>
<p>Let&rsquo;s try another visualization of the data using <a href="https://en.wikipedia.org/wiki/Contour_line">contour maps</a>, which allow the data to become 3D, so-to-speak. Using a 2D <a href="https://en.wikipedia.org/wiki/Kernel_density_estimation">kernel density estimator</a>, we can identify and color areas on the plot according to the number of points present in that area; the greater the color saturation, the more points present in the given area.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-2_hu_760d8dc1d3815e51.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-2.png 600w" src="box-office-rating-2.png"/> 
</figure>

<p>The two clusters mentioned previously are now much more apparent. It appears there are two distinct sets of movies: blockbusters which critics hate, and limited-appeal films which critics loves. Incidentally, there is no discernible difference between movies which are Fresh (&gt;60%) and Rotten.</p>
<h2 id="metacritic">Metacritic</h2>
<p>The <a href="http://www.metacritic.com">Metacritic</a> score is also <a href="http://www.metacritic.com/about-metascores">derived from review data</a> by critics; however, instead of calculating a binary review sentiment and calculating a proportion from that sentiment, Metacritic gives a quantification from 0 to 100 to each critic review and averages them together.</p>
<p>Does that change the results for 4,479 movies?</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-7_hu_fef2b0f07f0269fe.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-7.png 600w" src="box-office-rating-7.png"/> 
</figure>

<p>Correlation between Metacritic score and log(BoxOffice) is <strong>-0.13</strong>, which puts the analysis in a similar state as the Rotten Tomatoes data. However, the blockbuster cluster has shifted right, and the lesser-appeal cluster has shifted left.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-8_hu_db41f472024f23b6.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-8.png 600w" src="box-office-rating-8.png"/> 
</figure>

<p>Clusters are much closer together.</p>
<p>Perhaps a review metric by non-critics will tell a different story.</p>
<h2 id="rotten-tomatoes-audience-score">Rotten Tomatoes Audience Score</h2>
<p>The Audience Score is calculated in a similar way to the Rotten Tomatoes Tomatometer score: user to the site rate a movie from 0 to 5 stars in half-star increments (i.e. effectively a scale from 0-10) and the proportion of reviews with 3.5 star ratings or higher becomes the Audience Score.</p>
<p>This also presents a cognitive bias in ratings: the <a href="http://tvtropes.org/pmwiki/pmwiki.php/Main/FourPointScale">Four Point Scale</a>, where having a discrete form of ranking may cause people to tend to rate toward the top of the scale and make the entire metric skewed or misleading.</p>
<p>How does the Audience Score compare for 5,163 movies? After all, the audience is the group of people who determine how much money a movie makes at the Box Office.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-3_hu_1f2e7fff936a2fa7.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-3.png 600w" src="box-office-rating-3.png"/> 
</figure>

<p>Correlation between the Audience score and log(BoxOffice) is <strong>0.05</strong>, which is a positive linear correlation, but representative of barely any practical correlation.</p>
<p>Speaking of the Four Point Scale, notice how, like with Metacritic score, there are barely any movies between 0% and 20% Audience Score. Is there really a skew? Let&rsquo;s look at the contours:</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-4_hu_1baf301a632b3684.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-4.png 600w" src="box-office-rating-4.png"/> 
</figure>

<p>The locations of the clusters are much different than that of Tomatometer clusters. Both clusters are closer together, with the blockbuster cluster between 50% and 60% audience score and the lesser-appeal cluster between 70% and 80%. Hence, the low correlation.</p>
<h2 id="imdb">IMDb</h2>
<p><a href="http://www.imdb.com">IMDb</a> works <a href="http://www.imdb.com/help/show_leaf?votestopfaq">almost the same way</a> as the Metacritic for non-critics: ratings from IMDb users between 1-10 (note that 0 is missing!) are averaged to get a final score.</p>
<p>How do 5,167 movies fare?</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-5_hu_1f38b551534a465.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-5.png 600w" src="box-office-rating-5.png"/> 
</figure>

<p><strong>What?!</strong></p>
<p>The point groupings are at the <em>same</em> positions of ratings, and the correlation between IMDb ratings and log(BoxOffice) is <strong>0.00</strong>. Yes, there&rsquo;s <em>zero</em> correlation!</p>
<p>Checking the contour map confirms it:</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-6_hu_ba1aa72c3457ed1d.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-6.png 600w" src="box-office-rating-6.png"/> 
</figure>

<p>That is <em>literally</em> a Four Point Scale between 5 and 8!</p>
<p>The Rotten Tomatoes metric is the only metric that actually <em>uses</em> the entire rating scale. None of the other potential metrics provide more insight into a potential reason for high box-office revenue. Perhaps the movie rating system itself is broken.</p>
<p>That&rsquo;s not to say that movies need high box-office revenues to be considered successful. However, working with movie profitability, and by extension movie budget, is opening another can-of-worms with respect to data integrity. (that said, on Reddit, /u/chartmkr recently <a href="https://www.reddit.com/r/dataisbeautiful/comments/3zpp3w/movie_budgets_and_box_office_success_19552015_oc/">posted a visualization</a> of Gross vs. Budget which is interesting).</p>
<p>It&rsquo;ll still be fun to point to a Rotten Tomatoes Tomatometer rating as a kneejerk reaction to whether a movie rocks/sucks. Although, the reasons for movie financial success at the box office definitely warrant further investigation.</p>
<p><strong>UPDATE 1/11/15</strong>: On a <a href="https://news.ycombinator.com/item?id=10872076">discussion on Hacker News</a>, it was suggested that the blockbuster movies and the indie movies cancel each other out, i.e. blockbusters have a positive correlation and indies have a negative correlation.</p>
<p>For the blockbuster cluster alone, the log-correlation is <strong>0.23</strong> (not weak but not great positive correlation). For the indie cluster alone, the log-correlation is <strong>-0.12</strong> (same as original analysis).</p>
<p>For future analysis, it may be worthwhile to split these two clusters. I stand by the original analysis for this post: very frequently I&rsquo;ve heard the question &ldquo;is this a good movie?&rdquo; and the response is &ldquo;what does the RT score say?&rdquo; Both Box Office revenues and RT scores are important measures of quality (depending on perspective), and users who want to see or purchase a movie may not necessarily care if it&rsquo;s indie or a blockbuster.</p>
<p>User cwyers <a href="https://news.ycombinator.com/item?id=10878019">suggested</a> that Simpson&rsquo;s Paradox may be in play since the number of theaters showing a movie is positively correlated to box office revenue, adding a potentially-confounding affect. I will see if I can obtain that data for future analysis.</p>
<hr>
<p><em>You can access the open-sourced Jupyter notebook and high-resolution charts from this article in <a href="https://github.com/minimaxir/movie-revenue-ratings">this GitHub repository</a>. If you use the code or data visualization designs contained within this article, it would be greatly appreciated if proper attribution is given back to this article and/or myself. Thanks!</em></p>
<p><em>Unfortunately, I cannot redistribute the data itself due to licensing concerns.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Let&#39;s Code an Analysis and Visualizations of Yelp Data using R and ggplot2</title>
      <link>https://minimaxir.com/2015/12/lets-code-1/</link>
      <pubDate>Mon, 28 Dec 2015 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2015/12/lets-code-1/</guid>
      <description>The first of (hopefully) many 1440p/60fps videos of fun data analysis. With many fun errors too!</description>
      <content:encoded><![CDATA[<p>One of the reasons I have open-sourced the code for my complicated data visualizations is transparency for the creation process. 2015 was a <a href="http://qz.com/580859/the-most-misleading-charts-of-2015-fixed/">year of misleading and incorrect data visualizations</a>, and I don&rsquo;t want to help contribute to the misconception that data can be used for trickery. &ldquo;Big data&rdquo; in particular is a area where the steps to reproduce results are rarely released publicly in a step-by-step manner, often in an attempt to make the resulting analysis unimpeachable.</p>
<p>It&rsquo;s time to take things to the next level of transparency by recording <a href="https://en.wikipedia.org/wiki/Screencast">screencasts</a> of my data analysis and visualizations.</p>
<p>Last week, ggplot2 author Hadley Wickham released <a href="http://blog.rstudio.org/2015/12/21/ggplot2-2-0-0/">a surprise update</a> for my favorite R package, bumping the version to 2.0.0. Why not celebrate by playing around with ggplot2 and making some pretty charts?</p>
<h2 id="lets-code">Let&rsquo;s Code!</h2>
<p>I have recorded a screencast of myself coding in R to play around with data from <a href="http://www.yelp.com/dataset_challenge">Yelp Dataset Challenge</a> and <a href="https://www.youtube.com/watch?v=Emt9bn0D5ZI">uploaded it to YouTube</a>. Additionally, the video can be played at an unusually high quality for screencasting: 1440p on supported browsers, at 60 frames per second.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/Emt9bn0D5ZI?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>This particular screencast is also my first significant attempt at working with audio/video editing and voice-over. Feel free to provide suggestions for future videos.</p>
<p>Since the screencast is 40 minutes long (inadvertently!), I&rsquo;ve written an abridged summary of the screencast, along with some clarification of points made.</p>
<h2 id="yelp-data-v2">Yelp Data v2</h2>
<p>A year ago I made a <a href="http://minimaxir.com/2014/09/one-star-five-stars/">blog post analyzing the same Yelp data</a>. Now that the data set contains 1.6 million reviews (as opposed to just 1.1 million back then), it might be interesting to look at it again to see if anything has changed. The data is formatted as by-line JSON: I wrote a pair of Python scripts to convert it to CSV for easy import into R.</p>
<p>The screencast centralizes on three R packages: readr, dplyr, and ggplot2. (all authored by Hadley Wickham)</p>
<p>Loading the dataset into R is easy and fast with <code>read_csv</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_reviews</span> <span class="o">&lt;-</span> <span class="nf">read_csv</span><span class="p">(</span><span class="s">&#34;yelp_reviews.csv&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>Since dplyr was loaded beforehand, read<em>csv loads the data into a tbl_df instead of a normal data.frame. When you call a normal data.frame by itself, _all data is printed to console</em>, which is a problem when you have 1.6M rows (yes, that happened during a test recording). Calling a tbl_df results in a very descriptive overview of the data:</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/overview_hu_85b6185f7555f24.webp 320w,/2015/12/lets-code-1/overview.png 621w" src="overview.png"/> 
</figure>

<p>Most columns are self-explanatory. <code>review_length</code> is approximate number of words in the review, <code>pos_words</code> is the number of positive words in the review, <code>neg_words</code> is what you expect, <code>net_sentiment</code> is pos_words - neg_words.</p>
<p>A quick way to analyze the distribution of numerical data is to perform a summary on the data frame, which returns a by-column <a href="https://en.wikipedia.org/wiki/Five-number_summary">five-number summary</a> + mean:</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/summary_hu_dd5f6188a271b608.webp 320w,/2015/12/lets-code-1/summary.png 591w" src="summary.png"/> 
</figure>

<p>Ratings are biased toward 4 and 5 star reviews. There is a lot of skew for review length.</p>
<p>dplyr makes it easy to add columns in-line with the <code>mutate</code> command. Let&rsquo;s normalize the pos_words column:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_reviews</span> <span class="o">&lt;-</span> <span class="n">df_reviews</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">pos_norm</span> <span class="o">=</span> <span class="n">pos_words</span> <span class="o">/</span> <span class="n">review_length</span><span class="p">)</span>
</span></span></code></pre></div><p>And we could do similar steps for the neg_words column too. Or use mutate to transform the data of an existing column.</p>
<p>Onto ggplot2. If you want a quick histogram of univariate data, qplot does just that. Let&rsquo;s visualize the distribution of stars.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">qplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_reviews</span><span class="p">,</span> <span class="n">stars</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-qplot-1_hu_32eb366ec56acb6e.webp 320w,/2015/12/lets-code-1/lc1-qplot-1_hu_9dbb338fd8854f15.webp 768w,/2015/12/lets-code-1/lc1-qplot-1_hu_f7a292f1b760131d.webp 1024w,/2015/12/lets-code-1/lc1-qplot-1.png 1200w" src="lc1-qplot-1.png"/> 
</figure>

<p>Definitely a skew toward 4 and 5 star reviews.</p>
<p>We can do that for other variables too, like review length.</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-qplot-2_hu_3a5122830bc13338.webp 320w,/2015/12/lets-code-1/lc1-qplot-2_hu_776cbdb873449086.webp 768w,/2015/12/lets-code-1/lc1-qplot-2_hu_78be4a8b2f917b20.webp 1024w,/2015/12/lets-code-1/lc1-qplot-2.png 1200w" src="lc1-qplot-2.png"/> 
</figure>

<p>What about bivariate data? If you give two variables to qplot, it will create a scatter plot. Perhaps there is a relationship between the number of stars and the number of positive words?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">qplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_reviews</span><span class="p">,</span> <span class="n">stars</span><span class="p">,</span> <span class="n">pos_words</span><span class="p">)</span>
</span></span></code></pre></div><p>&hellip;and then we run into a problem. In this case, ggplot2 has to plot 1.6M points to screen, which can take awhile, especially if you are simultaneously using your GPU for video recording. Eventually, we get this:</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-qplot-3_hu_19dd08857796a6.webp 320w,/2015/12/lets-code-1/lc1-qplot-3_hu_d02b93fcb19c69fd.webp 768w,/2015/12/lets-code-1/lc1-qplot-3_hu_747c255748879122.webp 1024w,/2015/12/lets-code-1/lc1-qplot-3.png 1200w" src="lc1-qplot-3.png"/> 
</figure>

<p>At first glance, there appears to be a positive correlation between star rating and number of positive words, but that&rsquo;s misleading: since we don&rsquo;t have alpha transparency on the points, the density is ambiguous. (fixing it requires working outside of a qplot).</p>
<h2 id="serious-business-data">Serious Business Data</h2>
<p>We load the Yelp Businesses data into R through the same way as the reviews data. Here&rsquo;s an overview of the data:</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/businesses_hu_2c0d69272092403f.webp 320w,/2015/12/lets-code-1/businesses.png 567w" src="businesses.png"/> 
</figure>

<p>Both data frames have a <code>business_id</code> column. We can merge them with a <code>left_join</code>, a la SQL. If both data frames have a column with the same name, it will merge on that column by default.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_reviews</span> <span class="o">&lt;-</span> <span class="n">df_reviews</span> <span class="o">%&gt;%</span> <span class="nf">left_join</span><span class="p">(</span><span class="n">df_businesses</span><span class="p">)</span>
</span></span></code></pre></div><p>Then the R console helpfully points out that both dataframes also have a &ldquo;stars&rdquo; column. Uh-oh.</p>
<p>We reset the df<em>reviews data frame from scratch and merge again, explicitly stating the &ldquo;by&rdquo; column for merging. Now we know _where</em> reviews were made, and that might provide helpful information.</p>
<h2 id="aggregation-station">Aggregation Station</h2>
<p>It might be interesting to know the average star rating by city. dplyr allows for <code>group_by</code> and <code>summarize</code> operations in a similar manner as SQL.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_cities</span> <span class="o">&lt;-</span> <span class="n">df_reviews</span> <span class="o">%&gt;%</span> <span class="nf">group_by</span><span class="p">(</span><span class="n">city</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">summarize</span><span class="p">(</span><span class="n">avg_stars</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">stars.x</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/cities.png 317w" src="cities.png"/> 
</figure>

<p>&hellip;that&rsquo;s not good. The original Yelp Dataset Challenge page mentioned that the dataset is only from specific cities, not &ldquo;1023 E Frye Rd.&rdquo;</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/Dataset_Map_hu_2c0d4f54f063a671.webp 320w,/2015/12/lets-code-1/Dataset_Map_hu_d7afb4d089af919d.webp 768w,/2015/12/lets-code-1/Dataset_Map_hu_c19ffe41834b7eed.webp 1024w,/2015/12/lets-code-1/Dataset_Map.png 1585w" src="Dataset_Map.png"/> 
</figure>

<p>Hmrph.</p>
<p>From the map, it appears there is no overlap between any of the cities with geographic states, so let&rsquo;s use <code>state</code> instead. Additionally, we can add a count of reviews from that state, and sort by that count descending.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_states</span> <span class="o">&lt;-</span> <span class="n">df_reviews</span> <span class="o">%&gt;%</span> <span class="nf">group_by</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">summarize</span><span class="p">(</span><span class="n">avg_stars</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">stars.x</span><span class="p">),</span> <span class="n">count</span><span class="o">=</span><span class="nf">n</span><span class="p">())</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">count</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/states.png 302w" src="states.png"/> 
</figure>

<p>Looks good enough, but that&rsquo;s tempting fate.</p>
<h2 id="ggplot-all-the-things">ggplot All the Things</h2>
<p>We can plot state vs. avg_stars with ggplot2. Setting it up is easy:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_states</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">avg_stars</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-ggplot-1_hu_1ea430bfbdcb476c.webp 320w,/2015/12/lets-code-1/lc1-ggplot-1_hu_281d887c57bb9217.webp 768w,/2015/12/lets-code-1/lc1-ggplot-1_hu_fe0e1dbb3000ad9d.webp 1024w,/2015/12/lets-code-1/lc1-ggplot-1.png 1200w" src="lc1-ggplot-1.png"/> 
</figure>

<p>The blank plot is actually new to 2.0.0: running the code without any layers would normally throw an error. The axis values appear valid. Let&rsquo;s add columns via <code>geom_bar</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_states</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">avg_stars</span><span class="p">))</span> <span class="o">+</span> <span class="nf">geom_bar</span><span class="p">()</span>
</span></span></code></pre></div><p>&hellip;and this results in an error. geom_bar by itself does histograms on raw values, as shown in the qplots. The correct fix is to add a <code>stat=&quot;identity&quot;</code> parameter to geom_bar, which tells it to scale the bars by the given value of the aesthetic.</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-ggplot-2_hu_a3c35a6410b127ed.webp 320w,/2015/12/lets-code-1/lc1-ggplot-2_hu_90cf43e1feb55295.webp 768w,/2015/12/lets-code-1/lc1-ggplot-2_hu_62a51af623aa251c.webp 1024w,/2015/12/lets-code-1/lc1-ggplot-2.png 1200w" src="lc1-ggplot-2.png"/> 
</figure>

<p>Better. But the x-axis is cluttered and the States would look better on the y-axis. Time for a <code>coord_flip</code>.</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-ggplot-3_hu_f008de82f3bac460.webp 320w,/2015/12/lets-code-1/lc1-ggplot-3_hu_3452fba37b0fdda4.webp 768w,/2015/12/lets-code-1/lc1-ggplot-3_hu_4adb5cd5ddbd0244.webp 1024w,/2015/12/lets-code-1/lc1-ggplot-3.png 1200w" src="lc1-ggplot-3.png"/> 
</figure>

<p>Better. Now time to fix the order. You may notice that the order of the states is alphabetical going from the bottom of the axis to the top, and R will always set this order for any character vector. We want the sort the labels by their average star rating, descending. To do that we change the internal factor labels of state volume to the specified order.</p>
<p>In the recording, this took awhile due to several brain farts (which happen often when dealing with factor ordering). First, we need to remove a few states with few reviews using a filter The easiest way to do this is to sort the original data frame by avg<em>stars descending, then set the factor order by using the new state order _in reverse</em>. (Ok, ok, it might be easier to just sort ascending and not reverse, but it makes the overview harder to visualize)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_states</span> <span class="o">&lt;-</span> <span class="n">df_states</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">avg_stars</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">count</span> <span class="o">&gt;</span> <span class="m">2000</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">state</span> <span class="o">=</span> <span class="nf">factor</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">levels</span><span class="o">=</span><span class="nf">rev</span><span class="p">(</span><span class="n">state</span><span class="p">)))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/states-2.png 299w" src="states-2.png"/> 
</figure>

<p>Rerunning the plot code afterward yields:</p>
<figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-ggplot-4_hu_149bd799d70d3f01.webp 320w,/2015/12/lets-code-1/lc1-ggplot-4_hu_31ede1176b5915dc.webp 768w,/2015/12/lets-code-1/lc1-ggplot-4_hu_5ac327cae184a292.webp 1024w,/2015/12/lets-code-1/lc1-ggplot-4.png 1200w" src="lc1-ggplot-4.png"/> 
</figure>

<p>Good! Why not add labels for each point? This can be done with geom_text, along with adding <code>hjust=1</code> to offset the label, changing the size, and setting the text to white. We can round the avg_star values to 2 decimal places as well.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_states</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">avg_stars</span><span class="p">))</span> <span class="o">+</span> <span class="nf">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s">&#34;identity&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="nf">coord_flip</span><span class="p">()</span> <span class="o">+</span> <span class="nf">geom_text</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="nf">round</span><span class="p">(</span><span class="n">avg_stars</span><span class="p">,</span> <span class="m">2</span><span class="p">)),</span> <span class="n">hjust</span><span class="o">=</span><span class="m">1</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">&#34;white&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-ggplot-5_hu_89d724c40a7c164d.webp 320w,/2015/12/lets-code-1/lc1-ggplot-5_hu_1f77b24461bff868.webp 768w,/2015/12/lets-code-1/lc1-ggplot-5_hu_e5a5ad3b7fa96d2.webp 1024w,/2015/12/lets-code-1/lc1-ggplot-5.png 1200w" src="lc1-ggplot-5.png"/> 
</figure>

<p>The &ldquo;3.7&rdquo; label requires using the <code>sprintf</code> function instead of <code>round</code> to print &ldquo;3.70&rdquo;, which is not fun. Otherwise, these labels are nice so far. Why not add a theme and axis labels?</p>
<p>I go to my <a href="http://minimaxir.com/2015/02/ggplot-tutorial/">previous ggplot2 tutorial</a> and copy-paste the FiveThirtyEight-inspired theme from there because I am efficient. (The theme required loading the RColorBrewer package, though). The axis labels are added through the <code>labs</code> function. (note that since the axes are flipped, the labels must be flipped too!)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_states</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">avg_stars</span><span class="p">))</span> <span class="o">+</span> <span class="nf">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s">&#34;identity&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="nf">coord_flip</span><span class="p">()</span> <span class="o">+</span> <span class="nf">geom_text</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="nf">round</span><span class="p">(</span><span class="n">avg_stars</span><span class="p">,</span> <span class="m">2</span><span class="p">)),</span> <span class="n">hjust</span><span class="o">=</span><span class="m">2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="m">2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">&#34;white&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span> <span class="nf">labs</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="s">&#34;Average Star Rating by State&#34;</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">&#34;State&#34;</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">&#34;Average Yelp Review Star Ratings by State&#34;</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-ggplot-6_hu_adee5bd97bf9bbb3.webp 320w,/2015/12/lets-code-1/lc1-ggplot-6_hu_5a31a50f3b576005.webp 768w,/2015/12/lets-code-1/lc1-ggplot-6_hu_bd1dc62f4011bfba.webp 1024w,/2015/12/lets-code-1/lc1-ggplot-6.png 1200w" src="lc1-ggplot-6.png"/> 
</figure>

<p>Why not add 95% confidence intervals for each average? (Note that the normality assumptions for the confidence interval may not be entirely valid). We can calculate the standard error of the mean and rebuild the dataframe and reorder factors again.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">df_states</span> <span class="o">&lt;-</span> <span class="n">df_reviews</span> <span class="o">%&gt;%</span> <span class="nf">group_by</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">summarize</span><span class="p">(</span><span class="n">avg_stars</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">stars.x</span><span class="p">),</span> <span class="n">count</span><span class="o">=</span><span class="nf">n</span><span class="p">(),</span> <span class="n">se_mean</span><span class="o">=</span><span class="nf">sd</span><span class="p">(</span><span class="n">stars.x</span><span class="p">)</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="n">count</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">avg_stars</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="n">count</span> <span class="o">&gt;</span> <span class="m">2000</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">state</span> <span class="o">=</span> <span class="nf">factor</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">levels</span><span class="o">=</span><span class="nf">rev</span><span class="p">(</span><span class="n">state</span><span class="p">)))</span>
</span></span></code></pre></div><p>Time to add a <code>geom_errorbar</code> (not a <code>geom_crossbar</code>!)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df_states</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">avg_stars</span><span class="p">))</span> <span class="o">+</span> <span class="nf">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="o">=</span><span class="s">&#34;identity&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="nf">coord_flip</span><span class="p">()</span> <span class="o">+</span> <span class="nf">geom_text</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="nf">round</span><span class="p">(</span><span class="n">avg_stars</span><span class="p">,</span> <span class="m">2</span><span class="p">)),</span> <span class="n">hjust</span><span class="o">=</span><span class="m">2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="m">2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">&#34;white&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="nf">fte_theme</span><span class="p">()</span> <span class="o">+</span> <span class="nf">labs</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="s">&#34;Average Star Rating by State&#34;</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">&#34;State&#34;</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">&#34;Average Yelp Review Star Ratings by State&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="nf">geom_errorbar</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">ymin</span><span class="o">=</span><span class="n">avg_stars</span> <span class="o">-</span> <span class="m">1.96</span> <span class="o">*</span> <span class="n">se_mean</span><span class="p">,</span> <span class="n">ymax</span><span class="o">=</span><span class="n">avg_stars</span> <span class="o">+</span> <span class="m">1.96</span> <span class="o">*</span> <span class="n">se_mean</span><span class="p">))</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2015/12/lets-code-1/lc1-ggplot-7_hu_fd96c2d620c174.webp 320w,/2015/12/lets-code-1/lc1-ggplot-7_hu_423b664a411409a1.webp 768w,/2015/12/lets-code-1/lc1-ggplot-7_hu_78e75ee61e1d99a6.webp 1024w,/2015/12/lets-code-1/lc1-ggplot-7.png 1200w" src="lc1-ggplot-7.png"/> 
</figure>

<p>Averages are very stable for all cities due to the large sample size.</p>
<p>At this point I realized the recording is too long and I end it there. For a normal blog post, I&rsquo;d add more theming, adjust colors so they don&rsquo;t clash, and add annotations, such as a line representing the true review average from the population. And ideally, performing statistical tests to determine if any averages are different from the population average.</p>
<p>Hopefully this gives some insight into the mechanical process of creating simple data visualizations with R and ggplot2 (the &ldquo;abridged summary&rdquo; ended up being as long as a typical blog post!). As my screencast shows, programming is a recurring process of saying &ldquo;this is easy to do!&rdquo; then failing miserably for stupid reasons. Even after the 40 minute screencast, there&rsquo;s still much, much more polish needed for the data visualization. My blog posts take a very long time to produce for those reasons; the clear, clean code from the finished product is not indicative of the unexpected errors that occur when writing it.{% comment %}At the least, they are <em>fixable</em> errors, which is a strong benefit of being a good QA engineer.{% endcomment %}</p>
<p>I did this recording &ldquo;blind&rdquo; to test whether or not it&rsquo;s feasible for me to <em>stream</em> the coding of data visualization on services like <a href="http://www.twitch.tv">Twitch</a>. It&rsquo;s definitely possible, but has more logistical challenges. (namely, that <a href="https://obsproject.com">OBS</a> is fussy outside of Windows and I still need to figure out how to configure it optimally). I admit the code in this screencast may not be the highest-quality code (in retrospect I should have put the code in an editor instead of directly in the console, and reuse dataframe/ggplot objects), but the transparent process for coding data visualizations is important. If there is enough interest, I may revisit Yelp data again, or even more advanced datasets.</p>
<hr>
<p><em>You can access the R code used for the data visualizations and the Python scripts used to process the raw Yelp dataset <a href="https://github.com/minimaxir/lets-code-1">in this GitHub repository</a>. However, the raw data itself cannot be redistributed.</em></p>
<p><em>For those wondering what I used for recording the screencast:</em></p>
<p>Computer: <em>Late 2013 13&quot; Retina MacBook Pro running OS X 10.11.2</em></p>
<p>Recording Software: <em>Screenflow 4.5</em></p>
<p>Microphone: <em>Shure MV5 Digital Condenser</em></p>
<p>Music: <em>Various artists from the &ldquo;No Attribution Required&rdquo; section of the YouTube Audio Library</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
