Media Analysis on Max Woolf's Blog

Movie Review Aggregator Ratings Have No Relationship with Box Office Success

Thu, 07 Jan 2016 08:30:00 -0700

Rotten Tomatoes has become synonymous with movie quality in recent years. The Rotten Tomatoes Tomatometer aggregates all reviews written by movie critics for a given movie on the internet, determines whether each reviewer rates the movie as “Fresh” or “Rotten” and calculates an average. If the proportion of Fresh reviews for a given movie is greater than or equal to 60%, the movie itself is considered “Fresh” and receives a special icon.

Top Movies like Christopher Nolan’s The Dark Knight received a 94% Rotten Tomatoes rating, and generated $533.3 million in domestic box office revenue. But other movies, like Michael Bay’s Transformers: Revenge of the Fallen, received a 19% Tomatometer rating, but still generated $402.1 million in domestic box office revenue.

How strong is the relationship between Tomatometer scores and box office success, anyways? Or are other, better metrics? Time to make some pretty charts.

I obtained a large amount of movie data from the OMDb API, which provides easy access to movie metadata from IMDb and Rotten Tomatoes. This data contains Rotten Tomatoes Tomatometer scores, Rotten Tomatoes Audience Scores, IMDb User Rankings, and Metacritic Scores. If you want to know how I processed the data in R and plotted the charts using ggplot2, I have prepared a screencast for your viewing pleasure.

For this analysis, we will be looking at the log-transformation of domestic box office revenue, since the values are skewed by mega-blockbusters like the ones mentioned previously. Revenues are not inflation-adjusted since the rating data is only present for recent years and due to the log-transformation already present, inflation correction would not impact this particular analysis much.

Rotten Tomatoes Tomatometer

After processing, I have a data subset of 4,863 movies with both Tomatometer and Box Office Gross values. Let’s plot all those movies on a scatterplot of log(BoxOffice) vs. Meter with each point having a slight transparency; that way, clusters of points will be come apparent where the areas are darker on the chart.

We expect a positive linear relationship: movies with high Tomatometer scores to have high box office revenue, and inversely movies with low score to have low box office revenue.

Wait, why does the trendline have a negative slope?

The Pearson correlation between the Tomatometer scores and log(BoxOffice) is -0.18, implying a weak negative linear relationship between the two variables. Not what I expected.

There do appear to be clusters in the data. There is a group of points between $10M and $100M revenue and 0% to 20% Tomatometer rating. Another group is present between $1,000 and $1M revenue and 80% to 100% RT rating. Both of these areas are outside of a linear relationship: perhaps these clusters are skewing trends too?

Let’s try another visualization of the data using contour maps, which allow the data to become 3D, so-to-speak. Using a 2D kernel density estimator, we can identify and color areas on the plot according to the number of points present in that area; the greater the color saturation, the more points present in the given area.

The two clusters mentioned previously are now much more apparent. It appears there are two distinct sets of movies: blockbusters which critics hate, and limited-appeal films which critics loves. Incidentally, there is no discernible difference between movies which are Fresh (>60%) and Rotten.

Metacritic

The Metacritic score is also derived from review data by critics; however, instead of calculating a binary review sentiment and calculating a proportion from that sentiment, Metacritic gives a quantification from 0 to 100 to each critic review and averages them together.

Does that change the results for 4,479 movies?

Correlation between Metacritic score and log(BoxOffice) is -0.13, which puts the analysis in a similar state as the Rotten Tomatoes data. However, the blockbuster cluster has shifted right, and the lesser-appeal cluster has shifted left.

Clusters are much closer together.

Perhaps a review metric by non-critics will tell a different story.

Rotten Tomatoes Audience Score

The Audience Score is calculated in a similar way to the Rotten Tomatoes Tomatometer score: user to the site rate a movie from 0 to 5 stars in half-star increments (i.e. effectively a scale from 0-10) and the proportion of reviews with 3.5 star ratings or higher becomes the Audience Score.

This also presents a cognitive bias in ratings: the Four Point Scale, where having a discrete form of ranking may cause people to tend to rate toward the top of the scale and make the entire metric skewed or misleading.

How does the Audience Score compare for 5,163 movies? After all, the audience is the group of people who determine how much money a movie makes at the Box Office.

Correlation between the Audience score and log(BoxOffice) is 0.05, which is a positive linear correlation, but representative of barely any practical correlation.

Speaking of the Four Point Scale, notice how, like with Metacritic score, there are barely any movies between 0% and 20% Audience Score. Is there really a skew? Let’s look at the contours:

The locations of the clusters are much different than that of Tomatometer clusters. Both clusters are closer together, with the blockbuster cluster between 50% and 60% audience score and the lesser-appeal cluster between 70% and 80%. Hence, the low correlation.

IMDb

IMDb works almost the same way as the Metacritic for non-critics: ratings from IMDb users between 1-10 (note that 0 is missing!) are averaged to get a final score.

How do 5,167 movies fare?

What?!

The point groupings are at the same positions of ratings, and the correlation between IMDb ratings and log(BoxOffice) is 0.00. Yes, there’s zero correlation!

Checking the contour map confirms it:

That is literally a Four Point Scale between 5 and 8!

The Rotten Tomatoes metric is the only metric that actually uses the entire rating scale. None of the other potential metrics provide more insight into a potential reason for high box-office revenue. Perhaps the movie rating system itself is broken.

That’s not to say that movies need high box-office revenues to be considered successful. However, working with movie profitability, and by extension movie budget, is opening another can-of-worms with respect to data integrity. (that said, on Reddit, /u/chartmkr recently posted a visualization of Gross vs. Budget which is interesting).

It’ll still be fun to point to a Rotten Tomatoes Tomatometer rating as a kneejerk reaction to whether a movie rocks/sucks. Although, the reasons for movie financial success at the box office definitely warrant further investigation.

UPDATE 1/11/15: On a discussion on Hacker News, it was suggested that the blockbuster movies and the indie movies cancel each other out, i.e. blockbusters have a positive correlation and indies have a negative correlation.

For the blockbuster cluster alone, the log-correlation is 0.23 (not weak but not great positive correlation). For the indie cluster alone, the log-correlation is -0.12 (same as original analysis).

For future analysis, it may be worthwhile to split these two clusters. I stand by the original analysis for this post: very frequently I’ve heard the question “is this a good movie?” and the response is “what does the RT score say?” Both Box Office revenues and RT scores are important measures of quality (depending on perspective), and users who want to see or purchase a movie may not necessarily care if it’s indie or a blockbuster.

User cwyers suggested that Simpson’s Paradox may be in play since the number of theaters showing a movie is positively correlated to box office revenue, adding a potentially-confounding affect. I will see if I can obtain that data for future analysis.

You can access the open-sourced Jupyter notebook and high-resolution charts from this article in this GitHub repository. If you use the code or data visualization designs contained within this article, it would be greatly appreciated if proper attribution is given back to this article and/or myself. Thanks!

Unfortunately, I cannot redistribute the data itself due to licensing concerns.

Let's Code an Analysis and Visualizations of Yelp Data using R and ggplot2

Mon, 28 Dec 2015 09:00:00 -0700

One of the reasons I have open-sourced the code for my complicated data visualizations is transparency for the creation process. 2015 was a year of misleading and incorrect data visualizations, and I don’t want to help contribute to the misconception that data can be used for trickery. “Big data” in particular is a area where the steps to reproduce results are rarely released publicly in a step-by-step manner, often in an attempt to make the resulting analysis unimpeachable.

It’s time to take things to the next level of transparency by recording screencasts of my data analysis and visualizations.

Last week, ggplot2 author Hadley Wickham released a surprise update for my favorite R package, bumping the version to 2.0.0. Why not celebrate by playing around with ggplot2 and making some pretty charts?

Let’s Code!

I have recorded a screencast of myself coding in R to play around with data from Yelp Dataset Challenge and uploaded it to YouTube. Additionally, the video can be played at an unusually high quality for screencasting: 1440p on supported browsers, at 60 frames per second.

This particular screencast is also my first significant attempt at working with audio/video editing and voice-over. Feel free to provide suggestions for future videos.

Since the screencast is 40 minutes long (inadvertently!), I’ve written an abridged summary of the screencast, along with some clarification of points made.

Yelp Data v2

A year ago I made a blog post analyzing the same Yelp data. Now that the data set contains 1.6 million reviews (as opposed to just 1.1 million back then), it might be interesting to look at it again to see if anything has changed. The data is formatted as by-line JSON: I wrote a pair of Python scripts to convert it to CSV for easy import into R.

The screencast centralizes on three R packages: readr, dplyr, and ggplot2. (all authored by Hadley Wickham)

Loading the dataset into R is easy and fast with read_csv.

df_reviews <- read_csv("yelp_reviews.csv")

Since dplyr was loaded beforehand, readcsv loads the data into a tbl_df instead of a normal data.frame. When you call a normal data.frame by itself, _all data is printed to console, which is a problem when you have 1.6M rows (yes, that happened during a test recording). Calling a tbl_df results in a very descriptive overview of the data:

Most columns are self-explanatory. review_length is approximate number of words in the review, pos_words is the number of positive words in the review, neg_words is what you expect, net_sentiment is pos_words - neg_words.

A quick way to analyze the distribution of numerical data is to perform a summary on the data frame, which returns a by-column five-number summary + mean:

Ratings are biased toward 4 and 5 star reviews. There is a lot of skew for review length.

dplyr makes it easy to add columns in-line with the mutate command. Let’s normalize the pos_words column:

df_reviews <- df_reviews %>% mutate(pos_norm = pos_words / review_length)

And we could do similar steps for the neg_words column too. Or use mutate to transform the data of an existing column.

Onto ggplot2. If you want a quick histogram of univariate data, qplot does just that. Let’s visualize the distribution of stars.

qplot(data=df_reviews, stars)

Definitely a skew toward 4 and 5 star reviews.

We can do that for other variables too, like review length.

What about bivariate data? If you give two variables to qplot, it will create a scatter plot. Perhaps there is a relationship between the number of stars and the number of positive words?

qplot(data=df_reviews, stars, pos_words)

…and then we run into a problem. In this case, ggplot2 has to plot 1.6M points to screen, which can take awhile, especially if you are simultaneously using your GPU for video recording. Eventually, we get this:

At first glance, there appears to be a positive correlation between star rating and number of positive words, but that’s misleading: since we don’t have alpha transparency on the points, the density is ambiguous. (fixing it requires working outside of a qplot).

Serious Business Data

We load the Yelp Businesses data into R through the same way as the reviews data. Here’s an overview of the data:

Both data frames have a business_id column. We can merge them with a left_join, a la SQL. If both data frames have a column with the same name, it will merge on that column by default.

df_reviews <- df_reviews %>% left_join(df_businesses)

Then the R console helpfully points out that both dataframes also have a “stars” column. Uh-oh.

We reset the dfreviews data frame from scratch and merge again, explicitly stating the “by” column for merging. Now we know _where reviews were made, and that might provide helpful information.

Aggregation Station

It might be interesting to know the average star rating by city. dplyr allows for group_by and summarize operations in a similar manner as SQL.

df_cities <- df_reviews %>% group_by(city) %>% summarize(avg_stars = mean(stars.x))

…that’s not good. The original Yelp Dataset Challenge page mentioned that the dataset is only from specific cities, not “1023 E Frye Rd.”

Hmrph.

From the map, it appears there is no overlap between any of the cities with geographic states, so let’s use state instead. Additionally, we can add a count of reviews from that state, and sort by that count descending.

df_states <- df_reviews %>% group_by(state) %>% summarize(avg_stars = mean(stars.x), count=n()) %>% arrange(desc(count))

Looks good enough, but that’s tempting fate.

ggplot All the Things

We can plot state vs. avg_stars with ggplot2. Setting it up is easy:

ggplot(data=df_states, aes(state, avg_stars))

The blank plot is actually new to 2.0.0: running the code without any layers would normally throw an error. The axis values appear valid. Let’s add columns via geom_bar:

ggplot(data=df_states, aes(state, avg_stars)) + geom_bar()

…and this results in an error. geom_bar by itself does histograms on raw values, as shown in the qplots. The correct fix is to add a stat="identity" parameter to geom_bar, which tells it to scale the bars by the given value of the aesthetic.

Better. But the x-axis is cluttered and the States would look better on the y-axis. Time for a coord_flip.

Better. Now time to fix the order. You may notice that the order of the states is alphabetical going from the bottom of the axis to the top, and R will always set this order for any character vector. We want the sort the labels by their average star rating, descending. To do that we change the internal factor labels of state volume to the specified order.

In the recording, this took awhile due to several brain farts (which happen often when dealing with factor ordering). First, we need to remove a few states with few reviews using a filter The easiest way to do this is to sort the original data frame by avgstars descending, then set the factor order by using the new state order _in reverse. (Ok, ok, it might be easier to just sort ascending and not reverse, but it makes the overview harder to visualize)

df_states <- df_states %>% arrange(desc(avg_stars)) %>% filter(count > 2000) %>% mutate(state = factor(state, levels=rev(state)))

Rerunning the plot code afterward yields:

Good! Why not add labels for each point? This can be done with geom_text, along with adding hjust=1 to offset the label, changing the size, and setting the text to white. We can round the avg_star values to 2 decimal places as well.

ggplot(data=df_states, aes(state, avg_stars)) + geom_bar(stat="identity") + coord_flip() + geom_text(aes(label=round(avg_stars, 2)), hjust=1, color="white")

The “3.7” label requires using the sprintf function instead of round to print “3.70”, which is not fun. Otherwise, these labels are nice so far. Why not add a theme and axis labels?

I go to my previous ggplot2 tutorial and copy-paste the FiveThirtyEight-inspired theme from there because I am efficient. (The theme required loading the RColorBrewer package, though). The axis labels are added through the labs function. (note that since the axes are flipped, the labels must be flipped too!)

ggplot(data=df_states, aes(state, avg_stars)) + geom_bar(stat="identity") + coord_flip() + geom_text(aes(label=round(avg_stars, 2)), hjust=2, size=2, color="white") + fte_theme() + labs(y="Average Star Rating by State", x="State", title="Average Yelp Review Star Ratings by State")

Why not add 95% confidence intervals for each average? (Note that the normality assumptions for the confidence interval may not be entirely valid). We can calculate the standard error of the mean and rebuild the dataframe and reorder factors again.

df_states <- df_reviews %>% group_by(state) %>% summarize(avg_stars = mean(stars.x), count=n(), se_mean=sd(stars.x)/sqrt(count)) %>% arrange(desc(avg_stars)) %>% filter(count > 2000) %>% mutate(state = factor(state, levels=rev(state)))

Time to add a geom_errorbar (not a geom_crossbar!)

ggplot(data=df_states, aes(state, avg_stars)) + geom_bar(stat="identity") + coord_flip() + geom_text(aes(label=round(avg_stars, 2)), hjust=2, size=2, color="white") + fte_theme() + labs(y="Average Star Rating by State", x="State", title="Average Yelp Review Star Ratings by State") + geom_errorbar(aes(ymin=avg_stars - 1.96 * se_mean, ymax=avg_stars + 1.96 * se_mean))

Averages are very stable for all cities due to the large sample size.

At this point I realized the recording is too long and I end it there. For a normal blog post, I’d add more theming, adjust colors so they don’t clash, and add annotations, such as a line representing the true review average from the population. And ideally, performing statistical tests to determine if any averages are different from the population average.

Hopefully this gives some insight into the mechanical process of creating simple data visualizations with R and ggplot2 (the “abridged summary” ended up being as long as a typical blog post!). As my screencast shows, programming is a recurring process of saying “this is easy to do!” then failing miserably for stupid reasons. Even after the 40 minute screencast, there’s still much, much more polish needed for the data visualization. My blog posts take a very long time to produce for those reasons; the clear, clean code from the finished product is not indicative of the unexpected errors that occur when writing it.{% comment %}At the least, they are fixable errors, which is a strong benefit of being a good QA engineer.{% endcomment %}

I did this recording “blind” to test whether or not it’s feasible for me to stream the coding of data visualization on services like Twitch. It’s definitely possible, but has more logistical challenges. (namely, that OBS is fussy outside of Windows and I still need to figure out how to configure it optimally). I admit the code in this screencast may not be the highest-quality code (in retrospect I should have put the code in an editor instead of directly in the console, and reuse dataframe/ggplot objects), but the transparent process for coding data visualizations is important. If there is enough interest, I may revisit Yelp data again, or even more advanced datasets.

You can access the R code used for the data visualizations and the Python scripts used to process the raw Yelp dataset in this GitHub repository. However, the raw data itself cannot be redistributed.

For those wondering what I used for recording the screencast:

Computer: Late 2013 13" Retina MacBook Pro running OS X 10.11.2

Recording Software: Screenflow 4.5

Microphone: Shure MV5 Digital Condenser

Music: Various artists from the “No Attribution Required” section of the YouTube Audio Library