Programming on Max Woolf's Blog

The Statistical Difference Between 1-Star and 5-Star Reviews on Yelp

Tue, 23 Sep 2014 08:00:00 -0700

Many business in the real world encourage their customers to “Rate us on Yelp!”. Yelp, the “best way to find local businesses,” relies on user reviews to help its viewers find the best places. Both positive and negative reviews are helpful in this mission: positive reviews on Yelp identify the best places, negative reviews identify places where people shouldn’t go. Usually, both positive and negative reviews are not based on objective attributes of the business, but on the experience the writer has with the establishment.

I analyzed the language present in 1,125,458 Yelp Reviews using the dataset from the Yelp Dataset Challenge containing reviews of businesses in the cities of Phoenix, Las Vegas, Madison, Waterloo and Edinburgh. Users can rate businesses 1, 2, 3, 4, or 5 stars. When comparing the most-frequent two-word phrases between 1-star and 5-star reviews, the difference is apparent.

The 5-star Yelp reviews contain many instances of “Great”, “Good”, and “Happy”. In contrast, the 1-star Yelp reviews use very little positive language, and instead discuss the amount of “minutes,” presumably after long and unfortunate waits at the establishment. (Las Vegas is one of the cities where the reviews were collected, which is why it appears prominently in both 1-star and 5-star reviews)

Looking at three-word phrases tells more of a story.

1-Star reviews frequently contain warnings for potential customers, which promises that the author will “never go back” and a strong impression that issues stem from conflicts with “the front desk”, such as those at hotels. 5-star reviews “love this place” and “can’t wait to” go back.

Can this language be used to predict reviews?

Regression of Language

To determine the causal impact on positive and negative words on the # of stars given in a review, we can perform a simple linear regression of stars on the number of positive words in the review, the number of negative words in the review, and the number of words in the review itself (since the length of the review is related to the number of positive/negative words; the longer the review, the more words)

A quick-and-dirty way to determine the number of positive/negative words in a given Yelp review is to compare each word of the review against a lexicon of positive/negative words, and count the number of review words in the lexicon. In this case, I use the lexicons compiled by UIC professor Bing Liu.

Running a regression of # stars in a Yelp review on # positive words, # negative words, and # words in review, returns these results:

Coefficients:
               Estimate	 Std. Error  t value  Pr(>|t|)
(Intercept)    3.692      1.670e-03  2210.0   <2e-16 ***
pos_words      0.122      2.976e-04   411.3   <2e-16 ***
neg_words     -0.154      4.887e-04  -315.9   <2e-16 ***
review_words  -0.003      1.984e-05  -169.4   <2e-16 ***


Residual standard error: 1.119 on 1125454 degrees of freedom
Multiple R-squared:  0.2589,	Adjusted R-squared:  0.2589
F-statistic: 1.311e+05 on 3 and 1125454 DF,  p-value: < 2.2e-16

The regression output explains these things:

If a reviewer posted a blank review with no text in it, that review gave an average rating of 3.692.
For every positive word, the predicted average star rating given is increased by 0.122 on average (e.g. 8 positive words indicate a 1-star increase)
For every negative word, the predicted average star rating given is decreased by 0.15 on average (e.g. 6-7 negative words indicate a 1-star decrease)
The amount of words in the review has a lesser, negative effect. (A review that is 333 words indicates a 1-star decrease, but the average amount of words in a Yelp review is 130 words)
This model explains 25.98% of the variation in the number of stars given in a review. This sounds like a low percentage, but is impressive for such a simple model using unstructured real-world data.

All of these conclusions are extremely statistically significant due to the large sample size.

Additionally, you could rephrase the regression as a logistic classification problem, where reviews rated 1, 2, or 3 stars are classified as “negative,” and reviews with 4 or 5 stars are classified as “positive.” Then, run the regression to determine the likelihood of a given review being positive. Running this regression (not shown) results in a logistic model with up to 75% accuracy, a noted improvement over the “no information rate” of 66%, which is the model accuracy if you just guessed that every review was positive. The logistic model also has similar conclusions for the predictor variables as the linear model.

It can be proven that language has a strong statistical effect on review ratings, but that’s intuitive enough. How have review ratings changed?

1-Star and 5-Star Reviews, Visualized

Since 2005, Yelp has had incredible growth in the number of new reviews.

For that chart, it appears that each of the five rating brackets have grown at the same rate, but that isn’t the case. Here’s a chart of the rating brackets showing how the proportions of new reviews of each rating have changed over time.

Early Yelp had mostly 4-star and 5-star reviews, as one might expect for an early Web 2.0 startup where the primary users who would be the only ones who would put in the effort to write a review would be those who had positive experiences. However, the behavior from 2010 onward is interesting: the relative proportions of both 1-star reviews and 5-star reviews increases over time.

As a result, the proportions of ratings in reviews from Yelp’s beginning in 2005 and Yelp’s present 2014 are incredibly different.

More negativity, more positivity. Do they cancel out?

How Positive Are Yelp Reviews?

We can calculate relative positivity between reviews by taking the number of positive reviews in a review and dividing it by the number of words in the review itself.

The average positivity among all reviews is 5.6%. Over time, the positivity has been relatively flat.

Flat, but still increasing, mostly likely due to the increasing proportion of 5-star reviews. But the number of 1-star reviews also increased: do the two offset each other?

This histogram of positivity scores shows that 1-star reviews have lower positivity with rarely high positivity, and 5-star reviews rarely have low positivity and instead have very high positivity. The distribution for each star rating is close to a Normal distribution, with each successive rating category peaking at increasing positivity values.

The relative proportion of each star rating reinforces this.

Over half of the 0% positivity reviews are 1-star reviews, while over three-quarters of the reviews at the highest positivity levels are 5-star reviews. (note that the 2-star, 3-star, and 4-star ratings are not as significant at either extreme)

How Negative Are Yelp Reviews?

When working with the negativity of reviews, calculated by taking the number of negative words and dividing them by the number of total words in the review, the chart looks much different.

The average negativity among all reviews is 2.0%. Since the average positivity is 5.6%, this implies that the net sentiment among all reviews is positive, despite the increase in 1-star reviews over time.

The histogram of negative reviews looks much different as well.

Even 1-star reviews aren’t completely negative all the time.

The chart is heavily skewed right, making it difficult to determine the proportions of each rating at first glance.

Henceforth here’s another proportion chart.

At low negativity, the proportions of negative review scores (1-star, 2-stars, 3-stars) and positive review scores (4-stars, 5-stars) are about equal, implying that negative reviews can be just as civil as positive reviews. But high negativity is solely present in 1-star and 2-star reviews.

From this article, you’ve seen that Yelp reviews with 5-star ratings are generally positive, and Yelp reviews with 1-star are generally negative. Yes, this blog post is essentially “Pretty Charts Made By Captain Obvious,” but what’s important is confirmation of these assumptions. Language plays a huge role in determining the ratings of reviews, and that knowledge could be applied to many other industries and review websites.

Four Stars

I’d give this blog post a solid 4-stars. The content was great, but the length was long, although not as long as some others. Can’t wait to read this post again!

Yelp reviews were preprocessed with Python, by simultaneously converting the data from JSON to a tabular structure, tokenizing the words in the review, counting the positive/negative words, and storing bigrams and trigrams in a dictionary to later be exported for creaitng word clouds.
All data analysis was performed using R, and a ll charts were made using ggplot2. Pixelmator was used to manually add relevant annotations when necessary.
You can view both the Python and R code used to process and chart the data in this GitHub repository. Note that since Yelp prevents redistribution of the data, the code may not be reproducible.
You can download full-resolution PNGs of the two word clouds [5000x2000px] in this ZIP file [18 MB]

The Interesting Percentages of Female Students in MIT and Harvard Online Courses

Fri, 04 Jul 2014 10:30:00 -0700

At the end of May, Harvard and MIT jointly released a dataset containing statistics about their online courses in the Academic Year of 2013. This Person-Course De-Identified dataset contains 476,532 students who have taken up to 13 unique courses from a variety of topics:

About half of the courses involve subjects in the humanities, while the other half involve computer science and electrical engineering.

One of the statistics I wanted to analyze was the gender ratio of students of online courses. In the data set, 425,105 students have a gender on record, with 311,534 male students (73.3%) and 113,571 female students (26.7%). This population proportion of female students is surprisingly low, especially since the male/female ratio is about 50:50 at MIT and Harvard themselves.

Therefore, I took a looked at the gender distribution of each of the 13 unique courses. Is the gender ratio similar across all classes, or is there a huge difference between classes?

Yeah, there’s a huge difference.

The proportion of female students in each of Harvard and MIT’s online courses range from 5% to 49%.

The top half of the gender ratios are all well above the 26.7% threshold. All six of these courses are in the humanities or in the life sciences. The bottom half of the gender ratio are all well below the 26.7% threshold. All seven are these courses are engineering or computer science courses with a strong focus on mathematics. (for clarification, the Elements of Structures course at MIT is a physics course with linear algebra programming)

Is there a correlation? As it turns out, the reason that the average proportion of female students is so low is that both Harvard’s Introduction to Computer Science I (where 169,621 students took the class; about 40% of all students) and MIT’s Introduction to CS/Programming (124,446 students total across both semesters) are so popular that the low percentage of women in those particular classes is drastically affecting the average.

The presence and interest of women in STEM fields (science, technology, engineering, and mathematics) has been a topic of controversy for a very long time. However, the chart shows that indeed the percentage of women interested in STEM classes is measurably lower than other fields, and hopefully awareness of this issue will help cause changes in the future.

Data was processed using R and the chart was made using ggplot2. (w/ a few annotations added using a photo editor)
You can view code necessary to reproduce these results in this GitHub repository. Since MIT/Harvard prevent redistribution of the dataset, you’ll have to download the dataset yourself.

A Statistical Analysis of 1.2 Million Amazon Reviews

Tue, 17 Jun 2014 08:20:00 -0700

When buying the latest products on Amazon, reading reviews is an important part of the purchasing process.

Customer reviews from customers who have actually purchased and used the product in question can give you more context to the product itself. Each reviewer rates the product from 1 to 5 stars, and provides a text summary of their experiences and opinions about the product. The ratings for each product are averaged together in order to get an overall product rating.

The number of reviews on Amazon has grown over the years.

But how do people write reviews? What types of ratings do reviewers give? How many of these reviews are considered helpful?

Stanford researchers Julian McAuley and Jure Leskovec collected all Amazon reviews from the service’s online debut in 1995 to 2013. Analyzing the dataset of 1.2 million Amazon reviews of products in the Electronics section, I found some interesting statistical trends; some are intuitive and obvious, but others give insight to how Amazon’s review system actually works.

Describing the Data

First, let’s see how the user ratings are distributed among the reviews.

More than half of the reviews give a 5-star rating. Aside from perfect reviews, most reviewers give 4-star or 1-star ratings, with very few giving 2-stars or 3-stars relatively.

As as result, the statistical average for all review ratings is on the high-end of the scale at about 3.90. In fact, the average review rating for newly-written reviews has varied from 3.4 to 4.2 over time.

Another metric used to measure reviews is review helpfulness. Other Amazon reviewers can rate a particular review as “helpful” or “not helpful.” A “review helpfulness” statistic can be calculated by taking the number of “is-helpful” indicators divided by the total number of is-helpful/is-not-helpful indicators (in the example at the beginning of the article, 639/665 people found the review helpful, so the helpfulness rating would be 96%). This gives an indication of review quality to a prospective buyer. Only 10% of the reviews had atleast 10 is-helpful/is-not-helpful data points, and of those reviews, the vast majority of the reviews had perfect helpfulness scores.

That would make sense; if you’re writing a review (especially a 5 star review), you’re writing with the intent to help other prospective buyers.

Another consideration is review length. Do reviews frequently write essays, or do reviews typically write a single paragraph?

Most reviews are 100-150 characters, but the average amount of characters in a review is about 582 (there are some outlier reviews with 30,000+ characters!). Assuming that the average amount of characters in a paragraph is 352, reviewers typically write about half a paragraph. Interestingly, reviews are rarely less than a sentence. (the Review Guidelines suggest a minimum of 20 words in a review, so this discrepancy could be attributed to moderator removal of short, one-liner reviews)

Particularizing the Products

The 1.2 million reviews in the Electronics data set address about 82,003 distinct products. However, most of those entries represent different SKUs of the same product (e.g. different colors of headphones). Of those products, only 30,577 products have pricing information which identify them as the source product.

Over 2/3rds of Amazon Electronics are priced between $0 and $50, which makes sense as popular electronics such as television remotes and phone cases are not extremely expensive. However, there’s no statistical correlation between the price of a product and the number of reviews it receives.

For the overall rating of a particular product, which is the average rating of all reviews for that product, the ratings are no longer limited to discrete numbers between 1 and 5, and can take decimal values between those numbers as well. The distribution of product ratings is similar to the distribution of review ratings.

Again, the perfect rating of 5 is most popular for products. This distribution resembles the distribution of scores of all reviews for the discrete rating values, but this view reveals local maxima at the midpoint between each discrete value. (i.e. 3-and-a-half stars and 4-and-a-half stars are surprisingly common ratings)

What happens when you plot product rating and product price together?

The most expensive products have 4-star and 5-star overall ratings, but not 1-star and 2-star ratings. However, the correlation is very weak. (r = 0.04)

In contrast, the relationship between product price and the average length of reviews for the product is surprising.

This relationship is logarithmic with a relatively good correlation (r = 0.29), and it shows that reviewers put more time and effort into reviewing products which are worth more.

Reviewing the Reviewers

As you might expect, most people leave only 1 or 2 reviews on Amazon, but some have left hundreds of reviews. Out of 1.2 Million reviews, there are 510,434 distinct reviewers.

Over 80% of the reviewers of Amazon electronics left only 1 review. Analyzing reviewers who have left only 1 review is not helpful statistically, so for the rest of the analysis, only reviews who have made 5 or more reviews (which have received atleast 1 is-helpful/is-not-helpful indicator) will be considered. This makes it much easier to get the overall profile of a reviewer. 11,676 reviewers fit this criteria.

Do repeat Amazon users tend to give 5-star reviews?

Distribution of review ratings when averaged across is similar to the other distributions of review ratings. However, this distribution is less skewed toward 5-stars and is more uniform between 4-stars and 5-stars.

What about the average helpfulness of the reviews written by a single reviewer? If a reviewer has enjoyed Amazon enough such that they make 5 or more reviews, chances are that their reviews are high quality.

Again, the data is slightly skewed. 8% of the reviewers have perfect helpfulness scores on all their reviews, and the average helpfulness score for all repeat reviews is 80%. Interestingly, a few repeat reviewers have average helpfulness scores of 0.

If you plot both average score and average helpfulness in a single chart, the picture becomes much more clear:

As the chart shows, there’s a good positive correlation (r = 0.27) between rating and helpfulness, with a discernible cluster at the top. However, I don’t think it’s a causal relationship. Reviewers who give a product a 4 - 5 star rating are more passionate about the product and likely to write better reviews than someone who writes a 1 - 2 star “this product sucks and you suck too!” review.

Another interesting bivariate relationship is the relationship between the helpfulness of a review and the length of a review). Stereotypically, you might think that longer reviews are more helpful reviews. And in the case of Amazon’s Electronics reviews, you’d be correct.

Again, there’s a good positive correlation (r = 0.26) between average helpfulness and average length, which the trend line supports. (the dip at the end is caused by the high amount of low-character reviews). All the longer reviews have high helpfulness; there are very, very few unhelpful reviews that are also long.

Completing the Conclusion

The reviews on Amazon’s Electronics products very frequently rate the product 4 or 5 stars, and such reviews are almost always considered helpful. 1-stars are used to signify disapproval, and 2-star and 3-stars reviews have no significant impact at all. If that’s the case, then what’s the point of having a 5 star ranking system at all if the vast majority of reviewers favor the product? Would Amazon benefit if they made review ratings a binary like/dislike?

Having a 5-star system can allow the prospective customer to make more informed comparisons between two products: a customer may be more likely to buy a product that’s rated 4.2 stars than a product that is rated 3.8 stars, which is a subtlety that can’t easily be emulated with a like/dislike system. Likewise, if products are truly bad, the propensity toward 5-star reviews can help obfuscate the low quality of the product when a like/dislike system would make the low quality more apparent.

Unfortunately, only Amazon has the data that would answer all these questions.

Of course, there are many other secrets to be uncovered from Amazon reviews. The Stanford professors who collected the initial data used machine learning techniques on the review text to predict the rating given by a review from just the review text itself. Other potential topics for analysis are comparisons between types of Electronics (e.g. MP3 players, headphones) or using natural language processing to determine the common syntax in reviews.

That’s a topic for another blog post. :)

Data analysis was performed using R, and all charts were made using ggplot2.
You can download a ZIP file containing CSVs of the time series, the aggregate product data, and the anonymized aggregate reviewer data here.
No, I have no relation to “M. Wolff”.