Data Science

Playing with 80 Million Amazon Product Review Ratings Using Apache Spark

Manipulating actually-big-data is just as easy as performing an analysis on a dataset with only a few records.

What Percent of the Top-Voted Comments in Reddit Threads Were Also 1st Comment?

Are commenters 'late to this thread' indeed late?

Visualizing How Developers Rate Their Own Programming Skills

As it turns out, there is no correlation between programming ability and the frequency of Stack Overflow visits.

Methods for Finding Related Reddit Subreddits with Simple Set Theory

Fancy machine learning approaches may not be required to help Redditors discover new things.

How to Create a Network Graph Visualization of Reddit Subreddits

There is very little discussion on how to gather the data for large-scale network graph visualizations, and how to make them. It is time to fix that.

Blockbuster Movies with Male Leads Earn More Than Those with Female Leads

On average, blockbuster movies with male leads generate 22% more domestic box office revenue, and this difference is statistically significant.

Facebook Reactions and the Problems With Quantifying Likes Differently

Apparently, there is little statistical relationship between things that are cute and things that make you go YAAASS.

Video Games and Charity: Analyzing Awesome Games Done Quick 2016 Donations

Were frames killed? Were animals saved?

Movie Review Aggregator Ratings Have No Relationship with Box Office Success

Perhaps the movie rating system itself is broken.

Let's Code an Analysis and Visualizations of Yelp Data using R and ggplot2

The first of (hopefully) many 1440p/60fps videos of fun data analysis. With many fun errors too!