Big Data

Problems with Predicting Post Performance on Reddit and Other Link Aggregators

The nature of algorithmic feeds like Reddit inherently leads to a survivorship bias: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail.

Analyzing IMDb Data The Intended Way, with R and ggplot2

For IMDb's big-but-not-big data, you have to play with the data smartly, and both R and ggplot2 have neat tricks to do just that.

Visualizing One Million NCAA Basketball Shots

Although visualizing basketball shots has been done before, this time we have access to an order of magnitude more public data to do some really cool stuff.

Playing with 80 Million Amazon Product Review Ratings Using Apache Spark

Manipulating actually-big-data is just as easy as performing an analysis on a dataset with only a few records.

What Percent of the Top-Voted Comments in Reddit Threads Were Also 1st Comment?

Are commenters 'late to this thread' indeed late?

Methods for Finding Related Reddit Subreddits with Simple Set Theory

Fancy machine learning approaches may not be required to help Redditors discover new things.

How to Create a Network Graph Visualization of Reddit Subreddits

There is very little discussion on how to gather the data for large-scale network graph visualizations, and how to make them. It is time to fix that.

Mapping Where Arrests Frequently Occur in San Francisco Using Crime Data

Let's plot 587,499 arrests on top of a map of San Francisco for fun and see what happens.

Analyzing San Francisco Crime Data to Determine When Arrests Frequently Occur

Spoilers: Most arrests in San Francisco happen Wednesdays at 4-5 PM. For some reason.

How to Visualize New York City Using Taxi Location Data and ggplot2

I had posted a visualization of NYC taxis using ggplot2. Due to popular demand, I've cleaned up the code and have released it open source, with a few improvements.