Data Engineering on Max Woolf's Blog

Visualizing Airline Flight Characteristics Between SFO and JFK

Wed, 23 Oct 2019 09:00:00 -0700

In March, Google Compute Platform developer advocate Felipe Hoffa made a tweet about airline flight data from San Francisco International Airport (SFO) to Seattle-Tacoma International Airport (SEA):

Particularly, his visualization of total elapsed times by airline caught my eye.

The overall time for flights from SFO to SEA goes up drastically starting in 2015, and this increase occurs across multiple airlines, implying that it’s not an airline-specific problem. But what could intuitively cause that?

U.S. domestic airline data is freely distributed by the United States Department of Transportation. Normally it’s a pain to work with as it’s very large with millions of rows, but BigQuery makes playing with such data relatively easy, fun, and free. What other interesting factoids can be found?

Expanding on SFO → SEA

BigQuery is a big data warehousing tool that allows you to query massive amounts of data. The table Hoffa created from the airline data (fh-bigquery.flights.ontime_201903) is 83.37 GB and 184 million rows. You can query 1 TB of data from it for free, but since BQ will only query against the fields you request, the queries in this post only consume about 2 GB each, allowing you to run them well within that quota.

Hoffa’s query that runs on BigQuery looks like this:

SELECT FlightDate_year, Reporting_Airline
  , AVG(ActualElapsedTime) ActualElapsedTime
  , AVG(TaxiOut) TaxiOut
  , AVG(TaxiIn) TaxiIn
  , AVG(AirTime) AirTime
  , COUNT(*) c
FROM `fh-bigquery.flights.ontime_201903`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY 1,2
ORDER BY 1 DESC, 3 DESC
LIMIT 1000

For each year and airline after 2010, the query calculates the average metrics specified for flights on the SFO → SEA route.

I made a few query and data visualization tweaks to what Hoffa did above, and here’s the result showing the increase in elapsed airline flight time, over time for that route:

Let’s explain what’s going on here.

A common trend in statistics is avoiding using averages as a summary statistic whenever possible, as averages can be overly affected by strong outliers (and with airline flights, there are definitely strong outliers!). The solution is to use a median instead, but one problem: medians are hard and computationally complex to calculate compared to simple averages. Despite the rise of “big data”, most databases and BI tools don’t have a MEDIAN function that’s as easy to use as an AVG function. But BigQuery has an uncommon APPROX_QUANTILES function, which calculates the specified amount of quantiles; for example, if you call APPROX_QUANTILES(ActualElapsedTime, 100), it will return an array with the 100 quantiles, where the median will be the 50th quantile. BigQuery uses an algorithmic trick called HyperLogLog++ to calculate these quantiles efficiently even with millions of data points. But since we get other quantiles like the 5th, 25th, 75th, and 95th quantiles for free with that approach, we can visualize the spread of the data.

We can aggregate the data by month for more granular trends and calculate the APPROX_QUANTILES in a subquery so it only has to be computed once. Hoffa also uploaded a more recent table (fh-bigquery.flights.ontime_201908) with a few additional months of data. To make things more simple, we’ll ignore aggregating by airlines since the metrics do not vary strongly between them. The final query ends up looking like this:

#standardSQL
SELECT Year, Month, num_flights,
time_q[OFFSET(5)] AS q_5,
time_q[OFFSET(25)] AS q_25,
time_q[OFFSET(50)] AS q_50,
time_q[OFFSET(75)] AS q_75,
time_q[OFFSET(95)] AS q_95
FROM (
SELECT Year, Month,
  COUNT(*) as num_flights,
  APPROX_QUANTILES(ActualElapsedTime, 100) AS time_q
FROM `fh-bigquery.flights.ontime_201908`
WHERE Origin = 'SFO'
AND Dest = 'SEA'
AND FlightDate_year > '2010-01-01'
GROUP BY Year, Month
)
ORDER BY Year, Month

The resulting data table:

In retrospect, since we’re only focusing on one route, it isn’t big data (this query only returns data on 64,356 flights total), but it’s still a very useful skill if you need to analyze more of the airline data (the APPROX_QUANTILES function can handle millions of data points very quickly).

As a professional data scientist, one of my favorite types of data visualization is a box plot, as it provides a way to visualize spread without being visually intrusive. Data visualization tools like R and ggplot2 make constructing them very easy to do.

By default, for each box representing a group, the thick line in the middle of the box is the median, the lower bound of the box is the 25th quantile and the upper bound is the 75th quantile. The whiskers are normally a function of the interquartile range (IQR), but if there’s enough data, I prefer to use the 5th and 95th quantiles instead.

If you feed ggplot2’s geom_boxplot() with raw data, it will automatically calculate the corresponding metrics for visualization; however, with big data, the data may not fit into memory and as noted earlier, medians and other quantiles are computationally expensive to calculate. Because we precomputed the quantiles with the query above for every year and month, we can use those explicitly. (The minor downside is that this will not include outliers)

Additionally for box plots, I like to fill in each box with a different color corresponding to the year in order to better perceive data seasonality. In the case of airline flights, seasonality is more literal: weather has an intuitive impact on flight times and delays, and during winter months there are also holidays which could affect airline logistics.

The resulting ggplot2 code looks like this:

plot <-
  ggplot(df_tf,
         aes(
           x = date,
           ymin = q_5,
           lower = q_25,
           middle = q_50,
           upper = q_75,
           ymax = q_95,
           group = date,
           fill = year_factor
         )) +
  geom_boxplot(stat = "identity", size = 0.3) +
  scale_fill_hue(l = 50, guide = F) +
  scale_x_date(date_breaks = '1 year', date_labels = "%Y") +
  scale_y_continuous(breaks = pretty_breaks(6)) +
  labs(
    title = "Distribution of Flight Times of Flights From SFO → SEA, by Month",
    subtitle = "via US DoT. Box bounds are 25th/75th percentiles, whiskers are 5th/95th percentiles.",
    y = 'Total Elapsed Flight Time (Minutes)',
    fill = '',
    caption = 'Max Woolf — minimaxir.com'
  ) +
  theme(axis.title.x = element_blank())

ggsave('sfo_sea_flight_duration.png',
       plot,
       width = 6,
       height = 4)

And behold (again)!

You can see that the boxes do indeed trend upward after 2016, although per-month medians are in flux. The spread is also increasingly slowly over time. But what’s interesting is the seasonality; pre-2016, the summer months (the “middle” of a given color) have a very significant drop in total time, which doesn’t occur as strongly after 2016. Hmm.

SFO and JFK

Since I occasionally fly from San Francisco to New York City, it might be interesting (for completely selfish reasons) to track trends over time for flights between those areas. On the San Francisco side I choose SFO, and for the New York side I choose John F. Kennedy International Airport (JFK), as the data goes back very far for those routes specifically, and I only want to look at a single airport at a time (instead of including other NYC airports such as Newark Liberty International Airport [EWR] and LaGuardia Airport [LGA]) to limit potential data confounders.

Fortunately, the code and query changes are minimal: in the query, change the target metric to whatever metric you want, and the Origin and Dest in the WHERE clause to what you want, and if you want to calculate metrics other than elapsed time, change the metric in APPROX_QUANTILES accordingly.

Here’s the chart of total elapsed time from SFO → JFK:

And here’s the reverse, from JFK → SFO:

Unlike the SFO → SEA charts, both charts are relatively flat over the years. However, when looking at seasonality, SFO → JFK dips in the summer and spikes during winter, while JFK → SFO does the complete opposite: dips during the winter and spikes during the summer, which is similar to the SFO → SEA route. I don’t have any guesses what would cause that behavior.

How about flight speed (calculated via air time divided by distance)? Have new advances in airline technology made planes faster and/or more efficient?

The expected flight speed for a commercial airplane, per Wikipedia, is 547-575 mph, so the metrics from SFO pass the sanity check. The metrics from JFK indicate there’s about a 20% drop in flight speed potentially due to wind resistance, which makes sense. Month-to-month, the speed trends are inverse to the total elapsed time, which makes sense intuitively as they are strongly negatively correlated.

Lastly, what about flight departure delays? Are airlines becoming more efficient, or has increased demand caused more congestion?

Wait a second. In this case, massive 2-3 hour flight delays are frequent enough that even just the 95% percentile skews the entire plot. Let’s remove the whiskers in order to look at trends more clearly.

A negative delay implies the flight leaves early, so we can conclude on average, flights leave slightly earlier than the stated departure time. Even without the whiskers, we can see major spikes at the 75th percentile level for summer months, and said spikes were especially bad in 2017 for both airports.

These box plots are only an exploratory data analysis. Determining the cause of changes in these flight metrics is difficult even for experts (I am definitely not an expert!) and many not even be possible to determine from publicly-available data.

But there are still other fun things that can be done with the airline flight data, such as faceting airline trends by time and the inclusion of other airports, which is interesting.

You can view the BigQuery queries used to get the data, plus the R and ggplot2 used to create the data visualizations, in this R Notebook. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Problems with Predicting Post Performance on Reddit and Other Link Aggregators

Mon, 10 Sep 2018 09:15:00 -0700

Reddit, “the front page of the internet” is a link aggregator where anyone can submit links to cool happenings. Over the years, Reddit has expanded from just being a link aggregator, to allowing image and videos, and as of recently, hosting images and videos itself.

Reddit is broken down into subreddits, where each subreddit represents each own community around a particular interest, like /r/aww for pet photos and /r/politics for U.S. politics. The posts on each subreddit are ranked by some function of both time elapsed since the submission was made, and the score of the submission as determined by upvotes and downvotes from other users.

There’s also an intrinsic pride in having something you’re responsible for providing to the community get lots of upvotes (the submitter also earns karma based on received upvotes, although karma is meaningless and doesn’t provide any user benefits). But the reality is that even on the largest subreddits, submissions with 1 point (the default score for new submissions) are the most prominent, with some subreddits having over half of their submissions with only 1 point.

The exposure from having a submission go viral on Reddit (especially on larger subreddits) can be valuable especially if its your own original content. As a result, there has been a lot of analysis/stereotypes on what techniques to do to help your submission make it to the top of the front page. But almost all claims of “cracking” the Reddit algorithm are post hoc rationalizations, attributing success to things like submission timing and title verbiage of a single submission after the fact. The nature of algorithmic feeds inherently leads to a survivorship bias: although users may recognize certain types of posts that appear on the front page, there are many more which follow the same patterns but fail, which makes modeling a successful post very tricky.

I’ve touched on analyzing Reddit post performance before, but let’s give it another look and see if we can drill down on why Reddit posts do and do not do well.

Submission Timing

As with many US-based websites, the majority of Reddit users are most active during work hours (9 AM — 5 PM Eastern time weekdays). Most subreddits have submission patterns which fit accordingly.

But what’s interesting are the subreddits which deviate from that standard. Gaming subreddits (/r/DestinyTheGame, /r/Overwatch) have short activity after a Tuesday game update/patch, game communication subreddits (/r/Fireteams, /r/RocketLeagueExchange) are more active outside of work hours as they assume you are playing the game at the time, and Not-Safe-For-Work subreddits (/r/dirtykikpals, /r/gonewild) are incidentally less active during work hours and more active late-night than other subreddits.

Whenever you make a submission to Reddit, the submission appears in the subreddit’s /new queue of the most recent submissions, where hopefully kind souls will find your submission and upvote it if it’s good.

However, if it falls off the first page of the /new queue, your submission might be as good as dead. As a result, there’s an element of game theory to timing your submission if you want it to not become another 1-point submission. Is it better to submit during peak hours when more users may see the submission before it falls off of /new? Is it better to submit before peak usage since there will be less competition, then continue the momentum once it hits the front page?

Here’s a look at the median post performance at each given time slot for top subreddits:

As the earlier distribution chart implied, the median score is around 1-2 for most subreddits, and that’s consistent across all time slots. Some subreddits with higher medians like /r/meirl do appear to have a _slight benefit when posting before peak activity. When focusing on subreddits with high overall median scores, the difference is more explicit.

Subreddits like /r/PrequelMemes and /r/TheDonald _definitely have better performance on average when made before peak activity! Posting before peak usage does appear to be a viable strategy, however for the majority of subreddits it doesn’t make much of a difference.

Submission Titles

Each Reddit subreddit has their own vocabulary and topics of discussion. Let’s break down text by subreddit by looking at the 75th percentile for score on posts containing a given two-word phrase:

The one trend consistent across all subreddits is the effectiveness of first-person pronouns (I/my) and original content (fan art). Other than that, the vocabulary and sentiment for successful posts is very specific to the subreddit and culture is represents; no universal guaranteed-success memes.

Can Deep Learning Predict Post Performance?

Some might think “oh hey, this is an arbitrary statistical problem, you can just build an AI to solve it!” So, for the sake of argument, I did.

Instead of using Reddit data for building a deep learning model, we’ll use data from Hacker News, another link aggregator similar to Reddit with a strong focus on technology and startup entrepreneurship. The distribution of scores on posts, submission timings, upvoting, and front page ranking systems are all the same as on Reddit.

The titles on Hacker News submissions are also shorter (80 characters max vs. Reddit’s 300 character max) and in concise English (no memes/shitposts allowed), which should help the model learn the title syntax and identify high-impact keywords easier. Like Reddit, the score data is super-skewed with most HN submissions at 1-2 points, and typical model training will quickly converge but try to predict that every submission has a score of 1, which isn’t helpful!

By constructing a model employing many deep learning tricks with Keras/TensorFlow to prevent model cheating and training on hundreds of thousands of HN submissions (using post title, day-of-week, hour, and link domain like github.com as model features), the model does converge and finds some signal among the noise (training R² ~ 0.55 when trained for 50 epochs). However, it fails to offer any valuable predictions on new, unseen posts (test R² < 0.00) because it falls into the same exact human biases regarding titles: it saw submissions with titles that did very well during training, but can’t isolate the random chance why X and Y submissions are similar but X goes viral while Y does not.

I’ve made the Keras/TensorFlow model training code available in this Kaggle Notebook if you want to fork it and try to improve the model.

Other Potential Modeling Factors

The deep learning model above makes optimistic assumptions about the underlying data, including that each post behaves independently, and the included features are the sole features which determine the score. These assumptions are questionable.

The simple model forgoes the content of the submission itself, which is hard to retrieve for hundreds of thousands of data points. On Hacker News that’s mostly OK since most submissions are links/articles which accurately correlate to the content, although occasionally there are idiosyncratic short titles which do the opposite. On Reddit, obviously looking at content is necessary for image/video-oriented subreddits, which is hard to gather and analyze at scale.

A very important concept of post performance is momentum. A post having a high score is a positive signal in itself, which begets more votes (a famous Reddit problem is brigading from /r/all which can cause submission scores to skyrocket). If the front page of a subreddit has a large number of high-performing posts, they might also suppress posts coming out of the /new queue because the score threshold is much higher. A simple model may not be able to capture these impacts; the model would need to incorporate the state of the front page at the time of posting.

Some also try to manipulate upvotes. Reddit became famous for adding the rule “asking for upvotes is a violation of intergalactic law” to their Content Policy, although some subreddits do it anyway without consequence. On Reddit, obvious spam posts can be downvoted to immediately counteract illicit upvotes. Hacker News has a similar don’t-upvote rule, although there aren’t downvotes, just a flagging mechanism which quickly neutralizes spam/misleading posts. In general, there’s no legitimate reason to highlight your own submission immediately after its posted (except for Reddit’s AMAs). Fortunately, gaming the system is less impactful on Reddit and Hacker News due to their sheer size and countermeasures, but it’s a good example of potential user behavior that makes modeling post performance difficult, and hopefully link aggregators of the future aren’t susceptible to such shenanigans.

Do We Really to Predict Post Score?

Let’s say you are submitting original content to Reddit or your own tech project to Hacker News. More points means a higher ranking means more exposure for your link, right? Not exactly. As noted from Reddit/HN screenshots above, the scores of popular submissions are all over the place ranking-wise, having been affected by age penalties.

In practical terms, from my own purely anecdotal experience, submissions at a top ranking receive substantially more clickthroughs despite being spatially close on the page to others.

…and now traffic at #3.

Placement is absurdly important for search engines/social media sites. Difference between #1 and #3 is dramatic. pic.twitter.com/nGjWJBx6dU
— Max Woolf (@minimaxir) June 20, 2017

In that case, falling from #1 to #3 immediately halved the referral traffic coming from Hacker News.

Therefore, an ideal link aggregator predictive model to maximize clicks should try to predict the rank of a submission (max rank, average rank over n period, etc.), not necessarily the score it receives. You could theoretically create a model by making a snapshot of a Reddit subreddit/front page of Hacker News every minute or so which includes the post position at the time of the snapshot. As mentioned earlier, the snapshots can also be used as a model feature to identify whether the front page is active or stale. Unfortunately, snapshots can’t be retrieved retroactively, and both storing, processing, and analyzing snapshots at scale is a difficult and expensive feat of data engineering.

Presumably Reddit’s data scientists would be incorporating submission position as a part of their data analytics and modeling, but after inspecting what’s sent to Reddit’s servers when you perform an action like upvoting, I wasn’t able to find a sent position value when upvoting from the feed: only the post score and post upvote percentage at the time of the action were sent.

In this example, I upvoted the Fact are facts submission at position #5: we’d expect a value between 3 and 5 be sent with the post metadata within the analytics payload, but that’s not the case.

Optimizing ranking instead of a tangible metric or classification accuracy is a relatively underdiscussed field of modern data science (besides SEO for getting the top spot on a Google search), and it would be interesting to dive deeper into it for other applications.

In the future

The moral of this post is that you should not take it personally if a submission fails to hit the front page. It doesn’t necessarily mean it’s bad. Conversely, if a post does well, don’t assume that similar posts will do just as well. There’s a lot of quality content that falls through the cracks due to dumb luck. Fortunately, both Reddit and Hacker News allow reposts, which helps alleviate this particular problem.

There’s still a lot that can be done to more deterministically predict the behavior of these algorithmic feeds. There’s also room to help make these link aggregators more fair. Unfortunately, there’s even more undiscovered ways to game these algorithms, and we’ll see how things play out.

You can view the BigQuery queries used to get the Reddit and Hacker News data, plus the R and ggplot2 used to create the data visualizations, in this R Notebook. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Analyzing IMDb Data The Intended Way, with R and ggplot2

Mon, 16 Jul 2018 09:45:00 -0700

IMDb, the Internet Movie Database, has been a popular source for data analysis and visualizations over the years. The combination of user ratings for movies and detailed movie metadata have always been fun to play with.

There are a number of tools to help get IMDb data, such as IMDbPY, which makes it easy to programmatically scrape IMDb by pretending it’s a website user and extracting the relevant data from the page’s HTML output. While it works, web scraping public data is a gray area in terms of legality; many large websites have a Terms of Service which forbids scraping, and can potentially send a DMCA take-down notice to websites redistributing scraped data.

IMDb has data licensing terms which forbid scraping and require an attribution in the form of a Information courtesy of IMDb (http://www.imdb.com). Used with permission. statement, and has also DMCAed a Kaggle IMDb dataset to hone the point.

However, there is good news! IMDb publishes an official dataset for casual data analysis! And it’s now very accessible, just choose a dataset and download (now with no hoops to jump through), and the files are in the standard TSV format.

The uncompressed files are pretty large; not “big data” large (it fits into computer memory), but Excel will explode if you try to open them in it. You have to play with the data smartly, and both R and ggplot2 have neat tricks to do just that.

First Steps

R is a popular programming language for statistical analysis. One of the most popular series of external packages is the tidyverse package, which automatically imports the ggplot2 data visualization library and other useful packages which we’ll get to one-by-one. We’ll also use scales which we’ll use later for prettier number formatting. First we’ll load these packages:

library(tidyverse)
library(scales)

And now we can load a TSV downloaded from IMDb using the read_tsv function from readr (a tidyverse package), which does what the name implies, at a much faster speed than base R (+ a couple other parameters to handle data encoding). Let’s start with the ratings file:

df_ratings <- read_tsv('title.ratings.tsv', na = "\\N", quote = '')

We can preview what’s in the loaded data using dplyr (a tidyverse package), which is what we’ll be using to manipulate data for this analysis. dplyr allows you to pipe commands, making it easy to create a sequence of manipulation commands. For now, we’ll use head(), which displays the top few rows of the data frame.

df_ratings %>% head()

Each of the 873k rows corresponds to a single movie, an ID for the movie, its average rating (from 1 to 10), and the number of votes which contribute to that average. Since we have two numeric variables, why not test out ggplot2 by creating a scatterplot mapping them? ggplot2 takes in a data frame and names of columns as aesthetics, then you specify what type of shape to plot (a “geom”). Passing the plot to ggsave saves it as a standalone, high-quality data visualization.

plot <- ggplot(df_ratings, aes(x = numVotes, y = averageRating)) +
          geom_point()

ggsave("imdb-0.png", plot, width = 4, height = 3)

Here is nearly 1 million points on a single chart; definitely don’t try to do that in Excel! However, it’s not a useful chart since all the points are opaque and we’re not sure what the spatial density of points is. One approach to fix this issue is to create a heat map of points, which ggplot can do natively with geom_bin2d. We can color the heat map with the viridis colorblind-friendly palettes just introduced into ggplot2. We should also tweak the axes; the x-axis should be scaled logarithmically with scale_x_log10 since there are many movies with high numbers of votes and we can format those numbers with the comma function from the scales package (we can format the scale with comma too). For the y-axis, we can add explicit number breaks for each rating; R can do this neatly by setting the breaks to 1:10. Putting it all together:

plot <- ggplot(df_ratings, aes(x = numVotes, y = averageRating)) +
          geom_bin2d() +
          scale_x_log10(labels = comma) +
          scale_y_continuous(breaks = 1:10) +
          scale_fill_viridis_c(labels = comma)

Not bad, although it unfortunately confirms that IMDb follows a Four Point Scale where average ratings tend to fall between 6 — 9.

Mapping Movies to Ratings

You may be asking “which ratings correspond to which movies?” That’s what the tconst field is for. But first, let’s load the title data from title.basics.tsv into df_basics and take a look as before.

df_basics <- read_tsv('title.basics.tsv', na = "\\N", quote = '')

We have some neat movie metadata. Notably, this table has a tconst field as well. Therefore, we can join the two tables together, adding the movie information to the corresponding row in the rating table (in this case, a left join is more appropriate than an inner/full join)

df_ratings <- df_ratings %>% left_join(df_basics)

Runtime minutes sounds interesting. Could there be a relationship between the length of a movie and its average rating on IMDb? Let’s make a heat map plot again, but with a few tweaks. With the new metadata, we can filter the table to remove bad points; let’s keep movies only (as IMDb data also contains television show data), with a runtime < 3 hours, and which have received atleast 10 votes by users to remove extraneous movies). X-axis should be tweaked to display the minutes-values in hours. The fill viridis palette can be changed to another one in the family (I personally like inferno).

More importantly, let’s discuss plot theming. If you want a minimalistic theme, add a theme_minimal to the plot, and you can pass a base_family to change the default font on the plot and a base_size to change the font size. The labs function lets you add labels to the plot (which you should always do); you have your title, x, and y parameters, but you can also add a subtitle, a caption for attribution, and a color/fill to name the scale. Putting it all together:

plot <- ggplot(df_ratings %>% filter(runtimeMinutes < 180, titleType == "movie", numVotes >= 10), aes(x = runtimeMinutes, y = averageRating)) +
          geom_bin2d() +
          scale_x_continuous(breaks = seq(0, 180, 60), labels = 0:3) +
          scale_y_continuous(breaks = 0:10) +
          scale_fill_viridis_c(option = "inferno", labels = comma) +
          theme_minimal(base_family = "Source Sans Pro", base_size = 8) +
          labs(title = "Relationship between Movie Runtime and Average Mobie Rating",
               subtitle = "Data from IMDb retrieved July 4th, 2018",
               x = "Runtime (Hours)",
               y = "Average User Rating",
               caption = "Max Woolf — minimaxir.com",
               fill = "# Movies")

Now that’s pretty nice-looking for only a few lines of code! Albeit unhelpful, as there doesn’t appear to be a correlation.

(Note: for the rest of this post, the theming/labels code will be omitted for convenience)

How about movie ratings vs. the year the movie was made? It’s a similar plot code-wise to the one above (one perk about ggplot2 is that there’s no shame in reusing chart code!), but we can add a geom_smooth, which adds a nonparametric trendline with confidence bands for the trend; since we have a large amount of data, the bands are very tight. We can also fix the problem of “empty” bins by setting the color fill scale to logarithmic scaling. And since we’re adding a black trendline, let’s change the viridis palette to plasma for better contrast.

plot <- ggplot(df_ratings %>% filter(titleType == "movie", numVotes >= 10), aes(x = startYear, y = averageRating)) +
          geom_bin2d() +
          geom_smooth(color="black") +
          scale_x_continuous() +
          scale_y_continuous(breaks = 1:10) +
          scale_fill_viridis_c(option = "plasma", labels = comma, trans = 'log10')

Unfortunately, this trend hasn’t changed much either, although the presence of average ratings outside the Four Point Scale has increased over time.

Mapping Lead Actors to Movies

Now that we have a handle on working with the IMDb data, let’s try playing with the larger datasets. Since they take up a lot of computer memory, we only want to persist data we actually might use. After looking at the schema provided with the official datasets, the only really useful metadata about the actors is their birth year, so let’s load that, but only keep both actors/actresses (using the fast str_detect function from stringr, another tidyverse package) and the relevant fields.

df_actors <- read_tsv('name.basics.tsv', na = "\\N", quote = '') %>%
                filter(str_detect(primaryProfession, "actor|actress"))  %>%
                select(nconst, primaryName, birthYear)

The principals dataset, the large 1.28GB TSV, is the most interesting. It’s an unnested list of the credited persons in each movie, with an ordering indicating their rank (where 1 means first, 2 means second, etc.).

For this analysis, let’s only look at the lead actors/actresses; specifically, for each movie (identified by the tconst value), filter the dataset to where the ordering value is the lowest (in this case, the person at rank 1 may not necessarily be an actor/actress).

df_principals <- read_tsv('title.principals.tsv', na = "\\N", quote = '') %>%
  filter(str_detect(category, "actor|actress")) %>%
  select(tconst, ordering, nconst, category) %>%
  group_by(tconst) %>%
  filter(ordering == min(ordering))

Both datasets have a nconst field, so let’s join them together. And then join that to the ratings table earlier via tconst.

df_principals <- df_principals %>% left_join(df_actors)
df_ratings <- df_ratings %>% left_join(df_principals)

Now we have a fully denormalized dataset in df_ratings. Since we now have the movie release year and the birth year of the lead actor, we can now infer the age of the lead actor at the movie release. With that goal, filter out the data on the criteria we’ve used for earlier data visualizations, plus only keeping rows which have an actor’s birth year.

df_ratings_movies <- df_ratings %>%
                        filter(titleType == "movie", !is.na(birthYear), numVotes >= 10) %>%
                        mutate(age_lead = startYear - birthYear)

Plotting Ages

Age discrimination in movie casting has been a recurring issue in Hollywood; in fact, in 2017 a law was signed to force IMDb to remove an actor’s age upon request, which in February 2018 was ruled to be unconstitutional.

Have the ages of movie leads changed over time? For this example, we’ll use a ribbon plot to plot the ranges of ages of movie leads. A simple way to do that is, for each year, calculate the 25th percentile of the ages, the 50th percentile (i.e. the median), and the 75th percentile, where the 25th and 75th percentiles are the ribbon bounds and the line represents the median.

df_actor_ages <- df_ratings_movies %>%
                  group_by(startYear) %>%
                  summarize(low_age = quantile(age_lead, 0.25, na.rm=T),
                            med_age = quantile(age_lead, 0.50, na.rm=T),
                            high_age = quantile(age_lead, 0.75, na.rm=T))

Plotting it with ggplot2 is surprisingly simple, although you need to use different y aesthetics for the ribbon and the overlapping line.

plot <- ggplot(df_actor_ages %>% filter(startYear >= 1920) , aes(x = startYear)) +
          geom_ribbon(aes(ymin = low_age, ymax = high_age), alpha = 0.2) +
          geom_line(aes(y = med_age))

Turns out that in the 2000’s, the median age of lead actors started to increase? Both the upper and lower bounds increased too. That doesn’t coalesce with the age discrimination complaints.

Another aspect of these complaints is gender, as female actresses tend to be younger than male actors. Thanks to the magic of ggplot2 and dplyr, separating actors/actresses is relatively simple: add gender (encoded in category) as a grouping variable, add it as a color/fill aesthetic in ggplot, and set colors appropriately (I recommend the ColorBrewer qualitative palettes for categorical variables).

df_actor_ages_lead <- df_ratings_movies %>%
                  group_by(startYear, category) %>%
                  summarize(low_age = quantile(age_lead, 0.25, na.rm = T),
                            med_age = quantile(age_lead, 0.50, na.rm = T),
                            high_age = quantile(age_lead, 0.75, na.rm = T))

plot <- ggplot(df_actor_ages_lead %>% filter(startYear >= 1920), aes(x = startYear, fill = category, color = category)) +
          geom_ribbon(aes(ymin = low_age, ymax = high_age), alpha = 0.2) +
          geom_line(aes(y = med_age)) +
          scale_fill_brewer(palette = "Set1") +
          scale_color_brewer(palette = "Set1")

There’s about a 10-year gap between the ages of male and female leads, and the gap doesn’t change overtime. But both start to rise at the same time.

One possible explanation for this behavior is actor reuse: if Hollywood keeps casting the same actor/actresses, by construction the ages of the leads will start to steadily increase. Let’s verify that: with our list of movies and their lead actors, for each lead actor, order all their movies by release year, and add a ranking for the #th time that actor has been a lead actor. This is possible through the use of row_number in dplyr, and window functions like row_number are data science’s most useful secret.

df_ratings_movies_nth <- df_ratings_movies %>%
                      group_by(nconst) %>%
                      arrange(startYear) %>%
                      mutate(nth_lead = row_number())

One more ribbon plot later (w/ same code as above + custom y-axis breaks):

Huh. The median and upper-bound #th time has dropped over time? Hollywood has been promoting more newcomers as leads? That’s not what I expected!

More work definitely needs to be done in this area. In the meantime, the official IMDb datasets are a lot more robust than I thought they would be! And I only used a fraction of the datasets; the rest tie into TV shows, which are a bit messier. Hopefully you’ve seen a good taste of the power of R and ggplot2 for playing with big-but-not-big data!

You can view the R and ggplot used to create the data visualizations in this R Notebook, which includes many visualizations not used in this post. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Visualizing One Million NCAA Basketball Shots

Mon, 19 Mar 2018 09:20:00 -0700

So March Madness is happing right now. In celebration, Google uploaded massive basketball datasets from the NCAA and Sportradar to BigQuery for anyone to query and experiment. After learning that the dataset had location data on where basketball shots were made on the court, I played with it and a couple hours later, I created a decent heat map data visualization. The next day, I posted it to Reddit’s /r/dataisbeautiful subreddit where it earned about 40,000 upvotes. (!?)

Let’s dig a little deeper. Although visualizing basketball shots has been done before, this time we have access to an order of magnitude more public data to do some really cool stuff.

Full Court

The Sportradar play-by-play table on BigQuery mbb_pbp_sr has more than 1 million NCAA men’s basketball shots since the 2013-2014 season, with more being added now during March Madness. Here’s a heat map of the locations where those shots were made on the full basketball court:

We can clearly see at a glance that the majority of shots are made right in front of the basket. For 3-point shots, the center and the corners have higher numbers of shot attempts than the other areas. But not much else since the data is so spatially skewed: setting the bin color scale to logarithmic makes trends more apparent and helps things go viral on Reddit.

Now there’s more going on here: shot behavior is clearly symmetric on each side of the court, and there’s a small gap between the 3-point line and where 3-pt shots are typically made, likely to ensure that it it’s not accidentally ruled as a 2-pt shot.

How likely is it to score a shot from a given spot? Are certain spots better than others?

Surprisingly, shot accuracy is about equal from anywhere within typical shooting distance, except directly in front of the basket where it’s much higher. What is the expected value of a shot at a given position: that is, how many points on average will they earn for their team?

The average points earned for 3-pt shots is about 1.5x higher than many 2-pt shot locations in the inner court due to the equal accuracy, but locations next to the basket have an even higher expected value. Perhaps the accuracy of shots close to the basket is higher (>1.5x) than 3-pt shots and outweighs the lower point value?

Since both sides of the court are indeed the same, we can combine the two sides and just plot a half-court instead. (Cross-court shots, which many Redditors argued that they invalidated my visualizations above, constitute only 0.16% of the basketball shots in the dataset, so they can be safely removed as outliers).

There are still a few oddities, such as shots being made behind the basket. Let’s drill down a bit.

Focusing on Basketball Shot Type

The Sportradar dataset classifies a shot as one of 5 major types: a jump shot where the player jumps-and-throws the basketball, a layup where the player runs down the field toward the basket and throws a one-handed shot, a dunk where the player slams the ball into the basket (looking cool in the process), a hook shot where the player close to the basket throws the ball with a hook motion, and a tip shot where the player intercepts a basket rebound at the tip of the basket and pushes it in.

However, the most frequent types of shots are the less flashy, more practical jump shots and layups. But is a certain type of shot “better?”

Layups are safer than jump shots, but dunks are the most accurate of all the types (however, players likely wouldn’t attempt a dunk unless they knew it would be successful). The accuracy of layups and other close-to-basket shots is indeed more than 1.5x better than the jump shots of 3-pt shots, which explains the expected value behavior above.

Plotting the heat maps for each type of shot offers more insight into how they work:

They’re wildly different heat maps which match the shot type descriptions above, but show we’ll need to separate data visualizations by type to accurately see trends.

Impact of Game Elapsed Time At Time of Shot

A NCAA basketball game lasts for 40 minutes total (2 halves of 20 minutes each), with the possibility of overtime. The example BigQuery for the NCAA-provided data compares the percentage of 3-point shots made during the first 35 minutes of the game versus the last 5 minutes: at the end of the game, accuracy was lower by 4 percentage points (31.2% vs. 35.1%). It might be interesting to facet these visualizations by the elapsed time of the game to see if there are any behavioral changes.

There isn’t much difference between the proportions within a given half, but there is a difference between the first half and the second half, where the second half has fewer jump shots and more aggressive layups and dunks. After looking at shot success percentage:

The jump shot accuracy loss at the end of the game with Sportradar data is similar to that of the NCAA data, which is a good sanity check (but it’s odd that the accuracy drop only happens in the last 5 minutes and not elsewhere in the 2nd half). Layup accuracy increases in the second half with the number of layups.

We can also visualize heat maps for each combo of shot type with time elapsed bucket, but given the results above, the changes in behavior over time may not be very perceptible.

Impact of Winning/Losing Before Shot

Another theory worth exploring is determining if there is any difference whether a team is winning or losing when they make their shot (technically, when the delta between the team score and the other team score is positive for winning teams, negative for losing teams, or 0 if tied). Are players more relaxed when they have a lead? Are players more prone to making mistakes when losing?

Layups are the same across all buckets, but for teams that are winning, there are fewer jump shots and more dunkin’ action (nearly double the dunks!). However, the accuracy chart illustrates an issue:

Accuracy for most types of shots is much better for teams that are winning…which may be the reason they’re winning. More research can be done in this area.

Conclusion

I fully admit I am not a basketball expert. But playing around with this data was a fun way to get a new perspective on how collegiate basketball games work. There’s a lot more work that can be done with big basketball data and game strategy; the NCAA-provided data doesn’t have location data, but it does have 6x more shots, which will be very helpful for further fun in this area.

You can view the R code, ggplot2 code, and BigQueries used to create the data visualizations in this R Notebook. You can also view the images/code used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Special thanks to Ewen Gallic for his implementation of a basketball court in ggplot2, which saved me a lot of time!

A Visual Overview of Stack Overflow's Question Tags

Fri, 09 Feb 2018 09:00:00 -0700

Stack Overflow is the most popular contemporary knowledge base for programming questions. But most interact with the site by Googling a programming question and getting a top result that links to SO. There isn’t as much discussion about actually asking questions on the site.

I could use my Stack Overflow account and test out the process of creating a question, but ~~I already know everything about programming~~ there may be another way to learn how SO works. Stack Overflow releases an archive of all questions on the site every 3 months, and this archive is syndicated to BigQuery, making it trivial to retrieve and analyze the millions of SO questions over the years. Even though (now-former) Stack Overflow data scientist David Robinson has written many interesting blog posts for Stack Overflow with their data, I figured why not give it a try.

Overview

Unlike social media sites like Twitter and Reddit where the majority of traffic is driven within the first days after something is posted, posts on evergreen content sources like Stack Overflow are still relevant many years later. In fact, the traffic to Stack Overflow for most of 2017 (derived by finding the difference between question view counts from archive snapshots) is approximately uniform across question age, with a slight bias toward older content.

In 2017, Stack Overflow received about 40k-50k new questions each week, an impressive feat:

For the rest of this post, we’ll only look at questions made in 2017 (until December; about 2.3 million questions total) in order to get a sense of the current development landscape, and what’s to come in the future. But what types of questions are they?

Tag Breakdown

All questions on Stack Overflow are required to have atleast 1 tag indicating the programming language/technologies involved with the question, and can have up to 5 tags. In the example “how do you get the last element of a list in Python” question above, the tags are python, list, and indexing. In 2017, most of new questions had 2-3 tags. (i.e. people aren’t tag spamming like on Instagram for maximum exposure).

In theory, tag spamming might make a question more likely to be answered; however for all tag counts, the proportion of questions with accepted answer (the green checkmark) is 36-39%, so there’s not much practical benefit from minmaxing tag counts. Which types of tagged questions are most likely to be answered?

First, here’s the breakdown of the top 40 tags on Stack Overflow, by the number of new questions containing that tag for each month throughout 2017. This can give a sense of each technology’s growth/decline throughout the year.

Both new web development technologies like reactjs and typescript and data science tools like pandas and r are trending upward.

For the Top 1,000 tags, here are the top 30 tags by the proportion of questions which received an acceptable answer:

In contrast, here are the bottom 30 out of the Top 1,000:

The top tags are newer, sexier technologies like rust and dart, with another strong hint of data science tooling with dplyr (which I used to aggregate the data for this post!) and data.table. In contrast, the bottom tags are less sexy and more corporate like salesforce, drupal, and sharepoint-2013 (that’s why consultants who specialize in these technologies can get paid very well!).

It should be noted these two charts do not necessarily imply that one technology is “better” than another, and the difference in answer rates may be due to question difficulty and the number of people skilled in the tech available that can answer it effectively.

The timing when questions are asked might vary by tag. Per a Stack Overflow analysis, people typically ask questions during the 9 AM - 5 PM work hours (although in my case, I cannot easily adjust for the time zone of the asker). How does this data fare?

This visualization is a bit weird. I adjusted the times to the Eastern time since internet activity for U.S.-based websites tends to revolve around that time zone. But for most technologies, the peak question-asking times are well before 9 AM to 5 PM: do those technologies correspond more to greater use in Europe and Asia? (In contrast, data-oriented technologies like r, pandas and excel do peak during the 9-5 block).

How easy is it to get an answer by tag?

Stack Overflow caters the homepage toward the logged-in user’s recommended tags. Therefore, it’s not a surprise that the distribution of view counts on 2017 questions for each tag are very similar, although there is a slight edge toward the new “hip” technologies like typescript, spring, and swift.

At the least, the distribution ensures that atleast 10 people see your question for these popular topics, which is nifty when you consider posts on Twitter and Reddit can die without any visibility at all. But will they provide an acceptable answer?

The time it takes to get an acceptable answer also varies significantly by tag:

A median time of 15 minutes for tags like pandas and arrays is pretty impressive! And even in the worst case scenario for these popular tags, the median is only a couple hours, much lower than I thought it would be.

The Relationship Between Tags

As one would expect, the types of questions asked for each tag are much different. Here’s a wordcloud for each of the tags, quantifying the words most frequently used in the questions on those tags:

Notably, each word cloud is significantly different from reach other, even when technologies are related (also surprisingly true in the case of angular and angularjs!).

How are the tags related anyways? We can calculate an adjacency matrix of the tag pairs in the questions to see which tags are related:

Looking down a given row/column, you can see which technologies have a lot of questions in common with another (for example, javascript and json are frequently asked in conjunction with other tags).

Going back earlier to talking about tag abuse, do the presence of certain pairs of tags lead to notably different answer rates?

Tag pairs which don’t make much sense (e.g. ios+android, ios+javascript, android+php) tend to have very low answer rates (20%-30%). But tags with already high answer rates like regex don’t get much higher or much lower at a given pair.

Conclusion

There’s a lot more than can be done looking at question tags on Stack Overflow. I was surprised to see that all types of programming languages have quick answer times and a high probability of receiving an acceptable answer! I’ll definitely keep an eye on the SO archives as they are released, and I’m excited to see how trends change in the future.

You can view the R and ggplot2 code used to create the data visualizations in this R Notebook. You can also view the images/data used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Predicting the Success of a Reddit Submission with Deep Learning and Keras

Mon, 26 Jun 2017 09:00:00 -0700

I’ve been trying to figure out what makes a Reddit submission “good” for years. If we assume the number of upvotes on a submission is a fair proxy for submission quality, optimizing a statistical model for Reddit data with submission score as a response variable might lead to interesting (and profitable) insights when transferred into other domains, such as Facebook Likes and Twitter Favorites.

An important part of a Reddit submission is the submission title. Like news headlines, a catchy title will make a user more inclined to engage with a submission and potentially upvote.

Additionally, the time when the submission is made is important; submitting when user activity is the highest tends to lead to better results if you are trying to maximize exposure.

The actual content of the Reddit submission such as images/links to a website is likewise important, but good content is relatively difficult to optimize.

Can the magic of deep learning reconcile these concepts and create a model which can predict if a submission is a good submission? Thanks to Keras, performing deep learning on a very large number of Reddit submissions is actually pretty easy. Performing it well is a different story.

Getting the Data + Feature Engineering

It’s difficult to retrieve the content of millions of Reddit submissions at scale (ethically), so let’s initially start by building a model using submissions on /r/AskReddit: Reddit’s largest subreddit which receives 8,000+ submissions each day. /r/AskReddit is a self-post only subreddit with no external links, allowing us to focus on only the submission title and timing.

As always, we can collect large amounts of Reddit data from the public Reddit dataset on BigQuery. The submission title is available by default. The raw timestamp of the submission is also present, allowing us to extract the hour of submission (adjusted to Eastern Standard Time) and dayofweek, as used in the heatmap above. But why stop there? Since /r/AskReddit receives hundreds of submissions every hour on average, we should look at the minute level to see if there are any deeper trends (e.g. there are only 30 slots available on the first page of /new and since there is so much submission activity, it might be more advantageous to submit during off-peak times). Lastly, to account for potential changes in behavior as the year progresses, we should add a dayofyear feature, where January 1st = 1, January 2nd = 2, etc which can also account for variance due to atypical days like holidays.

Instead of predicting the raw number on upvotes of the Reddit submission (as the distribution of submission scores is heavily skewed), we should predict whether or not the submission is good, shaping the problem as a logistic regression. In this case, let’s define a “good submission” as one whose score is equal to or above the 50th percentile (median) of all submissions in /r/AskReddit. Unfortunately, the median score ends up being 2 points; although “one upvote” might be a low threshold for a “good” submission, it splits the dataset into 64% bad submissions, 36% good submissions, and setting the percentile threshold higher will result in a very unbalanced dataset for model training (a score of 2+ also implies that the submission did not get downvoted to death, which is useful).

Gathering all 976,538 /r/AskReddit submissions from January 2017 to April 2017 should be enough data for this project. Here’s the final BigQuery:

#standardSQL
SELECT id, title,
  CAST(FORMAT_TIMESTAMP('%H', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS hour,
  CAST(FORMAT_TIMESTAMP('%M', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS minute,
  CAST(FORMAT_TIMESTAMP('%w', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS dayofweek,
  CAST(FORMAT_TIMESTAMP('%j', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS dayofyear,
  IF(PERCENT_RANK() OVER (ORDER BY score ASC) >= 0.50, 1, 0) as is_top_submission
  FROM `fh-bigquery.reddit_posts.*`
  WHERE (_TABLE_SUFFIX BETWEEN '2017_01' AND '2017_04')
  AND subreddit = 'AskReddit'

Model Architecture

If you want to see the detailed data transformations and Keras code examples/outputs for this post, you can view this Jupyter Notebook.

Text processing is a good use case for deep learning, as it can identify relationships between words where older methods like tf-idf can’t. Keras, a high level deep-learning framework on top of lower frameworks like TensorFlow, can easily convert a list of texts to a padded sequence of index tokens that can interact with deep learning models, along with many other benefits. Data scientists often use recurrent neural networks that can “learn” for classifying text. However fasttext, a newer algorithm from researchers at Facebook, can perform classification tasks at an order of magnitude faster training time than RNNs, with similar predictive performance.

fasttext works by averaging word vectors. In this Reddit model architecture inspired by the official Keras fasttext example, each word in a Reddit submission title (up to 20) is mapped to a 50-dimensional vector from an Embeddings layer of up to 40,000 words. The Embeddings layer is initialized with GloVe word embeddings pre-trained on billions of words to give the model a good start. All the word vectors for a given Reddit submission title are averaged together, and then a Dense fully-connected layer outputs a probability the given text is a good submission. The gradients then backpropagate and improve the word embeddings for future batches during training.

Keras has a convenient utility to visualize deep learning models:

However, the first output above is the auxiliary output for regularizing the word embeddings; we still have to incorporate the submission timing data into the model.

Each of the four timing features (hour, minute, day of week, day of year) receives its own Embeddings layer, outputting a 64D vector. This allows the features to learn latent characteristics which may be missed using traditional one-hot encoding for categorical data in machine learning problems.

The 50D word average vector is concatenated with the four vectors above, resulting in a 306D vector. This combined vector is connected to another fully-connected layer which can account for hidden interactions between all five input features (plus batch normalization, which improves training speed for Dense layers). Then the model outputs a final probability prediction: the main output.

The final model:

All of this sounds difficult to implement, but Keras’s functional API ensures that adding each layer and linking them together can be done in a single line of code each.

Training Results

Because the model uses no recurrent layers, it trains fast enough on a CPU despite the large dataset size.

We split the full dataset into 80%/20% training/test datasets, training the model on the former and testing the model against the latter. Keras trains a model with a simple fit command and trains for 20 epochs, where one epoch represents an entire pass of the training set.

There’s a lot happening in the console output due to the architecture, but the main metrics of interest are the main_out_acc, the accuracy of the training set through the main output, and val_main_out_acc, the accuracy of the test set. Ideally, the accuracy of both should increase as training progresses. However, the test accuracy must be better than the 64% baseline (if we just say all /r/AskReddit submissions are bad), otherwise this model is unhelpful.

Keras’s CSVLogger trivially logs all these metrics to a CSV file. Plotting the results of the 20 epochs:

The test accuracy does indeed beat the 64% baseline; however, test accuracy decreases as training progresses. This is a sign of overfitting, possibly due to the potential disparity between texts in the training and test sets. In deep learning, you can account for overfitting by adding Dropout to relevant layers, but in my testing it did not help.

Using The Model To Optimize Reddit Submissions

At the least, we now have a model that understands the latent characteristics of an /r/AskReddit submission. But how do you apply the model in practical, real-world situations?

Let’s take a random /r/AskReddit submission: Which movie’s plot would drastically change if you removed a letter from its title?, submitted Monday, January 16th at 3:46 PM EST and receiving 4 upvotes (a “good” submission in context of this model). Plugging those input variables into the trained model results in a 0.669 probability of it being considered a good submission, which is consistent with the true results.

But what if we made minor, iterative changes to the title while keeping the time submitted unchanged? Can we improve this probability?

“Drastically” is a silly adjective; removing it and using the title Which movie’s plot would change if you removed a letter from its title? results in a greater probability of 0.682.

“Removed” is grammatically incorrect; fixing the issue and using the title Which movie’s plot would change if you remove a letter from its title? results in a greater probability of 0.692.

“Which” is also grammatically incorrect; fixing the issue and using the title What movie’s plot would change if you remove a letter from its title? results in a greater probability of 0.732.

Although adjectives are sometimes redundant, they can add an intriguing emphasis; adding a “single” and using the title What movie’s plot would change if you remove a single letter from its title? results in a greater probability of 0.753.

Not bad for a little workshopping!

Now that we have an improved title, we can find an optimal time to make the submission through brute force by calculating the probabilities for all combinations of hour, minute, and day of week (and offsetting the day of year appropriately). After doing so, I discovered that making the submission on the previous Sunday at 10:55 PM EST results in the maximum probability possible of being a good submission at 0.841 (the other top submission times are at various other minutes during that hour; the best time on a different day is the following Tuesday at 4:05 AM EST with a probability of 0.823).

In all, this model of Reddit submission success prediction is a proof of concept; there are many, many optimizations that can be done on the feature engineering side and on the data collection side (especially if we want to model subreddits other than /r/AskReddit). Predicting which submissions go viral instead of just predicting which submissions receive atleast one upvote is another, more advanced problem entirely.

Thanks to the high-level abstractions and utility functions of Keras, I was able to prototype the initial model in an afternoon instead of the weeks/months required for academic papers and software applications in this area. At the least, this little experiment serves as an example of applying Keras to a real-world dataset, and the tradeoffs that result when deep learning can’t magically solve everything. But that doesn’t mean my experiments on the Reddit data were unproductive; on the contrary, I now have a few new clever ideas how to fix some of the issues discovered, which I hope to implement soon.

Again, I strongly recommend reading the data transformations and Keras code examples in this Jupyter Notebook for more information into the methodology, as building modern deep learning models is more intuitive and less arcane than what thought pieces on Medium imply.

You can view the R and ggplot2 code used to visualize the model data in this R Notebook, including 2D projections of the Embedding layers not in this article. You can also view the images/data used for this post in this GitHub repository.

You are free to use the data visualizations/model architectures from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

The Decline of Imgur on Reddit and the Rise of Reddit's Native Image Hosting

Tue, 20 Jun 2017 08:00:00 -0700

Last week, Bloomberg reported that Reddit was raising about $150 Million in venture capital at a valuation of $1.7 billion. Since Reddit’s data is public on BigQuery, I quickly checked if there were any recent user engagement growth spurts which could justify such a high worth. Here’s an example BigQuery which aggregates the total number of Reddit submissions made for each month until the end of April 2017:

#standardSQL
SELECT DATE_TRUNC(DATE(TIMESTAMP_SECONDS(created_utc)), MONTH) as mon,
  COUNT(*) as num_submissions,
  FROM `fh-bigquery.reddit_posts.*`
  WHERE (_TABLE_SUFFIX BETWEEN '2016_01' AND '2017_04' OR _TABLE_SUFFIX = 'full_corpus_201512')
  GROUP BY mon
  ORDER BY mon

As it turns out, Reddit did indeed get a large boost in activity toward the end of 2016, likely due to the heated discussions and events around the U.S. Presidential Election. But Reddit has maintained the growth rate since then, which is very appealing to potential investors.

How are other sites benefiting from Reddit’s growth? Imgur, an image-host developed to be the de facto image hosting service for Reddit, shared in Reddit’s continual growth…

…until mid-2016, when Imgur submission activity abruptly dropped. What happened?

Coincidentally in mid-2016, Reddit made itself an image host for submissions to the site. Initially limited to uploads via the iOS/Android apps, Reddit then allowed desktop users to upload images through a beta rollout starting May 24th, and a full sitewide release on June 21st.

How many Reddit-hosted image submissions are there compared to the number of Imgur submissions?

Wow, native Reddit images caught on.

Did the rise of Reddit-hosted images cause the decline of Imgur on Reddit? Let’s look at the daily number of Imgur submissions and Reddit-hosted Image submissions from December 2015 to April 2017, normalized by the total number of sitewide submissions on that day. This gives us a Reddit “market share” metric for both services.

Additionally, we can plot vertical lines representing the dates when Reddit-hosted images rolled out in the limited beta release and the full sitewide release to see if there is a link between those events and submission behavior.

Before Reddit added native image hosting, Imgur accounted for 15% of all submissions to Reddit. Now it’s below 9%. More Reddit-hosted images are being shared on Reddit than images from Imgur.

Instead of looking at all of Reddit, where spam subreddits could skew the results, we can also look at the largest image-only subreddits: /r/pics and /r/gifs, both of which were a part of the beta rollout.

Here, the impact of the two rollouts is much noticeable, with immediate increases in Reddit-hosted image market share after each rollout, and proportional decreases in Imgur market share. The growth rate after the beta release is flat for both services, but when Reddit image hosting becomes sitewide, the market shares of Reddit-hosted/Imgur images increase/decrease linearly over time once users officially learn that the native image upload functionality exists. And these trends do not appear to be slowing down.

A Silver Lining?

Obviously Imgur does not like losing a large chunk of traffic, but there’s a possibility that this outcome will be better for the business than what’s implied from the charts above.

Hosting images on the internet isn’t free, and bandwidth costs are the primary reason dedicated image hosts have died off over the years. Direct image links which show the user only the image and nothing else are convenient, but they are pure loss for the service. That’s why image hosts encourage linking to the image on a landing page of the website, filled with ads which generate an expected revenue greater than the cost of serving the image.

After a user uploads an image to Imgur on the desktop, the user is given two share links that can be submitted to sites like Reddit: an image link that goes to the image + ads, and a direct link to the image.

Recently, Imgur has pushed app downloads when visiting the site on an iOS/Android device, including disabling uploads in the mobile browser. When sharing an image from the Imgur app, the only way to share an image is through the image link, which could lead to an increase in the proportion of ad-filled Imgur image links on Reddit. Said increase could counteract the decrease in total Imgur submissions, and Imgur could actually come out ahead.

With BigQuery, we can check the percentage of all Imgur submissions to Reddit which are direct links and the percentage which are indirect/lead to a landing page, and see if the ratio changes along the same time horizon used above:

Welp. No significant change in the ratio over time, eliminating that possible silver lining.

Conclusion

Note that the decline of Imgur on Reddit says nothing about Imgur as a business; it’s entirely possible that Imgur’s traffic on the main site itself is sufficient for growth. But the loss of Reddit traffic certainly can’t be ignored, and it’s interesting to visualize how quickly a service can be replaced when there’s an equivalent native feature.

It’s worth nothing that new competitors in the image space such as Giphy utilize image hosting as a secondary service. Instead, they focus on building a repository of images which can be licensed and accessed programmatically by other services like Slack, Facebook, and Twitter. And Giphy has raised $150 Million total with this approach, so perhaps the image hosting market itself has indeed changed.

You can view the R, ggplot2 code, and BigQueries used to visualize the Reddit data in this R Notebook. You can also view the images/data used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Playing with 80 Million Amazon Product Review Ratings Using Apache Spark

Mon, 02 Jan 2017 09:00:00 -0700

Amazon product reviews and ratings are a very important business. Customers on Amazon often make purchasing decisions based on those reviews, and a single bad review can cause a potential purchaser to reconsider. A couple years ago, I wrote a blog post titled A Statistical Analysis of 1.2 Million Amazon Reviews, which was well-received.

Back then, I was only limited to 1.2M reviews because attempting to process more data caused out-of-memory issues and my R code took hours to run.

Apache Spark, which makes processing gigantic amounts of data efficient and sensible, has become very popular in the past couple years (for good tutorials on using Spark with Python, I recommend the free eDX courses). Although data scientists often use Spark to process data with distributed cloud computing via Amazon EC2 or Microsoft Azure, Spark works just fine even on a typical laptop, given enough memory (for this post, I use a 2016 MacBook Pro/16GB RAM, with 8GB allocated to the Spark driver).

I wrote a simple Python script to combine the per-category ratings-only data from the Amazon product reviews dataset curated by Julian McAuley, Rahul Pandey, and Jure Leskovec for their 2015 paper Inferring Networks of Substitutable and Complementary Products. The result is a 4.53 GB CSV that would definitely not open in Microsoft Excel. The truncated and combined dataset includes the user_id of the user leaving the review, the item_id indicating the Amazon product receiving the review, the rating the user gave the product from 1 to 5, and the timestamp indicating the time when the review was written (truncated to the Day). We can also infer the category of the reviewed product from the name of the data subset.

Afterwards, using the new sparklyr package for R, I can easily start a local Spark cluster with a single spark_connect() command and load the entire CSV into the cluster in seconds with a single spark_read_csv() command.

There are 80.74 million records total in the dataset, or as the output helpfully reports, 8.074e+07 records. Performing advanced queries with traditional tools like dplyr or even Python’s pandas on such a dataset would take a considerable amount of time to execute.

With sparklyr, manipulating actually-big-data is just as easy as performing an analysis on a dataset with only a few records (and an order of magnitude easier than the Python approaches taught in the eDX class mentioned above!).

Exploratory Analysis

(You can view the R code used to process the data with Spark and generate the data visualizations in this R Notebook)

There are 20,368,412 unique users who provided reviews in this dataset. 51.9% of those users have only written one review.

Relatedly, there are 8,210,439 unique products in this dataset, where 43.3% have only one review.

After removing duplicate ratings, I added a few more features to each rating which may help illustrate how review behavior changed over time: a ranking value indicating the # review that the author of a given review has written (1st review by author, 2nd review by author, etc.), a ranking value indicating the # review that the product of a given review has received (1st review for product, 2nd review for product, etc.), and the month and year the review was made.

The first two added features require a very large amount of processing power, and highlight the convenience of Spark’s speed (and the fact that Spark uses all CPU cores by default, while typical R/Python approaches are single-threaded!)

These changes are cached into a Spark DataFrame df_t. If I wanted to determine which Amazon product category receives the best review ratings on average, I can aggregate the data by category, calculate the average rating score for each category, and sort. Thanks to the power of Spark, the data processing for this many-millions-of-records takes seconds.

df_agg <- df_t %>%
            group_by(category) %>%
            summarize(count = n(), avg_rating = mean(rating)) %>%
            arrange(desc(avg_rating)) %>%
            collect()

Or, visualized in chart form using ggplot2:

Digital Music/CD products receive the highest reviews on average, while Video Games and Cell Phones receive the lowest reviews on average, with a 0.77 rating range between them. This does make some intuitive sense; Digital Music and CDs are types of products where you know exactly what you are getting with no chance of a random product defect, while Cell Phones and Accessories can have variable quality from shady third-party sellers (Video Games in particular are also prone to irrational review bombing over minor grievances).

We can refine this visualization by splitting each bar into a percentage breakdown of each rating from 1-5. This could be plotted with a pie chart for each category, however a stacked bar chart, scaled to 100%, looks much cleaner.

The new visualization does help support the theory above; the top categories have a significantly higher percentage of 4/5-star ratings than the bottom categories, and a much a lower proportion of 1/2/3-star ratings. The inverse holds true for the bottom categories.

How have these breakdowns changed over time? Are there other factors in play?

Rating Breakdowns Over Time

Perhaps the advent of the binary Like/Dislike behaviors in social media in the 2000’s have translated into a change in behavior for a 5-star review system. Here are the rating breakdowns for reviews written in each month from January 2000 to July 2014:

The voting behavior oscillates very slightly over time with no clear spikes or inflection points, which dashes that theory.

Distribution of Average Scores

We should look at the global averages of Amazon product scores (i.e. what customers see when they buy products), and the users who give the ratings. We would expect the distributions to match, so any deviations would be interesting.

Products on average, when looking at products with atleast 5 ratings, have a 4.16 overall rating.

When looking at a similar graph for the overall ratings given by users, (5 ratings minimum), the average rating is slightly higher at 4.20.

The primary difference between the two distributions is that there is significantly higher proportion of Amazon customers giving only 5-star reviews. Normalizing and overlaying the two charts clearly highlights that discrepancy.

The Marginal Review

A few posts ago, I discussed how the first comment on a Reddit post has dramatically more influence than subsequent comments. Does user rating behavior change after making more and more reviews? Is the typical rating behavior different for the first review of a given product?

Here is the ratings breakdown for the n-th Amazon review a user gives:

The first user review has a slightly higher proportion of being a 1-star review than subsequent reviews. Otherwise, the voting behavior is mostly the same overtime, although users have an increased proportion of giving a 4-star review instead of a 5-star review as they get more comfortable.

In contrast, here is the ratings breakdown for the n-th review an Amazon product received:

The first product review has a slightly higher proportion of being a 5-star review than subsequent reviews. However, after the 10th review, there is zero change in the distribution of ratings, which implies that the marginal rating behavior is independent from the current score after that threshold.

Summary

Granted, this blog post is more playing with data and less analyzing data. What might be interesting to look into for future technical posts is conditional behavior, such as predicting the rating of a review given the previous ratings on that product/by that user. However, this post shows that while “big data” may be an inscrutable buzzword nowadays, you don’t have to work for a Fortune 500 company to be able to understand it. Even with a data set consisting of 5 simple features, you can extract a large number of insights.

And this post doesn’t even look at the text of the Amazon product reviews or the metadata associated with the products! I do have a few ideas lined up there which I won’t spoil.

You can view all the R and ggplot2 code used to visualize the Amazon data in this R Notebook. You can also view the images/data used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

What Percent of the Top-Voted Comments in Reddit Threads Were Also 1st Comment?

Mon, 07 Nov 2016 06:30:00 -0700

Reddit threads can be a crowded place. In popular subreddits such as /r/AskReddit and /r/pics, Reddit submissions can receive hundreds, even thousands of unique comments. Some comments inevitably become lost in the noise. Reddit’s ranking algorithm attempts to rectify this by determining comment ranking using both time and community voting; comments in a thread, by default, are ordered based on the points score (upvotes - downvotes) the comment receives, subject to a rank decay based on the age of the comment.

In theory, this system should allow comments that posted later in the thread’s lifetime to rank much higher temporarily, then Redditors can vote on the new comment; if the new comment is good, it can now rise to the top and therefore the content which would otherwise be buried is now surfaced. Anecdotally, that doesn’t be the case with Reddit’s modern algorithm; comments made late in the thread appear at the bottom, where they likely will not receive any upvotes (this led to a minor “I know I’m late to this thread but…” meme).

I, of course, am not satisfied with anecdotes. A month ago, a Redditor asked “What percentage of the top comment in threads were also the first comment?” Why not calculate it exactly using big data?

Getting the Reddit Data

You can view all the R and ggplot2 code used to query, analyze, and visualize the Reddit data in this R Notebook.

In order to process a great amount of Reddit data, I turned to BigQuery, which now has data for all Reddit comments until September 2016.

For this analysis, I will only look at the top-level comments (i.e. comments which are not replies to other comments), since those are the ones most affected by the ordering and submission of new comments. Additionally I will only look at comments within Reddit threads with atleast 30 top-level comments to ensure I only look at threads with sufficient discussion and where late posts are more likely to become hidden. It also mirrors the “late to this thread” meme: can posts be too late?

The queried data will be all comments posted from January 2015 to September 2016: this give a good balance of sample size and foundation around the modern comment ranking algorithms. The total number of Reddit comments analyzed, after filtering on threads with sufficient conversation and limiting the scope to the first 100 comments of a thread scoring within the Top 100, is n = 86,561,476.

With clever use of BigQuery window functions, I obtained the aggregate data, counting the number of comments from the filtered Reddit threads at each voting rank and created rank.

Visualizing the Discussion

Filtering on the top-voted comments (score_rank = 1) only, what percent of the top-voted comments in Reddit threads were also 1st Comment?

The answer is 17.24% of all top-voted comments! That’s certainly more than what I expected! Additionally, 56% of the top-voted comments were posted within the first 5 comments, and 77% within the first 10 comments. The chart follows a power-law distribution.

Let’s invert it: filtering on only the first comments (created_rank = 1) made in comment threads, what percentage of the 1st Comments in Reddit threads were also the top-voted comment?

By construction, the answer is the same as before (17.24%), however the followup proportions are slightly different, with the first comment ranking within the Top 5 comments 46% of the time, and within the Top 10 comments 62% of the time.

It may be worth it to visualize both dimensions at the same time using a heatmap, with the created rank on one axis, score rank on the other, and a z-axis representing the number of comments at each rank pairing. We can also add a faint contour line to help visualize clusters of the data. Putting it together:

Woah, most of the values are constrained between the semisquare constrained by the first 5 comments and the top 5 comments! But it’s harder to see trends, so let’s try applying a logarithmic base-10 scaling on the comment count:

Much better! We can see a grouping of the 5x5 semisquare, but also smaller groupings of a 30x30 shape (this may possibly be due to the 30 comment filter threshold), a faint 60x60 shape, and voids in the upper-left and lower-right corners.

From the 2D heatmap, there appears to be a positive correlation between the rank of the comment and the time it was submitted. Ideally, if Reddit’s algorithm correctly cycled posts so that each comment gets a fair chance at going viral, then there should be no correlation between score rank and time posted.

Analysis by Subreddit

When working with Reddit data, it is always important to facet the analysis by subreddit, as subreddits can have idiosyncratic behaviors which deviate from general Reddit behavior. As noted in the original Reddit thread with the initial question, it is possible that the percentage of first comments becoming top comment is “higher in lighter subs (funny, pics, videos) than more serious subs (askscience, history, etc).”

I tweaked the BigQuery above to retrieve the same data for each of the Top 100 subreddits (determined by unique commenter count over the same time period). Afterward, via scripting, I created a 1D proportion-of-first-comments-by-score-rank and 2D heatmaps for each subreddit. You can view and download the 1D charts here, and the 2D heatmaps here.

For example, here’s the chart of first-comment-rankings for /r/IAmA, one of Reddit’s biggest subreddits where normal Redditors can ask celebrities any question they want.

Unlike the all-Reddit chart, the distribution of first-comment proportions is more uniform instead of following a power law. It makes sense in theory; people would likely upvote top-level questions which the original poster replied to, so there should be less of a bias toward the first top-level comment.

What does the 2D heatmap show?

Damn it.

While the 1D behavior is different, the overall 2D behavior is the same albeit with larger voids (indeed, in the heatmap, you can see at created_rank = 1, the vertical strip doesn’t fit the pattern).

It turns out that most /r/IAmA threads have this comment:

As it’s made by a robot, it’s always the first comment, and it gets ignored/downvoted in normal circumstances. Other subreddits with the same pattern of 1D irregularities, 2D regularities, and AutoModerator usage are /r/gameofthrones, /r/photoshopbattles, and /r/WritingPrompts.

Some subreddits have more uniformity than typical Reddit rank behavior. In /r/funny, /r/leagueoflegends, /r/pics, /r/todayilearned, and /r/videos (i.e. many default subreddits), there is no upper-left void (early comments can be poorly ranked) and the bottom-right void is minimized but still present.

Inversely, there are subreddits where the correlation is obvious. /r/pcmasterrace and /r/gonewild both exhibit very straight lines, and are subreddits where the comments themselves are not very constructive, so whatever gets posted gets upvoted anyways.

Rushing to say FIRST!!1!11! in a comments section of a blog post or forum thread is a meme that long predates Reddit. However, rushing to make the first comment in a Reddit thread may have strategic merit if you want to get your voice heard.

Even in the most optimistic circumstances, comments that are late to a thread have a very, very low probability of becoming one of the top comments. In fairness, it’s hard to determine with public Reddit data if tweaking the ranking algorithm such that new comments will always rank at the top initially will actually improve the Reddit user experience as a whole. On the other hand, this behavior presents an opportunity: if there is a long tail of Reddit content that is unjustifiably being buried due to lack of attention, then perhaps there is a business opportunity in creating a service to discover and resurface quality comments…

You can view all the R and ggplot2 code used to query, analyze, and visualize the Reddit data in this R Notebook. You can also view the images/data used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Visualizing How Developers Rate Their Own Programming Skills

Thu, 21 Jul 2016 06:30:00 -0700

Stack Overflow, the favorite destination for software developers when something breaks for no apparent reason, recently released their 2016 Stack Overflow Survey Results with responses to the questions of “where they work, what they build, and who they are.” You can download the released dataset containing all 56,030 cleaned responses here.

One variable present in the dataset but surprisingly unaddressed in the official Stack Overflow analysis is the programming_ability field — On a scale of 1-10, how would you rate your programming ability?

I took a look at the 46,982 users who identified their programming ability in the survey. On average, developers rate themselves 7.09 / 10. And like most 1-10 rating scales, the distribution of self-assessments is unimodal around 7 and 8, with relatively rare 9’s and 10’s.

We can aggregate the programming ability data by other relevant metrics in the Stack Overflow dataset, such as experience and commit activity, and hopefully find interesting trends.

Sanity Checking

I normally dislike working with survey data since there is a high possibility of selection bias among the respondents. In Stack Overflow’s case, their marketing of the survey on Facebook and Twitter may cause a high proportion of social-media savvy respondents and discount the insight of developers who are not likely to use those services. For this reason, I will show confidence intervals whenever possible to reflect the proportionate uncertainty for groupings with insufficient data, and to also account for possibility that a minority of respondents may be dishonest and nudge their programming ability a few points higher than the truth.

Let’s compare programming skill to the developer’s experience in the field. In the survey, the user could classify their IT / programming experience as a range, from “Less than 1 year”, “1 - 2 years”, “2 - 5 years”, “6 - 10 years”, and “11+ years.” Since we would expect a positive correlation between skill and experience, identifying such a positive correlation visually gives a quick indication that the analysis is on the right track.

We can plot the average programming-ability rating for developers which fall into each of those five groups, and a confidence interval for that average. Additionally, we can make a violin plot of each group to give a sense of the underlying distribution of ratings.

Putting it all together:

The color dot for each group represents the average rating from the sample which the developers in the group give to themselves. The black error bars on the dot represent a 95% confidence interval for the true value of the average, obtained via percentile bootstrap with 10,000 resamples of the dataset with replacement (since there is a large amount of source data, the confidence intervals end up being very narrow in most cases; this is one legitimate advantage of big data).

The violin plot for each group represents the normalized overall distribution of ratings. The narrowness of the per-value ratings reflect the amount of data available for that group: the more data available, the more narrow/precise the kernel smoothing is. Overall flat plots represent a wide selection of self-ratings, while an overall narrow plot represents a more-constrained selection (for the plot above, you can easily see the distribution shift to the right as the experience range increases).

Also, keep in mind that these groupings alone do not imply a causal relationship between the two variables. Employing traditional regression analysis to build a model for predicting programming ability would be tricky: does having more experience cause programming skill to improve, or does having strong innate technical skill cause developers to remain in the industry and grow?

Back to the plot at hand. We can easily confirm that a positive correlation exists between programming activity and experience, with newbie developers rating their skills 5.02 / 10 on average, and extremely experienced developers rating their skills three whole ranks higher at 8.13 / 10. What’s also notable is the range of values selected: for developers with less than 1 years of experience, the distribution is almost completely flat between 1-7, showing that they are more honest with the self-assessment of their programming skills. Inversely, developers with 11+ years of experience select 9 and 10 ratings almost as much as 7 and 8 ratings.

It’s a good start. We can also compare developer skill to their age, which by construction (older developers have more experience) should have parallel behavior to experience levels.

Yes, the plot is indeed similar, with average ratings ranging from 6 to about 8. What’s interesting is the behavior for > 60 vs. 50-59 is that the > 60 age programmers occasionally rate their skills at the low end of the scale, which is why the confidence interval is larger and the average is lower for that group.

Lastly, we can look at the salary the developer is paid (in USD) as a validation of skill. This particular chart will only focus on developers in the United States (n = 13,539), so that the salary follows expected behavior with the specified currency and cost-of-living. In this case, there are many more groups, but that makes the distribution shift more apparent, and more interesting.

The large amount of >$100k earners in the dataset shows how the Stack Overflow demographic can skew toward Silicon Valley engineers. The $90k—$100k group serves as a convenient inflection point on how the distribution of self-ratings becomes a Four Point Scale between 7 and 10 for those who earn more than $100k.

Do better developers rate themselves better?

So far, the data is internally consistent. There are a few other developer-relevant statistics are available in the dataset which can easily be aggregated. A good one is the type of employment. For example, do freelance / contract developers believe they are better programmers than full-time developers?

As it turns out, that guess is indeed the case, albeit it’s only a slight difference (7.53 / 10 for freelance / contract vs. 7.29 / 10 for full-time).

What about repository commit activity by developers? Are developers who commit more better? One could argue that a developer who commits code often is either vigilant with accounting for functional code changes, or polluting the codebase in an attempt to show productivity.

Yes, developers who commit lots of code rate themselves better.

Lastly, let’s remember that the source of data is Stack Overflow. Are developers who use Stack Overflow as a resource better developers who know how to properly use external references in times of crisis, or are they developers who use it as a crutch to compensate for weak coding skills?

As it turns out, there is no correlation between programming ability and and the frequency of Stack Overflow visits, as the averages and distributions are virtually identical across all groups.

There are many, many other answers available in the dataset; some allow multiple responses and are harder to parse, while others have zero correlation with programming ability as with the Stack Overflow visits, and therefore do not provide much additional insight. Although we cannot establish causal relationships with this methodology, there may be other important insights obtainable from aggregating programming ability data, but the charts presented in this post are a good start.

As always, the full code used to process the comment data and generate the visualizations is available in this Jupyter notebook, open-sourced on GitHub. The repository also contains a few unused bonus charts!

You are free to use the charts from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Data Engineering on Max Woolf's Blog

Visualizing Airline Flight Characteristics Between SFO and JFK

Expanding on SFO → SEA

SFO and JFK

Problems with Predicting Post Performance on Reddit and Other Link Aggregators

Submission Timing

Submission Titles

Can Deep Learning Predict Post Performance?

Other Potential Modeling Factors

Do We Really to Predict Post Score?

In the future

Analyzing IMDb Data The Intended Way, with R and ggplot2

First Steps

Mapping Movies to Ratings

Mapping Lead Actors to Movies

Plotting Ages

Visualizing One Million NCAA Basketball Shots

Full Court

Focusing on Basketball Shot Type

Impact of Game Elapsed Time At Time of Shot

Impact of Winning/Losing Before Shot

Conclusion

A Visual Overview of Stack Overflow's Question Tags

Overview

Tag Breakdown

How easy is it to get an answer by tag?

The Relationship Between Tags

Conclusion

Predicting the Success of a Reddit Submission with Deep Learning and Keras

Getting the Data + Feature Engineering

Model Architecture

Training Results

Using The Model To Optimize Reddit Submissions

The Decline of Imgur on Reddit and the Rise of Reddit's Native Image Hosting

Market Share

A Silver Lining?

Conclusion

Playing with 80 Million Amazon Product Review Ratings Using Apache Spark

Exploratory Analysis

Rating Breakdowns Over Time

Distribution of Average Scores

The Marginal Review

Summary

What Percent of the Top-Voted Comments in Reddit Threads Were Also 1st Comment?

Getting the Reddit Data

Visualizing the Discussion

Analysis by Subreddit

Visualizing How Developers Rate Their Own Programming Skills

Sanity Checking

Do better developers rate themselves better?