This R Notebook is the complement to my blog post The Decline of Imgur on Reddit and the Rise of Reddit’s Native Image Hosting.

This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)

1 Setup

theme_set(theme_minimal(base_size=9, base_family="Source Sans Pro") +
            theme(plot.title = element_text(size=11, family="Source Sans Pro Bold"),
                  axis.title.x = element_text(family="Source Sans Pro Semibold"),
                  axis.title.y = element_text(family="Source Sans Pro Semibold"),
                  plot.caption = element_text(size=6, color="#CCCCCC")))

Include BigQuery Project name (NEVER share with anyone!)

project <- "<FILL IN>"

Reddit is blue, Imgur is green.

Also, hard-code dates for vertical lines.

site_colors <- c(Reddit="#3498db", Imgur="#2ecc71")
reddit_image_beta_date <- as.numeric(as.Date('2016-05-24'))
reddit_image_sitewide_date <- as.numeric(as.Date('2016-06-21'))

2 Exploratory Reddit Data

Get Reddit submission data over time.

sql <- "
  COUNT(*) as num_submissions,
  COUNTIF(REGEXP_CONTAINS(domain, '|')) as num_reddit_image_submissions,
  COUNTIF(REGEXP_CONTAINS(domain, '')) AS num_imgur_submissions
  FROM `fh-bigquery.reddit_posts.*`
  WHERE (_TABLE_SUFFIX BETWEEN '2016_01' AND '2017_04' OR _TABLE_SUFFIX = 'full_corpus_201512')
  GROUP BY mon
  ORDER BY mon
df_reddit_daily <- query_exec(sql, project = project, use_legacy_sql = FALSE)
df_reddit_daily %>% tail()
plot <- ggplot(df_reddit_daily, aes(mon, num_submissions)) +
          geom_line() +
          scale_x_date() +
          scale_y_continuous(labels = comma) + 
          labs(title = "Number of Submissions to Reddit, by Month",
               x = "Month",
               y = "# Submissions",
               caption = "Max Woolf —")
ggsave( "reddit-1.png", plot, width=5, height=3)