This R Notebook is the complement to my blog post The Decline of Imgur on Reddit and the Rise of Reddit’s Native Image Hosting.

This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)

1 Setup

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.2.0        bindrcpp_0.2         tidyr_0.6.3          bigrquery_0.3.0.9000
[5] scales_0.4.1         ggplot2_2.2.1.9000   dplyr_0.7.0          readr_1.1.1         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11      bindr_0.1         compiler_3.4.0    plyr_1.8.4       
 [5] prettyunits_1.0.2 base64enc_0.1-3   tools_3.4.0       progress_1.1.2   
 [9] digest_0.6.12     jsonlite_1.5      evaluate_0.10     tibble_1.3.3     
[13] gtable_0.2.0      rlang_0.1.1       DBI_0.6-1         curl_2.6         
[17] yaml_2.1.14       httr_1.2.1        stringr_1.2.0     knitr_1.16       
[21] hms_0.3           rprojroot_1.2     grid_3.4.0        glue_1.1.0       
[25] R6_2.2.2          rmarkdown_1.6     magrittr_1.5      backports_1.1.0  
[29] htmltools_0.3.6   rsconnect_0.8     assertthat_0.2.0  colorspace_1.3-2 
[33] httpuv_1.3.3      labeling_0.3      stringi_1.1.5     lazyeval_0.2.0   
[37] openssl_0.9.6     munsell_0.4.3    
theme_set(theme_minimal(base_size=9, base_family="Source Sans Pro") +
            theme(plot.title = element_text(size=11, family="Source Sans Pro Bold"),
                  axis.title.x = element_text(family="Source Sans Pro Semibold"),
                  axis.title.y = element_text(family="Source Sans Pro Semibold"),
                  plot.caption = element_text(size=6, color="#CCCCCC")))

Include BigQuery Project name (NEVER share with anyone!)

project <- "<FILL IN>"

Reddit is blue, Imgur is green.

Also, hard-code dates for vertical lines.

site_colors <- c(Reddit="#3498db", Imgur="#2ecc71")
reddit_image_beta_date <- as.numeric(as.Date('2016-05-24'))
reddit_image_sitewide_date <- as.numeric(as.Date('2016-06-21'))

2 Exploratory Reddit Data

Get Reddit submission data over time.

sql <- "
  COUNT(*) as num_submissions,
  COUNTIF(REGEXP_CONTAINS(domain, '|')) as num_reddit_image_submissions,
  COUNTIF(REGEXP_CONTAINS(domain, '')) AS num_imgur_submissions
  FROM `fh-bigquery.reddit_posts.*`
  WHERE (_TABLE_SUFFIX BETWEEN '2016_01' AND '2017_04' OR _TABLE_SUFFIX = 'full_corpus_201512')
  GROUP BY mon
  ORDER BY mon
df_reddit_daily <- query_exec(sql, project = project, use_legacy_sql = FALSE)
Auto-refreshing stale OAuth token.

Running job -:  1s:
Running job \:  1s:
Running job |:  2s:
Running job /:  2s:
Running job -:  3s:
Running job \:  3s:
Running job |:  3s:
Running job /:  3s:
Running job -:  4s:
Running job \:  4s:
7.7 gigabytes processed

Running job |:  4s:
df_reddit_daily %>% tail()
plot <- ggplot(df_reddit_daily, aes(mon, num_submissions)) +
          geom_line() +
          scale_x_date() +
          scale_y_continuous(labels = comma) + 
          labs(title = "Number of Submissions to Reddit, by Month",
               x = "Month",
               y = "# Submissions",
               caption = "Max Woolf —")
ggsave( "reddit-1.png", plot, width=5, height=3)