This R Notebook is the complement to my blog post A Visual Overview of Stack Overflow’s Question Tags.

This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)

1 Setup

library(tidyverse)
── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
✔ tibble  1.4.2          ✔ dplyr   0.7.4     
✔ tidyr   0.8.0          ✔ stringr 1.2.0     
✔ readr   1.1.1          ✔ forcats 0.2.0     
package ‘tibble’ was built under R version 3.4.3package ‘tidyr’ was built under R version 3.4.3── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(lubridate)

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date
library(tidytext)   # created at Stack Overflow by Julia Silge and David Robinson
library(scales)

Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor
library(viridis)
Loading required package: viridisLite

Attaching package: ‘viridis’

The following object is masked from ‘package:viridisLite’:

    viridis.map

The following object is masked from ‘package:scales’:

    viridis_pal
library(ggrepel)
library(ggridges)
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.4.1     ggrepel_0.7.0      viridis_0.4.1      viridisLite_0.3.0  scales_0.5.0      
 [6] tidytext_0.1.6     lubridate_1.7.1    forcats_0.2.0      stringr_1.2.0      dplyr_0.7.4       
[11] purrr_0.2.4        readr_1.1.1        tidyr_0.8.0        tibble_1.4.2       ggplot2_2.2.1.9000
[16] tidyverse_1.2.1   

loaded via a namespace (and not attached):
 [1] reshape2_1.4.3    haven_1.1.1       lattice_0.20-35   colorspace_1.3-2  SnowballC_0.5.1  
 [6] yaml_2.1.16       rlang_0.1.6       pillar_1.1.0      foreign_0.8-69    glue_1.2.0       
[11] modelr_0.1.1      readxl_1.0.0      bindrcpp_0.2      bindr_0.1         plyr_1.8.4       
[16] munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0  rvest_0.3.2       psych_1.7.8      
[21] knitr_1.19        parallel_3.4.2    broom_0.4.3       tokenizers_0.1.4  Rcpp_0.12.15     
[26] jsonlite_1.5      gridExtra_2.3     mnormt_1.5-5      hms_0.4.1         stringi_1.1.6    
[31] grid_3.4.2        cli_1.0.0         tools_3.4.2       magrittr_1.5      lazyeval_0.2.1   
[36] janeaustenr_0.1.5 crayon_1.3.4      pkgconfig_2.0.1   Matrix_1.2-12     xml2_1.2.0       
[41] assertthat_0.2.0  httr_1.3.1        rstudioapi_0.7    R6_2.2.2          nlme_3.1-131     
[46] compiler_3.4.2   
Sys.setenv(TZ="America/Los_Angeles")
# https://brandcolors.net/b/stackoverflow
stack_overflow_color <- "#f48024"
theme_set(theme_minimal(base_size=9, base_family="Source Sans Pro") +
            theme(plot.title = element_text(size=8, family="Source Sans Pro Bold", margin=margin(t = -0.1, b = 0.1, unit='cm')),
                  axis.title.x = element_text(size=8),
                  axis.title.y = element_text(size=8),
                  plot.subtitle = element_text(family="Source Sans Pro Semibold", color="#969696", size=6),
                  plot.caption = element_text(size=6, color="#969696"),
                  legend.text = element_text(size = 6),
                  legend.key.width = unit(0.25, unit='cm')))

2 Behavior for new submissions

Use data precomputed from this BigQuery:

#standardSQL
SELECT
  DATE_TRUNC(DATE(creation_date), YEAR) AS year,
  SUM(view_count_delta) AS total_delta 
FROM (
  SELECT
    id,
    creation_date,
    b.view_count - a.view_count AS view_count_delta
  FROM
    `fh-bigquery.stackoverflow_archive.201703_posts_questions` a
  LEFT JOIN (
    SELECT
      id,
      view_count
    FROM
      `fh-bigquery.stackoverflow_archive.201712_posts_questions` ) b
  USING
    (id) )
GROUP BY
  year
ORDER BY
  year ASC

Load in the precomputed data.

file_path <- "stack_overflow_delta.csv"
df_deltas <- read_csv(file_path) %>% mutate(perc = total_delta / sum(as.numeric(total_delta)))
Parsed with column specification:
cols(
  year = col_date(format = ""),
  total_delta = col_integer()
)
df_deltas

Overview of 2017 view counts on older posts.

plot <- ggplot(df_deltas %>% filter(year >= ymd('2009-01-01'), year <= ymd('2016-01-01')), aes(x=year, y=perc)) +
          geom_bar(alpha=0.9, stat="identity", fill=stack_overflow_color) +
          scale_x_date(date_breaks='1 year', date_labels='%Y', minor_breaks = NULL) +
          scale_y_continuous(labels=percent) +
          labs(title='Proportion of 2017 Views on Older Stack Overflow Questions by Year',
                subtitle='From March 13th, 2017 to December 3rd, 2017. Visualization Excludes Partial Years',
               x='Year Question Was Posted',
               y='% of All Views',
               caption = "Max Woolf — minimaxir.com"
              )
ggsave('so_overview.png', plot, width=4, height=2)