This R Notebook is the complement to my blog post Analyzing IMDb Data The Intended Way, with R and ggplot2.

This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)

IMDb data retrieved on July 4th 2018.

Information courtesy of IMDb (http://www.imdb.com). Used with permission.

library(tidyverse)
── Attaching packages ───────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
package ‘dplyr’ was built under R version 3.5.1── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggridges)   # unused in final blog post

Attaching package: ‘ggridges’

The following object is masked from ‘package:ggplot2’:

    scale_discrete_manual
library(tidytext)   # unused in final blog post
library(scales)

Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_0.5.0    tidytext_0.1.9  ggridges_0.5.0  forcats_0.3.0  
 [5] stringr_1.3.1   dplyr_0.7.6     purrr_0.2.5     readr_1.1.1    
 [9] tidyr_0.8.1     tibble_1.4.2    ggplot2_3.0.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.4  reshape2_1.4.3    haven_1.1.2       lattice_0.20-35  
 [5] colorspace_1.3-2  SnowballC_0.5.1   yaml_2.1.19       rlang_0.2.1      
 [9] pillar_1.2.3      foreign_0.8-70    glue_1.2.0        withr_2.1.2      
[13] modelr_0.1.2      readxl_1.1.0      bindrcpp_0.2.2    bindr_0.1.1      
[17] plyr_1.8.4        munsell_0.5.0     gtable_0.2.0      cellranger_1.1.0 
[21] rvest_0.3.2       psych_1.8.4       knitr_1.20        parallel_3.5.0   
[25] broom_0.4.5       tokenizers_0.2.1  Rcpp_0.12.17      jsonlite_1.5     
[29] mnormt_1.5-5      hms_0.4.2         stringi_1.2.3     grid_3.5.0       
[33] cli_1.0.0         tools_3.5.0       magrittr_1.5      lazyeval_0.2.1   
[37] janeaustenr_0.1.5 crayon_1.3.4      pkgconfig_2.0.1   Matrix_1.2-14    
[41] xml2_1.2.0        lubridate_1.7.4   assertthat_0.2.0  httr_1.3.1       
[45] rstudioapi_0.7    R6_2.2.2          nlme_3.1-137      compiler_3.5.0   

Helper function to read IMDB files given filename.

read_imdb <- function(data_path) {
  path <- "/Volumes/Extreme 510/Data/imdb/"
  read_tsv(paste0(path, data_path), na = "\\N", quote='', progress=F)
}

Helper function to pretty print the size of a dataframe for charts/notebook.

ppdf <- function(df) {
  df %>% nrow() %>% comma()
}

1 Ratings

df_ratings <- read_imdb("title.ratings.tsv")
Parsed with column specification:
cols(
  tconst = col_character(),
  averageRating = col_double(),
  numVotes = col_integer()
)
df_ratings %>% head()

There are 847,394 ratings in the dataset.

Plot every point. (note: very slow!)

plot <- ggplot(df_ratings, aes(x = numVotes, y = averageRating)) +
          geom_point()

ggsave("imdb-0.png", plot, width=4, height=3)