Zoom in/out or pan around the chart using the controls in the upper-right corner. Hover on a data point to identify the corresponding headline. Click on a news source in the legend to toggle on/off.

Facebook recently announced that they will punish Facebook Posts which link to articles using clickbait headlines by limiting their exposure on the News Feed. They also announced that they have a large team manually classifying what is and isn’t linkbait. From my analysis of BuzzFeed headlines last year, I found that clickbait typically follows very specific tropes with phrases such as “The [X] Most” or “You Should Do.” It shouldn’t be that difficult to identify clickbait using heuristics/machine learning.

Relatedly, I recently read a blog post by Lance Legel describing words2map, a project which takes in keywords and converts Google News articles representing those keywords into numerical vector representations, clusters them together in 2D, and plots those on a chart.

Why not combine the two ideas? Let’s deconstruct thousands of news headlines into numeric representations and cluster them together to see if we can isolate submissions which intrinsincly hit those clickbait tropes.

5 Big Data Processing Techniques You Should Know

Using a modified version of my Facebook Page Post Scraper, I downloaded all Facebook Posts by the Facebook Pages representing news publications CNN, NYTimes, BuzzFeed, and Upworthy, and stored the headlines of each linked article if present. CNN and NYTimes represent traditional media whose headlines tend to follow AP Style guidelines, while BuzzFeed and Upworthy are more known for their clickbait headlines.

However, those are not absolute rules; BuzzFeed occasionally has more-serious headlines, and CNN occasionally has more-gimmicky headlines. That’s okay. If my hypothesis is correct, the nonappropriate headlines will be clustered with other headlines of the same style.

First, I load the four datasets into the hip new big data tool Apache Spark 2.0, (via the Python interface, PySpark), and combine them all into a single DataFrame, with a little extra post-processing to remove invalid entries.

df_cnn = read_tsv("fb_headlines/CNN_fb.tsv")
df_nytimes = read_tsv("fb_headlines/NYTimes_fb.tsv")
df_buzzfeed = read_tsv("fb_headlines/BuzzFeed_fb.tsv")
df_upworthy = read_tsv("fb_headlines/Upworthy_fb.tsv")

df = df_cnn.union(df_nytimes).union(df_buzzfeed).union(df_upworthy)
df = process_df(df).cache()

In all, I had 102,267 valid news headlines to analyze; not “big data,” but enough data that it’s worth optimizing the analysis code as much as possible, especially in this case where the computation can be intensive.

Once the data is loaded, we convert the headlines to an array of tokens representing each word from the headline, all lowercase and with punctuation stripped. This task can normally be difficult and have poor performance on large datasets, however Spark has a RegexTokenizer that quickly executes all the necessary tasks in one fell swoop.

tokenizer = RegexTokenizer(pattern="[^\w]", inputCol="text", outputCol="words")
df = tokenizer.transform(df)

Most analyses would remove stop words as their high frequency can cause noise in the subsequent analysis. In this case, we should not remove them as many stop words are critical components of the clickbait tropes (e.g. “I Am” and “The [X] Most”).

Now that the tokens are created, we can apply Word2vec, an algorithm which converts a collection of words into a dictionary of multidimensional numerical representations. Once the dictionary is created, we can average all the word vectors for a given headline to get the numeric representation of the headline itself. Again, Spark has convenient functions for those actions, setting a randomness seed for reproducibility:

word2Vec = Word2Vec(vectorSize=50, seed=42, inputCol="words", outputCol="vectors")
model = word2Vec.fit(df)
df = model.transform(df)

In this case, each word and phrase are converted into a 50-dimension vector for speed later; usually, the word vectors from Word2vec are between 100 and 1,000 dimensions.

Another step to add context to the data is to add features representing the page that posted the status. This takes two steps in Spark; first, use a StringIndexer to covert the names of the four Facebook Pages into numerical 0-indexed representations, then use a OneHotEncoder to convert the data into dummy variables.

stringIndexer = StringIndexer(inputCol="page_id", outputCol="indexed")
model = stringIndexer.fit(df)
df = model.transform(df)
encoder = OneHotEncoder(inputCol="indexed", outputCol="page_ohe")
df = encoder.transform(df)

This adds 3 columns, where the column containing the 1.0 represents the corresponding page. (the 4th page is represented by none of the 3 columns containing 1.0)

Lastly, combine the 3D page numeric vectors and the 50D word numeric vectors with a VectorAssembler:

model = VectorAssembler(inputCols=['page_ohe', 'vectors'], outputCol="merged_vectors")
df = model.transform(df)

That’s it! And these code blocks could be combined into a Spark Pipeline and be used on datasets hundreds or thousands times as large with just two lines of code, something which will help me make interesting blog posts in the future.

This Chart Literally Just Totes Made Me Can’t Even

To keep comparisons between news sources apples-to-apples with respect to current events (and there have been a lot of events in the past couple months!), we will only look at headlines from June 1st, 2016 to August 12 among the four pages; 9,500 headlines total. That’s a fair maximum sample size, both because all the points need to be loaded onto this webpage, and because the clustering algorithm is slow.

Speaking of clustering, I load the 53D vectors created above for this subset into R and apply the t-SNE algorithm to project and cluster each point into 2 dimensions. (since the algorithm performance scales quadratically with sample size and is single-threaded, calculating the projected points took over 8 hours!)

cluster_coords <- tsne(matrix, initial_dims=53, perplexity=50, epoch=50)

Prototyping the plot extremely quickly in ggplot2:

ggplot(df_plot, aes(x=x, y=y, color=page_id)) +
    geom_point(alpha=0.75, stroke=0) + 

Gives us a map close to what we want. Hooray, my crazy clustering idea was not completely crazy!

However, for this visualization, it is extremely important to be able to determine which point corresponds to which headline. A static image opens up further questions on what causes points to be spatially located where. Therefore, I plot the chart interactively using Plotly, specifically with its WebGL interface, as rendering 9,500 points in the browser with the typical d3/SVG without hitting massive slowdown is difficult. The speed of WebGL also allows users to scan the headlines rapidly.

As you’ve seen from the visualizations above, all four Facebook pages have their own clusters, thanks to the added identifier feature. The left side of the 2D representation represents the more serious headlines, while the right side represents the more silly headlines. There is a little overlap between the NYTimes/CNN/BuzzFeed articles; notably, the NYTimes/CNN articles close to the BuzzFeed cluster tend to be more linkbaity, as shown in the image above. Upworthy is in its own little bubble with little similarity to the other news publications (the Upworthy headlines are much more verbose, which is likely causing more entropy and dissimilarity with the other, more concise headlines).

Something particularly interesting is the formation of natural subclusters outside of the main clusters. These clusters are based around keywords in the headlines, which is significant since the input data is a linear combination of that keyword and many other words, without using explicit word-importance statistical tools such as tf-idf:

  • The top clusters are based around pop-culture keywords: Taylor Swift, Harry Potter/J.K. Rowling, Pokémon Go, and Game of Thrones.
  • The bottom clusters are based around political keywords, including Bernie Sanders, Hillary Clinton, Donald Trump, and the U.S. itself.
  • The cluster between BuzzFeed and Upworthy contains headlines with the “[X]-year-old” trope from all pages.

As you may have noticed playing around with the interactive chart, this methodology is not perfect. Some linkbait headlines are present in the center of CNN/NYTimes clusters, and some serious headlines are present in the center of the BuzzFeed cluster. The more academic method of identifying clickbait in an unsupervised manner using machine learning would be to incorporate other inherent attributes of words and phrases, such as using bigrams, part-of-speech tagging, bag-of-words/tf-idf, and character-level language models.

Coincidentally around the same time Facebook announced their anticlickbait initiative, Facebook open-sourced their fasttext project, which can quickly build models to classify text using some of the above example techniques. Hmmmmmm…

The full code used to process the Facebook Page data using Spark is available in this Jupyter notebook, and the code used to generate the Plotly visualizations in R is available in this Jupyter notebook, both open-sourced on GitHub. In the GitHub repository, you can download a standalone, offline version of the interactive WebGL chart.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!


Max Woolf (@minimaxir) is a Data Scientist at BuzzFeed in San Francisco. He is also an ex-Apple employee and Carnegie Mellon University graduate.

In his spare time, Max uses Python to gather data from public APIs and ggplot2 to plot plenty of pretty charts from that data. On special occasions, he uses Keras for fancy deep learning projects.

You can learn more about Max here, view his data analysis portfolio here, or view his coding portfolio here.