Last week, Yahoo announced that 1 billion accounts on their service were compromised by attackers in 2013. That is bad, and unfortunately common nowadays.
Security expert Troy Hunt maintains Have I been pwned?, a database of breached internet services where users can check to see if their account information has been compromised in such an attack as the Yahoo attack. Recently, Hunt released a dataset consisting of 1.4 billion unique breached accounts, and the services where those specific accounts were compromised.
The dataset has been scrubbed of identifying information and sensitive data (justifiability so), which strongly limits the scope of any potential analysis. However, what may be interesting is taking a look at the interrelated networks between the services and creating a neat data visualization.
Here’s a line from the 141.8 MB text file:
In this example, 2 users had their accounts breached at these specific 5 services.
I wrote a simple Python script that takes in the HIBP data dump and outputs two files:
One, a CSV file containing the total number of breached accounts from each service in this dataset. From the example data point, we would add 2 to the counts of Adobe breaches, GFAN breaches, etc. There are 1,768,628,867 total records in the dataset (which is greater than the earlier 1.4B metric since it multicounts accounts which have been breached multiple times). The 1.77B number approximately matches the given number of total number of records on HIBP (1,989,141,353) minus the number of records from sensitive breaches (~221M).
Two, a CSV file containing the unique combination pairs of each service in a data point, From the example data point, we would add 2 to the counts of [Adobe, GFAN], [Adobe, HeroesOfNewerth], etc. (statistically-minded readers will notice that there are 10 unique two-value combinations for a set of 5 services).
Although the HIBP dataset represents 1.77 Billion records, it is far from “big data”: my Python script takes less than a minute to process it.
The latter CSV serves as the edge list for a network graph. There are 10,816 unique edges. Using R, ggplot2, and ggnetwork, I connected all the edges into a graph network (removing edges with too few breached accounts), set the layout according to 50,000 iterations of the Fruchterman-Reingold algorithm, and attempted to make it pretty.
The results of my first made-in-an-hour-after-getting-home-from-work draft were not pretty. On Hacker News, user flashman provides a sensible fix; only include edges in the network with a proportional number of accounts shared between the two internet services it connects (e.g. the edge contains atleast 1% of the accounts in both services).
Removing edges which do not fit that criteria dramatically reduces the clutter and makes communities much more distinct. After rerunning the code and making further style tweaks, here’s the final result:
(click on the image to view at full resolution)
The colors of the nodes represent the communities as determined by the Walktrap community finding algorithm run on the network. There are a few noteworthy groups: the turquoise group of mainstream social networking services including LinkedIn and MySpace, the brown group of mainstream gaming services like Nexus Mods and XSplit, and mustard-colored services of questionable legality on the right of the cluster. There are clusters of services extended far from the general clusters which represent non-English services: the yellow-green cluster on the left consists of Russian internet services, while the lime-green cluster at the top represents Chinese internet services.
The size of the nodes represent the degree of the node, or the number of nodes connected to a given node. With that, we can easily see services like LinkedIn and XSplit have strong connections to other breached services. Relatedly, the transparency of the edges in the image is determined by the corresponding weight of the edge, and demonstrates the relative magnitude of the LinkedIn/MySpace breaches. (And it’s also why we can’t use other metrics like centrality for sizing the nodes, as the relative weights of those breaches skew the result.)
What’s important is that the network is generated entirely from user behavior, and not by manually establishing the actual relationships between the internet services.
Given the wildly-varying timing and magnitude of these breaches, along with the numerous potential causes of breaches which are not always publicly disclosed, it is difficult to make accurate predictive models about future breaches from the HaveIBeenPwned? dataset. The records in HIBP are a very small sample of all the leaked data worldwide, unfortunately. However, it shows what relationships can be visualized from simple user fingerprints all around the web, even when the fingerprint itself is unknown.
You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!
I am currently looking for a job in data analysis/software engineering in San Francisco. If you liked this post and have a lead, feel free to shoot me an email.
Since I currently do not have a full-time salary to subsidize my machine learning/deep learning/software/hardware needs for these blog posts, I have set up a Patreon, and any monetary contributions to the Patreon are appreciated and will be put to good creative use.