There has been a lot of talk lately about Pokémon due to the runaway success of Pokémon GO (I myself am Trainer Level 18 and on Team Valor). Players revel in the nostalgia of 1996 by now having the ability catching the original 151 Pokémon in real life.

However, while players most-fondly remember the first generation, Pokémon is currently on its sixth generation, with the seventh generation beginning later this year with Pokémon Sun and Moon. As of now, there are 721 total Pokémon in the Pokédex, from Bulbasaur to Volcanion, not counting alternate Forms of several Pokémon such as Mega Evolutions.

In the meantime, I’ve seen a few interesting data visualizations which capitalize on the frenzy. A highly-upvoted post on the Reddit subreddit /r/dataisbeautiful by /u/nvvknvvk charts the Height vs. Weight of the original 151 Pokémon. Anh Le of Duke University posted a cluster analysis of the original 151 Pokémon using principal component analysis (PCA), by compressing the 6 primary Pokémon stats into 2 dimensions.

However, those visualizations think too small, and only on a small subset of Pokémon. Why not capture every single aspect of every Pokémon and violently crush that data into three dimensions?

Spark (30% chance of paralyzing the target)

Last week, Apache Spark 2.0.0 was released, a major milestone in big data analysis. Spark 2.0 is potentially 10x as fast as the previous 1.6 version, with much improved APIs and documentation (you can actually import CSVs now!). The Python interface (which I use for this post), PySpark, is now almost as functionally capable as the leading scikit-learn machine learning tool, but with the ability to scale to terabyte-size datasets.

The data source of both visualizations above is the PokéAPI, whose data is open-source and available as CSVs on GitHub, and unusually clean with proper normalization. The dump is thorough, with full coverage of all the up-to-date Pokémon information available in the games up to the current generation.

The PokéAPI data includes numerical variables representing each Pokémon, including the six primary Pokémon stats (HP, Attack, Defense, Special Attack, Special Defense, Speed) and Height/Weight. We can aggregate those variables for all Pokémon and normalize them so that they are within [0.0, 1.0] for easier computation when the data is eventually reduced into 3 dimensions.

But a tool like Spark is overkill for just normalization. What Spark isn’t overkill for is working with categorical variables. Something we can do is encode variables as dummy variables, as Spark has a few tools especially helpful for that workflow. For example, let’s say we want to encode a Pokémon’s type(s) as binary variables.

There are currently 18 different types of Pokémon, and a Pokémon can have either 1 or 2 types. We can encode a Pokémon’s type(s) in the field by setting a 1.0 in columns which represent a given’s Pokémon’s type, and 0.0 in all other columns. Bulbasaur, for example, is a Grass/Poison type. We set 1.0 to the column representing Grass for Bulbasaur, and 1.0 to the column representing Poison, and 16 total 0.0 for all other columns (in the end, the data representing a Pokémon’s type is a 721x18 matrix). This approach is similar to the one-hot encoding technique used for high-level data analysis. Spark allows the use of sparse data structures, so Spark does not have to store each 0.0 in memory; just the indices of each 1.0 and the size of the vector itself. Printing the table confirms that data structure.

Here are the other Pokémon attributes present in the PokéAPI data dump that we can encode as binary columns:

  • Pokémon moves, including all moves the Pokémon is capable of learning via leveling up/TMs/egg moves in any version of the game. This variable results in 613 added columns.
  • Pokémon abilities, which are passive effects, and each Pokémon can have one of up to 3 unique abilities (adds 191 columns)
  • Pokémon color, which is just as it sounds (10 columns)
  • Pokémon shape, which apparently includes classifications such as “ball,” “quadruped,” and “squiggle”? (14 columns)
  • Pokémon habitat where the Pokémon can generally be found (or not found in the case of Event Pokémon) (10 columns)

Combining all the sparse vectors with special Spark functions, we have a dataframe of 863 features. Using PCA, we reduce the dimensionality that data from 863 dimensions to 50 dimensions (we’re not going down to 3 dimensions just yet)

The top 3 principal components of this 50D model explain only 12.8% of the data variance in the 3D space, which means that clusters would likely not be apparent if plotted as-is. A smart thing to do would be to use a clustering algorithm. t-SNE is a relatively new algorithm by Laurens van der Maaten that is surprisingly effective at clustering high-dimensional data into low-dimensional space, without causing high amounts of blending between points (unfortunately, it can be computationally intensive, which is one of the reason I reduced the dimensionality of the data to 50D first). Most academic papers focus on 2D representations of the data resulting from t-SNE, but there’s nothing preventing users from projecting the data to 3D!

Running a Python wrapper of t-SNE on the PCA-reduced dataset results in magic.

As you can see, the Pokémon along the same evolutionary track have very close [x,y,z] values, which indicates that the clustering algorithm accurately placed them close together. Notably, it classified the preevolutions of the common bug Pokémon all together (Caterpie/Metapod/Weedle/Kakuna) even though they are of different evolutionary lines, but have similar attributes/movesets. The final evolutions of these Bug Pokémon, Butterfree and Beedrill, are clustered much differently than their pre-evolution stages, because the preevolutions are statistically useless while Butterfree/Beedrill are not as useless.

How do you best visualize these clusters? That’s where Plotly comes in.

Nasty Plotly (Sharply raises Special Attack)

Plotly is a data visualization tool, now open-source, which allows users to create interactive visualizations with a robust API. I previously used it for visualizations with R, but as it turns out, the Python API to Plotly is much more powerful. Simply plotting a static 3D chart using something like matplotlib with the default settings is not insightful.

In this case, the ability to manipulate the perspective of the data is very important. And so with a little bit of Plotly documentation-delving, I managed to create the 3D chart that you hopefully saw at the top of this page.

Each dot is colored by the Pokémon’s type in the first slot for simplicity (one exception is Normal/Flying Pokémon like Pidgey; I manually converted their displayed type to Flying because the omission was notable due to Game Freak’s addiction to that particular pairing in the early generations). This gives a surprisingly coherent visualization of the groups of Pokémon types present. But it’s also important to note the groups outside the clusters, in order to identify incorrect clusters, or to identify Pokémon which have been clusters as especially unique.

A few interesting observations I’ve noted:

  • There are two far-away colocated clusters for Pokémon which are useless: gimmick Pokémon with limited movesets like Wobbuffet, Ditto, Smeargle, and Magikarp, and Bug Pokémon with limited movesets, as mentioned before, like Burmy and Kricketot.
  • There are often close clusters for Legendary Pokémon trios (Articuno, Zapdos, Moltres; the Pokémon which represent the teams in Pokémon GO) due to similar stats/moves.
  • The Flying cluster is interesting due to the presence of other Types of Pokémon which have a different first type, but the location correctly implies that they are closer to their Flying secondary type (Zubat, Murkrow, Xatu, Wingull)
  • Lastly, the Pokémon God Arceus, which has the highest stat total of the base Pokémon, is located in its own cluster alongside similarly-statistically-powerful Pokémon such as Dialga, Dragonite, and Gyarados.

Interactive 3D was definitely a good idea for this type of visualization, as a 2D visualization would be difficult to read, and even more difficult to discern the clusters. And this project was a good reason to test out the capabilities of Spark 2.0 and prototype code for future data analysis projects (the next dataset I use will be much larger, I promise). I do have a few ideas for improvements to the 3D chart in the pipeline, but maybe I’ll save them for when the seventh generation is fully released.

The full code used to process the Pokémon data using Spark is available in this Jupyter notebook, and the code used to generate the Plotly visualizations is available in this Jupyter notebook, both open-sourced on GitHub. In the GitHub repository, you can download standalone, offline versions of the 3D chart; including an extra chart with cluster meshes, which was unused for this post due to performance issues.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!


Max Woolf (@minimaxir) is a Data Scientist at BuzzFeed in San Francisco. He is also an ex-Apple employee and Carnegie Mellon University graduate.

In his spare time, Max uses Python to gather data from public APIs and ggplot2 to plot plenty of pretty charts from that data. On special occasions, he uses Keras for fancy deep learning projects.

You can learn more about Max here, view his data analysis portfolio here, or view his coding portfolio here.