Keras is a high-level open-source framework for deep learning, maintained by François Chollet, that abstracts the massive amounts of configuration and matrix algebra needed to build production-quality deep learning models. The Keras API abstracts a lower-level deep learning framework like Theano or Google’s TensorFlow framework. Switching between these backends is only a matter of setting flags; no front-end code changes necessary.

But while Google has received a lot of publicity with TensorFlow, Microsoft has been quietly releasing their own machine learning frameworks open-source. There is LightGBM, presented as an alternative to the extremely famous xgboost framework. Now, there is CNTK (Microsoft Cognitive Toolkit), released at v2.0 a couple weeks ago, which markets strong performance in both accuracy and speed even when compared to TensorFlow.

CNTK v2.0 also has a key feature: Keras compatibility. And just last week, support for the CNTK backend was merged into the official Keras repository.

Microsoft employees commented on Hacker News that simply changing the backend of Keras from TensorFlow to CNTK would result in a performance boost. So let’s put that to the test.

Deep Learning in the Cloud

Setting up a GPU-instance for deep learning in the cloud is surprisingly underdiscussed. Most recommend simply using a premade image from Amazon which includes all the necessary GPU drivers. However, Amazon EC2 charges $0.90/hr (not-prorated) for a NVIDIA Tesla K80 GPU instance, while Google Compute Engine charges $0.75/hr (prorated to the minute) for the same GPU, which is a nontrivial discount for the many hours necessary to train deep learning models.

The catch with GCE is you have to setup the deep learning drivers and frameworks from a blank Linux instance. I did that for my first adventure with Keras and it was not fun. However, I recently found a blog post by Durgesh Mankekar which takes a more modern approach to managing such dependencies with Docker containers, and also provides a setup script plus container with the necessary deep learning drivers/frameworks for Keras. This container can then be loaded using nvidia-docker, which allows Docker containers to access the GPU on the host. Running a deep learning script in the container is simply a matter of running a Docker command. After the script completes, the container is destroyed. This approach incidentally ensures that separate executions are independent; perfect for benchmarking/reproducibility.

I tweaked the container to include an installation of CNTK, a CNTK-compatable version of Keras, and made CNTK the default backend for Keras.

Benchmark Methodology

The Keras examples are robust and solve real-world deep learning problems; perfect for simulating real-world performance. I took a variety of those examples, emphasizing different neural network architectures, and added a custom logger which outputs a CSV containing both model performance and elapsed time as the training progresses.

As mentioned earlier, the only change needed to switch between backends is setting a flag. Even though CNTK is the default backend for Keras in the container, a simple -e KERAS_BACKEND='tensorflow' argument in the Docker command switches it to TensorFlow.

I wrote a Python benchmark script (executed on the host) to administrate and run all the examples in their own Docker containers, with both CNTK and TensorFlow, and collected the resulting logs.

Here are the results.

IMDb Review Dataset

The IMDb review dataset is a famous dataset for benchmarking natural language processing (NLP) for sentiment analysis. The 25,000 reviews in the dataset are tagged as positive or negative. Good machine learning models developed before deep learning became mainstream score about 88% classification accuracy on the test dataset.

The first model approach is with a Bidirectional LSTM, which weights the model by the sequence of words, both forward and backward.

First, let’s look at the classification accuracy of the test set at various points in time while the model is being trained:

Note: all charts in this blog post are interactive Plotly charts; feel free to mouse-over data points for exact values and use the controls in the upper-right to manipulate the chart.

Normally the accuracy increases as training proceeds; Bidirectional LSTMs take a long time to train to get improving results, but at the least both frameworks are equally performant.

To gauge the speed of algorithm, we can calculate the average amount of time it takes to train an epoch (i.e. each time the model sees the entire training set). The time is mostly consistent per epoch but there is some variability; each measurement will have a 95% confidence interval for the true average, obtained via nonparametric bootstrapping. In the case of the Bidirectional LSTM:

Wow, CNTK is much faster! Not the 5x-10x speedup the benchmarks highlighted for working with LSTMs, but nearly halving the runtime by simply setting a backend flag is still impressive.

Next, we’ll look at the modern fasttext approach on the same dataset. Fasttext is a newer algorithm that averages word vector Embeddings together (irrespective of order), but gets incredible results at incredible speeds even when using the CPU only, as with Facebook’s official implementation for fasttext. (for this benchmark, I opt to include bigrams)

Both frameworks have nearly identical accuracy due to model simplicity, but in this case, TensorFlow is faster at working with Embeddings. (at the least, fasttext clearly much faster than the Bidirectional LSTM approach!) In addition, fasttext blows away the 88% benchmark, which may be worth considering for other machine learning projects.

MNIST Dataset

The MNIST dataset is another famous dataset of handwritten digits, good for testing computer vision (60,000 training images, 10,000 test images). Generally, good models get above 99% classification accuracy on the test set.

The multilayer perceptron (MLP) approach just uses a large fully-connected network and lets Deep Learning Magic ™ take over. Sometimes that can be enough.

Both frameworks train the model extremely quickly taking only a few seconds per epoch; there’s no clear winner in terms of accuracy (although neither broke 99%), but CNTK is faster.

Another approach is the convolutional neural network (CNN), which utilizes the inherent relationships between adjacent pixels and is a more logical architecture for image data.

In this case, TensorFlow performs better, both in accuracy and speed (and it breaks 99% too).


Going more into complex real-world models, the CIFAR-10 dataset is a dataset used for image classification of 10 different objects. The architecture in the benchmark script is a Deep CNN + MLP of many layers similar in architecture to the famous VGG-16 model, but more simple since most people do not have a super-computer cluster to train it.

In this case, performance between the two backends is equal, both in accuracy and speed. Perhaps the MLP benefits of CNTK and the CNN benefits of TensorFlow canceled each other out.

Nietzsche Text Generation

Text generation based off of char-rnn is popular. Specifically, it uses a LSTM to “learn” the text and sample new text. In the Keras example using Nietzsche’s ramblings as the source dataset, the model attempts to predict the next character using the previous 40 characters, and minimize the training loss. Ideally you want below 1.00 loss before generated text is grammatically coherent.

Both have similar changes in loss over time (unfortunately, a loss of 1.40 will still result in gibberish text generated), although performance on CTNK is again fast due to the LSTM architecture.

For this next benchmark, I will not use a official Keras example script, but instead use my own text generator architecture, created during my previous Keras post.

My network avoids converging early with only a minor cost to training speed in the TensorFlow case; unfortunately, CNTK speed is much slower than the simple model, but still faster than TensorFlow in the advanced model.

Here’s the generated text output from the TensorFlow-trained model on my architecture:

hinks the rich man must be wholly perverity and connection of the english sin of the philosophers of the basis of the same profound of his placed and evil and exception of fear to plants to me such as the case of the will seems to the will to be every such a remark as a primates of a strong of

And here’s the output from the CNTK-trained model:

(_x2js1hevjg4z_?z_aæ?q_gpmj:sn![?(f3_ch=lhw4y n6)gkh
momu,?!ljë7g)k,!?[45 0as9[d.68éhhptvsx jd_næi,ä_z!cwkr"_f6ë-mu_(epp

Wait, what? Apparently my model architecture caused CNTK to hit a legitimate bug when making predictions, which did not happen with CNTK + the simple LSTM architecture. Thanks to my QA skills, I found that batch normalization was the cause of the bug and filed the issue appropriately.


In all, the title of this post does not follow Betteridge’s law of headlines; deciding the better Keras framework is not as clear cut as expected. Accuracy is mostly identical between the two frameworks. CNTK is faster at LSTMs/MLPs, TensorFlow is faster at CNNs/Embeddings, but when networks implement both, it’s a tie.

Random bug aside, it’s possible that CNTK is not fully optimized for running on Keras (indeed, the 1bit-SGD functionality does not work yet) so there is still room for future improvement. Despite that, the results for simply setting a flag are extremely impressive, and it is worth testing Keras models on both CNTK and TensorFlow now to see which is better before deploying them to production.

All scripts for running the benchmark are available in this GitHub repo. You can view the R/ggplot2 code used to process the logs and create the interactive visualizations in this R Notebook.


Max Woolf (@minimaxir) is a former Apple Software QA Engineer living in San Francisco and a Carnegie Mellon University graduate.

In his spare time, Max uses Python to gather data from public APIs and ggplot2 to plot plenty of pretty charts from that data.

You can learn more about Max here, view his data analysis portfolio here, or view his coding portfolio here.