<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Benchmarking on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/tag/benchmarking/</link>
    <description>Recent content in Benchmarking on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Tue, 28 Nov 2017 08:30:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/tag/benchmarking/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Benchmarking Modern GPUs for Maximum Cloud Cost Efficiency in Deep Learning</title>
      <link>https://minimaxir.com/2017/11/benchmark-gpus/</link>
      <pubDate>Tue, 28 Nov 2017 08:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/11/benchmark-gpus/</guid>
      <description>A 36% price cut to GPU instances, in addition to the potential new benefits offered by software and GPU updates, however, might be enough to tip the cost-efficiency scales back in favor of GPUs.</description>
      <content:encoded><![CDATA[<p>A few months ago, I <a href="http://minimaxir.com/2017/06/keras-cntk/">performed benchmarks</a> of deep learning frameworks in the cloud, with a <a href="http://minimaxir.com/2017/07/cpu-or-gpu/">followup</a> focusing on the cost difference between using GPUs and CPUs. And just a few months later, the landscape has changed, with significant updates to the low-level <a href="https://developer.nvidia.com/cudnn">NVIDIA cuDNN</a> library which powers the raw learning on the GPU, the <a href="https://www.tensorflow.org">TensorFlow</a> and <a href="https://github.com/Microsoft/CNTK">CNTK</a> deep learning frameworks, and the higher-level <a href="https://github.com/fchollet/keras">Keras</a> framework which uses TensorFlow/CNTK as backends for easy deep learning model training.</p>
<p>As a bonus to the framework updates, Google <a href="https://cloudplatform.googleblog.com/2017/09/introducing-faster-GPUs-for-Google-Compute-Engine.html">recently released</a> the newest generation of NVIDIA cloud GPUs, the Pascal-based P100, onto <a href="https://cloud.google.com/compute/">Google Compute Engine</a> which touts an up-to-10x performance increase to the current K80 GPUs used in cloud computing. As a bonus bonus, Google recently <a href="https://cloudplatform.googleblog.com/2017/11/new-lower-prices-for-GPUs-and-preemptible-Local-SSDs.html">cut the prices</a> of both K80 and P100 GPU instances by up to 36%.</p>
<p>The results of my earlier benchmarks favored <a href="https://cloud.google.com/preemptible-vms/">preemptible</a> instances with many CPUs as the most cost efficient option (where a preemptable instance can only last for up to 24 hours and could end prematurely). A 36% price cut to GPU instances, in addition to the potential new benefits offered by software and GPU updates, however, might be enough to tip the cost-efficiency scales back in favor of GPUs. It&rsquo;s a good idea to rerun the experiment with updated VMs and see what happens.</p>
<h2 id="benchmark-setup">Benchmark Setup</h2>
<p>As with the original benchmark, I set up a <a href="https://github.com/minimaxir/keras-cntk-docker">Docker container</a> containing the deep learning frameworks (based on cuDNN 6, the latest version of cuDNN natively supported by the frameworks) that can be used to train each model independently. The <a href="https://github.com/minimaxir/keras-cntk-benchmark/tree/master/v2/test_files">Keras benchmark scripts</a> run on the containers are based off of <em>real world</em> use cases of deep learning.</p>
<p>The 6 hardware/software configurations and Google Compute Engine <a href="https://cloud.google.com/compute/pricing">pricings</a> for the tests are:</p>
<ul>
<li>A K80 GPU (attached to a <code>n1-standard-1</code> instance), tested with both TensorFlow (1.4) and CNTK (2.2): <strong>$0.4975 / hour</strong>.</li>
<li>A P100 GPU (attached to a <code>n1-standard-1</code> instance), tested with both TensorFlow and CNTK: <strong>$1.5075 / hour</strong>.</li>
<li>A preemptable <code>n1-highcpu-32</code> instance, with 32 vCPUs based on the Intel Skylake architecture, tested with TensorFlow only: <strong>$0.2400 / hour</strong></li>
<li>A preemptable <code>n1-highcpu-16</code> instance, with 16 vCPUs based on the Intel Skylake architecture, tested with TensorFlow only: <strong>$0.1200 / hour</strong></li>
</ul>
<p>A single K80 GPU uses 1/2 a GPU board while a single P100 uses a full GPU board, which in an ideal world would suggest that the P100 is twice as fast at the K80 at minimum. But even so, the P100 configuration is about 3 times as expensive, so even if a model is trained in half the time, it may not necessarily be cheaper with the P100.</p>
<p>Also, the CPU tests use TensorFlow <em>as installed via the recommended method</em> through pip, since compiling the TensorFlow binary from scratch to take advantage of CPU instructions as <a href="http://minimaxir.com/2017/07/cpu-or-gpu/">with my previous test</a> is not a pragmatic workflow for casual use.</p>
<h2 id="benchmark-results">Benchmark Results</h2>
<p>When a fresh-out-of-a-AI-MOOC engineer wants to experiment with deep learning in the cloud, typically they use a K80 + TensorFlow setup, so we&rsquo;ll use that as the <em>base configuration</em>.</p>
<p>For each model architecture and software/hardware configuration, I calculate the <strong>total training time relative to the base configuration instance training</strong> for running the model training for the provided test script. In all cases, the P100 GPU <em>should</em> perform better than the K80, and 32 vCPUs <em>should</em> train faster than 16 vCPUs. The question is how <em>much</em> faster?</p>
<p>Let&rsquo;s start using the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a> of handwritten digits plus the common multilayer perceptron (MLP) architecture, with dense fully-connected layers. Lower training time is better.</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-5_hu_df63751b48270991.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-5_hu_33351b8d5d2916d3.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-5_hu_773ee4a74d2ce535.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-5.png 1200w" src="dl-cpu-gpu-5.png"/> 
</figure>

<p>For this task, CNTK appears to be more effective than TensorFlow. Indeed, the P100 is faster than the K80 for the corresponding framework, although it&rsquo;s not a dramatic difference. However, since the task is simple, the CPU performance is close to that of the GPU, which implies that the GPU is not as cost effective for a simple architecture.</p>
<p>For each model architecture and configuration, I calculate a <strong>normalized training cost relative to the cost of the base configuration training</strong>. Because GCE instance costs are prorated, we can simply calculate experiment cost by multiplying the total number of seconds the experiment runs by the cost of the instance (per second).</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-6_hu_8092aa4efa0c4355.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-6_hu_6ec85d77120003f7.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-6_hu_3fa9ff93fed554d5.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-6.png 1200w" src="dl-cpu-gpu-6.png"/> 
</figure>

<p>Unsurprisingly, CPUs are more cost effective. However, the P100 is more cost <em>ineffective</em> for this task than the K80.</p>
<p>Now, let&rsquo;s look at the same dataset with a convolutional neural network (CNN) approach for digit classification. Since CNNs are typically used for computer vision tasks, new graphic card architectures are optimized for CNN workflows, so it will be interesting to see how the P100 performs compared to the K80:</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-7_hu_f8361510000c69ef.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-7_hu_a5e4bb39cb0f4851.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-7_hu_13b371e4d8afa6c9.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-7.png 1200w" src="dl-cpu-gpu-7.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-8_hu_f4a994fcdbd47c8f.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-8_hu_94b3b6c80d09cc47.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-8_hu_ca2831240a30c8c.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-8.png 1200w" src="dl-cpu-gpu-8.png"/> 
</figure>

<p>Indeed, the P100 is twice as fast and the K80, but due to the huge cost premium, it&rsquo;s not cost effective for this simple task. However, CPUs do not perform well on this task either, so notably the base configuration is the best configuration.</p>
<p>Let&rsquo;s go deeper with CNNs and look at the <a href="https://www.cs.toronto.edu/%7Ekriz/cifar.html">CIFAR-10</a> image classification dataset, and a model which utilizes a deep covnet + a multilayer perceptron and ideal for image classification (similar to the <a href="https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3">VGG-16</a> architecture).</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-9_hu_3e89a9d69d2114d8.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-9_hu_188420deeffa2cca.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-9_hu_2994e1dc8b68f244.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-9.png 1200w" src="dl-cpu-gpu-9.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-10_hu_4c8240dc9addd1a4.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-10_hu_e38edfb433bf8413.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-10_hu_a879b46166fddc6d.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-10.png 1200w" src="dl-cpu-gpu-10.png"/> 
</figure>

<p>Similar results to that of a normal MLP. Nothing fancy.</p>
<p>The Bidirectional long-short-term memory (LSTM) architecture is great for working with text data like IMDb reviews. When I did <a href="http://minimaxir.com/2017/06/keras-cntk/">my first benchmark article</a>, I noticed that CNTK performed significantly better than TensorFlow, as <a href="https://news.ycombinator.com/item?id=14538086">commenters on Hacker News</a> noted that TensorFlow uses an inefficient implementation of the LSTM on the GPU.</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/cntk-old_hu_b86c227c88de2e7d.webp 320w,/2017/11/benchmark-gpus/cntk-old_hu_3901dc880777da18.webp 768w,/2017/11/benchmark-gpus/cntk-old_hu_8d49b907914bb06b.webp 1024w,/2017/11/benchmark-gpus/cntk-old.png 1620w" src="cntk-old.png"/> 
</figure>

<p>However, with Keras&rsquo;s <a href="https://keras.io/layers/recurrent/#cudnnlstm">new CuDNNRNN layers</a> which leverage cuDNN, this inefficiency may be fixed, so for the K80/P100 TensorFlow GPU configs, I use a CuDNNLSTM layer instead of a normal LSTM layer. So let&rsquo;s take another look:</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-1_hu_f633549e7615557a.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-1_hu_c8eb1a82936955a7.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-1_hu_734746132ba497c3.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-1.png 1200w" src="dl-cpu-gpu-1.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-2_hu_6f0e2078d0fbe4a8.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-2_hu_f5299cfcd4184de5.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-2_hu_9c9b4dbee5321cd.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-2.png 1200w" src="dl-cpu-gpu-2.png"/> 
</figure>

<p><em>WOAH.</em> TensorFlow is now more than <em>three times as fast</em> than CNTK! (And compared against my previous benchmark, TensorFlow on the K80 w/ the CuDNNLSTM is about <em>7x as fast</em> as it once was!) Even the CPU-only versions of TensorFlow are faster than CNTK on the GPU now, which implies significant improvements in the ecosystem outside of the CuDNNLSTM layer itself. (And as a result, CPUs are still more cost efficient)</p>
<p>Lastly, LSTM text generation of <a href="https://en.wikipedia.org/wiki/Friedrich_Nietzsche">Nietzsche&rsquo;s</a> <a href="https://s3.amazonaws.com/text-datasets/nietzsche.txt">writings</a> follows similar patterns to the other architectures, but without the drastic hit to the GPU.</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-11_hu_e64be99549e22a4a.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-11_hu_c9e45139e2d4d36b.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-11_hu_73f05d523cc746fa.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-11.png 1200w" src="dl-cpu-gpu-11.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/dl-cpu-gpu-12_hu_18c099feff0cab3f.webp 320w,/2017/11/benchmark-gpus/dl-cpu-gpu-12_hu_346cce6ac1dd882a.webp 768w,/2017/11/benchmark-gpus/dl-cpu-gpu-12_hu_784cadffdd30380.webp 1024w,/2017/11/benchmark-gpus/dl-cpu-gpu-12.png 1200w" src="dl-cpu-gpu-12.png"/> 
</figure>

<h2 id="conclusions">Conclusions</h2>
<p>The biggest surprise of these new benchmarks is that there is no configuration where the P100 is the most cost-effective option, even though the P100 is indeed faster than the K80 in all tests. Although per <a href="https://developer.nvidia.com/cudnn">the cuDNN website</a>, there is apparently only a 2x speed increase between the performance of the K80 and P100 using cuDNN 6, which is mostly consistent with the results of my benchmarks:</p>
<figure>

    <img loading="lazy" srcset="/2017/11/benchmark-gpus/cudnn_hu_354d8fa8ab3eff29.webp 320w,/2017/11/benchmark-gpus/cudnn_hu_bb346ea37595e154.webp 768w,/2017/11/benchmark-gpus/cudnn_hu_9b3f6e3ea7ba3a02.webp 1024w,/2017/11/benchmark-gpus/cudnn.png 1688w" src="cudnn.png"/> 
</figure>

<p>I did not include a multi-GPU configuration in the benchmark data visualizations above using Keras&rsquo;s new <code>multi_gpu_model</code> <a href="https://keras.io/utils/#multi_gpu_model">function</a> because in my testing, the multi-GPU training <em>was equal to or worse than a single GPU</em> in all tests.</p>
<p>Taking these together, it&rsquo;s possible that the overhead introduced by parallel, advanced architectures <em>eliminates the benefits</em> for &ldquo;normal&rdquo; deep learning workloads which do not fully saturate the GPU. Rarely do people talk about diminishing returns in GPU performance with deep learning.</p>
<p>In the future, I want to benchmark deep learning performance against more advanced deep learning use cases such as <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learning</a> and deep CNNs like <a href="https://github.com/tensorflow/models/tree/master/research/inception">Inception</a>. But that doesn&rsquo;t mean these benchmarks are not relevant; as stated during the benchmark setup, the GPUs were tested against typical deep learning use cases, and now we see the tradeoffs that result.</p>
<p>In all, with the price cuts on GPU instances, cost-performance is often <em>on par</em> with preemptable CPU instances, which is an advantage if you want to train models faster and not risk the instance being killed unexpectedly. And there is still a lot of competition in this space: <a href="https://www.amazon.com">Amazon</a> offers a <code>p2.xlarge</code> <a href="https://aws.amazon.com/ec2/spot/">Spot Instance</a> with a K80 GPU for $0.15-$0.20 an hour, less than half of the corresponding Google Compute Engine K80 GPU instance, although with <a href="https://aws.amazon.com/ec2/spot/details/">a few bidding caveats</a> which I haven&rsquo;t fully explored yet. Competition will drive GPU prices down even further, and training deep learning models will become even easier.</p>
<p>And as the cuDNN chart above shows, things will get <em>very</em> interesting once Volta-based GPUs like the V100 are generally available and the deep learning frameworks support cuDNN 7 by default.</p>
<p><strong>UPDATE 12/17</strong>: <em>As pointed out by <a href="https://news.ycombinator.com/item?id=15941682">dantiberian on Hacker News</a>, Google Compute Engine now supports <a href="https://cloud.google.com/compute/docs/instances/preemptible#preemptible_with_gpu">preemptible GPUs</a>, which was apparently added after this post went live. Preemptable GPUs are exactly half the price of normal GPUs (for both K80s and P100s; $0.73/hr and $0.22/hr respectively), so they&rsquo;re about double the cost efficiency (when factoring in the cost of the base preemptable instance), which would put them squarely ahead of CPUs in all cases. (and since the CPU instances used here were also preemptable, it&rsquo;s apples-to-apples)</em></p>
<hr>
<p><em>All scripts for running the benchmark are available in <a href="https://github.com/minimaxir/keras-cntk-benchmark/tree/master/v2">this GitHub repo</a>. You can view the R/ggplot2 code used to process the logs and create the visualizations in <a href="http://minimaxir.com/notebooks/benchmark-gpus/">this R Notebook</a>.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Benchmarking TensorFlow on Cloud CPUs: Cheaper Deep Learning than Cloud GPUs</title>
      <link>https://minimaxir.com/2017/07/cpu-or-gpu/</link>
      <pubDate>Wed, 05 Jul 2017 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/07/cpu-or-gpu/</guid>
      <description>Using CPUs instead of GPUs for deep learning training in the cloud is cheaper because of the massive cost differential afforded by preemptible instances.</description>
      <content:encoded><![CDATA[<p>I&rsquo;ve been working on a few personal deep learning projects with <a href="https://github.com/fchollet/keras">Keras</a> and <a href="https://www.tensorflow.org">TensorFlow</a>. However, training models for deep learning with cloud services such as <a href="https://aws.amazon.com/ec2/">Amazon EC2</a> and <a href="https://cloud.google.com/compute/">Google Compute Engine</a> isn&rsquo;t free, and as someone who is currently unemployed, I have to keep an eye on extraneous spending and be as cost-efficient as possible (please support my work on <a href="https://www.patreon.com/minimaxir">Patreon</a>!). I tried deep learning on the cheaper CPU instances instead of GPU instances to save money, and to my surprise, my model training was only slightly slower. As a result, I took a deeper look at the pricing mechanisms of these two types of instances to see if CPUs are more useful for my needs.</p>
<p>The <a href="https://cloud.google.com/compute/pricing#gpus">pricing of GPU instances</a> on Google Compute Engine starts at <strong>$0.745/hr</strong> (by attaching a $0.700/hr GPU die to a $0.045/hr n1-standard-1 instance). A couple months ago, Google <a href="https://cloudplatform.googleblog.com/2017/05/Compute-Engine-machine-types-with-up-to-64-vCPUs-now-ready-for-your-production-workloads.html">announced</a> CPU instances with up to 64 vCPUs on the modern Intel <a href="https://en.wikipedia.org/wiki/Skylake_%28microarchitecture%29">Skylake</a> CPU architecture. More importantly, they can also be used in <a href="https://cloud.google.com/compute/docs/instances/preemptible">preemptible CPU instances</a>, which live at most for 24 hours on GCE and can be terminated at any time (very rarely), but cost about <em>20%</em> of the price of a standard instance. A preemptible n1-highcpu-64 instance with 64 vCPUs and 57.6GB RAM plus the premium for using Skylake CPUs is <strong>$0.509/hr</strong>, about 2/3rds of the cost of the GPU instance.</p>
<p>If the model training speed of 64 vCPUs is comparable to that of a GPU (or even slightly slower), it would be more cost-effective to use the CPUs instead. But that&rsquo;s assuming the deep learning software and the GCE platform hardware operate at 100% efficiency; if they don&rsquo;t (and they likely don&rsquo;t), there may be <em>even more savings</em> by scaling down the number of vCPUs and cost accordingly (a 32 vCPU instance with same parameters is half the price at <strong>$0.254/hr</strong>, 16 vCPU at <strong>$0.127/hr</strong>, etc).</p>
<p>There aren&rsquo;t any benchmarks for deep learning libraries with tons and tons of CPUs since there&rsquo;s no demand, as GPUs are the <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam&rsquo;s razor</a> solution to deep learning hardware. But what might make counterintuitive but economical sense is to use CPUs instead of GPUs for deep learning training because of the massive cost differential afforded by preemptible instances, thanks to Google&rsquo;s <a href="https://en.wikipedia.org/wiki/Economies_of_scale">economies of scale</a>.</p>
<h2 id="setup">Setup</h2>
<p>I already have <a href="https://github.com/minimaxir/deep-learning-cpu-gpu-benchmark">benchmarking scripts</a> of real-world deep learning use cases, <a href="https://github.com/minimaxir/keras-cntk-docker">Docker container environments</a>, and results logging from my <a href="http://minimaxir.com/2017/06/keras-cntk/">TensorFlow vs. CNTK article</a>. A few minor tweaks allow the scripts to be utilized for both CPU and GPU instances by setting CLI arguments. I also rebuilt <a href="https://github.com/minimaxir/keras-cntk-docker/blob/master/Dockerfile">the Docker container</a> to support the latest version of TensorFlow (1.2.1), and created a <a href="https://github.com/minimaxir/keras-cntk-docker/blob/master/Dockerfile-cpu">CPU version</a> of the container which installs the CPU-appropriate TensorFlow library instead.</p>
<p>There is a notable CPU-specific TensorFlow behavior; if you install from <code>pip</code> (as the<a href="https://www.tensorflow.org/install/"> official instructions</a> and tutorials recommend) and begin training a model in TensorFlow, you&rsquo;ll see these warnings in the console:</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/tensorflow-console_hu_e436e066e4e1304d.webp 320w,/2017/07/cpu-or-gpu/tensorflow-console_hu_ce5df372394290b4.webp 768w,/2017/07/cpu-or-gpu/tensorflow-console_hu_9e354816d97d6c8f.webp 1024w,/2017/07/cpu-or-gpu/tensorflow-console.png 1130w" src="tensorflow-console.png"/> 
</figure>

<p>In order to fix the warnings and benefit from these <a href="https://en.wikipedia.org/wiki/SSE4#SSE4.2">SSE4.2</a>/<a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">AVX</a>/<a href="https://en.wikipedia.org/wiki/FMA_instruction_set">FMA</a> optimizations, we <a href="https://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions">compile TensorFlow from source</a>, and I created a <a href="https://github.com/minimaxir/keras-cntk-docker/blob/master/Dockerfile-cpu-compiled">third Docker container</a> to do just that. When training models in the new container, <a href="https://github.com/tensorflow/tensorflow/issues/10689">most</a> of the warnings no longer show, and (spoiler alert) there is indeed a speed boost in training time.</p>
<p>Therefore, we can test three major cases with Google Compute Engine:</p>
<ul>
<li>A Tesla K80 GPU instance.</li>
<li>A 64 Skylake vCPU instance where TensorFlow is installed via <code>pip</code> (along with testings at 8/16/32 vCPUs).</li>
<li>A 64 Skylake vCPU instance where TensorFlow is compiled (<code>cmp</code>) with CPU instructions (+ 8/16/32 vCPUs).</li>
</ul>
<h2 id="results">Results</h2>
<p>For each model architecture and software/hardware configuration, I calculate the <strong>total training time relative to the GPU instance training</strong> for running the model training for the provided test script. In all cases, the GPU <em>should</em> be the fastest training configuration, and systems with more processors should train faster than those with fewer processors.</p>
<p>Let&rsquo;s start using the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a> of handwritten digits plus the common multilayer perceptron (MLP) architecture, with dense fully-connected layers. Lower training time is better. All configurations below the horizontal dotted line are better than GPUs; all configurations above the dotted line are worse than GPUs.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-5_hu_8cf5154f974aed3c.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-5_hu_2ec21aba02d8fb37.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-5_hu_7682d0a58ea1e871.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-5.png 1200w" src="dl-cpu-gpu-5.png"/> 
</figure>

<p>Here, the GPU is the fastest out of all the platform configurations, but there are other curious trends: the performance between 32 vCPUs and 64 vCPUs is similar, and the compiled TensorFlow library is indeed a significant improvement in training speed <em>but only for 8 and 16 vCPUs</em>. Perhaps there are overheads negotiating information between vCPUs that eliminate the performance advantages of more vCPUs, and perhaps these overheads are <em>different</em> with the CPU instructions of the compiled TensorFlow. In the end, it&rsquo;s a <a href="https://en.wikipedia.org/wiki/Black_box">black box</a>, which is why I prefer black box benchmarking all configurations of hardware instead of theorycrafting.</p>
<p>Since the difference between training speeds of different vCPU counts is minimal, there is definitely an advantage by scaling down. For each model architecture and configuration, I calculate a <strong>normalized training cost relative to the cost of GPU instance training</strong>. Because GCE instance costs are prorated (unlike Amazon EC2), we can simply calculate experiment cost by multiplying the total number of seconds the experiment runs by the cost of the instance (per second). Ideally, we want to <em>minimize</em> cost.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-6_hu_c6ff3c375435199.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-6_hu_6bee6729ce48517c.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-6_hu_ea518ff15e46de10.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-6.png 1200w" src="dl-cpu-gpu-6.png"/> 
</figure>

<p>Lower CPU counts are <em>much</em> more cost-effective for this problem, when going as low as possible is better.</p>
<p>Now, let&rsquo;s look at the same dataset with a convolutional neural network (CNN) approach for digit classification:</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-7_hu_d3205561da4ed49c.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-7_hu_ae81ceba7d6092e6.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-7_hu_7a29bcea36dbe20e.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-7.png 1200w" src="dl-cpu-gpu-7.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-8_hu_64f1eac6ff5b2b3f.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-8_hu_c6dd20c1ccc111a5.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-8_hu_2fa65c3c187723bb.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-8.png 1200w" src="dl-cpu-gpu-8.png"/> 
</figure>

<p>GPUs are unsurprisingly more than twice as fast as any CPU approach at CNNs, but cost structures are still the same, except that 64 vCPUs are <em>worse</em> than GPUs cost-wise, with 32 vCPUs training even faster than with 64 vCPUs.</p>
<p>Let&rsquo;s go deeper with CNNs and look at the <a href="https://www.cs.toronto.edu/%7Ekriz/cifar.html">CIFAR-10</a> image classification dataset, and a model which utilizes a deep covnet + a multilayer perceptron and ideal for image classification (similar to the <a href="https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3">VGG-16</a> architecture).</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-9_hu_4a5cd8ba80674837.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-9_hu_a81280d52893c1c9.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-9_hu_af30edd0d3117cd8.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-9.png 1200w" src="dl-cpu-gpu-9.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-10_hu_a6061eb15b5b8609.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-10_hu_fe0751d9cd60a655.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-10_hu_a371016369278a9a.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-10.png 1200w" src="dl-cpu-gpu-10.png"/> 
</figure>

<p>Similar behaviors as in the simple CNN case, although in this instance all CPUs perform better with the compiled TensorFlow library.</p>
<p>The fasttext algorithm, used here on the <a href="http://ai.stanford.edu/%7Eamaas/data/sentiment/">IMDb reviews dataset</a> to determine whether a review is positive or negative, classifies text extremely quickly relative to other methods.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-3_hu_12d55d02148bf0ea.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-3_hu_aaf9917a1629214f.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-3_hu_d51ed2e2c6fdec60.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-3.png 1200w" src="dl-cpu-gpu-3.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-4_hu_6b591a471f3027a4.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-4_hu_7cc361b383b25fb0.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-4_hu_4c516e76a92eff3c.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-4.png 1200w" src="dl-cpu-gpu-4.png"/> 
</figure>

<p>In this case, GPUs are much, much faster than CPUs. The benefit of lower numbers of CPU isn&rsquo;t as dramatic; although as an aside, the <a href="https://github.com/facebookresearch/fastText">official fasttext implementation</a> is <em>designed</em> for large amounts of CPUs and handles parallelization much better.</p>
<p>The Bidirectional long-short-term memory (LSTM) architecture is great for working with text data like IMDb reviews, but after my previous benchmark article, <a href="https://news.ycombinator.com/item?id=14538086">commenters on Hacker News</a> noted that TensorFlow uses an inefficient implementation of the LSTM on the GPU, so perhaps the difference will be more notable.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-1_hu_4369b4e9e8856507.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-1_hu_3e65077eb16928e4.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-1_hu_d736592c927bd764.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-1.png 1200w" src="dl-cpu-gpu-1.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-2_hu_d8c58f429f4a781b.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-2_hu_1306d728b4fce90.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-2_hu_ad3d19e88738d072.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-2.png 1200w" src="dl-cpu-gpu-2.png"/> 
</figure>

<p>Wait, what? GPU training of Bidirectional LSTMs is <em>twice as slow</em> as any CPU configuration? Wow. (In fairness, the benchmark uses the Keras LSTM default of <code>implementation=0</code> which is better on CPUs while <code>implementation=2</code> is better on GPUs, but it shouldn&rsquo;t result in that much of a differential)</p>
<p>Lastly, LSTM text generation of <a href="https://en.wikipedia.org/wiki/Friedrich_Nietzsche">Nietzsche&rsquo;s</a> <a href="https://s3.amazonaws.com/text-datasets/nietzsche.txt">writings</a> follows similar patterns to the other architectures, but without the drastic hit to the GPU.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-11_hu_d84b78ad35a1f056.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-11_hu_d58d19568c89869.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-11_hu_c078d8bd94df56aa.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-11.png 1200w" src="dl-cpu-gpu-11.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-12_hu_44c1d2cc10581f1a.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-12_hu_27c08aabe3a3cacd.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-12_hu_d41db5a45ef62daf.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-12.png 1200w" src="dl-cpu-gpu-12.png"/> 
</figure>

<h2 id="conclusion">Conclusion</h2>
<p>As it turns out, using 64 vCPUs is <em>bad</em> for deep learning as current software/hardware architectures can&rsquo;t fully utilize all of them, and it often results in the exact same performance (or <em>worse</em>) than with 32 vCPUs. In terms balancing both training speed and cost, training models with <strong>16 vCPUs + compiled TensorFlow</strong> seems like the winner. The 30%-40% speed boost of the compiled TensorFlow library was an unexpected surprise, and I&rsquo;m shocked Google doesn&rsquo;t offer a precompiled version of TensorFlow with these CPU speedups since the gains are nontrivial.</p>
<p>It&rsquo;s worth nothing that the cost advantages shown here are <em>only</em> possible with preemptible instances; regular high-CPU instances on Google Compute Engine are about 5x as expensive, and as a result eliminate the cost benefits completely. Hooray for economies of scale!</p>
<p>A major implicit assumption with the cloud CPU training approach is that you don&rsquo;t need a trained model ASAP. In professional use cases, time may be too valuable to waste, but in personal use cases where someone can just leave a model training overnight, it&rsquo;s a very, very good and cost-effective option, and one that I&rsquo;ll now utilize.</p>
<hr>
<p><em>All scripts for running the benchmark are available in <a href="https://github.com/minimaxir/deep-learning-cpu-gpu-benchmark">this GitHub repo</a>. You can view the R/ggplot2 code used to process the logs and create the visualizations in <a href="http://minimaxir.com/notebooks/deep-learning-cpu-gpu/">this R Notebook</a>.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
