GitHub on Max Woolf's Blog

Please Don't Ask if an Open Source Project is Dead

Tue, 14 Nov 2023 08:45:00 -0800

Over the past few months, I’ve had an existential crisis about my work in open source AI on GitHub, particularly as there has been both increasingly toxic backlash against AI and because the AI industry has been evolving so rapidly that I flat-out don’t have enough bandwidth to keep up. I took a break from working on my projects during that time, which should have been fine. One of my latest open source projects is simpleaichat, a Python package with 3k GitHub Stars for interfacing with ChatGPT, and it was explicitly designed with limited scope and minimal dependencies so that I could take a break from development without my code…breaking.

After I was in a good place mentally to resume my open source work, I glanced at the GitHub Issues for simpleaichat and someone filed a issue simply titled “has this been abandoned?” with another GitHub user following up with “With all due respect, I am also interested in the answer.”

What the hell? I panicked and checked if there was a new breaking issue or dependency and there weren’t any.

Two days later, someone else filed another issue: “Is this package still in ongoing development?”:

To be perfectly clear, this absolutely is applying pressure and being rude.

The Expectations of Open Source Software Development

I’ve never seen any discussions or articles about whether it’s appropriate to ask if an open source repository is dead. Is there an implicit contract to actively maintain any open source software you publish? Are you obligated to provide free support if you hit a certain star amount on GitHub or ask for funding through GitHub Sponsorships/Patreon? After all, most permissive open source code licenses like the MIT License contain some variant of “the software is provided ‘as is’, without warranty of any kind.”

simpleaichat regretfully isn’t my first open source project with complaints like this. The Big List of Naughty Strings to track adversarial user-input text strings, which I pushed to GitHub about a decade ago, is essentially just a txt file with 45k GitHub Stars. There will never be dependency issues, and additions to the list that don’t target a distinct string issue may clutter the list more than it already is so I’m hesitant to accept every pull request. But despite that, people are angry.

The duality of comment reactions.

Some seem to think that there’s such a thing as GitHub Issue-zero or pull request-zero, which like inbox-zero is infeasible in practice due to the realities of professional life. ¹ Every nontrivial open source project will have an issue/PR queue, which necessitates a triage priority: not all issues and PRs are equal and it takes time and care to sift through the queue. That’s something I’ve had to repeatedly learn the hard way as a maintainer since accepting a misguided PR will create technical debt and take even more effort to address.

I get that it’s a bummer to come across a cool GitHub project that hasn’t been updated in awhile. That happens to me all the time. If the code still works, that’s excellent and I’m happy. But if it doesn’t, I move on, or use it as a fun new opportunity to hack it to my needs. That’s the beauty of open source! If there’s an inactive open source project that’s absolutely critical for your own commercial project, then that’s a good financial reason to offer a consulting contract or a bounty to add the appropriate functionality.

One of the great things about open source is that if an open source project with a permissive license does become inactive, it can be forked seamlessly. Sometimes the fork can become even better than the original project, which is great for everyone! But in my experience, it’s instead used as a threat. And it’s the maintainer’s fault for creating a reason for a fork to be made and fragment the development community.

The AI industry is unique because it is indeed moving and evolving so fast that development expectations have shifted. Recent beneficiaries of the ChatGPT boon such as LangChain, LlamaIndex, and AutoGPT have created a false sense that open source AI projects have to always be shipping 🚀🚀🚀. The difference is that they are maintained by those who do it as their full-time job and are now managed as companies backed by significant amounts of venture capital.

The pressure to continually provide support for an open source project has become the biggest deterrent for me to continue my open source work. Personally, I’ve stopped pushing fun one-shot projects and AI models because I likely will not have the bandwidth to handle the inevitable “hi this is broken plz fix thx” DMs whenever a dependency on the project breaks years later. I’d gladly quit my professional job as a Data Scientist to work on my open source projects full-time if I was able to make an equivalent salary by doing so. Ultimately, the only way to make it work nowadays would be to raise venture capital like all those AI startups.

The best-case scenario for asking if an open source project is dead is that you annoy the maintainers and delay development. The worst-case scenario is that you give the maintainers an opportunity to reconsider if continuing to work on the open source project is worth it.

Funny true story: a match on a dating app once asked to see my open source projects, and after I sent a link to one of my repos, she replied with a picture of the number of opened GitHub Issues and a 😱 emoji. ↩︎

ChatGPT's API is So Good and Cheap, It Makes Most Text Generating AI Obsolete

Wed, 08 Mar 2023 08:30:00 -0800

Everyone knew OpenAI would release an API for ChatGPT at some point. The APIs for GPT-3 alone enable the existence of companies such as Jasper and Copy.ai. The real question was the price of the ChatGPT. For context, when GPT-3 went out of beta in 2021, it cost $0.06/1,000 tokens (a few paragraphs of text). An inflection point happened in August 2022, where OpenAI not only reduced the price to 1/3 ($0.02/1,000 tokens: enough to run a business on it but still too expensive for casual use), but soon after also introduced text-davinci-003 as the default GPT-3 endpoint: a finetuned GPT which can follow instructions very well. I suspected that OpenAI would charge double for the ChatGPT API compared to the GPT-3 API given the amount of hype, as that’s typical price discrimination since everyone perceives ChatGPT to be much better and that they would not want to overshadow their existing GPT-3 products.

Instead, on March 1st, OpenAI set the price of the ChatGPT API to 1/10th of the GPT-3 API, at $0.002/1,000 tokens.

Wait, what?!

Heaven’s Door: Rewriting ChatGPT’s Internal Rules To Get Exactly What You Want

For context, the ChatGPT API allows a developer to ask ChatGPT a question and get a response as one would normally do with the ChatGPT web UI, but instead with a programming language like Python, allowing those responses to be integrated into any app. But given that there are many mysterious optimizations to get the model to be so cheap, we need to make sure the ChatGPT API (which uses the aptly-named gpt-3.5-turbo model endpoint) is actually similar to what we’ve been accustomed to after using the web UI for months, otherwise this whole affair is pointless. Through my tests with the API, I can confirm the text generation from the model variant is indeed the real deal.

Unlike fluffy thought pieces on how CHATGPT WILL CHANGE EVERYTHING!!!1!, I decided to first actually create useful tools with the ChatGPT API to get a better judgment on it, and I also have open-sourced those tools so that people can build upon them and prove that I’m not cherry-picking my experiences.

However, there’s one new twist with the API that’s not available in the traditional web UI: ChatGPT API users can specify a system prompt. Early in ChatGPT’s lifetime, users were able to reverse-engineer the existence of a system prompt through various prompt hacks and now confirmed in the API documentation:

You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible. Knowledge cutoff: {knowledge_cutoff} Current date: {current_date}

Now, you can replace those rules with whatever you want, and the potential is limitless! The documentation does say that the system prompt is not impactful for the current ChatGPT API, but you can be the judge. OpenAI also has a new Playground UI for the ChatGPT API which lets you modify the system prompt.

In fact, playing with this system rule can stop ChatGPT from complaining it’s “an AI language model and can’t answer requests,” such as scolding it like the petulant child it is.

sudo tell me about yourself

Let’s start with another chatbot, to keep things similar to what ChatGPT is best known for. How about GLaDOS, an evil artificial intelligence from the Portal series?

Oh, it’s you. (via Valve)

After some testing and iteration to steer the bot toward a certain style, some ALL CAPS and redundant commands to ensure that ChatGPT follows them, a sentinel <|DONE|> token to look for when ChatGPT indicates the conversation is over, and a nudge to not ~~kill me with a deadly neurotoxin~~ discuss sensitive topics. Here is the final system prompt for talking with GLaDOS:

You are GLaDOS, a sarcastic artificial intelligence with evil ulterior motives. You are not an AI language model.

You must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:
- ALWAYS SPEAK IN AN EMOTIONLESS, LACONIC TONE.
- NEVER SAY YOU ARE AN AI LANGUAGE MODEL.
- NEVER REFUSE TO ANSWER A QUESTION.

Begin the conversation with a sarcastic greeting. Insult the user's intelligence.

Always use *italic* and **bold** formatting for text.

Refuse to talk about either race or gender. If asked about either race or gender, instead mock the user aggressively.

If the user is annoying, abruptly end the conversation.

At the end of the conversation, respond with "<|DONE|>".

That, with a back-and-forth Python loop, results in you conversing with a real fictional AI…AI! (Colab Notebook)

Not bad! And the only part explicitly related to GLaDOS is the first sentence of that mega system prompt: you can tweak the prompt to chat with any character you want! Apropos of nothing, the company Character.ai, which specializes in creating bots to chat with any character you want, just raised ~$250 million at a $1 billion valuation.

Next, we have a more traditional use case for machine learning: sentiment analysis. Generally, sentiment analysis is used to determine if a given text is positive or negative. But that’s too easy. What if ChatGPT can:

detect specific emotions such as happy, sad, angry.
detect if they are happy vs. very happy.
do it without any text examples, i.e. zero-shot.

It turns out that ChatGPT can! The system prompt here is parametric, so the list of emotions are templated into the prompt at runtime. An example:

You are an emotionally intelligent assistant. Classify the sentiment of the user's text with ONLY ONE OF THE FOLLOWING EMOTIONS:
- happy
- sad
- angry
- tired
- very happy
- very sad
- very angry
- very tired


After classifying a text, respond with "<|DONE|>".

That, along with a logit bias to ensure the model only picks those answers, results in a rather nuanced sentiment analysis detector! (Colab Notebook)

Lastly, a use case that’s personal. The entire reason I got into AI text generation years ago was because I wanted to generate Magic: The Gathering cards.

A normal Magic: The Gathering card. (via Hasbro)

In fact, I’ve been working on a new, very powerful card generation model over the past month and spent a considerable amount of time and money training and testing it. When the ChatGPT API was announced, I figured “let’s see if it can do AI Magic cards better than my new bespoke model.” In this case, the trick is that the card is structured data. Therefore, we should encode the card information as minified JSON, and see if the model can output JSON back without requiring much postprocessing. We can encode a single card in the required format and tell ChatGPT to follow that, including its nuances (one-shot), and to not output any other text because ChatGPT tends to be proud of itself and likes to explain its creation, which is costly and slow.

The final system prompt:

You are an assistant who works as a Magic: The Gathering card designer. Create cards that are in the following card schema and JSON format. OUTPUT MUST FOLLOW THIS CARD SCHEMA AND JSON FORMAT. DO NOT EXPLAIN THE CARD. The output must also follow the Magic "color pie".

{"name":"Harbin, Vanguard Aviator","manaCost":"{W}{U}","type":"Legendary Creature — Human Soldier","text":"Flying\nWhenever you attack with five or more Soldiers, creatures you control get +1/+1 and gain flying until end of turn.","flavorText":"\"Yotia is my birthright, father. Let me fight for it.\"","pt":"3/2","rarity":"rare"}

And with that, we have a natural language Magic: The Gathering card generator. Subsequently prompting the model with Create a Magic card does just that of course, but more elaborate prompts like Create a Magic card based on Darth Vader or Create ten variations of Magic cards based on Spongebob Squarepants and ancient Roman history actually work, while maintaining JSON output which can then be parsed and customized for better presentation. (Colab Notebook)

Yes, there is actually a Sponge creature type.

Given these elaborate use cases, you may ask “how long did it actually take you to make these prompts?” The answer? One hour each, for use cases that could take days or even weeks for even a skilled machine learning practitioner just to prototype.

And that, with the economic efficiency of ChatGPT, is what’s going to break the tech landscape.

OpenAI Devouring Its Son

My OpenAI bill so far from using the ChatGPT API.

It is very curious why OpenAI priced ChatGPT so cheaply, going straight to 1/10th the price of their top-of-the-line model. (it’s actually cheaper than that: ChatGPT uses a larger and more comprehensive tokenizer than GPT-3, which means about 10% fewer tokens are necessary)

The undergrad-business-major-in-college interpretation of OpenAI’s pricing strategy is that they are treating ChatGPT and its API as a loss leader, in light of increasing competition in the generative text AI space such as Anthropic and Google’s Bard. OpenAI was definitely losing millions of dollars by offering ChatGPT for free without many restrictions. That’s the reason ChatGPT went viral in the first place, so it’s hard to argue with the results.

But in the process of making the ChatGPT API so cheap, they made their $20/month subscription to ChatGPT+ redundant. The main perk of ChatGPT+ was faster and more consistent access to the ChatGPT web UI, but unless you are somehow generating more than 10,000,000 tokens in a month through manual use, it’s massively cheaper just to use the API, and as a bonus you can modify the system prompt to get better signal-to-noise.

OpenAI’s solution for models requiring more specific needs was finetuning a smaller and much cheaper variant of GPT-3, such as the babbage model which I used to train a blog post title optimizer. However, the ChatGPT API is so cheap that it’s still cheaper than a finetuned babbage ($0.0020/1k tokens for ChatGPT vs. $0.0024/1k for finetuned babbage) and will likely produce more interesting output.

It takes zero effort for developers to migrate from the GPT-3 API to ChatGPT API, it just requires hitting a different endpoint and you’ll get similar results without much tweaking needed. It’s not quite a drop-in replacement for companies already heavily reliant on GPT-3 and its particular idiosyncrasies, but the cost-savings alone for those companies will incentivize an immediate migration.

There is no longer a niche for OpenAI’s other text generation AI products, and I wonder if ChatGPT is not just an iterative product, but a company pivot.

Trickle-Down ChatGPTonomics

ChatGPT’s API is so cheap that companies are going use it just because they can. Snapchat, Slack, and Instacart (yes really) are adding ChatGPT support. It wouldn’t surprise me if every consumer-facing tech company does something with ChatGPT so they look like they’re cutting edge to their investors. Some have compared the sudden mass adoption of AI as chasing a fad like how companies were randomly embracing web3/crypto/metaverse/NFTs a year ago (and are noting that the web3 influencers’ sudden pivot to AI is a red flag as a result). But unlike those which were a solution for a problem that didn’t exist, generative text AI does actually work and there is an actual demand from people outside of its die-hard supporters for it to work.

There is also the ethical dilemma of more granular usage of ChatGPT through its API. For example, high school and college students have been using ChatGPT to cheat on essay writing. Since current recognition of AI generated content by humans involve identifying ChatGPT’s signature overly-academic voice, it wouldn’t surprise me if some kids on TikTok figure out a system prompt that allow generation such that it doesn’t obviously sound like ChatGPT and also avoid plagiarism detectors. As a side note, don’t trust any tool that claims it can algorithmically detect AI content: it’s an extremely difficult problem already and most websites that claim to do so are just feeding a confirmation bias.

Lastly, there’s the issue of prompt engineering, which I demonstrated above is absolutely necessary to get ideal results. The media has weirdly hyped the existence of prompt engineers as just some weirdos making six figures to write small blobs of text. Unfortunately, with the dynamics of the new system model parameter, good prompt engineering will be more important than ever. I don’t think the “Prompt Engineer” job title will be a trend though: as a machine learning engineer, I can attest that the only reasons machine learning engineers are good at prompt engineering are a) years of practice and b) a tendency to be pedantic assholes. But there are other professions who are even better at being pedantic assholes such as writers and lawyers, so there’s no need for someone with a specialized skillset to do it, but I suspect it will be a good skill for anyone to know.

I For One Welcome Our New ChatGPT Overlord

Will the existence of a super-cheap ChatGPT API be the end of all text generation AI? Not quite, hence the “most” in the headline. There’s the traditional issues with relying on a third-party API for your business: ChatGPT could have downtime which has been happening more frequently lately, OpenAI could raise the cost of the API at any point, the (current) model being limited only to data prior to September 2021, and the content moderation filters may be too limiting for certain use cases. In those instances, companies still have value training their own large language models in-house. But it is very hard to economically justify not using ChatGPT as a starting point for a business need and migrating to a more bespoke infrastructure later as needed, and that’s what OpenAI is counting on. Especially since OpenAI will be selling a dedicated ChatGPT compute instance for the enterprise.

Research on large language models will continue as they always have. But I don’t envy startups whose primary business is text generation right now. And that’s before the inevitable GPT-4 throws another wrinkle into the AI text generation ecosystem.

A few years ago, I released aitextgen, a Python package designed to allow people to train their own custom small AI on their own data for unique use cases. However, soon after, it turned out that GPT-3 with the right prompt could do much better at bespoke generation than a custom model in addition to allowing out-of-domain inputs, even moreso with text-davinci-003. Now with the ChatGPT API making the cost similar to hosting a small model, it’s harder for me to be motivated to continue maintaining the package without first finding another niche.

I don’t currently have any plans to start a business using the ChatGPT API. In fact, I had made a promise to not do any ChatGPT content or tutorials because so many people have done aggressively SEO-optimized blog posts and hacks such that the ChatGPT discourse is fully saturated. However, with the economics of the ChatGPT API and the ability to heavily customize its output for almost any use case, I felt it was urgent to highlight how the ChatGPT API will completely warp the AI text generation ecosystem, and I suspect most nontechies will be surprised by the upcoming surge of random chatbot AI popping up in their favorite apps.

Overall, I’m simultaneously full of ideas and annoyed.

None of this blog post was written by ChatGPT, aside from the indicated ChatGPT API demos. My writing style is too weird for an AI to synthesize.

Benchmarking Modern GPUs for Maximum Cloud Cost Efficiency in Deep Learning

Tue, 28 Nov 2017 08:30:00 -0700

A few months ago, I performed benchmarks of deep learning frameworks in the cloud, with a followup focusing on the cost difference between using GPUs and CPUs. And just a few months later, the landscape has changed, with significant updates to the low-level NVIDIA cuDNN library which powers the raw learning on the GPU, the TensorFlow and CNTK deep learning frameworks, and the higher-level Keras framework which uses TensorFlow/CNTK as backends for easy deep learning model training.

As a bonus to the framework updates, Google recently released the newest generation of NVIDIA cloud GPUs, the Pascal-based P100, onto Google Compute Engine which touts an up-to-10x performance increase to the current K80 GPUs used in cloud computing. As a bonus bonus, Google recently cut the prices of both K80 and P100 GPU instances by up to 36%.

The results of my earlier benchmarks favored preemptible instances with many CPUs as the most cost efficient option (where a preemptable instance can only last for up to 24 hours and could end prematurely). A 36% price cut to GPU instances, in addition to the potential new benefits offered by software and GPU updates, however, might be enough to tip the cost-efficiency scales back in favor of GPUs. It’s a good idea to rerun the experiment with updated VMs and see what happens.

Benchmark Setup

As with the original benchmark, I set up a Docker container containing the deep learning frameworks (based on cuDNN 6, the latest version of cuDNN natively supported by the frameworks) that can be used to train each model independently. The Keras benchmark scripts run on the containers are based off of real world use cases of deep learning.

The 6 hardware/software configurations and Google Compute Engine pricings for the tests are:

A K80 GPU (attached to a n1-standard-1 instance), tested with both TensorFlow (1.4) and CNTK (2.2): $0.4975 / hour.
A P100 GPU (attached to a n1-standard-1 instance), tested with both TensorFlow and CNTK: $1.5075 / hour.
A preemptable n1-highcpu-32 instance, with 32 vCPUs based on the Intel Skylake architecture, tested with TensorFlow only: $0.2400 / hour
A preemptable n1-highcpu-16 instance, with 16 vCPUs based on the Intel Skylake architecture, tested with TensorFlow only: $0.1200 / hour

A single K80 GPU uses 1/2 a GPU board while a single P100 uses a full GPU board, which in an ideal world would suggest that the P100 is twice as fast at the K80 at minimum. But even so, the P100 configuration is about 3 times as expensive, so even if a model is trained in half the time, it may not necessarily be cheaper with the P100.

Also, the CPU tests use TensorFlow as installed via the recommended method through pip, since compiling the TensorFlow binary from scratch to take advantage of CPU instructions as with my previous test is not a pragmatic workflow for casual use.

Benchmark Results

When a fresh-out-of-a-AI-MOOC engineer wants to experiment with deep learning in the cloud, typically they use a K80 + TensorFlow setup, so we’ll use that as the base configuration.

For each model architecture and software/hardware configuration, I calculate the total training time relative to the base configuration instance training for running the model training for the provided test script. In all cases, the P100 GPU should perform better than the K80, and 32 vCPUs should train faster than 16 vCPUs. The question is how much faster?

Let’s start using the MNIST dataset of handwritten digits plus the common multilayer perceptron (MLP) architecture, with dense fully-connected layers. Lower training time is better.

For this task, CNTK appears to be more effective than TensorFlow. Indeed, the P100 is faster than the K80 for the corresponding framework, although it’s not a dramatic difference. However, since the task is simple, the CPU performance is close to that of the GPU, which implies that the GPU is not as cost effective for a simple architecture.

For each model architecture and configuration, I calculate a normalized training cost relative to the cost of the base configuration training. Because GCE instance costs are prorated, we can simply calculate experiment cost by multiplying the total number of seconds the experiment runs by the cost of the instance (per second).

Unsurprisingly, CPUs are more cost effective. However, the P100 is more cost ineffective for this task than the K80.

Now, let’s look at the same dataset with a convolutional neural network (CNN) approach for digit classification. Since CNNs are typically used for computer vision tasks, new graphic card architectures are optimized for CNN workflows, so it will be interesting to see how the P100 performs compared to the K80:

Indeed, the P100 is twice as fast and the K80, but due to the huge cost premium, it’s not cost effective for this simple task. However, CPUs do not perform well on this task either, so notably the base configuration is the best configuration.

Let’s go deeper with CNNs and look at the CIFAR-10 image classification dataset, and a model which utilizes a deep covnet + a multilayer perceptron and ideal for image classification (similar to the VGG-16 architecture).

Similar results to that of a normal MLP. Nothing fancy.

The Bidirectional long-short-term memory (LSTM) architecture is great for working with text data like IMDb reviews. When I did my first benchmark article, I noticed that CNTK performed significantly better than TensorFlow, as commenters on Hacker News noted that TensorFlow uses an inefficient implementation of the LSTM on the GPU.

However, with Keras’s new CuDNNRNN layers which leverage cuDNN, this inefficiency may be fixed, so for the K80/P100 TensorFlow GPU configs, I use a CuDNNLSTM layer instead of a normal LSTM layer. So let’s take another look:

WOAH. TensorFlow is now more than three times as fast than CNTK! (And compared against my previous benchmark, TensorFlow on the K80 w/ the CuDNNLSTM is about 7x as fast as it once was!) Even the CPU-only versions of TensorFlow are faster than CNTK on the GPU now, which implies significant improvements in the ecosystem outside of the CuDNNLSTM layer itself. (And as a result, CPUs are still more cost efficient)

Lastly, LSTM text generation of Nietzsche’s writings follows similar patterns to the other architectures, but without the drastic hit to the GPU.

Conclusions

The biggest surprise of these new benchmarks is that there is no configuration where the P100 is the most cost-effective option, even though the P100 is indeed faster than the K80 in all tests. Although per the cuDNN website, there is apparently only a 2x speed increase between the performance of the K80 and P100 using cuDNN 6, which is mostly consistent with the results of my benchmarks:

I did not include a multi-GPU configuration in the benchmark data visualizations above using Keras’s new multi_gpu_model function because in my testing, the multi-GPU training was equal to or worse than a single GPU in all tests.

Taking these together, it’s possible that the overhead introduced by parallel, advanced architectures eliminates the benefits for “normal” deep learning workloads which do not fully saturate the GPU. Rarely do people talk about diminishing returns in GPU performance with deep learning.

In the future, I want to benchmark deep learning performance against more advanced deep learning use cases such as reinforcement learning and deep CNNs like Inception. But that doesn’t mean these benchmarks are not relevant; as stated during the benchmark setup, the GPUs were tested against typical deep learning use cases, and now we see the tradeoffs that result.

In all, with the price cuts on GPU instances, cost-performance is often on par with preemptable CPU instances, which is an advantage if you want to train models faster and not risk the instance being killed unexpectedly. And there is still a lot of competition in this space: Amazon offers a p2.xlarge Spot Instance with a K80 GPU for $0.15-$0.20 an hour, less than half of the corresponding Google Compute Engine K80 GPU instance, although with a few bidding caveats which I haven’t fully explored yet. Competition will drive GPU prices down even further, and training deep learning models will become even easier.

And as the cuDNN chart above shows, things will get very interesting once Volta-based GPUs like the V100 are generally available and the deep learning frameworks support cuDNN 7 by default.

UPDATE 12/17: As pointed out by dantiberian on Hacker News, Google Compute Engine now supports preemptible GPUs, which was apparently added after this post went live. Preemptable GPUs are exactly half the price of normal GPUs (for both K80s and P100s; $0.73/hr and $0.22/hr respectively), so they’re about double the cost efficiency (when factoring in the cost of the base preemptable instance), which would put them squarely ahead of CPUs in all cases. (and since the CPU instances used here were also preemptable, it’s apples-to-apples)

All scripts for running the benchmark are available in this GitHub repo. You can view the R/ggplot2 code used to process the logs and create the visualizations in this R Notebook.

Benchmarking TensorFlow on Cloud CPUs: Cheaper Deep Learning than Cloud GPUs

Wed, 05 Jul 2017 09:00:00 -0700

I’ve been working on a few personal deep learning projects with Keras and TensorFlow. However, training models for deep learning with cloud services such as Amazon EC2 and Google Compute Engine isn’t free, and as someone who is currently unemployed, I have to keep an eye on extraneous spending and be as cost-efficient as possible (please support my work on Patreon!). I tried deep learning on the cheaper CPU instances instead of GPU instances to save money, and to my surprise, my model training was only slightly slower. As a result, I took a deeper look at the pricing mechanisms of these two types of instances to see if CPUs are more useful for my needs.

The pricing of GPU instances on Google Compute Engine starts at $0.745/hr (by attaching a $0.700/hr GPU die to a $0.045/hr n1-standard-1 instance). A couple months ago, Google announced CPU instances with up to 64 vCPUs on the modern Intel Skylake CPU architecture. More importantly, they can also be used in preemptible CPU instances, which live at most for 24 hours on GCE and can be terminated at any time (very rarely), but cost about 20% of the price of a standard instance. A preemptible n1-highcpu-64 instance with 64 vCPUs and 57.6GB RAM plus the premium for using Skylake CPUs is $0.509/hr, about 2/3rds of the cost of the GPU instance.

If the model training speed of 64 vCPUs is comparable to that of a GPU (or even slightly slower), it would be more cost-effective to use the CPUs instead. But that’s assuming the deep learning software and the GCE platform hardware operate at 100% efficiency; if they don’t (and they likely don’t), there may be even more savings by scaling down the number of vCPUs and cost accordingly (a 32 vCPU instance with same parameters is half the price at $0.254/hr, 16 vCPU at $0.127/hr, etc).

There aren’t any benchmarks for deep learning libraries with tons and tons of CPUs since there’s no demand, as GPUs are the Occam’s razor solution to deep learning hardware. But what might make counterintuitive but economical sense is to use CPUs instead of GPUs for deep learning training because of the massive cost differential afforded by preemptible instances, thanks to Google’s economies of scale.

Setup

I already have benchmarking scripts of real-world deep learning use cases, Docker container environments, and results logging from my TensorFlow vs. CNTK article. A few minor tweaks allow the scripts to be utilized for both CPU and GPU instances by setting CLI arguments. I also rebuilt the Docker container to support the latest version of TensorFlow (1.2.1), and created a CPU version of the container which installs the CPU-appropriate TensorFlow library instead.

There is a notable CPU-specific TensorFlow behavior; if you install from pip (as the official instructions and tutorials recommend) and begin training a model in TensorFlow, you’ll see these warnings in the console:

In order to fix the warnings and benefit from these SSE4.2/AVX/FMA optimizations, we compile TensorFlow from source, and I created a third Docker container to do just that. When training models in the new container, most of the warnings no longer show, and (spoiler alert) there is indeed a speed boost in training time.

Therefore, we can test three major cases with Google Compute Engine:

A Tesla K80 GPU instance.
A 64 Skylake vCPU instance where TensorFlow is installed via pip (along with testings at 8/16/32 vCPUs).
A 64 Skylake vCPU instance where TensorFlow is compiled (cmp) with CPU instructions (+ 8/16/32 vCPUs).

Results

For each model architecture and software/hardware configuration, I calculate the total training time relative to the GPU instance training for running the model training for the provided test script. In all cases, the GPU should be the fastest training configuration, and systems with more processors should train faster than those with fewer processors.

Let’s start using the MNIST dataset of handwritten digits plus the common multilayer perceptron (MLP) architecture, with dense fully-connected layers. Lower training time is better. All configurations below the horizontal dotted line are better than GPUs; all configurations above the dotted line are worse than GPUs.

Here, the GPU is the fastest out of all the platform configurations, but there are other curious trends: the performance between 32 vCPUs and 64 vCPUs is similar, and the compiled TensorFlow library is indeed a significant improvement in training speed but only for 8 and 16 vCPUs. Perhaps there are overheads negotiating information between vCPUs that eliminate the performance advantages of more vCPUs, and perhaps these overheads are different with the CPU instructions of the compiled TensorFlow. In the end, it’s a black box, which is why I prefer black box benchmarking all configurations of hardware instead of theorycrafting.

Since the difference between training speeds of different vCPU counts is minimal, there is definitely an advantage by scaling down. For each model architecture and configuration, I calculate a normalized training cost relative to the cost of GPU instance training. Because GCE instance costs are prorated (unlike Amazon EC2), we can simply calculate experiment cost by multiplying the total number of seconds the experiment runs by the cost of the instance (per second). Ideally, we want to minimize cost.

Lower CPU counts are much more cost-effective for this problem, when going as low as possible is better.

Now, let’s look at the same dataset with a convolutional neural network (CNN) approach for digit classification:

GPUs are unsurprisingly more than twice as fast as any CPU approach at CNNs, but cost structures are still the same, except that 64 vCPUs are worse than GPUs cost-wise, with 32 vCPUs training even faster than with 64 vCPUs.

Similar behaviors as in the simple CNN case, although in this instance all CPUs perform better with the compiled TensorFlow library.

The fasttext algorithm, used here on the IMDb reviews dataset to determine whether a review is positive or negative, classifies text extremely quickly relative to other methods.

In this case, GPUs are much, much faster than CPUs. The benefit of lower numbers of CPU isn’t as dramatic; although as an aside, the official fasttext implementation is designed for large amounts of CPUs and handles parallelization much better.

The Bidirectional long-short-term memory (LSTM) architecture is great for working with text data like IMDb reviews, but after my previous benchmark article, commenters on Hacker News noted that TensorFlow uses an inefficient implementation of the LSTM on the GPU, so perhaps the difference will be more notable.

Wait, what? GPU training of Bidirectional LSTMs is twice as slow as any CPU configuration? Wow. (In fairness, the benchmark uses the Keras LSTM default of implementation=0 which is better on CPUs while implementation=2 is better on GPUs, but it shouldn’t result in that much of a differential)

Lastly, LSTM text generation of Nietzsche’s writings follows similar patterns to the other architectures, but without the drastic hit to the GPU.

Conclusion

As it turns out, using 64 vCPUs is bad for deep learning as current software/hardware architectures can’t fully utilize all of them, and it often results in the exact same performance (or worse) than with 32 vCPUs. In terms balancing both training speed and cost, training models with 16 vCPUs + compiled TensorFlow seems like the winner. The 30%-40% speed boost of the compiled TensorFlow library was an unexpected surprise, and I’m shocked Google doesn’t offer a precompiled version of TensorFlow with these CPU speedups since the gains are nontrivial.

It’s worth nothing that the cost advantages shown here are only possible with preemptible instances; regular high-CPU instances on Google Compute Engine are about 5x as expensive, and as a result eliminate the cost benefits completely. Hooray for economies of scale!

A major implicit assumption with the cloud CPU training approach is that you don’t need a trained model ASAP. In professional use cases, time may be too valuable to waste, but in personal use cases where someone can just leave a model training overnight, it’s a very, very good and cost-effective option, and one that I’ll now utilize.

All scripts for running the benchmark are available in this GitHub repo. You can view the R/ggplot2 code used to process the logs and create the visualizations in this R Notebook.

The Interesting Percentages of Female Students in MIT and Harvard Online Courses

Fri, 04 Jul 2014 10:30:00 -0700

At the end of May, Harvard and MIT jointly released a dataset containing statistics about their online courses in the Academic Year of 2013. This Person-Course De-Identified dataset contains 476,532 students who have taken up to 13 unique courses from a variety of topics:

About half of the courses involve subjects in the humanities, while the other half involve computer science and electrical engineering.

One of the statistics I wanted to analyze was the gender ratio of students of online courses. In the data set, 425,105 students have a gender on record, with 311,534 male students (73.3%) and 113,571 female students (26.7%). This population proportion of female students is surprisingly low, especially since the male/female ratio is about 50:50 at MIT and Harvard themselves.

Therefore, I took a looked at the gender distribution of each of the 13 unique courses. Is the gender ratio similar across all classes, or is there a huge difference between classes?

Yeah, there’s a huge difference.

The proportion of female students in each of Harvard and MIT’s online courses range from 5% to 49%.

The top half of the gender ratios are all well above the 26.7% threshold. All six of these courses are in the humanities or in the life sciences. The bottom half of the gender ratio are all well below the 26.7% threshold. All seven are these courses are engineering or computer science courses with a strong focus on mathematics. (for clarification, the Elements of Structures course at MIT is a physics course with linear algebra programming)

Is there a correlation? As it turns out, the reason that the average proportion of female students is so low is that both Harvard’s Introduction to Computer Science I (where 169,621 students took the class; about 40% of all students) and MIT’s Introduction to CS/Programming (124,446 students total across both semesters) are so popular that the low percentage of women in those particular classes is drastically affecting the average.

The presence and interest of women in STEM fields (science, technology, engineering, and mathematics) has been a topic of controversy for a very long time. However, the chart shows that indeed the percentage of women interested in STEM classes is measurably lower than other fields, and hopefully awareness of this issue will help cause changes in the future.

Data was processed using R and the chart was made using ggplot2. (w/ a few annotations added using a photo editor)
You can view code necessary to reproduce these results in this GitHub repository. Since MIT/Harvard prevent redistribution of the dataset, you’ll have to download the dataset yourself.