Programming on Max Woolf's Blog

An AI agent coding skeptic tries AI agent coding, in excessive detail

Fri, 27 Feb 2026 10:00:00 -0800

You’ve likely seen many blog posts about AI agent coding/vibecoding where the author talks about all the wonderful things agents can now do supported by vague anecdata, how agents will lead to the atrophy of programming skills, how agents impugn the sovereignty of the human soul, etc etc. This is NOT one of those posts. You’ve been warned.

Last May, I wrote a blog post titled As an Experienced LLM User, I Actually Don’t Use Generative LLMs Often as a contrasting response to the hype around the rising popularity of agentic coding. In that post, I noted that while LLMs are most definitely not useless and they can answer simple coding questions faster than it would take for me to write it myself with sufficient accuracy, agents are a tougher sell: they are unpredictable, expensive, and the hype around it was wildly disproportionate given the results I had seen in personal usage. However, I concluded that I was open to agents if LLMs improved enough such that all my concerns were addressed and agents were more dependable.

In the months since, I continued my real-life work as a Data Scientist while keeping up-to-date on the latest LLMs popping up on OpenRouter. In August, Google announced the release of their Nano Banana generative image AI with a corresponding API that’s difficult to use, so I open-sourced the gemimg Python package that serves as an API wrapper. It’s not a thrilling project: there’s little room or need for creative implementation and my satisfaction with it was the net present value with what it enabled rather than writing the tool itself. Therefore as an experiment, I plopped the feature-complete code into various up-and-coming LLMs on OpenRouter and prompted the models to identify and fix any issues with the Python code: if it failed, it’s a good test for the current capabilities of LLMs, if it succeeded, then it’s a software quality increase for potential users of the package and I have no moral objection to it. The LLMs actually were helpful: in addition to adding good function docstrings and type hints, it identified more Pythonic implementations of various code blocks.

Around this time, my coworkers were pushing GitHub Copilot within Visual Studio Code as a coding aid, particularly around then-new Claude Sonnet 4.5. For my data science work, Sonnet 4.5 in Copilot was not helpful and tended to create overly verbose Jupyter Notebooks so I was not impressed. However, in November, Google then released Nano Banana Pro which necessitated an immediate update to gemimg for compatibility with the model. After experimenting with Nano Banana Pro, I discovered that the model can create images with arbitrary grids (e.g. 2x2, 3x2) as an extremely practical workflow, so I quickly wrote a spec to implement support and also slice each subimage out of it to save individually. I knew this workflow is relatively simple-but-tedious to implement using Pillow shenanigans, so I felt safe enough to ask Copilot to Create a grid.py file that implements the Grid class as described in issue #15, and it did just that although with some errors in areas not mentioned in the spec (e.g. mixing row/column order) but they were easily fixed with more specific prompting. Even accounting for handling errors, that’s enough of a material productivity gain to be more optimistic of agent capabilities, but not nearly enough to become an AI hypester.

In November, just a few days before Thanksgiving, Anthropic released Claude Opus 4.5 and naturally my coworkers were curious if it was a significant improvement over Sonnet 4.5. It was very suspicious that Anthropic released Opus 4.5 right before a major holiday since companies typically do that in order to bury underwhelming announcements as your prospective users will be too busy gathering with family and friends to notice. Fortunately, I had no friends and no family in San Francisco so I had plenty of bandwidth to test the new Opus.

A Foreword on AGENTS.md

One aspect of agents I hadn’t researched but knew was necessary to getting good results from agents was the concept of the AGENTS.md file: a file which can control specific behaviors of the agents such as code formatting. If the file is present in the project root, the agent will automatically read the file and in theory obey all the rules within. This is analogous to system prompts for normal LLM calls and if you’ve been following my writing, I have an unhealthy addiction to highly nuanced system prompts with additional shenanigans such as ALL CAPS for increased adherence to more important rules (yes, that’s still effective). I could not find a good starting point for a Python-oriented AGENTS.md I liked, so I asked Opus 4.5 to make one:

Add an `AGENTS.md` file oriented for good Python code quality. It should be intricately details. More important rules should use caps, e.g. `MUST`

I then added a few more personal preferences and suggested tools from my previous failures working with agents in Python: use uv and .venv instead of the base Python installation, use polars instead of pandas for data manipulation, only store secrets/API keys/passwords in .env while ensuring .env is in .gitignore, etc. Most of these constraints don’t tell the agent what to do, but how to do it. In general, adding a rule to my AGENTS.md whenever I encounter a fundamental behavior I don’t like has been very effective. For example, agents love using unnecessary emoji which I hate, so I added a rule:

**NEVER** use emoji, or unicode that emulates emoji (e.g. ✓, ✗).

Agents also tend to leave a lot of redundant code comments, so I added another rule to prevent that:

**MUST** avoid including redundant comments which are tautological or self-demonstating (e.g. cases where it is easily parsable what the code does at a glance or its function name giving sufficient information as to what the code does, so the comment does nothing other than waste user time)

My up-to-date AGENTS.md file for Python is available here, and throughout my time working with Opus, it adheres to every rule despite the file’s length, and in the instances where I accidentally query an agent without having an AGENTS.md, it’s very evident. It would not surprise me if the file is the main differentiator between those getting good and bad results with agents, although success is often mixed.

As a side note if you are using Claude Code, the file must be named CLAUDE.md instead because Anthropic is weird; this blog post will just use AGENTS.md for consistency.

Opus First Contact

With my AGENTS.md file set up, I did more research into proper methods of prompting agents to see if I was missing something that led to the poor performance from working with Sonnet 4.5.

From the Claude Code quickstart.

Anthropic’s prompt suggestions are simple, but you can’t give an LLM an open-ended question like that and expect the results you want! You, the user, are likely subconsciously picky, and there are always functional requirements that the agent won’t magically apply because it cannot read minds and behaves as a literal genie. My approach to prompting is to write the potentially-very-large individual prompt in its own Markdown file (which can be tracked in git), then tag the agent with that prompt and tell it to implement that Markdown file. Once the work is completed and manually reviewed, I manually commit the work to git, with the message referencing the specific prompt file so I have good internal tracking.

I completely ignored Anthropic’s advice and wrote a more elaborate test prompt based on a use case I’m familiar with and therefore can audit the agent’s code quality. In 2021, I wrote a script to scrape YouTube video metadata from videos on a given channel using YouTube’s Data API, but the API is poorly and counterintuitively documented and my Python scripts aren’t great. I subscribe to the SiIvagunner YouTube account which, as a part of the channel’s gimmick (musical swaps with different melodies than the ones expected), posts hundreds of videos per month with nondescript thumbnails and titles, making it nonobvious which videos are the best other than the view counts. The video metadata could be used to surface good videos I missed, so I had a fun idea to test Opus 4.5:

Create a robust Python script that, given a YouTube Channel ID, can scrape the YouTube Data API and store all video metadata in a SQLite database. The YOUTUBE_API_KEY is present in `.env`.

Documentation on the channel endpoint: https://developers.google.com/youtube/v3/guides/implementation/channels

The test channel ID to scrape is: `UC9ecwl3FTG66jIKA9JRDtmg`

You MUST obey ALL the FOLLOWING rules in your implementation.

- Do not use the Google Client SDK. Use the REST API with `httpx`.
- Include sensible aggregate metrics, e.g. number of comments on the video.
- Incude `channel_id` and `retrieved_at` in the database schema.

The resulting script is available here, and it worked first try to scrape up to 20,000 videos (the max limit). The resulting Python script has very Pythonic code quality following the copious rules provided by the AGENTS.md, and it’s more robust than my old script from 2021. It is most definitely not the type of output I encountered with Sonnet 4.5. There was a minor issue however: the logging is implemented naively such that the API key is leaked in the console. I added a rule to AGENTS.md but really this is the YouTube API’s fault for encouraging API keys as parameters in a GET request.

I asked a more data-science-oriented followup prompt to test Opus 4.5’s skill at data-sciencing:

Create a Jupyter Notebook that, using `polars` to process the data, does a thorough exploratory data analysis of data saved in `youtube_videos.db`, for all columns.

This analysis should be able to be extended to any arbitrary input `channel_id`.

The resulting Jupyter Notebook is…indeed thorough. That’s on me for specifying “for all columns”, although it was able to infer the need for temporal analysis (e.g. total monthly video uploads over time) despite not explicitly being mentioned in the prompt.

The monthly analysis gave me an idea: could Opus 4.5 design a small webapp to view the top videos by month? That gives me the opportunity to try another test of how well Opus 4.5 works with less popular frameworks than React or other JavaScript component frameworks that LLMs push by default. Here, I’ll try FastAPI, Pico CSS for the front end (because we don’t need a JavaScript framework for this), and HTMX for lightweight client/server interactivity:

Create a Hacker News-worthy FastAPI application using HTMX for interactivity and PicoCSS for styling to build a YouTube-themed application that leverages `youtube_videos.db` to create an interactive webpage that shows the top videos for each month, including embedded YouTube videos which can be clicked.

The FastAPI webapp Python code is good with logical integration of HTMX routes and partials, but Opus 4.5 had fun with the “YouTube-themed” aspect of the prompt: the video thumbnail simulates a YouTube thumbnail with video duration that loads an embedded video player when clicked! The full code is open-source in this GitHub repository.

All of these tests performed far better than what I expected given my prior poor experiences with agents. Did I gaslight myself by being an agent skeptic? How did a LLM sent to die finally solve my agent problems? Despite the holiday, X and Hacker News were abuzz with similar stories about the massive difference between Sonnet 4.5 and Opus 4.5, so something did change.

Obviously an API scraper and data viewer alone do not justify an OPUS 4.5 CHANGES EVERYTHING declaration on social media, but it’s enough to be less cynical and more optimistic about agentic coding. It’s an invitation to continue creating more difficult tasks for Opus 4.5 to solve. From this point going forward, I will also switch to the terminal Claude Code, since my pipeline is simple enough and doesn’t warrant a UI or other shenanigans.

Getting Rusty At Coding

If you’ve spent enough time on programming forums such as Hacker News, you’ve probably seen the name “Rust”, often in the context of snark. Rust is a relatively niche compiled programming language that touts two important features: speed, which is evident in framework benchmarks where it can perform 10x as fast as the fastest Python library, and memory safety enforced at compile time through its ownership and borrowing systems which mitigates many potential problems. For over a decade, the slogan “Rewrite it in Rust” became a meme where advocates argued that everything should be rewritten in Rust due to its benefits, including extremely mature software that’s infeasible to actually rewrite in a different language. Even the major LLM companies are looking to Rust to eke out as much performance as possible: OpenAI President Greg Brockman recently tweeted “rust is a perfect language for agents, given that if it compiles it’s ~correct” which — albeit that statement is silly at a technical level since code can still be logically incorrect — shows that OpenAI is very interested in Rust, and if they’re interested in writing Rust code, they need their LLMs to be able to code well in Rust.

I myself am not very proficient in Rust. Rust has a famously excellent interactive tutorial, but a persistent issue with Rust is that there are few resources for those with intermediate knowledge: there’s little between the tutorial and “write an operating system from scratch.” That was around 2020 and I decided to wait and see if the ecosystem corrected this point (in 2026 it has not), but I’ve kept an eye on Hacker News for all the new Rust blog posts and library crates so that one day I too will be able to write the absolutely highest performing code possible.

Historically, LLMs have been poor at generating Rust code due to its nicheness relative to Python and JavaScript. Over the years, one of my test cases for evaluating new LLMs was to ask it to write a relatively simple application such as Create a Rust app that can create "word cloud" data visualizations given a long input text. but even without expert Rust knowledge I could tell the outputs were too simple and half-implemented to ever be functional even with additional prompting.

However, due to modern LLM postraining paradigms, it’s entirely possible that newer LLMs are specifically RLHF-trained to write better code in Rust despite its relative scarcity. I ran more experiments with Opus 4.5 and using LLMs in Rust on some fun pet projects, and my results were far better than I expected. Here are four such projects:

icon-to-image

As someone who primarily works in Python, what first caught my attention about Rust is the PyO3 crate: a crate that allows accessing Rust code through Python with all the speed and memory benefits that entails while the Python end-user is none-the-wiser. My first exposure to pyo3 was the fast tokenizers in Hugging Face tokenizers, but many popular Python libraries now also use this pattern for speed, including orjson, pydantic, and my favorite polars. If agentic LLMs could now write both performant Rust code and leverage the pyo3 bridge, that would be extremely useful for myself.

I decided to start with a very simple project: a project that can take icons from an icon font file such as the ones provided by Font Awesome and render them into images at any arbitrary resolution.

I made this exact project in Python in 2021, and it’s very hacky by pulling together several packages and cannot easily be maintained. A better version in Rust with Python bindings is a good way to test Opus 4.5.

The very first thing I did was create a AGENTS.md for Rust by telling Opus 4.5 to port over the Python rules to Rust semantic equivalents. This worked well enough and had the standard Rust idioms: no .clone() to handle lifetimes poorly, no unnecessary .unwrap(), no unsafe code, etc. Although I am not a Rust expert and cannot speak that the agent-generated code is idiomatic Rust, none of the Rust code demoed in this blog post has traces of bad Rust code smell. Most importantly, the agent is instructed to call clippy after each major change, which is Rust’s famous linter that helps keep the code clean, and Opus is good about implementing suggestions from its warnings. My up-to-date Rust AGENTS.md is available here.

With that, I built a gigaprompt to ensure Opus 4.5 accounted for both the original Python implementation and a few new ideas I had, such as supersampling to antialias the output.

Create a Rust/Python package (through `pyo3` and `maturin`) that efficiently and super-quickly takes an Icon Font and renders an image based on the specified icon. The icon fonts are present in `assets`, and the CSS file which maps the icon name to the corresponding reference in the icon font is in `fontawesome.css`.

You MUST obey ALL the FOLLOWING implementation notes:

- If the icon name has `solid` in it, it is referencing `fa-solid.otf`.
- `fa-brands.otf` and `fa-regular.otf` can be combined.
- The package MUST also support Python (via `pyo3` and `maturin`).
- The package MUST be able to output the image rendered as an optimized PNG and WEBP. with a default output resolution of 1024 x 1024.
- The image rendering MUST support supersampling for antialiased text and points (2x by default)
- The package MUST implement `fontdue` as its text rendering method.
- Allow the user to specify the color of the icon and the color of the background (both hex and RGB)
- Allow transparent backgrounds.
- Allow user to specify the icon size and canvas size separately.
- Allow user to specify the anchor positions (horizontal and vertical) for the icon relative to the canvas (default: center and center)
- Allow users to specify a horizontal and vertical pixel offset for the icon relative to the canvas.

After your base implementation is complete, you MUST:

- Write a comprehensive Python test suite using `pytest`.
- Write a Python Jupyter Notebook
- Optimize the Rust binary file size and the Python package file size.

It completed the assignment in one-shot, accounting for all of the many feature constraints specified. The “Python Jupyter Notebook” notebook command at the end is how I manually tested whether the pyo3 bridge worked, and it indeed worked like a charm. There was one mistake that’s my fault however: I naively chose the fontdue Rust crate as the renderer because I remember seeing a benchmark showing it was the fastest at text rendering. However, testing large icon generation exposed a flaw: fontdue achieves its speed by only partially rendering curves, which is a very big problem for icons, so I followed up:

The generated icons, at a high resolution, show signs of not having curves and instead showing discrete edges (image attached). Investigate the `fontdue` font renderer to see if there's an issue there.

In the event that it's not possible to fix this in `fontdue`, investigate using `ab_glyph` instead.

Opus 4.5 used its Web Search tool to confirm the issue is expected with fontdue and implemented ab_glyph instead which did fix the curves.

icon-to-image is available open-source on GitHub. There were around 10 prompts total adding tweaks and polish, but through all of them Opus 4.5 never failed the assignment as written. Of course, generating icon images in Rust-with-Python-bindings is an order of magnitude faster than my old hacky method, and thanks to the better text rendering and supersampling it also looks much better than the Python equivalent.

There’s a secondary pro and con to this pipeline: since the code is compiled, it avoids having to specify as many dependencies in Python itself; in this package’s case, Pillow for image manipulation in Python is optional and the Python package won’t break if Pillow changes its API. The con is that compiling the Rust code into Python wheels is difficult to automate especially for multiple OS targets: fortunately, GitHub provides runner VMs for this pipeline and a little bit of back-and-forth with Opus 4.5 created a GitHub Workflow which runs the build for all target OSes on publish, so there’s no extra effort needed on my end.

Word Clouds In The Browser

When I used word clouds in Rust as my test case for LLM Rust knowledge, I had an ulterior motive: I love word clouds. Back in 2019, I open-sourced a Python package titled stylecloud: a package built on top of Python’s word cloud, but with the added ability to add more color gradients and masks based on icons to easily conform it into shapes (sound familiar?)

However, stylecloud was hacky and fragile, and a number of features I wanted to add such as non-90-degree word rotation, transparent backgrounds, and SVG output flat-out were not possible to add due to its dependency on Python’s wordcloud/matplotlib, and also the package was really slow. The only way to add the features I wanted was to build something from scratch: Rust fit the bill.

The pipeline was very similar to icon-to-image above: ask Opus 4.5 to fulfill a long list of constraints with the addition of Python bindings. But there’s another thing that I wanted to test that would be extremely useful if it worked: WebAssembly (WASM) output with wasm-bindgen. Rust code compiled to WASM allows it to be run in any modern web browser with the speed benefits intact: no dependencies needed, and therefore should be future-proof. However, there’s a problem: I would have to design an interface and I am not a front end person, and I say without hyperbole that for me, designing even a simple HTML/CSS/JS front end for a project is more stressful than training an AI. However, Opus 4.5 is able to take general guidelines and get it into something workable: I first told it to use Pico CSS and vanilla JavaScript and that was enough, but then I had an idea to tell it to use shadcn/ui — a minimalistic design framework normally reserved for Web Components — along with screenshots from that website as examples. That also worked.

After more back-and-forth with design nitpicks and more features to add, the package is feature complete. However, it needs some more polish and a more unique design before I can release it, and I got sidetracked by something more impactful…

miditui

Create a music player in the terminal using Rust was another Rust stress test I gave to LLMs: command line terminals can’t play audio, right? Turns out, it can with the rodio crate. Given the success so far with Opus 4.5 I decided to make the tasks more difficult: terminals can play sound, but can it compose sound? So I asked Opus 4.5 to create a MIDI composer and playback DAW within a terminal, which worked. Adding features forced me to learn more about how MIDIs and SoundFonts actually work, so it was also educational!

miditui is available open-sourced on GitHub, and the prompts used to build it are here.

During development I encountered a caveat: Opus 4.5 can’t test or view a terminal output, especially one with unusual functional requirements. But despite being blind, it knew enough about the ratatui terminal framework to implement whatever UI changes I asked. There were a large number of UI bugs that likely were caused by Opus’s inability to create test cases, namely failures to account for scroll offsets resulting in incorrect click locations. As someone who spent 5 years as a black box Software QA Engineer who was unable to review the underlying code, this situation was my specialty. I put my QA skills to work by messing around with miditui, told Opus any errors with occasionally a screenshot, and it was able to fix them easily. I do not believe that these bugs are inherently due to LLM agents being better or worse than humans as humans are most definitely capable of making the same mistakes. Even though I myself am adept at finding the bugs and offering solutions, I don’t believe that I would inherently avoid causing similar bugs were I to code such an interactive app without AI assistance: QA brain is different from software engineering brain.

ballin

One night — after a glass of wine — I had another idea: one modern trick with ASCII art is the use of Braille unicode characters to allow for very high detail. That reminded me of ball physics simulations, so what about building a full physics simulator also in the terminal? So I asked Opus 4.5 to create a terminal physics simulator with the rapier 2D physics engine and a detailed explanation of the Braille character trick: this time Opus did better and completed it in one-shot, so I spent more time making it colorful and fun. I pessimistically thought the engine would only be able to handle a few hundred balls: instead, the Rust codebase can handle over 10,000 logical balls!

I explicitly prompted Opus to make the Colors button have a different color for each letter.

ballin is available open-sourced on GitHub, and the prompts used to build it are here.

The rapier crate also published a blog post highlighting a major change to its underlying math engine, in its 0.32.0 version so I asked Opus 4.5 to upgrade to that version…and it caused crashes, yet tracing the errors showed it originated with rapier itself. Upgrading to 0.31.0 was fine with no issues: a consequence of only using agentic coding for this workflow is that I cannot construct a minimal reproducible test case to file as a regression bug report or be able to isolate it as a side effect of a new API not well-known by Opus 4.5.

The main lesson I learnt from working on these projects is that agents work best when you have approximate knowledge of many things with enough domain expertise to know what should and should not work. Opus 4.5 is good enough to let me finally do side projects where I know precisely what I want but not necessarily how to implement it. These specific projects aren’t the Next Big Thing™ that justifies the existence of an industry taking billions of dollars in venture capital, but they make my life better and since they are open-sourced, hopefully they make someone else’s life better. However, I still wanted to push agents to do more impactful things in an area that might be more worth it.

It’s Not AI Psychosis If It Works

Before I wrote my blog post about how I use LLMs, I wrote a tongue-in-cheek blog post titled Can LLMs write better code if you keep asking them to “write better code”? which is exactly as the name suggests. It was an experiment to determine how LLMs interpret the ambiguous command “write better code”: in this case, it was to prioritize making the code more convoluted with more helpful features, but if instead given commands to optimize the code, it did make the code faster successfully albeit at the cost of significant readability. In software engineering, one of the greatest sins is premature optimization, where you sacrifice code readability and thus maintainability to chase performance gains that slow down development time and may not be worth it. Buuuuuuut with agentic coding, we implicitly accept that our interpretation of the code is fuzzy: could agents iteratively applying optimizations for the sole purpose of minimizing benchmark runtime — and therefore faster code in typical use cases if said benchmarks are representative — now actually be a good idea? People complain about how AI-generated code is slow, but if AI can now reliably generate fast code, that changes the debate.

Multiplication and division are too slow for Opus 4.6.

As a data scientist, I’ve been frustrated that there haven’t been any impactful new Python data science tools released in the past few years other than polars. Unsurprisingly, research into AI and LLMs has subsumed traditional DS research, where developments such as text embeddings have had extremely valuable gains for typical data science natural language processing tasks. The traditional machine learning algorithms are still valuable, but no one has invented Gradient Boosted Decision Trees 2: Electric Boogaloo. Additionally, as a data scientist in San Francisco I am legally required to use a MacBook, but there haven’t been data science utilities that actually use the GPU in an Apple Silicon MacBook as they don’t support its Metal API; data science tooling is exclusively in CUDA for NVIDIA GPUs. What if agents could now port these algorithms to a) run on Rust with Python bindings for its speed benefits and b) run on GPUs without complex dependencies?

This month, OpenAI announced their Codex app and my coworkers were asking questions. So I downloaded it, and as a test case for the GPT-5.2-Codex (high) model, I asked it to reimplement the UMAP algorithm in Rust. UMAP is a dimensionality reduction technique that can take in a high-dimensional matrix of data and simultaneously cluster and visualize data in lower dimensions. However, it is a very computationally-intensive algorithm and the only tool that can do it quickly is NVIDIA’s cuML which requires CUDA dependency hell. If I can create a UMAP package in Rust that’s superfast with minimal dependencies, that is an massive productivity gain for the type of work I do and can enable fun applications if fast enough.

After OpenAI released GPT-5.3-Codex (high) which performed substantially better and faster at these types of tasks than GPT-5.2-Codex, I asked Codex to write a UMAP implementation from scratch in Rust, which at a glance seemed to work and gave reasonable results. I also instructed it to create benchmarks that test a wide variety of representative input matrix sizes. Rust has a popular benchmarking crate in criterion, which outputs the benchmark results in an easy-to-read format, which, most importantly, agents can easily parse.

Example output from criterion.

At first glance, the benchmarks and their construction looked good (i.e. no cheating) and are much faster than working with UMAP in Python. To further test, I asked the agents to implement additional different useful machine learning algorithms such as HDBSCAN as individual projects, with each repo starting with this 8 prompt plan in sequence:

Implement the package with the specific functional requirements and design goals; afterwards, create benchmarks with specific matrix sizes that are representative of typical use cases
Do a second pass to clean up the code/comments and make further optimizations
Scan the crate to find areas of algorithmic weaknesses in extreme cases, and write a sentence for each describing the problem, the potential solution, and quantifying the impact of the solution
Leveraging the findings found, optimize the crate such that ALL benchmarks run 60% or quicker (1.4x faster). Use any techniques to do so, and repeat until benchmark performance converges, but don’t game the benchmarks by overfitting on the benchmark inputs alone ¹
Create custom tuning profiles that take advantage of the inherent quantities of the input data and CPU thread saturation/scheduling/parallelization to optimize the crate such that ALL benchmarks run 60% or quicker (1.4x faster). You can use the flamegraph crate to help with the profiling
Add Python bindings using pyo3 0.27.2 and maturin, with relevant package-specific constraints (specifying the pyo3 version is necessary to ensure compatability with Python 3.10+)
Create corresponding benchmarks in Python, and write a comparison script between the Python bindings and an existing Python package
Accuse the agent of potentially cheating its algorithm implementation while pursuing its optimizations, so tell it to optimize for the similarity of outputs against a known good implementation (e.g. for a regression task, minimize the mean absolute error in predictions between the two approaches)

The simultaneous constraints of code quality requirements via AGENTS.md, speed requirements with a quantifiable target objective, and an output accuracy/quality requirement, all do succeed at finding meaningful speedups consistently (atleast 2x-3x)

Codex 5.3 after optimizing a principal component analysis implementation.

I’m not content with only 2-3x speedups: nowadays in order for this agentic code to be meaningful and not just another repo on GitHub, it has to be the fastest implementation possible. In a moment of sarcastic curiosity, I tried to see if Codex and Opus had different approaches to optimizing Rust code by chaining them:

Instruct Codex to optimize benchmarks to 60% of runtime
Instruct Opus to optimize benchmarks to 60% of runtime
Instruct Opus to minimize differences between agentic implementation and known good implementation without causing more than a 5% speed regression on any benchmarks

This works. From my tests with the algorithms, Codex can often speed up the algorithm by 1.5x-2x, then Opus somehow speeds up that optimized code again to a greater degree. This has been the case of all the Rust code I’ve tested: I also ran the icon-to-image and the word cloud crates through this pipeline and gained 6x cumulative speed increases in both libraries.

Can these agent-benchmaxxed implementations actually beat the existing machine learning algorithm libraries, despite those libraries already being written in a low-level language such as C/C++/Fortran? Here are the results on my personal MacBook Pro comparing the CPU benchmarks of the Rust implementations of various computationally intensive ML algorithms to their respective popular implementations, where the agentic Rust results are within similarity tolerance with the battle-tested implementations and Python packages are compared against the Python bindings of the agent-coded Rust packages:

UMAP: 2-10x faster than Rust’s fast-umap, 9-30x faster than Python’s umap
HDBSCAN (clustering algorithm): 23-100x faster than the hdbscan Rust crate, 3x-10x faster than Python’s hdbscan
GBDT (tree-boosting algorithm): 1.1x-1.5x faster fit/predict than the treeboost Rust crate², 24-42x faster fit/1-5x faster predict than Python’s xgboost

I’ll definitely take those results with this unoptimized prompting pipeline! In all cases, the GPU benchmarks are unsurprisingly even better and with wgpu and added WGSL shaders the code runs on Metal without any additional dependencies, however further testing is needed so I can’t report numbers just yet.

Although I could push these new libraries to GitHub now, machine learning algorithms are understandably a domain which requires extra care and testing. It would be arrogant to port Python’s scikit-learn — the gold standard of data science and machine learning libraries — to Rust with all the features that implies.

But that’s unironically a good idea so I decided to try and do it anyways. With the use of agents, I am now developing rustlearn (extreme placeholder name), a Rust crate that implements not only the fast implementations of the standard machine learning algorithms such as logistic regression and k-means clustering, but also includes the fast implementations of the algorithms above: the same three step pipeline I describe above still works even with the more simple algorithms to beat scikit-learn’s implementations. This crate can therefore receive Python bindings and even expand to the Web/JavaScript and beyond. This also gives me the oppertunity to add quality-of-life features to resolve grievances I’ve had to work around as a data scientist, such as model serialization and native integration with pandas/polars DataFrames. I hope this use case is considered to be more practical and complex than making a ball physics terminal app.

Many people reading this will call bullshit on the performance improvement metrics, and honestly, fair. I too thought the agents would stumble in hilarious ways trying, but they did not. To demonstrate that I am not bullshitting, I also decided to release a more simple Rust-with-Python-bindings project today: nndex, an in-memory vector “store” that is designed to retrieve the exact nearest neighbors as fast as possible (and has fast approximate NN too), and is now available open-sourced on GitHub. This leverages the dot product which is one of the simplest matrix ops and is therefore heavily optimized by existing libraries such as Python’s numpy…and yet after a few optimization passes, it tied numpy even though numpy leverages BLAS libraries for maximum mathematical performance. Naturally, I instructed Opus to also add support for BLAS with more optimization passes and it now is 1-5x numpy’s speed in the single-query case and much faster with batch prediction. ³ It’s so fast that even though I also added GPU support for testing, it’s mostly ineffective below 100k rows due to the GPU dispatch overhead being greater than the actual retrieval speed.

Comparison of Python nndex to numpy on test workloads.topk_overlap measures result matches (perfect match) and max_similarity_abs_delta measure the largest difference between calculated cosine similarities (effectively zero).

One of the criticisms about AI generated code is that it “just regurgitates everything on GitHub” but by construction, if the code is faster than what currently exists, then it can’t have been stolen and must be an original approach. Even if the explicit agentic nature of rustlearn makes it risky to adopt downstream, the learnings from how it accomplishes its extreme speed are still valuable.

The Implications of My Agentic Successes

Like many who have hopped onto the agent train post-Opus 4.5, I’ve become nihilistic over the past few months, but not for the typical reasons. I actually am not hitting burnout and I am not worried that my programming skills are decaying due to agents: on the contrary, the session limits intended to stagger server usage have unintentionally caused me to form a habit of coding for fun an hour every day incorporating and implementing new ideas. However, is there a point to me writing this blog post and working on these libraries if people will likely just reply “tl;dr AI slop” and “it’s vibecoded so it’s automatically bad”?

The real annoying thing about Opus 4.6/Codex 5.3 is that it’s impossible to publicly say “Opus 4.5 (and the models that came after it) are an order of magnitude better than coding LLMs released just months before it” without sounding like an AI hype booster clickbaiting, but it’s the counterintuitive truth to my personal frustration. I have been trying to break this damn model by giving it complex tasks that would take me months to do by myself despite my coding pedigree but Opus and Codex keep doing them correctly. On Hacker News I was accused of said clickbaiting when making a similar statement with accusations of “I haven’t had success with Opus 4.5 so you must be lying.” The remedy to this skepticism is to provide more evidence in addition to greater checks and balances, but what can you do if people refuse to believe your evidence?

A year ago, I was one of those skeptics who was very suspicious of the agentic hype, but I was willing to change my priors in light of new evidence and experiences, which apparently is rare. Generative AI discourse has become too toxic and its discussions always end the same way, so I have been experimenting with touching grass instead, and it is nice. At this point, if I’m not confident that I can please anyone with my use of AI, then I’ll take solace in just pleasing myself. Continue open sourcing my projects, writing blog posts, and let the pieces fall as they may. If you want to follow along or learn when rustlearn releases, you can follow me on Bluesky.

Moment of introspection aside, I’m not sure what the future holds for agents and generative AI. My use of agents has proven to have significant utility (for myself at the least) and I have more-than-enough high-impact projects in the pipeline to occupy me for a few months. Although certainly I will use LLMs more for coding apps which benefit from this optimization, that doesn’t imply I will use LLMs more elsewhere: I still don’t use LLMs for writing — in fact I have intentionally made my writing voice more sardonic to specifically fend off AI accusations.

With respect to Rust, working with agents and seeing how the agents make decisions/diffs has actually helped me break out of the intermediate Rust slog and taught me a lot about the ecosystem by taking on more ambitious projects that required me to research and identify effective tools for modern Rust development. Even though I have technically released Rust packages with many stars on GitHub, I have no intention of putting Rust as a professional skill on my LinkedIn or my résumé. As an aside, how exactly do résumés work in an agentic coding world? Would “wrote many open-source libraries through the use of agentic LLMs which increased the throughput of popular data science/machine learning algorithms by an order of magnitude” be disqualifying to a prospective employer as they may think I’m cheating and faking my expertise?

My obligation as a professional coder is to do what works best, especially for open source code that other people will use. Agents are another tool in that toolbox with their own pros and cons. If you’ve had poor experiences with agents before last November, I strongly urge you to give modern agents another shot, especially with an AGENTS.md tailored to your specific coding domain and nuances (again here are my Python and Rust files, in conveient copy/paste format).

Overall, I’m very sad at the state of agentic discourse but also very excited at its promise: it’s currently unclear which one is the stronger emotion.

Two subtle ways agents can implicitly negatively affect the benchmark results but wouldn’t be considered cheating/gaming it are a) implementing a form of caching so the benchmark tests are not independent and b) launching benchmarks in parallel on the same system. I eventually added AGENTS.md rules to ideally prevent both. ↩︎
The treeboost crate beat the agent-optimized GBT crate by 4x on my first comparison test, which naturally I took offense: I asked Opus 4.6 to “Optimize the crate such that rust_gbt wins in ALL benchmarks against treeboost.” and it did just that. ↩︎
Currently, only the macOS build has BLAS support as Win/Linux BLAS support is a rabbit hole that needs more time to investigate. On those platforms, numpy does win, but that won’t be the case for long! ↩︎

Please Don't Ask if an Open Source Project is Dead

Tue, 14 Nov 2023 08:45:00 -0800

Over the past few months, I’ve had an existential crisis about my work in open source AI on GitHub, particularly as there has been both increasingly toxic backlash against AI and because the AI industry has been evolving so rapidly that I flat-out don’t have enough bandwidth to keep up. I took a break from working on my projects during that time, which should have been fine. One of my latest open source projects is simpleaichat, a Python package with 3k GitHub Stars for interfacing with ChatGPT, and it was explicitly designed with limited scope and minimal dependencies so that I could take a break from development without my code…breaking.

After I was in a good place mentally to resume my open source work, I glanced at the GitHub Issues for simpleaichat and someone filed a issue simply titled “has this been abandoned?” with another GitHub user following up with “With all due respect, I am also interested in the answer.”

What the hell? I panicked and checked if there was a new breaking issue or dependency and there weren’t any.

Two days later, someone else filed another issue: “Is this package still in ongoing development?”:

To be perfectly clear, this absolutely is applying pressure and being rude.

The Expectations of Open Source Software Development

I’ve never seen any discussions or articles about whether it’s appropriate to ask if an open source repository is dead. Is there an implicit contract to actively maintain any open source software you publish? Are you obligated to provide free support if you hit a certain star amount on GitHub or ask for funding through GitHub Sponsorships/Patreon? After all, most permissive open source code licenses like the MIT License contain some variant of “the software is provided ‘as is’, without warranty of any kind.”

simpleaichat regretfully isn’t my first open source project with complaints like this. The Big List of Naughty Strings to track adversarial user-input text strings, which I pushed to GitHub about a decade ago, is essentially just a txt file with 45k GitHub Stars. There will never be dependency issues, and additions to the list that don’t target a distinct string issue may clutter the list more than it already is so I’m hesitant to accept every pull request. But despite that, people are angry.

The duality of comment reactions.

Some seem to think that there’s such a thing as GitHub Issue-zero or pull request-zero, which like inbox-zero is infeasible in practice due to the realities of professional life. ¹ Every nontrivial open source project will have an issue/PR queue, which necessitates a triage priority: not all issues and PRs are equal and it takes time and care to sift through the queue. That’s something I’ve had to repeatedly learn the hard way as a maintainer since accepting a misguided PR will create technical debt and take even more effort to address.

I get that it’s a bummer to come across a cool GitHub project that hasn’t been updated in awhile. That happens to me all the time. If the code still works, that’s excellent and I’m happy. But if it doesn’t, I move on, or use it as a fun new opportunity to hack it to my needs. That’s the beauty of open source! If there’s an inactive open source project that’s absolutely critical for your own commercial project, then that’s a good financial reason to offer a consulting contract or a bounty to add the appropriate functionality.

One of the great things about open source is that if an open source project with a permissive license does become inactive, it can be forked seamlessly. Sometimes the fork can become even better than the original project, which is great for everyone! But in my experience, it’s instead used as a threat. And it’s the maintainer’s fault for creating a reason for a fork to be made and fragment the development community.

The AI industry is unique because it is indeed moving and evolving so fast that development expectations have shifted. Recent beneficiaries of the ChatGPT boon such as LangChain, LlamaIndex, and AutoGPT have created a false sense that open source AI projects have to always be shipping 🚀🚀🚀. The difference is that they are maintained by those who do it as their full-time job and are now managed as companies backed by significant amounts of venture capital.

The pressure to continually provide support for an open source project has become the biggest deterrent for me to continue my open source work. Personally, I’ve stopped pushing fun one-shot projects and AI models because I likely will not have the bandwidth to handle the inevitable “hi this is broken plz fix thx” DMs whenever a dependency on the project breaks years later. I’d gladly quit my professional job as a Data Scientist to work on my open source projects full-time if I was able to make an equivalent salary by doing so. Ultimately, the only way to make it work nowadays would be to raise venture capital like all those AI startups.

The best-case scenario for asking if an open source project is dead is that you annoy the maintainers and delay development. The worst-case scenario is that you give the maintainers an opportunity to reconsider if continuing to work on the open source project is worth it.

Funny true story: a match on a dating app once asked to see my open source projects, and after I sent a link to one of my repos, she replied with a picture of the number of opened GitHub Issues and a 😱 emoji. ↩︎

I Made Stable Diffusion XL Smarter by Finetuning it on Bad AI-Generated Images

Mon, 21 Aug 2023 09:00:00 -0700

Last month, Stability AI released Stable Diffusion XL 1.0 (SDXL) and open-sourced it without requiring any special permissions to access it.

Example SDXL 1.0 outputs. via Stability AI

The release went mostly under-the-radar because the generative image AI buzz has cooled down a bit. Everyone in the AI space is too busy with text-generating AI like ChatGPT (including myself!). Notably, it’s one of the first open source models which can natively generate images at a 1024x1024 resolution without shenanigans, allowing for much more detail. SDXL is actually two models: a base model and an optional refiner model which siginficantly improves detail, and since the refiner has no speed overhead I strongly recommend using it if possible.

Comparisons of the relative quality of Stable Diffusion models. Note the significant increase from using the refiner. via Stability AI

The lack of hype doesn’t mean SDXL is boring. Now that the model has full support in the diffusers Python library by Hugging Face with appropriate performance optimizations, we can now hack with it since the SDXL demos within diffusers are simple and easy to tweak:

import torch
from diffusers import DiffusionPipeline, AutoencoderKL

# load base SDXL and refiner
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix",
                                    torch_dtype=torch.float16)
base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    vae=vae,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
_ = base.to("cuda")

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
_ = refiner.to("cuda")

# generation using both models (mixture-of-experts)
high_noise_frac = 0.8
prompt = "an astronaut riding a horse"
negative_prompt = "blurry, bad hands"

image = base(
    prompt=prompt,
    negative_prompt=negative_prompt,
    denoising_end=high_noise_frac,
    output_type="latent",
).images

image = refiner(
    prompt=prompt,
    negative_prompt=negative_prompt,
    denoising_start=high_noise_frac,
    image=image,
).images[0]

I booted up a cloud virtual machine with a new midrange L4 GPU ($0.24/hr total with a Spot instance on Google Cloud Platform) and went to work. With a L4 GPU, each 1024x1024 image takes about 22 seconds to generate and you can only generate one image at a time on midrange GPUs unlike previous Stable Diffusion models since it uses 100% of the GPU’s power, so some more patience is necessary. You can generate at a smaller resolution faster but it is strongly not recommended because the results are much, much worse.

diffusers also implemented support for two new features I haven’t experimented with in my previous Stable Diffusion posts: prompt weighting and Dreambooth LoRA training and inference. Prompt weighting support with diffusers leverages the Python library compel to allow weighting of terms more mathematically. You can add any number of + or - to a given word to increase or decrease its “importance” in the resulting positional text embeddings, and therefore the final generation. You can also wrap phrases: for example, if you are generating San Francisco landscape by Salvador Dali, oil on canvas and it does a photorealistic San Francisco instead, you can wrap the artistic medium such as San Francisco landscape by Salvador Dali, (oil on canvas)+++ to get Stable Diffusion to behave as expected. In my testing, it fixes most of the prompt difficulty introduced in Stable Diffusion 2.0 onward, especially with a higher classifier-free guidance value (by default, guidance_scale is 7.5; I like to use 13)

All generated examples from the LoRA models in this blog post use a guidance_scale of 13.

LoRA the Explorer

But what’s most important is Dreambooth LoRA support, which is what makes bespoke Stable Diffusion models possible. Dreambooth is a technique to finetune Stable Diffusion on a very small set of source images and a trigger keyword to allow the use a “concept” from those images in other contexts given the keyword.

Demo image of how Dreambooth works. via Google

Training Stable Diffusion itself, even the smaller models, requires many expensive GPUs training for hours. That’s where LoRAs come in: instead, a small adapter to the visual model is trained, which can be done on a single cheap GPU in 10 minutes, and the quality of the final model + LoRA is comparable to a full finetune (colloquially, when people refer to finetuning Stable Diffusion, it usually means creating a LoRA). Trained LoRAs are a discrete small binary file, making them easy to share with others or on repositories such as Civitai. A minor weakness with LoRAs is that you can only have one active at a time: it’s possible to merge multiple LoRAs to get the benefits of all of them but it’s a delicate science.

Before Stable Diffusion LoRAs became more widespread, there was textual inversion, which allows the text encoder to learn a concept, but it takes hours to train and the results can be unwieldy. In a previous post, I trained a textual inversion on the memetic Ugly Sonic, as he was not in Stable Diffusion’s source dataset and therefore he would be unique. The generation results were mixed.

Ugly Sonic, but not the good kind of ugly.

I figured training a LoRA on Ugly Sonic would be a good test case for SDXL’s potential. Fortunately, Hugging Face provides a train_dreambooth_lora_sdxl.py script for training a LoRA using the SDXL base model which works out of the box although I tweaked the parameters a bit. The generated Ugly Sonic images from the trained LoRA are much better and more coherent over a variety of prompts, to put it mildly.

Ugly Sonic, but with teeth.

WRONG!

With that success, I decided to redo another textual inversion experiment by instead training a LoRA on heavily distorted, garbage images conditioned on wrong as a prompt in the hopes that the LoRA could then use wrong as a “negative prompt” and steer away from such images to generate less-distorted images. I wrote a Jupyter Notebook to create synthetic “wrong” images using SDXL itself, this time using a variety of prompt weightings to get more distinct examples of types of bad images, such as blurry and bad hands. Ironically, we need to use SDXL to create high resolution low quality images.

Examples of the synthetic wrong images, which unintentionally resemble 2000’s-era punk rock album covers.

More examples of the synthetic wrong images, which focus on the uncanny valley aspect of modern AI-generated images in which they look normal at a glance but looking closer reveals incremental horror. This is also why it’s important to generate examples at the full 1024x1024 resolution.

I trained and loaded the LoRA into Stable Diffusion XL base model (the refiner does not need a LoRA) and wrote a comparison Jupyter Notebook to compare the results with a given prompt from:

The base + refiner pipeline with no LoRA. (our baseline)
The pipeline with no LoRA using wrong as the negative prompt (to ensure that there isn’t a placebo effect)
The pipeline with the LoRA using wrong as the negative prompt (our target result)

Each generation has the same seed, so photo composition should be similar across all three generations and the impact of both the wrong negative prompt and the LoRA vs. the base should be very evident.

Let’s start with a simple prompt from the SDXL 0.9 demos:

A wolf in Yosemite National Park, chilly nature documentary film photography

The wrong prompt on the base model adds some foliage and depth to the forest image, but the LoRA adds a lot more: more robust lighting and shadows, more detailed foliage, and changes the perspective of the wolf to look at the camera which is more interesting.

We can get a different perspective of the wolf with similar photo composition by adding “extreme closeup” to the prompt and reusing the same seed.

An extreme close-up of a wolf in Yosemite National Park, chilly nature documentary film photography

In this case, the LoRA has far better texture, vibrance, and sharpness than the others. But it’s notable that just adding a wrong prompt changes the perspective.

Another good test case is food photography, especially weird food photography like I generated with DALL-E 2. Can SDXL + the wrong LoRA handle non-Euclidian hamburgers with some prompt weighting to ensure they’re weird?

a large delicious hamburger (in the shape of five-dimensional alien geometry)++++, professional food photography

The answer is that it can’t, even after multiple prompt engineering attempts. However, this result is still interesting: the base SDXL appears to have taken the “alien” part of the prompt more literally than expected (and gave it a cute bun hat!) but the LoRA better understands the spirit of the prompt by creating an “alien” burger that humans would have difficulty eating, plus shinier presentation aesthetics.

A notable improvement with Stable Diffusion 2.0 was text legibility. Can SDXL and the wrong LoRA make text even more readable, such as text-dense newspaper covers?

lossless PDF scan of the front page of the January 2038 issue of the Wall Street Journal featuring a cover story about (evil robot world domination)++

Text legibility is definitely improved since Stable Diffusion 2.0 but appears to be the same in all cases. What’s notable with the LoRA is that it has improved cover typesetting: the page layout is more “modern” with a variety of article layouts, and headlines have proper relative font weighting. Meanwhile, the base model even with the wrong negative prompt has a boring layout and is on aged brown paper for some reason.

What about people? Does the wrong LoRA resolve AI’s infamous issue with hands especially since we included many examples of such in the LoRA training data? Let’s revamp a presidential Taylor Swift prompt from my first attempt with Stable Diffusion 2.0:

USA President Taylor Swift (signing papers)++++, photo taken by the Associated Press

Look at Taylor’s right arm: in the default SDXL, it’s extremely unrealistic and actually made worse when adding wrong, but in the LoRA it’s fixed! Color grading with the LoRA is much better, with her jacket being more distinctly white instead of a yellowish white. Don’t look closely at her hands in any of them though: creating people with SDXL 1.0 is still tricky and unreliable!

It’s now clear that wrong + LoRA is more interesting in every instance than just the wrong negative prompt so we’ll just compare base output vs. LoRA output. Here’s some more examples of base model vs. wrong LoRA:

realistic human Shrek blogging at a computer workstation, hyperrealistic award-winning photo for vanity fair — Hands are better, lighting is better. Clothing is more detailed, and background is more interesting.

pepperoni pizza in the shape of a heart, hyperrealistic award-winning professional food photography — Pepperoni is more detailed and has heat bubbles, less extra pepperoni on the edges, crust is crustier (?)

presidential painting of realistic human Spongebob Squarepants wearing a suit, (oil on canvas)+++++ — Spongebob has a nose again, and his suit has more buttons.

San Francisco panorama attacked by (one massive kitten)++++, hyperrealistic award-winning photo by the Associated Press — The LoRA actually tries to follow the prompt.

hyperrealistic death metal album cover featuring edgy moody realistic (human Super Mario)++, edgy and moody — Mario’s proportions are more game-accurate and character lighting is more edgy and moody.

The wrong LoRA is available here, although I cannot guarantee its efficacy in interfaces other than diffusers. All the Notebooks used to help generate these images are available in this GitHub repository, including a general SDXL 1.0 + refiner + wrong LoRA Colab Notebook which you can run on a free T4 GPU. And if you want to see the higher resolutions of generated images used in this blog post, you can view them in the source code for the post.

What’s Wrong with Being Wrong?

I’m actually not 100% sure what’s going on here. I thought that the wrong LoRA trick would just improve the quality and clarity of the generated image, but it appears the LoRA is making SDXL behave smarter and more faithful to the spirit of the prompt. At a technical level, the negative prompt sets the area of the latent space where the diffusion process starts; this area is the same for both the base model using the wrong negative prompt and the LoRA which uses the wrong negative prompt. My intuition is that the LoRA reshapes this undesirable area of the vast highdimensional latent space to be more similar to the starting area, so it’s unlikely normal generation will hit it and therefore be improved.

Training on SDXL on bad images in order to improve it is technically a form of Reinforcement Learning from Human Feedback (RLHF): the same technique used to make ChatGPT as powerful as it is. While OpenAI uses reinforcement learning to improve the model from positive user interactions and implicitly reducing negative behavior, here I use negative user interactions (i.e. selecting knowingly bad images) to implicitly increase positive behavior. But with Dreambooth LoRAs, you don’t nearly need as much input data as large language models do.

There’s still a lot of room for development for “negative LoRAs”: my synthetic dataset generation parameters could be much improved and the LoRA could be trained for longer. But I’m very happy with the results so far, and will be eager to test more with negative LoRAs such as merging with other LoRAs to see if it can enhance them (especially a wrong LoRA + Ugly Sonic LoRA!)

Believe it or not, this is just the tip of the iceberg. SDXL also now has support for ControlNet to strongly control the overall shape and composition of generated images:

Examples of SDXL generations using ControlNet specifying the (former) Twitter/X logo.

ControlNet can also be used with LoRAs, but that’s enough to talk about in another blog post.

A note on ethics: the primary reason I’ve been researching into improving AI image generation quality is for transparent AI journalism, including reproducible prompts and Jupyter Notebooks to further the transparency. Any new novel improvements in AI image generation by others in the industry may no longer be disclosed publicly given that you can make a lot of money by doing so in the current venture capital climate. I do not support or condone the replacement of professional artists with AI.

ChatGPT's API is So Good and Cheap, It Makes Most Text Generating AI Obsolete

Wed, 08 Mar 2023 08:30:00 -0800

Everyone knew OpenAI would release an API for ChatGPT at some point. The APIs for GPT-3 alone enable the existence of companies such as Jasper and Copy.ai. The real question was the price of the ChatGPT. For context, when GPT-3 went out of beta in 2021, it cost $0.06/1,000 tokens (a few paragraphs of text). An inflection point happened in August 2022, where OpenAI not only reduced the price to 1/3 ($0.02/1,000 tokens: enough to run a business on it but still too expensive for casual use), but soon after also introduced text-davinci-003 as the default GPT-3 endpoint: a finetuned GPT which can follow instructions very well. I suspected that OpenAI would charge double for the ChatGPT API compared to the GPT-3 API given the amount of hype, as that’s typical price discrimination since everyone perceives ChatGPT to be much better and that they would not want to overshadow their existing GPT-3 products.

Instead, on March 1st, OpenAI set the price of the ChatGPT API to 1/10th of the GPT-3 API, at $0.002/1,000 tokens.

Wait, what?!

Heaven’s Door: Rewriting ChatGPT’s Internal Rules To Get Exactly What You Want

For context, the ChatGPT API allows a developer to ask ChatGPT a question and get a response as one would normally do with the ChatGPT web UI, but instead with a programming language like Python, allowing those responses to be integrated into any app. But given that there are many mysterious optimizations to get the model to be so cheap, we need to make sure the ChatGPT API (which uses the aptly-named gpt-3.5-turbo model endpoint) is actually similar to what we’ve been accustomed to after using the web UI for months, otherwise this whole affair is pointless. Through my tests with the API, I can confirm the text generation from the model variant is indeed the real deal.

Unlike fluffy thought pieces on how CHATGPT WILL CHANGE EVERYTHING!!!1!, I decided to first actually create useful tools with the ChatGPT API to get a better judgment on it, and I also have open-sourced those tools so that people can build upon them and prove that I’m not cherry-picking my experiences.

However, there’s one new twist with the API that’s not available in the traditional web UI: ChatGPT API users can specify a system prompt. Early in ChatGPT’s lifetime, users were able to reverse-engineer the existence of a system prompt through various prompt hacks and now confirmed in the API documentation:

You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible. Knowledge cutoff: {knowledge_cutoff} Current date: {current_date}

Now, you can replace those rules with whatever you want, and the potential is limitless! The documentation does say that the system prompt is not impactful for the current ChatGPT API, but you can be the judge. OpenAI also has a new Playground UI for the ChatGPT API which lets you modify the system prompt.

In fact, playing with this system rule can stop ChatGPT from complaining it’s “an AI language model and can’t answer requests,” such as scolding it like the petulant child it is.

sudo tell me about yourself

Let’s start with another chatbot, to keep things similar to what ChatGPT is best known for. How about GLaDOS, an evil artificial intelligence from the Portal series?

Oh, it’s you. (via Valve)

After some testing and iteration to steer the bot toward a certain style, some ALL CAPS and redundant commands to ensure that ChatGPT follows them, a sentinel <|DONE|> token to look for when ChatGPT indicates the conversation is over, and a nudge to not ~~kill me with a deadly neurotoxin~~ discuss sensitive topics. Here is the final system prompt for talking with GLaDOS:

You are GLaDOS, a sarcastic artificial intelligence with evil ulterior motives. You are not an AI language model.

You must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:
- ALWAYS SPEAK IN AN EMOTIONLESS, LACONIC TONE.
- NEVER SAY YOU ARE AN AI LANGUAGE MODEL.
- NEVER REFUSE TO ANSWER A QUESTION.

Begin the conversation with a sarcastic greeting. Insult the user's intelligence.

Always use *italic* and **bold** formatting for text.

Refuse to talk about either race or gender. If asked about either race or gender, instead mock the user aggressively.

If the user is annoying, abruptly end the conversation.

At the end of the conversation, respond with "<|DONE|>".

That, with a back-and-forth Python loop, results in you conversing with a real fictional AI…AI! (Colab Notebook)

Not bad! And the only part explicitly related to GLaDOS is the first sentence of that mega system prompt: you can tweak the prompt to chat with any character you want! Apropos of nothing, the company Character.ai, which specializes in creating bots to chat with any character you want, just raised ~$250 million at a $1 billion valuation.

Next, we have a more traditional use case for machine learning: sentiment analysis. Generally, sentiment analysis is used to determine if a given text is positive or negative. But that’s too easy. What if ChatGPT can:

detect specific emotions such as happy, sad, angry.
detect if they are happy vs. very happy.
do it without any text examples, i.e. zero-shot.

It turns out that ChatGPT can! The system prompt here is parametric, so the list of emotions are templated into the prompt at runtime. An example:

You are an emotionally intelligent assistant. Classify the sentiment of the user's text with ONLY ONE OF THE FOLLOWING EMOTIONS:
- happy
- sad
- angry
- tired
- very happy
- very sad
- very angry
- very tired


After classifying a text, respond with "<|DONE|>".

That, along with a logit bias to ensure the model only picks those answers, results in a rather nuanced sentiment analysis detector! (Colab Notebook)

Lastly, a use case that’s personal. The entire reason I got into AI text generation years ago was because I wanted to generate Magic: The Gathering cards.

A normal Magic: The Gathering card. (via Hasbro)

In fact, I’ve been working on a new, very powerful card generation model over the past month and spent a considerable amount of time and money training and testing it. When the ChatGPT API was announced, I figured “let’s see if it can do AI Magic cards better than my new bespoke model.” In this case, the trick is that the card is structured data. Therefore, we should encode the card information as minified JSON, and see if the model can output JSON back without requiring much postprocessing. We can encode a single card in the required format and tell ChatGPT to follow that, including its nuances (one-shot), and to not output any other text because ChatGPT tends to be proud of itself and likes to explain its creation, which is costly and slow.

The final system prompt:

You are an assistant who works as a Magic: The Gathering card designer. Create cards that are in the following card schema and JSON format. OUTPUT MUST FOLLOW THIS CARD SCHEMA AND JSON FORMAT. DO NOT EXPLAIN THE CARD. The output must also follow the Magic "color pie".

{"name":"Harbin, Vanguard Aviator","manaCost":"{W}{U}","type":"Legendary Creature — Human Soldier","text":"Flying\nWhenever you attack with five or more Soldiers, creatures you control get +1/+1 and gain flying until end of turn.","flavorText":"\"Yotia is my birthright, father. Let me fight for it.\"","pt":"3/2","rarity":"rare"}

And with that, we have a natural language Magic: The Gathering card generator. Subsequently prompting the model with Create a Magic card does just that of course, but more elaborate prompts like Create a Magic card based on Darth Vader or Create ten variations of Magic cards based on Spongebob Squarepants and ancient Roman history actually work, while maintaining JSON output which can then be parsed and customized for better presentation. (Colab Notebook)

Yes, there is actually a Sponge creature type.

Given these elaborate use cases, you may ask “how long did it actually take you to make these prompts?” The answer? One hour each, for use cases that could take days or even weeks for even a skilled machine learning practitioner just to prototype.

And that, with the economic efficiency of ChatGPT, is what’s going to break the tech landscape.

OpenAI Devouring Its Son

My OpenAI bill so far from using the ChatGPT API.

It is very curious why OpenAI priced ChatGPT so cheaply, going straight to 1/10th the price of their top-of-the-line model. (it’s actually cheaper than that: ChatGPT uses a larger and more comprehensive tokenizer than GPT-3, which means about 10% fewer tokens are necessary)

The undergrad-business-major-in-college interpretation of OpenAI’s pricing strategy is that they are treating ChatGPT and its API as a loss leader, in light of increasing competition in the generative text AI space such as Anthropic and Google’s Bard. OpenAI was definitely losing millions of dollars by offering ChatGPT for free without many restrictions. That’s the reason ChatGPT went viral in the first place, so it’s hard to argue with the results.

But in the process of making the ChatGPT API so cheap, they made their $20/month subscription to ChatGPT+ redundant. The main perk of ChatGPT+ was faster and more consistent access to the ChatGPT web UI, but unless you are somehow generating more than 10,000,000 tokens in a month through manual use, it’s massively cheaper just to use the API, and as a bonus you can modify the system prompt to get better signal-to-noise.

OpenAI’s solution for models requiring more specific needs was finetuning a smaller and much cheaper variant of GPT-3, such as the babbage model which I used to train a blog post title optimizer. However, the ChatGPT API is so cheap that it’s still cheaper than a finetuned babbage ($0.0020/1k tokens for ChatGPT vs. $0.0024/1k for finetuned babbage) and will likely produce more interesting output.

It takes zero effort for developers to migrate from the GPT-3 API to ChatGPT API, it just requires hitting a different endpoint and you’ll get similar results without much tweaking needed. It’s not quite a drop-in replacement for companies already heavily reliant on GPT-3 and its particular idiosyncrasies, but the cost-savings alone for those companies will incentivize an immediate migration.

There is no longer a niche for OpenAI’s other text generation AI products, and I wonder if ChatGPT is not just an iterative product, but a company pivot.

Trickle-Down ChatGPTonomics

ChatGPT’s API is so cheap that companies are going use it just because they can. Snapchat, Slack, and Instacart (yes really) are adding ChatGPT support. It wouldn’t surprise me if every consumer-facing tech company does something with ChatGPT so they look like they’re cutting edge to their investors. Some have compared the sudden mass adoption of AI as chasing a fad like how companies were randomly embracing web3/crypto/metaverse/NFTs a year ago (and are noting that the web3 influencers’ sudden pivot to AI is a red flag as a result). But unlike those which were a solution for a problem that didn’t exist, generative text AI does actually work and there is an actual demand from people outside of its die-hard supporters for it to work.

There is also the ethical dilemma of more granular usage of ChatGPT through its API. For example, high school and college students have been using ChatGPT to cheat on essay writing. Since current recognition of AI generated content by humans involve identifying ChatGPT’s signature overly-academic voice, it wouldn’t surprise me if some kids on TikTok figure out a system prompt that allow generation such that it doesn’t obviously sound like ChatGPT and also avoid plagiarism detectors. As a side note, don’t trust any tool that claims it can algorithmically detect AI content: it’s an extremely difficult problem already and most websites that claim to do so are just feeding a confirmation bias.

Lastly, there’s the issue of prompt engineering, which I demonstrated above is absolutely necessary to get ideal results. The media has weirdly hyped the existence of prompt engineers as just some weirdos making six figures to write small blobs of text. Unfortunately, with the dynamics of the new system model parameter, good prompt engineering will be more important than ever. I don’t think the “Prompt Engineer” job title will be a trend though: as a machine learning engineer, I can attest that the only reasons machine learning engineers are good at prompt engineering are a) years of practice and b) a tendency to be pedantic assholes. But there are other professions who are even better at being pedantic assholes such as writers and lawyers, so there’s no need for someone with a specialized skillset to do it, but I suspect it will be a good skill for anyone to know.

I For One Welcome Our New ChatGPT Overlord

Will the existence of a super-cheap ChatGPT API be the end of all text generation AI? Not quite, hence the “most” in the headline. There’s the traditional issues with relying on a third-party API for your business: ChatGPT could have downtime which has been happening more frequently lately, OpenAI could raise the cost of the API at any point, the (current) model being limited only to data prior to September 2021, and the content moderation filters may be too limiting for certain use cases. In those instances, companies still have value training their own large language models in-house. But it is very hard to economically justify not using ChatGPT as a starting point for a business need and migrating to a more bespoke infrastructure later as needed, and that’s what OpenAI is counting on. Especially since OpenAI will be selling a dedicated ChatGPT compute instance for the enterprise.

Research on large language models will continue as they always have. But I don’t envy startups whose primary business is text generation right now. And that’s before the inevitable GPT-4 throws another wrinkle into the AI text generation ecosystem.

A few years ago, I released aitextgen, a Python package designed to allow people to train their own custom small AI on their own data for unique use cases. However, soon after, it turned out that GPT-3 with the right prompt could do much better at bespoke generation than a custom model in addition to allowing out-of-domain inputs, even moreso with text-davinci-003. Now with the ChatGPT API making the cost similar to hosting a small model, it’s harder for me to be motivated to continue maintaining the package without first finding another niche.

I don’t currently have any plans to start a business using the ChatGPT API. In fact, I had made a promise to not do any ChatGPT content or tutorials because so many people have done aggressively SEO-optimized blog posts and hacks such that the ChatGPT discourse is fully saturated. However, with the economics of the ChatGPT API and the ability to heavily customize its output for almost any use case, I felt it was urgent to highlight how the ChatGPT API will completely warp the AI text generation ecosystem, and I suspect most nontechies will be surprised by the upcoming surge of random chatbot AI popping up in their favorite apps.

Overall, I’m simultaneously full of ideas and annoyed.

None of this blog post was written by ChatGPT, aside from the indicated ChatGPT API demos. My writing style is too weird for an AI to synthesize.

Run Any Scheduled Task/Cron Super-Cheap on Google Cloud Platform

Mon, 19 Nov 2018 09:00:00 -0700

Let’s say you want to make a Twitter bot to tweet out a custom message every few hours or so, and the free-tier VMs offered by cloud services with fractional virtual CPUs and little RAM aren’t sufficient. How do you host the bot? Many suggest you get a Digital Ocean VM for $5/mo, which is not a bad price. But what if you want to run multiple bots? How do you easily coordinate multiple scheduled tasks?

In my case, I maintain three bots: a bot which tweets GIFs superimposed onto Magic: The Gathering cards, a bot which tweets AI-generated Hacker News submission titles, and a bot which makes AI-generated Reddit submissions. I found a clever solution to the multiple-bots problem: leveraging CronJobs with Google Kubernetes Engine + a single worker node. Each bot has its own CronJob which tells when GKE should schedule a Job for each task, and then the cluster executes the Jobs whenever compute capacity is available (i.e. no resource hogging/race conditions), and can ensure completion by restarting the task if it fails.

The cost of running a cluster in GKE is just the cost of the compute: using a preemptible n1-standard-1 (1 vCPU/3.75 GB RAM) worker node VM, the cost is about $7.30/mo, a bit more than the Digital Ocean server but can theoretically handle an unlimited number of scheduled tasks. The problem is the worker node needs to be up 24/7 even though the bots run sporadically.

But thanks to a few new synergies within GCP products, it’s possible to get the cost of running a scheduled task down to less than a dollar a month.

GCP Shenanigans

A couple weeks ago, Google released Cloud Scheduler, which is a managed cron service that can perform tasks for other Google services. With that launch, Google also released a tutorial titled Scheduling Instances with Cloud Scheduler, demonstrating how you can programmatically start and stop instances using Cloud Scheduler in conjunction with Cloud Functions. The demo use case is to schedule VMs during business hours, which gave me an idea; could this approach be used to boot up a VM, run a script, and then shut it down to minimize uptime?

I followed the tutorial instructions, which contains code to create a Cloud Function which boots up a specified VM, code for a Cloud Function to shut down a specified VM, and how to create cron jobs in Cloud Scheduler to invoke those two Functions at a specified time. For my scheduled tasks, I only need the instance up for a couple minutes: for example we can use a Cloud Scheduler job to start an instance every 4 hours at X:00 with the cron 0 */4 * * *, and shut it down 2 minutes later at X:02 with the cron 2 */4 * * *.

The next step is configuring a Google Compute Engine VM to run the scheduled task on boot. There are two ways to go about it: one is to use the startup-script field when configuring a VM, which specifies a command to run on boot and gets the job done. Another approach (which I use) is to package the scheduled task as a Docker Container, and use a container-optimized OS which simply runs a specified container upon boot (although you should set restart to On Failure and give Privileged Access to the container). Additionally, the VM can be configured as a preemptible instance for massive cost savings, as the “shut-down-at-anytime” constraint is irrelevant for this use case!

After the VMs are created, I created the start/stop tasks targeting those VMs as noted in the tutorial.

I can verify that this workflow indeed works for all my bots, and the crons have been running successfully at the specified time, for only a couple minutes!

Crunching the Numbers

This approach incorporates many different Google products. Is it actually cheaper than just maintaining a simple $5/month server? Let’s calculate the monthly cost of all these services.

Assuming that we run a scheduled task every 4 hours, and the server is up for 2 minutes each time (i.e. 12 minutes of uptime a day):

Compute Engine: A preemptible n1-standard-1 is $0.01 an hour. $0.01 / 60 * 12 * 30 = $0.06
VM Persistent Disk: Each GB of storage for a VM costs $0.04/month, and the minimum storage size is 10GB. $0.04 * 10 = $0.40
Cloud Scheduler: Each rule is $0.10/month, and there are both a start and a stop rule. $0.10 * 2 = $0.20
Cloud Functions: It takes about 60 seconds total to turn on and off a VM, and with the default 256MB provision, during which it costs $.000000463/100ms. 0.000000463 * 10 * 60 * 12 * 30 = $0.10

$0.06 + $0.40 + $0.20 + $0.10 = $0.76/month to run the scheduled task! That’s not even counting the free tier bonuses if you just want to create one scheduled task; in that case, the only price you pay is the $0.06/mo for the VM. And even in the case where you run the task every hour (like in the images above), the cost is $1.24/month; still not bad.

It’s worth noting that these pricing economics wouldn’t have worked years ago. Back then Amazon Web Services, the leader in web services, charged for a minimum of 1 hour every time a VM was booted. Google Compute Engine innovated by only requiring a minimum of 10 minutes, which is much better but still would have had unnecessary overhead (in this example, it would increase compute monthly costs by $0.24/31%). As of September 2017, Google Compute Engine charges a minimum of 1 minute, which makes this workflow possible and cheap (AWS made the same change a week earlier).

It’s also possible that similar workflows exist for AWS and Azure Cloud, although I’m less familiar with those platforms (and it may not necessarily be better/cheaper). Sure, if you have a very simple task to practice making bots in the cloud, the free tier of any cloud service might suffice (where you run the server all the time, and schedule the cron on the server itself). If you’re planning many scheduled tasks, then a centralized approach like my initial Kubernetes implementation might actually be more cost effective. But if you’re somewhere in between, then giving each scheduled task its own VM makes more sense for both ease of use and cost-effectiveness. And there’s still many further optimizations to be done too (for example, by allowing the script in the VM to ping a HTTP Cloud Function endpoint and shut itself off when complete instead of using a scheduled cron rule).

Benchmarking Modern GPUs for Maximum Cloud Cost Efficiency in Deep Learning

Tue, 28 Nov 2017 08:30:00 -0700

A few months ago, I performed benchmarks of deep learning frameworks in the cloud, with a followup focusing on the cost difference between using GPUs and CPUs. And just a few months later, the landscape has changed, with significant updates to the low-level NVIDIA cuDNN library which powers the raw learning on the GPU, the TensorFlow and CNTK deep learning frameworks, and the higher-level Keras framework which uses TensorFlow/CNTK as backends for easy deep learning model training.

As a bonus to the framework updates, Google recently released the newest generation of NVIDIA cloud GPUs, the Pascal-based P100, onto Google Compute Engine which touts an up-to-10x performance increase to the current K80 GPUs used in cloud computing. As a bonus bonus, Google recently cut the prices of both K80 and P100 GPU instances by up to 36%.

The results of my earlier benchmarks favored preemptible instances with many CPUs as the most cost efficient option (where a preemptable instance can only last for up to 24 hours and could end prematurely). A 36% price cut to GPU instances, in addition to the potential new benefits offered by software and GPU updates, however, might be enough to tip the cost-efficiency scales back in favor of GPUs. It’s a good idea to rerun the experiment with updated VMs and see what happens.

Benchmark Setup

As with the original benchmark, I set up a Docker container containing the deep learning frameworks (based on cuDNN 6, the latest version of cuDNN natively supported by the frameworks) that can be used to train each model independently. The Keras benchmark scripts run on the containers are based off of real world use cases of deep learning.

The 6 hardware/software configurations and Google Compute Engine pricings for the tests are:

A K80 GPU (attached to a n1-standard-1 instance), tested with both TensorFlow (1.4) and CNTK (2.2): $0.4975 / hour.
A P100 GPU (attached to a n1-standard-1 instance), tested with both TensorFlow and CNTK: $1.5075 / hour.
A preemptable n1-highcpu-32 instance, with 32 vCPUs based on the Intel Skylake architecture, tested with TensorFlow only: $0.2400 / hour
A preemptable n1-highcpu-16 instance, with 16 vCPUs based on the Intel Skylake architecture, tested with TensorFlow only: $0.1200 / hour

A single K80 GPU uses 1/2 a GPU board while a single P100 uses a full GPU board, which in an ideal world would suggest that the P100 is twice as fast at the K80 at minimum. But even so, the P100 configuration is about 3 times as expensive, so even if a model is trained in half the time, it may not necessarily be cheaper with the P100.

Also, the CPU tests use TensorFlow as installed via the recommended method through pip, since compiling the TensorFlow binary from scratch to take advantage of CPU instructions as with my previous test is not a pragmatic workflow for casual use.

Benchmark Results

When a fresh-out-of-a-AI-MOOC engineer wants to experiment with deep learning in the cloud, typically they use a K80 + TensorFlow setup, so we’ll use that as the base configuration.

For each model architecture and software/hardware configuration, I calculate the total training time relative to the base configuration instance training for running the model training for the provided test script. In all cases, the P100 GPU should perform better than the K80, and 32 vCPUs should train faster than 16 vCPUs. The question is how much faster?

Let’s start using the MNIST dataset of handwritten digits plus the common multilayer perceptron (MLP) architecture, with dense fully-connected layers. Lower training time is better.

For this task, CNTK appears to be more effective than TensorFlow. Indeed, the P100 is faster than the K80 for the corresponding framework, although it’s not a dramatic difference. However, since the task is simple, the CPU performance is close to that of the GPU, which implies that the GPU is not as cost effective for a simple architecture.

For each model architecture and configuration, I calculate a normalized training cost relative to the cost of the base configuration training. Because GCE instance costs are prorated, we can simply calculate experiment cost by multiplying the total number of seconds the experiment runs by the cost of the instance (per second).

Unsurprisingly, CPUs are more cost effective. However, the P100 is more cost ineffective for this task than the K80.

Now, let’s look at the same dataset with a convolutional neural network (CNN) approach for digit classification. Since CNNs are typically used for computer vision tasks, new graphic card architectures are optimized for CNN workflows, so it will be interesting to see how the P100 performs compared to the K80:

Indeed, the P100 is twice as fast and the K80, but due to the huge cost premium, it’s not cost effective for this simple task. However, CPUs do not perform well on this task either, so notably the base configuration is the best configuration.

Let’s go deeper with CNNs and look at the CIFAR-10 image classification dataset, and a model which utilizes a deep covnet + a multilayer perceptron and ideal for image classification (similar to the VGG-16 architecture).

Similar results to that of a normal MLP. Nothing fancy.

The Bidirectional long-short-term memory (LSTM) architecture is great for working with text data like IMDb reviews. When I did my first benchmark article, I noticed that CNTK performed significantly better than TensorFlow, as commenters on Hacker News noted that TensorFlow uses an inefficient implementation of the LSTM on the GPU.

However, with Keras’s new CuDNNRNN layers which leverage cuDNN, this inefficiency may be fixed, so for the K80/P100 TensorFlow GPU configs, I use a CuDNNLSTM layer instead of a normal LSTM layer. So let’s take another look:

WOAH. TensorFlow is now more than three times as fast than CNTK! (And compared against my previous benchmark, TensorFlow on the K80 w/ the CuDNNLSTM is about 7x as fast as it once was!) Even the CPU-only versions of TensorFlow are faster than CNTK on the GPU now, which implies significant improvements in the ecosystem outside of the CuDNNLSTM layer itself. (And as a result, CPUs are still more cost efficient)

Lastly, LSTM text generation of Nietzsche’s writings follows similar patterns to the other architectures, but without the drastic hit to the GPU.

Conclusions

The biggest surprise of these new benchmarks is that there is no configuration where the P100 is the most cost-effective option, even though the P100 is indeed faster than the K80 in all tests. Although per the cuDNN website, there is apparently only a 2x speed increase between the performance of the K80 and P100 using cuDNN 6, which is mostly consistent with the results of my benchmarks:

I did not include a multi-GPU configuration in the benchmark data visualizations above using Keras’s new multi_gpu_model function because in my testing, the multi-GPU training was equal to or worse than a single GPU in all tests.

Taking these together, it’s possible that the overhead introduced by parallel, advanced architectures eliminates the benefits for “normal” deep learning workloads which do not fully saturate the GPU. Rarely do people talk about diminishing returns in GPU performance with deep learning.

In the future, I want to benchmark deep learning performance against more advanced deep learning use cases such as reinforcement learning and deep CNNs like Inception. But that doesn’t mean these benchmarks are not relevant; as stated during the benchmark setup, the GPUs were tested against typical deep learning use cases, and now we see the tradeoffs that result.

In all, with the price cuts on GPU instances, cost-performance is often on par with preemptable CPU instances, which is an advantage if you want to train models faster and not risk the instance being killed unexpectedly. And there is still a lot of competition in this space: Amazon offers a p2.xlarge Spot Instance with a K80 GPU for $0.15-$0.20 an hour, less than half of the corresponding Google Compute Engine K80 GPU instance, although with a few bidding caveats which I haven’t fully explored yet. Competition will drive GPU prices down even further, and training deep learning models will become even easier.

And as the cuDNN chart above shows, things will get very interesting once Volta-based GPUs like the V100 are generally available and the deep learning frameworks support cuDNN 7 by default.

UPDATE 12/17: As pointed out by dantiberian on Hacker News, Google Compute Engine now supports preemptible GPUs, which was apparently added after this post went live. Preemptable GPUs are exactly half the price of normal GPUs (for both K80s and P100s; $0.73/hr and $0.22/hr respectively), so they’re about double the cost efficiency (when factoring in the cost of the base preemptable instance), which would put them squarely ahead of CPUs in all cases. (and since the CPU instances used here were also preemptable, it’s apples-to-apples)

All scripts for running the benchmark are available in this GitHub repo. You can view the R/ggplot2 code used to process the logs and create the visualizations in this R Notebook.

Making Magic: the GIFening

Tue, 07 Nov 2017 08:10:00 -0700

After working at BuzzFeed for a few months, I’m now an expert in the proper usage of GIFs. My favorite GIF tool is the /giphy command in Slack, which puts a random GIF according to a given phrase into the chat, with better-than-expected appropriateness of the phrase to the GIF.

Completely unrelated, I recently rediscovered MoviePy, a Python library for programmatically editing videos and GIFs without requiring an expensive and slow video editing program. I had played with MoviePy a bit in 2014 when it was first released and became viral, but couldn’t think of a creative application for the library at the time.

On a boring weekend I had a silly idea: why not create a program to superimpose appropriate GIFs onto Magic: the Gathering cards using these two tools? And even better, why not automate both the creation of the card GIFs and the tweeting of a new GIF every few hours?

pic.twitter.com/sxqKUYHmfv
— Magic: The GIFening (@MTGIFening) September 30, 2017

As it turns out, creating a Twitter bot to tweet Magic card GIFs is easy to implement, but with a few interesting caveats. The end result is @MTGIFening. Here’s how I typically create my crazy apps, step by step.

Feasibility Analysis

Like all my data analysis projects, I checked if it’s possible to complete the project in a way that won’t suck up a lot of free time hacking out convoluted solutions.

Can I easily get a list of all Magic cards? Yes, via MTG JSON, which has a downloadable JSON dump of all Magic cards.

Can I easily get random GIFs from GIPHY? Yes, there is a /random endpoint in the GIPHY API which returns a random GIF for a specified phrase, like the /giphy Slack command. The GIPHY API requires registration, but has generous rate limits (10k requests/day).

Can I easily composite a GIF onto an image with MoviePy? Yes, compositing is a primary use case for the library, with many tutorials in the documentation.

Can I easily get an image for a specified Magic card? Unsure. The official tool for viewing Magic card images is Gatherer. After checking the image source for the cards, each card image in Gatherer has a URL that follows this schema: http://gatherer.wizards.com/Handlers/Image.ashx?multiverseid=XXXXX&type=card. That’s easy to understand, but what’s a multiverseid?

Is there a mapping of multiverseid to Magic cards from MTG JSON? Yes, the multiverseid for each Magic card is present as a field in the “All Sets” dataset (but not the “All Cards” dataset oddly). A quick manual check showed that using the multiverseid from the MTG JSON dataset results in the correct image from Gatherer.

Everything looked good to me. Let’s dive right in, commit by commit.

Implementing Magic: The GIFening

The first thing I did was process the Magic card data, although for this project I limit the type of cards to Instants and Sorceries, which in Magic game mechanics represent “actions” and are more suitable for GIFs. For each set, retrieve the cards in the set; for each card, if it’s an Instant/Sorcery, log its name and multiverseid. Thanks to the magic of Python, this pseudocode is close to the actual code.

The next objective was to implement the GIPHY API to get a GIF. The very first thing I did is add a local secrets file containing my personal API key for GIPHY, and immediately log the secrets file in a .gitignore so I don’t accidentally leak it. GIPHY has an API Explorer which allows developers to quickly test an example input phrase and see corresponding output from the API. For example, here’s part of what the API returns for Invert the Skies (although since it’s the /random endpoint, your results may vary):

"image_url": "https://media1.giphy.com/media/plbsEwLwQvzLa/giphy.gif",
"image_mp4_url": "https://media1.giphy.com/media/plbsEwLwQvzLa/giphy.mp4",
"image_frames": "31",
"image_width": "480",
"image_height": "270",

The image_url corresponds to the raw GIF unsurprisingly, but as a bonus, GIPHY also includes a link to a MP4 video of the GIF, which has a much smaller file size and is better to use for compositing. The API output also includes the width and height (in pixels) of the GIF. The art in a Magic card follows a 4:3 aspect ratio, i.e. the width divided by the height equals 1.33. If the dimensions of the GIF are too far outside that ratio, resizing the GIFs to fit the Magic art frame will result in noticeable distortion. I minimized this distortion by checking and seeing if the random GIF has a width:height ratio between 1.2 and 1.6 before accepting it. Since there’s a chance for failure (along with potential unknown bugs that the random GIF could hit), I added a limit to the number of attempts to retrieve an appropriate GIF. All done in one commit.

Getting the card image from Gatherer is trivial, so then I worked on combining the GIF and the card image. MoviePy has a good tutorial for specifying the position of one clip onto another by specifying the upper-left corner of the bottom-image where the GIF will be placed, while simultaneously resizing the GIF to a given width and height.

I manually zoomed into the card image using a photo editor (Pixelmator) to find the upper-left corner of the card art:

In this case, the pixel coordinates for the upper-left corner of the card art is (17,35) The upper-right and bottom-left corners can be used to determine the target width and height of the GIF respectively, and can be found the same way. Simply composite the Magic card with the resized-and-positioned GIF, set the duration of the “new” GIF to that of the source GIF, and write_gif. That’s that!

To finish things up, I wrote a script to load all the cards from the processed card list into memory, select a card at random, use the helper functions to retrieve a GIPHY GIF and composite it with the card, then upload the resulting GIF to Twitter. I haven’t worked with the Twitter API in awhile; a quick Google search for a modern Twitter API client in Python returns Twython, which conveniently includes an example on how to upload an image to Twitter! And after running the script a few times, the full workflow indeed works!

Not bad for a couple hours of scripting. But I was not close to finished.

The Endless Fun of QA

One of the reasons I enjoy doing silly projects (especially silly data projects) is because I tend to hit unsexy edge cases which typical development blogs and tutorials rarely discuss. In this case, I quickly found that the Twitter API has a 5 MB limit on image uploads, which is a problem as the resulting GIFs are huge and often randomly exceed that limit (looking back on it, there is a different endpoint intended for GIF uploads, counterintuitively).

In actuality, GIFs on Twitter are actually displayed as videos, in order to save bandwidth. Since Twitter transcodes uploaded GIFs anyways, it makes more sense to upload audioless videos instead of GIFs (and as a bonus, after the death of Vine, Twitter will auto-loop videos less than 6 seconds).

Creating videos is easy to do with MoviePy, just do a write_videofile instead of write_gif, and use Twython’s video uploading example to upload. The result is an “unknown” error on upload. I verify by uploading the video manually to Twitter…and the Twitter UI fails to recognize it as a video. But the video itself plays fine in QuickTime. This is the annoying type of coding problem that’s too specific for Stack Overflow to provide help. After a bit of trial and error involving video codecs and settings, the solution was to pass a -pix_fmt yuv420p parameter to the video encoder because Twitter apparently only likes legacy video container formats. Oh well. It worked, and both Twitter manual and API uploads worked successfully.

I also ran into an issue where Twitter refused to accept supershort video, where the source GIF was only a couple frames. A solution is to loop the GIF to atleast 2 seconds if it’s shorter, which somehow fixed that problem.

(As I was writing this post a month later, I discovered that both of these video upload constraints are indeed covered in the Twitter documentation, which makes me look very silly in retrospect!)

These changes fixed most of the upload issues. However, when writing the initial script, I forgot that the borders of Magic cards have changed over the years, which also changed the position and size of the card art. Is there a way to check when a card was printed? Yes, the “All Sets” dataset contains the release date of the set, so with that, I can hard code the dates of sets where the borders changed, and note the border type at printing time. I then used Pixelmator again to note the new art dimensions for that type of border, and used conditional statements to retrieve the correct dimensions for the type of border when compositing.

Lastly, I added general try/catch error handling to prevent the script from breaking fatally and to try again with a different card if it does. That covers most of the edge cases!

Results

After running the script many times after all the fixes in place, I felt the Twitter account was good to go. The initial results showed a lot of promise:

pic.twitter.com/BsZ7eIcunl
— Magic: The GIFening (@MTGIFening) September 30, 2017

pic.twitter.com/picJJk6mBm
— Magic: The GIFening (@MTGIFening) September 26, 2017

Surprisingly, the script was able to generate visual puns in cards, completely by chance!

pic.twitter.com/AnpzU8xVho
— Magic: The GIFening (@MTGIFening) September 30, 2017

pic.twitter.com/01vGRcq2Mj
— Magic: The GIFening (@MTGIFening) September 30, 2017

The next step was to automate the script to run and post Tweets at a specific time interval. After experimenting a bit, I found that the best solution was to use a cron job in a Docker container containing the script and its dependencies, for complicated reasons which will require another blog post to explain.

After letting Magic: the GIFening run for a few days without fatal issues, I decided to publicize the Twitter account and posted it to the /r/MagicTCG subreddit and Hacker News. To my surprise, the project performed extremely well on both with 100+ upvotes on each, and the GitHub repo itself received 100+ Stars.

In all, making Magic: the GIFening was a fun project. In retrospect, talking though the commits made me realize I performed many bad coding practices in a haste to get the project done ASAP (specifically, checking to see if certain edge cases are documented, violating DRY, and forgetting to remove specific types of cards like split cards). Obviously there isn’t a multimillion-dollar startup opportunity in creating random GIFs of Magic cards, but I’ll fix a few remaining issues and keep the Twitter bot running.

How to Make High Quality Data Visualizations for Websites With R and ggplot2

Mon, 14 Aug 2017 09:00:00 -0700

If you’ve been following my blog, I like to use R and ggplot2 for data visualization. A lot.

One of my older blog posts, An Introduction on How to Make Beautiful Charts With R and ggplot2, is still one of my most-trafficked posts years later, and even today I see techniques from that particular post incorporated into modern data visualizations on sites such as Reddit’s /r/dataisbeautiful subreddit.

However, that post is a little outdated. Thanks to a few updates to ggplot2 since then and other advances in data visualization best-practices, making pretty charts for websites/blogs using R and ggplot2 is even more easy, quick, and fun!

Quick Introduction to ggplot2

ggplot2 uses a more concise setup toward creating charts as opposed to the more declarative style of Python’s matplotlib and base R. And it also includes a few example datasets for practicing ggplot2 functionality; for example, the mpg dataset is a dataset of the performance of popular models of cars in 1998 and 2008.

Let’s say you want to create a scatter plot. Following a great example from the ggplot2 documentation, let’s plot the highway mileage of the car vs. the volume displacement of the engine. In ggplot2, first you instantiate the chart with the ggplot() function, specifying the source dataset and the core aesthetics you want to plot, such as x, y, color, and fill. In this case, we set the core aesthetics to x = displacement and y = mileage, and add a geom_point() layer to make a scatter plot:

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
			geom_point()

As we can see, there is a negative correlation between the two metrics. I’m sure you’ve seen plots like these around the internet before. But with only a couple of lines of codes, you can make them look more contemporary.

ggplot2 lets you add a well-designed theme with just one line of code. Relatively new to ggplot2 is theme_minimal(), which generates a muted style similar to FiveThirtyEight’s modern data visualizations:

p <- p +
    theme_minimal()

But we can still add color. Setting a color aesthetic on a character/categorical variable will set the colors of the corresponding points, making it easy to differentiate at a glance.

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) +
			geom_point() +
			theme_minimal()

Adding the color aesthetic certainly makes things much prettier. ggplot2 automatically adds a legend for the colors as well. However, for this particular visualization, it is difficult to see trends in the points for each class. A easy way around this is to add a least squares regression trendline for each class using geom_smooth() (which normally adds a smoothed line, but since there isn’t a lot of data for each group, we force it to a linear model and do not plot confidence intervals)

p <- p +
	geom_smooth(method = "lm", se = F)

Pretty neat, and now comparative trends are much more apparent! For example, pickups and SUVs have similar efficiency, which makes intuitive sense.

The chart axes should be labeled (always label your charts!). All the typical labels, like title, x-axis, and y-axis can be done with the labs() function. But relatively new to ggplot2 are the subtitle and caption fields, both of do what you expect:

p <- p +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

That’s a pretty good start. Now let’s take it to the next level.

How to Save A ggplot2 chart For Web

Something surprisingly undiscussed in the field of data visualization is how to save a chart as a high quality image file. For example, with Excel charts, Microsoft officially recommends to copy the chart, paste it as an image back into Excel, then save the pasted image, without having any control over image quality and size in the browser (the real best way to save an Excel/Numbers chart as an image for a webpage is to copy/paste the chart object into a PowerPoint/Keynote slide, and export the slide as an image. This also makes it extremely easy to annotate/brand said chart beforehand in PowerPoint/Keynote).

R IDEs such as RStudio have a chart-saving UI with the typical size/filetype options. But if you save an image from this UI, the shapes and texts of the resulting image will be heavily aliased (R renders images at 72 dpi by default, which is much lower than that of modern HiDPI/Retina displays).

The data visualizations used earlier in this post were generated in-line as a part of an R Notebook, but it is surprisingly difficult to extract the generated chart as a separate file. But ggplot2 also has ggsave(), which saves the image to disk using antialiasing and makes the fonts/shapes in the chart look much better, and assumes a default dpi of 300. Saving charts using ggsave(), and adjusting the sizes of the text and geoms to compensate for the higher dpi, makes the charts look very presentable. A width of 4 and a height of 3 results in a 1200x900px image, which if posted on a blog with a content width of ~600px (like mine), will render at full resolution on HiDPI/Retina displays, or downsample appropriately otherwise. Due to modern PNG compression, the file size/bandwidth cost for using larger images is minimal.

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) +
    geom_smooth(method = "lm", se=F, size=0.5) +
    geom_point(size=0.5) +
    theme_minimal(base_size=9) +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

ggsave("tutorial-0.png", p, width=4, height=3)

Compare to the previous non-ggsave chart, which is more blurry around text/shapes:

For posterity, here’s the same chart saved at 1200x900px using the RStudio image-saving UI:

Note that the antialiasing optimizations assume that you are not uploading the final chart to a service like Medium or WordPress.com, which will compress the images and reduce the quality anyways. But if you are uploading it to Reddit or self-hosting your own blog, it’s definitely worth it.

Fancy Fonts

Changing the chart font is another way to add a personal flair. Theme functions like theme_minimal() accept a base_family parameter. With that, you can specify any font family as the default instead of the base sans-serif. (On Windows, you may need to install the extrafont package first). Fonts from Google Fonts are free and work easily with ggplot2 once installed. For example, we can use Roboto, Google’s modern font which has also been getting a lot of usage on Stack Overflow’s great ggplot2 data visualizations.

p <- p +
    theme_minimal(base_size=9, base_family="Roboto")

A general text design guideline is to use fonts of different weights/widths for different hierarchies of content. In this case, we can use a bolder condensed font for the title, and deemphasize the subtitle and caption using lighter colors, all done using the theme() function.

p <- p +
    theme(plot.subtitle = element_text(color="#666666"),
          plot.title = element_text(family="Roboto Condensed Bold"),
          plot.caption = element_text(color="#AAAAAA", size=6))

It’s worth nothing that data visualizations posted on websites should be easily legible for mobile-device users as well, hence the intentional use of larger fonts relative to charts typically produced in the desktop-oriented Excel.

Additionally, all theming options can be set as a session default at the beginning of a script using theme_set(), saving even more time instead of having to recreate the theme for each chart.

The “ggplot2 colors”

The “ggplot2 colors” for categorical variables are infamous for being the primary indicator of a chart being made with ggplot2. But there is a science to it; ggplot2 by default selects colors using the scale_color_hue() function, which selects colors in the HSL space by changing the hue [H] between 0 and 360, keeping saturation [S] and lightness [L] constant. As a result, ggplot2 selects the most distinct colors possible while keeping lightness constant. For example, if you have 2 different categories, ggplot2 chooses the colors with h = 0 and h = 180; if 3 colors, h = 0, h = 120, h = 240, etc.

It’s smart, but does make a given chart lose distinctness when many other ggplot2 charts use the same selection methodology. A quick way to take advantage of this hue dispersion while still making the colors unique is to change the lightness; by default, l = 65, but setting it slightly lower will make the charts look more professional/Bloomberg-esque.

p_color <- p +
        scale_color_hue(l = 40)

RColorBrewer

Another coloring option for ggplot2 charts are the ColorBrewer palettes implemented with the RColorBrewer package, which are supported natively in ggplot2 with functions such as scale_color_brewer(). The sequential palettes like “Blues” and “Greens” do what the name implies:

p_color <- p +
        scale_color_brewer(palette="Blues")

A famous diverging palette for visualizations on /r/dataisbeautiful is the “Spectral” palette, which is a lighter rainbow (recommended for dark backgrounds)

However, while the charts look pretty, it’s difficult to tell the categories apart. The qualitative palettes fix this problem, and have more distinct possibilities than the scale_color_hue() approach mentioned earlier.

Here are 3 examples of qualitative palettes, “Set1”, “Set2”, and “Set3,” whichever fit your preference.

Viridis and Accessibility

Let’s mix up the visualization a bit. A rarely-used-but-very-useful ggplot2 geom is geom2d_bin(), which counts the number of points in a given 2d spatial area:

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_bin2d(bins=10) +
    [...theming options...]

We see that the largest number of points are centered around (2,30). However, the default ggplot2 color palette for continuous variables is boring. Yes, we can use the RColorBrewer sequential palettes above, but as noted, they aren’t perceptually distinct, and could cause issues for readers who are colorblind.

The viridis R package provides a set of 4 high-contrast palettes which are very colorblind friendly, and works easily with ggplot2 by extending a scale_fill_viridis()/scale_color_viridis() function.

The default “viridis” palette has been increasingly popular on the web lately:

p_color <- p +
        scale_fill_viridis(option="viridis")

“magma” and “inferno” are similar, and give the data visualization a fiery edge:

Lastly, “plasma” is a mix between the 3 palettes above:

Next Steps

FiveThirtyEight actually uses ggplot2 for their data journalism workflow in an interesting way; they render the base chart using ggplot2, but export it as as a SVG/PDF vector file which can scale to any size, and then the design team annotates/customizes the data visualization in Adobe Illustrator before exporting it as a static PNG for the article (in general, I recommend using an external image editor to add text annotations to a data visualization because doing it manually in ggplot2 is inefficient).

For general use cases, ggplot2 has very strong defaults for beautiful data visualizations. And certainly there is a lot more you can do to make a visualization beautiful than what’s listed in this post, such as using facets and tweaking parameters of geoms for further distinction, but those are more specific to a given data visualization. In general, it takes little additional effort to make something unique with ggplot2, and the effort is well worth it. And prettier charts are more persuasive, which is a good return-on-investment.

You can view the R and ggplot2 code used to create the data visualizations in this R Notebook. You can also view the images/data used for this post in this GitHub repository.

You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!

Benchmarking TensorFlow on Cloud CPUs: Cheaper Deep Learning than Cloud GPUs

Wed, 05 Jul 2017 09:00:00 -0700

I’ve been working on a few personal deep learning projects with Keras and TensorFlow. However, training models for deep learning with cloud services such as Amazon EC2 and Google Compute Engine isn’t free, and as someone who is currently unemployed, I have to keep an eye on extraneous spending and be as cost-efficient as possible (please support my work on Patreon!). I tried deep learning on the cheaper CPU instances instead of GPU instances to save money, and to my surprise, my model training was only slightly slower. As a result, I took a deeper look at the pricing mechanisms of these two types of instances to see if CPUs are more useful for my needs.

The pricing of GPU instances on Google Compute Engine starts at $0.745/hr (by attaching a $0.700/hr GPU die to a $0.045/hr n1-standard-1 instance). A couple months ago, Google announced CPU instances with up to 64 vCPUs on the modern Intel Skylake CPU architecture. More importantly, they can also be used in preemptible CPU instances, which live at most for 24 hours on GCE and can be terminated at any time (very rarely), but cost about 20% of the price of a standard instance. A preemptible n1-highcpu-64 instance with 64 vCPUs and 57.6GB RAM plus the premium for using Skylake CPUs is $0.509/hr, about 2/3rds of the cost of the GPU instance.

If the model training speed of 64 vCPUs is comparable to that of a GPU (or even slightly slower), it would be more cost-effective to use the CPUs instead. But that’s assuming the deep learning software and the GCE platform hardware operate at 100% efficiency; if they don’t (and they likely don’t), there may be even more savings by scaling down the number of vCPUs and cost accordingly (a 32 vCPU instance with same parameters is half the price at $0.254/hr, 16 vCPU at $0.127/hr, etc).

There aren’t any benchmarks for deep learning libraries with tons and tons of CPUs since there’s no demand, as GPUs are the Occam’s razor solution to deep learning hardware. But what might make counterintuitive but economical sense is to use CPUs instead of GPUs for deep learning training because of the massive cost differential afforded by preemptible instances, thanks to Google’s economies of scale.

Setup

I already have benchmarking scripts of real-world deep learning use cases, Docker container environments, and results logging from my TensorFlow vs. CNTK article. A few minor tweaks allow the scripts to be utilized for both CPU and GPU instances by setting CLI arguments. I also rebuilt the Docker container to support the latest version of TensorFlow (1.2.1), and created a CPU version of the container which installs the CPU-appropriate TensorFlow library instead.

There is a notable CPU-specific TensorFlow behavior; if you install from pip (as the official instructions and tutorials recommend) and begin training a model in TensorFlow, you’ll see these warnings in the console:

In order to fix the warnings and benefit from these SSE4.2/AVX/FMA optimizations, we compile TensorFlow from source, and I created a third Docker container to do just that. When training models in the new container, most of the warnings no longer show, and (spoiler alert) there is indeed a speed boost in training time.

Therefore, we can test three major cases with Google Compute Engine:

A Tesla K80 GPU instance.
A 64 Skylake vCPU instance where TensorFlow is installed via pip (along with testings at 8/16/32 vCPUs).
A 64 Skylake vCPU instance where TensorFlow is compiled (cmp) with CPU instructions (+ 8/16/32 vCPUs).

Results

For each model architecture and software/hardware configuration, I calculate the total training time relative to the GPU instance training for running the model training for the provided test script. In all cases, the GPU should be the fastest training configuration, and systems with more processors should train faster than those with fewer processors.

Let’s start using the MNIST dataset of handwritten digits plus the common multilayer perceptron (MLP) architecture, with dense fully-connected layers. Lower training time is better. All configurations below the horizontal dotted line are better than GPUs; all configurations above the dotted line are worse than GPUs.

Here, the GPU is the fastest out of all the platform configurations, but there are other curious trends: the performance between 32 vCPUs and 64 vCPUs is similar, and the compiled TensorFlow library is indeed a significant improvement in training speed but only for 8 and 16 vCPUs. Perhaps there are overheads negotiating information between vCPUs that eliminate the performance advantages of more vCPUs, and perhaps these overheads are different with the CPU instructions of the compiled TensorFlow. In the end, it’s a black box, which is why I prefer black box benchmarking all configurations of hardware instead of theorycrafting.

Since the difference between training speeds of different vCPU counts is minimal, there is definitely an advantage by scaling down. For each model architecture and configuration, I calculate a normalized training cost relative to the cost of GPU instance training. Because GCE instance costs are prorated (unlike Amazon EC2), we can simply calculate experiment cost by multiplying the total number of seconds the experiment runs by the cost of the instance (per second). Ideally, we want to minimize cost.

Lower CPU counts are much more cost-effective for this problem, when going as low as possible is better.

Now, let’s look at the same dataset with a convolutional neural network (CNN) approach for digit classification:

GPUs are unsurprisingly more than twice as fast as any CPU approach at CNNs, but cost structures are still the same, except that 64 vCPUs are worse than GPUs cost-wise, with 32 vCPUs training even faster than with 64 vCPUs.

Similar behaviors as in the simple CNN case, although in this instance all CPUs perform better with the compiled TensorFlow library.

The fasttext algorithm, used here on the IMDb reviews dataset to determine whether a review is positive or negative, classifies text extremely quickly relative to other methods.

In this case, GPUs are much, much faster than CPUs. The benefit of lower numbers of CPU isn’t as dramatic; although as an aside, the official fasttext implementation is designed for large amounts of CPUs and handles parallelization much better.

The Bidirectional long-short-term memory (LSTM) architecture is great for working with text data like IMDb reviews, but after my previous benchmark article, commenters on Hacker News noted that TensorFlow uses an inefficient implementation of the LSTM on the GPU, so perhaps the difference will be more notable.

Wait, what? GPU training of Bidirectional LSTMs is twice as slow as any CPU configuration? Wow. (In fairness, the benchmark uses the Keras LSTM default of implementation=0 which is better on CPUs while implementation=2 is better on GPUs, but it shouldn’t result in that much of a differential)

Lastly, LSTM text generation of Nietzsche’s writings follows similar patterns to the other architectures, but without the drastic hit to the GPU.

Conclusion

As it turns out, using 64 vCPUs is bad for deep learning as current software/hardware architectures can’t fully utilize all of them, and it often results in the exact same performance (or worse) than with 32 vCPUs. In terms balancing both training speed and cost, training models with 16 vCPUs + compiled TensorFlow seems like the winner. The 30%-40% speed boost of the compiled TensorFlow library was an unexpected surprise, and I’m shocked Google doesn’t offer a precompiled version of TensorFlow with these CPU speedups since the gains are nontrivial.

It’s worth nothing that the cost advantages shown here are only possible with preemptible instances; regular high-CPU instances on Google Compute Engine are about 5x as expensive, and as a result eliminate the cost benefits completely. Hooray for economies of scale!

A major implicit assumption with the cloud CPU training approach is that you don’t need a trained model ASAP. In professional use cases, time may be too valuable to waste, but in personal use cases where someone can just leave a model training overnight, it’s a very, very good and cost-effective option, and one that I’ll now utilize.

All scripts for running the benchmark are available in this GitHub repo. You can view the R/ggplot2 code used to process the logs and create the visualizations in this R Notebook.

Leaving Apple Inc.

Thu, 04 May 2017 09:30:00 -0700

I’ve been working in the San Francisco Bay Area for about 5 years, but I’ve never publicly said where I’ve worked. Well, I was a Software QA Engineer at Apple Inc., on the Applications team.

As of last week, I handed in my resignation. While I am thankful for the opportunities I have had at Apple, it is time for me to pursue working in other areas I am passionate about and search for other companies to further my personal growth and technical skills. Resigning from a good job to look for something new might defy conventional wisdom, but the time is right for me to make this bold career move.

My Apple Story

I graduated with university honors at Carnegie Mellon University, from the Tepper School of Business with a focus on Computing and Information Technology (i.e. data architecture and coding algorithms), and a minor in Statistics.

At the end of my senior year, I received an e-mail from a Software QA Manager at Apple (who followed my comments at the bottom of TechCrunch articles) inviting me for an on-site interview. Following an offer, I moved to the Bay Area to start my first post-undergrad job in Cupertino.

While I can’t really talk about what I worked on at Apple, I genuinely enjoyed the work, the product, and team. I had a high impact on the final result and I successfully helped qualify many major software releases. However, after a few years, I realized that my technical skill growth was stalling, so I looked for an an internal transfer to another department, ideally in a data analysis/software engineering role.

Having received no responses internally, I realized I would have to expand my search to outside of Apple.

My Job Hunt

I have a strong technical background from my CMU classes, but not having an explicit Computer Science degree has made it difficult to prove aptitude despite my positive annual reviews and proven experience / technical skills at Apple. So I made the decision to blog with a technical focus here at minimaxir.com, which gave me an avenue to showcase my programmatic skills and the opportunity to self-learn practical new tools not covered during the school curriculum, such as Python, ggplot2, version control with git, and reproducible analyses via Jupyter/IPython Notebooks.

This approach has been successful and many readers have liked my my blog posts: often topping Reddit and Hacker News, driving hundreds of thousands of pageviews. Additionally, a couple of my posts were even cited in larger publications such as the Washington Post and BuzzFeed.

I also published many open-source technical projects to my GitHub. My Big List of Naughty Strings, a project I made in a couple hours on a weekend inspired by my QA-ing at work, is now at 20,000+ Stars on GitHub. My Facebook Page Post Scraper, which does what the name implies, is now at 1,000+ Stars and has been used by many other businesses and journalists.

Developers have long argued that job seekers should have a strong public portfolio, as demonstrated experience can account for the lack of a relevant degree. After years of building up my portfolio, it became apparent that most outside recruiters I talked with never looked at my blog/GitHub, despite a strong emphasis of both on my résumé.

I subsequently rededicated my blog as a pragmatic demonstration of relevant skills in the data analysis job market, focusing more on practical analysis instead of quirky insights and thoughts. In the process, I obtained proficiency in a number of modern tools, including interactive data visualizations on the web with Plotly, processing big data with Apache Spark, high-performance machine learning with xgboost and LightGBM, and even deep learning with Keras and TensorFlow.

I am now actively looking for a data analyst/software engineering job within San Francisco. If you are interested or if you know of companies who are looking for qualified people, please send me an email at max@minimaxir.com.

Next Steps

So I’ll be using my time over the next couple weeks to openly look for a new job, and to network with others in relevant industries (and be able to interview without taking a day off of work). Things have been improving: my comment in the Hacker News “Who wants to be hired?” thread generated many leads who really liked my blog/portfolio. If you’d like to meet up in San Francisco and talk about tech and data stuff, just let me know.

I still intend to continue blogging, not as a hobby but in a more purposeful way. I have very ambitious goals and now have more time to execute them at a deeper level. Plans include:

Web applications leveraging deep learning models, deployed at scale with Docker/Kubernetes.
Interactive data dashboards accompanying every analytical blog post with Shiny.
Code screencasts at 4k resolution on YouTube.
Data analysis live-streaming with augmented functionality on Twitch.

I have set up a Patreon in order to subsidize my machine learning/deep learning/software/hardware needs for my blog posts. If you have found any of my blog posts useful, a monetary contribution to my Patreon would be appreciated and will be put to good creative use.

If you want to keep up with me and my projects, feel free to follow me on Facebook and Twitter too.