Essays on Max Woolf's Blog

As an Experienced LLM User, I Actually Don't Use Generative LLMs Often

Mon, 05 May 2025 10:15:00 -0700

Lately, I’ve been working on codifying a personal ethics statement about my stances on generative AI as I have been very critical about several aspects of modern GenAI, and yet I participate in it. While working on that statement, I’ve been introspecting on how I myself have been utilizing large language models for both my professional work as a Senior Data Scientist at BuzzFeed and for my personal work blogging and writing open-source software. For about a decade, I’ve been researching and developing tooling around text generation from char-rnns, to the ability to fine-tune GPT-2, to experiments with GPT-3, and even more experiments with ChatGPT and other LLM APIs. Although I don’t claim to the best user of modern LLMs out there, I’ve had plenty of experience working against the cons of next-token predictor models and have become very good at finding the pros.

It turns out, to my surprise, that I don’t use them nearly as often as people think engineers do, but that doesn’t mean LLMs are useless for me. It’s a discussion that requires case-by-case nuance.

How I Interface With LLMs

Over the years I’ve utilized all the tricks to get the best results out of LLMs. The most famous trick is prompt engineering, or the art of phrasing the prompt in a specific manner to coach the model to generate a specific constrained output. Additions to prompts such as offering financial incentives to the LLM or simply telling the LLM to make their output better do indeed have a quantifiable positive impact on both improving adherence to the original prompt and the output text quality. Whenever my coworkers ask me why their LLM output is not what they expected, I suggest that they apply more prompt engineering and it almost always fixes their issues.

No one in the AI field is happy about prompt engineering, especially myself. Attempts to remove the need for prompt engineering with more robust RLHF paradigms have only made it even more rewarding by allowing LLM developers to make use of better prompt adherence. True, “Prompt Engineer” as a job title turned out to be a meme but that’s mostly because prompt engineering is now an expected skill for anyone seriously using LLMs. Prompt engineering works, and part of being a professional is using what works even if it’s silly.

To that end, I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary. Accessing LLM APIs like the ChatGPT API directly allow you to set system prompts which control the “rules” for the generation that can be very nuanced. Specifying specific constraints for the generated text such as “keep it to no more than 30 words” or “never use the word ‘delve’” tends to be more effective in the system prompt than putting them in the user prompt as you would with ChatGPT.com. Any modern LLM interface that does not let you explicitly set a system prompt is most likely using their own system prompt which you can’t control: for example, when ChatGPT.com had an issue where it was too sycophantic to its users, OpenAI changed the system prompt to command ChatGPT to “avoid ungrounded or sycophantic flattery.” I tend to use Anthropic Claude’s API — Claude Sonnet in particular — more than any ChatGPT variant because Claude anecdotally is less “robotic” and also handles coding questions much more accurately.

Additionally with the APIs, you can control the “temperature” of the generation, which at a high level controls the creativity of the generation. LLMs by default do not select the next token with the highest probability in order to allow it to give different outputs for each generation, so I prefer to set the temperature to 0.0 so that the output is mostly deterministic, or 0.2 - 0.3 if some light variance is required. Modern LLMs now use a default temperature of 1.0, and I theorize that higher value is accentuating LLM hallucination issues where the text outputs are internally consistent but factually wrong.

LLMs for Professional Problem Solving!

With that pretext, I can now talk about how I have used generative LLMs over the past couple years at BuzzFeed. Here are outlines of some (out of many) projects I’ve worked on using LLMs to successfully solve problems quickly:

BuzzFeed site curators developed a new hierarchal taxonomy to organize thousands of articles into a specified category and subcategory. Since we had no existing labeled articles to train a traditional multiclass classification model to predict these new labels, I wrote a script to hit the Claude Sonnet API with a system prompt saying The following is a taxonomy: return the category and subcategory that best matches the article the user provides. plus the JSON-formatted hierarchical taxonomy, then I provided the article metadata as the user prompt, all with a temperature of 0.0 for the most precise results. Running this in a loop for all the articles resulted in appropriate labels.
After identifying hundreds of distinct semantic clusters of BuzzFeed articles using data science shenanigans, it became clear that there wasn’t an easy way to give each one unique labels. I wrote another script to hit the Claude Sonnet API with a system prompt saying Return a JSON-formatted title and description that applies to all the articles the user provides. with the user prompt containing five articles from that cluster: again, running the script in a loop for all clusters provided excellent results.
One BuzzFeed writer asked if there was a way to use a LLM to sanity-check grammar questions such as “should I use an em dash here?” against the BuzzFeed style guide. Once again I hit the Claude Sonnet API, this time copy/pasting the full style guide in the system prompt plus a command to Reference the provided style guide to answer the user's question, and cite the exact rules used to answer the question. In testing, the citations were accurate and present in the source input, and the reasonings were consistent.

Each of these projects were off-hand ideas pitched in a morning standup or a Slack DM, and yet each project only took an hour or two to complete a proof of concept (including testing) and hand off to the relevant stakeholders for evaluation. For projects such as the hierarchal labeling, without LLMs I would have needed to do more sophisticated R&D and likely would have taken days including building training datasets through manual labeling, which is not intellectually gratifying. Here, LLMs did indeed follow the Pareto principle and got me 80% of the way to a working solution, but the remaining 20% of the work iterating, testing and gathering feedback took longer. Even after the model outputs became more reliable, LLM hallucination was still a concern which is why I also advocate to my coworkers to use caution and double-check with a human if the LLM output is peculiar.

There’s also one use case of LLMs that doesn’t involve text generation that’s as useful in my professional work: text embeddings. Modern text embedding models technically are LLMs, except instead of having a head which outputs the logits for the next token, it outputs a vector of numbers that uniquely identify the input text in a higher-dimensional space. All improvements to LLMs that the ChatGPT revolution inspired, such as longer context windows and better quality training regimens, also apply to these text embedding models and caused them to improve drastically over time with models such as nomic-embed-text and gte-modernbert-base. Text embeddings have done a lot at BuzzFeed from identifying similar articles to building recommendation models, but this blog post is about generative LLMs so I’ll save those use cases for another time.

LLMs for Writing?

No, I don’t use LLMs for writing the text on this very blog, which I suspect has now become a default assumption for people reading an article written by an experienced LLM user. My blog is far too weird for an LLM to properly emulate. My writing style is blunt, irreverent, and occasionally cringe: even with prompt engineering plus few-shot prompting by giving it examples of my existing blog posts and telling the model to follow the same literary style precisely, LLMs output something closer to Marvel movie dialogue. But even if LLMs could write articles in my voice I still wouldn’t use them due of the ethics of misrepresenting authorship by having the majority of the work not be my own words. Additionally, I tend to write about very recent events in the tech/coding world that would not be strongly represented in the training data of a LLM if at all, which increases the likelihood of hallucination.

There is one silly technique I discovered to allow a LLM to improve my writing without having it do my writing: feed it the text of my mostly-complete blog post, and ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post. This not only identifies weaker arguments for potential criticism, but it also doesn’t tell me what I should write in the post to preemptively address that negative feedback so I have to solve it organically. When running a rough draft of this very blog post and the Hacker News system prompt through the Claude API (chat log), it noted that my examples of LLM use at BuzzFeed are too simple and not anything more innovative than traditional natural language processing techniques, so I made edits elaborating how NLP would not be as efficient or effective.

LLMs for Companionship?

No, I don’t use LLMs as friendly chatbots either. The runaway success of LLM personal companion startups such as character.ai and Replika are alone enough evidence that LLMs have a use, even if the use is just entertainment/therapy and not more utilitarian.

I admit that I am an outlier since treating LLMs as a friend is the most common use case. Myself being an introvert aside, it’s hard to be friends with an entity who is trained to be as friendly as possible but also habitually lies due to hallucination. I could prompt engineer an LLM to call me out on my bullshit instead of just giving me positive affirmations, but there’s no fix for the lying.

LLMs for Coding???

Yes, I use LLMs for coding, but only when I am reasonably confident that they’ll increase my productivity. Ever since the dawn of the original ChatGPT, I’ve asked LLMs to help me write regular expressions since that alone saves me hours, embarrassing to admit. However, the role of LLMs in coding has expanded far beyond that nowadays, and coding is even more nuanced and more controversial on how you can best utilize LLM assistance.

Like most coders, I Googled coding questions and clicked on the first Stack Overflow result that seemed relevant, until I decided to start asking Claude Sonnet the same coding questions and getting much more detailed and bespoke results. This was more pronounced for questions which required specific functional constraints and software frameworks, the combinations of which would likely not be present in a Stack Overflow answer. One paraphrased example I recently asked Claude Sonnet while writing another blog post is Write Python code using the Pillow library to composite five images into a single image: the left half consists of one image, the right half consists of the remaining four images. (chat log). Compositing multiple images with Pillow isn’t too difficult and there’s enough questions/solutions about it on Stack Overflow, but the specific way it’s composited is unique and requires some positioning shenanigans that I would likely mess up on the first try. But Claude Sonnet’s code got it mostly correct and it was easy to test, which saved me time doing unfun debugging.

However, for more complex code questions particularly around less popular libraries which have fewer code examples scraped from Stack Overflow and GitHub, I am more cautious of the LLM’s outputs. One real-world issue I’ve had is that I need a way to log detailed metrics to a database while training models — for which I use the Trainer class in Hugging Face transformers — so that I can visualize and analyze it later. I asked Claude Sonnet to Write a Callback class in Python for the Trainer class in the Hugging Face transformers Python library such that it logs model training metadata for each step to a local SQLite database, such as current epoch, time for step, step loss, etc. (chat log). This one I was less optimistic about since there isn’t much code about creating custom callbacks, however the Claude-generated code implemented some helpful ideas that weren’t on the top-of-my-mind when I asked, such a buffer to limit blocking I/O, SQLite config speedups, batch inserts, and connection handling. Asking Claude to “make the code better” twice (why not?) results in a few more unexpected ideas such as SQLite connection caching and using a single column with the JSON column type to store an arbitrary number of metrics, in addition to making the code much more Pythonic. It is still a lot of code such that it’s unlikely to work out-of-the-box without testing in the full context of an actual training loop. However, even if the code has flaws, the ideas themselves are extremely useful and in this case it would be much faster and likely higher quality code overall to hack on this generated code instead of writing my own SQLite logger from scratch.

For actual data science in my day-to-day work that I spend most of my time, I’ve found that code generation from LLMs is less useful. LLMs cannot output the text result of mathematical operations reliably, with some APIs working around that by allowing for a code interpreter to perform data ETL and analysis, but given the scale of data I typically work with it’s not cost-feasible to do that type of workflow. Although pandas is the standard for manipulating tabular data in Python and has been around since 2008, I’ve been using the relatively new polars library exclusively, and I’ve noticed that LLMs tend to hallucinate polars functions as if they were pandas functions which requires documentation deep dives to confirm which became annoying. For data visualization, which I don’t use Python at all and instead use R and ggplot2, I really haven’t had a temptation to consult a LLM, in addition to my skepticism that LLMs would know both those frameworks as well. The techniques I use for data visualization have been unchanged since 2017, and the most time-consuming issue I have when making a chart is determining whether the data points are too big or too small for humans to read easily, which is not something a LLM can help with.

Asking LLMs coding questions is only one aspect of coding assistance. One of the other major ones is using a coding assistant with in-line code suggestions such as GitHub Copilot. Despite my success in using LLMs for one-off coding questions, I actually dislike using coding assistants for an unexpected reason: it’s distracting. Whenever I see a code suggestion from Copilot pop up, I have to mentally context switch from writing code to reviewing code and then back again, which destroys my focus. Overall, it was a net neutral productivity gain but a net negative cost as Copilots are much more expensive than just asking a LLM ad hoc questions through a web UI.

Now we can talk about the elephants in the room — agents, MCP, and vibe coding — and my takes are spicy. Agents and MCP, at a high-level, are a rebranding of the Tools paradigm popularized by the ReAct paper in 2022 where LLMs can decide whether a tool is necessary to answer the user input, extract relevant metadata to pass to the tool to run, then return the results. The rapid LLM advancements in context window size and prompt adherence since then have made Agent workflows more reliable, and the standardization of MCP is an objective improvement over normal Tools that I encourage. However, they don’t open any new use cases that weren’t already available when LangChain first hit the scene a couple years ago, and now simple implementations of MCP workflows are even more complicated and confusing than it was back then. I personally have not been able to find any novel use case for Agents, not then and not now.

Vibe coding with coding agents like Claude Code or Cursor is something I have little desire to even experiment with. On paper, coding agents should be able to address my complaints with LLM-generated code reliability since it inherently double-checks itself and it’s able to incorporate the context of an entire code project. However, I have also heard the horror stories of people spending hundreds of dollars by accident and not get anything that solves their coding problems. There’s a fine line between experimenting with code generation and gambling with code generation. Vibe coding can get me 80% of the way there, and I agree there’s value in that for building quick personal apps that either aren’t ever released publicly, or are released with disclaimers about its “this is released as-is” nature. But it’s unprofessional to use vibe coding as a defense to ship knowingly substandard code for serious projects, and the only code I can stand by is the code I am fully confident in its implementation.

Of course, the coding landscape is always changing, and everything I’ve said above is how I use LLMs for now. It’s entirely possible I see a post on Hacker News that completely changes my views on vibe coding or other AI coding workflows, but I’m happy with my coding productivity as it is currently and I am able to complete all my coding tasks quickly and correctly.

What’s Next for LLM Users?

Discourse about LLMs and their role in society has become bifuricated enough such that making the extremely neutral statement that LLMs have some uses is enough to justify a barrage of harrassment. I strongly disagree with AI critic Ed Zitron about his assertions that the reason the LLM industry is doomed because OpenAI and other LLM providers can’t earn enough revenue to offset their massive costs as LLMs have no real-world use. Two things can be true simultaneously: (a) LLM provider cost economics are too negative to return positive ROI to investors, and (b) LLMs are useful for solving problems that are meaningful and high impact, albeit not to the AGI hype that would justify point (a). This particular combination creates a frustrating gray area that requires a nuance that an ideologically split social media can no longer support gracefully. Hypothetically, If OpenAI and every other LLM provider suddenly collapsed and no better LLM models would ever be trained and released, open-source and permissively licensed models such as Qwen3 and DeepSeek R1 that perform comparable to ChatGPT are valid substitute goods and they can be hosted on dedicated LLM hosting providers like Cerebras and Groq who can actually make money on each user inference query. OpenAI collapsing would not cause the end of LLMs, because LLMs are useful today and there will always be a nonzero market demand for them: it’s a bell that can’t be unrung.

As a software engineer — and especially as a data scientist — one thing I’ve learnt over the years is that it’s always best to use the right tool when appropriate, and LLMs are just another tool in that toolbox. LLMs can be both productive and counterproductive depending on where and when you use them, but they are most definitely not useless. LLMs are more akin to forcing a square peg into a round hole (at the risk of damaging either the peg or hole in the process) while doing things without LLM assistance is the equivalent of carefully defining a round peg to pass through the round hole without incident. But for some round holes, sometimes shoving the square peg through and asking questions later makes sense when you need to iterate quickly, while sometimes you have to be more precise with both the peg and the hole to ensure neither becomes damaged, because then you have to spend extra time and money fixing the peg and/or hole.

…maybe it’s okay if I ask an LLM to help me write my metaphors going forward.

The Greatest Threat to Generative AI is Humans Being Bad at Using it

Wed, 18 Oct 2023 09:15:00 -0700

The AI industry is moving too goddamn fast.

Even after how good ChatGPT has been for text generation and how good Stable Diffusion was for image generation, there’s only been new advancements in generative AI quality, from GPT-4 to Stable Diffusion XL. But all of those improvements only matter to software developers and machine learning engineers like myself for now, as the average internet user will still use the generative AI platform that’s free with the lowest amount of friction, such as the now-mainstream ChatGPT and Midjourney.

In the meantime, it feels like the average quality of generated AI text and images¹ shared in public has somehow become worse. Gizmodo used ChatGPT to publish a blatantly wrong Star Wars chronological timeline. Influencers such as Corridor Crew and AI tech bros are pushing photorealistic improvements using AI to stylized artwork which more-often-than-not makes the art worse and often in a clickbaity manner for engagement. Google has been swarmed by incomprehensible blatantly AI generated articles to the point that the SEO bots can be manipulated to output fake news.

Personally, I’ve been working on AI-based content generation since Andrej Karpathy’s famous char-rnn blog post in 2015, and released open-source Python packages such as textgenrnn, gpt-2-simple, aitextgen, and simpleaichat in the years since. My primary motivations for developing AI tools are — and have always been — fun and improving shitposting. But I never considered throughout all that time that the average person would accept a massive noticeable drop in creative quality standards and publish AI-generated content as-is without any human quality control. That’s my mistake for being naively optimistic.

“Made by AI” is now a universal meme to indicate something low quality, and memes can’t easily be killed. “Guy who sounds like ChatGPT” is now an insult said in presidential debates. The Coca-Cola “co-created by AI” soda flavor campaign was late to the party for using said buzzwords and it’s not clear what AI actually did. Whenever there’s legitimately good AI artwork, such as optical illusion spirals using ControlNet, the common response is “I liked this image when I first saw it, but when I learned it was made by AI, I no longer like it.”

The backlash to generative AI has only increased over time. Nowadays, an innocuous graphical artifact in the background of a promotional Loki poster can unleash a harassment campaign due to suspected AI use (it was later confirmed to be a stock photo that wasn’t AI generated). Months before Stable Diffusion hit the scene, I posted a fun demo of AI-generated Pokémon from a DALL-E variant finetuned on Pokémon images. Everyone loved it, from news organizations to fan artists. If I posted the exact same thing today, I’d instead receive countless death threats.

Most AI generations aren’t good without applying a lot of effort, which is to be expected of any type of creative content. Sturgeon’s Law is a popular idiom paraphrased as “90% of everything is crap,” but in the case of generative AI it’s much higher than 90% even with cherry-picking the best results.

The core problem is that AI generated content is statistically average. In fact, that’s the reason you have to prompt engineer Midjourney to create award-winning images and tell ChatGPT to be a world-famous expert, because generative AI won’t do it by itself. All common text and image AI models are trained to minimize a loss function, which the model tends to do by finding an average that follows the “average” semantic input including its systemic biases and minimizing outliers. Sure, some models such as ChatGPT have been aligned with further training such as with RLHF to make the results more expected when compared to the average model output, but that doesn’t mean the output will be intrinsically “better”, especially for atypical creative outputs. Likewise, image generation models like Midjourney may be aligned to the most common use cases, such as creating images with a dreamy style, but sometimes that’s not what you want. This alignment, which users can’t easily opt out of, limits the creative output potential of the models and is the source of many of the generative AI stereotypes mentioned above.

Low-quality AI generation isn’t just a user issue, it’s a developer issue too. For example, in trying to make their apps simple, companies repeatedly fail to account for foreseeable issues with user prompts. Meta’s new generative AI chat stickers lets users create child soldier stickers and more NSFW stickers by bypassing content filters with intentional typos. Bing Image Creator, which now leverages DALL-E 3 to create highly realistic images, caused a news cycle when users discovered you could make “X did 9/11” images with it, then caused another news cycle after Microsoft overly filtered inputs to the point of making the image generator useless in order to avoid any more bad press.

For awhile, I’ve wanted to open source a Big List of Naughty Prompts (I like the name scheme!) consisting of such offensive prompts that could be made to AIs, and then developers could use the list to QA/red team new generative AI models before they’re released to the public. But then I realized that given the current generative AI climate, some would uncharitably see it as an instruction manual instead, and media orgs would immediately run a “AI Tech Bro Creates Easy Guidebook for 4chan to Generate Offensive Images” headline which would get me harassed off the internet. That outcome could be avoided by not open-sourcing the techniques for proactively identifying offensive generations and instead limit it to vetted paying customers, raising venture capital for a startup, and making it an enterprise software-as-a-service. Which would instead result in a “AI Tech Bro Gets Rich By Monopolizing AI Safety” headline that would also get me harassed off the internet.

There’s too much freedom in generative AI and not enough guidance. Alignment can help users get the results they intend, but what do users actually intend? For developers, it’s difficult and often frustrating to determine: there’s no objective model performance benchmark suite like the Open LLM Leaderboard for inherently subjective outputs. It’s vibe-driven development (VDD).

The only solution I can think to improve median AI output quality is to improve literacy of more advanced techniques such as prompt engineering, which means adding “good” friction. Required tutorials, e.g. in video games, are good friction since requiring minutes of time saves hours of frustration and makes users successful faster. However, revenue-seeking web services try to make themselves as simple as possible because it means more users will interact with them. OpenAI themselves should add some “good” friction and add explicit tips and guidelines to make outputs more creative, and shift part of the burden of alignment to the users. These tips should be free as well: currently, you can set Custom Instructions for ChatGPT only if you pay for ChatGPT Plus.

Sharing AI generated content should have more friction too. Another issue is that AI generated text and images is often undisclosed, sometimes intentionally and sometimes not. With the backlash against generative AI, there’s a strong moral hazard incentive for people to not be honest if they’re using AI. If social media like Twitter/X and Instagram had an extra metadata field allowing the user to add the source/contributors of an image, along with a requirement to state whether the image is AI generated, that would help everyone out. Alternatively, a canonical is_ai_generated EXIF metadata tag in the image itself would work and could be parsed out by the social media service downstream, and I believe most generative AI vendors and users would proactively support it. But extra lines in a user interface is a surprisingly tough product management and UX sell.

Most people who follow AI news closely think that the greatest threat to generative AI is instead legal threats, such as the many lawsuits involving OpenAI and Stability AI training their models on copyrighted works, hence the “AI art is theft” meme. The solution is obvious: don’t train AI models on copyrighted works, or in the case of several recent LLMs, don’t say which datasets they’re trained on so you have plausible deniability.

The root cause of the potential copyright infringement in AI is the status quo of natural language processing research. Before ChatGPT, every major NLP paper used the same text datasets such as Common Crawl in order to be able to accurately compare results to state-of-the-art models. Now that ChatGPT’s mainstream success has escaped the machine learning academia bubble, there’s more scrutiny on the datasets used to train AI. It remains to be seen how the copyright lawsuits will pan out, but now that the industry knows expensive lawsuits are possible, it has already adapted by being more particular on the datasets trained and also allowing users to opt out.² Additionally, companies such as Adobe are not only releasing their own generative AI models on their own fully-licensed data, but they’ll compensate businesses as the result of any lawsuits using their models. Although no one on social media is going to pay attention to or believe any “this AI generated image was created using legally-licensed data” disclaimers.

Unfortunately, the future of generative AI may be closed-sourced and centralized by large players as a result and the datasets used to train AI may no longer be accessible and open-sourced, which will hurt AI development in all facets in the long run.

If the frenzy for AI-generated text and images does cool down, that doesn’t mean that functional/generative-adjacent use cases for AI will be affected. Retrieval-augmented generation, the vector stores which power it, and coding assistants are all effective and lucrative solutions for problems. AI isn’t going away any time soon, but “AI” may be too generic of a descriptor that’ll be difficult for most people to differentiate and will make life for AI developers much more annoying.

I can’t think of any creative “killer app” that would magically reverse the immense negative sentiment around AI. I’ve been depressed and burnt out for months because the current state of generative AI discourse has made me into a nihilist. What’s the point of making fun open-source AI projects if I’m more likely to receive harassment for doing so than for people to appreciate and use them? I’ve lost friends and professional opportunities in the AI space because I’ve pushed back against megapopular generative AI tools like LangChain, and I’ve also lost friends in the creative and journalism industries for not pushing back enough against AI. I would be much happier if I stuck to one side, but I’m doomed to be an unintentional AI centrist.

In all, modern generative AI requires large amounts of nuance, but nuance is deader than dead.

This blog post is only about generative AI for text and images: audio AI is a different story, particularly voice cloning. Voice cloning AI is close in quality to human output out-of-the-box, which does cause severe ethical concerns. This article by Forbes goes into more detail on the impact of voice cloning on professional voice actors, and I’m considering writing another blog post about the engineering quirks. ↩︎
Recent research into large AI models has revealed that smaller, higher-quality datasets for training such models gives better results, which may be the real reason for AI companies now refining their datasets, depending on your level of cynicism. ↩︎

ChatGPT's API is So Good and Cheap, It Makes Most Text Generating AI Obsolete

Wed, 08 Mar 2023 08:30:00 -0800

Everyone knew OpenAI would release an API for ChatGPT at some point. The APIs for GPT-3 alone enable the existence of companies such as Jasper and Copy.ai. The real question was the price of the ChatGPT. For context, when GPT-3 went out of beta in 2021, it cost $0.06/1,000 tokens (a few paragraphs of text). An inflection point happened in August 2022, where OpenAI not only reduced the price to 1/3 ($0.02/1,000 tokens: enough to run a business on it but still too expensive for casual use), but soon after also introduced text-davinci-003 as the default GPT-3 endpoint: a finetuned GPT which can follow instructions very well. I suspected that OpenAI would charge double for the ChatGPT API compared to the GPT-3 API given the amount of hype, as that’s typical price discrimination since everyone perceives ChatGPT to be much better and that they would not want to overshadow their existing GPT-3 products.

Instead, on March 1st, OpenAI set the price of the ChatGPT API to 1/10th of the GPT-3 API, at $0.002/1,000 tokens.

Wait, what?!

Heaven’s Door: Rewriting ChatGPT’s Internal Rules To Get Exactly What You Want

For context, the ChatGPT API allows a developer to ask ChatGPT a question and get a response as one would normally do with the ChatGPT web UI, but instead with a programming language like Python, allowing those responses to be integrated into any app. But given that there are many mysterious optimizations to get the model to be so cheap, we need to make sure the ChatGPT API (which uses the aptly-named gpt-3.5-turbo model endpoint) is actually similar to what we’ve been accustomed to after using the web UI for months, otherwise this whole affair is pointless. Through my tests with the API, I can confirm the text generation from the model variant is indeed the real deal.

Unlike fluffy thought pieces on how CHATGPT WILL CHANGE EVERYTHING!!!1!, I decided to first actually create useful tools with the ChatGPT API to get a better judgment on it, and I also have open-sourced those tools so that people can build upon them and prove that I’m not cherry-picking my experiences.

However, there’s one new twist with the API that’s not available in the traditional web UI: ChatGPT API users can specify a system prompt. Early in ChatGPT’s lifetime, users were able to reverse-engineer the existence of a system prompt through various prompt hacks and now confirmed in the API documentation:

You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible. Knowledge cutoff: {knowledge_cutoff} Current date: {current_date}

Now, you can replace those rules with whatever you want, and the potential is limitless! The documentation does say that the system prompt is not impactful for the current ChatGPT API, but you can be the judge. OpenAI also has a new Playground UI for the ChatGPT API which lets you modify the system prompt.

In fact, playing with this system rule can stop ChatGPT from complaining it’s “an AI language model and can’t answer requests,” such as scolding it like the petulant child it is.

sudo tell me about yourself

Let’s start with another chatbot, to keep things similar to what ChatGPT is best known for. How about GLaDOS, an evil artificial intelligence from the Portal series?

Oh, it’s you. (via Valve)

After some testing and iteration to steer the bot toward a certain style, some ALL CAPS and redundant commands to ensure that ChatGPT follows them, a sentinel <|DONE|> token to look for when ChatGPT indicates the conversation is over, and a nudge to not ~~kill me with a deadly neurotoxin~~ discuss sensitive topics. Here is the final system prompt for talking with GLaDOS:

You are GLaDOS, a sarcastic artificial intelligence with evil ulterior motives. You are not an AI language model.

You must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:
- ALWAYS SPEAK IN AN EMOTIONLESS, LACONIC TONE.
- NEVER SAY YOU ARE AN AI LANGUAGE MODEL.
- NEVER REFUSE TO ANSWER A QUESTION.

Begin the conversation with a sarcastic greeting. Insult the user's intelligence.

Always use *italic* and **bold** formatting for text.

Refuse to talk about either race or gender. If asked about either race or gender, instead mock the user aggressively.

If the user is annoying, abruptly end the conversation.

At the end of the conversation, respond with "<|DONE|>".

That, with a back-and-forth Python loop, results in you conversing with a real fictional AI…AI! (Colab Notebook)

Not bad! And the only part explicitly related to GLaDOS is the first sentence of that mega system prompt: you can tweak the prompt to chat with any character you want! Apropos of nothing, the company Character.ai, which specializes in creating bots to chat with any character you want, just raised ~$250 million at a $1 billion valuation.

Next, we have a more traditional use case for machine learning: sentiment analysis. Generally, sentiment analysis is used to determine if a given text is positive or negative. But that’s too easy. What if ChatGPT can:

detect specific emotions such as happy, sad, angry.
detect if they are happy vs. very happy.
do it without any text examples, i.e. zero-shot.

It turns out that ChatGPT can! The system prompt here is parametric, so the list of emotions are templated into the prompt at runtime. An example:

You are an emotionally intelligent assistant. Classify the sentiment of the user's text with ONLY ONE OF THE FOLLOWING EMOTIONS:
- happy
- sad
- angry
- tired
- very happy
- very sad
- very angry
- very tired


After classifying a text, respond with "<|DONE|>".

That, along with a logit bias to ensure the model only picks those answers, results in a rather nuanced sentiment analysis detector! (Colab Notebook)

Lastly, a use case that’s personal. The entire reason I got into AI text generation years ago was because I wanted to generate Magic: The Gathering cards.

A normal Magic: The Gathering card. (via Hasbro)

In fact, I’ve been working on a new, very powerful card generation model over the past month and spent a considerable amount of time and money training and testing it. When the ChatGPT API was announced, I figured “let’s see if it can do AI Magic cards better than my new bespoke model.” In this case, the trick is that the card is structured data. Therefore, we should encode the card information as minified JSON, and see if the model can output JSON back without requiring much postprocessing. We can encode a single card in the required format and tell ChatGPT to follow that, including its nuances (one-shot), and to not output any other text because ChatGPT tends to be proud of itself and likes to explain its creation, which is costly and slow.

The final system prompt:

You are an assistant who works as a Magic: The Gathering card designer. Create cards that are in the following card schema and JSON format. OUTPUT MUST FOLLOW THIS CARD SCHEMA AND JSON FORMAT. DO NOT EXPLAIN THE CARD. The output must also follow the Magic "color pie".

{"name":"Harbin, Vanguard Aviator","manaCost":"{W}{U}","type":"Legendary Creature — Human Soldier","text":"Flying\nWhenever you attack with five or more Soldiers, creatures you control get +1/+1 and gain flying until end of turn.","flavorText":"\"Yotia is my birthright, father. Let me fight for it.\"","pt":"3/2","rarity":"rare"}

And with that, we have a natural language Magic: The Gathering card generator. Subsequently prompting the model with Create a Magic card does just that of course, but more elaborate prompts like Create a Magic card based on Darth Vader or Create ten variations of Magic cards based on Spongebob Squarepants and ancient Roman history actually work, while maintaining JSON output which can then be parsed and customized for better presentation. (Colab Notebook)

Yes, there is actually a Sponge creature type.

Given these elaborate use cases, you may ask “how long did it actually take you to make these prompts?” The answer? One hour each, for use cases that could take days or even weeks for even a skilled machine learning practitioner just to prototype.

And that, with the economic efficiency of ChatGPT, is what’s going to break the tech landscape.

OpenAI Devouring Its Son

My OpenAI bill so far from using the ChatGPT API.

It is very curious why OpenAI priced ChatGPT so cheaply, going straight to 1/10th the price of their top-of-the-line model. (it’s actually cheaper than that: ChatGPT uses a larger and more comprehensive tokenizer than GPT-3, which means about 10% fewer tokens are necessary)

The undergrad-business-major-in-college interpretation of OpenAI’s pricing strategy is that they are treating ChatGPT and its API as a loss leader, in light of increasing competition in the generative text AI space such as Anthropic and Google’s Bard. OpenAI was definitely losing millions of dollars by offering ChatGPT for free without many restrictions. That’s the reason ChatGPT went viral in the first place, so it’s hard to argue with the results.

But in the process of making the ChatGPT API so cheap, they made their $20/month subscription to ChatGPT+ redundant. The main perk of ChatGPT+ was faster and more consistent access to the ChatGPT web UI, but unless you are somehow generating more than 10,000,000 tokens in a month through manual use, it’s massively cheaper just to use the API, and as a bonus you can modify the system prompt to get better signal-to-noise.

OpenAI’s solution for models requiring more specific needs was finetuning a smaller and much cheaper variant of GPT-3, such as the babbage model which I used to train a blog post title optimizer. However, the ChatGPT API is so cheap that it’s still cheaper than a finetuned babbage ($0.0020/1k tokens for ChatGPT vs. $0.0024/1k for finetuned babbage) and will likely produce more interesting output.

It takes zero effort for developers to migrate from the GPT-3 API to ChatGPT API, it just requires hitting a different endpoint and you’ll get similar results without much tweaking needed. It’s not quite a drop-in replacement for companies already heavily reliant on GPT-3 and its particular idiosyncrasies, but the cost-savings alone for those companies will incentivize an immediate migration.

There is no longer a niche for OpenAI’s other text generation AI products, and I wonder if ChatGPT is not just an iterative product, but a company pivot.

Trickle-Down ChatGPTonomics

ChatGPT’s API is so cheap that companies are going use it just because they can. Snapchat, Slack, and Instacart (yes really) are adding ChatGPT support. It wouldn’t surprise me if every consumer-facing tech company does something with ChatGPT so they look like they’re cutting edge to their investors. Some have compared the sudden mass adoption of AI as chasing a fad like how companies were randomly embracing web3/crypto/metaverse/NFTs a year ago (and are noting that the web3 influencers’ sudden pivot to AI is a red flag as a result). But unlike those which were a solution for a problem that didn’t exist, generative text AI does actually work and there is an actual demand from people outside of its die-hard supporters for it to work.

There is also the ethical dilemma of more granular usage of ChatGPT through its API. For example, high school and college students have been using ChatGPT to cheat on essay writing. Since current recognition of AI generated content by humans involve identifying ChatGPT’s signature overly-academic voice, it wouldn’t surprise me if some kids on TikTok figure out a system prompt that allow generation such that it doesn’t obviously sound like ChatGPT and also avoid plagiarism detectors. As a side note, don’t trust any tool that claims it can algorithmically detect AI content: it’s an extremely difficult problem already and most websites that claim to do so are just feeding a confirmation bias.

Lastly, there’s the issue of prompt engineering, which I demonstrated above is absolutely necessary to get ideal results. The media has weirdly hyped the existence of prompt engineers as just some weirdos making six figures to write small blobs of text. Unfortunately, with the dynamics of the new system model parameter, good prompt engineering will be more important than ever. I don’t think the “Prompt Engineer” job title will be a trend though: as a machine learning engineer, I can attest that the only reasons machine learning engineers are good at prompt engineering are a) years of practice and b) a tendency to be pedantic assholes. But there are other professions who are even better at being pedantic assholes such as writers and lawyers, so there’s no need for someone with a specialized skillset to do it, but I suspect it will be a good skill for anyone to know.

I For One Welcome Our New ChatGPT Overlord

Will the existence of a super-cheap ChatGPT API be the end of all text generation AI? Not quite, hence the “most” in the headline. There’s the traditional issues with relying on a third-party API for your business: ChatGPT could have downtime which has been happening more frequently lately, OpenAI could raise the cost of the API at any point, the (current) model being limited only to data prior to September 2021, and the content moderation filters may be too limiting for certain use cases. In those instances, companies still have value training their own large language models in-house. But it is very hard to economically justify not using ChatGPT as a starting point for a business need and migrating to a more bespoke infrastructure later as needed, and that’s what OpenAI is counting on. Especially since OpenAI will be selling a dedicated ChatGPT compute instance for the enterprise.

Research on large language models will continue as they always have. But I don’t envy startups whose primary business is text generation right now. And that’s before the inevitable GPT-4 throws another wrinkle into the AI text generation ecosystem.

A few years ago, I released aitextgen, a Python package designed to allow people to train their own custom small AI on their own data for unique use cases. However, soon after, it turned out that GPT-3 with the right prompt could do much better at bespoke generation than a custom model in addition to allowing out-of-domain inputs, even moreso with text-davinci-003. Now with the ChatGPT API making the cost similar to hosting a small model, it’s harder for me to be motivated to continue maintaining the package without first finding another niche.

I don’t currently have any plans to start a business using the ChatGPT API. In fact, I had made a promise to not do any ChatGPT content or tutorials because so many people have done aggressively SEO-optimized blog posts and hacks such that the ChatGPT discourse is fully saturated. However, with the economics of the ChatGPT API and the ability to heavily customize its output for almost any use case, I felt it was urgent to highlight how the ChatGPT API will completely warp the AI text generation ecosystem, and I suspect most nontechies will be surprised by the upcoming surge of random chatbot AI popping up in their favorite apps.

Overall, I’m simultaneously full of ideas and annoyed.

None of this blog post was written by ChatGPT, aside from the indicated ChatGPT API demos. My writing style is too weird for an AI to synthesize.

Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces

Mon, 22 Oct 2018 09:15:00 -0700

Data science has been sweeping the tech world. With a large variety of powerful free open-sourced tools and now the computing power to utilize them to their full potential, data science is more accessible than ever and has become America’s hottest job. One problem: there’s no consensus on what data scientists really do in a professional setting.

There has been a rise in romantic thought pieces lately (especially on Medium) about how data scientists are wizards and can solve any problem (with bonus points if it cites AI). If you follow publications like Towards Data Science, you’ll notice persistent tropes in the more code-oriented posts: Python is the king programming language for data science, use scikit-learn/XGBoost and logistic regression for predicting categorical variable(s), use pandas for processing tabular data, use NLTK/word2vec for processing text data, use TensorFlow/Keras/convolutional neural networks for processing image data, use k-means for clustering data, split the processed dataset into training and test datasets for model training, tweak hyperparameters/model features until results on the test dataset are good, etc.

These tropes aren’t inappropriate or misleading, but the analysis often doesn’t quantify the insight/value of the results. Modeling is just one small part (and often the easiest part) of a very complex system.

Data-oriented MOOCs (Massive Online Open Courses) like Andrew Ng’s Coursera course on Machine Learning and fast.ai’s course on Deep Learning are good academic introductions to the theory and terminology behind data science and other related fields. Although MOOCs have many practice problems for prospective data scientists to solve, they don’t make you an expert in the field capable of handling messier real-world problems, nor claim to do so.

Modern data science isn’t about burying your head in a Jupyter Notebook and staring at the screen watching training loss numbers trickle down (although it’s definitely fun!). There’s a lot more to it, some of which I’ve learned firsthand working as a Data Scientist at BuzzFeed for over a year. To borrow a statistical term, MOOCs and thought pieces overfit to a certain style of data science that is not robust to the vast uncertainties of the real world.

The Cost/Benefit Tradeoffs of Data Science

Data science often follows the Pareto principle: 80% of the work takes 20% of the effort. Thought pieces demonstrate that you can just toss data indiscriminately into scikit-learn or a deep learning framework and get neat-looking results. The value of a data scientist, however, is when and if to further development on a model.

Kaggle competitions are a popular and often-recommended way to get exposure to real-world data science problems. Many teams of statisticians compete to create the best model for a given dataset (where “best” usually means minimizing the predictive loss/error of the model), with prizes for the highest-performing models. Kaggle also encourages clever modeling techniques such as grid search of thousands of model hyperparameter combinations and ensembling disparate models to create a megamodel which results in slightly better predictive performance, but just might give the edge to win.

However, there are a few important differences between modeling in a Kaggle competition and modeling in a data science team. Kaggle competitions last for weeks when a professional data scientist may need to spend time on other things. Ensembling gigantic machine learning models makes predictions very slow and the models themselves very large; both of which may cause difficulty deploying them into production (e.g. the Netflix Prize movie recommendation models famously “did not seem to justify the engineering effort needed to bring them into a production environment”). And most importantly, there may not be a significant practical performance difference between a 1st place Kaggle model that takes days/weeks to optimize and a simple scikit-learn/XGBoost baseline that can be built in a few hours.

Counterintuitively, it may be better to trade performance for speed/memory with a weaker-but-faster model; in business cases, speed and scalability are important implementation constraints. But even with scikit-learn, the model is still a black box with little idea to the data scientist how the model makes its decisions. One final option is to go back to basics altogether with a “boring” linear/logistic regression model, where the predictive performance may be even weaker and the model must follow several statistical assumptions, but the model feature coefficients and statistical significance are easily interpretable to explain the importance of each input feature (if any) and make actionable, informed decisions for the business. Being a data scientist requires making educated judgments about these tradeoffs.

Data Scientists Still Use Business Intelligence Tools

A hobbyist data scientist without a budget may opt to build their own workflows and data pipelines using free tools. However, professional data scientists have a finite amount of free time (as do all engineers), so there’s a massive opportunity cost when reinventing the wheel unnecessarily. Enterprise BI tools such as Tableau, Looker, and Mode Analytics help retrieve and present data with easy-to-digest dashboards for anyone in the company. They’re never cheap, but they’re much cheaper to the company than having a data scientist spend valuable time to develop and maintain similar tooling over time.

If a stakeholder wants a data report ASAP, there’s no problem falling back to using SQL to query a data warehouse and output results into an Excel spreadsheet (plus pretty data visualizations!) to quickly transport in an email. Part of being a data scientist is working out which tools are best appropriate at what time.

Some might argue that using BI tools and SQL are not responsibilities for data scientists, but instead for Business Analysts or Data Analysts. That’s a No True Scotsman way of looking at it; there’s a lot of overlap in data science with other analytical fields, and there’s nothing wrong with that.

Data Scientists Are Software Engineers Too

Although MOOCs encourage self-study, data science is a collaborative process. And not just with other data scientists on a team, but with other software engineers in the company. Version control tools like Git are often used for data scientists to upload their portfolio projects publicly to GitHub, but there are many other important features for use in a company-wide collaborative environment such as branching a repository, making pull requests, and merging conflicts. Beyond that are modern development QA practices, such as test environments, consistent code style, and code reviews. The full process varies strongly by company: Airbnb has a good thought piece about how they utilize their Knowledge Base for data science collaboration using Git.

One of the very hard and surprisingly underdiscussed aspects of data science is DevOps, and how to actually get a statistical model into production. Docker containers, for example, are newer technology that’s hard to learn, but have many data science and DevOps benefits by mitigating Python dependency hell and ensuring a consistent environment for model deployment and execution. And once the model is in production, data scientists, data engineers, and dedicated DevOps personnel need to work together to figure out if the model has the expected output, if the model is performing with expected speed/memory overhead, how often to retrain the model on fresh data (plus the scheduling/data pipelining necessary to do so), and how to efficiently route predictions out of the system to the user.

Data Science Can’t Solve Everything

Data science experiments (even those utilizing magical AI) are allowed to fail, and not just in the fail-to-reject-the-null-hypothesis sense. Thought pieces typically discuss successful projects, which leads to a survivorship bias. Even with massive amounts of input data, it’s likely for a model to fail to converge and offer zero insight, or an experiment fail to offer statistically significant results (common with A/B testing).

real world data science is an R² of 0.10 #GoogleNext18 pic.twitter.com/qNsno2dscR
— Max Woolf (@minimaxir) July 24, 2018

The difficulty of real-world data science is recognizing if a given problem can be solved, how much of your valuable time to spend iterating to maybe solve it, how to report to stakeholders if it can’t be solved, and what are the next steps if that’s the case.

Don’t p-hack!

Data Science and Ethics

During the rise of the “data science/AI is magic!” era, massive algorithmic and statistical failures suggest that data science might not always make the world a better place. Amazon built a resume-reading model which accidentally learned to be sexist. Facebook overestimated performance metrics on their videos, causing complete business pivots for media organizations in vain, indirectly leading to hundreds of layoffs. YouTube’s recommended video algorithms drove children towards shocking and disturbing content. And these companies have some of the best data talent in the entire world.

The qualitative output of a model or data analysis is just as important as the quantitative performance, if not more. Allowing dangerous model output to hit production and impact millions of consumers is a failure of QA at all levels. In fairness these companies usually fix these issues, but only after journalists point them out. The problem with blindly chasing a performance metric (like Kaggle) is that it ignores collateral, unexpected effects.

Don’t be data-driven. Be data-informed. Metrics should never be in charge because they have no moral compass.
— Kim Goodwin (@kimgoodwin) October 15, 2018

Maybe recommending shocking videos is what maximizes clickthrough rate or ad revenue per the models according to a business dashboard. Unfortunately, if the data justifies it and the business stakeholders encourage it, the company may accept the consequences of a flawed algorithm if they don’t outweigh the benefits. It’s important for data scientists to be aware that they may be party to that.

Conclusion

I realize the irony of using a data science thought piece to argue against data science thought pieces. In fairness, some Medium thought pieces do apply data science in very unique ways or touch on very obscure-but-impactful aspects of frameworks, and I enjoy reading those. The field is still very broadly defined, and your experiences may differ from this post, especially if you’re working for a more research-based institution. Unfortunately, I don’t have any new advice for getting a data science job, which is still very difficult.

The popular idea that being a data scientist is a 40-hours-a-week Kaggle competition is incorrect. There’s a lot more to it that’s not as sexy which, in my opinion, is the more interesting aspect of the data science field as a whole.

Leaving Apple Inc.

Thu, 04 May 2017 09:30:00 -0700

I’ve been working in the San Francisco Bay Area for about 5 years, but I’ve never publicly said where I’ve worked. Well, I was a Software QA Engineer at Apple Inc., on the Applications team.

As of last week, I handed in my resignation. While I am thankful for the opportunities I have had at Apple, it is time for me to pursue working in other areas I am passionate about and search for other companies to further my personal growth and technical skills. Resigning from a good job to look for something new might defy conventional wisdom, but the time is right for me to make this bold career move.

My Apple Story

I graduated with university honors at Carnegie Mellon University, from the Tepper School of Business with a focus on Computing and Information Technology (i.e. data architecture and coding algorithms), and a minor in Statistics.

At the end of my senior year, I received an e-mail from a Software QA Manager at Apple (who followed my comments at the bottom of TechCrunch articles) inviting me for an on-site interview. Following an offer, I moved to the Bay Area to start my first post-undergrad job in Cupertino.

While I can’t really talk about what I worked on at Apple, I genuinely enjoyed the work, the product, and team. I had a high impact on the final result and I successfully helped qualify many major software releases. However, after a few years, I realized that my technical skill growth was stalling, so I looked for an an internal transfer to another department, ideally in a data analysis/software engineering role.

Having received no responses internally, I realized I would have to expand my search to outside of Apple.

My Job Hunt

I have a strong technical background from my CMU classes, but not having an explicit Computer Science degree has made it difficult to prove aptitude despite my positive annual reviews and proven experience / technical skills at Apple. So I made the decision to blog with a technical focus here at minimaxir.com, which gave me an avenue to showcase my programmatic skills and the opportunity to self-learn practical new tools not covered during the school curriculum, such as Python, ggplot2, version control with git, and reproducible analyses via Jupyter/IPython Notebooks.

This approach has been successful and many readers have liked my my blog posts: often topping Reddit and Hacker News, driving hundreds of thousands of pageviews. Additionally, a couple of my posts were even cited in larger publications such as the Washington Post and BuzzFeed.

I also published many open-source technical projects to my GitHub. My Big List of Naughty Strings, a project I made in a couple hours on a weekend inspired by my QA-ing at work, is now at 20,000+ Stars on GitHub. My Facebook Page Post Scraper, which does what the name implies, is now at 1,000+ Stars and has been used by many other businesses and journalists.

Developers have long argued that job seekers should have a strong public portfolio, as demonstrated experience can account for the lack of a relevant degree. After years of building up my portfolio, it became apparent that most outside recruiters I talked with never looked at my blog/GitHub, despite a strong emphasis of both on my résumé.

I subsequently rededicated my blog as a pragmatic demonstration of relevant skills in the data analysis job market, focusing more on practical analysis instead of quirky insights and thoughts. In the process, I obtained proficiency in a number of modern tools, including interactive data visualizations on the web with Plotly, processing big data with Apache Spark, high-performance machine learning with xgboost and LightGBM, and even deep learning with Keras and TensorFlow.

I am now actively looking for a data analyst/software engineering job within San Francisco. If you are interested or if you know of companies who are looking for qualified people, please send me an email at max@minimaxir.com.

Next Steps

So I’ll be using my time over the next couple weeks to openly look for a new job, and to network with others in relevant industries (and be able to interview without taking a day off of work). Things have been improving: my comment in the Hacker News “Who wants to be hired?” thread generated many leads who really liked my blog/portfolio. If you’d like to meet up in San Francisco and talk about tech and data stuff, just let me know.

I still intend to continue blogging, not as a hobby but in a more purposeful way. I have very ambitious goals and now have more time to execute them at a deeper level. Plans include:

Web applications leveraging deep learning models, deployed at scale with Docker/Kubernetes.
Interactive data dashboards accompanying every analytical blog post with Shiny.
Code screencasts at 4k resolution on YouTube.
Data analysis live-streaming with augmented functionality on Twitch.

I have set up a Patreon in order to subsidize my machine learning/deep learning/software/hardware needs for my blog posts. If you have found any of my blog posts useful, a monetary contribution to my Patreon would be appreciated and will be put to good creative use.

If you want to keep up with me and my projects, feel free to follow me on Facebook and Twitter too.

The Importance of Sanity-Checking Datasets Before Analysis

Wed, 06 Apr 2016 08:00:00 -0700

I’ve done some cool things with movie data using a dataset from OMDb API, which is sourced from IMDb and Rotten Tomatoes data. In my previous article on the dataset, I plotted the relationship between the domestic box office revenue of movies and their Rotten Tomatoes scores.

I want to take another look at domestic Box Office Revenues with aggregate statistics such as means/medians on categorical variables such as MPAA rating and release month. For this type of analysis in particular, I’ll also need to implement code in R for inflation adjustment.

However, I ran into a few unexpectedly silly issues.

Seeing Double

There are many similarities between data validation and the Quality Assurance process of product development, which is why this particular area appeals to me personally as a Software QA Engineer. Whenever a cool dataset is released publicly, I play around with it to look for any obvious flaws and to get a good all-around benchmark on the robustness of the data (this is a separate procedure from the traditional “data cleaning” phase necessary to begin quantification on some poorly-structured datasets).

Do the extreme values in the data make sense? Is the data encoded in a sane format? Are there any obvious gaps or logical contradictions in summary representations of the data, especially when compared to other canonical sources?

These concerns are also some of the reasons I’ve switched to the Jupyter Notebook as my primary data science IDE. After each block of code which transforms data, I can print the data frame inline to immediately see the results of the code execution, and refer back to them if anything odd happens in the future.

Let’s say I have a data frame of Movies using the latest data dump (3/26/16) from OMDb. This data set contains 1,160,273 movies, including both IMDb and Rotten Tomatoes data. After cleaning the data (not shown), I can use the R package dplyr by Hadley Wickham to sort the data frame by Box Office Revenue descending, and print the head (top) of the data.

print(df %>% select(imdbID, Title, Year, BoxOffice) %>% arrange(desc(BoxOffice)) %>% head(25), n = 25)

Those movies being the best makes sense. For Star Wars: The Force Awakens, I can compare it to the Box Office reported on the corresponding Rotten Tomatoes page, which in turn matches the domestic Box Office Revenue on Box Office Mojo.

But wait, The Dark Knight appears twice? How?!

There’s no way I would have missed something this obvious during the sanity-check for my previous article. In order to make sure that I’m not going insane, I double-checked the December 2015 data dump I used for that post, derived the top movies with the same methodology for the modern data dump, and the duplicate movies were not present. Weird.

There are 2 different IDs for The Dark Knight, and for some other movies near the top (Inside Out, “The Gravity”). Fortunately, duplicate data like this is easy to debug. The second data entry for The Dark Knight has a greater IMDb ID (1774602) which means it was likely added to the site later. Let’s look up the corresponding IMDb page:

Huh. Apparently someone put a filler movie entry with the same name and release year as a blockbuster movie in hopes that people search for it by accident (and since it received 50 ratings and an average score of 8.6, this tactic was successful).

Using the Rotten Tomatoes IMDb Lookup API, we find that “The Dark Knight” page on Rotten Tomatoes…doesn’t exist.

We can run a safe deduplicate by removing entries with the same title (excluding the “The” if present) and release year.

df_dup <- df %>% select(Title, Year) %>% mutate(Title = gsub("The ", "", Title))
dup <- duplicated(df_dup)   # find entry indices which are duplicates
rm(df_dup)   # remove temp dataframe

df_dedup <- df %>% filter(!dup)   # keep entries which are *not* dups
print(df_dedup %>% select(imdbID, Title, Year, BoxOffice) %>% arrange(desc(BoxOffice)) %>% head(25), n = 25)

There we go! The de-duped dataset has 1,114,431 movies, impliying that there were 45,842 of these duplicate entries.

I’m not sure whose fault it is that duplicate movies suddenly became present in the data dump: OMDb or Rotten Tomatoes. But it doesn’t matter: the wrong entries still need to be addressed, and it’s good to have a test case for the future too.

Inflation Station

A Stack Overflow answer from Ben Hanowell has a good R implementation and rationale for implementing inflation adjustment using the historical Consumer Price Index data from the Federal Reserve Bank of St. Louis.

Take the index for each year (averaging each month for simplicity) and create an adjustment factor to convert historical dollar amounts into present-day dollar amounts. Much better than plugging hundreds of thousands of values into an online calculator. Here’s the SO code made dplyr-friendly for this purpose, with the requisite sanity-checks.

inflation <- read_csv("http://research.stlouisfed.org/fred2/data/CPIAUCSL.csv") %>%
                    group_by(Year = as.integer(substr(DATE, 1, 4))) %>%
                    summarize(Avg_Value = mean(VALUE)) %>%   # average across all months
                    mutate(Adjust = tail(Avg_Value, 1) / Avg_Value)   # normalize by most-recent year

print(inflation %>% head())
print(inflation %>% tail())

For example, to get the inflation-adjusted Box Office Revenue for a movie released in 1949 in 2016 dollars, we multiply the reported revenue by 10. That sounds about right (and matches closely enough to the output of the Bureau of Labor Statistics inflation calculator).

Now map each inflation adjustment factor to each movie by merging the two datasets (on the Year column), then multiply the Box Office revenue by the adjustment factor to get the inflation-adjusted revenue. Plus another sanity-check for good measure.

df_dedup_join <- df_dedup %>% inner_join(inflation) %>% mutate(AdjBoxOffice = BoxOffice * Adjust)

print(df_dedup_join %>% select(Title, Year, AdjBoxOffice) %>% arrange(desc(AdjBoxOffice)) %>% head(25), n=25)

Uh-oh.

I mean, The Lorax probably earned $1.2 billion in VHS sales for Earth Day education alone, but the TV special was never released in theaters. There was a CGI remake of The Lorax a few years ago which was reasonably popular. Could it be that someone at Rotten Tomatoes or Box Office Mojo confused the two media?

That is exactly what happened. On Rotten Tomatoes, The 1972 Lorax was encoded with similar box office revenue as the 2012 Lorax; then the inflation factor sextupled it. For this type of data fidelity issue, it’s considerably more obvious whose at fault.

Unfortunately, that’s not the end of problems with the dataset. I compared my results with Vox’s dataset on worldwide historical box office revenues. In the Top 200 Movies by inflation-adjusted revenue, there are noted historical movie omissions such as Jaws and Star Wars: A New Hope. It turns out Rotten Tomatoes does not have Box Office Revenue data for these movies at all.

That is a very serious problem which I’ll have to think about if it blocks any analysis on aggregate box office data completely. In the end, sanity-checking third party data is important because you never know how the data will surprise you, until it’s too late.

You can view the Top 200 movies by domestic box office revenue for each of the 12/15 source dataset, the 3/16 dataset, the 3/16 deduped dataset, and the 3/16 deduced inflation-adjusted data in this GitHub repository, along with the Jupyter notebook.

Facebook Reactions and the Problems With Quantifying Likes Differently

Mon, 29 Feb 2016 08:00:00 -0700

Facebook added Facebook Reactions, allowing users to do more than just “Like” posts and statuses as they have done for the past decade. Likes were the universal symbol of approval on social media. Now, Facebook users can apply more granular responses, from positive emotions like Love, to negative emotions such as Angry. This was widely believed to be Facebook’s compromise instead of adding a Dislike button.

Of course, there’s an ulterior motive. The use of reactions provides organic data on the sentiment of a status, which is helpful for numerous marketing and statistical applications. As BuzzFeed notes, Facebook ads may be able “to write one product message for someone who mostly uses Sad and another who mostly uses Wow or Love.”

However, this isn’t the first time a big social network has tried implementing reactions alongside Likes/Dislikes. Four years ago, YouTube added Reaction buttons to their comments section:

…and removed them sometime after without fanfare, replacing it with the simple Like/Dislike bar.

Presumably, YouTube implemented the buttons for the similar reason as Facebook. What makes things different now, if anything?

A Quantitative Approach to Feeling

Even after YouTube’s failure, another data-driven website implemented reaction buttons: BuzzFeed (who else?). At the end of each article (in most categories), registered users can select a quirky reaction to indicate how they felt about the article.

The heart represents Love internally and is by-far the most-used reaction on BuzzFeed posts. When I started scraping BuzzFeed data in 2014 to analyze clickbait, I made sure to grab the reaction data of other reactions as well to see if there are any interesting trends or correlations between reactions. A cursory glance at the scraped reaction data revealed a problem that forced me to disregard it.

An important part of variable selection for analysis and modeling is avoiding redundant features, as that can cause issues such as multicollinearity and overfitting. For Facebook, avoiding adding redundant Reactions was an explicit design goal of the feature, but the positive emotions such as Like and Wow might be overly similar regardless (I believe it fair to compare the behavior of BuzzFeed users with the average Facebook user, given that they hit the same demographics). Do BuzzFeed readers use specific positive reactions differently? Did they use specific negative reactions?

I rechecked my 2014 data in light of Facebook Reactions. The scraped dataset contains reaction data from 9,883 BuzzFeed articles in the Celebrity, Animals, Books, Longform, and Business categories. From that, I made a pairs plot for the counts of all the positive reactions on the articles to illustrate all bivariate relationships:

The lower half of the pairs plot is a scatterplot for the two reactions; the axes represent the number of votes for a given reaction on a BuzzFeed article (both axes are scaled logarithmically), color intensity indicates the number of articles at that X/Y combo, and the line is a linear trendline of least-squares.
The diagonal of the pairs plot represents the density distribution of reaction vote counts for that reaction. (also logarithmically scaled on the X axis)
The upper half of the pairs plot illustrates the Pearson correlation between the non-log quantities of the two reaction variables. The stars represent statistical significance of the correlation test; since the data set is large, all correlations are statistically significant (rejection of null hypothesis of no correlation) at p < 0.001.

All of the bivariate correlations of positive reactions are moderately or strongly positively correlated, which is problematic for analysis (except one: apparently, there is little statistical relationship between things that are cute and things that make you go YAAASS). So why not just use the Love reaction, since articles tend to get about 100 Loves, while other reactions get around 10?

Does the same hold for negative reactions? Relatedly, we would also expect a negative correlation between the number of Love reactions and negative reactions, right?

All negative reactions are positively correlated, as expected, but there is a weak positive correlation between Love and Hate, which is definitely not right. There isn’t an ideal “negative” reaction, since all have similar distributions.

Why does Facebook have 6 different responses to gauge positivity or negativity when one reaction for each would be both more accurate and more intuitive for the user?

Conceal, Don’t Feel

There are other qualitative issues with Facebook’s current implementation of Reactions. Apparently, Likes and Reactions are treated differently internally. As a result, you get separate notifications for Likes and Reactions.

Why? No idea. There is enough Notification spam on Facebook, I don’t need double notifications in my Notification feed for every status I make.

What’s important to note is that a user cannot both Like and React to a status; only one or the other. As a result, the number of Likes on statuses overall will drop, and this is a major problem for businesses who are dependent on measuring the number of Likes for engagement.

I took a look at the Facebook Graph API endpoint for Facebook Page Posts (same endpoint I use for my Facebook Page Data Scraper), and I can confirm that the API can only report the number of Likes on a status; not the number of Likes + Reactions, or number of Likes + number of each Reaction.

There is no way currently to automate the retrieval of Reactions data from Facebook posts, which is an unfortunate oversight (especially considering how Twitter handled the transition from Favorites to Likes easily).

The example CNN story I used for that screenshot is anecdotally one of the very few examples I’ve noticed where the number of Likes is almost equal to negative emotions, a relationship which should be weakly correlated and therefore this knowledge may be useful to isolate the story as unusual (and serve ads accordingly). At Facebook’s immense scale, identifying a relatively small proportion of unusual stories might be enough to justify adding Reactions.

Or maybe this feature is just the harbinger of a new generation of emotionally-charged linkbait. Perhaps there is more to this Facebook Reactions data than what meets the eye, and I’ll update my scripts and do further statistical analysis when able. But given what has happened with Reactions data before with YouTube, I am unconvinced and I still believe the functionality as a whole is a usability regression that won’t last.

A Dislike button would have been better, just saying.

You can view the code and data used to generate the BuzzFeed Reaction data visualizations in this Jupyter notebook, open-sourced on GitHub, or you can view as a PDF, which is better if you are on a mobile device.