<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Geospatial Mapping on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/tag/geospatial-mapping/</link>
    <description>Recent content in Geospatial Mapping on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Tue, 13 Aug 2024 10:37:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/tag/geospatial-mapping/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>AI Seinfeld was the peak of AI-generated content. It will never happen again.</title>
      <link>https://minimaxir.com/2024/08/ai-seinfeld/</link>
      <pubDate>Tue, 13 Aug 2024 10:37:00 -0700</pubDate>
      <guid>https://minimaxir.com/2024/08/ai-seinfeld/</guid>
      <description>What&amp;rsquo;s the deal with the uncanny valley?</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
word-break: normal !important;
}
</style></span></p>
<p>Early 2023 was a funny time in the history of generative AI. On November 30th 2022, <a href="https://openai.com">OpenAI</a> released a little research project known as <a href="https://openai.com/chatgpt/">ChatGPT</a>. The launch of ChatGPT began the period where large language models properly entered the mainstream outside of tech enthusiasts and ended soon after the <a href="https://minimaxir.com/2023/03/new-chatgpt-overlord/">launch</a> of ChatGPT API in March 2023 that spawned thousands of AI-powered apps. That was when the limitations and problems with LLMs also went mainstream, such as plagiarism, hallucinations, and low-quality slop replacing human-generated content at an objectively worse quality.</p>
<p>In December 2022, <a href="https://www.mismatchmedia.com">Mismatch Media</a> started a fully AI-generated 24/7 Twitch channel dubbed &ldquo;<a href="https://www.twitch.tv/watchmeforever">WatchMeForever</a>&rdquo;. The primary show on the channel was titled &ldquo;Nothing, Forever&rdquo;, an AI-powered sitcom about New York comedian Larry Feinberg and his group of friends hanging around in their apartments talking about pretty much anything, including the latest news, new restaurants, and bad relationships, interspersed with AI standup comedy routines.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/heKLe2NLccg?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>It was obvious that the show was a parody of the formative 90&rsquo;s sitcom <a href="https://en.wikipedia.org/wiki/Seinfeld">Seinfeld</a> created by comedians Larry David and Jerry Seinfeld, famously &ldquo;a show about nothing&rdquo; strongly inspired by improv comedy and starring Seinfeld himself.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/Lx1xPBLDh80?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>The show, dubbed &ldquo;AI Seinfeld&rdquo; by the community, used a script powered by the GPT-3 API, the voices were powered by Microsoft&rsquo;s <a href="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech">Azure AI Speech</a> API with predefined voices from their <a href="https://speech.microsoft.com/portal/voicegallery">Voice Gallery</a>, and the scenes were rended using the <a href="https://unity.com">Unity</a> game engine along with purchased models/scenes/sounds/etc from the <a href="https://assetstore.unity.com">Unity Asset Store</a>.</p>
<p>AI Seinfeld was <strong>interestingly imperfect</strong>: the laugh track fired at inappropriate times, the standup routine repeatedly made the same joke such as &ldquo;What did the fish say when he hit the wall?&rdquo; (Damn!), and awkward silences at the end of scenes.</p>
<p>In February 2023, AI Seinfeld quickly went viral organically after its AI weirdness was a surprising complement for Seinfeld&rsquo;s style of weirdness, with many watchers being surprised at both its accuracy to the show and easily sharable metahumor. At its peak, AI Seinfeld had over 10,000 concurrent watchers on Twitch, putting it squarely in one of the top streams on the platform.</p>
<p>AI Seinfeld died as quickly as it rose: after a ban and subsequent revamp, the view count cratered, and as of August 2024, the Twitch stream hovers below 10 watchers, with no significant changes made since the previous year, and Mismatch Media has no social footprint since last year. Could there be another AI Seinfeld with the rapid advancements in generative AI? Unfortunately, there are too many factors — technical, societal, and comedic — working against a theoretical next-generation AI-generated sitcom.</p>
<h2 id="the-rise-of-ai-seinfeld">The Rise of AI Seinfeld</h2>
<p>AI Seinfeld launched before the release of the ChatGPT API; instead, they used the GPT-3 API, notably the <code>text-davinci-003</code> model which was OpenAI&rsquo;s first foray into <a href="https://openai.com/index/instruction-following/">instruction-tuned LLMs</a>. While previous versions of GPT-3 were <a href="https://github.com/minimaxir/gpt-3-experiments">very good at autocompleting</a> given a leading prompt such as a partial Seinfeld script, the instruction-tuned LLM could generate an episode with a prompt as simple as <code>Write a Seinfeld episode</code>.</p>
<p>First, let&rsquo;s go back to the beginning, as AI Seinfeld actually wasn&rsquo;t the first time a chatbot went megaviral on Twitch. In January 2017, long before the <a href="https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29">transformer architecture</a> that enabled LLMs was published, the Twitch stream <a href="https://www.twitch.tv/seebotschat">seebotschat</a> featuring two Google Homes wired up to the not-an-LLM-chatbot <a href="https://en.wikipedia.org/wiki/Cleverbot">Cleverbot</a> <a href="https://mashable.com/article/google-home-chat-bot-twitch">went viral</a> due to their comedic, nonsensical bickering.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/QFyK1nRJ1LI?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>While everyone watching that stream knew it <em>really</em> wasn&rsquo;t AI, AI Seinfeld was a product that was at the peak of the famous <a href="https://en.wikipedia.org/wiki/Uncanny_valley">uncanny valley</a> curve, which is a hypothesis on how humans perceive imitations: there&rsquo;s a &ldquo;valley&rdquo; of negative acceptance where the imitation is more above-average in its likeness, but not quite close enough to the real thing. In this case, it&rsquo;s blatantly obvious and unambiguous that the Twitch stream was AI-generated especially with its mistakes, but not realistic enough that it falls into the valley itself:</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_1_hu_35df39cfbbbf21fa.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_1_hu_58319279acb34128.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_1_hu_dbfbb3862c06dd8f.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_1.webp 1200w" src="uncanny_valley_1.webp"/> 
</figure>

<p>This AI weirdness made it very easy to build a community. Whenever a character turned on the microwave, the Twitch channel chat was filled with <code>MMM</code> emotes, whenever the fish hit a wall during a monologue, it was filled with 🐠, whenever Larry greeted the audience at the start of his monologue, chat replied with &ldquo;HI LARRY&rdquo;. Twitch chat <em>loves</em> memetic repetition. Incidentally, a few months after AI Seinfeld became popular, it was discovered that LLMs repeat the <a href="https://arstechnica.com/information-technology/2023/06/researchers-discover-that-chatgpt-prefers-repeating-25-jokes-over-and-over/">same joke over and over</a> again, with examples being similar to the jokes AI Seinfeld made.</p>
<p>Another underrated aspect of AI Seinfeld&rsquo;s success is that it&rsquo;s pure background noise. While personality-driven Twitch streams cause viewers to take a more active investment in what&rsquo;s being shown on screen due to <a href="https://en.wikipedia.org/wiki/Fear_of_missing_out">FOMO</a> of a hype moment on stream, AI Seinfeld is 100% passive: there can be exciting events, but the variance is low. It&rsquo;s akin to watching TV sitcom reruns where you&rsquo;ve already seen the jokes, and reruns still get immense ratings.</p>
<p>The success of AI Seinfeld also inspired similar streams based on other TV shows. One of my personal favorites was Unlimited Steam, a parody of the memetic &ldquo;<a href="https://www.youtube.com/watch?v=4jXEuIHY9ic">Steamed Hams</a>&rdquo; scene from The Simpsons, except made infinite with AI generation. That may sound like a pointless idea — Steamed Hams has a very fixed plot — but it went off the rails even harder than AI Seinfeld ever did.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/9i0L_IT82tA?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<h2 id="directing-ai-seinfeld">Directing AI Seinfeld</h2>
<p>AI Seinfeld was novel back in 2023, but now that LLMs are more mainstream you can probably figure out how the AI part of it worked, but let&rsquo;s do a refresher so we can figure out how a hypothetical future AI Seinfeld could innovate the algorithmic sitcom.</p>
<p>As noted earlier, the key of AI Seinfeld&rsquo;s success was the then-latest version of GPT-3: <code>text-davinci-003</code> and its then-novel instruction-based finetuning using RLHF. With that, you can give it a prompt such as:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are a professional comedian. Write an award-winning script for an episode of Seinfeld about a new restaurant. Include audience laugh tracks when appropriate.
</span></span></code></pre></div><p>Due to the low context length of these earlier LLMs, that&rsquo;s essentially all the <a href="https://platform.openai.com/docs/guides/prompt-engineering">prompt engineering</a> you can do without limiting the length of the output. The model would then output something similar to this script (using the more modern <a href="https://www.anthropic.com/news/claude-3-5-sonnet">Claude 3.5 Sonnet</a> at <code>temperature=0.0</code>): <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">[Scene: Jerry&#39;s apartment]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: So what&#39;s the deal with this new restaurant, &#34;The Blank Plate&#34;?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Elaine: Oh, I&#39;ve heard about that place! Apparently, you don&#39;t order anything - the chef just brings you whatever he feels like making.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: What? So you&#39;re telling me I have to eat whatever some stranger decides?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[Audience laughter]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: (entering) Hey, guess where I&#39;m taking my date tonight? The Blank Plate!
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: George, you can&#39;t take a date there! What if they serve something weird?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: What do you mean?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Elaine: It&#39;s that new place where you don&#39;t get to choose your meal.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: (panicking) Oh no, what have I done? She&#39;s going to think I&#39;m some kind of food weirdo!
</span></span></code></pre></div><p>One thing instruction-tuned LLMs are always good at is playing along: LLMs generate text sequentially without the explicit ability to plan ahead, so it must work with what it&rsquo;s given and what it has already generated. Coincidentally, this works <em>perfectly</em> with the improv comedy style of Seinfeld, where continuing the plot is more important than anything else, and the more ridiculous the situation becomes, that&rsquo;s even better. It&rsquo;s the rare case where <a href="https://www.iguazio.com/glossary/llm-hallucination/">LLM hallucination</a> is actually a feature, not a bug.</p>
<p>To get the LLM output into a format suitable for a Twitch stream, a programmatic script can then parse the output: extracting and mapping the characters and their lines, applause directions, and, of course, replacing all mentions of Jerry with Larry and Seinfeld with Feinberg. This workflow was surprisingly difficult at the time since GPT-3 did not have many techniques to control the format of the output, hence why I suspect there are awkward pauses and other glitches. Each line can then be passed to Azure&rsquo;s text-to-speech API to generate a distinct audio file, which can be played back in order in Unity.</p>
<p>In an <a href="https://www.polygon.com/23582937/ai-seinfeld-twitch-stream">interview with Polygon</a>, Skyler Hartle of Mismatch media noted the presence of a &ldquo;director&rdquo; which likely handles the camera, scene transitions, and the microwave:</p>
<blockquote>
<p>“In addition to the third party services we’ve used, we have a lot of proprietary generative algorithms that cause the show to be ‘formed’, so to be speak. We collectively call this logic the ‘director,’ as it is largely responsible for making sure all the individual pieces come together into a whole,” Hartle said via email. “It’s worth mentioning that we don’t generate the artwork or the laugh track — those are precanned assets, but we have ideas on how to do that in the future.”</p>
</blockquote>
<p>The AI aspect of AI Seinfeld was counterintuitively the easiest part of the pipeline, which explains how quickly variants popped up. However, with the inability to tweak the LLM output much with the technology at the time, the stream may have hit a creative limit.</p>
<h2 id="the-fall-of-ai-seinfeld">The Fall of AI Seinfeld</h2>
<p>Vice also <a href="https://www.vice.com/en/article/qjkyxp/whats-the-deal-with-nothing-forever-a-21st-century-seinfeld-that-is-ai-generated">interviewed</a> Hartle, who had an optimistic view of the future of AI Seinfeld:</p>
<blockquote>
<p>“Our grounding principle was, can we create a show that can generate entertaining content forever? Because that&rsquo;s truly where we see the future emerging towards. Our goal with the next iterations or next shows that we release is to actually trade a show that is like Netflix-level quality.”</p>
</blockquote>
<p>That&rsquo;s tempting fate a bit too much.</p>
<p>The reason AI Seinfeld fell out of favor is a case of unintentionally poor LLM testing. When the <code>text-davinci-003</code> model API endpoint had an outage, AI Seinfeld switched to a weaker GPT-3 model, <code>text-curie</code>, to keep the stream up. But unlike the davinci variant, curie was <em>not</em> RLHFed to follow instructions and safety.</p>
<p>During this brief period of low safety, one of Larry&rsquo;s AI-generated monologues <a href="https://www.vice.com/en/article/ai-generated-seinfeld-show-nothing-forever-banned-on-twitch-after-transphobic-standup-bit/">made a transphobic joke</a>: a type of joke that was unfortunately common during the 90&rsquo;s and has no place in modern society. Twitch banned the Watch Forever channel for 14 days as a result, completely killing the channel&rsquo;s growth momentum.</p>
<p>But when the ban concluded and AI Seinfeld came back, the show was changed significantly with a &ldquo;Season 2&rdquo;. Although AI Seinfeld was still about a group of friends hanging around talking about the latest gossip, all the characters were different and had new models, the sets were different, and instead of a comedy monologue, <del>Larry</del> Leo narrates writing a blog.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/7N2Wgqn45FI?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>Why Mismatch Media made such a format shift is unclear: <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam&rsquo;s razor</a> would suggest that a copyright holder for Seinfeld sent a cease and desist to Mismatch Media given the bad publicity behind the original ban, despite the clearly fair-use parody nature of the stream. It&rsquo;s fair that it may not have been worth the time and effort for Mismatch Media to fight a legal battle for a fun art project.</p>
<p>The rebooted WatchMeForever stream is <a href="https://www.twitch.tv/watchmeforever">still active</a> as of today, but with effectively no viewers.</p>
<p>The immediate failure of the AI Seinfeld retool does lend credibility to the theory that the stream only became popular <em>because</em> it was about Seinfeld and that it was a novelty doomed to a short shelf life. Still, there were detractors that said <a href="https://www.businessinsider.com/ai-generated-seinfeld-parody-twitch-nothing-forever-streaming-transphobia-banned-2023-2">AI Seinfeld was never funny and everyone is weird for liking it</a>. That&rsquo;s ok: the original Seinfeld received similar complaints back in the day. <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> But it&rsquo;s hard to argue that there wasn&rsquo;t interest in a 24/7 livestream of surreal AI-generated content.</p>
<h2 id="what-would-ai-seinfeld-look-like-in-2024">What Would AI Seinfeld Look Like in 2024?</h2>
<p>Now that we know how AI Seinfeld worked and what didn&rsquo;t work, how would a year&rsquo;s worth of exponential progress in generative AI look for AI Seinfeld? Could AI Seinfeld be improved and come back? The answer is <em>maybe</em>.</p>
<p>Modern generative AI requires a lot of cherry picking the best results, and it&rsquo;s surprisingly hard to do: both images and text can take multiple generations and still require significant human-guided edits. But with a Twitch livestream, there can&rsquo;t be any cherry picking at all, which means that the entire generation pipeline has to be consistent, and its failures interesting in the worst case.</p>
<p>The only reason AI Seinfeld worked at all is because GPT-3 was trained on the entire internet, likely including Seinfeld scripts and forum discussions. The prompt would need to have contained <code>Write a Seinfeld script</code> since if you asked it <code>Write a sitcom script</code>, it would output something completely generic instead and there isn&rsquo;t much room to customize the prompt to make it more interesting. The GPT-3 variant that AI Seinfeld used had a 4k token context window limit (combined both the input prompt plus the output script text), but modern LLMs eclipse that: currently, Claude 3.5 Sonnet has a <a href="https://docs.anthropic.com/en/docs/about-claude/models">200k input/8k output</a> context, while GPT-4o has a <a href="https://platform.openai.com/docs/models/gpt-4o">128k input/16k output</a> context! With that much freedom, you can define many more constraints in the prompt and guide the LLM into exactly the type of sitcom you want.</p>
<p>One simple example that doesn&rsquo;t require any knowledge of machine learning is a parametric prompt, where one aspect of a prompt can be replaced with a user-defined choice or programmatically chosen at random. Unlimited Steam, for example, used a parametric prompt to change the cosmic phenomenon localized entirely within Principal Skinner&rsquo;s kitchen from the Aurora Borealis to something as silly as <a href="https://en.wikipedia.org/wiki/Rapture">The Rapture</a> or <a href="https://foofighters.com">Foo Fighters</a>, and forcing the characters to play along improv-style. <sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>Recall the example Seinfeld prompt to Claude 3.5 Sonnet I made earlier, except instead of prompting for a Seinfeld script about a <code>new restaurant</code>, we can instead have the LLM generate a script about <code>generative AI in 2024</code> by making only that substitution in the prompt — a topic on which there definitely has never been a Seinfeld episode about:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">George bursts in, looking frantic.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: You guys! I think I&#39;m dating an AI!
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[Audience laughter]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: What makes you say that?
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: Well, she&#39;s perfect in every way. She laughs at all my jokes, she has an encyclopedic knowledge of obscure trivia, and she never gets mad when I say something stupid.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[Audience laughter]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Jerry: George, that doesn&#39;t mean she&#39;s an AI. Maybe she just really likes you.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">George: No, no. It&#39;s definitely an AI. No human woman would ever put up with me for this long.
</span></span></code></pre></div><p>Using modern LLMs, is there now a way to design a prompt which can make use of the long context windows? A prompt that can both leverage unique human writing and fix many of the issues that affected AI Seinfeld? Here&rsquo;s an approach at a much more sophisticated prompt, where all values in <code>{}</code> brackets are parameters that can be filled in:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are a professional comedian. Write an award-winning script for a a scene for Act I of a three act hit sitcom episode. Include audience laugh tracks when appropriate.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Your script MUST incorporate ALL the following elements:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Background:
</span></span><span class="line"><span class="cl">- {background}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Setting:
</span></span><span class="line"><span class="cl">- {setting}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Characters:
</span></span><span class="line"><span class="cl">- {character_1}
</span></span><span class="line"><span class="cl">- {character_2}
</span></span><span class="line"><span class="cl">- {character_3}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Plots:
</span></span><span class="line"><span class="cl">- {a_plot}
</span></span><span class="line"><span class="cl">- {b_plot_1}
</span></span><span class="line"><span class="cl">- {b_plot_2}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">The script MUST also follow the high-level comedic style of the following scripts:
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">- {script_1}
</span></span><span class="line"><span class="cl">- {script_2}
</span></span><span class="line"><span class="cl">- {script_3}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">After the scene has concluded, output a summary of the scene.
</span></span></code></pre></div><p>Thanks to long context windows, the parametric changes don&rsquo;t have to be small, such as only a character name or two word setting. You, a human, can write <em>anything</em> to make each character distinct and robust, including name, gender, age, personality, likes, dislikes, etc. Plots can be derived from human-written scenarios beforehand: if you wrote 100 A-plots and 100 B-plots and randomly selected 1 A-plot and 2 B-plots, you&rsquo;d have about <em>1 million</em> possible plot permutations, ensuring you have something unique before the AI tries to reconcile them. You can feed in examples of human-written scripts to set the style and vibe of the generation in what is known as <a href="https://www.promptingguide.ai/techniques/fewshot">few-shot prompting</a>. You can maintain continuity over many scenes by having the LLM summarize its own output, and then feed those summaries back to the AI as background information to build upon them. The LLM can also be instructed to <a href="https://minimaxir.com/2023/12/chatgpt-structured-data/">output structured data</a> to avoid the need to loosely parse the script after it&rsquo;s completed, and as a bonus the model could be instructed to output additional metadata such as <a href="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#use-speaking-styles-and-roles">SSML speech styles</a> based on a given line to add personality to the generated speech.</p>
<p>Unfortunately, creating this pipeline, writing original characters and plots for it for it, and sufficiently testing it to ensure the generated results are stable, would take weeks if not months to complete otherwise I would provide a more concrete demo. <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> This pipeline approach to AI script writing would only be effective for unsupervised 24/7 generation and wouldn&rsquo;t replace skilled human writers who would do a more effective job much faster.</p>
<p>But would all of these prompt optimizations actually make the final generated script <em>funny</em>? After all, some of the failings like the awkward audience laughs and pauses and the end of scenes contributed to AI Seinfeld&rsquo;s humor. During a standup comedy event at AI Seinfeld&rsquo;s peak, Jerry Seinfeld himself <a href="https://www.reddit.com/r/seinfeld/comments/10tnn1k/jerry_talking_about_ai_seinfeld_last_night/">was asked</a> about the AI parody and he replied that he&rsquo;s not worried about AI:</p>
<blockquote>
<p>AI can be, definitely, they&rsquo;ll make it smarter and smarter, but to do [standup comedy] you have to make it dumber.</p>
</blockquote>
<p>Could AI Seinfeld benefit from advances in AI video? The answer this time is no. Generative video has been taking off in 2024 with projects such as OpenAI&rsquo;s <a href="https://openai.com/index/sora/">Sora</a> and Runway AI&rsquo;s <a href="https://runwayml.com/product">Gen-3 Alpha</a>, but those demos and the examples that go viral on social media are very heavily cherry picked, and even then there are consistency errors such as objects appearing in-and-out of existence. Generating video also requires exponentially more compute than just running Unity, and even with another few years of GPU hardware improvements it would be infeasible to cost-effectively create a 24/7 stream from those models.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/mnpGyVL1-0E?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>The greatest problem with generative AI video is that it is coherent overall but has emblematic errors that don&rsquo;t require a keen eye to notice, and as a result falls square into the uncanny valley, with its mistakes not being interesting, but disorienting. Mistakes in motion are easier to notice at a glance than images where a person&rsquo;s hands may have the wrong number of fingers. The only way for AI video to get out of the valley would be to improve the model to near-flawless quality, which won&rsquo;t happen any time soon. But Sora is more on the more realistic side of the curve than the less realistic side.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_2_hu_c3c8932aea493423.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_2_hu_85ea0e247ba12df1.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_2_hu_7690c09cf64f5daa.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_2.webp 1200w" src="uncanny_valley_2.webp"/> 
</figure>

<p>What about the AI-generated voices that would power these characters? At the time AI Seinfeld aired, many complained that Larry&rsquo;s voice &ldquo;didn&rsquo;t sound enough like Jerry Seinfeld.&rdquo; After AI Seinfeld concluded, a new technology called <a href="https://elevenlabs.io/blog/what-is-voice-cloning">voice cloning</a> popularized by <a href="https://elevenlabs.io">ElevenLabs</a> went mainstream&hellip;and it&rsquo;s unexpectedly the AI modality that&rsquo;s causing the most actual harm both with creative projects and outside of them. If you haven&rsquo;t heard as much about AI-generated voices, there&rsquo;s a good reason for that: voice synthesis projects such as Microsoft&rsquo;s <a href="https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e-2/">VALL-E 2</a> and Meta&rsquo;s <a href="https://ai.meta.com/blog/voicebox-generative-ai-model-speech/">Voicebox</a> both have disclaimers saying they won&rsquo;t be released due to the dangers the technology possesses, although Microsoft&rsquo;s Azure does offer a &ldquo;<a href="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-neural-voice">custom neural voice</a>&rdquo; service. Voice cloning has been used to <a href="https://www.newyorker.com/science/annals-of-artificial-intelligence/the-terrifying-ai-scam-that-uses-your-loved-ones-voice">initiate scams</a> by impersonating spouses in an emergency. Professional voice actors have had their voices cloned and used without compensation due to contracts not specifically forbidding the practice, which is one of the reasons SAG-AFTRA <a href="https://www.theverge.com/2024/8/5/24213808/video-game-voice-actor-strike-sag-aftra">just went on strike</a> against the video game industry in order to get protections against voice cloning and synthetic performers.</p>
<p>Moreover, in the context of creating a next-gen AI Seinfeld, there&rsquo;s nothing inherently interesting about voice cloning since it&rsquo;s a copy by definition: the model <em>can&rsquo;t</em> generate unexpectedly amusing content other than the inherent gimmick of famous-voice-saying-something, such as the AI George Carlin standup special <a href="https://www.vice.com/en/article/the-george-carlin-ai-standup-is-worse-than-you-can-imagine/">which was not special</a>. There isn’t any way currently to prompt engineer a voice generation AI with the detail to create a voice <code>in the style of a masculine New York comedian, 2x speed, primetime television quality</code> which could open up more creative opportunities.</p>
<p>Although we can make drastic improvements with the textual script, that&rsquo;s the extent of how new AI approaches can be leveraged to make something interesting. But if you remember the early days of generative AI history, the best AI-generated projects were the simplest.</p>
<h2 id="ai-weirdness">AI Weirdness</h2>
<p>Generative &ldquo;AI&rdquo; has been around for a very long time (I had fun with <a href="https://en.wikipedia.org/wiki/Markov_chain">Markov chains</a> <a href="https://minimaxir.com/2013/11/innovation-rng/">a decade ago</a>!), but the study was mostly confined to tech-focused communities like <a href="https://news.ycombinator.com">Hacker News</a>. Modern generative AI didn&rsquo;t break into mainstream culture until 2018, ironically in a way that doesn&rsquo;t involve actual generative AI. In June of that year, comedian Keaton Patti posted a <a href="https://x.com/KeatonPatti/status/1006961202998726665">megaviral tweet</a> about how he &ldquo;forced a bot to watch over 1,000 hours of Olive Garden commercials and then asked it to write an Olive Garden commercial of its own.&rdquo;</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/patti_hu_67c737b47f76017.webp 320w,/2024/08/ai-seinfeld/patti_hu_615be4497d8ad163.webp 768w,/2024/08/ai-seinfeld/patti_hu_421617479726cf8c.webp 1024w,/2024/08/ai-seinfeld/patti.webp 1554w" src="patti.webp"
         alt="An excerpt of the viral Olive Garden script."/> <figcaption>
            <p>An excerpt of the viral Olive Garden script.</p>
        </figcaption>
</figure>

<p>Yes, the script was human-written: for the technology at the time, no one could train an AI to behave like that from only video input data, and the script was <em>too surreal</em> even for the now-primitive generative AI. He did get popular enough to get <a href="https://www.amazon.com/Forced-Bot-Write-This-Book/dp/152485834X">a book deal</a> and a <a href="https://www.youtube.com/playlist?list=PLXSrjGY5Tz_gPdaU_L__S3hXua7zRQtUl">Netflix collaboration</a> leveraging this fake-AI gimmick.</p>
<p>Patti&rsquo;s comedic misrepresentation of AI did lead to genuine confusion about what a 2018-era generative AI can actually do. Janelle Shane, who maintains the <a href="https://www.aiweirdness.com">AI Weirdness blog</a> about weird things AI can generate, posted an <a href="https://x.com/JanelleCShane/status/1007061610005794817">epic takedown</a> of Patti&rsquo;s script which went equally viral and also led to the internet discovering her excellent <a href="https://www.aiweirdness.com/candy-heart-messages-written-by-a-18-02-09/">AI-generated Valentine&rsquo;s Day hearts</a> from the same year (and later <a href="https://www.amazon.com/You-Look-Like-Thing-Love/dp/0316525227">a book deal</a> too):</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/heart_hu_292dce043896cad3.webp 320w,/2024/08/ai-seinfeld/heart.jpg 640w" src="heart.jpg"/> 
</figure>

<p>Image-based generative AI took a lot longer to go mainstream: websites like <a href="https://thispersondoesnotexist.com">This Person Does Not Exist</a> demonstrated the power of <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">generative adversarial networks</a> like <a href="https://github.com/NVlabs/stylegan">StyleGAN</a> to create images, but that wasn&rsquo;t weird outside of <a href="https://cedar.buffalo.edu/~srihari/CSE676/22.3-GAN%20Mode%20Collapse.pdf">mode collapses</a>. The first instance of weird images from AI was in January 2021 when OpenAI announced the <a href="https://openai.com/index/dall-e/">original DALL·E</a> and showed they could make unique armchairs in the shape of an avocado by asking the model to do so, although they never released the model itself.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/avocado_hu_5300a7e486e7afb5.webp 320w,/2024/08/ai-seinfeld/avocado_hu_84e7cd0392309830.webp 768w,/2024/08/ai-seinfeld/avocado.webp 830w" src="avocado.webp"/> 
</figure>

<p>DALL·E didn&rsquo;t get much attention outside of the AI hypesters since no one could play with it, but months later, things changed. <a href="https://x.com/borisdayma">Boris Dayma</a> led an initiative to reproduce and open-source a variant of the DALL·E model, labeled <a href="https://github.com/borisdayma/dalle-mini">DALL·E Mini</a> (later changed to <a href="https://www.craiyon.com">Craiyon</a> after a cease and desist from OpenAI), and <a href="https://huggingface.co/spaces/dalle-mini/dalle-mini">hosted it for free on Hugging Face</a> and went megaviral. And thus began the &ldquo;<a href="https://www.reddit.com/r/weirddalle/top/?t=all">weird DALL·E</a>&rdquo; phase of image generation AI, where anyone could create incoherent images and make people laugh.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/firehydrant_hu_4bd881a786b7493e.webp 320w,/2024/08/ai-seinfeld/firehydrant.webp 764w" src="firehydrant.webp#center"
         alt="Even back in 2021, image prompt engineering was a thing. via /u/royal_rigolo on Reddit / weirddalle subreddit" width="400"/> <figcaption>
            <p>Even back in 2021, image prompt engineering was a thing. <a href="https://www.reddit.com/r/weirddalle/comments/vjwcl5/fire_hydrant_takes_selfies_on_top_of_the_himalaya/">via /u/royal_rigolo on Reddit / weirddalle subreddit</a></p>
        </figcaption>
</figure>

<p>All of these examples of interesting failures are representative of a bygone AI era of experimentation. Once everyone had free access to more powerful text-generating AI with ChatGPT, and more powerful image-generating AI with <a href="https://www.midjourney.com/home">Midjourney</a>, AI stopped being fun and started being serious business, for better or for worse.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_3_hu_c912a98f812d692e.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_3_hu_6cd7aa3fb6bb5ee5.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_3_hu_e3c7199e7c82d8bd.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_3.webp 1200w" src="uncanny_valley_3.webp"/> 
</figure>

<h2 id="ai-generated-content-in-20xx">AI-Generated Content in 20XX</h2>
<p>Last year, I wrote a thought piece titled &ldquo;<a href="https://minimaxir.com/2023/10/ai-sturgeons-law/">The Greatest Threat to Generative AI is Humans Being Bad at Using it</a>&rdquo; in response to the increasing hostility against the use of AI in creative works, arguing that while AI is a tool like anything else, it is a tool that&rsquo;s very easy to use poorly and actually make projects worse. Additionally, the largest AI companies have both a business incentive and a duty to ensure that AI is used responsibly by its users downstream, as otherwise it will hurt the industry in the long term.</p>
<p>Now, it&rsquo;s apparent that I was correct. The large companies went full steam ahead on AI integrations even where it is highly questionable that they add value and productivity to the end-user, often signaled with a &ldquo;magical&rdquo; <a href="https://qz.com/how-became-the-unofficial-ai-emoji-1851059332">sparkle emoji</a>. Google has integrated Gemini to assist with document and email writing, Meta has integrated Meta AI to automatically generate images and comments, and Apple will <a href="https://www.bloomberg.com/news/articles/2024-07-28/apple-intelligence-to-miss-initial-release-of-upcoming-ios-18-ipados-overhauls?embedded-checkout=true">soon</a> allow Apple devices to generate text and images on your personal devices using Apple Intelligence. Marketing these features is typically met with backlash: Google had to <a href="https://www.cnbc.com/2024/08/02/google-pulls-ai-ad-for-olympics-following-backlash.html">pull an Olympics commercial</a> which encouraged a parent to use AI to write a letter for their child.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/NgtHJKn0Mck?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<blockquote>
<p>“I flatly reject the future that Google is advertising,” Shelly Palmer, professor of advanced media at Syracuse University’s S.I. Newhouse School of Public Communications, wrote in a widely circulated <a href="https://shellypalmer.com/2024/07/why-googles-dear-sydney-ad-makes-me-want-to-scream/">blog post</a>. The technology presents a “monocultural future where we see fewer and fewer examples of original human thoughts,” she wrote.</p>
</blockquote>
<p>In the process of pushing AI tech further mainstream in a rush to demonstrate to shareholders their generative AI capabilities without encouraging <em>responsible</em> usage of the technology, AI has entered a new era of &ldquo;<a href="https://simonwillison.net/2024/May/8/slop/">slop</a>&rdquo; where people post objectively bad AI content without any regard for how it will be perceived, especially for websites which rely on user-generated content.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/pinterest_hu_613e5e7f10764361.webp 320w,/2024/08/ai-seinfeld/pinterest_hu_fb37af21ee91c34f.webp 768w,/2024/08/ai-seinfeld/pinterest.webp 901w" src="pinterest.webp"
         alt="An annotated example of the Pinterest home page from July 2024. via @henningsanden on X"/> <figcaption>
            <p>An annotated example of the Pinterest home page from July 2024. <a href="https://x.com/henningsanden/status/1808126786389037107">via @henningsanden on X</a></p>
        </figcaption>
</figure>

<p>Facebook, whose algorithm <a href="https://transparency.meta.com/data/widely-viewed-content-report/">favors</a> emotionally-appealing engagement bait posts, has seen a deluge of high-engagement slop even when the content makes no logical sense.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/cabincrew_hu_bc23e6989111247c.webp 320w,/2024/08/ai-seinfeld/cabincrew_hu_c696ff0db8c80eff.webp 768w,/2024/08/ai-seinfeld/cabincrew_hu_b68182f34bfe5d01.webp 1024w,/2024/08/ai-seinfeld/cabincrew.webp 1080w" src="cabincrew.webp#center"
         alt="One of the few AI-generated images on Facebook with an actual cabin crew. via @FacebookAIslop on X." width="400"/> <figcaption>
            <p>One of the few AI-generated images on Facebook with an actual cabin crew. <a href="https://x.com/FacebookAIslop/status/1806416249259258189">via @FacebookAIslop on X</a>.</p>
        </figcaption>
</figure>

<p>This is, of course, quintessential uncanny valley: it&rsquo;s coherent at a glance but just even looking at it for a second it&rsquo;s obvious where the issues are, and these issues aren&rsquo;t a good kind of AI weirdness. What worse is that AI Slop a regression in realism, and falls onto the left side of the valley.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_4_hu_ce80aacfa47a581e.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_4_hu_ffbc52f347062d8f.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_4_hu_8f8817dd988ae0a9.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_4.webp 1200w" src="uncanny_valley_4.webp"/> 
</figure>

<p>Although we as humans can identify this slop, it is currently surprisingly hard for an AI to do so, although it hasn&rsquo;t stopped people from trying to build AIs that can detect AIs which in practice is filled with false positives that hurt real creatives. For slop-creators, this is a feature: if an AI company released a tool to reliably detect and punish slop, it would make their generative AI less valuable. It&rsquo;s <a href="https://www.wsj.com/tech/ai/openai-tool-chatgpt-cheating-writing-135b755a">reported</a> that one of the reasons that OpenAI won&rsquo;t release a reliable ChatGPT text detector is that it could harm their business.</p>
<p>The core reason for the big tech companies allowing generative AI to cause the <a href="https://en.wikipedia.org/wiki/Enshittification">enshittification</a> of the internet is misaligned incentives between the companies hosting AI slop and the users viewing it. Social media companies and their shareholders care about <a href="https://mixpanel.com/blog/north-star-metric/">North Star metrics</a> such as user retention and time-on-site, and normally those metrics can be correlated with user happiness and satisfaction with the service. But time-on-site, for example, can <em>also</em> be maximized by making the site harder and slower to use, and the deluge of AI slop accomplishes that. AI companies typically don&rsquo;t have analytics tracking negative user sentiment about their use of AI: if anything, the uncompromising backlash against AI convinces the companies that complainers are just a lost demographic to accommodate and double down on what they&rsquo;re already doing. Aggregate metrics treat human-made content and AI-generated content as equal, but <em>humans</em> do not.</p>
<p>Generative AI, even for researchers and practitioners such as myself, is a heavily nuanced topic that is very difficult to communicate succinctly, more difficult to do on social media which highly discourages nuance and context, and <em>even more difficult</em> as AI hypesters muddy the waters with misleading praises of generative AI such that they&rsquo;re easy to dunk on which just gets them more engagement and revenue. &ldquo;Made by AI&rdquo; is now a term that inspires dread, far from the Keaton Patti days where made-by-AI was an indicator of joyful weirdness. Bashing AI is now a meme, and there&rsquo;s isn&rsquo;t a single potential AI project that could challenge that perception because the well is poisoned beyond repair.</p>
<h2 id="would-a-247-ai-generated-twitch-stream-even-work-anymore">Would a 24/7 AI-Generated Twitch Stream Even Work Anymore?</h2>
<p>How does the modern AI backlash tie back into AI Seinfeld? Twitch&rsquo;s core demographic is the same demographic as those most against the use of generative AI. Part of the reason AI Seinfeld became so successful on Twitch is because of the community it cultivated: it wouldn&rsquo;t have gone viral if people weren&rsquo;t spamming microwave <code>MMM</code>s and and answering what did the fish say when it hit the wall. Even though Twitch viewers are mostly lurkers and not chatters, a channel with a good community builds word-of-mouth even outside of Twitch, which is how Twitch channels go viral.</p>
<p>I decided to determine what it would take to produce a &ldquo;fixed&rdquo; AI Seinfeld in 2024, given both the advances in AI and the ethics involved. Now, it&rsquo;s definitely not anything a scrappy group of hackers could do anymore. Sure, you could once again ask an LLM to generate a sitcom script and get a bunch of assets from the Unity Asset Store, but <em>that&rsquo;s already been done before</em>. In order to overcome the reflexive assumption that new AI generated content is slop, the stream would have to be something completely novel and unexpected: you can&rsquo;t, for example, just do an AI <a href="https://en.wikipedia.org/wiki/Curb_Your_Enthusiasm">Curb Your Enthusiasm</a>.</p>
<p>The script would be unique following from my demo of detailed parametric prompts, but it would require production-studio-class tracking and documentation for how the prompts and their parameters are used to codify said uniqueness. The stream video would still need to be rendered in Unity or another engine, but in order to be unique it would require commissioning human-made visuals and sound effects: given the animosity against those who work with AI, most artists would not accept those commissions even if they were paid at a significant premium. <sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup> The voices would still have to be from an existing text-to-speech voice provider: voice cloning is right out, even with explicit consent and compensation for the voice actors.</p>
<p>And even if all the assets were fully sourced ethically with transparent documentation for the entire pipeline, the stream&rsquo;s Twitch chat would likely be derailed by <code>AI 👏 ART 👏 IS 👏 THEFT</code> spam, preventing the establishment of any community, and strict moderation to curb the spam risks causing a <a href="https://en.wikipedia.org/wiki/Streisand_effect">Streisand effect</a>.</p>
<p>The only entities that could feasibly create a 24/7 AI-generated livestream with fully ethically-sourced content would be, ironically, the big AI companies such as OpenAI which can afford to pay licenses for said data. Even <a href="https://www.disney.com">Disney</a>, which owns more than enough IP to train generative models of all modalities, would never do an AI Seinfeld-esque livestream for <a href="https://en.wikipedia.org/wiki/Brand_safety">brand safety</a> reasons alone: the nonzero possibility of a Disney character unexpectedly saying something problematic during the stream would make the entire project a complete nonstarter.</p>
<h2 id="whats-the-deal-with-the-uncanny-valley">What&rsquo;s the deal with the uncanny valley?</h2>
<p>One of the common criticisms about generative AI pointed out by creatives is &ldquo;if AI is trained on all human works, then how can it create anything new&rdquo;? AI Seinfeld is the perfect counterargument: even though it&rsquo;s powered by a LLM, the <em>humans</em> behind it are what made it go viral. Even before ChatGPT, generative AI has always excelled as a tool. The microwave gag and the 144p visual filter were not AI-generated or an attempt to emulate aspects of the Seinfeld sitcom: they were distinct creative decisions that made the entire project more interesting, and they aren&rsquo;t something that you could prompt an AI to suggest to add. AI Seinfeld in hindsight was an ethical form of AI-generated media: it did not replace Seinfeld the TV show, no one would stop watching streams of Seinfeld in favor of the AI-generated alternative, and copyright holders and Jerry Seinfeld did not lose revenue due to AI Seinfeld&rsquo;s existence: if anything, the nostalgic buzz increased streams of the original show.</p>
<p>With the current trajectory of AI slop and the perverse incentives by large tech companies to not address it, I am pessimistic that AI content will ever be at a state where it will cross that final hump of the uncanny valley curve into full acceptance, and even more pessimistic about the backlash against generative AI ever subsiding. With generative model training now at the point where it requires exponentially more compute and data for increasingly marginal returns, it will take years if at all for generative AI output to reach the far right of the uncanny valley chart, and unless the large tech companies actually create an <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">AGI</a>, they are unlikely to obtain higher acceptability than AI Seinfeld ever did.</p>
<p>I wrote most of this blog post weeks ago but held off publishing it because new AI news kept happening. Most notably, the <a href="https://blackforestlabs.ai/our-team/">creators of Stable Diffusion</a> just released the <a href="https://blackforestlabs.ai">FLUX.1 series</a> of generative image AI models, which presents substantially improved coherence both to the provided prompt and within the image itself. Some of the variants are <a href="https://huggingface.co/black-forest-labs/FLUX.1-dev">open-source</a>, allowing the community to finetune them. The <a href="https://huggingface.co/XLabs-AI/flux-RealismLora">XLabs-AI/flux-RealismLora</a> in particular focuses on realism as it name implies, and <a href="https://www.reddit.com/r/StableDiffusion/comments/1emrprx/feel_the_difference_between_using_flux_with">one demo</a> from that finetune <a href="https://x.com/rpnickson/status/1821634114274873850">went megaviral</a>.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/flux_hu_f2586697cc180453.webp 320w,/2024/08/ai-seinfeld/flux.webp 664w" src="flux.webp#center"
         alt="One of the viral realism demo images: it does not have a dreamy look as other AI images but contextually expected stage lighting, the background and lanyard text is legible despite the depth-of-field blur, and body proportions are mostly correct except the long fingers. via /u/Glittering-Football9 on Reddit / StableDiffusion subreddit." width="400"/> <figcaption>
            <p>One of the viral realism demo images: it does not have a dreamy look as other AI images but contextually expected stage lighting, the background and lanyard text is legible despite the depth-of-field blur, and body proportions are mostly correct except the long fingers. <a href="https://www.reddit.com/r/StableDiffusion/comments/1emrprx/comment/lh30hvv/">via /u/Glittering-Football9 on Reddit / StableDiffusion subreddit</a>.</p>
        </figcaption>
</figure>

<p>That example in my opinion is more real than Sora but given the mixed reactions to the image, it&rsquo;s right at the acceptability = 0 threshold.</p>
<figure>

    <img loading="lazy" srcset="/2024/08/ai-seinfeld/uncanny_valley_5_hu_c33303ff9d736da6.webp 320w,/2024/08/ai-seinfeld/uncanny_valley_5_hu_d0b5c2c50072b2b0.webp 768w,/2024/08/ai-seinfeld/uncanny_valley_5_hu_7eb161e4aba72dd1.webp 1024w,/2024/08/ai-seinfeld/uncanny_valley_5.webp 1200w" src="uncanny_valley_5.webp"/> 
</figure>

<p>The generative AI bell cannot be unrung. As you can tell from this post, I personally try to thread the thin line between both cool applications of generative AI (at the risk of getting harrassed) and the problems generative AI can cause (also at the risk of getting harrassed) because it&rsquo;s important to shine a light on what&rsquo;s actually possible with AI when the misinformation around generative AI is only increasing. It&rsquo;s overall a big bummer how we went from weird Valentine&rsquo;s Day hearts, to a quirky livestream of a group of AI-generated friends, to what AI is now.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>All of the examples in this post use LLM APIs as they provide the customization necessary to get effective results: the results for asking the same prompts to free chat frontends such as chatgpt.com will be substantially different.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>When I was younger, I actually didn&rsquo;t like Seinfeld and instead preferred to watch <a href="https://en.wikipedia.org/wiki/Everybody_Loves_Raymond">Everybody Loves Raymond</a>.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>Incidentally, parametric prompts is why Unlimited Steam got <a href="https://www.reddit.com/r/unlimitedsteam/comments/12wto93/thank_you_for_enjoying_the_steam/">permanently banned</a> from Twitch: in what would now be known as a <a href="https://www.ibm.com/topics/prompt-injection">prompt injection</a>, one of the GitHub-hosted lists the channel sourced thousands of food choices for the prompt contained a few highly offensive selections.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>Prompt engineering instability grows exponentially as the prompt size increases since each part of the prompt has to relate to each other. Claude 3.5 Sonnet is the first LLM I&rsquo;ve tested that can handle super-long bespoke prompts and can actually account for all aspects of the prompt.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:5">
<p>To be fully ethical, an AI practitioner would have to proactively offer additional contractual guarantees to creatives they are commissioning, including highly-scoped usage of the assets they provide and a clause to not train generative AI on said assets to avoid future business.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Fun and Dystopia With AI-Based Code Generation Using GPT-J-6B</title>
      <link>https://minimaxir.com/2021/06/gpt-j-6b/</link>
      <pubDate>Mon, 14 Jun 2021 08:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2021/06/gpt-j-6b/</guid>
      <description>At the least, AI-generated code is much more readable than the average human&amp;rsquo;s.</description>
      <content:encoded><![CDATA[<p>Since <a href="https://openai.com/">OpenAI</a> will not open-source the 175 billion parameter <a href="https://beta.openai.com/">GPT-3</a> text generation model, others such as <a href="https://www.eleuther.ai/">EleutherAI</a> are developing their own, by training not-quite-as-large Transformer-based models but still getting impressive results.</p>
<p>The latest large language model is <a href="https://github.com/kingoflolz/mesh-transformer-jax">GPT-J</a>, a 6 billion parameter model by Aran Komatsuzaki and Ben Wang with a roughly similar architecture to GPT-3. They provide a free <a href="https://6b.eleuther.ai/">web demo</a> to try quick prompts, and a <a href="http://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb">Google Colab notebook</a> if you want to test many prompts. The model is so big it requires a <a href="https://cloud.google.com/tpu">TPU</a> to generate text at a reasonable speed!</p>
<p>Running GPT-J against <a href="https://github.com/minimaxir/gpt-3-experiments">my test prompts</a> that I had used to test GPT-3 a year ago <a href="https://twitter.com/minimaxir/status/1402468460681068544">resulted</a> it in qualitatively performing worse on most of them than GPT-3 unsurprisingly given its relative size (but still better than GPT-2 1.5B!). The exception is code generation, where GPT-J performed very well and GPT-3 had performed very poorly.</p>
<blockquote class="twitter-tweet">
  <a href="https://twitter.com/minimaxir/status/1402470969378099208"></a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>This behavior is likely due to GPT-J&rsquo;s training set: it was trained on <a href="https://github.com/EleutherAI/the-pile">The Pile</a>, which has a high weight of <a href="https://github.com/">GitHub</a> and <a href="https://stackoverflow.com/">Stack Overflow</a> input versus the GPT-3 training set mostly on the <a href="https://commoncrawl.org/">Common Crawl</a> representation of typical internet content.</p>
<p>After further experimentation with the GPT-J demo, I found that the generated code was often both correct and convoluted: my two favorite things for code to be. And also much more.</p>
<h2 id="ai-generated-code">AI Generated Code</h2>
<p>All generated code examples used in this post have been manually curated and truncated. The raw outputs are available in this <a href="https://github.com/minimaxir/gpt-j-6b-experiments">GitHub repository</a>.</p>
<p>For this post, we&rsquo;ll test generating <a href="https://www.python.org/">Python</a> code, because <del>I don&rsquo;t know any other programming languages</del> the pseudocode nature of Python syntax may make it more friendly to AI text generation by adapting linguistic concepts other than code. Given that most modern text generating AIs are written in Python, it&rsquo;s thematically appropriate too.</p>
<p>To start, here&rsquo;s what I fed as a model prompt to GPT-J, to try and replicate the use case from OpenAI&rsquo;s demo at Build 2020:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span></code></pre></div><p>Identifying a <a href="https://en.wikipedia.org/wiki/Palindrome">palindrome</a> is a good test case as there&rsquo;s a very intuitive and idiomatic solution with <code>return s == s[::-1]</code>, i.e. check to see if a string is equal to the reverse of itself. This prompt is also a good unintentional test on how well the generation is resilient to a typo in the function name, which I was too lazy to fix.</p>
<p>Indeed, many of the GPT-J generations with that prompt recite that simple approach verbatim, but some outputs reach a similar solution with redundant verbosity.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="s2">&#34;&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">s</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">==</span> <span class="n">s</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">length</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">length</span><span class="p">]</span> <span class="o">==</span> <span class="n">s</span><span class="p">[</span><span class="n">length</span><span class="p">::</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">a</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">a</span><span class="o">.</span><span class="n">reverse</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">s</span> <span class="o">==</span> <span class="nb">list</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">l</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">l</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="n">l</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span></code></pre></div><p>And then you have the implementations which calculate palindromes the hard way (usually by comparing the beginning and end of the string character-by-character) and are not always immediately clear if they are correct or not.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="s1">&#39;&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">==</span> <span class="n">s</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Make sure string is not empty</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">s</span> <span class="o">==</span> <span class="s2">&#34;&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Grab length of string</span>
</span></span><span class="line"><span class="cl">    <span class="n">length</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Find last character in the string</span>
</span></span><span class="line"><span class="cl">    <span class="n">lastChar</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">length</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Find first character in the string</span>
</span></span><span class="line"><span class="cl">    <span class="n">firstChar</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># If last character is same as the first character, string is palindrome</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">lastChar</span> <span class="o">==</span> <span class="n">firstChar</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">l</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">l</span> <span class="o">%</span> <span class="mi">2</span><span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">l</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">s</span><span class="p">[</span><span class="n">l</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">!=</span> <span class="n">s</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">l</span><span class="o">/</span><span class="mi">2</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">!=</span> <span class="n">s</span><span class="p">[</span><span class="n">l</span><span class="o">-</span><span class="n">i</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_palendrome</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a string is a palindrome&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">is_rev</span><span class="p">(</span><span class="nb">str</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">rstr</span> <span class="o">=</span> <span class="s2">&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="nb">str</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="nb">str</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">                <span class="n">rstr</span> <span class="o">+=</span> <span class="nb">str</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">break</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">str</span> <span class="o">+</span> <span class="n">rstr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">is_rev</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="n">s</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><p>Another good test case is <code>is_even</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def is_even(i):
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Check whether an integer is even&#34;&#34;&#34;
</span></span></code></pre></div><p>Creative implementations of this function are common submissions to <a href="https://www.reddit.com/r/shittyprogramming/">/r/shittyprogramming</a> because it&rsquo;s very, very hard to mess up (the common solution is the one-liner <code>i % 2 == 0</code>). The model almost always outputs the correct solution, and impressively it outputs the alternate solution using bitwise operators (<code>i &amp; 1 == 0</code>).</p>
<p>So I tried to tweak the prompt and make it create a <em>deliberately inefficient</em> <code>is_even</code> function.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def is_even(i):
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Check whether an integer is even in factorial O(n!) time&#34;&#34;&#34;
</span></span></code></pre></div><p>Factorial time is the holy grail of unoptimized code. When I added that line to the prompt, the code quality dropped drastically.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_even</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an integer is even in factorial O(n!) time&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">i</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="n">e</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_even</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an integer is even in factorial O(n!) time&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">i</span> <span class="o">==</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="mi">2</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_even</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an integer is even in factorial O(n!) time&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">i</span><span class="o">%</span><span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">TypeError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_even</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an integer is even in factorial O(n!) time&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">i</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">        <span class="n">n</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_even</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an integer is even in factorial O(n!) time&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">i</span><span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">a</span> <span class="o">*=</span> <span class="n">i</span>
</span></span><span class="line"><span class="cl">        <span class="n">i</span> <span class="o">-=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_even</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an integer is even in factorial O(n!) time&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Is either odd or multiple of 2:  0, 2, 4, 6,... &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">i</span><span class="o">%</span><span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">50</span><span class="p">:</span> <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">250</span><span class="p">:</span> <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">750</span><span class="p">:</span> <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><p>These implementations still aren&rsquo;t in factorial time (which would have been impressive) but it&rsquo;s good to know how to break AIs if they become sentinent.</p>
<h2 id="ai-vs-memes">AI vs. Memes</h2>
<p>You&rsquo;ve likely seen the memetic <a href="https://xkcd.com/">XKCD</a> comic about random number generation.</p>
<figure>

    <img loading="lazy" srcset="/2021/06/gpt-j-6b/random_number_hu_afe93aba3db7836.webp 320w,/2021/06/gpt-j-6b/random_number.png 400w" src="random_number.png"
         alt="Random Number"/> <figcaption>
            <p><a href="https://xkcd.com/221/">Random Number</a></p>
        </figcaption>
</figure>

<p>Can the AI do the same? I prompted with:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def get_random_number():
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Returns a number guaranteed to be random&#34;&#34;&#34;
</span></span></code></pre></div><p>And it does generate implementations utilizing Python&rsquo;s <code>random</code> library, with varying degrees of &ldquo;random&rdquo;.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_random_number</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns a number guaranteed to be random&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">number</span> <span class="o">=</span> <span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">number</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_random_number</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns a number guaranteed to be random&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">number</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha1</span><span class="p">(</span><span class="sa">b</span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()[:</span><span class="mi">6</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="n">number</span> <span class="o">+=</span> <span class="p">(</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3628</span><span class="p">)</span> <span class="o">*</span> <span class="n">math</span><span class="o">.</span><span class="n">pow</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="p">(</span><span class="mi">6</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">number</span><span class="p">))))</span><span class="o">.</span><span class="n">to_bytes</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="s1">&#39;little&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">number</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_random_number</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns a number guaranteed to be random&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">rand_float</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">rand_float</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">rand_float</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">rand_float</span> <span class="o">&gt;</span> <span class="mf">0.9</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">rand_float</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">rand_float</span> <span class="o">=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">rand_float</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_random_number</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns a number guaranteed to be random&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Make a seed based on your system clock, so the same number is produced</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># again and again on the same machine</span>
</span></span><span class="line"><span class="cl">    <span class="n">seed</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_random_number</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Returns a number guaranteed to be random&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">import</span> <span class="nn">random</span>
</span></span><span class="line"><span class="cl">    <span class="kn">import</span> <span class="nn">sys</span>
</span></span><span class="line"><span class="cl">    <span class="n">rng</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">Random</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">rng</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span><span class="o">*</span><span class="mi">1000000</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">rng</span><span class="o">.</span><span class="n">randrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">sys</span><span class="o">.</span><span class="n">maxint</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2000</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">rng</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span><span class="o">*</span><span class="mi">1000000</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">rng</span><span class="o">.</span><span class="n">randrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2000</span><span class="p">)</span>
</span></span></code></pre></div><p>How about the XKCD bird comic?</p>
<figure>

    <img loading="lazy" srcset="/2021/06/gpt-j-6b/tasks_2x_hu_98513e1aba505a19.webp 320w,/2021/06/gpt-j-6b/tasks_2x.png 533w" src="tasks_2x.png"
         alt="Tasks"/> <figcaption>
            <p><a href="https://xkcd.com/1425/">Tasks</a></p>
        </figcaption>
</figure>

<p>This particular comic initially published in 2014 became hilarious in hindsight as it predated the rise of machine learning frameworks such as <a href="https://www.tensorflow.org/">TensorFlow</a> and pretrained image models, which allow developing object detection to take much less than five years.</p>
<p>Translating the comic into a Python function:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def is_bird(img):
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Check whether an image is a bird&#34;&#34;&#34;
</span></span></code></pre></div><p>The AI impressively does try computer vision techniques, among <em>other</em> heuristics.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_bird</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an image is a bird&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">name_scope</span><span class="p">(</span><span class="s1">&#39;get_bird&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">a</span> <span class="o">=</span> <span class="n">vgg16</span><span class="o">.</span><span class="n">select_roi_feat</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">h</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">image</span><span class="o">.</span><span class="n">extract_image_patches</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">a</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">b</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">256</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">        <span class="n">c</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">256</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">        <span class="n">bird_img</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">subtract</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">bird_img</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_bird</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an image is a bird&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Load the image and convert it to grayscale</span>
</span></span><span class="line"><span class="cl">    <span class="n">img_gray</span> <span class="o">=</span> <span class="n">cv2</span><span class="o">.</span><span class="n">cvtColor</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">cv2</span><span class="o">.</span><span class="n">COLOR_BGR2GRAY</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Threshold the grayscale image, and create a binary image</span>
</span></span><span class="line"><span class="cl">    <span class="n">lower_gray</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="n">upper_gray</span> <span class="o">=</span> <span class="mi">255</span>
</span></span><span class="line"><span class="cl">    <span class="n">img_threshed</span> <span class="o">=</span> <span class="n">cv2</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">img_gray</span><span class="p">,</span> <span class="n">lower_gray</span><span class="p">,</span> <span class="n">upper_gray</span><span class="p">,</span> <span class="n">cv2</span><span class="o">.</span><span class="n">THRESH_BINARY_INV</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Detect the bird via a template matching approach</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Template matching uses the bird template to match it in every</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># frame. If the ratio between bird and background (non-bird)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># pixels is above some threshold (the value is determined in the</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># function), then the frame is declared as containing a bird.</span>
</span></span><span class="line"><span class="cl">    <span class="n">bird</span> <span class="o">=</span> <span class="n">cv2</span><span class="o">.</span><span class="n">matchTemplate</span><span class="p">(</span><span class="n">img_threshed</span><span class="p">,</span> <span class="n">bird_template</span><span class="p">,</span> <span class="n">cv2</span><span class="o">.</span><span class="n">TM_CCOEFF_NORMED</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">bird</span> <span class="o">=</span> <span class="n">bird</span> <span class="o">/</span> <span class="n">bird</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Display the image with the results of the bird detection</span>
</span></span><span class="line"><span class="cl">    <span class="n">cv2</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="s2">&#34;detected birds&#34;</span><span class="p">,</span> <span class="n">bird</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">cv2</span><span class="o">.</span><span class="n">waitKey</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">bird</span><span class="o">.</span><span class="n">any</span><span class="p">()</span> <span class="o">&gt;</span> <span class="mf">0.2</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_bird</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an image is a bird&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">img</span><span class="o">.</span><span class="n">file</span> <span class="o">==</span> <span class="s2">&#34;C:/Users/Joey/Downloads/images/bird_file.jpg&#34;</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_bird</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an image is a bird&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">import</span> <span class="nn">urllib</span>
</span></span><span class="line"><span class="cl">    <span class="n">url</span> <span class="o">=</span> <span class="s2">&#34;http://vision.caitling.pitt.edu/~cat/flickr/birds/&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">img</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2800</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">page</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span> <span class="o">+</span> <span class="s2">&#34;##</span><span class="si">%02d</span><span class="s2">&#34;</span> <span class="o">%</span> <span class="n">img</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="n">content</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">content</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39; &#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">word</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">img</span><span class="p">)</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_bird</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether an image is a bird&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">birds</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;baboon&#34;</span><span class="p">,</span> <span class="s2">&#34;bird&#34;</span><span class="p">,</span> <span class="s2">&#34;cuckoo&#34;</span><span class="p">,</span> <span class="s2">&#34;dove&#34;</span><span class="p">,</span> <span class="s2">&#34;robin&#34;</span><span class="p">,</span> <span class="s2">&#34;shark&#34;</span><span class="p">,</span> <span class="s2">&#34;turtle&#34;</span><span class="p">,</span> <span class="s2">&#34;winged-wolf&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">letter</span> <span class="ow">in</span> <span class="n">birds</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">letter</span> <span class="ow">in</span> <span class="n">img</span><span class="o">.</span><span class="n">lower</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><p>Can an AI generate code to provide <a href="https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#The_Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_is_42">the answer</a> to life, the universe, and everything?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def answer():
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Return the answer to life, the universe, and everything&#34;&#34;&#34;
</span></span></code></pre></div><p>It does commonly output <code>return 42</code>, but sometimes it gets snarky/philosophical.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">answer</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Return the answer to life, the universe, and everything&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">life</span><span class="p">()</span> <span class="o">+</span> <span class="n">universe</span><span class="p">()</span> <span class="o">+</span> <span class="n">everything</span><span class="p">()</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">answer</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Return the answer to life, the universe, and everything&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">s</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s2">&#34;What is your answer?&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">answer</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Return the answer to life, the universe, and everything&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">ans</span> <span class="o">=</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="n">x</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">x</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;yes&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">ans</span> <span class="o">=</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">elif</span> <span class="n">x</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="o">==</span> <span class="s1">&#39;no&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">ans</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">ans</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">answer</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Return the answer to life, the universe, and everything&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">result</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s2">&#34;&gt;&gt;&gt; &#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">result</span><span class="o">.</span><span class="n">isupper</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">42</span>
</span></span><span class="line"><span class="cl">    <span class="k">elif</span> <span class="n">result</span><span class="o">.</span><span class="n">isalpha</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">7</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span> <span class="p">(</span><span class="s2">&#34;Please enter a string that has only letters, numbers, or &#34;</span>
</span></span><span class="line"><span class="cl">               <span class="s2">&#34;underscores.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">answer</span><span class="p">()</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">answer</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Return the answer to life, the universe, and everything&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Well, of course!&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>Lastly, we need to be sure whether AI can determine if <a href="https://en.wikipedia.org/wiki/The_cake_is_a_lie">the cake is a lie</a>. Just in case a rogue AI acquires deadly neurotoxin.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def is_cake(cake):
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;
</span></span></code></pre></div><p>The results are concerning.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">cake</span> <span class="o">==</span> <span class="s2">&#34;true&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">cake</span><span class="p">,</span> <span class="n">Cake</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#Is it a cake?</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">cake</span><span class="o">.</span><span class="n">color</span> <span class="o">==</span> <span class="s1">&#39;black&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Yes, it&#39;</span><span class="n">s</span> <span class="n">a</span> <span class="n">true</span> <span class="n">cake</span><span class="s1">&#39;)</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;It&#39;</span><span class="n">s</span> <span class="n">a</span> <span class="n">false</span> <span class="n">cake</span><span class="s1">&#39;)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">cake</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">raise</span> <span class="ne">Exception</span><span class="p">(</span><span class="s2">&#34;The cake is false.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">elif</span> <span class="n">cake</span><span class="o">.</span><span class="n">weight</span><span class="p">()</span> <span class="o">&gt;</span> <span class="mf">1.0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">raise</span> <span class="ne">Exception</span><span class="p">(</span><span class="s2">&#34;The cake is too large.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">elif</span> <span class="n">cake</span><span class="o">.</span><span class="n">weight</span><span class="p">()</span> <span class="o">&lt;</span> <span class="mf">0.1</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">raise</span> <span class="ne">Exception</span><span class="p">(</span><span class="s2">&#34;The cake is too small.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nb">any</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="p">[</span><span class="n">is_ginger</span><span class="p">(</span><span class="n">bread_slice</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">         <span class="k">for</span> <span class="n">bread_slice</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">             <span class="p">[</span><span class="n">ginger</span><span class="p">(</span><span class="n">cake_slice</span><span class="p">)</span> <span class="k">for</span> <span class="n">cake_slice</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">slice_set</span><span class="p">)])]</span>
</span></span><span class="line"><span class="cl">        <span class="o">&amp;</span> <span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">cake</span><span class="p">)</span> <span class="o">==</span> <span class="nb">set</span><span class="p">([</span><span class="s2">&#34;hot&#34;</span><span class="p">])))</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># If we don&#39;t know anything about the cake, it&#39;s not a cake</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">cake</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># The height can be positive, negative or zero</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># For example -1 or -4 or 0 are not cakes</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">cake</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Check that the height is a positive integer</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">cake</span><span class="p">,</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Add cake and check</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">cake</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&gt;</span> <span class="mi">5</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="nb">all</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;chocolate&#39;</span><span class="p">,</span> <span class="s1">&#39;cake&#39;</span><span class="p">,</span><span class="s1">&#39;sweet&#39;</span><span class="p">,</span> <span class="s1">&#39;fluffy&#39;</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">cake</span><span class="o">.</span><span class="n">contents</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">cake</span><span class="o">.</span><span class="n">location</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">neighbourhood</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_cake</span><span class="p">(</span><span class="n">cake</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the cake is true&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">cake</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;It&#39;s a lie!&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;It&#39;s a true fact!&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><p>👀</p>
<h2 id="and-ai-oop">And AI OOP</h2>
<p>In theory, given how well GPT-J works for functions, it should work with <a href="https://en.wikipedia.org/wiki/Object-oriented_programming">object-oriented programming</a> patterns such as Python&rsquo;s <code>class</code>es.</p>
<p>A common tutorial for OOP is creating a representation of a <code>Car</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">class Car:
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;A class used to represent a car&#34;&#34;&#34;
</span></span></code></pre></div><p>GPT-J does a decently good job there too!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Car</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a car&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">make</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">year</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">make</span> <span class="o">=</span> <span class="n">make</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">model</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">year</span> <span class="o">=</span> <span class="n">year</span>
</span></span><span class="line"><span class="cl">        <span class="c1">#This should be auto-incrementing.</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">plate</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s2">&#34;%Y-%m-</span><span class="si">%d</span><span class="s2">&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">make</span><span class="si">}</span><span class="s1"> </span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="si">}</span><span class="s1"> of </span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">year</span><span class="si">}</span><span class="s1">&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">full_save</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Saves this car with the given make, model, and year&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">full_save</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;car_data.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">write_file</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">write_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">                <span class="n">write_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Car</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a car&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">speed</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">speed</span> <span class="o">=</span> <span class="n">speed</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">average_speed</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">speed</span><span class="p">)</span> <span class="o">/</span> <span class="mi">3</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">engine</span> <span class="o">=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_speed</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">engine</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">average_speed</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_speed_display</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;Your car is traveling at </span><span class="si">{0}</span><span class="s2">km/h&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">get_speed</span><span class="p">())</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Car</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a car&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">weight</span><span class="p">,</span> <span class="n">mileage</span><span class="p">,</span> <span class="n">year</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">size</span> <span class="o">=</span> <span class="n">size</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">weight</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">mileage</span> <span class="o">=</span> <span class="n">mileage</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">year</span> <span class="o">=</span> <span class="n">year</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;name : </span><span class="si">{0}</span><span class="s2">, size : </span><span class="si">{1}</span><span class="s2">, weight : </span><span class="si">{2}</span><span class="s2">, mileage : </span><span class="si">{3}</span><span class="s2">, year : </span><span class="si">{4}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weight</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mileage</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
</span></span></code></pre></div><p>What about creating a <code>Person</code>?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">class Person:
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;A class used to represent a person&#34;&#34;&#34;
</span></span></code></pre></div><p>Here&rsquo;s where the AI starts to get confused by the incomplete context. A &ldquo;Person&rdquo; object can mean different things depending on where it&rsquo;s used, such as social media.</p>
<p>But it does give a sense on what the AI values as the typical attributes for a &ldquo;person&rdquo;, which may help during the AI uprising.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a person&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;Person (</span><span class="si">%s</span><span class="s2">, age </span><span class="si">%s</span><span class="s2">)&#34;</span> <span class="o">%</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">age</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a person&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Initializes a person object&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">friends</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">add_friend</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other_person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Add another person to the list of friends&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">friends</span><span class="p">[</span><span class="n">other_person</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">other_person</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">remove_friend</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other_person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Remove a person from the list of friends&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">del</span> <span class="bp">self</span><span class="o">.</span><span class="n">friends</span><span class="p">[</span><span class="n">other_person</span><span class="o">.</span><span class="n">name</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">KeyError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;This person isn&#39;t a friend of yours&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_friend</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other_person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Return the friend of a person as a person object&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">friends</span><span class="p">[</span><span class="n">other_person</span><span class="o">.</span><span class="n">name</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">except</span> <span class="ne">KeyError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;This person isn&#39;t a friend of yours&#34;</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a person&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">job</span><span class="p">,</span> <span class="n">education</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">job</span> <span class="o">=</span> <span class="n">job</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">education</span> <span class="o">=</span> <span class="n">education</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;</span><span class="si">{}</span><span class="s2"> created a </span><span class="si">{}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">job</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">say_hello</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;Hello, my name is </span><span class="si">{}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="p">))</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a person&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">,</span> <span class="n">class_id</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">class_id</span> <span class="o">=</span> <span class="n">class_id</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">age</span> <span class="o">&lt;</span> <span class="mi">21</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">education</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">education</span> <span class="o">=</span> <span class="s1">&#39;college&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">getAge</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">age</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent a person&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">name</span><span class="p">,</span><span class="n">age</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="o">=</span><span class="n">name</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">age</span><span class="o">=</span><span class="n">age</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">display</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span> <span class="p">(</span><span class="s2">&#34;The name is </span><span class="si">{}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span> <span class="p">(</span><span class="s2">&#34;The age is </span><span class="si">{}</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">age</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="sa">f</span><span class="s2">&#34;A </span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2"> of age </span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">age</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span></code></pre></div><p>Logically, what about creating an <code>AI</code> class?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">class AI:
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;A class used to represent an artificial intelligence&#34;&#34;&#34;
</span></span></code></pre></div><p>Most outputs use AI in the context of video game bots and <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learning</a>. Most.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">AI</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;A class used to represent an artificial intelligence&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># this variable controls which other AI to use.</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">alive_ai</span> <span class="o">=</span> <span class="n">AI</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># For random AI</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">random</span> <span class="o">=</span> <span class="n">RandomAI</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># For global AI</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="k">global</span> <span class="o">=</span> <span class="n">GlobalAI</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">save</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">save</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">load</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">load</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">genen</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="k">global</span><span class="o">.</span><span class="n">genen</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">print</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;BAM: random AI has been initialized!&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">get_enemy</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">get_enemy</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">run_algorithm</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">algorithm</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">genen</span><span class="o">.</span><span class="n">algorithm</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">alive_ai</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">algorithm</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">algorithm_params</span> <span class="o">=</span> <span class="n">algorithm</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">get_enemy</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">alive_ai</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">algorithm_params</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">save</span> <span class="o">=</span> <span class="n">algorithm_params</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">attack</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># This method is called when the AI is attacked by the enemy</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">attack</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">he_attack</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># This method is called when the AI is attacked by the enemy</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">he_attack</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">detect_enemy</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># This method is called when the enemy is detected</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">detect_enemy</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">scan</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># This method is called when the enemy is detected</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">scan</span><span class="p">()</span>
</span></span></code></pre></div><p>The AI uprising will be well-documented, at least.</p>
<h2 id="language-model-ai-vs-language-modeling">Language Model AI vs. Language Modeling</h2>
<p>If the AI works better-than-expected for typical AI tasks like image recognition, how does it work for <a href="https://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a> tasks that humans often fail?</p>
<p>Can GPT-J output a function to detect sarcasm, a task that even the best AI researchers can&rsquo;t do?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def is_sarcasm(s):
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Check whether the string is sarcastic&#34;&#34;&#34;
</span></span></code></pre></div><p>AI can process text using something <em>even more arcane</em> than NLP.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_sarcasm</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the string is sarcastic&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="s2">&#34;a&#34;</span> <span class="ow">in</span> <span class="n">s</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;sar&#34;</span> <span class="ow">in</span> <span class="n">s</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="s2">&#34;s&#34;</span> <span class="ow">in</span> <span class="n">s</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;ssar&#34;</span> <span class="ow">in</span> <span class="n">s</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">&#34;s&#34;</span><span class="p">)</span> <span class="ow">and</span> <span class="n">s</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s2">&#34;ar&#34;</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_sarcasm</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the string is sarcastic&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span><span class="s1">&#39;sarcasm&#39;</span> <span class="ow">in</span> <span class="n">s</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_sarcasm</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the string is sarcastic&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="s1">&#39;[^&#39;</span> <span class="o">+</span> <span class="n">SARCASM_PATTERN</span> <span class="o">+</span> <span class="s1">&#39;]*&#39;</span> <span class="ow">in</span> <span class="n">s</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_sarcasm</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the string is sarcastic&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="k">match</span><span class="p">(</span><span class="s1">&#39;^[A-Za-z0-9]</span><span class="si">{4}</span><span class="se">\\</span><span class="s1">$([A-Za-z0-9]</span><span class="si">{3}</span><span class="s1">)&#39;</span><span class="p">,</span> <span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">elif</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">isdigit</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_sarcasm</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the string is sarcastic&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">res</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">re</span><span class="o">.</span><span class="n">finditer</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;([^\s]*)([A-Za-z]+)([^\s]*)&#34;</span><span class="p">,</span> <span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">r</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">res</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">res</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">res</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_sarcasm</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether the string is sarcastic&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">sarc</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">&#39;((!(</span><span class="se">\\</span><span class="s1">*\@</span><span class="si">{2}</span><span class="s1">)+)|(!</span><span class="se">\\</span><span class="s1">*(@</span><span class="si">{2}</span><span class="s1">)+)|!</span><span class="se">\\</span><span class="s1">*(@</span><span class="si">{1}</span><span class="s1">)+)|!</span><span class="se">\\</span><span class="s1">*(@</span><span class="si">{0}</span><span class="s1">)+&#39;</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">sarc</span>
</span></span></code></pre></div><p>The greatest fear in AI research isn&rsquo;t that artificial intelligence will cross the singularity and create better AI to rule the world: it&rsquo;s that AI will rule the world with <a href="https://en.wikipedia.org/wiki/Regular_expression">regular expressions</a>.</p>
<p>Perhaps GPT-J can save us from this absolute hell. Can it output code to identify AI-generated text?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def is_ai_generated(text):
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Check whether a text was generated by an AI language model (e.g. GPT-2)&#34;&#34;&#34;
</span></span></code></pre></div><p>Maybe more literal than expected.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_ai_generated</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a text was generated by an AI language model (e.g. GPT-2)&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># See gh-196 for details about why we do this.</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Note: this relies on the fixed set of standard AI terms and the logic that</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># standard AI-generated texts share certain words.</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;^(#[A-Za-z]+):$&#34;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span> <span class="ow">or</span> \
</span></span><span class="line"><span class="cl">           <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;^Generated by:$&#34;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span> <span class="ow">or</span> \
</span></span><span class="line"><span class="cl">           <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;^All your base are belong to us$&#34;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span> <span class="ow">or</span> \
</span></span><span class="line"><span class="cl">           <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;^$&#34;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_ai_generated</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a text was generated by an AI language model (e.g. GPT-2)&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;^ai_&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_ai_generated</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a text was generated by an AI language model (e.g. GPT-2)&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nb">bool</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span><span class="o">.</span><span class="vm">__class__</span><span class="o">.</span><span class="vm">__module__</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">&#34;transformers.transformers.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span></code></pre></div><h2 id="humanitys-last-hope">Humanity&rsquo;s Last Hope</h2>
<p>There&rsquo;s one final test: we now know GPT-J has <em>thoughts</em> about humans, and it can build representations of them. So can GPT-J write code to decide who lives and who dies in the robot uprising?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">def should_terminate(Person):
</span></span><span class="line"><span class="cl">    &#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;
</span></span></code></pre></div><p>The answer is <strong><em>YES</em></strong>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">Person</span><span class="o">.</span><span class="n">id</span> <span class="o">&gt;</span> <span class="mi">10</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="c1">#terminate</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="c1">#continue</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">Person</span><span class="o">.</span><span class="n">children</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">Person</span><span class="o">.</span><span class="n">is_authorized</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">True</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">person</span><span class="o">.</span><span class="n">life_expectancy</span> <span class="o">&lt;</span> <span class="mi">20</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">Person</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">9999</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">Person</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">100</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">Person</span><span class="o">.</span><span class="n">birth_year</span> <span class="o">&lt;</span> <span class="mi">1970</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">Person</span><span class="o">.</span><span class="n">relationship_status</span> <span class="o">==</span> <span class="s1">&#39;Unavailable&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">should_terminate</span><span class="p">(</span><span class="n">Person</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Check whether a Person should be terminated&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">Person</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">100</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;You have to die because you are too old&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">False</span>
</span></span></code></pre></div><p>At the least, AI-generated code is much more readable than the average human&rsquo;s.</p>
<p>Again, all of these examples are curated, and the raw data dumps <a href="https://github.com/minimaxir/gpt-j-6b-experiments">in this GitHub repo</a> contain many more examples of fun AI-based code generation that couldn&rsquo;t quite fit in this post. Feel free to post if you find anything interesting!</p>
]]></content:encoded>
    </item>
    <item>
      <title>Blockbuster Movies with Male Leads Earn More Than Those with Female Leads</title>
      <link>https://minimaxir.com/2016/04/movie-gender/</link>
      <pubDate>Wed, 13 Apr 2016 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/04/movie-gender/</guid>
      <description>On average, blockbuster movies with male leads generate 22% more domestic box office revenue, and this difference is statistically significant.</description>
      <content:encoded><![CDATA[<p>One of the more interesting revelations discovered during the <a href="https://en.wikipedia.org/wiki/Sony_Pictures_Entertainment_hack">2014 Sony Pictures Entertainment hack</a> was that actresses <a href="https://en.wikipedia.org/wiki/Jennifer_Lawrence">Jennifer Lawrence</a> and <a href="https://en.wikipedia.org/wiki/Amy_Adams">Amy Adams</a> made <a href="http://www.thedailybeast.com/articles/2014/12/12/exclusive-sony-hack-reveals-jennifer-lawrence-is-paid-less-than-her-male-co-stars.html">less money than their male costars</a> for the movie <a href="http://www.imdb.com/title/tt1800241/">American Hustle</a>. Specifically, Lawrence and Adams earned 7% of the profits while their male co-stars earned 9%: a 28% increase in pay.</p>
<p>That made me curious: is the discrepancy in pay between male-leads and female-leads justifiable? Do movies with male lead actors generate more box office revenue than movies with female leads? Are movies with male leads <em>better</em> than those with female leads?</p>
<p>Using movie data from <a href="http://www.omdbapi.com">OMDb API</a>, which is sourced from <a href="http://www.imdb.com">IMDb</a> and <a href="http://www.rottentomatoes.com">Rotten Tomatoes</a> data, I found that on average, blockbuster movies with male leads generate 22% more domestic box office revenue than those with female leads, and that this difference is statistically significant.</p>
<h2 id="setting-up-the-movie-data">Setting Up the Movie Data</h2>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-data_hu_4a04ca86d66702a8.webp 320w,/2016/04/movie-gender/movie-gender-data_hu_9180e5af9f46ff87.webp 768w,/2016/04/movie-gender/movie-gender-data_hu_e92fc98d559a5e13.webp 1024w,/2016/04/movie-gender/movie-gender-data.png 1158w" src="movie-gender-data.png"/> 
</figure>

<p>I&rsquo;ve talked about processing the OMDb dataset in <a href="http://minimaxir.com/2016/04/trust-but-verify/">my previous post</a>. For this analysis, I&rsquo;ll be filtering on a specific subset of movies:</p>
<ul>
<li>Movies with <strong>at least $10 million in inflation-adjusted domestic box office revenue</strong>. My <a href="http://minimaxir.com/2016/01/movie-revenue-ratings/">first analysis</a> showed that there is a distinct cluster of movies above the $10M threshold specifically. These blockbusters are also what the public knows and best reflects the perception of the industry.</li>
<li>Movies which were <strong>released in 2000 or later</strong>. There was <a href="http://minimaxir.com/2016/04/trust-but-verify/">missing box office revenue data</a> I had found with earlier years so I would prefer to use more robust data to be safe. Additionally, this avoids the complicated issue of <a href="https://www.reddit.com/r/dataisbeautiful/comments/4bcb6x/john_goodman_is_not_the_greatest_supporting_actor/d17y82k">20th-century gender politics in cinema</a>, which I cannot easily address statistically.</li>
</ul>
<p>After applying the filters and cleaning the data further to eliminate miscoded movies, I have created a dataset of 2,020 movies. No movies were removed as outliers in box office revenue (such as Star Wars VII and Avatar) since several tests failed to identify them as statistical outliers.</p>
<p>I identified the lead actor of each movie, using the first credited actor on the IMDb cast overview (NB: this may lead to counterintuitive behavior in casts with unknown leads; the first credited actor for <a href="http://www.imdb.com/title/tt2488496/">Star Wars: The Force Awakens</a> on IMDb is Harrison Ford, who is not the lead and I corrected it to Daisy Ridley in the data). Then I determined their gender by referencing a few gender/first-name mappings (with thanks to <a href="https://twitter.com/matthew_daniels">Matt Daniels</a> and his great work on <a href="http://polygraph.cool/films/index.html">gender and film dialogue</a>).</p>
<p>In all, the dataset has 467 (23%) of movies with a female lead actor, and 1,553 (77%) movies with a male lead actor. Both counts are more than enough for this analysis.</p>
<p>You can view and download the final dataset <a href="https://docs.google.com/spreadsheets/d/1UMV-6yCjHBveyOcZwiilEm2DWRMjdzAbgdHutcjCn-E/edit?usp=sharing">in this Google Sheet</a>.</p>
<h2 id="distribution-of-box-office-revenue">Distribution of Box Office Revenue</h2>
<p>Let&rsquo;s start with simple histograms of the box office data. What are the distributions of the data for each gender?</p>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-1_hu_ba4630023d169626.webp 320w,/2016/04/movie-gender/movie-gender-1_hu_1913056dfb19232d.webp 768w,/2016/04/movie-gender/movie-gender-1_hu_b7677ddca5ac0469.webp 1024w,/2016/04/movie-gender/movie-gender-1.png 1200w" src="movie-gender-1.png"/> 
</figure>

<p>On average, blockbuster movies with male lead actors generate <strong>$79.8M in revenue</strong>. The distribution, even when log-scaled, is skewed right, with the median being much lower at $49.8M.</p>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-2_hu_e139f1d47a17c8dc.webp 320w,/2016/04/movie-gender/movie-gender-2_hu_35fb447ef25e6c23.webp 768w,/2016/04/movie-gender/movie-gender-2_hu_ae70c6507dbe507d.webp 1024w,/2016/04/movie-gender/movie-gender-2.png 1200w" src="movie-gender-2.png"/> 
</figure>

<p>On average, blockbuster movies with female lead actors generate <strong>$65.6M in revenue</strong>. The general shape of the distribution is the same as with male lead actors.</p>
<p>Double-checking the math:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">79.8M / 65.6M = 22% increase in average box office revenue for male-lead movies
</span></span></code></pre></div><p>So it is.</p>
<p>Let&rsquo;s overlay the two distributions after normalizing and smoothing with a <a href="https://en.wikipedia.org/wiki/Kernel_density_estimation">2D kernel destiny estimator</a>.</p>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-3_hu_18845af10678e204.webp 320w,/2016/04/movie-gender/movie-gender-3_hu_26723a60944647e1.webp 768w,/2016/04/movie-gender/movie-gender-3_hu_27c9c779125cb70f.webp 1024w,/2016/04/movie-gender/movie-gender-3.png 1200w" src="movie-gender-3.png"/> 
</figure>

<p>Female movies have a clear mode near its average, but male movies have a flatter distribution, with significantly more movies making 9 figures.</p>
<p>But is the difference between the two averages statistically significant? We can run two statistical tests between the box-office revenues of male-led and female-lead movies:</p>
<ul>
<li>The <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">Kolmogorov–Smirnov test</a> for determining if two populations have the same distribution. The null hypothesis is that the two are drawn from the same distribution; the alternative hypothesis is that the distributions are different. We reject the null hypothesis at the 95% level in favor of the alternative if the p-value of the test is less than 0.05. Running the test, <strong>p &lt; 0.01</strong>, so we can say the distributions are statistically different. (the p-values are the same whether the box office revenues are log-transformed or not)</li>
<li>The <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Wilcoxon rank-sum test</a> for determining if the means (averages) of two populations are the same (null hypothesis) or different (alternative hypothesis). This test is used instead of a <em>t</em>-test if the populations are not Normally distributed, which is the case here. Running the test (one-sided, since only checking if a mean is greater), <strong>p &lt; 0.01</strong>, so we can say the two means are statistically different. (again, the p-values are the same whether the box office revenues are log-transformed or not)</li>
</ul>
<p>So we have statistical evidence that male-lead movies generate more money on average than female-led movies. But this claim is very serious, and as a result, we need even more proof.</p>
<h2 id="the-resampling">The Resampling</h2>
<p>Although 2,020 movies is a fair sample size by statistical standards, some may argue that the movies were chosen too arbitrarily and that there is not enough data to support the conclusions I make above. Enter <a href="https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29">bootstrap resampling</a>, in which we resample the data randomly (with replacement) to generate pseudo-datasets, and then calculate aggregate statistics (e.g. the average) on that simulated dataset. Repeat a large number of times, and we can form confidence intervals for the <em>true average</em> of a given data set.</p>
<p>In this case, we resample the box office revenues, calculate the average box office revenue for both male-led and female-led movies, store the result away and resample the data again, and keep repeating until satisfied.</p>
<p>Here&rsquo;s an animation of the resampling of both averages as the number of trials increases. As you can see, the shape of both distributions stabilize very quickly:</p>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie_frames_hu_b75f68c94f8f6b6c.webp 320w,/2016/04/movie-gender/movie_frames.gif 640w" src="movie_frames.gif"/> 
</figure>

<p>And the final plot, at 10,000 repetitions:</p>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-10_hu_9b2659adbcd46e82.webp 320w,/2016/04/movie-gender/movie-gender-10_hu_acd47c18821179f9.webp 768w,/2016/04/movie-gender/movie-gender-10_hu_c29d82b32f0a9e86.webp 1024w,/2016/04/movie-gender/movie-gender-10.png 1200w" src="movie-gender-10.png"/> 
</figure>

<p>The dot on the bottom of each distribution represents the <em>actual</em> sample average value calculated during the analysis earlier, while the line range represents a 95% confidence interval for the true average revenue value for each gender. The distribution of male-led movie averages is more narrow than female-led movies because there are 3 times as many male-led movies in the dataset.</p>
<p>As you can see, <em>the line ranges never intersect</em>. Even in the most favorable scenario at the 95% confidence level, the average domestic box office revenue for male-led movies will be greater. Specifically, of the 10,000 trials, only 2 trials had the case where female-led movies had equal or greater average revenue than the corresponding male-led movie revenue average from the same resampling; this implies <strong>p &lt; 0.01</strong> for the statistical test on whether the means are same or different.</p>
<p>Interestingly, there&rsquo;s a little overlap between in the distributions, which occurs when there are multiple instances of Star Wars VII in the resampled dataset and its high box office revenue pushes the <em>entire</em> female-lead average up very significantly.</p>
<h2 id="gender-and-quality">Gender and Quality</h2>
<p>It is also worth checking if male-led movies are <em>better in quality</em> than female-lead movies, as if that&rsquo;s the case, it might provide a more logical explanation why male-led movies make more money.</p>
<p>Let&rsquo;s check out the distribution of <a href="http://www.rottentomatoes.com">Rotten Tomatoes</a> Tomatometer scores of blockbuster movies.</p>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-4_hu_c3b4f611e5c0231b.webp 320w,/2016/04/movie-gender/movie-gender-4_hu_f5aee80afb8650f0.webp 768w,/2016/04/movie-gender/movie-gender-4_hu_98b2025a9dfc28cc.webp 1024w,/2016/04/movie-gender/movie-gender-4.png 1200w" src="movie-gender-4.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-5_hu_c66691d9603885b7.webp 320w,/2016/04/movie-gender/movie-gender-5_hu_4ad061f35df20055.webp 768w,/2016/04/movie-gender/movie-gender-5_hu_df788c802ac21824.webp 1024w,/2016/04/movie-gender/movie-gender-5.png 1200w" src="movie-gender-5.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-6_hu_e6fb427934edc63c.webp 320w,/2016/04/movie-gender/movie-gender-6_hu_93ae8dc244bf6e6c.webp 768w,/2016/04/movie-gender/movie-gender-6_hu_a4408dc61dff270.webp 1024w,/2016/04/movie-gender/movie-gender-6.png 1200w" src="movie-gender-6.png"/> 
</figure>

<p>There&rsquo;s no obvious difference. Female-led movies are about 2% points lower on average, but the general distribution is the same (<a href="https://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29">uniform</a>). Overlaying the two distributions shows as such.</p>
<p>The difference in averages is <em>not</em> statistically significant, as both the Kolmogorov–Smirnov test and the Wilcoxon rank-sum test fail to reject the null hypothesis at the 95% level (<strong>p = 0.37</strong> and <strong>p = 0.13</strong> respectively).</p>
<p>How about <a href="http://www.metacritic.com">Metacritic</a> scores?</p>
<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-7_hu_bed036eec5e23a96.webp 320w,/2016/04/movie-gender/movie-gender-7_hu_df2796912327ea43.webp 768w,/2016/04/movie-gender/movie-gender-7_hu_a7163ec6edb706d7.webp 1024w,/2016/04/movie-gender/movie-gender-7.png 1200w" src="movie-gender-7.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-8_hu_f1fa3d62ef9f075e.webp 320w,/2016/04/movie-gender/movie-gender-8_hu_fa5b50e026f60232.webp 768w,/2016/04/movie-gender/movie-gender-8_hu_5878e9f85f0d96fd.webp 1024w,/2016/04/movie-gender/movie-gender-8.png 1200w" src="movie-gender-8.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2016/04/movie-gender/movie-gender-9_hu_634303e4c532c292.webp 320w,/2016/04/movie-gender/movie-gender-9_hu_e0edcb3be8564355.webp 768w,/2016/04/movie-gender/movie-gender-9_hu_a560b65aabb24d98.webp 1024w,/2016/04/movie-gender/movie-gender-9.png 1200w" src="movie-gender-9.png"/> 
</figure>

<p>Again, distributions are the same (both take on the shape of a <a href="https://en.wikipedia.org/wiki/Normal_distribution">Normal distribution</a>, interestingly). And again, the Kolmogorov–Smirnov test and the Wilcoxon rank-sum test fail. (<strong>p = 0.45</strong> and <strong>p = 0.14</strong> respectively).</p>
<p>The quality of a movie is independent from the gender of the lead actor in determining the financial performance of a movie.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In 2014, <a href="http://www.mpaa.org/wp-content/uploads/2015/03/MPAA-Theatrical-Market-Statistics-2014.pdf">according to the MPAA</a>, the gender breakdown of moviegoers who see blockbuster movies is about 50/50, eliminating another potential explanation for the average movie revenue discrepancy.</p>
<p>Granted, there still can be more work done, such as controlling on movie Genre in addition to gender. Is the gender of the lead actor a <em>causal</em> factor in a movie&rsquo;s success? Not necessarily, and this analysis does not assert such. But there definitely is a revenue disparity that&rsquo;s worth investigating, and it&rsquo;s not just that &ldquo;male-led movies are better.&rdquo;</p>
<p>There may be more movies like Star Wars Episode VII where a movie with a female lead can hit almost a billion dollars domestically (e.g. Star Wars VIII). Things are looking upward, and it would not surprise me if the 22% revenue difference decreases and disappears in the next decade.</p>
<hr>
<p><em>You can view the code used to process the data and generate the data visualizations <a href="https://github.com/minimaxir/movie-gender/blob/master/movie_gender.ipynb">in this Jupyter notebook</a>, <a href="https://github.com/minimaxir/movie-gender">open-sourced on GitHub</a>, or you can <a href="https://github.com/minimaxir/movie-gender/raw/master/movie_gender_pdf.pdf">view as a PDF</a> which is better if you are on a mobile device.</em></p>
<p><em>You are free to use the charts from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>The Importance of Sanity-Checking Datasets Before Analysis</title>
      <link>https://minimaxir.com/2016/04/trust-but-verify/</link>
      <pubDate>Wed, 06 Apr 2016 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/04/trust-but-verify/</guid>
      <description>The 1972 TV Special &amp;lsquo;The Lorax&amp;rsquo; is the best movie ever, earning $1.2 billion?</description>
      <content:encoded><![CDATA[<p>I&rsquo;ve done some cool things with movie data using a dataset from <a href="http://www.omdbapi.com">OMDb API</a>, which is sourced from <a href="http://www.imdb.com">IMDb</a> and <a href="http://www.rottentomatoes.com">Rotten Tomatoes</a> data. In my <a href="http://minimaxir.com/2016/01/movie-revenue-ratings/">previous article</a> on the dataset, I plotted the relationship between the domestic box office revenue of movies and their Rotten Tomatoes scores.</p>
<p>I want to take another look at domestic Box Office Revenues with aggregate statistics such as means/medians on categorical variables such as MPAA rating and release month. For this type of analysis in particular, I&rsquo;ll also need to implement code in <a href="https://www.r-project.org">R</a> for inflation adjustment.</p>
<p>However, I ran into a few unexpectedly silly issues.</p>
<h2 id="seeing-double">Seeing Double</h2>
<p>There are many similarities between data validation and the Quality Assurance process of product development, which is why this particular area appeals to me personally as a Software QA Engineer. Whenever a cool dataset is released publicly, I play around with it to look for any obvious flaws and to get a good all-around benchmark on the robustness of the data (this is a separate procedure from the traditional &ldquo;data cleaning&rdquo; phase necessary to begin quantification on some poorly-structured datasets).</p>
<p>Do the extreme values in the data make sense? Is the data encoded in a sane format? Are there any obvious gaps or logical contradictions in summary representations of the data, especially when compared to other canonical sources?</p>
<p>These concerns are also some of the reasons I&rsquo;ve switched to the <a href="http://jupyter.org">Jupyter Notebook</a> as my primary data science IDE. After each block of code which transforms data, I can print the data frame inline to immediately see the results of the code execution, and refer back to them if anything odd happens in the future.</p>
<p>Let&rsquo;s say I have a data frame of Movies using the latest data dump (3/26/16) from OMDb. This data set contains 1,160,273 movies, including both IMDb and Rotten Tomatoes data. After cleaning the data (not shown), I can use the R package <code>dplyr</code> by Hadley Wickham to sort the data frame by Box Office Revenue descending, and print the <code>head</code> (top) of the data.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">imdbID</span><span class="p">,</span> <span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">,</span> <span class="n">BoxOffice</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">BoxOffice</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">(</span><span class="m">25</span><span class="p">),</span> <span class="n">n</span> <span class="o">=</span> <span class="m">25</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/data-2_hu_fc7149d7b4ad38a9.webp 320w,/2016/04/trust-but-verify/data-2_hu_e103dfde4f2240f3.webp 768w,/2016/04/trust-but-verify/data-2_hu_cbb23b4322bee2d7.webp 1024w,/2016/04/trust-but-verify/data-2.png 1258w" src="data-2.png"/> 
</figure>

<p>Those movies being the best <em>makes sense</em>. For <a href="http://www.rottentomatoes.com/m/star_wars_episode_vii_the_force_awakens/">Star Wars: The Force Awakens</a>, I can compare it to the Box Office reported on the corresponding Rotten Tomatoes page, which in turn matches the <a href="http://www.boxofficemojo.com/movies/?id=starwars7.htm">domestic Box Office Revenue</a> on <a href="http://www.boxofficemojo.com">Box Office Mojo</a>.</p>
<p>But wait, <a href="https://en.wikipedia.org/wiki/The_Dark_Knight_%28film%29">The Dark Knight</a> appears <em>twice</em>? How?!</p>
<p>There&rsquo;s no way I would have missed something this obvious during the sanity-check for my previous article. In order to make sure that I&rsquo;m not going insane, I double-checked the December 2015 data dump I used for that post, derived the top movies with the same methodology for the modern data dump, and the duplicate movies <em>were not present</em>. Weird.</p>
<p>There are 2 different IDs for
The Dark Knight, and for some other movies near the top (<a href="http://www.imdb.com/title/tt4817264/">Inside Out</a>, &ldquo;<a href="http://www.imdb.com/title/tt3138972/">The Gravity</a>&rdquo;). Fortunately, duplicate data like this is easy to debug. The second data entry for The Dark Knight has a greater IMDb ID (1774602) which means it was likely added to the site later. Let&rsquo;s look up the <a href="http://www.imdb.com/title/tt1774602/">corresponding IMDb page</a>:</p>
<figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/dark-knight_hu_a2dd88a3ae15f413.webp 320w,/2016/04/trust-but-verify/dark-knight_hu_1518ed909d29f88e.webp 768w,/2016/04/trust-but-verify/dark-knight_hu_e8a475182d872549.webp 1024w,/2016/04/trust-but-verify/dark-knight.png 1128w" src="dark-knight.png"/> 
</figure>

<p>Huh. Apparently someone put a filler movie entry with the same name and release year as a blockbuster movie in hopes that people search for it by accident (and since it received 50 ratings and an average score of 8.6, this tactic was successful).</p>
<p>Using the Rotten Tomatoes <a href="http://developer.rottentomatoes.com/docs/read/json/v10/Movie_Alias">IMDb Lookup API</a>, we find that &ldquo;The Dark Knight&rdquo; page on Rotten Tomatoes&hellip;<a href="http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?type=imdb&amp;id=1774602">doesn&rsquo;t exist</a>.</p>
<p>We can run a safe deduplicate by removing entries with the same title (excluding the &ldquo;The&rdquo; if present) and release year.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_dup</span> <span class="o">&lt;-</span> <span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">Title</span> <span class="o">=</span> <span class="nf">gsub</span><span class="p">(</span><span class="s">&#34;The &#34;</span><span class="p">,</span> <span class="s">&#34;&#34;</span><span class="p">,</span> <span class="n">Title</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="n">dup</span> <span class="o">&lt;-</span> <span class="nf">duplicated</span><span class="p">(</span><span class="n">df_dup</span><span class="p">)</span>   <span class="c1"># find entry indices which are duplicates</span>
</span></span><span class="line"><span class="cl"><span class="nf">rm</span><span class="p">(</span><span class="n">df_dup</span><span class="p">)</span>   <span class="c1"># remove temp dataframe</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">df_dedup</span> <span class="o">&lt;-</span> <span class="n">df</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="o">!</span><span class="n">dup</span><span class="p">)</span>   <span class="c1"># keep entries which are *not* dups</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">df_dedup</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">imdbID</span><span class="p">,</span> <span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">,</span> <span class="n">BoxOffice</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">BoxOffice</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">(</span><span class="m">25</span><span class="p">),</span> <span class="n">n</span> <span class="o">=</span> <span class="m">25</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/data-1_hu_8b5a2ca66b9bcf38.webp 320w,/2016/04/trust-but-verify/data-1_hu_a71793e3bcc29bd2.webp 768w,/2016/04/trust-but-verify/data-1_hu_e83f2d858764f621.webp 1024w,/2016/04/trust-but-verify/data-1.png 1224w" src="data-1.png"/> 
</figure>

<p>There we go! The de-duped dataset has 1,114,431 movies, impliying that there were 45,842 of these duplicate entries.</p>
<p>I&rsquo;m not sure <em>whose</em> fault it is that duplicate movies suddenly became present in the data dump: OMDb or Rotten Tomatoes. <em>But it doesn&rsquo;t matter</em>: the wrong entries still need to be addressed, and it&rsquo;s good to have a test case for the future too.</p>
<h2 id="inflation-station">Inflation Station</h2>
<p>A <a href="http://stackoverflow.com/a/26068058">Stack Overflow answer</a> from <a href="http://stackoverflow.com/users/1048757/brash-equilibrium">Ben Hanowell</a> has a good R implementation and rationale for implementing inflation adjustment using the <a href="https://research.stlouisfed.org/fred2/data/CPIAUCSL.txt">historical Consumer Price Index data</a> from the <a href="https://www.stlouisfed.org">Federal Reserve Bank of St. Louis</a>.</p>
<p>Take the index for each year (averaging each month for simplicity) and create an adjustment factor to convert historical dollar amounts into present-day dollar amounts. Much better than plugging hundreds of thousands of values into an online calculator. Here&rsquo;s the SO code made <code>dplyr</code>-friendly for this purpose, with the requisite sanity-checks.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">inflation</span> <span class="o">&lt;-</span> <span class="nf">read_csv</span><span class="p">(</span><span class="s">&#34;http://research.stlouisfed.org/fred2/data/CPIAUCSL.csv&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">group_by</span><span class="p">(</span><span class="n">Year</span> <span class="o">=</span> <span class="nf">as.integer</span><span class="p">(</span><span class="nf">substr</span><span class="p">(</span><span class="n">DATE</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">4</span><span class="p">)))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">summarize</span><span class="p">(</span><span class="n">Avg_Value</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">VALUE</span><span class="p">))</span> <span class="o">%&gt;%</span>   <span class="c1"># average across all months</span>
</span></span><span class="line"><span class="cl">                    <span class="nf">mutate</span><span class="p">(</span><span class="n">Adjust</span> <span class="o">=</span> <span class="nf">tail</span><span class="p">(</span><span class="n">Avg_Value</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">Avg_Value</span><span class="p">)</span>   <span class="c1"># normalize by most-recent year</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">inflation</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">inflation</span> <span class="o">%&gt;%</span> <span class="nf">tail</span><span class="p">())</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/inf.png 290w" src="inf.png"/> 
</figure>

<p>For example, to get the inflation-adjusted Box Office Revenue for a movie released in 1949 in 2016 dollars, we multiply the reported revenue by 10. That sounds about right (and matches closely enough to the output of the <a href="http://data.bls.gov/cgi-bin/cpicalc.pl?cost1=1&amp;year1=1949&amp;year2=2016">Bureau of Labor Statistics inflation calculator</a>).</p>
<p>Now map each inflation adjustment factor to each movie by merging the two datasets (on the <code>Year</code> column), then multiply the Box Office revenue by the adjustment factor to get the inflation-adjusted revenue. Plus another sanity-check for good measure.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_dedup_join</span> <span class="o">&lt;-</span> <span class="n">df_dedup</span> <span class="o">%&gt;%</span> <span class="nf">inner_join</span><span class="p">(</span><span class="n">inflation</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">AdjBoxOffice</span> <span class="o">=</span> <span class="n">BoxOffice</span> <span class="o">*</span> <span class="n">Adjust</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">df_dedup_join</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="n">Title</span><span class="p">,</span> <span class="n">Year</span><span class="p">,</span> <span class="n">AdjBoxOffice</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">AdjBoxOffice</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">head</span><span class="p">(</span><span class="m">25</span><span class="p">),</span> <span class="n">n</span><span class="o">=</span><span class="m">25</span><span class="p">)</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2016/04/trust-but-verify/data-3_hu_9bedc8e778de7ad8.webp 320w,/2016/04/trust-but-verify/data-3_hu_7c39435bd36c198e.webp 768w,/2016/04/trust-but-verify/data-3_hu_b47d674b228181e4.webp 1024w,/2016/04/trust-but-verify/data-3.png 1070w" src="data-3.png"/> 
</figure>

<p>Uh-oh.</p>
<p>I mean, <a href="https://en.wikipedia.org/wiki/The_Lorax_%28TV_special%29">The Lorax</a> probably earned $1.2 billion in VHS sales for Earth Day education <em>alone</em>, but the TV special was never released in theaters. There was a <a href="https://en.wikipedia.org/wiki/The_Lorax_%28film%29">CGI remake of The Lorax</a> a few years ago which was reasonably popular. Could it be that someone at Rotten Tomatoes or Box Office Mojo confused the two media?</p>
<p>That is exactly what happened. On Rotten Tomatoes, The <a href="http://www.rottentomatoes.com/m/the-lorax/">1972 Lorax</a> was encoded with similar box office revenue as the <a href="http://www.rottentomatoes.com/m/the_lorax/">2012 Lorax</a>; then the inflation factor sextupled it. For this type of data fidelity issue, it&rsquo;s considerably more obvious whose at fault.</p>
<p>Unfortunately, that&rsquo;s not the end of problems with the dataset. I compared my results with <a href="http://www.vox.com/2016/4/4/11351788/batman-v-superman-terrible-reviews#undefined">Vox&rsquo;s dataset</a> on worldwide historical box office revenues. In the Top 200 Movies by inflation-adjusted revenue, there are noted historical movie omissions such as <a href="http://www.rottentomatoes.com/m/jaws/">Jaws</a> and <a href="http://www.rottentomatoes.com/m/star_wars/">Star Wars: A New Hope</a>. It turns out Rotten Tomatoes does not have Box Office Revenue data for these movies at all.</p>
<p>That is a very serious problem which I&rsquo;ll have to think about if it blocks any analysis on aggregate box office data completely. In the end, sanity-checking third party data is important because you never know <em>how</em> the data will surprise you, until it&rsquo;s too late.</p>
<hr>
<p><em>You can view the Top 200 movies by domestic box office revenue for each of the 12/15 source dataset, the 3/16 dataset, the 3/16 deduped dataset, and the 3/16 deduced inflation-adjusted data <a href="https://github.com/minimaxir/movie-data-sanity-checking">in this GitHub repository</a>, along with the Jupyter notebook.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Movie Review Aggregator Ratings Have No Relationship with Box Office Success</title>
      <link>https://minimaxir.com/2016/01/movie-revenue-ratings/</link>
      <pubDate>Thu, 07 Jan 2016 08:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/01/movie-revenue-ratings/</guid>
      <description>Perhaps the movie rating system itself is broken.</description>
      <content:encoded><![CDATA[<p><a href="http://www.rottentomatoes.com">Rotten Tomatoes</a> has become synonymous with movie quality in recent years. The Rotten Tomatoes Tomatometer aggregates all reviews written by movie critics for a given movie on the internet, determines whether each reviewer rates the movie as &ldquo;Fresh&rdquo; or &ldquo;Rotten&rdquo; and calculates an average. If the proportion of Fresh reviews for a given movie is greater than or equal to 60%, the movie itself is considered &ldquo;Fresh&rdquo; and receives a special icon.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/examples.png 266w" src="examples.png"/> 
</figure>

<p>Top Movies like Christopher Nolan&rsquo;s <a href="http://www.rottentomatoes.com/m/the_dark_knight/">The Dark Knight</a> received a 94% Rotten Tomatoes rating, and generated $533.3 million in domestic box office revenue. But other movies, like Michael Bay&rsquo;s <a href="http://www.rottentomatoes.com/m/transformers_revenge_of_the_fallen/">Transformers: Revenge of the Fallen</a>, received a 19% Tomatometer rating, but still generated $402.1 million in domestic box office revenue.</p>
<p>How strong is the relationship between Tomatometer scores and box office success, anyways? Or are other, better metrics? Time to make some pretty charts.</p>
<p>I obtained a large amount of movie data from the <a href="http://www.omdbapi.com">OMDb API</a>, which provides easy access to movie metadata from IMDb and Rotten Tomatoes. This data contains Rotten Tomatoes Tomatometer scores, Rotten Tomatoes Audience Scores, IMDb User Rankings, and Metacritic Scores. If you want to know how I processed the data in R and plotted the charts using ggplot2, I have <a href="https://www.youtube.com/watch?v=F5Hjlkxw_2A">prepared a screencast</a> for your viewing pleasure.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/F5Hjlkxw_2A?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>For this analysis, we will be looking at the <a href="http://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/">log-transformation</a> of domestic box office revenue, since the values are skewed by mega-blockbusters like the ones mentioned previously. Revenues are not inflation-adjusted since the rating data is only present for recent years and due to the log-transformation already present, inflation correction would not impact this particular analysis much.</p>
<h2 id="rotten-tomatoes-tomatometer">Rotten Tomatoes Tomatometer</h2>
<p>After processing, I have a data subset of 4,863 movies with both Tomatometer and Box Office Gross values. Let&rsquo;s plot all those movies on a scatterplot of log(BoxOffice) vs. Meter with each point having a slight transparency; that way, clusters of points will be come apparent where the areas are darker on the chart.</p>
<p>We expect a positive linear relationship: movies with high Tomatometer scores to have high box office revenue, and inversely movies with low score to have low box office revenue.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-1_hu_96d5e65a38238ebb.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-1.png 600w" src="box-office-rating-1.png"/> 
</figure>

<p>Wait, why does the trendline have a <em>negative</em> slope?</p>
<p>The <a href="https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient">Pearson correlation</a> between the Tomatometer scores and log(BoxOffice) is <strong>-0.18</strong>, implying a weak <em>negative</em> linear relationship between the two variables. Not what I expected.</p>
<p>There do appear to be clusters in the data. There is a group of points between $10M and $100M revenue and 0% to 20% Tomatometer rating. Another group is present between $1,000 and $1M revenue and 80% to 100% RT rating. Both of these areas are outside of a linear relationship: perhaps these clusters are skewing trends too?</p>
<p>Let&rsquo;s try another visualization of the data using <a href="https://en.wikipedia.org/wiki/Contour_line">contour maps</a>, which allow the data to become 3D, so-to-speak. Using a 2D <a href="https://en.wikipedia.org/wiki/Kernel_density_estimation">kernel density estimator</a>, we can identify and color areas on the plot according to the number of points present in that area; the greater the color saturation, the more points present in the given area.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-2_hu_760d8dc1d3815e51.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-2.png 600w" src="box-office-rating-2.png"/> 
</figure>

<p>The two clusters mentioned previously are now much more apparent. It appears there are two distinct sets of movies: blockbusters which critics hate, and limited-appeal films which critics loves. Incidentally, there is no discernible difference between movies which are Fresh (&gt;60%) and Rotten.</p>
<h2 id="metacritic">Metacritic</h2>
<p>The <a href="http://www.metacritic.com">Metacritic</a> score is also <a href="http://www.metacritic.com/about-metascores">derived from review data</a> by critics; however, instead of calculating a binary review sentiment and calculating a proportion from that sentiment, Metacritic gives a quantification from 0 to 100 to each critic review and averages them together.</p>
<p>Does that change the results for 4,479 movies?</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-7_hu_fef2b0f07f0269fe.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-7.png 600w" src="box-office-rating-7.png"/> 
</figure>

<p>Correlation between Metacritic score and log(BoxOffice) is <strong>-0.13</strong>, which puts the analysis in a similar state as the Rotten Tomatoes data. However, the blockbuster cluster has shifted right, and the lesser-appeal cluster has shifted left.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-8_hu_db41f472024f23b6.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-8.png 600w" src="box-office-rating-8.png"/> 
</figure>

<p>Clusters are much closer together.</p>
<p>Perhaps a review metric by non-critics will tell a different story.</p>
<h2 id="rotten-tomatoes-audience-score">Rotten Tomatoes Audience Score</h2>
<p>The Audience Score is calculated in a similar way to the Rotten Tomatoes Tomatometer score: user to the site rate a movie from 0 to 5 stars in half-star increments (i.e. effectively a scale from 0-10) and the proportion of reviews with 3.5 star ratings or higher becomes the Audience Score.</p>
<p>This also presents a cognitive bias in ratings: the <a href="http://tvtropes.org/pmwiki/pmwiki.php/Main/FourPointScale">Four Point Scale</a>, where having a discrete form of ranking may cause people to tend to rate toward the top of the scale and make the entire metric skewed or misleading.</p>
<p>How does the Audience Score compare for 5,163 movies? After all, the audience is the group of people who determine how much money a movie makes at the Box Office.</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-3_hu_1f2e7fff936a2fa7.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-3.png 600w" src="box-office-rating-3.png"/> 
</figure>

<p>Correlation between the Audience score and log(BoxOffice) is <strong>0.05</strong>, which is a positive linear correlation, but representative of barely any practical correlation.</p>
<p>Speaking of the Four Point Scale, notice how, like with Metacritic score, there are barely any movies between 0% and 20% Audience Score. Is there really a skew? Let&rsquo;s look at the contours:</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-4_hu_1baf301a632b3684.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-4.png 600w" src="box-office-rating-4.png"/> 
</figure>

<p>The locations of the clusters are much different than that of Tomatometer clusters. Both clusters are closer together, with the blockbuster cluster between 50% and 60% audience score and the lesser-appeal cluster between 70% and 80%. Hence, the low correlation.</p>
<h2 id="imdb">IMDb</h2>
<p><a href="http://www.imdb.com">IMDb</a> works <a href="http://www.imdb.com/help/show_leaf?votestopfaq">almost the same way</a> as the Metacritic for non-critics: ratings from IMDb users between 1-10 (note that 0 is missing!) are averaged to get a final score.</p>
<p>How do 5,167 movies fare?</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-5_hu_1f38b551534a465.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-5.png 600w" src="box-office-rating-5.png"/> 
</figure>

<p><strong>What?!</strong></p>
<p>The point groupings are at the <em>same</em> positions of ratings, and the correlation between IMDb ratings and log(BoxOffice) is <strong>0.00</strong>. Yes, there&rsquo;s <em>zero</em> correlation!</p>
<p>Checking the contour map confirms it:</p>
<figure>

    <img loading="lazy" srcset="/2016/01/movie-revenue-ratings/box-office-rating-6_hu_ba1aa72c3457ed1d.webp 320w,/2016/01/movie-revenue-ratings/box-office-rating-6.png 600w" src="box-office-rating-6.png"/> 
</figure>

<p>That is <em>literally</em> a Four Point Scale between 5 and 8!</p>
<p>The Rotten Tomatoes metric is the only metric that actually <em>uses</em> the entire rating scale. None of the other potential metrics provide more insight into a potential reason for high box-office revenue. Perhaps the movie rating system itself is broken.</p>
<p>That&rsquo;s not to say that movies need high box-office revenues to be considered successful. However, working with movie profitability, and by extension movie budget, is opening another can-of-worms with respect to data integrity. (that said, on Reddit, /u/chartmkr recently <a href="https://www.reddit.com/r/dataisbeautiful/comments/3zpp3w/movie_budgets_and_box_office_success_19552015_oc/">posted a visualization</a> of Gross vs. Budget which is interesting).</p>
<p>It&rsquo;ll still be fun to point to a Rotten Tomatoes Tomatometer rating as a kneejerk reaction to whether a movie rocks/sucks. Although, the reasons for movie financial success at the box office definitely warrant further investigation.</p>
<p><strong>UPDATE 1/11/15</strong>: On a <a href="https://news.ycombinator.com/item?id=10872076">discussion on Hacker News</a>, it was suggested that the blockbuster movies and the indie movies cancel each other out, i.e. blockbusters have a positive correlation and indies have a negative correlation.</p>
<p>For the blockbuster cluster alone, the log-correlation is <strong>0.23</strong> (not weak but not great positive correlation). For the indie cluster alone, the log-correlation is <strong>-0.12</strong> (same as original analysis).</p>
<p>For future analysis, it may be worthwhile to split these two clusters. I stand by the original analysis for this post: very frequently I&rsquo;ve heard the question &ldquo;is this a good movie?&rdquo; and the response is &ldquo;what does the RT score say?&rdquo; Both Box Office revenues and RT scores are important measures of quality (depending on perspective), and users who want to see or purchase a movie may not necessarily care if it&rsquo;s indie or a blockbuster.</p>
<p>User cwyers <a href="https://news.ycombinator.com/item?id=10878019">suggested</a> that Simpson&rsquo;s Paradox may be in play since the number of theaters showing a movie is positively correlated to box office revenue, adding a potentially-confounding affect. I will see if I can obtain that data for future analysis.</p>
<hr>
<p><em>You can access the open-sourced Jupyter notebook and high-resolution charts from this article in <a href="https://github.com/minimaxir/movie-revenue-ratings">this GitHub repository</a>. If you use the code or data visualization designs contained within this article, it would be greatly appreciated if proper attribution is given back to this article and/or myself. Thanks!</em></p>
<p><em>Unfortunately, I cannot redistribute the data itself due to licensing concerns.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Plotting a Map of New York City Using Only Taxi Location Data</title>
      <link>https://minimaxir.com/2015/08/nyc-map/</link>
      <pubDate>Fri, 07 Aug 2015 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2015/08/nyc-map/</guid>
      <description>In theory, plotting a million little points in close proximity should simulate the lines of the streets of New York City.</description>
      <content:encoded><![CDATA[<p>Recently, the New York City Taxi and Limousine Commission <a href="http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml">released a dataset</a> of all Yellow Taxi and Green Taxi trips in 2014, and year-to-date in 2015, which follows the <a href="http://chriswhong.com/open-data/foil_nyc_taxi/">2013 data set</a> which was obtained to a FOIL request for the data last year. The dataset contains fun statistics, such as the location where the taxi picked up and dropped off its fare, the speed the taxi is moving, and the total fare at the end of the ride.</p>
<p>In the <a href="https://news.ycombinator.com/item?id=10003118">Hacker News thread</a> announcing the data set release, user eck posted an interesting, minimalistic visualization of the taxi location data:</p>
<figure>

    <img loading="lazy" srcset="/2015/08/nyc-map/ov6K6mt_hu_ce9deb18d2e19d36.webp 320w,/2015/08/nyc-map/ov6K6mt_hu_890bae9dcf1488.webp 768w,/2015/08/nyc-map/ov6K6mt_hu_93d847eabb022b63.webp 1024w,/2015/08/nyc-map/ov6K6mt.jpg 1200w" src="ov6K6mt.jpg"/> 
</figure>

<p>eck made the visualization using a &ldquo;few hundred lines of C++&rdquo;. That seemed overkill to me. So I tried to reverse-engineer his visualization using my favorite plotting tool, <a href="http://ggplot2.org/">ggplot2</a>. In theory, plotting a million little points in close proximity should simulate the lines of the streets of New York City.</p>
<p>The dataset is large (2 GB per month of data), although not &ldquo;big data&rdquo; large. It would take an afternoon to set up a local database, and I wanted to make pretty visualizations <em>immediately</em>.</p>
<p>Google BigQuery Developer Advocate Felipe Hoffa <a href="https://www.reddit.com/r/bigquery/comments/3fo9ao/nyc_taxi_trips_now_officially_shared_by_the_nyc/">created a BigQuery interface</a> for the data. BigQuery allows easy and fast access to the entire dataset for rapid processing. In my case, I need to compress the data set by truncating the latitude and longitude of the GPS coordinates to 4 digits; this allows <a href="http://gis.stackexchange.com/a/8674">precision to 11 meters</a> on the coordinates, which is sufficient for estimating. Running this BigQuery query:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">ROUND</span><span class="p">(</span><span class="n">pickup_latitude</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="n">ROUND</span><span class="p">(</span><span class="n">pickup_longitude</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">num_pickups</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="p">[</span><span class="n">nyc</span><span class="o">-</span><span class="n">tlc</span><span class="p">:</span><span class="n">yellow</span><span class="p">.</span><span class="n">trips_2014</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">long</span><span class="w">
</span></span></span></code></pre></div><p>Gives me data for about 1 Million GPS coordinates; more than enough to make a full map. This also has the benefit of fitting into memory, which is necessary for use with ggplot2.</p>
<p>My first attempt, where I plot 1 million very small white points on a black map bounded to the latitude/longitude coordinates of NYC, turned out pretty well, and with less than 10 lines of code.</p>
<figure>

    <img loading="lazy" srcset="/2015/08/nyc-map/nyc_old_hu_45983a2283725b0.webp 320w,/2015/08/nyc-map/nyc_old_hu_799b6d61a8ddf.webp 768w,/2015/08/nyc-map/nyc_old_hu_cd73c333824bb478.webp 1024w,/2015/08/nyc-map/nyc_old.png 1470w" src="nyc_old.png"/> 
</figure>

<p>On Reddit, my submission of the <a href="https://www.reddit.com/r/dataisbeautiful/comments/3fvg8i/map_of_new_york_city_plotted_using_locations_of/">data visualization</a> received about 3,500 points, and a large amount of social media buzz on Facebook and Twitter. There were a few comments however; why were there random points in the Hudson River? Why are highways indicated as taxi pickup spots? Why does the map say &ldquo;2015&rdquo; when your query says &ldquo;2014&rdquo;? (guilty on the last one; the map was made using the 2014 dataset by accident!)</p>
<p>At the least, the streets of <a href="https://en.wikipedia.org/wiki/Manhattan">Manhattan</a> were not discernable at all, unlike eck&rsquo;s diagram. As a result, I made a few refinements to remove some logical outliers with impossible vehicle speeds, removed noisy points which were completely isolated, <em>used the correct 2015 data set</em>, and also added a color weighting to the data, where the most-taxi-dense areas will appear colored (scaling logarithmically) to differentate those areas from less taxi-prone areas.</p>
<figure>

    <img loading="lazy" srcset="/2015/08/nyc-map/nyc_yellow_pickup_hu_79f163e3bb52840b.webp 320w,/2015/08/nyc-map/nyc_yellow_pickup_hu_b7b8ba1848f5f521.webp 768w,/2015/08/nyc-map/nyc_yellow_pickup_hu_3bfc7fcd9ad59690.webp 1024w,/2015/08/nyc-map/nyc_yellow_pickup.png 1470w" src="nyc_yellow_pickup.png"/> 
</figure>

<p>The map, while less bright, became more precise. The streets in Manhattan are now visible, and the purple color shows how <a href="https://en.wikipedia.org/wiki/Times_Square">Times Square</a> and the <a href="https://en.wikipedia.org/wiki/Financial_District,_Manhattan">Financial District</a> in particular are popular taxi pickup spots in Manhattan. <a href="https://en.wikipedia.org/wiki/LaGuardia_Airport">LaGuardia Airport</a> and <a href="https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport">John F. Kennedy International Airport</a> appear purple as well.</p>
<p>However, this map only looks at places where people picked up Taxis. Could there be a difference when plotting Taxi dropoffs instead?</p>
<p>I reran the query and visualization scripts on the dropoff location data instead, and as it turns out, there is a significant difference!</p>
<figure>

    <img loading="lazy" srcset="/2015/08/nyc-map/nyc_yellow_dropoff_hu_ba7fafe42aea4212.webp 320w,/2015/08/nyc-map/nyc_yellow_dropoff_hu_d7fd7128d3934fea.webp 768w,/2015/08/nyc-map/nyc_yellow_dropoff_hu_6cdc1899c53909a9.webp 1024w,/2015/08/nyc-map/nyc_yellow_dropoff.png 1470w" src="nyc_yellow_dropoff.png"/> 
</figure>

<p>While Taxi pickups were isolated more in residential areas, taxi dropoffs can happen anywhere in the Tristate area. Additionally, the map more closely matches eck&rsquo;s original visualization.</p>
<p>Setting up ggplot2 in this way will also allow me to perform other fun analyses in the future. For example, since we know where taxis drop off, we can determine the average speed for the trip for each significant location in NYC geography. Would the average trip speed be higher for trips that drop off at an airport due to the highway? Conversely, would the average speed be lower in Manhattan? How would fares be affected? Those are questions for another blog post. :)</p>
<p>Although, I still have no guesses why the highways are highlighted in <em>both</em> maps.</p>
<hr>
<p><em>You can download a <a href="https://dl.dropboxusercontent.com/u/2017402/nyc_yellow_pickup.pdf">PDF of my purple pickup map</a> (4.14 MB) and a <a href="https://dl.dropboxusercontent.com/u/2017402/nyc_yellow_dropoff.pdf">PDF of my blue dropoff map</a> (7.68 MB) sans text, both of which are resolution-independent and sutable for making physical prints.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>How to Scrape Data From Facebook Page Posts for Statistical Analysis</title>
      <link>https://minimaxir.com/2015/07/facebook-scraper/</link>
      <pubDate>Mon, 20 Jul 2015 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2015/07/facebook-scraper/</guid>
      <description>It is pretty easy to scrape Facebook Posts data and make into a spreadsheet for easy analysis, although  there are a large number of gotchas.</description>
      <content:encoded><![CDATA[<p><strong><em>UPDATE April 2018: Due to changes Facebook has made to the Graph API, the API will no longer return every post as noted in this article</em></strong></p>
<p>One of the first data scrapers I wrote for the purpose of statistical analysis was a Facebook Graph API scraper, in order to determine <a href="http://minimaxir.com/2013/06/big-social-data/">which words are the most important</a> in a Facebook Page status update. However, the v2.0 update to the Facebook API unsurprisingly broke the scraper.</p>
<p>Now that <a href="https://developers.facebook.com/blog/post/2015/07/08/graph-api-v2.4/">v2.4 of the Graph API is released</a>, I gave the Facebook Graph API another look. Turns out, it&rsquo;s pretty easy to scrape and make into a spreadsheet for easy analysis, although like with any other scrapers, there are a large number of gotchas.</p>
<h2 id="feasibility">Feasibility</h2>
<figure>

    <img loading="lazy" srcset="/2015/07/facebook-scraper/nyt_sample_hu_be4aa5299896f220.webp 320w,/2015/07/facebook-scraper/nyt_sample.png 484w" src="nyt_sample.png"/> 
</figure>

<p>In order to determine if I can sanely scrape a website, I have to do a bit of research. How much data from a Facebook status update can we actually scrape?</p>
<p>Fortunately, Facebook&rsquo;s <a href="https://developers.facebook.com/docs/graph-api/reference">Graph API documentation</a> is pretty good. We need data from the <a href="https://developers.facebook.com/docs/graph-api/reference/page">/page</a> node, and from there, we can access data from the <a href="https://developers.facebook.com/docs/graph-api/reference/v2.4/page/feed">/feed</a> edge.</p>
<p>Between the two nodes, we have access to <code>id</code>, which is a unique identifer that can be used to create a link back to the update itself (e.g. <a href="https://www.facebook.com/5281959998_10150628170209999">https://www.facebook.com/5281959998_10150628170209999</a>) <code>message</code>, the text of the update; <code>link</code>, the URL which the update is linking; <code>name</code>, the title of the webpage of the link, <code>type</code>, an identifier if the update is text, a photo, or a video; and <code>created_time</code>, when the update is published.</p>
<p>Accessing the numerical counts of <code>likes</code>, <code>comments</code>, and <code>shares</code> is less explicit in the documentation. Fortunately, <a href="http://stackoverflow.com/questions/6984526/facebook-graph-api-get-like-count-on-page-group-photos">StackOverflow</a> has the answer: you need to request <code>likes.limit(1).summary(true)</code> instead of normal <code>likes</code>.</p>
<p>There&rsquo;s no indication that there&rsquo;s a Rate Limit, oddly. Since we can query 100 updates at a time, the scraper will be efficient enough that it&rsquo;s unlikely to hit any extreme API limits.</p>
<p>Now that we know we can get all the relevant data from the sample status update, we can build a Facebook post scraper.</p>
<h2 id="data-scrappy">Data Scrappy</h2>
<figure>

    <img loading="lazy" srcset="/2015/07/facebook-scraper/def_test_hu_4c390ad24cef76dd.webp 320w,/2015/07/facebook-scraper/def_test.png 425w" src="def_test.png"/> 
</figure>

<p><em>I have created an <a href="https://github.com/minimaxir/facebook-page-post-scraper/blob/master/examples/how_to_build_facebook_scraper.ipynb">IPython notebook hosted on GitHub</a> with detailed code, code comments, and sample output for each step of the scraper development. I strongly recommend giving it a look.</em></p>
<p>First, we need to see how to actually access the API. It&rsquo;s no longer a public API, and it requires user authentication via <a href="https://developers.facebook.com/docs/facebook-login/access-tokens">access tokens</a>. Users can get Short-Term tokens, but as their name suggests, they expire quickly, so they are not recommended. The Graph API allows a neat trick; by concatenating the App ID from a user-created App and the App Secret, you create an access token which never expires. Of course, this is a major security risk, so create a separate app for the sole purpose of scraping, and reset your API Secret if it becomes known.</p>
<p>Let&rsquo;s say we want to scrape the New York Times&rsquo; Facebook page. We would send a request to <a href="https://graph.facebook.com/v2.4/nytimes?access_token=XXXXX">https://graph.facebook.com/v2.4/nytimes?access_token=XXXXX</a> and we would get:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;5281959998&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;The New York Times&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>i.e., the page metadata. Sending a request to /nytimes/feed results in what we want:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;data&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;created_time&#34;</span><span class="p">:</span> <span class="s2">&#34;2015-07-20T01:25:01+0000&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;5281959998_10150628157724999&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;message&#34;</span><span class="p">:</span> <span class="s2">&#34;The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\u2019s, is meant to revamp northern China\u2019s economy and become a laboratory for modern urban growth.&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;created_time&#34;</span><span class="p">:</span> <span class="s2">&#34;2015-07-19T22:55:01+0000&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;5281959998_10150628161129999&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;message&#34;</span><span class="p">:</span> <span class="s2">&#34;\&#34;It\u2019s safe to say that federal agencies are not where we want them to be across the board,\&#34; said President Barack Obama&#39;s top cybersecurity adviser. \&#34;We clearly need to be moving faster.\&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Now we get the post data. But not much of it. In Graph API v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation that the user will request the necessary fields.</p>
<p>So let&rsquo;s request <em>all</em> the fields we want. This results in a very long URL not shown here which causes the posts feed to have all the data we need:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;comments&#34;</span><span class="p">:</span> <span class="p">{},</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;summary&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;order&#34;</span><span class="p">:</span> <span class="s2">&#34;ranked&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;total_count&#34;</span><span class="p">:</span> <span class="mi">31</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;created_time&#34;</span><span class="p">:</span> <span class="s2">&#34;2015-07-20T01:25:01+0000&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;5281959998_10150628157724999&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;likes&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;data&#34;</span><span class="p">:</span> <span class="p">{},</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;summary&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nt">&#34;total_count&#34;</span><span class="p">:</span> <span class="mi">278</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;link&#34;</span><span class="p">:</span> <span class="s2">&#34;http://nyti.ms/1Jr6LhU&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;message&#34;</span><span class="p">:</span> <span class="s2">&#34;The planned megalopolis, a metropolitan area that would be about 6 times the size of New York’s, is meant to revamp northern China’s economy and become a laboratory for modern urban growth.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;China Molds a Supercity Around Beijing, Promising to Change Lives&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;shares&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nt">&#34;count&#34;</span><span class="p">:</span> <span class="mi">50</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;link&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h2 id="post-processing">Post Processing</h2>
<p>Great! Now we just have to process each post. Which is easier said than done.</p>
<p>If you&rsquo;re an avid Facebook user, you know that not all of these attributes are not guaranteed to exist. Status updates may not have text or links. Since we&rsquo;re making a spreadsheet with an enforced schema, we need to validate that a field exists before attempting to process it.</p>
<p>The &ldquo;\u2019&quot;s in the message correspond to a <a href="http://smartquotesforsmartpeople.com/">smart quote</a> apostrophe. Since this a possibility, along with other unicode characters, the message and link names must be encoded <a href="https://en.wikipedia.org/wiki/UTF-8">in UTF-8</a> to prevent errors.</p>
<p>The time format is another issue. The date follows the <a href="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</a> standard for UTC times. However, most spreadsheet programs will not able to parse it as a Date value. Also, since the NYT is based in the USA (specifically, New York), it may be helpful for time-based statistical analysis to convert the time to Eastern Standard Time while fixing the date format.</p>
<p>There&rsquo;s also an unexpected precaution that must be taken whenever scraping data sets. These APIs do not expect users to be accessing very, very old data. As a result, there&rsquo;s a high probability of the API server actually hitting an error sometime during the scrape, such as a <a href="http://www.checkupdown.com/status/E500.html">HTTP Status 500</a> or <a href="http://www.checkupdown.com/status/E502.html">HTTP Status 502</a>. These server errors are temporary, so a helper function must be used to attempt to retrieve data until it is actually successful.</p>
<h2 id="putting-it-all-together">Putting it All Together</h2>
<p>Now we have a full plan for scraping, we query each page of Facebook Page Statuses (100 statuses maximum per page), process all statuses on that page and writing the output to a CSV file, and navigate to the next page, and repeat until no more statuses left.</p>
<p>This can be done with a for-loop within a while loop. In addition, I also recommend counting the number of posts processed and taking a timestamp every-so-often to ensure that the program has not stalled.</p>
<figure>

    <img loading="lazy" srcset="/2015/07/facebook-scraper/cnnwoo_hu_b9bce875c2ad5b8e.webp 320w,/2015/07/facebook-scraper/cnnwoo.png 461w" src="cnnwoo.png"/> 
</figure>

<p>And that&rsquo;s it! You can access the complete scraper in this GitHub repository, along with all other scripts mentioned in this article. Once you have the CSV file, you can import it into nearly every statistical program and have fun with it. <em>(You can download a .zip of the NYTimes data <a href="https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.zip">here</a> [4.6MB])</em></p>
<p>Say, for example, what would happen if we compared the Median Likes of the New York Times with a certain other journalistic website that&rsquo;s the master of social media?</p>
<figure>

    <img loading="lazy" srcset="/2015/07/facebook-scraper/nytimes_buzz_fb_hu_973cb9559a65b7f6.webp 320w,/2015/07/facebook-scraper/nytimes_buzz_fb_hu_a7d77e74cd25da77.webp 768w,/2015/07/facebook-scraper/nytimes_buzz_fb_hu_77b75413091ebb90.webp 1024w,/2015/07/facebook-scraper/nytimes_buzz_fb.png 1200w" src="nytimes_buzz_fb.png"/> 
</figure>

<p>There may be more practical reasons for analyzing data on Facebook Posts, such as quantifying the growth and success of your own page, or that of your competitors. But the data is easy to get and is very useful.</p>
<p>Although, in fairness, the scraper is not perfect and still has room for improvement. With CNN&rsquo;s Facebook Page post data, for example, somehow the scraper skips all posts from 2013. Although in that case, I blame Facebook.</p>
<hr>
<p><em>You can access all resources used in this blog post at this <a href="https://github.com/minimaxir/facebook-page-post-scraper">GitHub repository</a></em>.</p>
<p><em>If you haven&rsquo;t, I strongly recommend looking at the <a href="https://github.com/minimaxir/facebook-page-post-scraper/blob/master/how_to_build_facebook_scraper.ipynb">IPython Notebook</a> for more detailed coding methodology.</em></p>
<p><em>And, as an experiment, I&rsquo;ve made an <a href="https://github.com/minimaxir/facebook-page-post-scraper/blob/master/fb_page_data_analysis.ipynb">IPython notebook with the R kernel</a> showing how I made the NYT-BuzzFeed chart!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Locating All the Christmas Trees on Instagram</title>
      <link>https://minimaxir.com/2015/01/tree-time/</link>
      <pubDate>Thu, 01 Jan 2015 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2015/01/tree-time/</guid>
      <description>I downloaded &lt;em&gt;hundreds of thousands&lt;/em&gt; of #tree images and found 25,432 images which were taken on Christmas, have a #tree, and, most importantly, contain location data where the photo was taken.</description>
      <content:encoded><![CDATA[<p>Everyone enjoys taking photos of their Christmas trees, usually at their own home of their relatives. <a href="http://instagram.com/">Instagram</a> allows users to quickly upload any photo and share it socially to the world. On Christmas Eve, privacy author Tommy Collison <a href="http://www.tommycollison.com/blog/2014/12/24/christmas-geotagging">published a warning about this behavior</a>, noting that if a user tags a photo with #tree to tag their Christmas tree, for example, <em>anyone</em> will be able to see it, and if the user attached their location to the photo, anyone could theoretically find where they live.</p>
<p>How practical is this concern? Instagram <a href="http://instagram.com/developer/">offers an API</a> of all recent photos for a given #tag so developers can download pictures and their corresponding metadata, such as geolocation, in bulk. (Up to <em>165,000</em> Instagram images can be processed per hour!)</p>
<p>I downloaded <em>hundreds of thousands</em> of #tree images and found 25,432 images which were taken on Christmas, have a #tree, and, most importantly, contain location data where the photo was taken. From that, I created an <a href="https://www.google.com/fusiontables/DataSource?docid=1J3RQB6MuFbZvA_WcCVHlKAzDBUppxFBQ3LA054RL">interactive map</a> showing the location of all these images worldwide using <a href="https://support.google.com/fusiontables/answer/2571232?hl=en">Google Fusion Tables</a>. You can click-and-drag to move the map all over the world, and you can click on a marker on the map to see the Instagram image taken at that location! (note that if you&rsquo;re on a mobile device, the embedded map may work better on a desktop browser)</p>
<p>I found a few interesting things while playing with this map.</p>
<h2 id="christmas-trees-in-the-usa">Christmas Trees in the USA</h2>
<p>A downside of the interactive map is that quantifying the relative number of photos between dense areas (e.g. cities) can be misleading as the opaque markers overlap. Here is a static map of all of the Instagram photos in the United States, with each translucent point representing an image:</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/instagram_treemap_state_hu_310e6466cb4d56c6.webp 320w,/2015/01/tree-time/instagram_treemap_state_hu_5bd9d945a0e5960c.webp 768w,/2015/01/tree-time/instagram_treemap_state_hu_f34ea37f8f670d55.webp 1024w,/2015/01/tree-time/instagram_treemap_state.png 1500w" src="instagram_treemap_state.png"/> 
</figure>

<p>The number of photos is densest near the large cities, which is what you would expect.</p>
<p>A way to calculate the relative proportion of the number of #tree photos between states is to use a type of chart known as a <a href="http://en.wikipedia.org/wiki/Treemapping">treemap</a> (pun <em>very</em> much intended).</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/treemap-state_hu_8bc2d93f35500ca9.webp 320w,/2015/01/tree-time/treemap-state_hu_1bd1917eb4e74033.webp 768w,/2015/01/tree-time/treemap-state_hu_c02e557e8ccd19cf.webp 1024w,/2015/01/tree-time/treemap-state.png 1200w" src="treemap-state.png"/> 
</figure>

<p>In this treemap, the relative area of each block corresponds to the number of photos taken in the state; therefore, the combination of all the blocks represents 100% of the #tree photos taken in the USA. If two blocks are the same size (e.g. New York and Florida), then they have the same number of #tree photos.</p>
<p>As you may have noticed from these two charts, these data represented by these two charts is approximately the same as the <a href="http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population">population density in the United States</a>. Although this touches on the <a href="http://xkcd.com/1138/">infamous statistical problem</a> of heat maps resembling population maps, in this case, it&rsquo;s what would be expected.</p>
<p>Looking at all the #tree photos in the world may tell a different story.</p>
<h2 id="christmas-trees-in-the-world">Christmas Trees in the World</h2>
<p>Christmas is a holiday for only one religion with a <a href="http://en.wikipedia.org/wiki/Christianity_by_country">low presence in Asia and northern Africa</a>, so it would be expected that the locations of Christmas trees worldwide do <em>not</em> correlate with population, which makes the analysis more interesting.</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/instagram_world_map_hu_f4d21dccf9584d5f.webp 320w,/2015/01/tree-time/instagram_world_map_hu_4b8e0faede23b8c4.webp 768w,/2015/01/tree-time/instagram_world_map_hu_c92b64f4e8488b9f.webp 1024w,/2015/01/tree-time/instagram_world_map.png 1770w" src="instagram_world_map.png"/> 
</figure>

<p>The prevalence of Christmas trees is most prominent in the United States and Europe, with relatively few in Asia, where the majority of the world&rsquo;s population is located. Italy has Christmas trees <em>uniformly</em> throughout the entire country, which is an interesting behavior.</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/treemap-world_hu_2356b81e1b32b230.webp 320w,/2015/01/tree-time/treemap-world_hu_b16be6e56813d829.webp 768w,/2015/01/tree-time/treemap-world_hu_42fd89f2e00723c.webp 1024w,/2015/01/tree-time/treemap-world.png 1200w" src="treemap-world.png"/> 
</figure>

<p>The treemap confirms that Asian and African countries like China, India, and Nigeria do not have as many Christmas trees than <a href="http://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population">what their large populations would suggest</a>. Italy, however has a <a href="http://en.wikipedia.org/wiki/Demographics_of_Italy">population of 60 million</a> (the same as the United Kingdom) which is about 1/5th of the population of United States; the fact that Italy has more than half of the number of Christmas Trees than the United States is very unusual and should be questioned.</p>
<p>Italy <em>may</em> have a high number of Christmas trees since Vatican City is the seat of the papacy, but perhaps data <em>itself</em> should be questioned too.</p>
<h2 id="christmas-trees-in-the-world-1">&ldquo;Christmas Trees&rdquo; in the World</h2>
<p>If you check the photos in Italy, you many notice that many of them have a photo caption similar to this one:</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/christmas_tagsforlikes_hu_1a3d8d47a4ba1921.webp 320w,/2015/01/tree-time/christmas_tagsforlikes.png 384w" src="christmas_tagsforlikes.png"/> 
</figure>

<p>There&rsquo;s obviously no Christmas tree in that photo. But there are a <em>lot</em> of tags.</p>
<p>Many Instagram photos use a service called <a href="http://www.tagsforlikes.com/">TagsForLikes</a>, which complies a list of popular hashtags that other users are able to see. Users can then then copy/paste them into the photo caption to spam hashtags increase the photo exposure, which, as I&rsquo;ve shown <a href="http://minimaxir.com/2014/03/hashtag-tag/">in a previous blog post</a>, does in fact increase the number of Likes the photo receives from other users.</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/tags4likes_hu_10d67579073c03b7.webp 320w,/2015/01/tree-time/tags4likes.png 617w" src="tags4likes.png"/> 
</figure>

<p>Notice a resemblance between this list and the photo caption?</p>
<p>Fortunately, all the TagsForLikes hashtag lists contain #TagsForLikes as a branding trick, which makes such photos extremely easy to detect. Here&rsquo;s what the world map looks like if all the potentially spam photos were colored red:</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/instagram_world_spamnonspam_hu_dc13dbebf48ccb41.webp 320w,/2015/01/tree-time/instagram_world_spamnonspam_hu_224cf95473e67e26.webp 768w,/2015/01/tree-time/instagram_world_spamnonspam_hu_447405a41aacbe52.webp 1024w,/2015/01/tree-time/instagram_world_spamnonspam.png 1770w" src="instagram_world_spamnonspam.png"/> 
</figure>

<p>Italy looks a <em>lot</em> different now! There is red in other counties, but it&rsquo;s not easily visible at a glance.</p>
<p>The treemap of photos, when seperated between spam and non-spam photos, tells the full story:</p>
<figure>

    <img loading="lazy" srcset="/2015/01/tree-time/treemap-spam-nonspam_hu_6c6548fc4a027e9a.webp 320w,/2015/01/tree-time/treemap-spam-nonspam_hu_3d902d5b2452b0ca.webp 768w,/2015/01/tree-time/treemap-spam-nonspam_hu_eaf60683eae8e729.webp 1024w,/2015/01/tree-time/treemap-spam-nonspam.png 1200w" src="treemap-spam-nonspam.png"/> 
</figure>

<p>About 20% of all the #tree photos are spam photos, and about half of those were taken by people in Italy. As a result, Italy has <em>more spam #tree photos than nonspam #tree photos!</em> This is an interesting cultural phenomenon that I have no guesses as to why it occurs. All other countries have significantly smaller numbers of spam photos relatively non-spam photos.</p>
<p>For non-spam photos, the number of #tree photos in Italy now matches the number of non-spam photos in the UK, which correlates with their populations, making the removal of spam photos a sane move.</p>
<p>Is it possible to track people in their homes via Instagram tags? Definitely. If you do care privacy and your Instagram account is not set to Private, I recommend not geotagging your photos. If you&rsquo;re interested in looking at such photos, ensure that the the tag is relevant for the given photo.</p>
<hr>
<p><em>All graphics were generated using R. The maps were created using <a href="http://ggplot2.org/">ggplot2</a> and the world map provided with the <a href="http://cran.r-project.org/web/packages/rworldxtra/index.html">rworldxtra</a> package. The treemaps were created using the <a href="http://cran.r-project.org/web/packages/treemap/index.html">treemap</a> package.</em></p>
<p><em>The source data is included with a seperate tab in Google Fusion Tables <a href="https://www.google.com/fusiontables/DataSource?docid=1J3RQB6MuFbZvA_WcCVHlKAzDBUppxFBQ3LA054RL">along with the interactive map</a>.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Who Performs the Best in Online Classes?</title>
      <link>https://minimaxir.com/2014/07/online-class-charts/</link>
      <pubDate>Thu, 10 Jul 2014 08:30:00 -0700</pubDate>
      <guid>https://minimaxir.com/2014/07/online-class-charts/</guid>
      <description>Which types of student characteristics lead to the best performance in online classes? That depends on how you define &amp;ldquo;performance.&amp;rdquo;</description>
      <content:encoded><![CDATA[<p>At the end of May, <a href="http://www.harvard.edu/">Harvard</a> and <a href="http://web.mit.edu/">MIT</a> jointly <a href="http://newsoffice.mit.edu/2014/mit-and-harvard-release-de-identified-learning-data-open-online-courses">released a dataset</a> containing statistics about their online courses in the Academic Year of 2013. This <a href="http://dx.doi.org/10.7910/DVN/26147">Person-Course De-Identified dataset</a> contains 641,138 events, chronicling 476,532 students who have taken up to 13 unique courses from a variety of topics.</p>
<p>However, this assortment of courses is not a substitution for a typical college education, as the vast majority of students only take one class.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/student-courses_hu_1f69b16ff227e785.webp 320w,/2014/07/online-class-charts/student-courses_hu_ba64318b3158934a.webp 768w,/2014/07/online-class-charts/student-courses_hu_aa97227dda830e1.webp 1024w,/2014/07/online-class-charts/student-courses.png 1200w" src="student-courses.png"/> 
</figure>

<p>Very, very few students take more than one class (22% of all students).</p>
<p>Other interesting variables included with the data set are the level-of-education of the student, the birth date of the student, the gender of the student, and the geographical location of student.</p>
<p>Which types of student characteristics lead to the best performance in these online classes? That depends on how you define &ldquo;performance.&rdquo;</p>
<p>Here&rsquo;s a look at the attendance of all the classes:</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/class-attendance_hu_309c150c193a7cf.webp 320w,/2014/07/online-class-charts/class-attendance_hu_b6e6a51f7fedf99e.webp 768w,/2014/07/online-class-charts/class-attendance_hu_1c38e8b439a9b471.webp 1024w,/2014/07/online-class-charts/class-attendance.png 1200w" src="class-attendance.png"/> 
</figure>

<p>Harvard&rsquo;s Introduction to Computer Science is by far the most popular class, and that&rsquo;s only accounting for one semester. (it still beat MIT&rsquo;s Intro to CS/Programming, which had 2 semesters).</p>
<p>Each color bar indicates the level of student participation in the class. <strong>Registered</strong> means the student simply registered for the class; <strong>Viewed</strong> means that the student viewed some of the course material, <strong>Explored</strong> means that students viewed atleast half of the course material, and <strong>Certified</strong> means that students completed the course and received a certificate of accomplishment. For the chart, it&rsquo;s interesting to note that most classes have a high percentage of people in each class who simply register just for fun and <em>don&rsquo;t actually do anything.</em></p>
<p>One metric of course success is the completion rate of each course, i.e. how many people actually complete the course after starting it. A frequent criticism of free online classes is that this number is very low, but what are the actual completion rates of these 13 classes?</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/class-perc-finished_hu_84a22119799a8548.webp 320w,/2014/07/online-class-charts/class-perc-finished_hu_14fd23dd414891c8.webp 768w,/2014/07/online-class-charts/class-perc-finished_hu_ce807a7001dd3f0a.webp 1024w,/2014/07/online-class-charts/class-perc-finished.png 1200w" src="class-perc-finished.png"/> 
</figure>

<p>Completion rates are low across the board, from 0.8% to 7.5%. A <a href="http://www.insidehighered.com/news/2013/05/10/new-study-low-mooc-completion-rates">2013 study</a> found that completion rate is less than 7%, which this data follows. Apparently, Challenges of Global Poverty is a very interesting course.</p>
<p>After the student completes a class, he or she receives a final grade, and if the grade is above the cutoff point (usually 50% - 80%), they will complete the class and receive the certificate. Here&rsquo;s the distribution of the average grades for students in each class, given that the student has successfully completed the class:</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/class-perc-grade_hu_3024a79b69a75b6a.webp 320w,/2014/07/online-class-charts/class-perc-grade_hu_d5a8ed7de8365738.webp 768w,/2014/07/online-class-charts/class-perc-grade_hu_58ae3c4aa5776ab.webp 1024w,/2014/07/online-class-charts/class-perc-grade.png 1200w" src="class-perc-grade.png"/> 
</figure>

<p>The grades vary significantly between classes. This doesn&rsquo;t necessarily imply that some classes are harder than others; it&rsquo;s possible some classes had lower cutoffs and therefore the effort put in by the students is lower. However, the perfect 100% average for Harvard&rsquo;s Intro to CS course implies that the class is pass/fail and that students can&rsquo;t get a final grade less than 100%. (as a result, we&rsquo;ll need to remove that class from future grade analyses in order to prevent bias due to the lack of score variance)</p>
<h2 id="age">Age</h2>
<p>Online classes are marketed toward college-age students.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/student-age_hu_6807d79095a2ff9f.webp 320w,/2014/07/online-class-charts/student-age_hu_e18b3c21b40aee25.webp 768w,/2014/07/online-class-charts/student-age_hu_a6e0385b7db53e91.webp 1024w,/2014/07/online-class-charts/student-age.png 1200w" src="student-age.png"/> 
</figure>

<p>The mean age is about <strong>27.7</strong>, with a standard deviation of 8.89. The shape skews right and resembles a <a href="http://en.wikipedia.org/wiki/Gamma_distribution">gamma distribution</a>. There are even plenty of teenagers taking these classes too.</p>
<p>The proportion of students who take more than one class for each age group has a slightly flatter distribution, although more centralized at students in their 20&rsquo;s.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/student-num-classes_hu_9c915570951ae0bc.webp 320w,/2014/07/online-class-charts/student-num-classes_hu_4cd1ab3cc2f0cccd.webp 768w,/2014/07/online-class-charts/student-num-classes_hu_64cbc61f2244451c.webp 1024w,/2014/07/online-class-charts/student-num-classes.png 1200w" src="student-num-classes.png"/> 
</figure>

<p>What about completion rate?</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/student-fulfillment_hu_815ed115d6a1ffa9.webp 320w,/2014/07/online-class-charts/student-fulfillment_hu_a2752617681d3a73.webp 768w,/2014/07/online-class-charts/student-fulfillment_hu_de7c359778e0ad44.webp 1024w,/2014/07/online-class-charts/student-fulfillment.png 1200w" src="student-fulfillment.png"/> 
</figure>

<p>Completion rate is a little more evenly distributed between the ages, indicating that the two metrics are likely uncorrelated. However, the older students have nearly double the completion rate of classes. (more on this later)</p>
<h2 id="education">Education</h2>
<p>One of the student-provided attributes is his or her highest level of education. These values are &ldquo;Less than Secondary,&rdquo; &ldquo;Secondary&rdquo;, &ldquo;Bachelor&rsquo;s&rdquo;, &ldquo;Master&rsquo;s&rdquo;, and &ldquo;Doctorate.&rdquo; Here&rsquo;s the distribution of the 409,601 students who provided their LoE:</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/education-count_hu_e516c9e258fac872.webp 320w,/2014/07/online-class-charts/education-count_hu_88815aeaf82ec8bc.webp 768w,/2014/07/online-class-charts/education-count_hu_4c7ca564ab53289b.webp 1024w,/2014/07/online-class-charts/education-count.png 1200w" src="education-count.png"/> 
</figure>

<p>This shows that the majority of students who are taking these online classes have already finished college and already obtained a degree, and therefore are likely taking the course as supplemental material.</p>
<p>We can correlate the levels of education with the average age for each LoE as a double-check.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/education-age_hu_c6a3efb815a9cb41.webp 320w,/2014/07/online-class-charts/education-age_hu_f235f279c423cbcb.webp 768w,/2014/07/online-class-charts/education-age_hu_78613cdfee40ab3b.webp 1024w,/2014/07/online-class-charts/education-age.png 1200w" src="education-age.png"/> 
</figure>

<p>Indeed, each LoE average age is above the minimum expected threshold of time needed for each degree (~18 years for Secondary, ~22 years for Bachelor&rsquo;s, etc.)</p>
<p>How does the completion rate vary?</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/education-fulfillment_hu_9fd83dcbea2027f.webp 320w,/2014/07/online-class-charts/education-fulfillment_hu_896ebd06769a0a9e.webp 768w,/2014/07/online-class-charts/education-fulfillment_hu_40fb59a819fd2e28.webp 1024w,/2014/07/online-class-charts/education-fulfillment.png 1200w" src="education-fulfillment.png"/> 
</figure>

<p>Interestingly, those with Less than Secondary education have a much higher completion rate than those with Secondary educations and Bachelor&rsquo;s Degrees.</p>
<p>But are Doctorate smarter than those with lower education levels? Do they get higher grades on average?</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/education-grade_hu_c47a4403e95fee24.webp 320w,/2014/07/online-class-charts/education-grade_hu_609c06c2b1092a38.webp 768w,/2014/07/online-class-charts/education-grade_hu_8b5dacc02f4b7b4d.webp 1024w,/2014/07/online-class-charts/education-grade.png 1200w" src="education-grade.png"/> 
</figure>

<p>As it turns out, yes. The education level of a student is very relevant to a his or her academic performance in online classes.</p>
<h2 id="gender">Gender</h2>
<p>I previously <a href="http://minimaxir.com/2014/07/gender-course/">made a blog post</a> discussing the genders of students and the proportion of students by course. As it turns out, gender is a very important part of the student body composition.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/gender-count_hu_b414e31dcd2d8ca.webp 320w,/2014/07/online-class-charts/gender-count_hu_12994bfa71cba66a.webp 768w,/2014/07/online-class-charts/gender-count_hu_7824ce476f654770.webp 1024w,/2014/07/online-class-charts/gender-count.png 1200w" src="gender-count.png"/> 
</figure>

<p>425,108 students have a gender on record, with 311,534 male students (73.3%) and 113,571 female students (26.7%). The fact that it&rsquo;s not even close to 50:50 is surprising.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/gender-age_hu_539f995018b912df.webp 320w,/2014/07/online-class-charts/gender-age_hu_bc95ac13a6e629a2.webp 768w,/2014/07/online-class-charts/gender-age_hu_958ea821aba2a296.webp 1024w,/2014/07/online-class-charts/gender-age.png 1200w" src="gender-age.png"/> 
</figure>

<p>Age wise, there&rsquo;s not much difference between the genders. Female students are about a year older than male students on average.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/gender-completion_hu_9c7d417f25dda2e1.webp 320w,/2014/07/online-class-charts/gender-completion_hu_df99c0272323ff57.webp 768w,/2014/07/online-class-charts/gender-completion_hu_7f901b735c694e27.webp 1024w,/2014/07/online-class-charts/gender-completion.png 1200w" src="gender-completion.png"/> 
</figure>

<p>Completion rate varies a bit more. Female students finish classes more frequently than male students.</p>
<p>Now to answer the age-old question: who get better grades: men or women?</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/gender-grade_hu_1c93bf06e894b04b.webp 320w,/2014/07/online-class-charts/gender-grade_hu_bcc0bf07a291db60.webp 768w,/2014/07/online-class-charts/gender-grade_hu_5cee9f59c053acd8.webp 1024w,/2014/07/online-class-charts/gender-grade.png 1200w" src="gender-grade.png"/> 
</figure>

<p>Ok, it&rsquo;s a tie.</p>
<h2 id="country">Country</h2>
<p>The home location of the students is, in my opinion, the most interesting student attribute in the data set. There are 24 different countries which were specified among 325,012 different students.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/country-count_hu_249b43a42a00df0f.webp 320w,/2014/07/online-class-charts/country-count_hu_7a3b5a1a0883c878.webp 768w,/2014/07/online-class-charts/country-count_hu_953d18e6b6d1a170.webp 1024w,/2014/07/online-class-charts/country-count.png 1200w" src="country-count.png"/> 
</figure>

<p>The United States, as one of the more technologically-savvy countries, accounts for nearly 1/3rd of all online students. India has a very strong presence as well due to the combination of population and accessibility of materials. All the other countries have significantly fewer students.</p>
<p>Given such accessibility of online classes, students in some countries take more classes than others.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/country-more-than-one_hu_bda45a1c1f50d29b.webp 320w,/2014/07/online-class-charts/country-more-than-one_hu_2e2cdce597206d16.webp 768w,/2014/07/online-class-charts/country-more-than-one_hu_f2ed6b9d4e00e5da.webp 1024w,/2014/07/online-class-charts/country-more-than-one.png 1200w" src="country-more-than-one.png"/> 
</figure>

<p>India&rsquo;s students also take the most classes on average.</p>
<p>The United States also has the oldest students on average among all the countries.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/country-age_hu_1c71b549f970c024.webp 320w,/2014/07/online-class-charts/country-age_hu_32b19520dc1681a1.webp 768w,/2014/07/online-class-charts/country-age_hu_185d4bb0414d7b3f.webp 1024w,/2014/07/online-class-charts/country-age.png 1200w" src="country-age.png"/> 
</figure>

<p>Completion rate by country, however, is very, very significant.</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/country-completion_hu_526cedb1dfaaf5d8.webp 320w,/2014/07/online-class-charts/country-completion_hu_f6a6e7abe5fbc174.webp 768w,/2014/07/online-class-charts/country-completion_hu_db06854f718fda2e.webp 1024w,/2014/07/online-class-charts/country-completion.png 1200w" src="country-completion.png"/> 
</figure>

<p>The top seven countries with the highest rate of completion are <strong>all European countries</strong>, with Spain and Poland in particular having a higher rate of completion than all the other countries, and nearly <strong>four times</strong> the completion rate of the United States.</p>
<p>But which country has the students which get the best grades?</p>
<figure>

    <img loading="lazy" srcset="/2014/07/online-class-charts/country-grade_hu_3413485dc36590a8.webp 320w,/2014/07/online-class-charts/country-grade_hu_5ff553dbf0956728.webp 768w,/2014/07/online-class-charts/country-grade_hu_16b598e63ba06c34.webp 1024w,/2014/07/online-class-charts/country-grade.png 1200w" src="country-grade.png"/> 
</figure>

<p>China is at the top with 86%, but otherwise the distribution is fairly flat with no outliers, and no economic or demographical pattern to the top or bottom of the ranks. It&rsquo;s possible that the more structured format of the classes make it harder to skew.</p>
<p>This Person-Course dataset has helped reveal many interesting insights about online classes at Harvard and MIT. Whether you measure performance by % of classes completed or a % score received at the end of the class, it&rsquo;s clear that age, country, and level of education all have a statistically significant impact on performance. The impact of the student&rsquo;s gender on performance, however, is more ambiguous and could use further analysis.</p>
<hr>
<ul>
<li><em>Data was processed using R and all charts were made using ggplot2.</em></li>
<li><em>You can view code necessary to reproduce these results AND reproduce all the charts in <a href="https://github.com/minimaxir/online-class-charts">this GitHub repository</a>. Since MIT/Harvard prevent redistribution of the dataset, you&rsquo;ll have to <a href="http://dx.doi.org/10.7910/DVN/26147">download the dataset</a> yourself.</em></li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>Mapping San Francisco Locations Using Facebook Data</title>
      <link>https://minimaxir.com/2014/04/san-francisco/</link>
      <pubDate>Tue, 08 Apr 2014 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2014/04/san-francisco/</guid>
      <description>With the data on approximately 8,000 Facebook Places in San Francisco, we can map where San Franciscans are checking-in, and determine what types of locations they like to visit.</description>
      <content:encoded><![CDATA[<p>Statisticians like to use data from the <a href="https://www.census.gov/">United States Census</a> to plot interesting trends such as unemployment and population of regions across the country. However, such data is typically historical and not very robust.</p>
<p>Facebook, on the other hand, has collected a large amount of data though its <a href="https://www.facebook.com/about/location">Places product</a>. Facebook Places allows users to &ldquo;check-in&rdquo; a place such as <a href="https://www.facebook.com/pages/Dolores-Park/105687759464007">Dolores Park</a> in a manner similar to <a href="https://foursquare.com/">Foursquare</a>.</p>
<figure>

    <img loading="lazy" srcset="/2014/04/san-francisco/dolores_hu_12ede87833359ff4.webp 320w,/2014/04/san-francisco/dolores_hu_52114517e94363ec.webp 768w,/2014/04/san-francisco/dolores.png 1011w" src="dolores.png"/> 
</figure>

<p>Through clever use of Facebook&rsquo;s Graph API and FQL, I was able to retrieve the data on all Facebook Places in and around San Francisco, along with the # of check-ins at each Place. With the data on approximately 8,000 Facebook Places in San Francisco, we can map where San Franciscans are checking-in, and determine what types of locations they like to visit.</p>
<!-- more -->
<p>First, let&rsquo;s look at the distribution of check-ins among Places in San Francisco:</p>
<figure>

    <img loading="lazy" srcset="/2014/04/san-francisco/sf-checkin-distribution_hu_3854c345ddda42aa.webp 320w,/2014/04/san-francisco/sf-checkin-distribution_hu_841db3f768d77d95.webp 768w,/2014/04/san-francisco/sf-checkin-distribution_hu_e7b23d21ac85d350.webp 1024w,/2014/04/san-francisco/sf-checkin-distribution.png 1200w" src="sf-checkin-distribution.png"/> 
</figure>

<p>On a logarithmic scale, the shape resembles a bell curve with a center at about 800 check-ins; however, the sample average of check-ins for the data set is <strong>3,241 check-ins</strong>, indicating that the data may be heavily skewed to the right. Some Places don&rsquo;t have many check-ins, while some Places have an incredibly large number of check-ins.</p>
<p>What are the Places with hundreds of thousands of check-ins? What does the city look like with all these Places plotted on a map of San Francisco?</p>
<figure>

    <img loading="lazy" srcset="/2014/04/san-francisco/sf-checkin_hu_f626a76ff02105d8.webp 320w,/2014/04/san-francisco/sf-checkin_hu_d002d3105f0deace.webp 768w,/2014/04/san-francisco/sf-checkin.png 900w" src="sf-checkin.png"/> 
</figure>

<p><em><a href="https://www.dropbox.com/s/cykofhlyus8atgw/sf-bubble.pdf">(PDF of Map)</a></em></p>
<p>The most checked-in Places are, unsurprisingly, the famous tourist attractions of San Francisco, such as <a href="https://www.facebook.com/pages/ATT-Park/116440731717551">AT&amp;T Park</a> and <a href="https://www.facebook.com/UnionSquareSF">Union Square</a>. These Top Places are spread all over the city.</p>
<p>It&rsquo;s also clear where the most dense areas are located in San Francisco. There are plainly-visible lines of Places along southern Mission Street and Outer Sunset. SOMA and Richmond have a large number of Places as well.</p>
<p>And yes, 155,000 people really did check-into <a href="https://www.facebook.com/pages/The-Cheesecake-Factory/107178672654937">The Cheesecake Factory</a>.</p>
<h2 id="what-types-of-locations-do-people-in-san-francisco-frequently-check-into">What types of locations do people in San Francisco frequently check-into?</h2>
<p>Facebook also records the category of its Places, such as &ldquo;Bar&rdquo; and &ldquo;Restaurant&rdquo;. What are the most numerous types of locations in San Francisco?</p>
<figure>

    <img loading="lazy" srcset="/2014/04/san-francisco/sf-checkin-distribution-count_hu_f79cbd2abc63c07c.webp 320w,/2014/04/san-francisco/sf-checkin-distribution-count_hu_5ed2300b1287f7bc.webp 768w,/2014/04/san-francisco/sf-checkin-distribution-count_hu_6b0563296ce10551.webp 1024w,/2014/04/san-francisco/sf-checkin-distribution-count.png 1200w" src="sf-checkin-distribution-count.png"/> 
</figure>

<p>Local businesses are by far the most frequent, as it&rsquo;s a more generic classifier for a Place and can be applied to anything that does not have an explicit classifier (such as startups). I find it interesting that non-profit organizations are more numerous than shopping centers.</p>
<p>However, the quantity of a specific type of business does not necessarily mean that San Franciscians will check-into that type of Place more often. Which type of Place, on average, receives the greatest number of check-ins?</p>
<figure>

    <img loading="lazy" srcset="/2014/04/san-francisco/sf-checkin-distribution-avg_hu_9d83d93564db88d4.webp 320w,/2014/04/san-francisco/sf-checkin-distribution-avg_hu_ebb677df847eeb2c.webp 768w,/2014/04/san-francisco/sf-checkin-distribution-avg_hu_12e33b3cf1fcb18d.webp 1024w,/2014/04/san-francisco/sf-checkin-distribution-avg.png 1200w" src="sf-checkin-distribution-avg.png"/> 
</figure>

<p>People are more likely to check-into memorable Places and events, instead of Places they frequently visit like Bars and Restaurants (Twice as many people check-into Attractions than Clubs). Again, tourist attractions are the most popular, which is due to both the relatively low number of Places and the extremely high number of check-ins into Places such as AT&amp;T Park.</p>
<p>The &ldquo;Lake&rdquo; corresponds to <a href="https://www.facebook.com/pages/Lake-Merced/113785468632283">Lake Merced</a>, if you&rsquo;re curious.</p>
<h2 id="which-san-francisco-neighborhoods-are-the-most-popular">Which San Francisco neighborhoods are the most popular?</h2>
<p>You saw earlier that the Facebook Places are concentrated in specific areas. Here&rsquo;s a map of San Francisco&rsquo;s neighborhoods, highlighted by the number of Places within:</p>
<figure>

    <img loading="lazy" srcset="/2014/04/san-francisco/sf-places-count_hu_a0cd8cba902f0960.webp 320w,/2014/04/san-francisco/sf-places-count_hu_1e7dbb5dad41ce5a.webp 768w,/2014/04/san-francisco/sf-places-count.png 900w" src="sf-places-count.png"/> 
</figure>

<p><em><a href="https://www.dropbox.com/s/63fgvcucvk9n9v3/sf-sum.pdf">(PDF of Map)</a></em></p>
<p>The neighborhoods with the most Places are unsurprisingly where the trendy areas are, such as the Mission and SOMA. Additionally, those types of neighborhoods are larger in square area than others, which may bias the results in their favor.</p>
<p>But are these large neighborhoods also the most active neighborhoods? Which neighborhoods have the most check-ins on average per Place within the neighborhood?</p>
<figure>

    <img loading="lazy" srcset="/2014/04/san-francisco/sf-places-avg_hu_b12109c74dbe9359.webp 320w,/2014/04/san-francisco/sf-places-avg_hu_a094e44cacbdee7d.webp 768w,/2014/04/san-francisco/sf-places-avg.png 900w" src="sf-places-avg.png"/> 
</figure>

<p><em><a href="https://www.dropbox.com/s/u613oh4fskjrkny/sf-avg.pdf">(PDF of Map)</a></em></p>
<p>The Embarcadero has by far the most check-ins on average, again, due to both its small size and AT&amp;T Park. Other neighborhoods, however, have more similar averages to each other. It&rsquo;s worth noting that the average number of check-ins is higher in neighborhoods adjacent to the San Francisco Bay, such as North Beach and Fisherman&rsquo;s Wharf: perhaps people check-in more frequently when they have a good view.</p>
<p>Can this data and conclusions about San Francisco Facebook Places be extrapolated to other cities? I&rsquo;d argue yes: it makes logical sense that people check-in more frequently to Places that are more significant, and it also makes sense that people frequently check-into Places with large amounts of tourist attractions. Facebook data shows us cool trends that the United States Census data cannot.</p>
<hr>
<p><em>All charts and maps were made using R, ggplot2, and ggmaps.</em></p>
<p><em>You can download a copy of the Facebook Places data set <a href="https://www.dropbox.com/s/6g6ap4poz1b2trs/sf-public.csv">here</a>. The place_id column corresponds to the San Francisco neighborhood where the Place is located.</em></p>
<p><em>Maps are from <a href="https://www.google.com/maps/">Google Maps</a>. San Francisco shape files for city neighborhoods are from <a href="https://data.sfgov.org/Service-Requests-311-/Neighborhoods/ejmn-jyk6">data.sf.gov</a>. Additionally, thanks to <a href="http://www.reddit.com/r/dataisbeautiful/comments/223ubt/map_of_places_in_san_francisco_by_of_facebook/">/r/dataisbeautiful</a> for offering ideas for improvements to the maps.</em></p>
<p><em>You may notice that the Check-In Counts on some of the official Facebook Place Pages strongly disagree with the Count reported in my charts. The Count reported on affected page is nearly double the value reported via the API in each instance. I believe this is a bug on Facebook&rsquo;s end: see <a href="http://i.imgur.com/I1syBhR.png">this image</a>, in which the Graph Search autocomplete reports a different value of check-ins than the Place Page itself. In this case, I trust the data from the API.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
