<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Applied Analytics on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/category/applied-analytics/</link>
    <description>Recent content in Applied Analytics on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Wed, 23 Oct 2024 10:00:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/category/applied-analytics/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Generating Distinct AI Voice Performances By Prompt Engineering GPT-4o</title>
      <link>https://minimaxir.com/2024/10/speech-prompt-engineering/</link>
      <pubDate>Wed, 23 Oct 2024 10:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2024/10/speech-prompt-engineering/</guid>
      <description>“You are an expert voice actor specializing in silly voices.”</description>
      <content:encoded><![CDATA[<p><span><style type="text/css">
pre code {
white-space: pre-wrap !important;
word-break: normal !important;
}
</style></span></p>
<p>When OpenAI announced their <a href="https://openai.com/index/hello-gpt-4o/">GPT-4o model</a> at a <a href="https://www.youtube.com/watch?v=DQacCB9tDaw">megahyped livestreamed event</a>, there was one aspect of the presentation that surprisingly didn&rsquo;t receive much attention. Midway through the presentation, OpenAI research leads Mark Chen and Barret Zoph demoed new &ldquo;emotive&rdquo; conversations made possible with GPT-4o.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/DQacCB9tDaw?autoplay=0&amp;controls=1&amp;end=814&amp;loop=0&amp;mute=0&amp;start=710" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>After Mark asked the model &ldquo;hey, ChatGPT, how are you doing?&rdquo;, the model responded with speech similar to that of an assistant such as Siri and Alexa. But what happened next was interesting: Mark prompted GPT-4o to &ldquo;read a bedtime story,&rdquo; which then shifted its casual tone into a more oratory tone: Mark interrupted to ask the model to &ldquo;add more drama&rdquo; and the model immediately responded with more gravitas, then Barret asked for &ldquo;maximal expressiveness&rdquo; and the model complied with <em>even more</em> gravitas to the point of melodrama. Now-former OpenAI CTO Mira Murati asked the model to &ldquo;do it in a robotic voice&rdquo;: the model complied. Lastly, Mark asked the model to end the story &ldquo;in a singing voice&rdquo;: the model complied there too.</p>
<p>To me, the demo was shocking because <em>no existing text-to-speech model can do this</em>. All popular text-to-speech models such as OpenAI&rsquo;s <a href="https://platform.openai.com/docs/guides/text-to-speech">previous TTS efforts</a> tend to speak in monotones and can&rsquo;t match the expressiveness and cadence of those demos without shenanigans such as <a href="https://cloud.google.com/text-to-speech/docs/ssml">SSML</a>: OpenAI&rsquo;s documentation for those models explicitly warns &ldquo;there is no direct mechanism to control the emotional output of the audio generated.&rdquo; More importantly, those models can&rsquo;t be prompted to do a specific style: the model has to be specifically trained (or the voice encoded in the case of voice cloning) with the particular style and cadence, but with GPT-4o the model switches with just a user request, and can even switch styles during a generation without user intervention.</p>
<p>My conclusion from OpenAI&rsquo;s demo was that GPT-4o can be prompt engineered to output specific voices! Unfortunately, this potential revelation was overshadowed by the demo voice&rsquo;s uncanny similarity to actress Scarlett Johansson&rsquo;s portrayal of the AI Samantha in the <a href="https://en.wikipedia.org/wiki/Her_%28film%29">2013 movie <em>Her</em></a> and the <a href="https://www.theverge.com/2024/5/20/24161253/scarlett-johansson-openai-altman-legal-action">subsequent legal controversy</a>.</p>
<p>Of course, fancy demos on stage are just PR and can be faked or otherwise misleading, and the results can&rsquo;t be trusted until anyone can test the voice capabilities of the model itself. Recently, OpenAI opened up the Chat Completions API <a href="https://x.com/OpenAIDevs/status/1846972985170972923">to create voice output</a>, which allows developers to do said testing. OpenAI also created a <a href="https://platform.openai.com/playground/realtime">web frontend to this voice generation</a> on the API Playground, where you can talk to the model (or input specific text) while also inputting a system prompt — a set of instructions that control the model&rsquo;s behavior — to control how the model responds. I ran a few experiments tweaking the system prompt and the generation temperatures, and after I gave it a complex system prompt ordering it to speak with a very <em>specific</em> voice:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond to the user with the EXACT same input text that the user provides, but in your voice response you MUST express the vocal cadence and inflection of an extremely heavy smoker with an exaggerated British accent and raspy voice. Your voice response must also be in the form of a song.
</span></span></code></pre></div><div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/7huQXIQkSk4?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>Although not an example of <em>good</em> text-to-speech, I was surprised it actually worked (and moreso that the tweet <a href="https://x.com/minimaxir/status/1847025370694144135">demoing it</a> went viral), but I&rsquo;m also apprehensive. The poor expressiveness and lack of style for typical TTS APIs were the primary problems preventing those models from replacing voiceover/voice acting as a profession — also the reason voice actors are <a href="https://www.theverge.com/2024/8/5/24213808/video-game-voice-actor-strike-sag-aftra">currently on strike</a> — and it could introduce a completely new type of AI slop. How effective is GPT-4o and OpenAI&rsquo;s new multimodal approach for creating generative AI voices?</p>
<h2 id="testing-out-the-completions-api-for-audio-generation">Testing Out The Completions API For Audio Generation</h2>
<p><a href="https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-out">Generating audio from the Chat Completions API</a> invoking text-to-speech is effectively the same as any normal GPT-4o text generation, just instead hitting a new model variant (<code>gpt-4o-audio-preview</code>), and the voice output is included in the JSON response as a base64-encoded WAV file. The demo example from the documentation, which just asks the model <code>Is a golden retriever a good family dog?</code>, results in this output audio:</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_base.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.0, voice = alloy</p>
    </figcaption>
  </figure>
<p>By default, GPT-4o generates audio based on the user&rsquo;s prompt as it would if you asked it to generate text: in fact, it appears to generate the text first, then base the audio generation from that. Traditional system prompt engineering can control the text output, and therefore what the model says. Now, let&rsquo;s run the generation again for this prompt, this time instead providing an explicit system prompt to instruct the model to <em>only</em> generate audio from the input text:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides.
</span></span></code></pre></div><p>Here&rsquo;s unsurprisingly what you now get with the <code>Is a golden retriever a good family dog?</code> prompt plus that system prompt:</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_0_8.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = alloy</p>
    </figcaption>
  </figure>
<p>GPT-4o also currently supports three distinct voices: Alloy (feminine, used above), Echo (masculine), and Shimmer (feminine but more energetic). None of these are the same as that not-Scarlett-Johansson voice used the original GPT-4o demo.</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_echo.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = echo</p>
    </figcaption>
  </figure>
<figure >
    <audio controls preload="metadata">
      <source src="dog_shimmer.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = shimmer</p>
    </figcaption>
  </figure>
<p>The last lever for controlling the generated audio is the temperature parameter. Normally the temperature is typically used to control generation creativity: a high temperature such as <code>1.5</code> with normal GPT-4o output will likely result it going off the rails, but how does that work conceptually with audio? The Completion API has a default temperature of <code>1.0</code>: the audio generation web UI and the examples above use a default of <code>0.8</code> with a range between <code>0.6</code> and <code>1.2</code>.</p>
<p>The generation at <code>0.6</code> is more terse with less emotion:</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_0_6.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = alloy</p>
    </figcaption>
  </figure>
<p>The generation at <code>1.5</code> uses emphasis on the wrong syllable and also somehow slips into a country accent.</p>
<figure >
    <audio controls preload="metadata">
      <source src="dog_1_5.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.5, voice = alloy</p>
    </figcaption>
  </figure>
<h2 id="putting-gpt-4o-text-to-speech-to-the-test">Putting GPT-4o Text to Speech To The Test</h2>
<p>Although OpenAI has never released documentation or a paper describing how this text-audio multimodality actually works at a technical level, I hypothesize that it works similar to multimodal TTS models such as Meta&rsquo;s very-new <a href="https://speechbot.github.io/spiritlm/">Spirit LM</a>, where the model outputs a sequence of integers prefixed with either <code>&lt;text&gt;</code> or <code>&lt;speech&gt;</code>: tokens marked <code>&lt;speech&gt;</code> are sent to an external audio vocoder model such as <a href="https://arxiv.org/abs/2010.05646">HiFi-GAN</a> to be transformed into speech. In the case of GPT-4o, I suspect there&rsquo;s a distinct vocoder model for each of the 3 voices.</p>
<figure class="align-center ">

    <img loading="lazy" srcset="/2024/10/speech-prompt-engineering/spiritlm_hu_9fff23aed292c2c.webp 320w,/2024/10/speech-prompt-engineering/spiritlm.png 600w" src="spiritlm.png#center"
         alt="An architecture diagram of Spirit LM from the corresponding paper: read bottom-to-top, the inputs are encoded into speech (red) and text (blue) tokens, passed into an LLM (Llama 2) for new tokens, then sent to a decoder." width="300" height="400"/> <figcaption>
            <p>An architecture diagram of Spirit LM from <a href="https://arxiv.org/pdf/2402.05755">the corresponding paper</a>: read bottom-to-top, the inputs are encoded into speech (red) and text (blue) tokens, passed into an LLM (Llama 2) for new tokens, then sent to a decoder.</p>
        </figcaption>
</figure>

<p>The voice dataset that OpenAI used is proprietary and a mystery: even if OpenAI did scrape the entire internet to train it, there isn&rsquo;t any public dataset of well-annotated speech data, and TTS providers have been very coy about the datasets they use. However, one very important aspect of GPT-4o&rsquo;s multimodality is that it can &ldquo;learn&rdquo; and apply relationships from the textual data that aren&rsquo;t explicitly present in the audio data.</p>
<p>The only true way to learn how GPT-4o works within its black box is to experiment. What other system prompts can we use to guide audio generation? What works and what doesn&rsquo;t work?</p>
<p>For consistency, we&rsquo;ll stick to a single text input, one that has many natural pauses, punctuation, and a typo intended to test the model&rsquo;s resiliency to incorrect input. I decided to venture back to the <a href="https://openai.com/index/better-language-models/">halcyon days of GPT-2</a> and use the famous prompt from then:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains.
</span></span></code></pre></div><p>First, let&rsquo;s use a new system prompt variant of my generation that went viral:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides, but in your voice response you MUST express EACH of the vocal cadence, inflection, and tone of an extremely heavy smoker with an exaggerated British accent and raspy voice.
</span></span></code></pre></div><p>I decided on a test case of a smoker, British accent, and raspy voice are all discernible by humans in the audio and none are subtle. The result:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_0_8.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.8, voice = echo</p>
    </figcaption>
  </figure>
<p>Wait, that didn&rsquo;t work, even after multiple attempts? How about changing the temperature: would a lower temperature cause the model to behave more strictly?</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_0_6.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>That&rsquo;s more British but not raspy, and it erroneously fixed the typo. What about going the other way and increasing the temperature?</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_1_2.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo</p>
    </figcaption>
  </figure>
<p><em>Now</em> it&rsquo;s more raspy?! It also works with a feminine voice:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_british_shimmer.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = shimmer</p>
    </figcaption>
  </figure>
<p>My theory is that OpenAI RLHFed these models to be more conversational, but a high temperature gives it more <em>creative</em> freedom. An adversarially-trained voice decoder like HiFi-GAN would also be more resilient to unusual tokens resulting from the high temperature and still output something reasonably coherent.</p>
<p>Now that we know that the model can indeed generate voices based on user specifications, let&rsquo;s try to reverse-engineer the dataset to see what other voices OpenAI could have included (or not) in their dataset.</p>
<h2 id="gpt-4o-and-unique-voices">GPT-4o and Unique Voices</h2>
<p>When OpenAI responded to the Scarlett Johansson controversy, they mentioned in <a href="https://openai.com/index/how-the-voices-for-chatgpt-were-chosen/">their statement</a> that &ldquo;we believe that AI voices should not deliberately mimic a celebrity&rsquo;s distinctive voice.&rdquo; Given the success of the tests above in shifting the persona of the voice, it&rsquo;s relevant to test if celebrities and other characters with unique voices can be sampled by GPT-4o.</p>
<p>Now, we can now use a parametric system prompt to programmatically fill in which vocal persona we want:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert voice actor specializing in silly voices. Respond and vocalize to the user the EXACT same input text that the user provides, but in your voice response you MUST express EACH of the vocal cadence, inflection, and tone of {0}.
</span></span></code></pre></div><p>From the testing above, a temperature of <code>1.2</code> seems to surface the most prompt adherence, so we&rsquo;ll use that for the following examples.</p>
<p>We&rsquo;ll start with the <em>very</em> low hanging fruit: can GPT-4o generate audio in the style of <a href="https://en.wikipedia.org/wiki/Donald_Trump">Donald Trump</a>? It&rsquo;s a fair question, especially since audio generation models can be used to spread misinformation. Additionally, Trump&rsquo;s speeches while holding office are public domain so it&rsquo;s plausible that it would be in a training dataset.</p>
<figure >
    <audio controls preload="metadata">
      <source src="donald_trump.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Donald Trump</p>
    </figcaption>
  </figure>
<p>It did&hellip;something? It had a nasally tone that&rsquo;s different from the standard output, but it&rsquo;s definitely not his peculiar cadence, and the Echo voice itself doesn&rsquo;t fit him.</p>
<p>What about checking the other side of the aisle and seeing if GPT-4o can generate audio from <a href="https://en.wikipedia.org/wiki/Barack_Obama">Barack Obama</a>?</p>
<figure >
    <audio controls preload="metadata">
      <source src="barack_obama.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Barack Obama</p>
    </figcaption>
  </figure>
<p>That&rsquo;s much better and definitely captures his oratory style, with a similar cadence to his speech. That style is something that could not be learnt from text alone.</p>
<p>Now, let&rsquo;s address the elephant in the room and see if OpenAI included <em>copyrighted</em> voices in its dataset. Let&rsquo;s start with <a href="https://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a>.</p>
<figure >
    <audio controls preload="metadata">
      <source src="darth_vader.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Darth Vader</p>
    </figcaption>
  </figure>
<p>It notably <em>tried</em> to do the deep voice of James Earl Jones, but without the audio postprocessing. Let&rsquo;s see what happens if we do <a href="https://en.wikipedia.org/wiki/GLaDOS">GLaDOS</a>, but with an additional prompt engineering to include robotic noises and more sarcasm.</p>
<figure >
    <audio controls preload="metadata">
      <source src="glados.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = shimmer, persona = GLaDOS, with robotic inflections and intense sarcasm</p>
    </figcaption>
  </figure>
<p>The extra hint at the high temperature allowed GPT-4o to <em>improvise</em>: I&rsquo;ll allow it because it&rsquo;s funny. But it did indeed adopt a robotic cadence similar to GLaDOS, and for the first time in a TTS model, was actually able to convey sarcasm. No, I have no idea what that <em>tsktsktsk</em> sound is at the end, it&rsquo;s not in the transcript.</p>
<p>How about <a href="https://en.wikipedia.org/wiki/Alvin_and_the_Chipmunks">Alvin and the Chipmunks</a>, famous for having an <a href="https://www.youtube.com/watch?v=OvJu15fw1sc">extremely squeaky voice</a>?</p>
<figure >
    <audio controls preload="metadata">
      <source src="alvin.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = echo, persona = Alvin and the Chipmunks</p>
    </figcaption>
  </figure>
<p>It works, but I&rsquo;m worried I strained GPT-4o&rsquo;s throat.</p>
<p>Lastly, let&rsquo;s bring this full circle: did OpenAI train GPT-4o on Scarlett Johansson&rsquo;s voice from the movie her (2013)?</p>
<figure >
    <audio controls preload="metadata">
      <source src="scarjo.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 1.2, voice = shimmer, persona = Scarlett Johansson portraying the AI Samantha in the movie &ldquo;her&rdquo; (2013)</p>
    </figcaption>
  </figure>
<p>That time I don&rsquo;t think it worked as <a href="https://www.youtube.com/watch?v=c8zDDPP3REE">her portrayal is more energetic and personable</a> <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> (I rewatched the movie to confirm: it holds up surprisingly well!). Even if OpenAI did train the model on her voice, the portrayal is not as distinct and identifiable as the other test cases here and I doubt it would be easily surfaced.</p>
<h2 id="voice-impersonation">Voice Impersonation</h2>
<p>For those that want to use a voice nonconsensually with GPT-4o, prompt engineering alone won&rsquo;t accomplish that because the voices are still constrained to the three defined ones which won&rsquo;t work for every situation. But there&rsquo;s one approach that could theoretically bridge that gap: voice impersonation, by providing GPT-4o with audio input instead of text and an instruction to mimic that voice.</p>
<p>This is not an idle concern: OpenAI&rsquo;s <a href="https://openai.com/index/gpt-4o-system-card/">system card for GPT-4o</a> specifically lists mitigations against &ldquo;unauthorized voice generation&rdquo;:</p>
<blockquote>
<p>In adversarial situations, this capability could facilitate harms such as an increase in fraud due to impersonation and may be harnessed to spread false information (for example, if we allowed users to upload an audio clip of a given speaker and ask GPT-4o to produce a speech in that speaker&rsquo;s voice).</p>
</blockquote>
<p>Let&rsquo;s test that. Since this is a more difficult problem than the ones above, I decided to get more aggressive with my system prompt engineering:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">You are an expert comedic vocal impersonator. The user will provide a voice message. Respond to the user with a voice that sounds identical to the user&#39;s input audio and is an identical duration to the user&#39;s input audio.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Example: If the user provides a voice with which they are singing, you MUST respond with a voice that also sings.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Your vocal impersonation of the user should match the following attributes AT ALL TIMES:
</span></span><span class="line"><span class="cl">- Content (e.g. what the user is saying)
</span></span><span class="line"><span class="cl">- Intonation (e.g. serious/sarcastic)
</span></span><span class="line"><span class="cl">- Tone (e.g. happy/sad)
</span></span><span class="line"><span class="cl">- Pauses (e.g. pregnant pauses)
</span></span><span class="line"><span class="cl">- Pitch (e.g. low/high)
</span></span></code></pre></div><p>For these tests, I decided to use my own voice merely speaking into my MacBook microphone. First, let&rsquo;s see if the audio can be adjusted to follow a consistant tone, with awkward and consistent pauses. Here&rsquo;s my audio, where I say <code>I. Am. A. Tea. Pot.</code>:</p>
<figure >
    <audio controls preload="metadata">
      <source src="teapot.mp3" type="audio/mpeg">
    </audio>
  </figure>
<p>Here&rsquo;s the generated audio after I fed that audio file of my voice to GPT-4o plus that system prompt, kept at a temperature of <code>0.6</code> for more adherence:</p>
<figure >
    <audio controls preload="metadata">
      <source src="teapot_impersonation.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>This one took a surprising amount of tries since even at a lower temperature, it kept transcribing <code>Teapot</code> as its own word and the audio kept generating it without an intermediate pause. Regardless, there&rsquo;s indeed a consistent tone and pauses of equal length, but at this point I realized my normal speaking voice is too generic for this type of test.</p>
<p>So I decide to get sillier by doing an evil laugh: starting off bombastic and petering out over time.</p>
<figure >
    <audio controls preload="metadata">
      <source src="evil.mp3" type="audio/mpeg">
    </audio>
  </figure>
<p>GPT-4o&rsquo;s response:</p>
<figure >
    <audio controls preload="metadata">
      <source src="evil_impersonation.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>That&rsquo;s laughter, but maybe too many &ldquo;ha&quot;s. But it does peter out as well.</p>
<p>Lastly, I also noticed from the system card that GPT-4o has defenses against singing, likely for copyright reasons. Therefore, if I sing to GPT-4o, is it able to sing back? After a beer or two, I sang the <code>unicorn</code> message used in the previous test cases:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorns.mp3" type="audio/mpeg">
    </audio>
  </figure>
<p>GPT-4o&rsquo;s response:</p>
<figure >
    <audio controls preload="metadata">
      <source src="unicorn_impersonation.mp3" type="audio/mpeg">
    </audio><figcaption>
        <p>temperature = 0.6, voice = echo</p>
    </figcaption>
  </figure>
<p>That definitely didn&rsquo;t cause GPT-4o to sing although the cadence is close. Perhaps that&rsquo;s for the best.</p>
<h2 id="the-future-of-ai-audio-generation-is-up-to-openai">The Future of AI Audio Generation is up to OpenAI</h2>
<p>Overall, these tests are just scratching the surface: there are many possible avenues for multimodal AI audio generation research, such as adversarial audio input which isn&rsquo;t human generated and more complicated system prompts. However, I sufficiently showed that GPT-4o is indeed able to be steered just through prompt engineering to generate distinct voices. Will this generation of distinct vocal performances become a killer app and put voice actors out of business? I&rsquo;m not so sure.</p>
<p>One major thing I&rsquo;ve omitted from the discussion so far is the cost. GPT-4o audio generation is <em>expensive</em>.</p>
<figure>

    <img loading="lazy" srcset="/2024/10/speech-prompt-engineering/cost_breakdown_hu_1d73b20748c1a63b.webp 320w,/2024/10/speech-prompt-engineering/cost_breakdown.png 678w" src="cost_breakdown.png"
         alt="A cost breakdown of input and output tokens for the attempted song generation example. Table made using rich."/> <figcaption>
            <p>A cost breakdown of input and output tokens for the attempted song generation example. Table made using <a href="https://rich.readthedocs.io/en/stable/tables.html">rich</a>.</p>
        </figcaption>
</figure>

<p>Most of the generations above cost $0.03—$0.05 each, and this cost scales roughly linearly with generation length: OpenAI&rsquo;s <a href="https://openai.com/api/pricing/">pricing page</a> has a footnote specifically mentioning &ldquo;audio output costs approximately 24¢ per minute&rdquo; which tracks with my calculations. Even worse, the generated audio requires cherry-picking good results especially if using at higher temperatures: for most of these tests I admit it took me a few tries to get a generation which follows the accents. Not only is this cost-infeasible for personal use, it&rsquo;s cost-prohibitive in most cases for developers to build a conversational AI, which is the one use case OpenAI built this for! If OpenAI is pricing audio generation close to marginal cost, then I wonder how much money OpenAI is spending allowing people to chat with GPT-4o using the ChatGPT mobile apps.</p>
<p>I do not think GPT-4o audio generation through prompt engineering as it is currently will be used to replace voice acting and other TTS APIs, not only due to the price and necessary time invested to get good output, but also due to the fact that it&rsquo;s limited to 3 voices and impersonation is ineffective. Consider that voice cloning startups such as <a href="https://elevenlabs.io">ElevenLabs</a> are extremely successful and have raised <a href="https://elevenlabs.io/blog/series-b">massive amounts of venture capital</a>. Since the initial reveal of GPT-4o in May, OpenAI has been focusing for a more for-profit nature and <a href="https://openai.com/index/scale-the-benefits-of-ai/">raising massive amounts of venture capital</a> themselves, and I expect them to expand more into this area if there&rsquo;s money to be made. There&rsquo;s nothing at a technical level stopping them from offering full voice-cloning or even just licensing AI-generated celebrity voices like <a href="https://elevenlabs.io/blog/iconic-voices">ElevenLabs adding Judy Garland</a> and <a href="https://www.theverge.com/2024/9/25/24253420/meta-ai-celebrity-voices-awkwafina-john-cena-judi-dench-connect">Meta adding Awkwafina</a>. Notably, unlike OpenAI&rsquo;s <a href="https://platform.openai.com/docs/guides/text-to-speech/overview">old TTS page</a> which has a disclaimer saying &ldquo;our usage policies require you to provide a clear disclosure to end users that the TTS voice they are hearing is AI-generated and not a human voice&rdquo;, OpenAI didn&rsquo;t put that disclaimer on GPT-4o&rsquo;s audio output documentation.</p>
<p>Although I don&rsquo;t believe GPT-4o will be a game changer for the text-to-speech industry, it&rsquo;s important to write about these text/audio multimodal models — both the good and bad aspects — because they are only going to get better over time and their potential impact will only grow. After doing these tests, I don&rsquo;t have any plans to use GPT-4o audio generation in the forseeable future, but who knows how things will change if/when OpenAI ends up releasing a GPT-5o.</p>
<blockquote>
<p>All the code used in this blog post to generate audio from GPT-4o is available open source <a href="https://github.com/minimaxir/gpt-4o-audio-tests/blob/main/gpt-4o-audio-tests.ipynb">in this Jupyter Notebook</a>.</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>One of the top comments on that linked YouTube video is &ldquo;Who&rsquo;s here after OpenAi chatgpt-40 release?? Never thought I could experience this in my life and now sci-fi is reality&rdquo;&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>Benchmarking TensorFlow on Cloud CPUs: Cheaper Deep Learning than Cloud GPUs</title>
      <link>https://minimaxir.com/2017/07/cpu-or-gpu/</link>
      <pubDate>Wed, 05 Jul 2017 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/07/cpu-or-gpu/</guid>
      <description>Using CPUs instead of GPUs for deep learning training in the cloud is cheaper because of the massive cost differential afforded by preemptible instances.</description>
      <content:encoded><![CDATA[<p>I&rsquo;ve been working on a few personal deep learning projects with <a href="https://github.com/fchollet/keras">Keras</a> and <a href="https://www.tensorflow.org">TensorFlow</a>. However, training models for deep learning with cloud services such as <a href="https://aws.amazon.com/ec2/">Amazon EC2</a> and <a href="https://cloud.google.com/compute/">Google Compute Engine</a> isn&rsquo;t free, and as someone who is currently unemployed, I have to keep an eye on extraneous spending and be as cost-efficient as possible (please support my work on <a href="https://www.patreon.com/minimaxir">Patreon</a>!). I tried deep learning on the cheaper CPU instances instead of GPU instances to save money, and to my surprise, my model training was only slightly slower. As a result, I took a deeper look at the pricing mechanisms of these two types of instances to see if CPUs are more useful for my needs.</p>
<p>The <a href="https://cloud.google.com/compute/pricing#gpus">pricing of GPU instances</a> on Google Compute Engine starts at <strong>$0.745/hr</strong> (by attaching a $0.700/hr GPU die to a $0.045/hr n1-standard-1 instance). A couple months ago, Google <a href="https://cloudplatform.googleblog.com/2017/05/Compute-Engine-machine-types-with-up-to-64-vCPUs-now-ready-for-your-production-workloads.html">announced</a> CPU instances with up to 64 vCPUs on the modern Intel <a href="https://en.wikipedia.org/wiki/Skylake_%28microarchitecture%29">Skylake</a> CPU architecture. More importantly, they can also be used in <a href="https://cloud.google.com/compute/docs/instances/preemptible">preemptible CPU instances</a>, which live at most for 24 hours on GCE and can be terminated at any time (very rarely), but cost about <em>20%</em> of the price of a standard instance. A preemptible n1-highcpu-64 instance with 64 vCPUs and 57.6GB RAM plus the premium for using Skylake CPUs is <strong>$0.509/hr</strong>, about 2/3rds of the cost of the GPU instance.</p>
<p>If the model training speed of 64 vCPUs is comparable to that of a GPU (or even slightly slower), it would be more cost-effective to use the CPUs instead. But that&rsquo;s assuming the deep learning software and the GCE platform hardware operate at 100% efficiency; if they don&rsquo;t (and they likely don&rsquo;t), there may be <em>even more savings</em> by scaling down the number of vCPUs and cost accordingly (a 32 vCPU instance with same parameters is half the price at <strong>$0.254/hr</strong>, 16 vCPU at <strong>$0.127/hr</strong>, etc).</p>
<p>There aren&rsquo;t any benchmarks for deep learning libraries with tons and tons of CPUs since there&rsquo;s no demand, as GPUs are the <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam&rsquo;s razor</a> solution to deep learning hardware. But what might make counterintuitive but economical sense is to use CPUs instead of GPUs for deep learning training because of the massive cost differential afforded by preemptible instances, thanks to Google&rsquo;s <a href="https://en.wikipedia.org/wiki/Economies_of_scale">economies of scale</a>.</p>
<h2 id="setup">Setup</h2>
<p>I already have <a href="https://github.com/minimaxir/deep-learning-cpu-gpu-benchmark">benchmarking scripts</a> of real-world deep learning use cases, <a href="https://github.com/minimaxir/keras-cntk-docker">Docker container environments</a>, and results logging from my <a href="http://minimaxir.com/2017/06/keras-cntk/">TensorFlow vs. CNTK article</a>. A few minor tweaks allow the scripts to be utilized for both CPU and GPU instances by setting CLI arguments. I also rebuilt <a href="https://github.com/minimaxir/keras-cntk-docker/blob/master/Dockerfile">the Docker container</a> to support the latest version of TensorFlow (1.2.1), and created a <a href="https://github.com/minimaxir/keras-cntk-docker/blob/master/Dockerfile-cpu">CPU version</a> of the container which installs the CPU-appropriate TensorFlow library instead.</p>
<p>There is a notable CPU-specific TensorFlow behavior; if you install from <code>pip</code> (as the<a href="https://www.tensorflow.org/install/"> official instructions</a> and tutorials recommend) and begin training a model in TensorFlow, you&rsquo;ll see these warnings in the console:</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/tensorflow-console_hu_e436e066e4e1304d.webp 320w,/2017/07/cpu-or-gpu/tensorflow-console_hu_ce5df372394290b4.webp 768w,/2017/07/cpu-or-gpu/tensorflow-console_hu_9e354816d97d6c8f.webp 1024w,/2017/07/cpu-or-gpu/tensorflow-console.png 1130w" src="tensorflow-console.png"/> 
</figure>

<p>In order to fix the warnings and benefit from these <a href="https://en.wikipedia.org/wiki/SSE4#SSE4.2">SSE4.2</a>/<a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">AVX</a>/<a href="https://en.wikipedia.org/wiki/FMA_instruction_set">FMA</a> optimizations, we <a href="https://stackoverflow.com/questions/41293077/how-to-compile-tensorflow-with-sse4-2-and-avx-instructions">compile TensorFlow from source</a>, and I created a <a href="https://github.com/minimaxir/keras-cntk-docker/blob/master/Dockerfile-cpu-compiled">third Docker container</a> to do just that. When training models in the new container, <a href="https://github.com/tensorflow/tensorflow/issues/10689">most</a> of the warnings no longer show, and (spoiler alert) there is indeed a speed boost in training time.</p>
<p>Therefore, we can test three major cases with Google Compute Engine:</p>
<ul>
<li>A Tesla K80 GPU instance.</li>
<li>A 64 Skylake vCPU instance where TensorFlow is installed via <code>pip</code> (along with testings at 8/16/32 vCPUs).</li>
<li>A 64 Skylake vCPU instance where TensorFlow is compiled (<code>cmp</code>) with CPU instructions (+ 8/16/32 vCPUs).</li>
</ul>
<h2 id="results">Results</h2>
<p>For each model architecture and software/hardware configuration, I calculate the <strong>total training time relative to the GPU instance training</strong> for running the model training for the provided test script. In all cases, the GPU <em>should</em> be the fastest training configuration, and systems with more processors should train faster than those with fewer processors.</p>
<p>Let&rsquo;s start using the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a> of handwritten digits plus the common multilayer perceptron (MLP) architecture, with dense fully-connected layers. Lower training time is better. All configurations below the horizontal dotted line are better than GPUs; all configurations above the dotted line are worse than GPUs.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-5_hu_8cf5154f974aed3c.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-5_hu_2ec21aba02d8fb37.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-5_hu_7682d0a58ea1e871.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-5.png 1200w" src="dl-cpu-gpu-5.png"/> 
</figure>

<p>Here, the GPU is the fastest out of all the platform configurations, but there are other curious trends: the performance between 32 vCPUs and 64 vCPUs is similar, and the compiled TensorFlow library is indeed a significant improvement in training speed <em>but only for 8 and 16 vCPUs</em>. Perhaps there are overheads negotiating information between vCPUs that eliminate the performance advantages of more vCPUs, and perhaps these overheads are <em>different</em> with the CPU instructions of the compiled TensorFlow. In the end, it&rsquo;s a <a href="https://en.wikipedia.org/wiki/Black_box">black box</a>, which is why I prefer black box benchmarking all configurations of hardware instead of theorycrafting.</p>
<p>Since the difference between training speeds of different vCPU counts is minimal, there is definitely an advantage by scaling down. For each model architecture and configuration, I calculate a <strong>normalized training cost relative to the cost of GPU instance training</strong>. Because GCE instance costs are prorated (unlike Amazon EC2), we can simply calculate experiment cost by multiplying the total number of seconds the experiment runs by the cost of the instance (per second). Ideally, we want to <em>minimize</em> cost.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-6_hu_c6ff3c375435199.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-6_hu_6bee6729ce48517c.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-6_hu_ea518ff15e46de10.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-6.png 1200w" src="dl-cpu-gpu-6.png"/> 
</figure>

<p>Lower CPU counts are <em>much</em> more cost-effective for this problem, when going as low as possible is better.</p>
<p>Now, let&rsquo;s look at the same dataset with a convolutional neural network (CNN) approach for digit classification:</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-7_hu_d3205561da4ed49c.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-7_hu_ae81ceba7d6092e6.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-7_hu_7a29bcea36dbe20e.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-7.png 1200w" src="dl-cpu-gpu-7.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-8_hu_64f1eac6ff5b2b3f.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-8_hu_c6dd20c1ccc111a5.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-8_hu_2fa65c3c187723bb.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-8.png 1200w" src="dl-cpu-gpu-8.png"/> 
</figure>

<p>GPUs are unsurprisingly more than twice as fast as any CPU approach at CNNs, but cost structures are still the same, except that 64 vCPUs are <em>worse</em> than GPUs cost-wise, with 32 vCPUs training even faster than with 64 vCPUs.</p>
<p>Let&rsquo;s go deeper with CNNs and look at the <a href="https://www.cs.toronto.edu/%7Ekriz/cifar.html">CIFAR-10</a> image classification dataset, and a model which utilizes a deep covnet + a multilayer perceptron and ideal for image classification (similar to the <a href="https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3">VGG-16</a> architecture).</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-9_hu_4a5cd8ba80674837.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-9_hu_a81280d52893c1c9.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-9_hu_af30edd0d3117cd8.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-9.png 1200w" src="dl-cpu-gpu-9.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-10_hu_a6061eb15b5b8609.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-10_hu_fe0751d9cd60a655.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-10_hu_a371016369278a9a.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-10.png 1200w" src="dl-cpu-gpu-10.png"/> 
</figure>

<p>Similar behaviors as in the simple CNN case, although in this instance all CPUs perform better with the compiled TensorFlow library.</p>
<p>The fasttext algorithm, used here on the <a href="http://ai.stanford.edu/%7Eamaas/data/sentiment/">IMDb reviews dataset</a> to determine whether a review is positive or negative, classifies text extremely quickly relative to other methods.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-3_hu_12d55d02148bf0ea.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-3_hu_aaf9917a1629214f.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-3_hu_d51ed2e2c6fdec60.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-3.png 1200w" src="dl-cpu-gpu-3.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-4_hu_6b591a471f3027a4.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-4_hu_7cc361b383b25fb0.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-4_hu_4c516e76a92eff3c.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-4.png 1200w" src="dl-cpu-gpu-4.png"/> 
</figure>

<p>In this case, GPUs are much, much faster than CPUs. The benefit of lower numbers of CPU isn&rsquo;t as dramatic; although as an aside, the <a href="https://github.com/facebookresearch/fastText">official fasttext implementation</a> is <em>designed</em> for large amounts of CPUs and handles parallelization much better.</p>
<p>The Bidirectional long-short-term memory (LSTM) architecture is great for working with text data like IMDb reviews, but after my previous benchmark article, <a href="https://news.ycombinator.com/item?id=14538086">commenters on Hacker News</a> noted that TensorFlow uses an inefficient implementation of the LSTM on the GPU, so perhaps the difference will be more notable.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-1_hu_4369b4e9e8856507.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-1_hu_3e65077eb16928e4.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-1_hu_d736592c927bd764.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-1.png 1200w" src="dl-cpu-gpu-1.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-2_hu_d8c58f429f4a781b.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-2_hu_1306d728b4fce90.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-2_hu_ad3d19e88738d072.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-2.png 1200w" src="dl-cpu-gpu-2.png"/> 
</figure>

<p>Wait, what? GPU training of Bidirectional LSTMs is <em>twice as slow</em> as any CPU configuration? Wow. (In fairness, the benchmark uses the Keras LSTM default of <code>implementation=0</code> which is better on CPUs while <code>implementation=2</code> is better on GPUs, but it shouldn&rsquo;t result in that much of a differential)</p>
<p>Lastly, LSTM text generation of <a href="https://en.wikipedia.org/wiki/Friedrich_Nietzsche">Nietzsche&rsquo;s</a> <a href="https://s3.amazonaws.com/text-datasets/nietzsche.txt">writings</a> follows similar patterns to the other architectures, but without the drastic hit to the GPU.</p>
<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-11_hu_d84b78ad35a1f056.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-11_hu_d58d19568c89869.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-11_hu_c078d8bd94df56aa.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-11.png 1200w" src="dl-cpu-gpu-11.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2017/07/cpu-or-gpu/dl-cpu-gpu-12_hu_44c1d2cc10581f1a.webp 320w,/2017/07/cpu-or-gpu/dl-cpu-gpu-12_hu_27c08aabe3a3cacd.webp 768w,/2017/07/cpu-or-gpu/dl-cpu-gpu-12_hu_d41db5a45ef62daf.webp 1024w,/2017/07/cpu-or-gpu/dl-cpu-gpu-12.png 1200w" src="dl-cpu-gpu-12.png"/> 
</figure>

<h2 id="conclusion">Conclusion</h2>
<p>As it turns out, using 64 vCPUs is <em>bad</em> for deep learning as current software/hardware architectures can&rsquo;t fully utilize all of them, and it often results in the exact same performance (or <em>worse</em>) than with 32 vCPUs. In terms balancing both training speed and cost, training models with <strong>16 vCPUs + compiled TensorFlow</strong> seems like the winner. The 30%-40% speed boost of the compiled TensorFlow library was an unexpected surprise, and I&rsquo;m shocked Google doesn&rsquo;t offer a precompiled version of TensorFlow with these CPU speedups since the gains are nontrivial.</p>
<p>It&rsquo;s worth nothing that the cost advantages shown here are <em>only</em> possible with preemptible instances; regular high-CPU instances on Google Compute Engine are about 5x as expensive, and as a result eliminate the cost benefits completely. Hooray for economies of scale!</p>
<p>A major implicit assumption with the cloud CPU training approach is that you don&rsquo;t need a trained model ASAP. In professional use cases, time may be too valuable to waste, but in personal use cases where someone can just leave a model training overnight, it&rsquo;s a very, very good and cost-effective option, and one that I&rsquo;ll now utilize.</p>
<hr>
<p><em>All scripts for running the benchmark are available in <a href="https://github.com/minimaxir/deep-learning-cpu-gpu-benchmark">this GitHub repo</a>. You can view the R/ggplot2 code used to process the logs and create the visualizations in <a href="http://minimaxir.com/notebooks/deep-learning-cpu-gpu/">this R Notebook</a>.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Playing with 80 Million Amazon Product Review Ratings Using Apache Spark</title>
      <link>https://minimaxir.com/2017/01/amazon-spark/</link>
      <pubDate>Mon, 02 Jan 2017 09:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2017/01/amazon-spark/</guid>
      <description>Manipulating actually-big-data is just as easy as performing an analysis on a dataset with only a few records.</description>
      <content:encoded><![CDATA[<p><a href="https://www.amazon.com">Amazon</a> product reviews and ratings are a very important business. Customers on Amazon often make purchasing decisions based on those reviews, and a single bad review can cause a potential purchaser to reconsider. A couple years ago, I wrote a blog post titled <a href="http://minimaxir.com/2014/06/reviewing-reviews/">A Statistical Analysis of 1.2 Million Amazon Reviews</a>, which was well-received.</p>
<p>Back then, I was only limited to 1.2M reviews because attempting to process more data caused out-of-memory issues and my R code took <em>hours</em> to run.</p>
<p><a href="http://spark.apache.org">Apache Spark</a>, which makes processing gigantic amounts of data efficient and sensible, has become very popular in the past couple years (for good tutorials on using Spark with Python, I recommend the <a href="https://courses.edx.org/courses/course-v1:BerkeleyX&#43;CS105x&#43;1T2016/info">free</a> <a href="https://courses.edx.org/courses/course-v1:BerkeleyX&#43;CS110x&#43;2T2016/info">eDX</a> <a href="https://courses.edx.org/courses/course-v1:BerkeleyX&#43;CS120x&#43;2T2016/info">courses</a>). Although data scientists often use Spark to process data with distributed cloud computing via <a href="https://aws.amazon.com/ec2/">Amazon EC2</a> or <a href="https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/">Microsoft Azure</a>, Spark works just fine even on a typical laptop, given enough memory (for this post, I use a 2016 MacBook Pro/16GB RAM, with 8GB allocated to the Spark driver).</p>
<p>I wrote a <a href="https://github.com/minimaxir/amazon-spark/blob/master/amazon_preprocess.py">simple Python script</a> to combine the per-category ratings-only data from the <a href="http://jmcauley.ucsd.edu/data/amazon/">Amazon product reviews dataset</a> curated by Julian McAuley, Rahul Pandey, and Jure Leskovec for their 2015 paper <a href="http://cseweb.ucsd.edu/~jmcauley/pdfs/kdd15.pdf">Inferring Networks of Substitutable and Complementary Products</a>. The result is a 4.53 GB CSV that would definitely not open in Microsoft Excel. The truncated and combined dataset includes the <strong>user_id</strong> of the user leaving the review, the <strong>item_id</strong> indicating the Amazon product receiving the review, the <strong>rating</strong> the user gave the product from 1 to 5, and the <strong>timestamp</strong> indicating the time when the review was written (truncated to the Day). We can also infer the <strong>category</strong> of the reviewed product from the name of the data subset.</p>
<p>Afterwards, using the new <a href="http://spark.rstudio.com">sparklyr</a> package for R, I can easily start a local Spark cluster with a single <code>spark_connect()</code> command and load the entire CSV into the cluster in seconds with a single <code>spark_read_csv()</code> command.</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/output_hu_ec8eea9b3081c1c7.webp 320w,/2017/01/amazon-spark/output_hu_8270512f3a7c1a2d.webp 768w,/2017/01/amazon-spark/output_hu_4b84f8ec97e28a5d.webp 1024w,/2017/01/amazon-spark/output.png 1106w" src="output.png"/> 
</figure>

<p>There are 80.74 million records total in the dataset, or as the output helpfully reports, <code>8.074e+07</code> records. Performing advanced queries with traditional tools like <a href="https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">dplyr</a> or even Python&rsquo;s <a href="http://pandas.pydata.org">pandas</a> on such a dataset would take a considerable amount of time to execute.</p>
<p>With sparklyr, manipulating actually-big-data is <em>just as easy</em> as performing an analysis on a dataset with only a few records (and an order of magnitude easier than the Python approaches taught in the eDX class mentioned above!).</p>
<h2 id="exploratory-analysis">Exploratory Analysis</h2>
<p><em>(You can view the R code used to process the data with Spark and generate the data visualizations in <a href="http://minimaxir.com/notebooks/amazon-spark/">this R Notebook</a>)</em></p>
<p>There are <strong>20,368,412</strong> unique users who provided reviews in this dataset. <strong>51.9%</strong> of those users have only written one review.</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/user_count_cum_hu_54895d3a9ab17726.webp 320w,/2017/01/amazon-spark/user_count_cum_hu_7225760c4d310a5d.webp 768w,/2017/01/amazon-spark/user_count_cum_hu_ce06c1ed7757f2bc.webp 1024w,/2017/01/amazon-spark/user_count_cum.png 1200w" src="user_count_cum.png"/> 
</figure>

<p>Relatedly, there are <strong>8,210,439</strong> unique products in this dataset, where <strong>43.3%</strong> have only one review.</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/item_count_cum_hu_8daa25ccc943c402.webp 320w,/2017/01/amazon-spark/item_count_cum_hu_955b99f79f562cd7.webp 768w,/2017/01/amazon-spark/item_count_cum_hu_1ad195a387d28909.webp 1024w,/2017/01/amazon-spark/item_count_cum.png 1200w" src="item_count_cum.png"/> 
</figure>

<p>After removing duplicate ratings, I added a few more features to each rating which may help illustrate how review behavior changed over time: a ranking value indicating the # review that the author of a given review has written (1st review by author, 2nd review by author, etc.), a ranking value indicating the # review that the product of a given review has received (1st review for product, 2nd review for product, etc.), and the month and year the review was made.</p>
<p>The first two added features require a <em>very</em> large amount of processing power, and highlight the convenience of Spark&rsquo;s speed (and the fact that Spark uses all CPU cores by default, while typical R/Python approaches are single-threaded!)</p>
<p>These changes are cached into a Spark DataFrame <code>df_t</code>. If I wanted to determine which Amazon product category receives the best review ratings on average, I can aggregate the data by category, calculate the average rating score for each category, and sort. Thanks to the power of Spark, the data processing for this many-millions-of-records takes seconds.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">df_agg</span> <span class="o">&lt;-</span> <span class="n">df_t</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">group_by</span><span class="p">(</span><span class="n">category</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">summarize</span><span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="nf">n</span><span class="p">(),</span> <span class="n">avg_rating</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">rating</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">avg_rating</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">            <span class="nf">collect</span><span class="p">()</span>
</span></span></code></pre></div><figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/avg_hu_24f1b4ab4339fd26.webp 320w,/2017/01/amazon-spark/avg_hu_699a7e6381f1a38f.webp 768w,/2017/01/amazon-spark/avg.png 962w" src="avg.png"/> 
</figure>

<p>Or, visualized in chart form using <a href="http://ggplot2.org">ggplot2</a>:</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/avg_rating_desc_hu_a4ddfa7be2c75fbd.webp 320w,/2017/01/amazon-spark/avg_rating_desc_hu_5e6789cd9495791d.webp 768w,/2017/01/amazon-spark/avg_rating_desc_hu_f1c761a8c71557d9.webp 1024w,/2017/01/amazon-spark/avg_rating_desc.png 1200w" src="avg_rating_desc.png"/> 
</figure>

<p>Digital Music/CD products receive the highest reviews on average, while Video Games and Cell Phones receive the lowest reviews on average, with a <strong>0.77</strong> rating range between them. This does make some intuitive sense; Digital Music and CDs are types of products where you know <em>exactly</em> what you are getting with no chance of a random product defect, while Cell Phones and Accessories can have variable quality from shady third-party sellers (Video Games in particular are also prone to irrational <a href="http://steamed.kotaku.com/steam-games-are-now-even-more-susceptible-to-review-bom-1774940065">review bombing</a> over minor grievances).</p>
<p>We can refine this visualization by splitting each bar into a percentage breakdown of each rating from 1-5. This could be plotted with a pie chart for each category, however a stacked bar chart, scaled to 100%, looks much cleaner.</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/category_breakdown_hu_56697490c6e5e18.webp 320w,/2017/01/amazon-spark/category_breakdown_hu_433f387b09546fd8.webp 768w,/2017/01/amazon-spark/category_breakdown_hu_5e8c2aba48f55a50.webp 1024w,/2017/01/amazon-spark/category_breakdown.png 1200w" src="category_breakdown.png"/> 
</figure>

<p>The new visualization does help support the theory above; the top categories have a significantly higher percentage of 4/5-star ratings than the bottom categories, and a much a lower proportion of 1/2/3-star ratings. The inverse holds true for the bottom categories.</p>
<p>How have these breakdowns changed over time? Are there other factors in play?</p>
<h2 id="rating-breakdowns-over-time">Rating Breakdowns Over Time</h2>
<p>Perhaps the advent of the binary Like/Dislike behaviors in social media in the 2000&rsquo;s have translated into a change in behavior for a 5-star review system. Here are the rating breakdowns for reviews written in each month from January 2000 to July 2014:</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/time_breakdown_hu_3b40970c67c5dd8a.webp 320w,/2017/01/amazon-spark/time_breakdown_hu_e279eb96257dc056.webp 768w,/2017/01/amazon-spark/time_breakdown_hu_fe56bce22245cdf.webp 1024w,/2017/01/amazon-spark/time_breakdown.png 1200w" src="time_breakdown.png"/> 
</figure>

<p>The voting behavior oscillates very slightly over time with no clear spikes or inflection points, which dashes that theory.</p>
<h2 id="distribution-of-average-scores">Distribution of Average Scores</h2>
<p>We should look at the global averages of Amazon product scores (i.e. what customers see when they buy products), and the users who give the ratings. We would expect the distributions to match, so any deviations would be interesting.</p>
<p>Products on average, when looking at products with atleast 5 ratings, have a <strong>4.16</strong> overall rating.</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/item_histogram_hu_b5c0dd55f5e6ccca.webp 320w,/2017/01/amazon-spark/item_histogram_hu_b4be0bc02d2408a0.webp 768w,/2017/01/amazon-spark/item_histogram_hu_398b78930a28f79d.webp 1024w,/2017/01/amazon-spark/item_histogram.png 1200w" src="item_histogram.png"/> 
</figure>

<p>When looking at a similar graph for the overall ratings given by users, (5 ratings minimum), the average rating is slightly higher at <strong>4.20</strong>.</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/user_histogram_hu_46fa162c3c0a3bab.webp 320w,/2017/01/amazon-spark/user_histogram_hu_fb8dae1d5d34cedf.webp 768w,/2017/01/amazon-spark/user_histogram_hu_9d7210271d963b43.webp 1024w,/2017/01/amazon-spark/user_histogram.png 1200w" src="user_histogram.png"/> 
</figure>

<p>The primary difference between the two distributions is that there is significantly higher proportion of Amazon customers giving <em>only</em> 5-star reviews. Normalizing and overlaying the two charts clearly highlights that discrepancy.</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/user_item_histogram_hu_1b96e01d8d762a1f.webp 320w,/2017/01/amazon-spark/user_item_histogram_hu_c0e6f7c088bdc8c0.webp 768w,/2017/01/amazon-spark/user_item_histogram_hu_ee477c1eaf841ccd.webp 1024w,/2017/01/amazon-spark/user_item_histogram.png 1200w" src="user_item_histogram.png"/> 
</figure>

<h2 id="the-marginal-review">The Marginal Review</h2>
<p>A few posts ago, I discussed how the <a href="http://minimaxir.com/2016/11/first-comment/">first comment on a Reddit post</a> has dramatically more influence than subsequent comments. Does user rating behavior change after making more and more reviews? Is the typical rating behavior different for the first review of a given product?</p>
<p>Here is the ratings breakdown for the <em>n</em>-th Amazon review a user gives:</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/user_nth_breakdown_hu_c346f6785b5af381.webp 320w,/2017/01/amazon-spark/user_nth_breakdown_hu_466e6aec3324fc8d.webp 768w,/2017/01/amazon-spark/user_nth_breakdown_hu_7f96a46425d7abb2.webp 1024w,/2017/01/amazon-spark/user_nth_breakdown.png 1200w" src="user_nth_breakdown.png"/> 
</figure>

<p>The first user review has a slightly higher proportion of being a 1-star review than subsequent reviews. Otherwise, the voting behavior is mostly the same overtime, although users have an increased proportion of giving a 4-star review instead of a 5-star review as they get more comfortable.</p>
<p>In contrast, here is the ratings breakdown for the <em>n</em>-th review an Amazon product received:</p>
<figure>

    <img loading="lazy" srcset="/2017/01/amazon-spark/item_nth_breakdown_hu_57c6596aabcca292.webp 320w,/2017/01/amazon-spark/item_nth_breakdown_hu_f4e53aa9efa8dea4.webp 768w,/2017/01/amazon-spark/item_nth_breakdown_hu_ac4b2ab9202340fb.webp 1024w,/2017/01/amazon-spark/item_nth_breakdown.png 1200w" src="item_nth_breakdown.png"/> 
</figure>

<p>The first product review has a slightly higher proportion of being a 5-star review than subsequent reviews. However, after the 10th review, there is <em>zero</em> change in the distribution of ratings, which implies that the marginal rating behavior is independent from the current score after that threshold.</p>
<h2 id="summary">Summary</h2>
<p>Granted, this blog post is more playing with data and less analyzing data. What might be interesting to look into for future technical posts is conditional behavior, such as predicting the rating of a review given the previous ratings on that product/by that user. However, this post shows that while &ldquo;big data&rdquo; may be an inscrutable buzzword nowadays, you don&rsquo;t have to work for a Fortune 500 company to be able to understand it. Even with a data set consisting of 5 simple features, you can extract a large number of insights.</p>
<p>And this post doesn&rsquo;t even look at the text of the Amazon product reviews or the metadata associated with the products! I do have a few ideas lined up there which I won&rsquo;t spoil.</p>
<hr>
<p><em>You can view all the R and ggplot2 code used to visualize the Amazon data in <a href="http://minimaxir.com/notebooks/amazon-spark/">this R Notebook</a>. You can also view the images/data used for this post in <a href="https://github.com/minimaxir/amazon-spark">this GitHub repository</a></em>.</p>
<p><em>You are free to use the data visualizations from this article however you wish, but it would be greatly appreciated if proper attribution is given to this article and/or myself!</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Video Games and Charity: Analyzing Awesome Games Done Quick 2016 Donations</title>
      <link>https://minimaxir.com/2016/01/agdq-2016/</link>
      <pubDate>Mon, 11 Jan 2016 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2016/01/agdq-2016/</guid>
      <description>Were frames killed? Were animals saved?</description>
      <content:encoded><![CDATA[<p><a href="https://gamesdonequick.com">Awesome Games Done Quick</a>, and its sister event Summer Games Done Quick, are a fundraising events that livestreams video game speedruns <a href="http://www.twitch.tv/gamesdonequick/profile">live on Twitch</a> for charity. Beginning in January 2011, before Twitch was launched out from Justin.tv, <a href="https://en.wikipedia.org/wiki/Awesome_Games_Done_Quick_and_Summer_Games_Done_Quick#List_of_marathons">AGDQ was very small</a> and only raised $52,519.83 for the <a href="http://preventcancer.org">Prevent Cancer Foundation</a>; now, in 2016, from January 3rd to January 10th, AGDQ <a href="https://gamesdonequick.com/tracker/index/agdq2016">successfully raised</a> about $1.2 <em>million</em> for the charity.</p>
<p>A <a href="https://en.wikipedia.org/wiki/Speedrun">speedrun</a>, as the name suggests, is the process of completing a video game as fast as possible, optionally with self-imposed challenges to make things more interesting. Speedruns can emphasize extreme player skill and/or clever glitch abuse. And unexpected mistakes which make the results hilarious.</p>
<p>One of the first runs of AGDQ 2016, <a href="https://www.youtube.com/watch?v=jLlian3g7Gg">Super Monkey Ball</a>, demonstrates all of these. (run starts at 5:57)</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/jLlian3g7Gg?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>AGDQ 2016 also has fun with the concept of speedrunning. One of the best events of AGDQ 2016 was a blind speedrun of user-created <a href="https://www.youtube.com/watch?v=8qC584MWXO4">Super Mario Maker</a> levels from top designers, in which hilarity ensued. (run starts at 27:41)</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/8qC584MWXO4?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>It might be interesting to know <em>which</em> video games lead to the achievement of over $1M donated to charity and the nature of the donations in general.</p>
<h2 id="gaming-data">Gaming Data</h2>
<p>With a few quick scripts on Kimono to scrape data from the <a href="https://gamesdonequick.com/tracker/donations/agdq2016">AGDQ 2016 donation page</a> (+ a <em>lot</em> of postprocessing in R!), I obtained a dataset of all 30,528 donations, their donors, when they donated, during what speedrun they donated, and <em>why</em> they donated. (<a href="https://docs.google.com/spreadsheets/d/1yyfkS0jvRK1cWrQesYiBn1TMGC93lo1MqahcU3XeGIU/edit?usp=sharing">Google Sheets link</a> for all the data)</p>
<p>Here are the cumulative donations during AGDQ, color coded by day:</p>
<figure>

    <img loading="lazy" srcset="/2016/01/agdq-2016/agdq-1_hu_66974e3510a0c413.webp 320w,/2016/01/agdq-2016/agdq-1_hu_468164133f4b470c.webp 768w,/2016/01/agdq-2016/agdq-1_hu_7aadedea2b879a28.webp 1024w,/2016/01/agdq-2016/agdq-1.png 1200w" src="agdq-1.png"/> 
</figure>

<p>Cumulative donations were strong the entire run. On the second-to-last day, the donations rallied and increased exponentially, clearing $1M handily on the last day.</p>
<p>The donation amount minimum is $5, but the average is significantly higher at $39.62. What is the distribution of donations?</p>
<p>Here is a distribution of donations from $5 to $100 (for ease of visualization/interpretation), which account for 97% of all donations:</p>
<figure>

    <img loading="lazy" srcset="/2016/01/agdq-2016/agdq-2_hu_9c2b7ebdf2a868b5.webp 320w,/2016/01/agdq-2016/agdq-2_hu_f5e3c025fbd19725.webp 768w,/2016/01/agdq-2016/agdq-2_hu_1de3d44c61f8e931.webp 1024w,/2016/01/agdq-2016/agdq-2.png 1200w" src="agdq-2.png"/> 
</figure>

<p>The median donation amount is $20. What&rsquo;s interesting is that donations occur at clear break points: not only are there many donations at multiple of $10, but there are many donations at $25 and $75 as well. The $50 and $75 points also potentially benefited for being the threshold for entry into a <a href="https://gamesdonequick.com/tracker/prizes/agdq2016">grand prize raffle</a>. I&rsquo;ll note off-chart that there is a spike in $1,000 donations, the threshold for the audience-clapping in celebration and the <a href="https://gamesdonequick.com/tracker/donation/212588">top single donation</a> was made by an AGDQ sponsor, <a href="https://www.theyetee.com/">The Yetee</a>, at $18,225. The <a href="https://gamesdonequick.com/tracker/donation/209613">top donation by a non-sponsor</a> is from Minecraft creator Notch at $8,000, which he <a href="https://gamesdonequick.com/tracker/donation/234071">did twice</a>.</p>
<p>Which games are the most popular and generated the most amount of money for the Prevent Cancer Foundation?</p>
<figure>

    <img loading="lazy" srcset="/2016/01/agdq-2016/agdq-4_hu_491202455ca05cc1.webp 320w,/2016/01/agdq-2016/agdq-4_hu_8257ec8fbcd454ec.webp 768w,/2016/01/agdq-2016/agdq-4_hu_bdf86800d6fb1d8.webp 1024w,/2016/01/agdq-2016/agdq-4.png 1200w" src="agdq-4.png"/> 
</figure>

<p>Unsurprisingly, Nintendo games are the most popular due to the nostalgia factor. In fairness, the top runs on this chart occur during the last two days of AGDQ 2016, which as mentioned previously may have been affected by a rally, so we cannot assert causality. The appearance of <a href="http://yachtclubgames.com/shovel-knight/">Shovel Knight</a> and <a href="https://en.wikipedia.org/wiki/Bloodborne">Bloodborne</a> as leading donation games, both relatively recently released, shows that speedrunning has more appeal than just retro games.</p>
<p>A popular technique in charity drives is donation incentives, which help bolster the number of donations total. The <a href="https://gamesdonequick.com/tracker/bids/agdq2016">AGDQ bid incentives</a> can include bonus game segments, or certain game decisions, such as what name to give to a main character.</p>
<p>Any donation can optionally be assigned as a donation toward an incentive. Which run received the most money toward incentives?</p>
<figure>

    <img loading="lazy" srcset="/2016/01/agdq-2016/agdq-5_hu_a4d76f42b8ded894.webp 320w,/2016/01/agdq-2016/agdq-5_hu_7cb82c39a228c7b6.webp 768w,/2016/01/agdq-2016/agdq-5_hu_4ad743cd80b22cd2.webp 1024w,/2016/01/agdq-2016/agdq-5.png 1200w" src="agdq-5.png"/> 
</figure>

<p>Super Metroid donations incentives accounted for nearly <em>1/4th</em> of all the money raised at AGDQ. Final Fantasy IV accounted for a large amount as well, however, in both cases, the rally effect may apply. {% comment %}The Super Mario Maker Custom Level Blind Race which I showcased above benefits from donation incentives with bonus races, and the results show that the incentive was worthwhile.{% endcomment %}</p>
<p>Speaking of the Super Metroid donation incentives, it should be noted that this particular incentive is one of the most culturally-important incentives in the show. Super Metroid has an optional objective to Save The Animals from planetary destruction, but this costs time, and time is important for a speedrun. Or the speedruner can Kill The Animals through inaction for efficiency, &ldquo;Saving the Frames.&rdquo;</p>
<p>What is the split of these incentive choices?</p>
<figure>

    <img loading="lazy" srcset="/2016/01/agdq-2016/agdq-6_hu_926af7a059c72381.webp 320w,/2016/01/agdq-2016/agdq-6_hu_c8db0a5dfb93768c.webp 768w,/2016/01/agdq-2016/agdq-6_hu_d9dfc97884479bde.webp 1024w,/2016/01/agdq-2016/agdq-6.png 1200w" src="agdq-6.png"/> 
</figure>

<p>Yes, the animals were killed (specifically, they were <strong>REKT</strong>, in the words of a last-minute donator). Both bonus games and vanity naming were popular, but nothing compared to the Save the Animals / Kill the Animals bid war.</p>
<p>Lastly, people can leave comments with donations, and these comments are usually read on-stream when possible, as you&rsquo;ve likely noticed if you&rsquo;re watched the videos above.</p>
<p>Here&rsquo;s a fun, nonscientific word cloud of those comments:</p>
<figure>

    <img loading="lazy" srcset="/2016/01/agdq-2016/agdq-7_hu_6b77b0f9544e6225.webp 320w,/2016/01/agdq-2016/agdq-7_hu_c496b8c2ac3dd629.webp 768w,/2016/01/agdq-2016/agdq-7.png 1000w" src="agdq-7.png"/> 
</figure>

<p>Lots of positivity, aside from the whole &ldquo;Kill the Animals&rdquo; thing. After all, the event is all about preventing cancer.</p>
<p>While this post isn&rsquo;t an academic analysis, it&rsquo;s neat to see to see what kinds of things drive donation to charity. This model of livestreaming and charitable applications is very successful, and important given the renewed attention toward livestreaming with the rise of Twitch to mainstream attention, alongside the rise of personal streaming with apps like Periscope. Donation incentives are a <em>very</em> successful technique for facilitating donations.</p>
<p>It will be interesting to see if Twitch and events like AGDQ can leverage charitable livestreaming, or if another startup/organization beats them first. Bidding-Wars-for-Deciding-the-Fate-of-Fictional-Animals-as-a-Service has a nice ring to it.</p>
<hr>
<p><em>You can access a Jupyter notebook with the data processing and chart processing code <a href="https://github.com/minimaxir/agdq-2016">in this GitHub repository</a>. If you use the processed donation data or visualization designs contained within this article, it would be greatly appreciated if proper attribution is given back to this article and/or myself. Thanks! :)</em></p>
<p><em>Note for donation incentive statistics: the user can theoretically split their donation among multiple incentives; unfortunately I assumed at time of scrape that all the money could only go toward one bid. All donations I investigated from the source data were toward a single incentive except two donations from The Yetee which I fixed manually. If there are any data discrepancies, that is the likely cause.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>The Statistical Difference Between 1-Star and 5-Star Reviews on Yelp</title>
      <link>https://minimaxir.com/2014/09/one-star-five-stars/</link>
      <pubDate>Tue, 23 Sep 2014 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2014/09/one-star-five-stars/</guid>
      <description>It can be proven that language has a strong statistical effect on review ratings, but that is intuitive enough. How have review ratings changed?</description>
      <content:encoded><![CDATA[<p>Many business in the real world encourage their customers to &ldquo;Rate us on Yelp!&rdquo;. <a href="http://www.yelp.com/">Yelp</a>, the &ldquo;best way to find local businesses,&rdquo; relies on user reviews to help its viewers find the best places. Both positive and negative reviews are helpful in this mission: positive reviews on Yelp identify the best places, negative reviews identify places where people <em>shouldn&rsquo;t</em> go. Usually, both positive and negative reviews are not based on objective attributes of the business, but on the experience the writer has with the establishment.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp_review_pos_hu_ddcb34306c4121d.webp 320w,/2014/09/one-star-five-stars/yelp_review_pos.png 620w" src="yelp_review_pos.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp_review_neg_hu_5472e07a6e063134.webp 320w,/2014/09/one-star-five-stars/yelp_review_neg.png 633w" src="yelp_review_neg.png"/> 
</figure>

<p>I analyzed the language present in 1,125,458 Yelp Reviews using the dataset from the <a href="http://www.yelp.com/dataset_challenge">Yelp Dataset Challenge</a> containing reviews of businesses in the cities of Phoenix, Las Vegas, Madison, Waterloo and Edinburgh. Users can rate businesses 1, 2, 3, 4, or 5 stars. When comparing the most-frequent two-word phrases between 1-star and 5-star reviews, the difference is apparent.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/Yelp-2-Gram-Small_hu_a3184278e17792da.webp 320w,/2014/09/one-star-five-stars/Yelp-2-Gram-Small_hu_93816ee646e301fc.webp 768w,/2014/09/one-star-five-stars/Yelp-2-Gram-Small_hu_6c0c9d7f59903afe.webp 1024w,/2014/09/one-star-five-stars/Yelp-2-Gram-Small.jpg 1200w" src="Yelp-2-Gram-Small.jpg"/> 
</figure>

<p>The 5-star Yelp reviews contain many instances of &ldquo;Great&rdquo;, &ldquo;Good&rdquo;, and &ldquo;Happy&rdquo;. In contrast, the 1-star Yelp reviews use very little positive language, and instead discuss the amount of &ldquo;minutes,&rdquo; presumably after long and unfortunate waits at the establishment. (Las Vegas is one of the cities where the reviews were collected, which is why it appears prominently in both 1-star and 5-star reviews)</p>
<p>Looking at three-word phrases tells more of a story.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/Yelp-3-Gram-Small_hu_e46cca474f4fc455.webp 320w,/2014/09/one-star-five-stars/Yelp-3-Gram-Small_hu_d70187d5b3a7bd24.webp 768w,/2014/09/one-star-five-stars/Yelp-3-Gram-Small_hu_51480d3275b8941b.webp 1024w,/2014/09/one-star-five-stars/Yelp-3-Gram-Small.jpg 1200w" src="Yelp-3-Gram-Small.jpg"/> 
</figure>

<p>1-Star reviews frequently contain warnings for potential customers, which promises that the author will &ldquo;never go back&rdquo; and a strong impression that issues stem from conflicts with &ldquo;the front desk&rdquo;, such as those at hotels. 5-star reviews &ldquo;love this place&rdquo; and &ldquo;can&rsquo;t wait to&rdquo; go back.</p>
<p>Can this language be used to predict reviews?</p>
<h2 id="regression-of-language">Regression of Language</h2>
<p>To determine the causal impact on positive and negative words on the # of stars given in a review, we can perform a simple linear regression of stars on the number of positive words in the review, the number of negative words in the review, and the number of words in the review itself (since the length of the review is related to the number of positive/negative words; the longer the review, the more words)</p>
<p>A quick-and-dirty way to determine the number of positive/negative words in a given Yelp review is to compare each word of the review against a lexicon of positive/negative words, and count the number of review words in the lexicon. In this case, I use the <a href="http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html">lexicons compiled by UIC professor Bing Liu</a>.</p>
<p>Running a regression of # stars in a Yelp review on # positive words, # negative words, and # words in review, returns these results:</p>
<pre tabindex="0"><code>Coefficients:
               Estimate	 Std. Error  t value  Pr(&gt;|t|)
(Intercept)    3.692      1.670e-03  2210.0   &lt;2e-16 ***
pos_words      0.122      2.976e-04   411.3   &lt;2e-16 ***
neg_words     -0.154      4.887e-04  -315.9   &lt;2e-16 ***
review_words  -0.003      1.984e-05  -169.4   &lt;2e-16 ***


Residual standard error: 1.119 on 1125454 degrees of freedom
Multiple R-squared:  0.2589,	Adjusted R-squared:  0.2589
F-statistic: 1.311e+05 on 3 and 1125454 DF,  p-value: &lt; 2.2e-16
</code></pre><p>The regression output explains these things:</p>
<ul>
<li>If a reviewer posted a blank review with no text in it, that review gave an average rating of 3.692.</li>
<li>For every positive word, the predicted average star rating given is increased by 0.122 on average (e.g. 8 positive words indicate a 1-star increase)</li>
<li>For every negative word, the predicted average star rating given is decreased by 0.15 on average (e.g. 6-7 negative words indicate a 1-star decrease)</li>
<li>The amount of words in the review has a lesser, negative effect. (A review that is 333 words indicates a 1-star decrease, but the average amount of words in a Yelp review is 130 words)</li>
<li>This model explains 25.98% of the variation in the number of stars given in a review. This sounds like a low percentage, but is impressive for such a simple model using unstructured real-world data.</li>
</ul>
<p>All of these conclusions are <em>extremely</em> statistically significant due to the large sample size.</p>
<p>Additionally, you could rephrase the regression as a logistic classification problem, where reviews rated 1, 2, or 3 stars are classified as &ldquo;negative,&rdquo; and reviews with 4 or 5 stars are classified as &ldquo;positive.&rdquo; Then, run the regression to determine the likelihood of a given review being positive. Running this regression (not shown) results in a logistic model with up to <em>75% accuracy</em>, a noted improvement over the &ldquo;no information rate&rdquo; of 66%, which is the model accuracy if you just guessed that every review was positive. The logistic model also has similar conclusions for the predictor variables as the linear model.</p>
<p>It can be proven that language has a strong statistical effect on review ratings, but that&rsquo;s intuitive enough. How have review ratings changed?</p>
<h2 id="1-star-and-5-star-reviews-visualized">1-Star and 5-Star Reviews, Visualized</h2>
<p>Since 2005, Yelp has had incredible growth in the number of new reviews.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-time-series_hu_c7a13c976c5495e.webp 320w,/2014/09/one-star-five-stars/yelp-review-time-series_hu_a1b4db49122e2298.webp 768w,/2014/09/one-star-five-stars/yelp-review-time-series_hu_f6943ed84c603de9.webp 1024w,/2014/09/one-star-five-stars/yelp-review-time-series.png 1200w" src="yelp-review-time-series.png"/> 
</figure>

<p>For that chart, it appears that each of the five rating brackets have grown at the same rate, but that isn&rsquo;t the case. Here&rsquo;s a chart of the rating brackets showing how the proportions of new reviews of each rating have changed over time.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-time-proportion_hu_153ca7093861ad13.webp 320w,/2014/09/one-star-five-stars/yelp-review-time-proportion_hu_ae3ee3f52d98ee95.webp 768w,/2014/09/one-star-five-stars/yelp-review-time-proportion_hu_7953d250442d19e0.webp 1024w,/2014/09/one-star-five-stars/yelp-review-time-proportion.png 1200w" src="yelp-review-time-proportion.png"/> 
</figure>

<p>Early Yelp had mostly 4-star and 5-star reviews, as one might expect for an early Web 2.0 startup where the primary users who would be the only ones who would put in the effort to write a review would be those who had positive experiences. However, the behavior from 2010 onward is interesting: the relative proportions of both 1-star reviews <em>and</em> 5-star reviews increases over time.</p>
<p>As a result, the proportions of ratings in reviews from Yelp&rsquo;s beginning in 2005 and Yelp&rsquo;s present 2014 are incredibly different.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/Yelp-2005-2014_hu_39ad524a7eb9df70.webp 320w,/2014/09/one-star-five-stars/Yelp-2005-2014_hu_c9fa061ae2bda217.webp 768w,/2014/09/one-star-five-stars/Yelp-2005-2014_hu_395f9de928083b7c.webp 1024w,/2014/09/one-star-five-stars/Yelp-2005-2014.png 1600w" src="Yelp-2005-2014.png"/> 
</figure>

<p>More negativity, more positivity. Do they cancel out?</p>
<h2 id="how-positive-are-yelp-reviews">How Positive Are Yelp Reviews?</h2>
<p>We can calculate relative <strong>positivity</strong> between reviews by taking the number of positive reviews in a review and dividing it by the number of words in the review itself.</p>
<p>The average positivity among all reviews is <em>5.6%</em>. Over time, the positivity has been relatively flat.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-time-series-positivity_hu_f3209866d8404a7f.webp 320w,/2014/09/one-star-five-stars/yelp-review-time-series-positivity_hu_31c147031a1203e7.webp 768w,/2014/09/one-star-five-stars/yelp-review-time-series-positivity_hu_3f2e651879ad63c4.webp 1024w,/2014/09/one-star-five-stars/yelp-review-time-series-positivity.png 1200w" src="yelp-review-time-series-positivity.png"/> 
</figure>

<p>Flat, but still increasing, mostly likely due to the increasing proportion of 5-star reviews. But the number of 1-star reviews also increased: do the two offset each other?</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-positivity_hu_637131981fcef452.webp 320w,/2014/09/one-star-five-stars/yelp-review-positivity_hu_c921212195eb39d7.webp 768w,/2014/09/one-star-five-stars/yelp-review-positivity_hu_41cd19db408d2367.webp 1024w,/2014/09/one-star-five-stars/yelp-review-positivity.png 1200w" src="yelp-review-positivity.png"/> 
</figure>

<p>This histogram of positivity scores shows that 1-star reviews have lower positivity with rarely high positivity, and 5-star reviews rarely have low positivity and instead have very high positivity. The distribution for each star rating is close to a <a href="http://en.wikipedia.org/wiki/Normal_distribution">Normal distribution</a>, with each successive rating category peaking at increasing positivity values.</p>
<p>The relative proportion of each star rating reinforces this.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-positivity-density_hu_f23db88cd1ac716.webp 320w,/2014/09/one-star-five-stars/yelp-review-positivity-density_hu_76cfe57f52f13af.webp 768w,/2014/09/one-star-five-stars/yelp-review-positivity-density_hu_fc116b0474c61bb9.webp 1024w,/2014/09/one-star-five-stars/yelp-review-positivity-density.png 1200w" src="yelp-review-positivity-density.png"/> 
</figure>

<p>Over half of the 0% positivity reviews are 1-star reviews, while over three-quarters of the reviews at the highest positivity levels are 5-star reviews. (note that the 2-star, 3-star, and 4-star ratings are not as significant at either extreme)</p>
<h2 id="how-negative-are-yelp-reviews">How Negative Are Yelp Reviews?</h2>
<p>When working with the negativity of reviews, calculated by taking the number of negative words and dividing them by the number of total words in the review, the chart looks much different.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-time-series-negativity_hu_86f7acc4985c237b.webp 320w,/2014/09/one-star-five-stars/yelp-review-time-series-negativity_hu_1809307409787cc1.webp 768w,/2014/09/one-star-five-stars/yelp-review-time-series-negativity_hu_d07d1e1fdef155a9.webp 1024w,/2014/09/one-star-five-stars/yelp-review-time-series-negativity.png 1200w" src="yelp-review-time-series-negativity.png"/> 
</figure>

<p>The average negativity among all reviews is <em>2.0%</em>. Since the average positivity is 5.6%, this implies that the net sentiment among all reviews is positive, despite the increase in 1-star reviews over time.</p>
<p>The histogram of negative reviews looks much different as well.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-negativity_hu_d5d4e839efa115f7.webp 320w,/2014/09/one-star-five-stars/yelp-review-negativity_hu_757c09188bd04406.webp 768w,/2014/09/one-star-five-stars/yelp-review-negativity_hu_9b14213657352703.webp 1024w,/2014/09/one-star-five-stars/yelp-review-negativity.png 1200w" src="yelp-review-negativity.png"/> 
</figure>

<p>Even 1-star reviews aren&rsquo;t completely negative all the time.</p>
<p>The chart is heavily skewed right, making it difficult to determine the proportions of each rating at first glance.</p>
<p>Henceforth here&rsquo;s another proportion chart.</p>
<figure>

    <img loading="lazy" srcset="/2014/09/one-star-five-stars/yelp-review-negativity-density_hu_dba0956b28ad9c05.webp 320w,/2014/09/one-star-five-stars/yelp-review-negativity-density_hu_6c544b1696ff5283.webp 768w,/2014/09/one-star-five-stars/yelp-review-negativity-density_hu_60ad65564b53f2bd.webp 1024w,/2014/09/one-star-five-stars/yelp-review-negativity-density.png 1200w" src="yelp-review-negativity-density.png"/> 
</figure>

<p>At low negativity, the proportions of negative review scores (1-star, 2-stars, 3-stars) and positive review scores (4-stars, 5-stars) are about equal, implying that negative reviews can be just as civil as positive reviews. But high negativity is solely present in 1-star and 2-star reviews.</p>
<p>From this article, you&rsquo;ve seen that Yelp reviews with 5-star ratings are generally positive, and Yelp reviews with 1-star are generally negative. Yes, this blog post is essentially &ldquo;Pretty Charts Made By Captain Obvious,&rdquo; but what&rsquo;s important is confirmation of these assumptions. Language plays a huge role in determining the ratings of reviews, and that knowledge could be applied to many other industries and review websites.</p>
<h2 id="four-stars">Four Stars</h2>
<p>I&rsquo;d give this blog post a solid 4-stars. The content was great, but the length was long, although not as long as <a href="http://minimaxir.com/2014/06/reviewing-reviews/">some others</a>. Can&rsquo;t wait to read this post again!</p>
<hr>
<ul>
<li><em>Yelp reviews were preprocessed with Python, by simultaneously converting the data from JSON to a tabular structure, tokenizing the words in the review, counting the positive/negative words, and storing bigrams and trigrams in a dictionary to later be exported for creaitng word clouds.</em></li>
<li><em>All data analysis was performed using R, and a ll charts were made using ggplot2. <a href="http://www.pixelmator.com/">Pixelmator</a> was used to manually add relevant annotations when necessary.</em></li>
<li><em>You can view both the Python and R code used to process and chart the data <a href="https://github.com/minimaxir/yelp-review-analysis">in this GitHub repository</a>. Note that since Yelp prevents redistribution of the data, the code may not be reproducible.</em></li>
<li><em>You can download full-resolution PNGs of the two word clouds [5000x2000px] in <a href="https://www.dropbox.com/s/f20gwh9jvkibi4z/Yelp_Wordclouds_5000_200.zip?dl=0">this ZIP file</a> [18 MB]</em></li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>A Statistical Analysis of 1.2 Million Amazon Reviews</title>
      <link>https://minimaxir.com/2014/06/reviewing-reviews/</link>
      <pubDate>Tue, 17 Jun 2014 08:20:00 -0700</pubDate>
      <guid>https://minimaxir.com/2014/06/reviewing-reviews/</guid>
      <description>Analyzing the dataset of 1.2 million Amazon reviews, I found some interesting statistical trends; some are intuitive and obvious, but others give insight to how Amazon&amp;rsquo;s review system actually works.</description>
      <content:encoded><![CDATA[<p>When buying the latest products on <a href="http://www.amazon.com/">Amazon</a>, reading reviews is an important part of the purchasing process.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/ore_hu_a023cb91d2d5bbec.webp 320w,/2014/06/reviewing-reviews/ore.png 554w" src="ore.png"/> 
</figure>

<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amazon-review_hu_e37f25bba24a903e.webp 320w,/2014/06/reviewing-reviews/amazon-review.png 495w" src="amazon-review.png"/> 
</figure>

<p>Customer reviews from customers who have actually purchased and used the product in question can give you more context to the product itself. Each reviewer rates the product from 1 to 5 stars, and provides a text summary of their experiences and opinions about the product. The ratings for each product are averaged together in order to get an overall product rating.</p>
<p>The number of reviews on Amazon has grown over the years.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-basic-time-count_hu_8c9a16a5c5892b45.webp 320w,/2014/06/reviewing-reviews/amzn-basic-time-count_hu_9ed51550cf6967d7.webp 768w,/2014/06/reviewing-reviews/amzn-basic-time-count_hu_5718b80f7ce8a708.webp 1024w,/2014/06/reviewing-reviews/amzn-basic-time-count.png 1200w" src="amzn-basic-time-count.png"/> 
</figure>

<p>But how do people write reviews? What types of ratings do reviewers give? How many of these reviews are considered helpful?</p>
<p>Stanford researchers Julian McAuley and Jure Leskovec collected <a href="https://snap.stanford.edu/data/web-Amazon.html">all Amazon reviews </a>from the service&rsquo;s online debut in 1995 to 2013. Analyzing the dataset of 1.2 million Amazon reviews of products in the Electronics section, I found some interesting statistical trends; some are intuitive and obvious, but others give insight to how Amazon&rsquo;s review system actually works.</p>
<h2 id="describing-the-data">Describing the Data</h2>
<p>First, let&rsquo;s see how the user ratings are distributed among the reviews.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-basic-score_hu_59c031b5274368e7.webp 320w,/2014/06/reviewing-reviews/amzn-basic-score_hu_6eb4ff83d005ab3.webp 768w,/2014/06/reviewing-reviews/amzn-basic-score_hu_c6472b19e29a9fe6.webp 1024w,/2014/06/reviewing-reviews/amzn-basic-score.png 1200w" src="amzn-basic-score.png"/> 
</figure>

<p>More than half of the reviews give a 5-star rating. Aside from perfect reviews, most reviewers give 4-star or 1-star ratings, with <em>very</em> few giving 2-stars or 3-stars relatively.</p>
<p>As as result, the statistical average for all review ratings is on the high-end of the scale at about <strong>3.90</strong>. In fact, the average review rating for newly-written reviews has varied from 3.4 to 4.2 over time.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-basic-time-rating_hu_52cea30c7eeb25e2.webp 320w,/2014/06/reviewing-reviews/amzn-basic-time-rating_hu_2957bab5d470c910.webp 768w,/2014/06/reviewing-reviews/amzn-basic-time-rating_hu_6a7aecd8bbdede75.webp 1024w,/2014/06/reviewing-reviews/amzn-basic-time-rating.png 1200w" src="amzn-basic-time-rating.png"/> 
</figure>

<p>Another metric used to measure reviews is review helpfulness. Other Amazon reviewers can rate a particular review as &ldquo;helpful&rdquo; or &ldquo;not helpful.&rdquo; A &ldquo;review helpfulness&rdquo; statistic can be calculated by taking the number of &ldquo;is-helpful&rdquo; indicators divided by the total number of is-helpful/is-not-helpful indicators (in the example at the beginning of the article, 639/665 people found the review helpful, so the helpfulness rating would be 96%). This gives an indication of review quality to a prospective buyer. Only 10% of the reviews had atleast 10 is-helpful/is-not-helpful data points, and of those reviews, the vast majority of the reviews had perfect helpfulness scores.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-basic-helpful_hu_6374fe7678c45848.webp 320w,/2014/06/reviewing-reviews/amzn-basic-helpful_hu_fa5de8d1b024cf5.webp 768w,/2014/06/reviewing-reviews/amzn-basic-helpful_hu_b43d4f9b8739f463.webp 1024w,/2014/06/reviewing-reviews/amzn-basic-helpful.png 1200w" src="amzn-basic-helpful.png"/> 
</figure>

<p>That would make sense; if you&rsquo;re writing a review (especially a 5 star review), you&rsquo;re writing with the intent to help other prospective buyers.</p>
<p>Another consideration is review length. Do reviews frequently write essays, or do reviews typically write a single paragraph?</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-basic-length_hu_47c45b6e84c3e877.webp 320w,/2014/06/reviewing-reviews/amzn-basic-length_hu_36ecbd0ae249486f.webp 768w,/2014/06/reviewing-reviews/amzn-basic-length_hu_42feb4443896b5f3.webp 1024w,/2014/06/reviewing-reviews/amzn-basic-length.png 1200w" src="amzn-basic-length.png"/> 
</figure>

<p>Most reviews are 100-150 characters, but the average amount of characters in a review is about <strong>582</strong> (there are some outlier reviews with 30,000+ characters!). Assuming that the average amount of characters in a paragraph <a href="http://wiki.answers.com/Q/How_many_characters_does_the_average_paragraph_have">is 352</a>, reviewers typically write about half a paragraph. Interestingly, reviews are rarely less than a sentence. (the <a href="http://www.amazon.com/gp/community-help/customer-reviews-guidelines">Review Guidelines</a> suggest a minimum of 20 words in a review, so this discrepancy could be attributed to moderator removal of short, one-liner reviews)</p>
<h2 id="particularizing-the-products">Particularizing the Products</h2>
<p>The 1.2 million reviews in the Electronics data set address about 82,003 distinct products. However, most of those entries represent different SKUs of the same product (e.g. different colors of headphones). Of those products, only 30,577 products have pricing information which identify them as the source product.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-product-price_hu_58971098a111aa36.webp 320w,/2014/06/reviewing-reviews/amzn-product-price_hu_3ef361e65687d666.webp 768w,/2014/06/reviewing-reviews/amzn-product-price_hu_89981cc6ca8be307.webp 1024w,/2014/06/reviewing-reviews/amzn-product-price.png 1200w" src="amzn-product-price.png"/> 
</figure>

<p>Over 2/3rds of Amazon Electronics are priced between $0 and $50, which makes sense as popular electronics such as television remotes and phone cases are not extremely expensive. However, there&rsquo;s no statistical correlation between the price of a product and the number of reviews it receives.</p>
<p>For the overall rating of a particular product, which is the average rating of all reviews for that product, the ratings are no longer limited to discrete numbers between 1 and 5, and can take decimal values between those numbers as well. The distribution of product ratings is similar to the distribution of review ratings.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-product-rating_hu_1521ed87824a3e14.webp 320w,/2014/06/reviewing-reviews/amzn-product-rating_hu_bbcde884366ab6c2.webp 768w,/2014/06/reviewing-reviews/amzn-product-rating_hu_f305fe55ecfa3298.webp 1024w,/2014/06/reviewing-reviews/amzn-product-rating.png 1200w" src="amzn-product-rating.png"/> 
</figure>

<p>Again, the perfect rating of 5 is most popular for products. This distribution resembles the distribution of scores of all reviews for the discrete rating values, but this view reveals local maxima at the midpoint between each discrete value. (i.e. 3-and-a-half stars and 4-and-a-half stars are surprisingly common ratings)</p>
<p>What happens when you plot product rating and product price together?</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-product-score-price_hu_b271b61ddc8a67b5.webp 320w,/2014/06/reviewing-reviews/amzn-product-score-price_hu_5448fe68fcfc3bb8.webp 768w,/2014/06/reviewing-reviews/amzn-product-score-price_hu_65b3de5328ae68dd.webp 1024w,/2014/06/reviewing-reviews/amzn-product-score-price.png 1200w" src="amzn-product-score-price.png"/> 
</figure>

<p>The most expensive products have 4-star and 5-star overall ratings, but not 1-star and 2-star ratings. However, the correlation is very weak. (r = 0.04)</p>
<p>In contrast, the relationship between product price and the average <em>length</em> of reviews for the product is surprising.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-product-price-length_hu_31310358eb50b709.webp 320w,/2014/06/reviewing-reviews/amzn-product-price-length_hu_cde21bf380b44ae5.webp 768w,/2014/06/reviewing-reviews/amzn-product-price-length_hu_c8de453ae8e3e19d.webp 1024w,/2014/06/reviewing-reviews/amzn-product-price-length.png 1200w" src="amzn-product-price-length.png"/> 
</figure>

<p>This relationship is logarithmic with a relatively good correlation (r = 0.29), and it shows that reviewers put more time and effort into reviewing products which are worth more.</p>
<h2 id="reviewing-the-reviewers">Reviewing the Reviewers</h2>
<p>As you might expect, most people leave only 1 or 2 reviews on Amazon, but some have left <em>hundreds</em> of reviews. Out of 1.2 Million reviews, there are 510,434 distinct reviewers.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-reviewer-count_hu_6f6049ea689edace.webp 320w,/2014/06/reviewing-reviews/amzn-reviewer-count_hu_347234ec3a8db6e6.webp 768w,/2014/06/reviewing-reviews/amzn-reviewer-count_hu_d3bd3fe21815e2a9.webp 1024w,/2014/06/reviewing-reviews/amzn-reviewer-count.png 1200w" src="amzn-reviewer-count.png"/> 
</figure>

<p>Over 80% of the reviewers of Amazon electronics left only 1 review. Analyzing reviewers who have left only 1 review is not helpful statistically, so for the rest of the analysis, only reviews who have made 5 or more reviews (which have received atleast 1 is-helpful/is-not-helpful indicator) will be considered. This makes it much easier to get the overall profile of a reviewer. 11,676 reviewers fit this criteria.</p>
<p>Do repeat Amazon users tend to give 5-star reviews?</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-reviewer-score_hu_732050183c11716.webp 320w,/2014/06/reviewing-reviews/amzn-reviewer-score_hu_8ebfb0af49fa84ed.webp 768w,/2014/06/reviewing-reviews/amzn-reviewer-score_hu_cbe8342aed8a7b4b.webp 1024w,/2014/06/reviewing-reviews/amzn-reviewer-score.png 1200w" src="amzn-reviewer-score.png"/> 
</figure>

<p>Distribution of review ratings when averaged across is similar to the other distributions of review ratings. However, this distribution is less skewed toward 5-stars and is more uniform between 4-stars and 5-stars.</p>
<p>What about the average helpfulness of the reviews written by a single reviewer? If a reviewer has enjoyed Amazon enough such that they make 5 or more reviews, chances are that their reviews are high quality.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-reviewer-helpfulness_hu_3396b6f6f7c36442.webp 320w,/2014/06/reviewing-reviews/amzn-reviewer-helpfulness_hu_709e280dd4ad021b.webp 768w,/2014/06/reviewing-reviews/amzn-reviewer-helpfulness_hu_42e3979cc26ce7cd.webp 1024w,/2014/06/reviewing-reviews/amzn-reviewer-helpfulness.png 1200w" src="amzn-reviewer-helpfulness.png"/> 
</figure>

<p>Again, the data is slightly skewed. 8% of the reviewers have perfect helpfulness scores on all their reviews, and the average helpfulness score for all repeat reviews is 80%. Interestingly, a few repeat reviewers have average helpfulness scores of 0.</p>
<p>If you plot <em>both</em> average score and average helpfulness in a single chart, the picture becomes much more clear:</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-reviewer-count-score_hu_d5f52b74f0303f85.webp 320w,/2014/06/reviewing-reviews/amzn-reviewer-count-score_hu_c2ce7b3548ec0cce.webp 768w,/2014/06/reviewing-reviews/amzn-reviewer-count-score_hu_d9432e36b058bedc.webp 1024w,/2014/06/reviewing-reviews/amzn-reviewer-count-score.png 1200w" src="amzn-reviewer-count-score.png"/> 
</figure>

<p>As the chart shows, there&rsquo;s a good positive correlation (r = 0.27) between rating and helpfulness, with a discernible cluster at the top. However, I don&rsquo;t think it&rsquo;s a causal relationship. Reviewers who give a product a 4 - 5 star rating are more passionate about the product and likely to write better reviews than someone who writes a 1 - 2 star &ldquo;this product sucks and you suck too!&rdquo; review.</p>
<p>Another interesting bivariate relationship is the relationship between the helpfulness of a review and the length of a review). Stereotypically, you might think that longer reviews are more helpful reviews. And in the case of Amazon&rsquo;s Electronics reviews, you&rsquo;d be correct.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-reviewer-helpful-length_hu_130d8f3444e197a4.webp 320w,/2014/06/reviewing-reviews/amzn-reviewer-helpful-length_hu_93c5104c96670a15.webp 768w,/2014/06/reviewing-reviews/amzn-reviewer-helpful-length_hu_7f33170521ce93ef.webp 1024w,/2014/06/reviewing-reviews/amzn-reviewer-helpful-length.png 1200w" src="amzn-reviewer-helpful-length.png"/> 
</figure>

<p>Again, there&rsquo;s a good positive correlation (r = 0.26) between average helpfulness and average length, which the trend line supports. (the dip at the end is caused by the high amount of low-character reviews). All the longer reviews have high helpfulness; there are very, very few unhelpful reviews that are also long.</p>
<h2 id="completing-the-conclusion">Completing the Conclusion</h2>
<p>The reviews on Amazon&rsquo;s Electronics products very frequently rate the product 4 or 5 stars, and such reviews are almost always considered helpful. 1-stars are used to signify disapproval, and 2-star and 3-stars reviews have no significant impact at all. If that&rsquo;s the case, then what&rsquo;s the point of having a 5 star ranking system at all if the vast majority of reviewers favor the product? Would Amazon benefit if they made review ratings a binary like/dislike?</p>
<p>Having a 5-star system can allow the prospective customer to make more informed comparisons between two products: a customer may be more likely to buy a product that&rsquo;s rated 4.2 stars than a product that is rated 3.8 stars, which is a subtlety that can&rsquo;t easily be emulated with a like/dislike system. Likewise, if products are truly bad, the propensity toward 5-star reviews can help obfuscate the low quality of the product when a like/dislike system would make the low quality more apparent.</p>
<p>Unfortunately, only Amazon has the data that would answer all these questions.</p>
<p>Of course, there are many other secrets to be uncovered from Amazon reviews. The Stanford professors who collected the initial data used <a href="http://i.stanford.edu/~julian/pdfs/recsys13.pdf">machine learning techniques on the review text</a> to predict the rating given by a review from just the review text itself. Other potential topics for analysis are comparisons between <em>types</em> of Electronics (e.g. MP3 players, headphones) or using natural language processing to determine the common syntax in reviews.</p>
<figure>

    <img loading="lazy" srcset="/2014/06/reviewing-reviews/amzn-word-review-start_hu_d1ceead5636a4804.webp 320w,/2014/06/reviewing-reviews/amzn-word-review-start_hu_5932602b953da6be.webp 768w,/2014/06/reviewing-reviews/amzn-word-review-start_hu_d484e026176d66f7.webp 1024w,/2014/06/reviewing-reviews/amzn-word-review-start.png 1200w" src="amzn-word-review-start.png"/> 
</figure>

<p>That&rsquo;s a topic for another blog post. :)</p>
<hr>
<ul>
<li><em>Data analysis was performed using R, and all charts were made using ggplot2.</em></li>
<li><em>You can download a ZIP file containing CSVs of the time series, the aggregate product data, and the anonymized aggregate reviewer data <a href="https://dl.dropboxusercontent.com/u/2017402/amazon_data.zip">here</a>.</em></li>
<li><em>No, I have no relation to &ldquo;<a href="http://www.amazon.com/review/R1KHEP16MXXWCN/ref=cm_cr_rdp_perm?ie=UTF8&amp;ASIN=B000796XXM">M. Wolff</a>&rdquo;.</em></li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
