<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Data Mining on Max Woolf&#39;s Blog</title>
    <link>https://minimaxir.com/tag/data-mining/</link>
    <description>Recent content in Data Mining on Max Woolf&#39;s Blog</description>
    <image>
      <title>Max Woolf&#39;s Blog</title>
      <url>https://minimaxir.com/android-chrome-512x512.png</url>
      <link>https://minimaxir.com/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright Max Woolf © 2026</copyright>
    <lastBuildDate>Mon, 24 Feb 2014 08:00:00 -0700</lastBuildDate>
    <atom:link href="https://minimaxir.com/tag/data-mining/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>A Statistical Analysis of All Hacker News Submissions</title>
      <link>https://minimaxir.com/2014/02/hacking-hacker-news/</link>
      <pubDate>Mon, 24 Feb 2014 08:00:00 -0700</pubDate>
      <guid>https://minimaxir.com/2014/02/hacking-hacker-news/</guid>
      <description>After downloading all 1,265,114 Hacker News submissions from the official Hacker News API, I gathered a few interesting statistics which show the true impact of Hacker News.</description>
      <content:encoded><![CDATA[<p><a href="https://news.ycombinator.com/news">Hacker News</a> is a very popular link aggregator for the technology and startup community. Officially titled <a href="http://ycombinator.com/hackernews.html">by Paul Graham in 2007</a>, Hacker News began mostly as a place where the very computational-savvy could submit stories around the internet and discuss the latest computing trends.</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-wordcloud-2007i_hu_a21091457dbb2200.webp 320w,/2014/02/hacking-hacker-news/hn-wordcloud-2007i_hu_8fd629f171ba8e6b.webp 768w,/2014/02/hacking-hacker-news/hn-wordcloud-2007i.png 800w" src="hn-wordcloud-2007i.png"/> 
</figure>

<p>Back then, people were talking about networking, software users, and a little up-and-coming startup known as &ldquo;<a href="https://twitter.com/">Twitter</a>.&rdquo;</p>
<p>Six years later, during the new renaissance of computing accessibility and startup entrepreneurship, not much has changed.</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-wordcloud-2013i_hu_3e72e8e95bf0a736.webp 320w,/2014/02/hacking-hacker-news/hn-wordcloud-2013i_hu_f105cdb74da28009.webp 768w,/2014/02/hacking-hacker-news/hn-wordcloud-2013i.png 800w" src="hn-wordcloud-2013i.png"/> 
</figure>

<p>Hacker News, from 2007 to 2014, always illustrates what&rsquo;s &ldquo;new&rdquo; in technology. After downloading all 1,265,114 Hacker News submissions from the official <a href="https://hn.algolia.com/api">Hacker News API</a>, I gathered a few interesting statistics which show the true impact of Hacker News.</p>
<h2 id="how-many-stories-are-submitted-to-hacker-news">How many stories are submitted to Hacker News?</h2>
<p>In the past few years, Hacker News has had an interesting growth pattern.</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-monthly-submissions_hu_1dbee298e30e52f1.webp 320w,/2014/02/hacking-hacker-news/hn-monthly-submissions_hu_6faa726d1014844d.webp 768w,/2014/02/hacking-hacker-news/hn-monthly-submissions_hu_9d00d63d7320efd6.webp 1024w,/2014/02/hacking-hacker-news/hn-monthly-submissions.png 1200w" src="hn-monthly-submissions.png"/> 
</figure>

<p>From the beginning of 2010 with 12k monthly submissions to to the end of 2011 with 31k monthly submissions, the amount of monthly submissions to Hacker News nearly tripled. It&rsquo;s a similar growth rate to that of the startups that Y Combinator typically funds.</p>
<p>What&rsquo;s <em>really</em> interesting is that the end of 2011 is the peak: since then, the amount of submissions has been trending downward. Is Hacker News dying?</p>
<p>I don&rsquo;t think so. Hacker News implements a proprietary anti-spam algorithm which &ldquo;kills&rdquo; submissions, and moderators can kill submissions manually if necessary. Killed articles do not appear in the submission count, so a change in policy would cause the discrepancy. At the least, it helps improve the quality of discussion.</p>
<p><em>UPDATE (2/28): Paul Graham, in the <a href="https://news.ycombinator.com/item?id=7291531">corresponding HN thread</a>, made a <a href="https://news.ycombinator.com/item?id=7292094">comment</a> that the anti-spam algorithm did indeed increase spam detection and number of article killed at the end of 2011.</em></p>
<h2 id="how-many-submissions-receive-large-amounts-of-points">How many submissions receive large amounts of points?</h2>
<p>On Hacker News, users are able to upvote submissions. The more upvotes a submission has, the higher the position that it appears on the front page of the main site. A simple heuristic for calculating exposure from a Hacker News front page submission is 100 page views per point minimum, which means a submission that earns hundreds of points can go viral very quickly!</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hntop_hu_af230ef14bf29a30.webp 320w,/2014/02/hacking-hacker-news/hntop_hu_96494a1c3cf8cdd4.webp 768w,/2014/02/hacking-hacker-news/hntop.png 800w" src="hntop.png"/> 
</figure>

<p>But how many submissions actually make it to the front page, and how many actually make it to the top?</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-points-hist_hu_fce61c39f9d4f40.webp 320w,/2014/02/hacking-hacker-news/hn-points-hist_hu_965cfe0767eb2f9d.webp 768w,/2014/02/hacking-hacker-news/hn-points-hist_hu_4dfce52c7656e605.webp 1024w,/2014/02/hacking-hacker-news/hn-points-hist.png 1200w" src="hn-points-hist.png"/> 
</figure>

<p>On a logarithmic scale, it&rsquo;s evident that the vast majority of Hacker News submissions don&rsquo;t even hit 10 points. (the average amount of points for a submission is 9.51). Usually, hitting 10 points is the sign that you&rsquo;ve appeared on the front page atleast briefly; there, the submission either receives voting momentum or dies quickly due to other rising stars.</p>
<p>But how many submissions <em>do</em> receive hundreds of points? Here&rsquo;s a chart of submissions by month which have received more than 100 points:</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-monthly-submissions-front_hu_e68674c8f8386328.webp 320w,/2014/02/hacking-hacker-news/hn-monthly-submissions-front_hu_bc78bd2173fe5740.webp 768w,/2014/02/hacking-hacker-news/hn-monthly-submissions-front_hu_c1f31cd368da453f.webp 1024w,/2014/02/hacking-hacker-news/hn-monthly-submissions-front.png 1200w" src="hn-monthly-submissions-front.png"/> 
</figure>

<p>The growth rate of top-scoring submissions is correlated with the growth rate of Hacker News submissions themselves, which is not surprising. The number of points a post receives is also dependent on the number of users; as Hacker News grows, the number of users grows as well. Even though the front page cycles frequently, there is still room for great content.</p>
<h2 id="when-is-the-best-time-to-submit-to-hacker-news">When is the best time to submit to Hacker News?</h2>
<p>The age old question. What is the best time to post such that your post makes it to the front page?</p>
<p>First, let&rsquo;s see when Hacker News has the most activity by observing the average number of submissions for each combination of submission hour and weekday:</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-submissions_hu_80486fbedf5cab32.webp 320w,/2014/02/hacking-hacker-news/hn-submissions_hu_eb0732bd76eae3ac.webp 768w,/2014/02/hacking-hacker-news/hn-submissions_hu_d60c67aefbee27ca.webp 1024w,/2014/02/hacking-hacker-news/hn-submissions.png 1200w" src="hn-submissions.png"/> 
</figure>

<p>Hacker News activity is most active at around 12 PM EST / 9 AM PST at about 40 submissions per hour, when hackers on the East Coast submit just before eating lunch, and hackers on the West Coast submit just after getting to work. Weekends, unsurprisingly, are completely dead.</p>
<p>If you submitted your link at 12 PM, you&rsquo;d have a lot competition, but it would be easier to get upvotes since there would be more people visiting the site. If you submitted your post on the weekend, there would be no competition, but would be harder to make the front page.</p>
<p>What is the best weekday + hour to submit such that your submission goes viral? An easy way to estimate the best time is to analyze the times of submission of previous posts with large amounts of points; with enough data (we have enough), it&rsquo;ll provide a strong guess.</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-front-page_hu_a534949d956eeb7f.webp 320w,/2014/02/hacking-hacker-news/hn-front-page_hu_8f31dfb6afef1d6a.webp 768w,/2014/02/hacking-hacker-news/hn-front-page_hu_f04007a58755d1dd.webp 1024w,/2014/02/hacking-hacker-news/hn-front-page.png 1200w" src="hn-front-page.png"/> 
</figure>

<p>As it turns out, the submission times of posts are <em>uncorrelated</em> with the number of viral posts. There are <em>slightly</em> more when submitting at peak activity (weekdays at 12 PM EST / 9 AM PST), but it won&rsquo;t make-or-break an article&rsquo;s success on HN.</p>
<p>Having good content is more important to having a post get to the top of Hacker News. Although you probably shouldn&rsquo;t submit an article when there&rsquo;s a major tech event. (e.g. Facebook&rsquo;s WhatsApp Purchase)</p>
<p><em>UPDATE (2/28): It&rsquo;s been pointed out that measuring the proportion of viral posts (number of viral submissions / number of total submissions) would be a better indicator of odds of article success. Since the number of viral submissions is similar across all time zones, the proportion of viral posts would be greatest on the weekends, since there are dramatically fewer total submissions. However, this logic isn&rsquo;t perfectly correct due to how the article discovery and upvoting system works. I may cover this in a future post.</em></p>
<h2 id="do-y-combinator-startup-announcements-score-better-on-hn">Do Y Combinator startup announcements score better on HN?</h2>
<p>One of the main benefits of Hacker News is to showcase the startups which Y Combinator has funded. Links about YC Startups contain the YC class name of that startup in their title, such as &ldquo;<a href="https://news.ycombinator.com/item?id=6103506">Watsi (YC W13) raises $1.2M first-of-its-kind philanthropic seed round</a>.&rdquo; Do these links perform better than the typical links submitted to Hacker News?</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-points-class-hist_hu_ac3b4c0c1e570f1.webp 320w,/2014/02/hacking-hacker-news/hn-points-class-hist_hu_bf7f3eb23758546f.webp 768w,/2014/02/hacking-hacker-news/hn-points-class-hist_hu_72b9f82c6ebed07b.webp 1024w,/2014/02/hacking-hacker-news/hn-points-class-hist.png 1200w" src="hn-points-class-hist.png"/> 
</figure>

<p>As it turns out, yes. For normal posts, the average number of points is 9.5 points, but for YC class announcements, the average is 41.7 points (from 1,745 submissions analyzed).</p>
<p>For fun, which YC classes perform the best on Hacker News?</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-top-class_hu_b5d8a12c0d639c97.webp 320w,/2014/02/hacking-hacker-news/hn-top-class_hu_8e6de5acef9cc30c.webp 768w,/2014/02/hacking-hacker-news/hn-top-class_hu_b650436137eeb5e2.webp 1024w,/2014/02/hacking-hacker-news/hn-top-class.png 1200w" src="hn-top-class.png"/> 
</figure>

<p>W06 placed first because of <a href="https://news.ycombinator.com/item?id=2481576">two</a> <a href="https://news.ycombinator.com/item?id=2481610">announcements</a> about Wufoo, and S11 placed second because of <a href="https://news.ycombinator.com/item?id=6585071">CryptoSeal</a> and <a href="https://news.ycombinator.com/item?id=2846725">Parse</a>.</p>
<h2 id="who-are-the-best-submitters-on-hacker-news">Who are the best submitters on Hacker News?</h2>
<p>Like all popular link aggregators, Hacker News has many spammers who submit large amounts of low quality content. Who are the users who submit quality content?</p>
<p>Calculating the average points of a user&rsquo;s submitted content isn&rsquo;t an accurate measurement, since that can be heavily skewed by one viral post. Therefore, I created a Hacker News &ldquo;<a href="http://en.wikipedia.org/wiki/Batting_average">batting average</a>&rdquo; statistic: which posters have the highest proportion of posts that make it to the front page vs. the total number submitted? (for posts since 2010 and number of submitted posts &gt;= 10)</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-top-submitters_hu_4357e32824872e5a.webp 320w,/2014/02/hacking-hacker-news/hn-top-submitters_hu_fa5aea2cb08d2e0b.webp 768w,/2014/02/hacking-hacker-news/hn-top-submitters_hu_10c3382c41243262.webp 1024w,/2014/02/hacking-hacker-news/hn-top-submitters.png 1200w" src="hn-top-submitters.png"/> 
</figure>

<p>It should be no surprise that most of the people on the list are startup founders. It&rsquo;s also not surprising that most of those founders, such as <a href="https://news.ycombinator.com/user?id=mwseibel">mwseibel</a>, <a href="https://news.ycombinator.com/user?id=rahulvohra">rahulvohra</a> and <a href="https://news.ycombinator.com/user?id=tikhon">tikhon</a> also founded a Y Combinator startup. (although Paul Graham <a href="https://news.ycombinator.com/user?id=pg">himself</a> only has a 0.856 average).</p>
<h2 id="what-are-hacker-news-favorite-programming-languages">What are Hacker News&rsquo; favorite programming languages?</h2>
<p>One of the infamous memes about Hacker News is programming language elitism, with favoritism for languages such as Lisp and Erlang.</p>
<p>But what programming languages are indeed the most popular on Hacker News?</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-lang-num-submissions_hu_bda184d12cf4659.webp 320w,/2014/02/hacking-hacker-news/hn-lang-num-submissions_hu_491701d73a87831c.webp 768w,/2014/02/hacking-hacker-news/hn-lang-num-submissions_hu_1e2020979baa11c.webp 1024w,/2014/02/hacking-hacker-news/hn-lang-num-submissions.png 1200w" src="hn-lang-num-submissions.png"/> 
</figure>

<p>Javascript is very popular, especially with the rising popularity of node.js. Go is unexpectedly frequently submitted for being such a new language. (although it&rsquo;s possible for &ldquo;go&rdquo; to be used in a context outside of a programing language.) Lisp and Erlang are indeed obscure, which might discredit the meme.</p>
<p>Which programming languages are most well-liked on HN?</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-lang-avg-submissions_hu_7336a78c84656798.webp 320w,/2014/02/hacking-hacker-news/hn-lang-avg-submissions_hu_41a95cc9a9ed709a.webp 768w,/2014/02/hacking-hacker-news/hn-lang-avg-submissions_hu_aa499a861b27d066.webp 1024w,/2014/02/hacking-hacker-news/hn-lang-avg-submissions.png 1200w" src="hn-lang-avg-submissions.png"/> 
</figure>

<p>&hellip;so Lisp and Erlang <em>are</em> well-liked on HN.</p>
<p>At the least, in both cases, no one on Hacker News likes PHP.</p>
<h2 id="snowden-and-bitcoin">Snowden and Bitcoin</h2>
<p>Edward Snowden&rsquo;s leaks in June 2013 about the NSA and PRISM affected the entire tech industry, including Hacker News. How did Hacker News react to the leaks?</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-snowden_hu_9761adf3262af05e.webp 320w,/2014/02/hacking-hacker-news/hn-snowden_hu_1bcbcb818562d1c6.webp 768w,/2014/02/hacking-hacker-news/hn-snowden_hu_6e3ffffb6a9080fb.webp 1024w,/2014/02/hacking-hacker-news/hn-snowden.png 1200w" src="hn-snowden.png"/> 
</figure>

<p>Strongly.</p>
<p>But after the June spike, discussion about the NSA decreased significantly, but it&rsquo;s still a popular topic.</p>
<p>Bitcoin is more interesting since it has had three distinct surges:</p>
<figure>

    <img loading="lazy" srcset="/2014/02/hacking-hacker-news/hn-bitcoin_hu_3ece8effd35dd23e.webp 320w,/2014/02/hacking-hacker-news/hn-bitcoin_hu_7e0819fc53a49a60.webp 768w,/2014/02/hacking-hacker-news/hn-bitcoin_hu_2c69675b7293e133.webp 1024w,/2014/02/hacking-hacker-news/hn-bitcoin.png 1200w" src="hn-bitcoin.png"/> 
</figure>

<p>The June 2011 spike was due to the theft of <a href="https://bitcointalk.org/index.php?topic=16457.0">25,000 Bitcoin</a>, the April 2013 spike happened during the first rise-and-fall from $200/BTC, and the November 2013 spike happened during the second rise-and-fall from $1,000/BTC.</p>
<p>Hacker News is a great model for a link aggregator. It emphasizes more on quality content than the quantity of content, and it has paid off over the years.</p>
<hr>
<p><em>Code for getting all the HN submissions is <a href="https://github.com/minimaxir/hacker-news-download-all-stories">available on GitHub</a>. Unfortunately, the Hacker News data is too large to distribute freely. <a href="http://minimaxir.com/contact/">Contact me</a> if you want the raw data or any data to reproduce the charts.</em></p>
<p><em>Note: there appear to be <a href="https://docs.google.com/spreadsheets/d/1Zdex42KE-8DFIHujhVWjJ3yqilJSws2EbT8VAARPYgE/edit?usp=sharing">some gaps in the data</a> for dates before 2010. This appears to be caused by the API server: for example, compare the <a href="https://news.ycombinator.com/submitted?id=liebke">number of submissions as reported by HN for top user liebke</a> (21) and the <a href="https://hn.algolia.com/api/v1/search_by_date?tags=story,author_liebke">number of submissions as reported by the API</a> for liebke (14, with the last 7 submitted stories missing relative to the HN output). Also, number of stories (1,265,114) in the output data and the <a href="https://hn.algolia.com/">server data</a> (~1,267,000 as of publishing) are very close, making the discrepancy unlikely caused by client error. As a result, any chart that is based on a time series does not start earlier than 2010.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
