Interesting post by the reddit people on the apparent miscounting of traffic by third-party traffic counting sites.

The basic issue is the enormous discrepancy between the traffic estimates produced by outfits like Alexa and “more reliable” numbers from Google Analytics. Here’s the thing, each works the problem from the opposite end. Alexa follows the TV ratings model (find volunteers, watch what they do, extrapolate to the population) while Google takes the publisher model (count how many copies you sell and where they go). Both are trying to estimate your market reach, generally expressed in terms of uniques and pageviews.

Each approach is bound to make certain kinds of mistakes. When you only sample a population, you’re prone to sampling errors. Alexa, and sites like it, are have a huge sample bias in favor of … let’s call them idiots, because only “idiots” allow the necessary spyware on their computer (e.g. if you visit one of their websites with IE running on XP SP1 you’ll probably end up with their crap on your computer without even agreeing to it; try the same thing with Chrome or Safari and … nothing).

The original TV ratings system, pioneered by Neilsen, used volunteers and logbooks (my wife and I actually filled in logs for radio listening about six years back, so this method is still in use). It has all kinds of obvious flaws, not least of which is that you’re not going to measure the viewing habits of people who suck at keeping log books up-to-date (even if they volunteer to fill them in). Even more modern automated systems have major sample bias issues. Alexa (et al) piggyback their spyware on apparently useful add-ons, but let’s just say that the typical reddit user is not the kind of person who is going to be well-represented in their statistics.

Google Analytics, on the other hand, is going to overcount a lot of things, especially uniques (e.g. if I have cookies fully or partially disabled and my IP address changes) and pageviews (when is a refresh a new page view?). Readership doesn’t equal the number of copies printed.

Anyway, I think I have a simple answer for the reddit post. The kinds of people who show up in spyware-based ratings systems are exactly the kinds of people who don’t visit reddit.