Counting Website Traffic

  • Written By: D. Geller
  • Published: November 1 1999

Problem: Knowing how many people visit your website can help you improve the site and increase revenues. Counting them can be easy or accurate, but not both.

In Depth: In the Golden Age of the Internet (three or four years ago) the success of a website was measured by "hits." A hit is an event that retrieves an item from the website. When you pointed your browser to this page, for example, there were four or five hits registered at our server. One was for the main HTML page code, and the others were for the various graphic elements named on the page. Hits can be a good measure of how hard a server works (or how powerful it is) but are not particularly good measures of how many people visit the site.

Reporting On Your Website

Even though hits are a poor measure of how interesting a site is, you may still hear the measure used. Salespeople, who may not have been briefed on the inherent deceptiveness of hits, or who just prefer to quote the largest numbers they have, sometimes use hits when selling ads or sponsorships. This creates a numbers inflation problem for competing salespeople who may have been directed to use more accurate numbers. They are forced to educate the prospect on the intricacies of web measurement. Universal website auditing would put everyone on the same level playing field. However, although common in the publishing industry, auditing is not yet universal on the Internet.

If hits are not the right measure, what is? It should not be surprising that there is no single "right" measure. As always, what you measure depends on what you want to know. What people really want to measure when they use hits is "page impressions" (also known as "impressions" or "page views" or "pages"). The important event is pointing the browser to a particular page - the number of times a browser has to request files from the webserver is irrelevant.

Impressions are as easily available as hits. Every hit is recorded in the webserver's log file. (See: Web Logs Tell (Almost) All). To count the impressions it is only necessary to know how to distinguish pages from images. Pages are usually named with extensions of htm, html, asp, or cgi. (The extensions "asp" and "cgi" indicate different forms of program that are used to generate page content dynamically). Note that the use of Java servelets and JavaScript has the potential of showing the user many "different" pages with only one page impression, because they can change the content of the page from within the browser.

Impressions are the most basic honest measure that one can provide about website activity, although all impressions are not equally valid, as documented below. They are important tools to understand how people use your site. Some websites will examine the page impression data every day; others look only weekly or monthly. Impression statistics can pinpoint areas of interest that will lead to content changes in order to grab more visitors. (Figure 1 shows a sample impression count for this site from a commercially available traffic-reporting tool.) However, there are many questions that raw page impression counts can not answer.

Figure 1
Top 10 most commonly accessed pages during period:
/index.asp with 14576 page views (12.32%)
/news_analysis/10-99/na_ec_dpg_10_22_99_1.asp with 11221 page views (10.90%)
/research_notes/09-99/pn_hw_rak_9_99_3.asp with 9763 page views (6.81%)
/events/events.asp with 9523 page views (5.77%)
/research_notes/08-99/tu_ec_dpg_8_99_1.asp with 8946 page views (3.78%)
/news_analysis/10-99/na_st_lpt_10_26_99_2.asp with 8898 page views (2.54%)
/research_notes/09-99/pn_dw_mfr_9_99_1.asp with 8885 page views (2.46%)
/research_notes/09-99/pn_ba_srm_9_99_1.asp with 8684 page views (2.17%)
/research_notes/09-99/tu_dw_mfr_9_99_1.asp with 8560 page views (1.55%)
/search/query.idq with 8259 page views (1.53%)

Page impressions cannot tell you about how much your site grabs people. Do your 10,000 page impressions come from 10,000 individuals, each looking at one page, or do they derive from one individual who really likes your site? Usually the answer is someplace in the middle. The number of people who visit the site is an important qualification of page impressions, as is the average time spent on the site by a visitor. They tell about the inherent interest of the site and the degree to which the site is known to your target audience. Both of these are important variables in redesigning content and in developing the best advertising sales strategy.


A visit is usually defined as a series of interactions with the website that does not contain any thirty-minute gaps. If surfer A visits your home page, takes a phone call for 20 minutes, clicks on a link to another page in the site, takes another twenty-minute phone call, and clicks on another link, all three clicks are part of one visit. If, while talking on the phone, A were to surf over to the World Wrestling Foundation site, and come back to your site when the calls were over, there would still be only one visit. (This might actually depend on how A got back to your site and on the traffic analysis software you use. In some cases you would see three visits.) If the phone calls were to extend to 31 minutes, though, then each impression would be a separate visit.

Identifying visits from a log file takes a tad more cleverness than counting page impressions. A log analyzer starts with the IP address for the browser, but this is frequently not sufficient to distinguish between different surfers, especially when they are coming from AOL. So the analyzer may also look at the referring field to establish a trail within the site, the cookie and username fields, and of course the time. Different analyzers have different algorithms and so will produce slightly different results. Differences of up to ten percent between different products on some measures are not unreasonable. Calculations of the average time per visit (or per page) are also skewed by the choice of algorithm. Specifically, how much time should be assigned to the very last page? The website has no record of a user pointing the browser someplace else, so it is up to the analysis software to assign a value here. The value assigned can range from 29 minutes, as if the user stayed on the page so long that the visit terminated, to zero. A site would clearly want to report as long a time as possible to others, but might not want to bias its internal analyses with a value produced by the most extreme of the possible choices. Different products differ on how they handle this.

The next level of information the company might want to know is what site visitors actually do. In other words, what kind of paths do they take through the site? For many sites, only fairly short paths are needed for this kind of analysis. For example, do more come to the perfume sales page by doing a search or after reading the article you posted about antique perfume bottles at the Victoria Museum? Some traffic analysis programs can provide this kind of data in some form, such as a list of the top five previous pages for each of the ten most visited pages. Not all commercial packages do even this much, and due to the combinatorial explosion, longer path analysis, if needed, will probably require a custom solution. Also, path analysis can be confounded when a site uses more than one server, because the sheer size of web logs may make merging them unwieldy.

Ad Impressions

Page impressions and visit statistics are good ways to explain to potential advertisers why they should advertise on your site, but once they do they will be less interested in general numbers and more interested in how well their own ads are doing. There are two common measures of ad performance: ad impressions and clickthroughs. An ad impression is a presentation of the ad. Each ad impression occurs when a page is presented to the user. For a site that serves its own ads, all the information needed to report on ads can be made available through the log; each ad is an image file whose name can be read from the log. Confusingly, when speaking about ads, ad impressions may just be called "impressions."

A clickthrough occurs when a surfer clicks on an ad to be taken to a page on your site or on the advertiser's. Again, a site that handles its own ad serving can arrange to read clickthrough counts from its log. However, most sites use third-party software or an outsourced service for their ad serving. In these cases the reports are created from the adserver's logs. Note that there is typically no easy way to correlate data from the webserver logs with data from the adserver logs.

Are the Data Valid?

Unfortunately, there are many ways in which the numbers from the webserver and adserver logs can be inaccurate. Not every impression is a reportable impression, and not every absence of an impression means that a page was not looked at.

When a browser is pointed to a page, there can be a complicated series of transactions before the browser decides whether to download the page. The browser may have viewed the page recently, and may therefore have a copy of it cached. In this case the browser will first make contact with the server to determine whether the page has changed; if not, the page can be displayed from the cache. Unfortunately, if the page is not reloaded to the browser the log will have no record of the transaction. The loss is particularly significant when services like AOL, which do caching for all of their members, are taken into account. It is possible to capture these events with software/hardware products that sit on the network and interpret the protocol-level communications.

As far as the impressions that are not, the main source of these are various web crawlers, usually launched by search engines. While having a search engine visit your site is a good thing, it can inflate your traffic and ad statistics. Another source of contamination can be visits by people within your company, such as web editors and quality assurance personnel. It is possible to remove both kinds of contamination from the logs. Many crawlers are well enough behaved to identify themselves in the user agent field, and people within the company can usually be identified by IP address. The auditing company ABCInteractive, a subsidiary of the publishing industry monitoring service Audit Bureau of Circulation, posts a list of known crawlers on its website. However not every crawler is kind enough to identify itself. They can be detected by their behavior, but this is not a feature of any current commercial traffic analysis logs.

Panel Measurement

Where page impressions are merely representations about traffic, advertising numbers are the basis of contracts. Ads are sold for a fixed number of impressions or clickthroughs, or may be shown over a period of time with a fee incurred for each of these events. If the ad numbers are inflated, an advertiser is not getting what it paid for. It should be no surprise that there are other ways to determine site traffic other than log-based reporting. A number of services provide off-line surveys of website popularity and ad reach. These are not primarily used by websites or advertisers to check their own historic data. Instead, the customers tend to be prospective advertisers trying to determine which sites are best to advertise on, and current advertisers looking to add or switch sites. These are the analogues of the Nielson ratings for television shows, and Nielson is indeed one company in this space.

However, a recent study comparing log-based reports with panel measurement revealed a large variation. The results showed significant variations, with discrepancies ranging from 14 to 323 percent. Panels both over- and underreport as compared with log based analysis. Although log-based data might appear to be inherently more accurate, the research team pointed to such problems as crawlers, the failure to separate national from international traffic, and data corruption from server crashes as being good causes for skepticism about the precision of log-based measurements as guides to what an advertiser can actually expect from a site.

comments powered by Disqus