Problem: Knowing how many people visit your website can help you improve
the site and increase revenues. Counting them can be easy or accurate, but not
Depth: In the Golden Age of the Internet (three or four years ago) the success
of a website was measured by "hits." A hit is an event that retrieves an item
from the website. When you pointed your browser to this page, for example, there
were four or five hits registered at our server. One was for the main HTML page
code, and the others were for the various graphic elements named on the page.
Hits can be a good measure of how hard a server works (or how powerful it is)
but are not particularly good measures of how many people visit the site.
On Your Website
though hits are a poor measure of how interesting a site is, you may still hear
the measure used. Salespeople, who may not have been briefed on the inherent
deceptiveness of hits, or who just prefer to quote the largest numbers they
have, sometimes use hits when selling ads or sponsorships. This creates a numbers
inflation problem for competing salespeople who may have been directed to use
more accurate numbers. They are forced to educate the prospect on the intricacies
of web measurement. Universal website auditing would put everyone on the same
level playing field. However, although common in the publishing industry, auditing
is not yet universal on the Internet.
hits are not the right measure, what is? It should not be surprising that there
is no single "right" measure. As always, what you measure depends on what you
want to know. What people really want to measure when they use hits is "page
impressions" (also known as "impressions" or "page views" or "pages"). The important
event is pointing the browser to a particular page - the number of times a browser
has to request files from the webserver is irrelevant.
are as easily available as hits. Every hit is recorded in the webserver's log
file. (See: Web
Logs Tell (Almost) All). To count the impressions it is only necessary to
know how to distinguish pages from images. Pages are usually named with extensions
of htm, html, asp, or cgi. (The extensions "asp" and "cgi" indicate different
forms of program that are used to generate page content dynamically). Note that
many "different" pages with only one page impression, because they can change
the content of the page from within the browser.
are the most basic honest measure that one can provide about website activity,
although all impressions are not equally valid, as documented below. They are
important tools to understand how people use your site. Some websites will examine
the page impression data every day; others look only weekly or monthly. Impression
statistics can pinpoint areas of interest that will lead to content changes
in order to grab more visitors. (Figure 1 shows a sample impression count for
this site from a commercially available traffic-reporting tool.) However, there
are many questions that raw page impression counts can not answer.
10 most commonly accessed pages during period:
14576 page views
11221 page views
9763 page views
9523 page views
8946 page views
8898 page views
8885 page views
8684 page views
8560 page views
8259 page views
impressions cannot tell you about how much your site grabs people. Do your 10,000
page impressions come from 10,000 individuals, each looking at one page, or
do they derive from one individual who really likes your site? Usually the answer
is someplace in the middle. The number of people who visit the site is an important
qualification of page impressions, as is the average time spent on the site
by a visitor. They tell about the inherent interest of the site and the degree
to which the site is known to your target audience. Both of these are important
variables in redesigning content and in developing the best advertising sales
visit is usually defined as a series of interactions with the website that does
not contain any thirty-minute gaps. If surfer A visits your home page, takes
a phone call for 20 minutes, clicks on a link to another page in the site, takes
another twenty-minute phone call, and clicks on another link, all three clicks
are part of one visit. If, while talking on the phone, A were to surf over to
the World Wrestling Foundation site, and come back to your site when the calls
were over, there would still be only one visit. (This might actually depend
on how A got back to your site and on the traffic analysis software you use.
In some cases you would see three visits.) If the phone calls were to extend
to 31 minutes, though, then each impression would be a separate visit.
Identifying visits from a log file takes a tad more cleverness than counting
page impressions. A log analyzer starts with the IP address for the browser,
but this is frequently not sufficient to distinguish between different surfers,
especially when they are coming from AOL. So the analyzer may also look at the
referring field to establish a trail within the site, the cookie and username
fields, and of course the time. Different analyzers have different algorithms
and so will produce slightly different results. Differences of up to ten percent
between different products on some measures are not unreasonable. Calculations
of the average time per visit (or per page) are also skewed by the choice of
algorithm. Specifically, how much time should be assigned to the very last page?
The website has no record of a user pointing the browser someplace else, so
it is up to the analysis software to assign a value here. The value assigned
can range from 29 minutes, as if the user stayed on the page so long that the
visit terminated, to zero. A site would clearly want to report as long a time
as possible to others, but might not want to bias its internal analyses with
a value produced by the most extreme of the possible choices. Different products
differ on how they handle this.
next level of information the company might want to know is what site visitors
actually do. In other words, what kind of paths do they take through the site?
For many sites, only fairly short paths are needed for this kind of analysis.
For example, do more come to the perfume sales page by doing a search or after
reading the article you posted about antique perfume bottles at the Victoria
Museum? Some traffic analysis programs can provide this kind of data in some
form, such as a list of the top five previous pages for each of the ten most
visited pages. Not all commercial packages do even this much, and due to the
combinatorial explosion, longer path analysis, if needed, will probably require
a custom solution. Also, path analysis can be confounded when a site uses more
than one server, because the sheer size of web logs may make merging them unwieldy.
Page impressions and visit statistics are good ways to explain to potential
advertisers why they should advertise on your site, but once they do they will
be less interested in general numbers and more interested in how well their
own ads are doing. There are two common measures of ad performance: ad impressions
and clickthroughs. An ad impression is a presentation of the ad. Each ad impression
occurs when a page is presented to the user. For a site that serves its own
ads, all the information needed to report on ads can be made available through
the log; each ad is an image file whose name can be read from the log. Confusingly,
when speaking about ads, ad impressions may just be called "impressions."
clickthrough occurs when a surfer clicks on an ad to be taken to a page on your
site or on the advertiser's. Again, a site that handles its own ad serving can
arrange to read clickthrough counts from its log. However, most sites use third-party
software or an outsourced service for their ad serving. In these cases the reports
are created from the adserver's logs. Note that there is typically no easy way
to correlate data from the webserver logs with data from the adserver logs.
the Data Valid?
Unfortunately, there are many ways in which the numbers from the webserver and
adserver logs can be inaccurate. Not every impression is a reportable impression,
and not every absence of an impression means that a page was not looked at.
a browser is pointed to a page, there can be a complicated series of transactions
before the browser decides whether to download the page. The browser may have
viewed the page recently, and may therefore have a copy of it cached. In this
case the browser will first make contact with the server to determine whether
the page has changed; if not, the page can be displayed from the cache. Unfortunately,
if the page is not reloaded to the browser the log will have no record of the
transaction. The loss is particularly significant when services like AOL, which
do caching for all of their members, are taken into account. It is possible
to capture these events with software/hardware products that sit on the network
and interpret the protocol-level communications.
far as the impressions that are not, the main source of these are various web
crawlers, usually launched by search engines. While having a search engine visit
your site is a good thing, it can inflate your traffic and ad statistics. Another
source of contamination can be visits by people within your company, such as
web editors and quality assurance personnel. It is possible to remove both kinds
of contamination from the logs. Many crawlers are well enough behaved to identify
themselves in the user agent field, and people within the company can usually
be identified by IP address. The auditing company ABCInteractive, a subsidiary
of the publishing industry monitoring service Audit Bureau of Circulation, posts
a list of known crawlers on its website. However not every crawler is kind enough
to identify itself. They can be detected by their behavior, but this is not
a feature of any current commercial traffic analysis logs.
page impressions are merely representations about traffic, advertising numbers
are the basis of contracts. Ads are sold for a fixed number of impressions or
clickthroughs, or may be shown over a period of time with a fee incurred for
each of these events. If the ad numbers are inflated, an advertiser is not getting
what it paid for. It should be no surprise that there are other ways to determine
site traffic other than log-based reporting. A number of services provide off-line
surveys of website popularity and ad reach. These are not primarily used by
websites or advertisers to check their own historic data. Instead, the customers
tend to be prospective advertisers trying to determine which sites are best
to advertise on, and current advertisers looking to add or switch sites. These
are the analogues of the Nielson ratings for television shows, and Nielson is
indeed one company in this space.
a recent study comparing log-based reports with panel measurement revealed a
large variation. The results showed significant variations, with discrepancies
ranging from 14 to 323 percent. Panels both over- and underreport as compared
with log based analysis. Although log-based data might appear to be inherently
more accurate, the research team pointed to such problems as crawlers, the failure
to separate national from international traffic, and data corruption from server
crashes as being good causes for skepticism about the precision of log-based
measurements as guides to what an advertiser can actually expect from a site.