This is Part II of a two part article.
Part
I: The Why's and What's of Auditing
Part
II: The Audit Process
What's
in an audit?
The
audit is a process for verification of the numbers that you report to
your advertisers. Audits can be performed in a number of different ways.
- Server-based
audits examine data that is available at the server, most importantly
traffic logs and web logs. An auditing organization will prowl through
the logs to check for various kinds of impressions that should not be
reported. This investigation will include an examination of the parameters
you use to run your traffic analysis programs. Auditors may insert software
in the web server that causes independent logs, totally under the auditor's
control, to be created.
- Panel-based
audits measure the surfing behavior of a sample panel of users, and
attempt to project that statistically to the entire Web population
- Browser-based
audits attempt to confirm actual ad displays. For example, an applet
can be attached to an ad or to a page; the applet will report when the
ad is actually displayed on some user's browser.
Larger consumer
sites like Yahoo and Amazon.com, and their advertisers, use panel-based
audits, and the numbers are sometimes front-page news. Smaller consumer
sites and B2B sites generally don't have the volume for panel-based audits
to be statistically significant, and rely mostly on server-based audits.
Browser-based audits are a newer technique and are not heavily used.
How good
are the different techniques? Jim Spaeth, President of the Advertising
Research Foundation, tells of comparisons where on site X a server-based
procedure showed 15% of the traffic shown by a panel-based audit, but
on a second site Y the order of the methods was reversed, with the server-based
procedure showing 300% of the traffic that the panel-based audit did.
"This kind of result gives people chills down the back of the neck," Mr.
Spaeth said. He also noted that different procedures of the same class
also tend to produce different numbers.
Figure 1,
from a comparison of three different measures of traffic on Yahoo in 1999,
shows graphically how different techniques may differ; the Figure was
originally published on TheStandard.com.
Who
sets the standards?
When your CFO faces an audit, it's always perfectly clear what's required.
If there are problems, the auditor can explain exactly what they are.
Try this: Ask your CFO if it would be surprising to have an audit performed
by two different highly respected firms on the same business at the same
time and get wildly different results. You already know what the answer
is: some variant of "that shouldn't happen." That's because accounting
firms and standards bodies have agreed on rules for audits that cover
almost any question that could be asked. Yet, despite the apparent simplicity
of the data that need to be analyzed and the fact that the Web is all
technology all the time, the standards just aren't there yet.
This
may be partly because there is no single recognized standards body as
there is for financial accounting (within a country). However, that situation
is starting to change as voluntary or ad hoc organizations
put in the work to develop standards. One such organization is FAST, which
stands for Future of Advertising Stakeholders. FAST has developed a number
of draft standards for how and what to measure, and some are being adopted
voluntarily. However there is no legal or even quasi-legal pressure to
make sites, software manufacturers or auditing firms adhere to them.
One
promising attempt to level the playing field is the planned September
launch of Audit Central, a web site that will publish audit reports that
have been made publicly available by the sites that were audited. The
site is run by ABC Interactive, BPA International, and Engage I/Pro. These
competing audit firms have a clear interest in improving the quality of
audits and public recognition of their value. The site is scheduled to
begin with approximately 600 reports, all from companies that have agreed
to make their reports public.
How
Many Visitors? A Sample Nightmare.
One of the simplest measures that any site wants to know is how many different
individuals - "unique visitors" in industry parlance - visit the site.
FAST's draft standard on Metrics and Methodology
suggests "three acceptable methods for identifying unique users: unique
registration, unique cookies and unique IP address with heuristic." Of
these, it suggests that unique registration is the best, "Sites that
register visits should have no problem determining the page requests that
belong to the same visitor. A site must use 100% registration in order
to use this method validly."
Next
is the use of unique cookies. If a unique cookie is dropped on every browser,
the user can be uniquely identified even without any personal information.
The third method calls for the use of IP addresses. However, IP addresses
are only an approximate match to actual users. As FAST states, "It must
be noted that IP addresses can and often do represent more than one user,
so this measure does not necessarily represent the number of people reached.
It should also be noted that dynamically assigned IP addresses impact
the accuracy of this methodology."
Few
websites require registration before showing any pages at all to a user,
so the most practical way to track individuals uniquely is with cookies.
(True some small percentage of users block cookies, but because there
are so few they become largely irrelevant to the discussion). The standard
doesn't say that a site has to drop cookies, only that if it doesn't it
must have another way to count visitors.
If
the site doesn't drop unique cookies then visitor calculations have to
be done by making educated guesses based on IP address. Such guesses could
take into account the time period between two pages served to the same
IP address, the click trail as revealed in the referrer field, and other
items. In the latter category are cookies that may be dropped by the web
server without the website taking explicit action; Microsoft's IIS in
particular can end up dropping quite a few. Any particular traffic program
can use any of these means to count visitors, but there is no one best
way to do so.
To
make things worse, consider caching. When a surfer clicks on a link there
is really no guarantee that your website will even see the request in
its logs. The page may be cached someplace between your server and the
user's browser - in the user's machine. Websites certainly want to claim
views of cached pages or ads as part of their traffic, but by definition
these can only be estimated. So, again, how can reliable numbers be generated?
How
to prepare for an audit: Know Thy Traffic
It
is absolutely important to understand two things about traffic numbers.
- You won't
get it 100% correct.
- Politics
The first
point is probably obvious. What with caching, proxy servers, the ability
of users to block cookies, and other factors, there's no way to be perfect.
Nor is that a problem. Given the variations between different audit styles,
consistency and traceability are your best bets. If the numbers you report
are only 5% or even 15% off from your first audit you'll probably set
a huge round of applause from the auditors.
Second, we
want to make sure that you understand that traffic numbers are as much
a political matter as a technical one. The classic example is the problem
that was faced by sales people of companies who, early in the evolution
of Web advertising, wanted to be above-board in their use of traffic numbers.
"We're reporting page impressions," one of those sales people told TEC
years ago, "but we're competing with people who still report hits. And
the customer doesn't understand the difference." While no advertiser is
going to get caught buying hits instead of impressions today, until the
sites of your competitors are audited there's no way to know how accurate
the numbers they provide to advertisers are. So sales and marketing folk
may have different levels of interest from the IT staff in pruning the
numbers down to the absolute minimum.
Most sites
use commercial traffic reporting products or services. While these are
certainly appropriate for use on a regular basis, we recommend that sites
expecting to be audited at some point come to an understanding of their
traffic before relying too heavily and too long on such packages. The
accuracy of the commercial offerings is limited by how well you can configure
them to exclude page impressions that should not count for an audit. You
can expect a package to automatically exclude images like .gif and .jpg
files, and some come out-of-the-box ready to exclude the larger search
engines, but they can't know the characteristics of your site.
In fact,
from a traffic point of view you may not know the characteristics of your
site until you create a small project to examine the logs in detail. It
should take only a week or two of programmer time to write a program that
can counts impressions, visitors and visits. The careful inspection of
the logs and the derivation of algorithms you'll need to do this will
put you on a firm footing both to configure your commercial log analysis
software and to be prepared for a traffic audit.
Among the
areas to pay attention to are:
- IP
Addresses: Do you know which IP addresses people from the company
(or from partners) will be recorded as coming from? Can you create an
estimate of the number of people who actually come from such overloaded
addresses as AOL's proxy servers to use your site?
- Cookies:
TEC recommends the use of a unique cookie to identify visitors. However,
you may discover that your server software has its own supply of cookies.
Microsoft's SiteServer, through its various features, has the capability
of dropping many cookies. These, unless you understand them carefully,
may have the effect of confounding your traffic reporting software,
probably leading to a significant over-counting of visitors.
- Usernames:
If your site has registration the usernames can appear in the traffic
logs and be quite helpful in validating your numbers. But if you don't
require people to register immediately the same person may appear in
the logs both with and without a username. This would have the effect
of inflating your visitor count and decreasing the measure of the average
time spent on the site per user.
- Caching:
Having your static pages cached by remote servers or browsers helps
reduce the load on your own servers and on the network as a whole, but
at the cost of reducing your traffic counts. You can develop estimates
of the degree to which this occurs by inserting directives into the
HTML code that will have the effect of invalidating versions stored
in caches, or by changing the modification dates to make those pages
look new. The former approach gives a better estimate since it in theory
causes every browser to reload the pages every time, while the latter
approach merely causes reloads once by caching servers. Trials that
mix both methods can lead to the best estimates of real traffic.
- Bots:
There are lists of "known" robots and search engines published on the
Internet, and some traffic packages routinely use these lists to remove
unwanted impressions. However, these lists are not complete, and unknown
robots regularly search your site and can cause significant spikes and
consistent over-estimates of your traffic. The only protection here
is eternal vigilance. While most robots identify themselves in the User-agent
field of the log, many do not. One way to find such impolite robots
is to look for users who visit a large number of pages in a short period
of time. If you know which specialized search engines visit your site,
you find out directly from them what IP addresses they use and set your
software to ignore them.
It may be
that you end up with more special cases than your commercial software
can deal with. This will mean that its reports over- or under-estimate
what you believe the accurate numbers to be. In that case a few data points
should establish the nature of this difference. You can then adjust the
numbers from the package before reporting them - making sure to revalidate
the relationship periodically.
Reporting
on advertising is a different matter. It is conceptually similar, in that
you can specify some kinds of impression as ineligible for counting. Typically
the ad server logs contain less information about individuals than do
your traffic logs, so many of the opportunities for removing bogus impressions
are not easily available. Since this is the same boat everyone else is
in there should be no problem when it comes to an audit. And your ad serving
vendor or service should be able to show you that their overall procedures
have been certified by some auditing agency, and should be able to advise
you about any particular circumstances on your site. However, it will
be up to you, through analysis of your traffic, to discover special situations
that might be necessary to account for in ad reports.
Conclusion
Accurate traffic numbers are an important management tool, but they become
most important when advertisers start asking for audits. You should begin
to prepare by understanding where your traffic comes from and designing
your traffic reporting procedures to be as accurate as possible.
A
formal statement of your procedures and principles will serve as good
documentation, and can be used to effect by your sales people in representing
your site to potential advertisers. When the time for an audit comes,
your documentation of procedures and principles may smooth the process,
and will certainly be helpful if the audit disagrees with the traffic
numbers you believe to be accurate.