Overview
In the world of high performance computing, everyone wants to know how
well a system performs before deciding to buy it. Benchmarks provide a
relatively objective way of determining how well a system will perform
under given conditions. What customers need to know is: which benchmarks
are relevant to their particular needs, and which ones don't matter? In
this article, we will go through some of the more common benchmarks used,
and discuss in which areas/markets they are most important.
One
caveat: although most benchmark tests are a reasonable attempt to simulate
real-world conditions under which systems may be expected to operate,
there will always be some deviation between the tested configuration and
a customer's actual computing environment. Thus, benchmarks should be
thought of more as guideposts than actual "scripted scenarios". However,
benchmarks can (and often do) serve as a tool for comparison between different
systems. It is the comparative aspect of benchmarking which provides the
most useful information.
Benchmarks
examined here will focus on hardware systems such as servers, desktops,
and notebooks. Although loose Operating System (OS) comparisons can be
made, doing so is more complex, and can give misleading results. Loose
comparisons can also be made between some applications (e.g. comparing
a system running Oracle 8i to one running Microsoft's SQL
Server to one running IBM's DB2), but as with OSes, can
be misleading. However, trends can sometimes be assessed through judicious
use and analysis.
Brief
History and Background
For a long time, the only benchmarks to which anyone paid attention were
related to the CPU. In the 1980s, people (especially salesmen) were fond
of quoting how many MIPS (Millions of Instructions
Per Second) a computer could perform. This
was more meaningful when the bulk of computers were CISC (Complex
Instruction Set Computer or
Computing), a group that includes IBM-compatible personal
computers.
With
the advent of RISC (Reduced Instruction Set
Computer or Computing) machines, measuring
the number of instructions executed became an apples/oranges comparison,
with the conflict akin to religious warfare. In addition, the lines between
CISC and RISC have become more blurred. This conflict led to the need
to develop benchmarks more focused on system performance than component
performance, as well as to provide more refined performance figures for
the CPU and CPU subsystem.
Another
attempt at meaningful benchmarks was FLOPS (FLoating-point
OPerations per Second), which rated processor
speed. Computer manufacturers often quoted their systems as "XX megaFLOPS"
(MFLOPS) or ""YY gigaFLOPS" (GFLOPS). However, as users came to realize
that FLOPS is an incomplete measure of system performance, other benchmarks
were developed by groups such as the Standard Performance Evaluation Corporation
(referred to as SPEC), a consortium of industry vendors who joined together
for that purpose.
As
non-mainframe servers and personal computers have proliferated, both in
actual numbers and quantity/type of application they are called upon to
run, more tests became necessary. The benchmark tests developed were more
focused, and thus more applicable for a particular type of situation.
For example, a test designed to measure how fast/well a humongous database
query can be executed through a server is not suitable for comparing 3D
graphics performance of mechanical CAD workstations.
What
we now find is that, in addition to industry/consortium-originated benchmarks,
tests are developed by non-vendor groups. A key example is the ZDNet eTesting
Labs a.k.a Ziff-Davis Media Benchmarks (formerly known as Ziff-Davis Benchmark
Operation [ZDBOp] and the suite of benchmarks they have built. "ZD" is
Ziff-Davis, publisher of computer-related magazines such as PC
Magazine, PC Week, and PC Computing. ZDBOp currently
provides and oversees more than ten benchmark tests. These tests are primarily
PC-based, but also include tests for Macintoshes, servers, and Internet
performance.
The
other key "group" providing benchmark suites is the individual vendors.
Companies such as Oracle, SAP AG, Microsoft,
and Lotus/IBM provide benchmarks for their specific products. As
with the wider-focus tests, hardware manufacturers sometimes use these
tests for competitive selling. Some provide the tests to potential customers
to help them decide how much computer they will need to order.
Benchmarks
So, what are the benchmarks in current use, and what do they measure?
Listed in Table 1 below are some of the better-known benchmarks, along
with the kind of performance factors they measure/evaluate. This list
is not all-encompassing, but it does list many of the benchmarks most
users will find valuable and useful.
Note:
To access the benchmark details click on the view box in the Details column.
Table 1.
|
Test
Name
|
Segment
|
Synopsis
|
Metrics
|
Details
|
| TPC-C |
System
|
Measures
transaction processing performance and exercises all related subsystems |
tpmC,
$/tpmC |
|
| TPC-H |
System |
Measures
ad-hoc transaction performance |
QphH,
$/QphH |
|
| TPC-R |
System |
Measures
performance of a standard set of queries |
QphR,
$/QphR |
|
| TPC-W |
System |
Measure
transactions (e.g. e-commerce) for a business-oriented web server |
WIPS,
$/WIPS |
|
| SPECweb99 |
System |
Updated
version of SPECweb96. Measures peak throughput for web serving |
Conform.
simul. connections |
|
| SPEC
CPU2000 |
CPU
subsys |
Measures
CPU performance (replaces SPECint/fp 95) |
SPECmark |
|
| SPECsfs97 |
System |
NFS
file server throughput and response time |
Ops/sec;
Overall resp. time (ORT) |
|
SYSmark98/
SYSmark2000 |
Desktop |
Overall
general application performance, incl. office productivity and content
creation |
SYSmark
rating |
|
| SYSmark/32 |
Desktop |
32-bit
application performance |
SYSmark
rating |
|
| SYSmarkNT4 |
Desktop |
Measures
performance across a mix of applications (CAD, word processing, spreadsheet,
ProjMgmt, presentation) |
SYSmark
rating |
|
| i-Bench |
Internet |
Performance,
capability of Web clients |
Various |
|
| WebBench |
Internet |
Web,
proxy, and cache server software performance |
Score:
rps
Thruput: byte/sec |
|
| NetBench |
Server |
File
server's handling of 32-bit clients' I/O requests |
Thruput:
Mb/sec Response: msec
|
|
| Winstone |
Desktop |
Overall
32-bit application performance |
Winstone
units |
|
| Business |
Desktop |
Application
suite performance |
Winstone
units |
|
|
High-End |
Desktop |
Applications
for demanding users, e.g. multimedia (NT only) |
Winstone
units |
|
Content Creation
(CC) |
Desktop |
Content
creation (e.g. Photoshop, Director) performance |
Winstone
units |
|
| Winbench |
Desktop |
Graphics
and disk subsystems performance |
Many,
see table |
|
| 3D
Winbench |
Desktop |
3D
subsystem, incl. graphics, S/W |
Frames
/second |
|
| PC
WorldBench 2000 |
Desktop,
Notebook |
System
and applications performance |
WorldBench
score |
 |
| BatteryMark |
Notebook |
Battery
life when running Windows applications |
Life:
minutes |
|
| Web
Polygraph |
Server
(appliances) |
Measures
performance and value of caching server appliances. |
Thruput:
rps
MRT: sec
Price/perf:
rp s/K$ |
|
| WebStone |
Client/Server |
Measures
throughput and latency of HTTP transfers |
Thruput:
Mb/s
Peak:
Conns/sec |
|
| VolanoMark |
Server |
Measures
Java Virtual Machine (JVM) performance |
Unitless
score |
|
| DirectoryMark |
Server |
Measures
LDAP directory server performance |
Ops/sec;
resp. time |
|
|
VENDOR-BASED
|
| MMB |
Server/Client
|
MAPI
Messaging Bchmk, measures throughput actions of a "Medium User" profile,
executed over an 8-hour day |
MMB |
|
| SAP |
Server/Client
|
Performance
of system while running SAP R/3 |
#
of users, Response time |
|
| Oracle |
Server/Client
|
Performance
of system while running Oracle 8i |
User
count; Response time |
|
| NotesBench |
Server/Client
|
Performance
while running Lotus Notes, used to size servers |
Throughput;
Response time |
|
| RETIRED |
| SPECweb96 |
System |
Measures
peak throughput for web serving (still searchable) |
Ops
per second |
see
website
www.spec.org
|
| SPECint95 |
CPU
subsys. |
Measures
CPU integer performance w/main memory |
SPECmark |
| SPECfp95
|
Workstation |
Measures
CPU floating-point performance |
|
| AIM
Windows NT |
Server
|
General
performance; now defunct |
|
|
| AIM
Unix |
Server |
General
performance; now defunct |
|
|
| ServerBench
(retired) |
Server |
Performance
of application server hardware and OS in a client/server environment |
TPS |
|
Which
Ones to Check?
Customers need to know which benchmarks to use for comparing various applications.
Shown below is a correlation table of some standard tasks to various benchmarks.
A check mark means that the particular benchmark is a good indicator of
how well a given system will perform the required task(s).
Figure
1. CORRELATION TABLE - Application vs. Benchmark

click
here for larger version
Pitfalls
Benchmarks can be misapplied in various ways.
One
way uses the benchmark to provide "competitive data/analysis" for task
or tasks which bear no relation to what is really important. An example
of this might be the attempted use of a CPU performance benchmark as an
indicator of system-level performance. As the saying goes "Many a slip
'twixt cup and lip"; for us, this means that while CPU performance and
system performance are often linked, the link may be tenuous and not necessarily
usable for comparison.
Another
popular misapplication is to test a high-performing but non-useful configuration,
then present the results as if customers would actually get the measures
performance. An example of this might be to test a disk-laden system configured
for RAID0 (high-performing but no data loss protection), when the typical
customer wants/needs RAID1 or RAID5 (data loss protection but lower performance).
Finally,
sometimes vendors will quote unaudited benchmark data. Although this is
often innocuous, what can happen is that a vendor will implement special
features or "tune" the System Under Test (SUT). This can make it perform
better than is realistic, and certainly better than a competitor's untuned
system. We have seen at least one vendor pull figures from its Website
because the performance figures quoted were theoretical not real-world,
and definitely not audited. While we commend the vendor for "doing the
right thing", we believe the figures had no business being there in the
first place.
To
help you combat BMS (Benchmark Misapplication Syndrome), here are some
questions you can ask of hardware vendors:
- Have
these benchmark figures been audited by (name of auditing organization)?
- Why is
the particular benchmark you're quoting applicable to my circumstances/needs?
- How realistic
is the benchmark configuration you tested?
Summary
As we have seen, benchmarks fill an important role in selecting a computer,
from servers to notebooks. As with many things, benchmarks can be used
for good or evil. Judicious use of performance data, combined with an
understanding of what work/tasks you want your system to perform, will
help reduce or eliminate a lot of the hype surrounding performance data.
Don't be afraid to put the vendor on the spot regarding the figures quoted.
Asking a few hard questions now may save a lot of work later, including
the effort of buying another system - because the one the vendor sold
you doesn't quite perform as well as they told you.