What’s All This Benchmark Stuff, Anyway?

  • Written By: R. Krause
  • Published On: June 10 2002



What's All This Benchmark Stuff, Anyway?
R. Krause - Februrary 12, 2001

Overview  

In the world of high performance computing, everyone wants to know how well a system performs before deciding to buy it. Benchmarks provide a relatively objective way of determining how well a system will perform under given conditions. What customers need to know is: which benchmarks are relevant to their particular needs, and which ones don't matter? In this article, we will go through some of the more common benchmarks used, and discuss in which areas/markets they are most important.

One caveat: although most benchmark tests are a reasonable attempt to simulate real-world conditions under which systems may be expected to operate, there will always be some deviation between the tested configuration and a customer's actual computing environment. Thus, benchmarks should be thought of more as guideposts than actual "scripted scenarios". However, benchmarks can (and often do) serve as a tool for comparison between different systems. It is the comparative aspect of benchmarking which provides the most useful information.

Benchmarks examined here will focus on hardware systems such as servers, desktops, and notebooks. Although loose Operating System (OS) comparisons can be made, doing so is more complex, and can give misleading results. Loose comparisons can also be made between some applications (e.g. comparing a system running Oracle 8i to one running Microsoft's SQL Server to one running IBM's DB2), but as with OSes, can be misleading. However, trends can sometimes be assessed through judicious use and analysis.

Brief History and Background 

For a long time, the only benchmarks to which anyone paid attention were related to the CPU. In the 1980s, people (especially salesmen) were fond of quoting how many MIPS (Millions of Instructions Per Second) a computer could perform. This was more meaningful when the bulk of computers were CISC (Complex Instruction Set Computer or Computing), a group that includes IBM-compatible personal computers.

With the advent of RISC (Reduced Instruction Set Computer or Computing) machines, measuring the number of instructions executed became an apples/oranges comparison, with the conflict akin to religious warfare. In addition, the lines between CISC and RISC have become more blurred. This conflict led to the need to develop benchmarks more focused on system performance than component performance, as well as to provide more refined performance figures for the CPU and CPU subsystem.

Another attempt at meaningful benchmarks was FLOPS (FLoating-point OPerations per Second), which rated processor speed. Computer manufacturers often quoted their systems as "XX megaFLOPS" (MFLOPS) or ""YY gigaFLOPS" (GFLOPS). However, as users came to realize that FLOPS is an incomplete measure of system performance, other benchmarks were developed by groups such as the Standard Performance Evaluation Corporation (referred to as SPEC), a consortium of industry vendors who joined together for that purpose.

As non-mainframe servers and personal computers have proliferated, both in actual numbers and quantity/type of application they are called upon to run, more tests became necessary. The benchmark tests developed were more focused, and thus more applicable for a particular type of situation. For example, a test designed to measure how fast/well a humongous database query can be executed through a server is not suitable for comparing 3D graphics performance of mechanical CAD workstations.

What we now find is that, in addition to industry/consortium-originated benchmarks, tests are developed by non-vendor groups. A key example is the ZDNet eTesting Labs a.k.a Ziff-Davis Media Benchmarks (formerly known as Ziff-Davis Benchmark Operation [ZDBOp] and the suite of benchmarks they have built. "ZD" is Ziff-Davis, publisher of computer-related magazines such as PC Magazine, PC Week, and PC Computing. ZDBOp currently provides and oversees more than ten benchmark tests. These tests are primarily PC-based, but also include tests for Macintoshes, servers, and Internet performance.

The other key "group" providing benchmark suites is the individual vendors. Companies such as Oracle, SAP AG, Microsoft, and Lotus/IBM provide benchmarks for their specific products. As with the wider-focus tests, hardware manufacturers sometimes use these tests for competitive selling. Some provide the tests to potential customers to help them decide how much computer they will need to order.

Benchmarks 

So, what are the benchmarks in current use, and what do they measure? Listed in Table 1 below are some of the better-known benchmarks, along with the kind of performance factors they measure/evaluate. This list is not all-encompassing, but it does list many of the benchmarks most users will find valuable and useful.

Note: To access the benchmark details click on the view box in the Details column.

Table 1.

Test Name
Segment
Synopsis
Metrics
Details
TPC-C System Measures transaction processing performance and exercises all related subsystems tpmC, $/tpmC
TPC-H System Measures ad-hoc transaction performance QphH, $/QphH
TPC-R System Measures performance of a standard set of queries QphR, $/QphR
TPC-W System Measure transactions (e.g. e-commerce) for a business-oriented web server WIPS, $/WIPS
SPECweb99 System Updated version of SPECweb96. Measures peak throughput for web serving Conform. simul. connections
SPEC CPU2000 CPU subsys Measures CPU performance (replaces SPECint/fp 95) SPECmark
SPECsfs97 System NFS file server throughput and response time Ops/sec; Overall resp. time (ORT)
SYSmark98/
SYSmark2000
Desktop Overall general application performance, incl. office productivity and content creation SYSmark rating
SYSmark/32 Desktop 32-bit application performance SYSmark rating
SYSmarkNT4 Desktop Measures performance across a mix of applications (CAD, word processing, spreadsheet, ProjMgmt, presentation) SYSmark rating
i-Bench Internet Performance, capability of Web clients Various
WebBench Internet Web, proxy, and cache server software performance Score: rps
Thruput: byte/sec
NetBench Server File server's handling of 32-bit clients' I/O requests

Thruput: Mb/sec Response: msec

Winstone Desktop Overall 32-bit application performance Winstone units
 
    Business Desktop Application suite performance Winstone units
    High-End Desktop Applications for demanding users, e.g. multimedia (NT only) Winstone units
    Content     Creation
    (CC)
Desktop Content creation (e.g. Photoshop, Director) performance Winstone units
Winbench Desktop Graphics and disk subsystems performance Many, see table
3D Winbench Desktop 3D subsystem, incl. graphics, S/W Frames
/second
PC WorldBench 2000 Desktop, Notebook System and applications performance WorldBench score
BatteryMark Notebook Battery life when running Windows applications Life:
minutes
Web Polygraph Server (appliances) Measures performance and value of caching server appliances. Thruput:
rps
MRT: sec
Price/perf:
rp s/K$
WebStone Client/Server Measures throughput and latency of HTTP transfers Thruput:
Mb/s
Peak:
Conns/sec
VolanoMark Server Measures Java Virtual Machine (JVM) performance Unitless score
DirectoryMark Server Measures LDAP directory server performance Ops/sec;
resp. time
VENDOR-BASED
MMB Server/Client MAPI Messaging Bchmk, measures throughput actions of a "Medium User" profile, executed over an 8-hour day MMB
SAP Server/Client Performance of system while running SAP R/3 # of users, Response time
Oracle Server/Client Performance of system while running Oracle 8i User count; Response time
NotesBench Server/Client Performance while running Lotus Notes, used to size servers Throughput; Response time
RETIRED
SPECweb96 System Measures peak throughput for web serving (still searchable) Ops per second
see website
www.spec.org
SPECint95 CPU subsys. Measures CPU integer performance w/main memory SPECmark
SPECfp95 Workstation Measures CPU floating-point performance  
AIM Windows NT Server General performance; now defunct  
 
AIM Unix Server General performance; now defunct  
 
ServerBench (retired) Server Performance of application server hardware and OS in a client/server environment TPS
 

 

Which Ones to Check? 

Customers need to know which benchmarks to use for comparing various applications. Shown below is a correlation table of some standard tasks to various benchmarks. A check mark means that the particular benchmark is a good indicator of how well a given system will perform the required task(s).

Figure 1. CORRELATION TABLE - Application vs. Benchmark

click here for larger version

Pitfalls 

Benchmarks can be misapplied in various ways.

One way uses the benchmark to provide "competitive data/analysis" for task or tasks which bear no relation to what is really important. An example of this might be the attempted use of a CPU performance benchmark as an indicator of system-level performance. As the saying goes "Many a slip 'twixt cup and lip"; for us, this means that while CPU performance and system performance are often linked, the link may be tenuous and not necessarily usable for comparison.

Another popular misapplication is to test a high-performing but non-useful configuration, then present the results as if customers would actually get the measures performance. An example of this might be to test a disk-laden system configured for RAID0 (high-performing but no data loss protection), when the typical customer wants/needs RAID1 or RAID5 (data loss protection but lower performance).

Finally, sometimes vendors will quote unaudited benchmark data. Although this is often innocuous, what can happen is that a vendor will implement special features or "tune" the System Under Test (SUT). This can make it perform better than is realistic, and certainly better than a competitor's untuned system. We have seen at least one vendor pull figures from its Website because the performance figures quoted were theoretical not real-world, and definitely not audited. While we commend the vendor for "doing the right thing", we believe the figures had no business being there in the first place.

To help you combat BMS (Benchmark Misapplication Syndrome), here are some questions you can ask of hardware vendors:

  • Have these benchmark figures been audited by (name of auditing organization)?

  • Why is the particular benchmark you're quoting applicable to my circumstances/needs?

  • How realistic is the benchmark configuration you tested?

Summary 

As we have seen, benchmarks fill an important role in selecting a computer, from servers to notebooks. As with many things, benchmarks can be used for good or evil. Judicious use of performance data, combined with an understanding of what work/tasks you want your system to perform, will help reduce or eliminate a lot of the hype surrounding performance data. Don't be afraid to put the vendor on the spot regarding the figures quoted. Asking a few hard questions now may save a lot of work later, including the effort of buying another system - because the one the vendor sold you doesn't quite perform as well as they told you.

 
comments powered by Disqus