Fault Meets Performance -- Comprehensive Infrastructure Management Part 1: The Problem

  • Written By: Fred Engel
  • Published: March 26 2002


As recently as a year ago the mantra "just keep it running" drowned out every other voice in the information technology (IT) universe. Corporate IT infrastructures had become so mind-numbingly complex that staff had everything it could do just to keep everything running - never mind how well it might have been running. When employees complained about performance, IT staff was not responsive. Unfortunately, slow performance means loss of business, so fixing the slow performance is life and death to the business.

Now that those same IT infrastructures are turning outward to embrace business customers and partners, often-neglected performance issues get more attention. Customers, suppliers, partners and other business needs dominate IT decision making, so when these business partners complain about slow applications or interminable downloads, IT listens - or else. Catching and correcting the innumerable faults and performance problems that bring down IT environments is more important than ever.

IT has had software tools for maintaining uptime - primarily fault detection products - and performance management for years. They've grown up on parallel tracks, separate and not always equal in the IT universe. If companies want to succeed at using their IT infrastructure as a sales and support channel, however, the two must intersect. Performance tools were tools for planning, now they are tools for keeping things humming in real time.

This is Part 1 of a two-part note.
Part 2 will discuss a solution.

Fault and Performance Management Emerge in the 90s

Combining the two technologies into one platform provides the best way to continuously run IT at peak performance. It lets IT managers view their infrastructures as a set of business operations, such as call centers or e-commerce, rather than a set of hardware devices. Instead of focusing on whether a Web server is running, IT managers can now make sure that customers can download Web pages quickly, so they don't get bored and go to a competitor's site.

Using this technology to automate infrastructure monitoring and management procedures helps IT staffs identify and address problems before they affect customers. This automation also frees technicians to focus on the company's core business focus rather than the day-to-day routines.

IT management software providers delivering one platform with fault and performance management will have an easier time developing, if they've established a legacy in performance technology first. Performance technology involves more complex processes and more variables than fault, and is therefore more difficult to learn, develop and adopt.

Red Alert: This Server Is Going Down

Fault management uses a straightforward approach to alert IT staff of problems. Equipment vendors deploy simple network management protocol (SNMP) traps on individual components, such as servers, routers, and networks. In addition, companies can place agents, which are automated pieces of software, on systems to monitor different functions. When a device exceeds a given performance threshold, the agent generates a trap for that fault and sends an alarm to the company's IT fault management interface.

Fault management technologies keep that trap information in a short-term data store that allows IT staff to search through it at will. This approach warns IT staff in real-time when a device is failing. Without this technology, IT staffs often found out about infrastructure problems after it was too late. They found out what went wrong by having to go to the devices to see why they failed.

Fault Leaves IT Staffs Looking For More Info

Fault technology increases uptime, but it doesn't cure long-term problems nor does it help find slow performance. Most fault technologies on the market today bombard IT staff with too many alarms, creating more noise then information. Often, these systems generate one alarm for each time a device exceeds a set threshold. So if a CPU registers that it is more than 80 percent busy, an alarm may be generated each time it exceeds 80 percent, which can occur multiple times per second. An IT manager looking at some topology map of his network seeing thousands of similar alarms flashing on his screen in red has a hard time determining which alarms signify critical problems and which ones are redundant.

Some IT management software vendors have developed software to filter out the noise from redundant alarms that occur from these multiple messages, this is often called de-duplication. Other vendors analyze these threshold crossings over time, and react more slowly to these changes so that fewer false alarms arise. Others look at historical behavior patterns comparing them to current behavior and alarm on deviations from normal behavior.

Another problem with fault management technologies is that they fail to shed any meaningful light on whether things have gone wrong or if the behavior observed is actually normal for the device. When IT managers see an alarm they click on it and see only the one variable as to whether the device is up or down. IT technicians need to be able to look at the historical behavior of the device to better understand its current behavior.

For example, an SNMP trap kicks out an alarm that a virtual private network (VPN) is operating at 80 percent capacity. IT managers see the alarm, assume that the VPN should be operating at a lower capacity, and without any historical context, assume the VPN is about to fail. But if the VPN normally operates at 80 percent, then it is unlikely that the VPN is causing any performance problems. Therefore, it shouldn't need replacing. Without that information, IT managers seeing the alarm will likely purchase and install a new VPN capacity, only to discover that this investment did not improve performance.

This concludes Part 1 of a two-part note.
Part 2 will discuss a solution.

About the Author

Fred Engel is a recognized expert in networking technologies, and a member of several Internet Engineering Task Force (IETF) standards committees, which explore and define industry networking standards like RMON and RMON2. He was recently named one of the Massachusetts ECOM Top 10 for 2001, making him a true "pioneer in technology." Engel has also been nominated for Computerworld's Top IT Leaders of the new economy.

As executive vice president and CTO for Concord Communications, Fred invented the network reporting and analysis industry by developing eHealth for Concord Communications. Fred works with leading Fortune 1000 organizations, carriers and Internet service providers to deploy turnkey networking performance and analysis solutions. Before joining Concord in 1989, Fred was vice president of engineering for Technology Concepts/Bell Atlantic, where he led the design and implementation of the SS7 800 Database front-ends, now deployed nationwide. His experience also includes Digital Equipment Corp., where he led the company's first TCP/IP deployment.

Fred has taught computer science, statistics and survey research courses at the University of Connecticut, Boston College and Boston University. He frequently speaks at leading networking industry forums, including Networld+Interop, COMDEX, COMNET, the Federal Computer Conference and DECUS, as well as two keynote engagements in France: Network Developments and IEEE International Conference on Networking (ICN).

Fred can be reached at fengel@concord.com.

comments powered by Disqus