Technology Evaluation Centers

The Necessity of Data Warehousing

M. Reed - 8/2/2000

The Necessity of Data Warehousing
M. Reed - August 2, 2000

Why the market is necessary

Data warehousing is an integral part of the "information age". Corporations have long known that some of the keys to their future success could be gleaned from their existing data, both current and historical. Until approximately 1990, many factors made it difficult, if not impossible, to extract this data and turn it into useful information. Some examples:

Most operational data is stored in what is referred to as an OLTP (On-Line Transaction Processing) system. These systems are specifically designed for high levels of transaction volume with many concurrent users. If the database is relational, it has probably been "normalized" (the process of organizing data in accordance with the rules of a relational database). If the database is non-relational, custom programs have to be written to store and retrieve data from the database. (This is often accomplished with the COBOL programming language). Whether relational or non-relational, the very design that makes an OLTP system efficient for transaction processing makes it inefficient for end-user queries. In the 1980's, many business users referred to their mainframes as "the black hole", because all the information went into it, but none ever came back out - all requests for reports had to be programmed by the Information Systems staff. Only "pre-canned" reports could be generated on a scheduled basis, ad-hoc real-time querying was virtually impossible.

To resolve these issues, data warehousing was created. The theory was to create a database infrastructure that was always on-line, contained all the information from the OLTP systems, including historical data, but structured in such a way that it was fast and efficient for querying. The most common of these schemas (logical and physical database designs) is known as the star schema. A star schema consists of facts (actual business facts) and dimensions (ways of looking at the facts). One simple way to look at a star schema is that it is designed such that the maximum amount of information can be derived from the fewest number of table reads. Another way to reduce the amount of data being read is to pre-define aggregations (summaries of detail data, such as monthly total sales) within the star, since most queries ask questions like "how many were sold last month?"

Data warehousing also led to the development of the concept of metadata management. Metadata is data about data, such as table and column names, and datatypes. Managing metadata makes it possible to understand relationships between data elements and assists in the mapping of source to target fields. (For more information of Metadata see "Metadata Standards in the Marketplace ")

Next came the creation of Extract/Transform/Load (ETL) tools, which made use of the metadata to get the information from the source systems into the data warehouse.

Additional tools, which made use of SQL (Structured Query Language), were developed to give end-users direct access to the data in the warehouse. As time went by, the query tools became user-friendly, and many now have a parser that can turn plain English questions into valid SQL. These end-user tools are now loosely referred to as "business intelligence" tools. In addition, there are other database constructs used to assist business intelligence tools in multi-dimensional analysis of data in the warehouse. These databases are referred to as hypercubes (also known as cubes, multi-dimensional cubes, or MDB's).

Since the early 1990's, data warehouses have become ubiquitous, technology and methodology have been improving, and costs have been decreasing. In 1998, data warehousing was a $28 Billion (USD) industry, and growing at over 10% per year. In addition, a recent survey of top IT executives indicated that data warehousing would be the number one post-Y2K priority. Data warehousing is now recognized as an important way to add business value and improve return on investment, if it is properly planned and implemented.

Selection Issues

Selecting a set of products for a data warehouse effort is complex. The first and most important issue is to ensure that the Extract/Transform/Load tool that is chosen can effectively and efficiently extract the source data from all the required systems.

The selection of the ETL tool requires an understanding of the source data feeds. The following issues should be considered:

Selection of the business intelligence tool(s) requires decisions such as:

Average acquisition cost

A minimum tool set, excluding hardware, would include a database engine, ETL tool, and a number of BI tools for the individual users. A reasonable starting point for this set would be in the range of $100,000. Costs vary greatly based on vendor selection, number of seats purchased, software maintenance fees, and many other factors. In addition, consulting services drive the cost up immensely. A full blown, large-scale data warehouse in the terabyte range with hundreds of user licenses could cost tens of millions of dollars to implement and maintain.

Conclusions

Data Warehousing has taken many years to develop into its current form, and is still an immature technology. Improvements in hardware and software have made it possible to build warehouses that are many terabytes (one terabyte is 1,099,511,627,776 bytes) in size. The ability to access current and historical data to examine market trends and to perform business analysis has proven to provide significant return on investment. In today's dynamic marketplace, companies will continue to use data warehousing to derive competitive advantage.