The Necessity of Data Warehousing
M. Reed - August 2, 2000
Why the market is necessary
Data warehousing is an integral part of the "information age". Corporations have long known that some of the keys to their future success could be gleaned from their existing data, both current and historical. Until approximately 1990, many factors made it difficult, if not impossible, to extract this data and turn it into useful information. Some examples:
- Data storage peripherals such as DASD (Direct Access Storage Device) were extremely expensive on a per-megabyte basis. Therefore, much of the needed data was stored offline, typically on magnetic tape.
- Processing power was very expensive as measured in MIPS (Millions of Instructions per Second). Mainframes had to reserve most of their processing power for day-to-day operations, reports could only be run overnight in batch mode (without interaction from the user).
- Relational database technology was still in its infancy, and server engines were not powerful enough to support the data loads required.
- The type of programming that had to be done with third generation languages (3GL's) was tedious and expensive. Fourth generation languages were needed to abstract some of the required coding, but 4GL's were still in their infancy.
Most operational data is stored in what is referred to as an OLTP (On-Line Transaction Processing) system. These systems are specifically designed for high levels of transaction volume with many concurrent users. If the database is relational, it has probably been "normalized" (the process of organizing data in accordance with the rules of a relational database). If the database is non-relational, custom programs have to be written to store and retrieve data from the database. (This is often accomplished with the COBOL programming language). Whether relational or non-relational, the very design that makes an OLTP system efficient for transaction processing makes it inefficient for end-user queries. In the 1980's, many business users referred to their mainframes as "the black hole", because all the information went into it, but none ever came back out - all requests for reports had to be programmed by the Information Systems staff. Only "pre-canned" reports could be generated on a scheduled basis, ad-hoc real-time querying was virtually impossible.
To resolve these issues, data warehousing was created. The theory was to create a database infrastructure that was always on-line, contained all the information from the OLTP systems, including historical data, but structured in such a way that it was fast and efficient for querying. The most common of these schemas (logical and physical database designs) is known as the star schema. A star schema consists of facts (actual business facts) and dimensions (ways of looking at the facts). One simple way to look at a star schema is that it is designed such that the maximum amount of information can be derived from the fewest number of table reads. Another way to reduce the amount of data being read is to pre-define aggregations (summaries of detail data, such as monthly total sales) within the star, since most queries ask questions like "how many were sold last month?"
Data warehousing also led to the development of the concept of metadata management. Metadata is data about data, such as table and column names, and datatypes. Managing metadata makes it possible to understand relationships between data elements and assists in the mapping of source to target fields. (For more information of Metadata see "Metadata Standards in the Marketplace ")
Next came the creation of Extract/Transform/Load (ETL) tools, which made use of the metadata to get the information from the source systems into the data warehouse.
Additional tools, which made use of SQL (Structured Query Language), were developed to give end-users direct access to the data in the warehouse. As time went by, the query tools became user-friendly, and many now have a parser that can turn plain English questions into valid SQL. These end-user tools are now loosely referred to as "business intelligence" tools. In addition, there are other database constructs used to assist business intelligence tools in multi-dimensional analysis of data in the warehouse. These databases are referred to as hypercubes (also known as cubes, multi-dimensional cubes, or MDB's).
Since the early 1990's, data warehouses have become ubiquitous, technology and methodology have been improving, and costs have been decreasing. In 1998, data warehousing was a $28 Billion (USD) industry, and growing at over 10% per year. In addition, a recent survey of top IT executives indicated that data warehousing would be the number one post-Y2K priority. Data warehousing is now recognized as an important way to add business value and improve return on investment, if it is properly planned and implemented.
Selecting a set of products for a data warehouse effort is complex. The first and most important issue is to ensure that the Extract/Transform/Load tool that is chosen can effectively and efficiently extract the source data from all the required systems.
The selection of the ETL tool requires an understanding of the source data feeds. The following issues should be considered:
- Many warehouses are built from "legacy" systems that may be difficult to access from the computer network. ETL tools often do not reside on the same machine as the source data.
- The data structures of the legacy systems may be hard to decompose into raw data.
- Legacy data is often "dirty" (containing invalid data, or missing data). Care must be taken in the evaluation of the tool to ensure it has an adequate function library for cleansing the data. Depending on the complexity of the cleansing required, a separate tool designed specifically for cleansing and validation may have to be purchased in addition to the ETL tool.
- The ETL tool should have a metadata ("data about data") repository, which allows the data sources, targets, and transformations to be tracked in an effective manner.
- The tool should be able to access legacy data without the need for pre-processing (usually with COBOL programs) to get the data into sequential "flat files". This becomes increasingly complex when working with filesystems like VSAM (Virtual Sequential Access Method), and files that contain COBOL Occurs and Re-Defines clauses (repeating groups and conditionally defined fields). It should be noted that a large percentage of the world's data is stored in VSAM files.
- A final issue is whether the ETL tool moves all the data through its own engine on the way to the target, or can be a "proxy" and move the data directly from the source to the target.
Selection of the business intelligence tool(s) requires decisions such as:
- Will multi-dimensional analysis be necessary, or does the organization need only generalized queries? Not all warehouse implementations require sophisticated analysis techniques such as data mining (statistical analysis to discover trends in the data), data visualization (graphical display of query results), or multi-dimensional analysis (the so called "slice and dice").
- Will the architecture be two-tiered or three-tiered? Three-tiered architectures offload some of the processing to an "application server" which sits between the database server and the end-user.
- Will the tool employ a "push" or a "pull" technology? ("Push" technology publishes the queries to subscribed users, much like Pointcast works, "pull" requires that the user request the query).
- Will the information be broadcast over a corporate intranet, extranet, or the Internet?
- How will the organization implement data security, especially if information is being broadcast outside the corporate firewalls?
Average acquisition cost
A minimum tool set, excluding hardware, would include a database engine, ETL tool, and a number of BI tools for the individual users. A reasonable starting point for this set would be in the range of $100,000. Costs vary greatly based on vendor selection, number of seats purchased, software maintenance fees, and many other factors. In addition, consulting services drive the cost up immensely. A full blown, large-scale data warehouse in the terabyte range with hundreds of user licenses could cost tens of millions of dollars to implement and maintain.
Data Warehousing has taken many years to develop into its current form, and is still an immature technology. Improvements in hardware and software have made it possible to build warehouses that are many terabytes (one terabyte is 1,099,511,627,776 bytes) in size. The ability to access current and historical data to examine market trends and to perform business analysis has proven to provide significant return on investment. In today's dynamic marketplace, companies will continue to use data warehousing to derive competitive advantage.