M.
Reed
- August
2, 2000
Why
the market is necessary
Data warehousing is an integral part of the "information age". Corporations
have long known that some of the keys to their future success could be
gleaned from their existing data, both current and historical. Until approximately
1990, many factors made it difficult, if not impossible, to extract this
data and turn it into useful information. Some examples:
- Data
storage peripherals such as DASD (Direct Access Storage Device) were
extremely expensive on a per-megabyte basis. Therefore, much of the
needed data was stored offline, typically on magnetic tape.
- Processing
power was very expensive as measured in MIPS (Millions of Instructions
per Second). Mainframes had to reserve most of their processing power
for day-to-day operations, reports could only be run overnight in batch
mode (without interaction from the user).
- Relational
database technology was still in its infancy, and server engines were
not powerful enough to support the data loads required.
- The type
of programming that had to be done with third generation languages (3GL's)
was tedious and expensive. Fourth generation languages were needed to
abstract some of the required coding, but 4GL's were still in their
infancy.
Most
operational data is stored in what is referred to as an OLTP (On-Line
Transaction Processing) system. These systems are specifically designed
for high levels of transaction volume with many concurrent users. If the
database is relational, it has probably been "normalized" (the process
of organizing data in accordance with the rules of a relational database).
If the database is non-relational, custom programs have to be written
to store and retrieve data from the database. (This is often accomplished
with the COBOL programming language). Whether relational or non-relational,
the very design that makes an OLTP system efficient for transaction processing
makes it inefficient for end-user queries. In the 1980's, many business
users referred to their mainframes as "the black hole", because all the
information went into it, but none ever came back out - all requests for
reports had to be programmed by the Information Systems staff. Only "pre-canned"
reports could be generated on a scheduled basis, ad-hoc real-time querying
was virtually impossible.
To
resolve these issues, data warehousing was created. The theory was to
create a database infrastructure that was always on-line, contained all
the information from the OLTP systems, including historical data, but
structured in such a way that it was fast and efficient for querying.
The most common of these schemas (logical and physical database designs)
is known as the star schema. A star schema consists of facts (actual business
facts) and dimensions (ways of looking at the facts). One simple way to
look at a star schema is that it is designed such that the maximum amount
of information can be derived from the fewest number of table reads. Another
way to reduce the amount of data being read is to pre-define aggregations
(summaries of detail data, such as monthly total sales) within the star,
since most queries ask questions like "how many were sold last month?"
Data
warehousing also led to the development of the concept of metadata management.
Metadata is data about data, such as table and column names, and datatypes.
Managing metadata makes it possible to understand relationships between
data elements and assists in the mapping of source to target fields. (For
more information of Metadata see "Metadata
Standards in the Marketplace ")
Next
came the creation of Extract/Transform/Load (ETL) tools, which made use
of the metadata to get the information from the source systems into the
data warehouse.
Additional
tools, which made use of SQL (Structured Query Language), were developed
to give end-users direct access to the data in the warehouse. As time
went by, the query tools became user-friendly, and many now have a parser
that can turn plain English questions into valid SQL. These end-user tools
are now loosely referred to as "business intelligence" tools. In addition,
there are other database constructs used to assist business intelligence
tools in multi-dimensional analysis of data in the warehouse. These databases
are referred to as hypercubes (also known as cubes, multi-dimensional
cubes, or MDB's).
Since
the early 1990's, data warehouses have become ubiquitous, technology and
methodology have been improving, and costs have been decreasing. In 1998,
data warehousing was a $28 Billion (USD) industry, and growing at over
10% per year. In addition, a recent survey of top IT executives indicated
that data warehousing would be the number one post-Y2K priority. Data
warehousing is now recognized as an important way to add business value
and improve return on investment, if it is properly planned and implemented.
Selection
Issues
Selecting a set of products for a data warehouse effort is complex. The
first and most important issue is to ensure that the Extract/Transform/Load
tool that is chosen can effectively and efficiently extract the source
data from all the required systems.
The
selection of the ETL tool requires an understanding of the source data
feeds. The following issues should be considered:
- Many
warehouses are built from "legacy" systems that may be difficult to
access from the computer network. ETL tools often do not reside on the
same machine as the source data.
- The data
structures of the legacy systems may be hard to decompose into raw data.
- Legacy
data is often "dirty" (containing invalid data, or missing data). Care
must be taken in the evaluation of the tool to ensure it has an adequate
function library for cleansing the data. Depending on the complexity
of the cleansing required, a separate tool designed specifically for
cleansing and validation may have to be purchased in addition to the
ETL tool.
- The ETL
tool should have a metadata ("data about data") repository, which allows
the data sources, targets, and transformations to be tracked in an effective
manner.
- The tool
should be able to access legacy data without the need for pre-processing
(usually with COBOL programs) to get the data into sequential "flat
files". This becomes increasingly complex when working with filesystems
like VSAM (Virtual Sequential Access Method), and files that contain
COBOL Occurs and Re-Defines clauses (repeating groups and conditionally
defined fields). It should be noted that a large percentage of the world's
data is stored in VSAM files.
- A final
issue is whether the ETL tool moves all the data through its own engine
on the way to the target, or can be a "proxy" and move the data directly
from the source to the target.
Selection
of the business intelligence tool(s) requires decisions such as:
- Will
multi-dimensional analysis be necessary, or does the organization need
only generalized queries? Not all warehouse implementations require
sophisticated analysis techniques such as data mining (statistical analysis
to discover trends in the data), data visualization (graphical display
of query results), or multi-dimensional analysis (the so called "slice
and dice").
- Will
the architecture be two-tiered or three-tiered? Three-tiered architectures
offload some of the processing to an "application server" which sits
between the database server and the end-user.
- Will
the tool employ a "push" or a "pull" technology? ("Push" technology
publishes the queries to subscribed users, much like Pointcast works,
"pull" requires that the user request the query).
- Will
the information be broadcast over a corporate intranet, extranet, or
the Internet?
- How will
the organization implement data security, especially if information
is being broadcast outside the corporate firewalls?
Average
acquisition cost
A minimum tool set, excluding hardware, would include a database engine,
ETL tool, and a number of BI tools for the individual users. A reasonable
starting point for this set would be in the range of $100,000. Costs vary
greatly based on vendor selection, number of seats purchased, software
maintenance fees, and many other factors. In addition, consulting services
drive the cost up immensely. A full blown, large-scale data warehouse
in the terabyte range with hundreds of user licenses could cost tens of
millions of dollars to implement and maintain.
Conclusions
Data Warehousing has taken many years to develop into its current form,
and is still an immature technology. Improvements in hardware and software
have made it possible to build warehouses that are many terabytes (one
terabyte is 1,099,511,627,776 bytes) in size. The ability to access current
and historical data to examine market trends and to perform business analysis
has proven to provide significant return on investment. In today's dynamic
marketplace, companies will continue to use data warehousing to derive
competitive advantage.