The Evolution of a Real-time Data Warehouse
- December 23, 2009
Understanding Real-time Systems
Today, real time computing is everywhere, from customer information control systems (CICSs) to real-time data warehouse systems. Real-time systems have the ability to respond to user actions in a very short period of time. This computing behavior gives real-time systems special features such as instant interaction: users request information from the system and they receive the answer. Also, users have the possibility to remain connected (online) so they can start this interaction with the system anytime—which is called an online transaction processing (OLTP) system.
In general, a real-time system generates a large amount of up-to-date data and operates in a transactional way (meaning, it is used to register business transactions). Even though a traditional real-time system can store historical information, it is not designed to exploit this information for analytical purposes. In real-time systems, the importance of data resides mainly in the possibility of reacting or responding to the most current data in the system and uses historical data for monitoring purposes. Common real-time systems are designed for bookkeeping purposes (e.g., invoice information, flight reservations, etc.). Transactional data stored in the traditional real-time system will be changed (updated) according to operational purposes, which explains why some traditional real-time systems are also called "operational systems."
The Data Warehouse
Operational systems are able to generate a large amount of data to support the transactional operations of an organization. The data generated by these systems is fresh and valuable; as well, it can be very diverse and heterogeneous, and can come from a wide variety of sources. It is common for many companies to have several systems to support their operations. In order to solve the problem (gathering, integrating, cleaning, and analyzing this heterogeneous information), new software systems were developed. The data warehouse is its most important character.
A collection of subject-oriented, integrated, time-variant, and non-volatile data is what we can call a data warehouse. This data is used to support the decision-making process of an organization's management team. A data warehouse is used to integrate all of an organization's historical data, and has the ability to store snapshots of its transactions. All information generated from the operational data source is extracted, cleansed, transformed, and loaded into the data warehouse. Once in the data warehouse, the data can be subject to a wide range of analysis and exploration processes.
In a traditional data warehouse design, data generated within the company is concentrated in a unique place and will not be deleted. This ensures that all data is integrated and homologated to reflect one single version of the truth throughout the organization (every quarter, month, week, or even day). There are many tools that enable data analysis from very diverse points of view—reporting and analysis tools, online analytical processing (OLAP) analysis tools, data mining, forecasting, etc.
Managers and decision makers realized that a data warehouse—and its technology—had the ability to provide a complete view of the status of an organization. But as operational systems evolved, businesses merged, and globalization augmented, the number of business transactions increased substantially. Soon it became clear that having only historical information available wouldn't be enough to give companies all the necessary data for decision support. Managers needed data to be refreshed at a high speed, forcing data warehouses refresh cycles to be increased in frequency.
Real Time and the Data Warehouse
At first glance, real-time and data warehouse concepts appear to be very distant and disparate, but if we look more closely, we will be able to see that the real time data warehouse paradigm has more logic than we might have thought. As I mentioned before, the volume (data) generated from operational data systems has grown intensively. Data speed has forced data warehouses to radically change the way data is stored and handled.
As a natural process and consequence of the increasing speed of data generation, the traditional data warehouse design was forced to incorporate more frequent refresh cycles, and new challenges in data warehouse design had to be confronted. One of the first steps in managing this major issue was the creation of the near real-time data warehouse.
The Near Real-time Data Warehouse
The extraction, transformation, and load (ETL) process represents one of the major challenges in real-time data warehouse design. All ETL data warehouse processes were originally designed to be executed in batch mode, during previously scheduled downtimes. All operational data from distinct sources (e.g. ERP systems) was extracted, cleansed under a stage repository, and loaded into the data warehouse over long periods of time, mostly at night. These processes can take minutes or hours, depending on the volume of data being uploaded to the data warehouse.
With the pressure to load more recent data into the data warehouse, all ETL processes were forced to augment their frequency with new designs. This approach was derived from a very simple solution: if the business does not really need real-time technology, it might be enough to implement a near-real-time data warehouse. This means that the main part consists only in doing more ETL processes. The near real-time approach has some challenges such as increasing the downtime frequency; the pressure to decrease downtime period duration; and avoiding inconsistency in data results. If there is no actual need for a real-time data warehouse solution, a good option could be to implement a near real-time data warehouse.
The Real-time Data Warehouse
A real-time data warehouse enables data to be stored at the very moment it is generated and it is immediately captured, cleansed, and stored within the data warehouse structure. Traditional refresh cycles are no longer valid. The data warehouse is capable of reading the same data that moves around the operational systems at the same time it is generated. Despite the technical difficulties of implementing a true real-time data warehouse, there are some advantages.
It shortens information delivery times.
It improves integration throughout the organization.
It eases the analysis of future trends.
Basic Principles to Consider
With the growing popularity and increasing implementation of real-time data warehouses, it is important to consider some basic principles when considering a real-time data warehouse implementation.
Data on Time, at the Right Time. The data must flow to the real-time data warehouse at the necessary speed in order to be considered valuable data. In a real-time data warehouse, the ETL batch mechanism based on a table or file transfer is replaced with a design in which data flows from very different yet synchronized data sources into the data warehouse. Data will be considered valuable if it flows at the right speed and not necessarily at real-time speed. This data cycle will depend on the analytical purpose the data is used for. Real-time data is an essential part of the analytical process, but historical data remains an essential part of the design paradigm. While tactical analysis requires immediate or recent data, other sorts of strategic decision making require a higher volume of historical data. For all business analysis, a combination of both historical and real-time data is necessary.
The Analytical Purpose. Real-time data warehouses are not intended to replace traditional operational systems. One of the functions of a real-time data warehouse is to support analytical purposes, and not to perform operational functions. Even when a real-time data warehouse has the ability to store real-time data, its design is intended to perform intensive data loading, not transaction-based registrations. Real-time data warehouses are based on business data flows rather than business transactions, and need to maintain their main analytical purpose.
The Enterprise Focus. One of the key differentiators from other real-time systems such as ERPs, CRMs etc. is integration. A real-time data warehouse integrates data that flows from different sources to one single site. Consider a real-time data warehouse as an enterprise data warehouse—this repository will be accessible for all business units across the organization.
Some Players in the Real-time Data Warehouse Field
There are several vendors in the data warehouse space. Some are big and offer very robust solutions. Others vendors are innovative and offer very state-of-the-art solutions. Here are some data warehouse vendors to consider when planning a real-time data warehouse solution implementation (listed in alphabetic order).
Greenplum Database is data warehouse software built to support large-scale analytics processing. Massive parallel processing (MPP) is architecture-based and multi-level fault tolerance. This database supports industry standard interfaces (structured query language [SQL], open database connectivity [ODBC], java database connection [JDBC], and interoperation with most common business intelligence (BI) and ETL tools.
IBM counts on InfoSphere Warehouse to deliver an entire data warehouse deployment solution, with different editions (e.g., Enterprise and Departmental), and a complete warehousing solution (InfoSphere Balanced Warehouse). Recently, IBM announced InfoSphere System Z as part of a complete IBM data warehouse solution. It enables applications to populate data warehouses under DB2 for z/OS operating systems.
SQL Server 2008 provides a scalable data warehouse solution for BI. This product's features include data compression, partitioned tables, parallelism capabilities, and change data capture. The scalable integration services tools enable fast ETL operations and connectivity to non-SQL server data sources.
Netezza Data Warehouse Appliances is a combination of software and hardware tools designed for analytical processing of very large amounts of data. Netezza's architecture is massively parallel processing (MPP)-based and uses a "streaming" processing to help enable advanced or complex analytical processes.
Oracle offers a set of products for deploying data warehousing solutions, such as Exadata Storage Server (based on the HP ProLiant DL180 G5 server), HP Oracle Database Machine designed for multiterabyte data warehouses, and the Oracle Warehouse Builder integration tool that supports advanced data loads and data capture for low latency databases. They also offer other products like Oracle Partitioning, which can lower query times and increase data availability.
Among other products offered, Terada Database 12 is an enterprise data warehouse (EDW) solution with parallel operation capability, mission-critical availability, and ease of integration. With support from symmetric multiprocessing (SMP), massively parallel processing (MPP), and its own tools and utilities such as Tpump, FastLoad, MultiLoad, data is loaded continuously. Also, the Teradata Replication Services are used for real-time capture and delivery of changed data.
The Vertica Analytic Database is a column-oriented, MPP architecture-based database that handles new generation data warehouses. It is designed to handle large-scale data analysis and many concurrent users, and features an aggressive data compression.
Data warehouse implementation has changed radically. It still is an enterprise data repository that enables data analysis. But nowadays data warehouses are able to incorporate not only historical data but real-time data. This new feature expands the functionality of data warehouses and enables them to store information at a higher speed. It is also capable of supporting tactical and analytical decision for an organization at the right time. The data warehouse is not only alive and well, but it is evolving and maturing. It is enabling companies to evolve and expand by using new technology to analyze past and present information, to support future decisions.