To understand the relevance of extract transform and load (ETL) components and how they fit into business intelligence (BI), one should first appreciate what data integration is and the significance of having clean, accurate data that enable successful business decisions. Within the BI industry, data integration is essential. By capturing the right information, organizations are able to perform analyses, create reports, and develop strategies that help them to not only survive, but, more importantly, to thrive.
Informatica, a leading provider of enterprise data integration software, defines data integration as "the process of combining two or more data sets together for sharing and analysis, in order to support information management inside a business". In BI terms, this means that data is extracted in its original form and stored in an interim location, where it is transformed into the format that will be used in the data warehouse. The transformation process includes validating data (e.g., filling in null zip code information in the customer database) and reformatting data fields (e.g., separating Last Name and First Name fields of customer records that are merged in one database but not others). The next step is to load the data into the data warehouse. The data is then used to create queries and data analysis builds, such as on-line analytical processing (OLAP) cubes and scorecard analyses. In a sense, extracting the proper data, transforming it by cleansing and merging records, and loading it into the target database is what allows BI solutions to build analytical tools successfully. It is also the essence of ETL functionality.
Data Integration Components
In order to determine the most suitable ETL solution for them, organizations should evaluate their needs in terms of the core components of the data integration process, as listed below.
- Data Identification. What data does the organization need to extract and where does it come from? What end result, in terms of the data, does the organization want to analyze? Essentially, answering these questions means identifying the origin of the data, and what the relationship is between the different data sources.
- Data Extraction. How frequently does the organization require the data? Is it monthly, weekly, daily, or hourly? Where should data storing and transformation activities occur (i.e., on a dedicated server or in the data warehouse, etc.)? Considering these factors identifies the data frequency needs of the organization. For example, analysis of sales data may require the organization to load data monthly or quarterly, whereas some other data transfers may be performed multiple times a day. In determining the frequency of the data loading and transformation in the data warehouse or on the dedicated server, the organization should also consider the amount of data to be transferred and its effect on product performance.
- Data Standardization. What is the format of the organization's data, and is it currently compatible with the same data elements in other systems? For example, if the organization wants to analyze customer information and to merge customer buying patterns with customer service data, it must know if the customer is identified in the same way in both places (e.g., by customer identification [ID], phone number, or first and last name). This is crucial for ensuring that the correct data is merged and that the data is attached to the right customer throughout the data standardization process. Another data standardization issue the organization should deal with is identifying how it will manage data cleansing and data integrity functions within the data warehouse over time.
- Data Transformation. The organization should consider data transformation requirements and the interaction between the transformed data components. The critical questions are how will the data be reflected in the new database, and how will that data be merged on a row by row basis? Answering these questions involves identifying the business and data rules associated with the data to ensure accuracy in data loads.
- Data Loading. Where will the data be loaded? What data monitoring activities are required? Other data loading concerns are failed data transfer identification, how failed transfers are handled, and how updates occur. For example, will each load involve re-loading the whole dataset, or will updates be made using only updated fields within the data sources?
After evaluating the core components of data integration, the organization should investigate its traditional BI needs throughout the organization, and assess how they will evolve or change.
Until recently, ETL involved uploading data at regular (i.e., monthly or weekly) time intervals to drive business performance decisions and identify business opportunities. However, as BI tools become more integrated with overall business functions, including business performance management (BPM) and reporting and analysis requirements, data needs have shifted from monthly or weekly intervals to real time updates. This means that it has become more important for data transfers to accurately reflect real time business transactions, and that there has been an increase in the amount of data transfers required.
Nonetheless, real time ETL doesn't necessarily refer to automatic data transfer as operational databases are updated. In terms of BI, real time may mean different things to different organizations or even different departments within these organizations. Take, for instance, an automotive manufacturer whose traditional data warehouse solutions (OLAP cubes, etc.) involved capturing data at a given point in time. The automotive manufacturer might, for example, have wanted to track and compare monthly sales with last year's sales during the same month by region, car model, and dealer size, thus requiring the data warehouse to be updated on a monthly basis. However, as the manufacturer's business decisions evolved based on this analysis, its data needs shifted from a monthly requirement to a weekly one, and on to an ever more frequent basis, eventually creating the demand for real time data. In the case of the automotive manufacturer, real time data may be useful for identifying the movement of car parts within a warehouse relative to their storage locations and comparing this information with the demand for these parts.
Such a shift in data requirements affects both the volume of data required and when the data loading occurs. The end result is that, in order to meet the changing needs of user organizations, ETL and BI vendors have concentrated on moving towards real time ETL and shifting their data loading functionality to accommodate higher volumes of data transfer.
How Do Vendors Handle ETL?
Once an organization has evaluated its BI and data integration needs, it is ready to investigate BI vendors and the type of ETL functionality that they offer. It is important to note that though, for many vendors, ETL is only a part of their offering, this article will focus strictly on the ETL functionality those vendors provide. When evaluating potential vendors and their ETL functionality, there are two types to choose from. Vendors such as Cognos, SAS, and Information Builders provide integrated ETL functionality built in to their overall BI framework. Data integration or data quality vendors such as DataFlux, Ascential, and Trillium, on the other hand, provide best-of-breed ETL solutions. Which sort of vendor an organization will favor will depend on its data requirements.
In general, BI vendors accommodate customers by providing embedded support for slowly changing data requirements, as well as by allowing users to choose the location of data loads. This helps increase the speed of data transfers, accommodating industry demands for increased data loading in the data warehouse and for a greater number of data loads within a given time period.
Cognos 8 BI Data Manager allows users to extract, merge, transform, and load data in a single pass, as well as to use Web services to construct and schedule process builds and jobs on any server located within the network. Additionally, processes such as hierarchy and data validation definitions are automated, allowing embedded support for slowly changing dimensions and late arriving data. Data Manager enables the data integration process within a simple drag-and-drop environment.
SAS's Data Integration uses a wizard-driven user interface to provide ease of use for end users. Included in its ETL functionality is the ability for processing to occur natively on any platform and within any database. The solution has built in data load balancing to optimize resources, as well as scalable deployment to take into account growing data transfer needs. This last functionality eliminates the need to redesign processes as data requirements change. In addition, SAS's product has an imbedded design, test, and production environment, which allow users to synchronize data transfers and conduct testing.
Information Builders' DataMigrator offers essential ETL functionality, such as the ability to aggregate, join, merge, and apply selection criteria to information from any combination of data sources. DataMigrator can also transform data from raw forms into structured formats based on individual business needs and automatically generated and managed file transfer protocol (FTP) scripts. Additionally, DataMigrator allows for either bulk-loading or row-at-time data inserts, and has a change data capture capability that allows only changed records to be loaded into the required database, making it easier to achieve near real time results.
The following data integration vendors provide the same functionality as the aforementioned BI vendors, but with an increased focus on data cleansing and integrity.
DataFlux, acquired by SAS in 2000 to extend its data warehousing capabilities, provides a product called dfPowerStudio that can identify data accuracy, validity, and patterns to standardize data. Moreover, dfPowerStudio can monitor and audit data by providing alerts to identify the state of organizational data over time. Alerts are also provided to identify data quality levels and business rule violations. Additionally, dfPowerStudio analyzes and corrects data inconsistencies. It matches data by clustering data into groups, merging duplicates into the best record choice through drag-and-drop functionality. Furthermore, dfPowerStudio's user friendly interface makes data profiling and data quality management easy.
IBM WebSphere Information Integration (formerly Ascential Software) is a data integration suite that enables organizations to use a single information integration platform to access, cleanse, integrate, transform, and deliver data, as well as to immediately identify data anomalies through an embedded source and target profiling and analysis system. IBM WebSphere software maintains data through the use of a single, open repository on DB2, Oracle, or SQL server platforms; validates business requirements to identify whether they are achievable or not; and ensures disparate data sources support target requirements
Trillium, a Harte-Hanks company, also provides users with a software solution suite that allows them to implement a total data quality solution. The suite, composed of Trillium Software Discovery and Trillium Software System, enables users to identify incorrect data formats, duplicates, misspellings, and redundant and missing values, and to create and define their own business rules to automatically monitor data standards. Also, Trillium software generates entity-relationship diagrams; creates a central repository of data, metadata, statistics, rules, and documentation; and has drill down capabilities down to individual rows of data. Additionally, continuous data profiling, cleansing, and monitoring activities, plus data filter and search capabilities are provided.
Depending on an organization's data requirements and its data integrity standards, the organization must determine whether BI vendors can provide all the appropriate functionality or whether the company should take advantage of the added data quality functionality that is built in to data integration and data quality vendors. In this regard, organizations should realize that "bad" data occurs over time when data is entered inconsistently across disparate systems within an organization. To optimize BI solutions, data must be clean and accurate, and this process needs to be maintained over time. Data quality vendors, such as DataFlux, place priority on data cleansing by providing end user features that allow data cleansing activities and monitoring to occur on a regular basis over time. With BI vendors offering integrated suites, this is not always the case. Using these integrated BI suites may require an organization to provide in-depth data cleansing functions at the end user level before data is brought into the data warehouse. This not only wastes time in man hours, but also places the burden of maintaining data standards on end user groups, where data inconsistencies are most likely to occur due to human error. Thus, for many organizations, it is advantageous to invest more money in order to integrate the solutions of both types of vendors, utilizing the enhanced analytical tools of a BI solution and the data integrity and management capabilities provided by best-of-breed data integration vendors.