Originally Published - October 20, 2008
The zeal to get as much business data to the user as soon as possible often prevails over the establishment of processes that control the quality of data. Low data quality standards can lead to bad business decisions and missed opportunities. Even with a data warehouse that is well designed and equipped with the best tools for business intelligence (BI), users will encounter inefficiency and frustration if the quality of data is compromised. When embarking on a data warehousing or business intelligence project, it is essential for organizations to emphasize the quality of data that is used for analysis and subsequent decision making.
As data captured from a multitude of sources makes its way to an enterprise data warehouse or data marts, a data quality framework creates a screening process that measures the purity of the data and corrects any inconsistencies found. This article walks the reader through a typical data quality strategy by illustrating, through examples, how and where quality issues occur, and the approaches available to inhibit the proliferation of these issues. We look at a subset of vendors that provide powerful and rich solutions to protect enterprise data quality.
Begin at the Beginning
The problem of inaccurate data often begins in application systems (the sources of data). There are simple best practices that can help limit the extent of inaccurate data.
Tying data types to business entities. Data types in source databases must closely describe the business entities they represent. For instance, numeric entities should not be stored as columns with string data types. When non-numeric data is accidentally stored in such columns, integrity problems are bound to occur downstream.
Default NULLS. Default values must be specified for all columns so that the "null" values of all entities are explicable. Failure to do so will lead to data warehousing systems having to guess the correct representation of NULL values.
Enforce referential integrity. Relationships between entities that are identified must be enforced through referential integrity. For instance, a customer's city must be constrained to a predefined set of cities.
Enforcing business rules. Database systems provide constraints, with which certain business rules can be applied to values that are entered into the database. For instance, a salary column can have a predefined range. In addition, application interfaces must provide constrained input fields in order to simplify data capture for the user, as well as enforce business rules. For instance, North American telephone numbers must be constrained to 10 digits, gender entered through a constrained user interface, and so on.
Consistency across business systems. A unified approach to building organization-wide application systems is paramount to ensure that entities are described consistently across multiple systems within an organization. Master data management is growing in importance and relevance. Although this article does not delve into this subject, it is important for organizations to put in place the necessary tools and processes for master data management.
Data through the Looking Glass
Data quality issues can certainly be minimized in data sources. However, data warehouses are conglomerations of several data sources, often combining unstructured and structured data, and even including data external to the enterprise. While each application system may be self-consistent, it is important to homogenize data from multiple sources. In addition, simply constraining data to match business rules is only the first step toward controlling accuracy. Patterns that emerge from large samples of data often point to problems that are impossible to detect as data is captured from individual values. Data quality solutions have been part of data integration platforms for a few years now, and although products vary in terms of the depth and breadth of different functions, an overall paradigm for data quality has emerged.
Data quality functions fall into three categories: data profiling, to analyze and identify quality issues; data cleansing, to correct and standardize data in preparation for consumption by the user community; and data monitoring, to control quality over time.
Diagnose with Data Profiling
By creating data profiles, organizations can gain an understanding of quality issues that stem from their business data. Metrics are created based on selected fields to measure the quality of data represented in those fields. Examples of data profiling metrics include the following:
Number of distinct values, which provides insight into how uniqueness is maintained.
Percentage of string values. Numeric values in alphanumeric entities or vice versa can expose basic violations of data types.
Percentage of missing values. A large percentage of missing values can point to errors in the source system(s).
Minimum and maximum values. Looking at a set of minimum and maximum values with frequency counts of these values can quickly point out data that lie outside the expected range of values. If the three minimum and maximum values for age in a university student information database are (6, 17,18) and (24, 42, 52), it is likely that 6 was incorrectly entered in place of 16; 42 and 52 in place of 24 and 25.
More advanced rules can also be applied to the data in order to discover aberrations that are not obvious. Data mining is the process of discovering information from enterprise data that is otherwise hidden. For instance, an online bookstore suggests additional books based on what a user adds to his or her cart by examining evidence from other comparable buyers. This is done through the use of association rules applied to historical sales data. Although the primary purpose of mining data is to gain business insight, it can be applied to discover anomalies in data. Consider a Web-based system that infers location from a user's Internet protocol (IP). If all users on a particular day are found to be localized in California, it may indicate that on that particular day, the IP-based inference did not function correctly and all users were set to the default location, which happened to be set to California.
In a nutshell, data profiling provides information about organizational data. Key quality issues are identified and should be addressed before proceeding further. Any deviant data that uncovers problems further upstream must prompt an immediate action to correct the component(s) that has caused these anomalies. Note that profiling does not eliminate all quality issues; showstoppers are eliminated and a framework for managing data quality is put in place.
Correct with Data Cleansing
If profiling looks at data through a magnifying glass, cleansing involves the use of an electron microscope. The extract, transform, and load (ETL) process in a data warehousing system extracts records from data source(s), transforms them using rules to convert data into a form that is suitable for reporting and analysis, and finally loads the transformed records into the destination (typically a data warehouse or data mart). Data cleansing is an integral part of the transformation process and enforces business and schema rules on each record and field. Data cleansing involves the application of quality screens that monitor every source record. The outcome of the violation of business rules can include
immediate correction using rules that are specified in the data cleansing application logic
logging the error and proceeding to the next data value
terminating the process
The most appropriate action depends on the nature and severity of each data issue. If a record that contains an address is missing the state or province, but has the city and country, the most appropriate action would be to correct the record by inferring the state or province from the other two data items, rather than terminating or ignoring the error.
Data cleansing functionality has advanced considerably, and most data integration platforms offer a variety of features tailored for most business scenarios.
Validation checks if every data value follows specified business rules. If, for instance, the data format of a social security number is incorrect, or a mandatory data value is missing, a validation procedure can flag and even clean or correct the data value. Complex business rules specific to the business environment can also be built to validate permissible data values where applicable.
Deduplication is one of the most important cleansing techniques, in which duplicate entities are merged. Duplication of data can occur within a single field, but more often within a combination of fields. For example, "A.A.Milne|Writer|Hampstead, England" and "Alan Alexander Milne|Author|Hampstead, London, England" represent the same entity. However, simple text matches on each field do not resolve this match. Matching based on fuzzy logic can be applied to resolve such records. Standard fuzzy matching helps eliminate duplicates that are caused by misspelling or approximate matches through the use of similarity functions. A similarity function usually returns a score that represents the degree of similarity between two sets of data values. Software that provides fuzzy matching capabilities often has underlying libraries that provide knowledge that cannot be discerned purely from text (or phonetic) similarities. For instance, the knowledge that a "developer" and "programmer" must be considered the same job title is required to successfully match job titles that have little or no textual similarity. Most tools also provide the possibility to customize matching rules and create more complex rules tailored to the specific business environment.
Householding is the method by which individual records can be grouped based on common properties they share. Aggregates that are based on customer organization names can be done by first grouping all records by company (using fuzzy matching to resolve approximate matches) and then computing the aggregates for each group.
Maintain through Data Monitoring
In order to sustain the reliability of data, it is imperative to build controls that constantly assess the quality of data and trigger alerts when data quality levels are threatened. With a data cleansing framework in place, an organization builds up its own profile of data quality issues and how these issues were dealt with, on each occurrence. Data monitoring can provide insight into how often duplicates were found and resolved, how many missing values were resolved each week, and what types of actions were taken for missing values that could not be resolved. Rules can be built that track thresholds on tolerance levels for various data quality issues; alerts can be triggered when thresholds exceed permissible levels. Weak processes and applications are identified through monitoring data quality over time. The insight found through ongoing monitoring can be used to solidify business processes. This feedback mechanism keeps quality checks in place constantly.
Data quality technology is growing at a rapid pace, as organizations recognize and place increasing emphasis on delivering high quality data to users. Solutions are available as simply data quality tools or as part ETL/data integration packages.
DataFlux (a SAS company) offers dfPowerStudio, an integrated platform for data quality and data integration. One of the components of dfPowerStudio is dfPower® Profile, an extensive data profiling module. It includes metadata analysis that helps organize data across multiple organizational sources. A relationship discovery component helps uncover relationships across data sources and granularity. In addition to verifying metadata-defined relationships, dfPower® Profile makes it possible to discover relationships that are not defined through metadata.
Informatica's Data Quality Workbench fully integrates with PowerCenter to create a single platform for data integration and data quality. Users can leverage Informatica's parallel processing–based partitioning option to run data quality and integration processes in parallel for high performance. Informatica's Data Explorer profiles individual columns, relationships within tables, and across tables, in order to identify data quality issues.
Datanomic's dn:Director provides a family of processors, which constitute a complete set of data quality functions. This includes profiling and analysis to gain an understanding of data, transformation that helps cleanse and enrich data, phrase profiling and parsing that can help provide structure to unstructured data and identify important information that may be hidden in large datasets, and matching that can be used for deduplication. In addition to a batch architecture, the solution also supports a real-time architecture in which processes developed using dn:Director can be applied to Java Messaging Service (JMS) transactions in order to validate data in real time.
Everything's Got a Moral, If Only You Can Find It
The primary purpose of a BI solution is to provide a business with the necessary information to support better decision making. This requires the integration of data within and external to an enterprise. A sound data quality strategy protects an enterprise from inconsistencies and anomalies that arise from the complexity of integrating multiple systems and from hidden problems that are uncovered only through sophisticated analysis techniques. Data quality solutions provide organizations with the ability to understand their data better through profiling, address data quality issues through cleansing, and establish processes to monitor the quality of their data over time.
About the Author
Anna Mallikarjunan is a member of TEC's research & development team. She is responsible for the analysis and development of TEC's decision support software as well as tools for business intelligence (BI). She has over four years of business analysis, design, and development experience in several areas of BI, including data warehousing; extract, transform, and load (ETL); online analytical processing (OLAP); reporting; and custom application development.
Past positions Mallikarjunan has held include technical lead and applications development manager of a team of .NET, data warehousing, and BI professionals for a fashion retail company. In this role, she was responsible for the development, maintenance, and support of Windows and Web-based applications, as well as an operational data store, data marts, and BI applications.
Mallikarjunan holds a BSc in computer science from the University of Madras (India), and an MSc in computer science from Anna University in Madras, India.