Continuous Data Quality Management: The Cornerstone of Zero-Latency Business Analytics

  • Written By: Metagenix
  • Published On: October 17 2002



The Cost Of Poor Quality Data

No matter how well an enterprise implements a CRM, ERP, SCM, Business Intelligence, or Data Warehouse project, poor data quality can destroy its utility and cost real dollars. According to recent industry studies:

  • Poor data quality costs businesses $611 billion per year in the United States alone (TDWI).

  • 75% of businesses have experienced significant problems due to faulty data (PWC).

  • Only 33% of businesses felt confident in the quality of their company's data.

Now imagine the downstream impact of this same poor quality data fueling business decisions. Not only will bad decision be made, but when employees and managers stop trusting their business intelligence applications, the entire investment in these costly systems can be jeopardized. It's quite simple: businesses that ignore data quality do so at their own peril and expense.

This, Part One of a two-part article, defines the problem and its importance to enterprises.

Part Two presents a solution based on the author's experience.

No Time To Think

Zero-latency decision-making, based on real-time/near-time monitoring of business activity, is a clear trend in the business community. To enable this instant access/instant response to take place, a blending of real-time business intelligence with historical data is required, creating a tremendous dependency on continuous data quality.

Traditionally, large-scale applications such as ERP, CRM, Business Intelligence, Process Management, Middleware, and Data Integration have experienced difficulties with data quality problems. ERP usually serves as a data collection point where data errors can be introduced into enterprise data assets. CRM tracks an ever-changing, ever-moving target of sales and customers, where poor data quality can result in missed sales opportunities as well as upset customers. Business Intelligence can only provide information as accurate as the source data. Business Process Management is only effective when the data exchanged is accurate. Middleware and Data Integration merely move faulty data from point A to point B. While an implementation may incorporate data cleansing as part of a project, ongoing monitoring of data quality after handover is currently not a typical IT assignment.

Zero-latency decision-making that does not pay close attention to data quality is likely to fail. Unfortunately, the traditional approach to data quality is simply not robust enough to meet these new demands. Data quality solutions have traditionally clustered around name standardization, address hygiene, and demographic accuracy. While these properties of data quality are important, they are practically useless in a near zero-latency implementation. Traditional data quality tools are inherently batch-oriented and single-shot endeavors, whereas the current need is for a transactional, real-time solution. Even if the technological hurdles for real-time name standardization, address cleansing, and demographic assignment could be overcome, a framework has not existed for measuring the impact of these activities. Even if they could be cleaned up, the same mistakes would continue to occur when data is added or manipulated, requiring even more cleanup.

In fact, data quality goes far beyond data cleansing. Data quality indicates how well enterprise data matches up with the real world at any given time. If decision-making is based upon poor quality data, then by definition there is no ability to make accurate decisions. Luckily, there are several indicators for data quality that can be used to define and measure the state of data quality in the enterprise.

Defining Data Quality

Data Quality can be broken into the following eight categories:

  • Domains
  • Definitions
  • Completeness
  • Validity
  • Business rules
  • Structural integrity
  • Transformations
  • Data flows

Domains

Domains describe the ranges and types of values present in a data set. The typical errors that can occur relating to domains are:

Unexpected domain values. The documentation for a system indicates that the values for a column will be (A,B,C), but the data actually contains (A,B,C,d,e,f). This can lead to a variety of fatal problems.

Cardinality. Cardinality indicates the number of unique values found within a data set. For a primary key, the cardinality is expected to be equal to the total number of records, while for a Yes/No field the cardinality is expected to be two.

Uniqueness. The degree of uniqueness in the data can point to data quality problems. A field that is 98% unique may indicate "garbage" in a primary key field.

Constants. Constants indicate that the same value is present in every record. Applications tend to overlook constants (since they know what's supposed to be there), thus creating integrity problems in downstream activities. A change in a constant usually indicates a change in program logic from an upstream data producer.

Outliers. Some data may have completely unexpected values, such as "Number of family members" = -3. Outliers are famous for generating system crashes by unsuspecting applications.

Length. This refers to the size of the data expected. Small changes, such as moving from an 8-digit to a 6-digit identifier, can break applications throughout the enterprise.

Precision. Rounding and truncation errors are often introduced during data movement or access.

Scale. Is the data expressed as a percentage, a factor, or a time period? Is it possible to compare 100 and 1.00 as percentages?

Internationalization. Data may have unexpected postal codes, time, or date formats.

Definitions

Definitions indicate how entities are referenced throughout the enterprise. While simple words such as "Revenue" can have vastly different meanings across sales, marketing, manufacturing, and finance, even more subtle are the possible differences between Employee->Id and Invoice->Id and Invoice->Employee_Id. Definition problems are broken down into these categories:

Synonyms. Whether or not the same entities are actually named the same. The fields EMP_ID, EMPID, and EM01 may or may not all actually refer to the same type of data.

Homonyms. These indicate fields that are spelled the same, but really aren't the same. The common variable name "Id" can mean a lot of different things in different contexts.

Relationships. Just because a field is named FK_INVOICE doesn't mean that it's really a foreign key to the invoice file.

Completeness

Completeness indicates whether or not all of the data is actually present. While this would seem elementary, missing fields and values are perhaps the most common data quality problem. Completeness examines these areas:

Integrity. Does the actual data match our description of the data? In other words, is our meta-data actually accurate? An unused field in a Cobol file is a good example of an integrity problem.

Accuracy. It is often necessary to examine the degree of agreement between the data values and a source that is assumed to be correct. Comparing rollup tables to the actual sums generated from transactions often leads to unexpected results. Likewise, many data sources can be compared to external third party sources. Address matching and cleansing, name standardization, and demographic matching are all examples of accuracy checks.

Values Present. This refers to the degree to which values are present in the attributes that require them. If 50% of our customer file does not have an email address, then an email marketing campaign is probably not the desired course to reach our customers.

Reliability. Can you rely on the data given its context? For example, a zip code should match city and state.

Redundancy. Are there duplicates or near duplicates in the data?

Consistency. Are there conflicts within the data? Is the same invoice number referenced with two different amounts?

Validity

Just like it sounds, validity indicates whether or not the data is valid. Surprisingly enough, many enterprise databases are rife with data that cannot be correct. Validity checks consist of:

Acceptability. Does the data pass a known set of acceptability tests? For instance, a part number could consist of a 7-digit, alphanumeric string starting with two alpha digits and five numbers.

Reliability. This refers to the probability that the data will perform the indicated function for a given time period. For example, how reliable are the traffic sensors during rush hour? Consider the businessman who bought a hotdog cart, only to find out that the "revenue per hour" was measured during a once a year business convention. (True story!)

Anomalies. Are there "facts" in the data that are clearly impossible? Can a lawyer really bill 48 hours in a day? Can a vehicle be in both California and North Carolina at the same time?

Timeliness. Is the data up to date? A real time measure of a data feed that was most recently successfully completed three months ago isn't very useful.

Business Rules

Business rule checks measure the degree of compliance between the actual data and what the producers and consumers of that data expect. Many business rule capture systems seek to trap the knowledge in the heads of people across an organization. From a data quality perspective, however, the important business rules are the ones that can be objectively measured. Business rule checks can be measured as:

Constraints. Does the data comply with a known set of constraints? A banking organization could have a business rule that all credit card customers are at least 21 years of age. Anything that can be described mathematically or algorithmically can become a constraint check.

Computational Rules. Closely related to constraints, computational rules check for agreement of values across records according to an equation or an algorithm. Computational rules check a calculation, for instance ensuring that an amount equals price times cost.

Comparisons. These business rules check for expected relations between fields within a record. For instance, ShipDate should never be less than OrderDate.

Conditions. These rules indicate the if-then logic of certain data objects. For instance, if an employee is of Level 2, then he or she should receive a raise of 5%, unless the employee's total pay is greater than $55,000/year.

Functional Dependencies. These rules measure the invariants across columns of data. For example, for every customer number the data should always contain the same customer name.

Structural Integrity

Structural integrity checks examine whether the data is complete on a macro level. These checks ensure that when you take the data as a whole, you're getting the correct results. Structural integrity checks are made up of:

Referential Integrity. If we expect a one-to-one relationship between two data elements, does the existence of element A always imply the existence of B? If we expect a one-to-many relationship, is there always at least one B for every A? Many applications and ETL jobs turn off referential integrity checks in order to speed the loading of databases (or use some other logic to check referential integrity). As a result, the data that ends up in a database that supposedly provides referential integrity can often be incorrect.

Linkages. Are there parts in our invoice file that don't exist in the parts catalog? Are there orders that have been marked as delivered than can't be matched to an invoice?

Primary Keys. Are primary keys unique?

Cardinality Checks. Certain relationships indicate that the cardinality of columns will be equivalent. For instance, the cardinality of a lookup field in a master table should match the cardinality of the lookup field in the lookup table. This can be thought of as the macro view of domain analysis.

Transformations

Transformation checks examine the impact of data transformations as data moves from system to system. The logic for a data transformation may be faulty, but the only way to check it is to compare the source and target data sets and verify that the transformation took place correctly. Transformation checks include:

Filtering. This verifies that the records transferred are meant to be transferred. For instance, a data warehouse load might require the transfer of last week's records.

Merging. Some transformations require that multiple data sources are merged together to form a single target. For instance, several customer files might be merged together. Merging checks indicate whether or not the whole is equal to the sum of its parts.

Transformation Maps. These checks verify that expected simple transformations have taken place, such as A -> 1, B -> 2, etc. A transformation function is applied to both the source and target records to verify that the expected results are generated.

Computations. Some fields in targets are the result of computations on the source. For example, total sales could be the result of sales1 + sales2. This check calculates the expected result from a source and compares it to the result in the target.

Data Flows

These checks are concerned with the aggregate results of movements of data from sources to targets. Many data quality problems can be traced to incorrect loads, missed loads, or system failures that go unnoticed. Automated data transfers that are not immediately verified can populate target systems with faulty data in an undetected manner. Charting the history of these checks is invaluable in looking at performance issues. Data flow issues include:

Missing Records Across Systems. This set of checks determines whether records that exist in source systems actually make it into the target systems. Why is there a customer in the customer service system that is unknown to the order entry system?

Record Counts. A very simple set of checks, record counts merely verify that the number of records produced in the target system is as expected. Duplicate primary keys can often result in fewer records in the target than expected.

Checksums. When a one-for-one transfer of a column or table is expected, simple checksums can verify that data is written as expected.

Timestamps. Is the timestamp of the data in the source loading area as expected? Have we previously loaded a file with this timestamp?

Process Time. Is the transfer taking an inordinate amount of time to complete? Promptly finding out that a transfer is slowing down incrementally each night allows for management of the resources and fire prevention.

Summary

Each of these checks is vital in fulfilling the promise of zero-latency analytics. Of the 40 categories presented above, however, only one category, Accuracy, is handled by traditional name and address data cleansing tools. That means that just because a customer address file has been cleaned for a CRM system, it does not provide an assurance that the data will remain accurate as the system is used and new data is added.

So what does data quality really mean? Data quality provides the confidence that decisions are based upon accurate data. While it is impossible to guarantee that all data is 100% accurate 100% of the time, vigorously checking data on a periodic basis to verify compliance with business rules is the best way to protect an investment in enterprise data. It's one thing to provide information in near real-time, but maintaining the integrity of that information in near real-time is another story. Unless the information is based upon high quality data, the result will simply be poor decisions—made quickly.

This concludes Part One of a two-part article on the importance of data quality management.

Part Two will discuss the author's experience in solving this problem.

About the Author

Greg Leman is the CEO of Metagenix, Inc.

Metagenix, Inc. (www.metagenix.com) is a developer of data quality tools. The company builds solutions that allow organizations to monitor data quality throughout the enterprise. The company's latest product, MetaPure, is a state-of-the-art continuous data quality management (CDQM) tool. A new development in data quality analytics, MetaPure takes over where traditional data cleansing tools leave off, providing a real-time insurance policy against poor quality data in CRM, ERP, Business Intelligence, and Supply Chain Management applications across the enterprise

 
comments powered by Disqus