The Cost Of Poor Quality Data
No matter how well an enterprise implements a CRM, ERP, SCM, Business Intelligence, or Data Warehouse project, poor data quality can destroy its utility and cost real dollars. According to recent industry studies:
- Poor data quality
costs businesses $611 billion per year in the United States alone (TDWI).
- 75% of businesses
have experienced significant problems due to faulty data (PWC).
- Only 33% of
businesses felt confident in the quality of their company's data.
Now imagine the downstream impact of this same poor quality data fueling business decisions. Not only will bad decision be made, but when employees and managers stop trusting their business intelligence applications, the entire investment in these costly systems can be jeopardized. It's quite simple: businesses that ignore data quality do so at their own peril and expense.
This,
Part One of a two-part article, defines the problem and its importance to enterprises.
Part
Two presents a solution based on the author's experience.
No Time To Think
Zero-latency decision-making, based on real-time/near-time monitoring of business activity, is a clear trend in the business community. To enable this instant access/instant response to take place, a blending of real-time business intelligence with historical data is required, creating a tremendous dependency on continuous data quality.
Traditionally, large-scale applications such as ERP, CRM, Business Intelligence, Process Management, Middleware, and Data Integration have experienced difficulties with data quality problems. ERP usually serves as a data collection point where data errors can be introduced into enterprise data assets. CRM tracks an ever-changing, ever-moving target of sales and customers, where poor data quality can result in missed sales opportunities as well as upset customers. Business Intelligence can only provide information as accurate as the source data. Business Process Management is only effective when the data exchanged is accurate. Middleware and Data Integration merely move faulty data from point A to point B. While an implementation may incorporate data cleansing as part of a project, ongoing monitoring of data quality after handover is currently not a typical IT assignment.
Zero-latency decision-making that does not pay close attention to data quality is likely to fail. Unfortunately, the traditional approach to data quality is simply not robust enough to meet these new demands. Data quality solutions have traditionally clustered around name standardization, address hygiene, and demographic accuracy. While these properties of data quality are important, they are practically useless in a near zero-latency implementation. Traditional data quality tools are inherently batch-oriented and single-shot endeavors, whereas the current need is for a transactional, real-time solution. Even if the technological hurdles for real-time name standardization, address cleansing, and demographic assignment could be overcome, a framework has not existed for measuring the impact of these activities. Even if they could be cleaned up, the same mistakes would continue to occur when data is added or manipulated, requiring even more cleanup.
In fact, data quality goes far beyond data cleansing. Data quality indicates how well enterprise data matches up with the real world at any given time. If decision-making is based upon poor quality data, then by definition there is no ability to make accurate decisions. Luckily, there are several indicators for data quality that can be used to define and measure the state of data quality in the enterprise.
Defining Data Quality
Data
Quality can be broken into the following eight categories:
- Domains
- Definitions
- Completeness
- Validity
- Business rules
- Structural integrity
- Transformations
- Data flows
Domains
Domains
describe the ranges and types of values present in a data set. The typical errors
that can occur relating to domains are:
Unexpected
domain values. The documentation for a system indicates that the values
for a column will be (A,B,C), but the data actually contains (A,B,C,d,e,f).
This can lead to a variety of fatal problems.
Cardinality.
Cardinality indicates the number of unique values found within a data set.
For a primary key, the cardinality is expected to be equal to the total number
of records, while for a Yes/No field the cardinality is expected to be two.
Uniqueness.
The degree of uniqueness in the data can point to data quality problems. A
field that is 98% unique may indicate "garbage" in a primary key field.
Constants.
Constants indicate that the same value is present in every record. Applications
tend to overlook constants (since they know what's supposed to be there),
thus creating integrity problems in downstream activities. A change in a constant
usually indicates a change in program logic from an upstream data producer.
Outliers.
Some data may have completely unexpected values, such as "Number of family
members" = -3. Outliers are famous for generating system crashes by unsuspecting
applications.
Length.
This refers to the size of the data expected. Small changes, such as moving
from an 8-digit to a 6-digit identifier, can break applications throughout
the enterprise.
Precision.
Rounding and truncation errors are often introduced during data movement or
access.
Scale.
Is the data expressed as a percentage, a factor, or a time period? Is it possible
to compare 100 and 1.00 as percentages?
Internationalization.
Data may have unexpected postal codes, time, or date formats.
Definitions
Definitions indicate how entities are referenced throughout the enterprise. While simple words such as "Revenue" can have vastly different meanings across sales, marketing, manufacturing, and finance, even more subtle are the possible differences between Employee->Id and Invoice->Id and Invoice->Employee_Id. Definition problems are broken down into these categories:
Synonyms.
Whether or not the same entities are actually named the same. The fields EMP_ID,
EMPID, and EM01 may or may not all actually refer to the same type of data.
Homonyms.
These indicate fields that are spelled the same, but really aren't the same.
The common variable name "Id" can mean a lot of different things in different
contexts.
Relationships.
Just because a field is named FK_INVOICE doesn't mean that it's really a foreign
key to the invoice file.
Completeness
Completeness indicates whether or not all of the data is actually present. While this would seem elementary, missing fields and values are perhaps the most common data quality problem. Completeness examines these areas:
Integrity.
Does the actual data match our description of the data? In other words, is
our meta-data actually accurate? An unused field in a Cobol file is a good
example of an integrity problem.
Accuracy.
It is often necessary to examine the degree of agreement between the data
values and a source that is assumed to be correct. Comparing rollup tables
to the actual sums generated from transactions often leads to unexpected results.
Likewise, many data sources can be compared to external third party sources.
Address matching and cleansing, name standardization, and demographic matching
are all examples of accuracy checks.
Values Present.
This refers to the degree to which values are present in the attributes that
require them. If 50% of our customer file does not have an email address,
then an email marketing campaign is probably not the desired course to reach
our customers.
Reliability.
Can you rely on the data given its context? For example, a zip code should
match city and state.
Redundancy.
Are there duplicates or near duplicates in the data?
Consistency.
Are there conflicts within the data? Is the same invoice number referenced
with two different amounts?
Validity
Just like it sounds, validity indicates whether or not the data is valid. Surprisingly enough, many enterprise databases are rife with data that cannot be correct. Validity checks consist of:
Acceptability.
Does the data pass a known set of acceptability tests? For instance, a part
number could consist of a 7-digit, alphanumeric string starting with two alpha
digits and five numbers.
Reliability.
This refers to the probability that the data will perform the indicated function
for a given time period. For example, how reliable are the traffic sensors
during rush hour? Consider the businessman who bought a hotdog cart, only
to find out that the "revenue per hour" was measured during a once a year
business convention. (True story!)
Anomalies.
Are there "facts" in the data that are clearly impossible? Can a lawyer really
bill 48 hours in a day? Can a vehicle be in both California and North Carolina
at the same time?
Timeliness.
Is the data up to date? A real time measure of a data feed that was most recently
successfully completed three months ago isn't very useful.
Business Rules
Business rule checks measure the degree of compliance between the actual data and what the producers and consumers of that data expect. Many business rule capture systems seek to trap the knowledge in the heads of people across an organization. From a data quality perspective, however, the important business rules are the ones that can be objectively measured. Business rule checks can be measured as:
Constraints.
Does the data comply with a known set of constraints? A banking organization
could have a business rule that all credit card customers are at least 21
years of age. Anything that can be described mathematically or algorithmically
can become a constraint check.
Computational
Rules. Closely related to constraints, computational rules check for agreement
of values across records according to an equation or an algorithm. Computational
rules check a calculation, for instance ensuring that an amount equals price
times cost.
Comparisons.
These business rules check for expected relations between fields within a
record. For instance, ShipDate should never be less than OrderDate.
Conditions.
These rules indicate the if-then logic of certain data objects. For instance,
if an employee is of Level 2, then he or she should receive a raise of 5%,
unless the employee's total pay is greater than $55,000/year.
Functional
Dependencies. These rules measure the invariants across columns of data.
For example, for every customer number the data should always contain the
same customer name.
Structural Integrity
Structural integrity checks examine whether the data is complete on a macro level. These checks ensure that when you take the data as a whole, you're getting the correct results. Structural integrity checks are made up of:
Referential
Integrity. If we expect a one-to-one relationship between two data elements,
does the existence of element A always imply the existence of B? If we expect
a one-to-many relationship, is there always at least one B for every A? Many
applications and ETL jobs turn off referential integrity checks in order to
speed the loading of databases (or use some other logic to check referential
integrity). As a result, the data that ends up in a database that supposedly
provides referential integrity can often be incorrect.
Linkages.
Are there parts in our invoice file that don't exist in the parts catalog?
Are there orders that have been marked as delivered than can't be matched
to an invoice?
Primary Keys.
Are primary keys unique?
Cardinality
Checks. Certain relationships indicate that the cardinality of columns
will be equivalent. For instance, the cardinality of a lookup field in a master
table should match the cardinality of the lookup field in the lookup table.
This can be thought of as the macro view of domain analysis.
Transformations
Transformation checks examine the impact of data transformations as data moves from system to system. The logic for a data transformation may be faulty, but the only way to check it is to compare the source and target data sets and verify that the transformation took place correctly. Transformation checks include:
Filtering.
This verifies that the records transferred are meant to be transferred. For
instance, a data warehouse load might require the transfer of last week's
records.
Merging.
Some transformations require that multiple data sources are merged together
to form a single target. For instance, several customer files might be merged
together. Merging checks indicate whether or not the whole is equal to the
sum of its parts.
Transformation
Maps. These checks verify that expected simple transformations have taken
place, such as A -> 1, B -> 2, etc. A transformation function is applied to
both the source and target records to verify that the expected results are
generated.
Computations.
Some fields in targets are the result of computations on the source. For example,
total sales could be the result of sales1 + sales2. This check calculates
the expected result from a source and compares it to the result in the target.
Data Flows
These checks are concerned with the aggregate results of movements of data from sources to targets. Many data quality problems can be traced to incorrect loads, missed loads, or system failures that go unnoticed. Automated data transfers that are not immediately verified can populate target systems with faulty data in an undetected manner. Charting the history of these checks is invaluable in looking at performance issues. Data flow issues include:
Missing Records
Across Systems. This set of checks determines whether records that exist
in source systems actually make it into the target systems. Why is there a
customer in the customer service system that is unknown to the order entry
system?
Record Counts.
A very simple set of checks, record counts merely verify that the number of
records produced in the target system is as expected. Duplicate primary keys
can often result in fewer records in the target than expected.
Checksums.
When a one-for-one transfer of a column or table is expected, simple checksums
can verify that data is written as expected.
Timestamps.
Is the timestamp of the data in the source loading area as expected? Have
we previously loaded a file with this timestamp?
Process Time.
Is the transfer taking an inordinate amount of time to complete? Promptly
finding out that a transfer is slowing down incrementally each night allows
for management of the resources and fire prevention.
Summary
Each of these checks is vital in fulfilling the promise of zero-latency analytics. Of the 40 categories presented above, however, only one category, Accuracy, is handled by traditional name and address data cleansing tools. That means that just because a customer address file has been cleaned for a CRM system, it does not provide an assurance that the data will remain accurate as the system is used and new data is added.
So what does data quality really mean? Data quality provides the confidence that decisions are based upon accurate data. While it is impossible to guarantee that all data is 100% accurate 100% of the time, vigorously checking data on a periodic basis to verify compliance with business rules is the best way to protect an investment in enterprise data. It's one thing to provide information in near real-time, but maintaining the integrity of that information in near real-time is another story. Unless the information is based upon high quality data, the result will simply be poor decisions—made quickly.
This
concludes Part One of a two-part article on the importance of data quality management.
Part
Two will discuss the author's experience in solving this problem.
About
the Author
Greg Leman is the CEO of Metagenix, Inc.
Metagenix,
Inc. (www.metagenix.com)
is a developer of data quality tools. The company builds solutions that allow
organizations to monitor data quality throughout the enterprise. The company's
latest product, MetaPure, is a state-of-the-art continuous data quality management
(CDQM) tool. A new development in data quality analytics, MetaPure takes over
where traditional data cleansing tools leave off, providing a real-time insurance
policy against poor quality data in CRM, ERP, Business Intelligence, and Supply
Chain Management applications across the enterprise