News
7 Sources of Poor Data Quality
By William McKnight, partner, Information Management, Lucidity Consulting Group
In recent years, corporate scandals, regulatory
changes, and the collapse of major financial
institutions have brought much warranted attention
to the quality of enterprise information. We have
seen the rise and assimilation of tools and
methodologies that promise to make data cleaner and
more complete. Best practices have been developed
and discussed in print and online. Data quality is
no longer the domain of just the data warehouse. It
is accepted as an enterprise responsibility. If we
have the tools, experiences, and best practices,
why, then, do we continue to struggle with the
problem of data quality?
The answer lies in the difficulty of truly
understanding what quality data is and in
quantifying the cost of bad data. It isn't always
understood why or how to correct this problem
because poor data quality presents itself in so many
ways. We plug one hole in our system, only to find
more problems elsewhere. If we can better understand
the underlying sources of quality issues, then we
can develop a plan of action to address the problem
that is both proactive and strategic.
Each instance of a quality issue presents challenges
in both identifying where problems exist and in
quantifying the extent of the problems. Quantifying
the issues is important in order to determine where
our efforts should be focused first. A large number
of missing email addresses may well be alarming but
could present little impact if there is no process
or plan for communicating by email. It is imperative
to understand the business requirements and to match
them against the assessment of the problem at hand.
Consider the following seven sources of data quality
issues.
1. Entry quality: Did the information enter
the system correctly at the origin?
2. Process quality: Was the integrity of the
information maintained during processing through the
system?
3. Identification quality: Are two similar
objects identified correctly to be the same or
different?
4. Integration quality: Is all the known
information about an object integrated to the point
of providing an accurate representation of the
object?
5. Usage quality: Is the information used and
interpreted correctly at the point of access?
6. Aging quality: Has enough time passed that
the validity of the information can no longer be
trusted?
7. Organizational quality: Can the same
information be reconciled between two systems based
on the way the organization constructs and views the
data?
A plan of action must account for each of these
sources of error. Each case differs in its ease of
detection and ease of correction. An examination of
each of these sources reveals a varying amount of
costs associated with each and inconsistent amounts
of difficulty to address the problem.
Entry Quality
Entry quality is probably the easiest problem to
identify but is often the most difficult to correct.
Entry issues are usually caused by a person entering
data into a system. The problem may be a typo or a
willful decision, such as providing a dummy phone
number or address. Identifying these outliers or
missing data is easily accomplished with profiling
tools or simple queries.
The cost of entry problems depends on the use. If a
phone number or email address is used only for
informational purposes, then the cost of its absence
is probably low. If instead, a phone number is used
for marketing and driving new sales, then
opportunity cost may be significant over a major
percentage of records.
Addressing data quality at the source can be
difficult. If data was sourced from a third party,
there is usually little the organization can do.
Likewise, applications that provide internal sources
of data might be old and too expensive to modify.
And there are few incentives for the clerks at the
point of entry to obtain, verify, and enter every
data point.
Process Quality
Process quality issues usually occur systematically
as data is moved through an organization. They may
result from a system crash, lost file, or any other
technical occurrence that results from integrated
systems. These issues are often difficult to
identify, especially if the data has made a number
of transformations on the way to its destination.
Process quality can usually be remedied easily once
the source of the problem is identified. Proper
checks and quality control at each touchpoint along
the path can help ensure that problems are rooted
out, but these checks are often absent in legacy
processes.
Identification Quality
Identification quality problems result from a
failure to recognize the relationship between two
objects. For example, two similar products with
different SKUs are incorrectly judged to be the
same.
Identification quality may have significant
associated costs, such as mailing the same household
more than once. Data quality processes can largely
eliminate this problem by
matching records,
identifying duplicates and placing a confidence
score on the similarity of records. Ambiguously
scored records can be reviewed and judged by a data
steward. Still, the results are never perfect, and
determining the proper business rules for matching
can involve trial and error.
Integration Quality
Integration quality, or quality of completeness, can
present big challenges for large organizations.
Integration quality problems occur because
information is isolated by system or departmental
boundaries. It might be important for an auto claims
adjuster to know that a customer is also a
high-value life insurance customer, but if the auto
and life insurance systems are not integrated, that
information will not be available.
While the desire to have integrated information may
seem obvious, the reality is that it is not always
apparent. Business users who are accustomed to
working with one set of data may not be aware that
other data exists or may not understand its value.
Data governance programs that document and promote
enterprise data can facilitate the development of
data warehousing and master data management systems
to address integration issues.
MDM
enables the process of identifying records from
multiple systems that refer to the same entity. The
records are then consolidated into a single master
record. The data warehouse allows the transactional
details related to that entity to be consolidated so
that its behaviors and relationships across systems
can be assessed and analyzed.
Usage Quality
Usage quality often presents itself when data
warehouse developers lack access to legacy source
documentation or subject matter experts. Without
adequate guidance, they are left to guess the
meaning and use of certain data elements. Another
scenario occurs in organizations where users are
given the tools to write their own queries or create
their own reports. Incorrect usage may be difficult
to detect and quantify in cost.
Thorough documentation, robust metadata, and user
training are helpful and should be built into any
new initiative, but gaining support for a
post-implementation metadata project can be
difficult. Again, this is where a data governance
program should be established and a grassroots
effort made to identify and document corporate
systems and data definitions. This metadata can be
injected into systems and processes as it becomes
part of the culture to do so. This may be more
effective and realistic than a big-bang approach to
metadata.
Aging Quality
The most challenging aspect of aging quality is
determining at which point the information is no
longer valid. Usually, such decisions are somewhat
arbitrary and vary by usage. For example,
maintaining a former customer's address for more
than five years is probably not useful. If customers
haven't been heard from in several years despite
marketing efforts,
how can we be certain they still
live at the same address? At the same time,
maintaining customer address information for a
homeowner's insurance claim may be necessary and
even required by law. Such decisions need to be made
by the business owners and the rules should be
architected into the solution. Many MDM tools
provide a platform for implementing survivorship and
aging rules.
Organizational Quality
Organizational quality, like entry quality, is easy
to diagnose and sometimes very difficult to address.
It shares much in common with process quality and
integration quality but is less a technical problem
than a systematic one that occurs in large
organizations. Organizational issues occur when, for
example, marketing tries to "tie" their calculations
to finance. Financial reporting systems generally
take an account view of information, which may be
very different than how the company markets the
product or tracks its customers. These business
rules may be buried in many layers of code
throughout multiple systems. However, the biggest
challenge to reconciliation is getting the various
departments to agree that their A equals the other's
B equals the other's C plus D.
A Strategic Approach
The first step to developing a data strategy is to
identify where quality problems exist. These issues
are not always apparent, and it is important to
develop methods for detection. A thorough approach
requires inventorying the system, documenting the
business and technical rules that affect data
quality, and conducting data profiling and scoring
activities that give us insight in the extent of the
issues.
After identifying the problem, it is important to
assess the business impact and cost to the
organization. The downstream effects are not always
easy to quantify, especially when it is difficult to
detect an issue in the first place. In addition, the
cost associated with a particular issue may be small
at a departmental level but much greater when viewed
across the entire enterprise. The business impact
will drive business involvement and investment in
the effort.
Finally, once we understand the issues and their
impact on the organization, we can develop a plan of
action. Data quality programs are multifaceted. A
single tool or project is not the answer. Addressing
data quality requires changes in the way we conduct
our business and in our technology framework. It
requires organizational commitment and long-term
vision.
The strategy for addressing data quality issues
requires a blend of analysis, technology, and
business involvement. When viewed from this
perspective, an MDM program is an effective
approach. MDM provides the framework for identifying
quality problems, cleaning the data, and
synchronizing it between systems. However, MDM by
itself won't resolve all data quality issues.
An active data governance program empowered by chief
executives is essential to making the organizational
changes necessary to achieve success. The data
governance council should set the standards for
quality and ensure that the right systems are in
place for measurement. In addition, the company
should establish incentives for both users and
system developers to maintain the standards.
The end result is an organization where attention to
quality and excellence permeate the company. Such an
approach to enterprise information quality takes
dedication and requires a shift in the
organization's mindset. However, the results are
both achievable and profitable.
---Source: Information Management June 2009 (www.information-management.com).
William McKnight is partner, Information Management,
at Lucidity Consulting Group. He can be reached at
wmcknight@luciditycg.com.
|
|
|